Python中如何使用Beautiful Soup 4(BS4)库进行网页解析与数据提取?

<body><html> <table border="1" width="100%" cellspacing="0" cellpadding="1"> <tr bgcolor="#3366FF"> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Date </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Day </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Time </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Course </font></td> <td align="left" width="40%" valign="top"><font color="#FFFFFF"> Course Title </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Duration </font></td> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AC1101 </td> <td align="left" width="40%" valign="top"> ACCOUNTING I </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AD1101 </td> <td align="left" width="40%" valign="top"> FINANCIAL ACCOUNTING </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> BA3201 </td> <td align="left" width="40%" valign="top"> LIFE CONTINGENCIES AND DEMOGRAPHY </td> <td align="left" width="10%" valign="top"> 3 </td> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> </table> </body></html>

这样一个 html 文件,想导出到这样的 json 格式

{"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}

https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322

求助 bs4 如何导!不想用正则

谢谢


Python中如何使用Beautiful Soup 4(BS4)库进行网页解析与数据提取?

5 回复

In [1]: from lxml import etree
In [2]: with open(‘tmp.html’,‘r’) as f:
…: tree=etree.HTML(f.read())
In [10]: tmp=tree.xpath(’//tr’)
In [29]: import json
In [37]: out=list()
…: for tmp1 in tmp[1:]:
…: i=0
…: dict_d={1:‘Date’,2:‘Day’,3:‘Time’,4:‘Course’,5:’ Course Title’,6:‘Duration’}
…: t1=dict()
…: for t in tmp1:
…: i=i+1
…: t2=t.xpath(‘text()’)[0]
…: t1[dict_d[i]]=t2
…: out.append(t1)
In [45]: out2=dict()
…: for o in out:
…: try:
…: out2[o[‘Course’]]={‘Course Title’:o[’ Course Title’],‘Date’:o[‘Date’],‘Day’:o[‘Day’],‘Duration’:o[‘Duration’],‘Time’:o[‘Time’]}
…: except:
…: pass
In [46]: out2
Out[46]:
{’ AC1101 ': {‘Course Title’: ’ ACCOUNTING I ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 2.5 ',
‘Time’: ’ 9.00 am '},
’ AD1101 ': {‘Course Title’: ’ FINANCIAL ACCOUNTING ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 2.5 ',
‘Time’: ’ 9.00 am '},
’ BA3201 ': {‘Course Title’: ’ LIFE CONTINGENCIES AND DEMOGRAPHY ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 3 ',
‘Time’: ’ 9.00 am '}}


# 基本使用示例
from bs4 import BeautifulSoup
import requests

# 1. 获取网页并创建解析对象
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')  # 或用'lxml'解析器更快

# 2. 常用查找方法
# 按标签名查找第一个匹配
title = soup.find('h1')  # 返回第一个<h1>标签
all_links = soup.find_all('a')  # 返回所有<a>标签列表

# 3. 按CSS类查找
articles = soup.find_all('div', class_='article')  # class是Python关键字,加下划线

# 4. 按属性查找
img = soup.find('img', {'src': 'logo.png'})  # 查找src属性为logo.png的图片

# 5. 提取数据
# 获取标签文本
print(title.text)  # 或 title.get_text()
# 获取属性值
link = soup.find('a')
print(link['href'])  # 获取href属性
print(link.get('href'))  # 更安全的获取方式

# 6. 层级导航
# 通过父/子节点
div = soup.find('div')
first_child = div.contents[0]  # 直接子节点列表
children = div.children  # 子节点迭代器

# 通过兄弟节点
next_sib = title.next_sibling
prev_sib = title.previous_sibling

# 7. CSS选择器(推荐)
items = soup.select('div.item')  # 所有class="item"的div
nested = soup.select('ul > li.active')  # ul下的直接li子元素且class包含active

核心要点:

  • find()找单个元素,find_all()找多个
  • CSS选择器select()功能最强,类似jQuery语法
  • 记得处理编码和异常(这里省略了异常处理代码)
  • 解析器选lxmlhtml.parser更快更容错

一句话建议:优先用CSS选择器,配合find_all()基本能解决90%的提取需求。

from lxml import etree
with open(‘tmp.html’,‘r’) as f:
____tree=etree.HTML(f.read())
tmp=tree.xpath(’//tr’)
import json
out=list()
for tmp1 in tmp[1:]:
____i=0
____dict_d={1:‘Date’,2:‘Day’,3:‘Time’,4:‘Course’,5:’ Course Title’,6:‘Duration’}
____t1=dict()
____for t in tmp1:
________i=i+1
________t2=t.xpath(‘text()’)[0]
________t1[dict_d[i]]=t2
____out.append(t1)
out2=dict()
for o in out:
____try:
________out2[o[‘Course’]]={‘Course Title’:o[’ Course Title’],‘Date’:o[‘Date’],‘Day’:o[‘Day’],‘Duration’:o[‘Duration’],‘Time’:o[‘Time’]}
____except:
________pass
print(out2)

非常感谢你的回答,都是我没有见过的东西,需要慢慢消化。在等待的时候我已经用 dict,list 和 bs4 实现了。就是代码看起来很初级的样子

为什么不用 pyquery 呢 滑稽

回到顶部