Python中如何使用Beautiful Soup 4(BS4)库进行网页解析与数据提取?
这样一个 html 文件,想导出到这样的 json 格式
{"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}
https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322
求助 bs4 如何导!不想用正则
谢谢
Python中如何使用Beautiful Soup 4(BS4)库进行网页解析与数据提取?
In [1]: from lxml import etree
In [2]: with open(‘tmp.html’,‘r’) as f:
…: tree=etree.HTML(f.read())
In [10]: tmp=tree.xpath(’//tr’)
In [29]: import json
In [37]: out=list()
…: for tmp1 in tmp[1:]:
…: i=0
…: dict_d={1:‘Date’,2:‘Day’,3:‘Time’,4:‘Course’,5:’ Course Title’,6:‘Duration’}
…: t1=dict()
…: for t in tmp1:
…: i=i+1
…: t2=t.xpath(‘text()’)[0]
…: t1[dict_d[i]]=t2
…: out.append(t1)
In [45]: out2=dict()
…: for o in out:
…: try:
…: out2[o[‘Course’]]={‘Course Title’:o[’ Course Title’],‘Date’:o[‘Date’],‘Day’:o[‘Day’],‘Duration’:o[‘Duration’],‘Time’:o[‘Time’]}
…: except:
…: pass
In [46]: out2
Out[46]:
{’ AC1101 ': {‘Course Title’: ’ ACCOUNTING I ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 2.5 ',
‘Time’: ’ 9.00 am '},
’ AD1101 ': {‘Course Title’: ’ FINANCIAL ACCOUNTING ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 2.5 ',
‘Time’: ’ 9.00 am '},
’ BA3201 ': {‘Course Title’: ’ LIFE CONTINGENCIES AND DEMOGRAPHY ',
‘Date’: ’ 24 November 2017 ',
‘Day’: ’ Friday ',
‘Duration’: ’ 3 ',
‘Time’: ’ 9.00 am '}}
# 基本使用示例
from bs4 import BeautifulSoup
import requests
# 1. 获取网页并创建解析对象
url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser') # 或用'lxml'解析器更快
# 2. 常用查找方法
# 按标签名查找第一个匹配
title = soup.find('h1') # 返回第一个<h1>标签
all_links = soup.find_all('a') # 返回所有<a>标签列表
# 3. 按CSS类查找
articles = soup.find_all('div', class_='article') # class是Python关键字,加下划线
# 4. 按属性查找
img = soup.find('img', {'src': 'logo.png'}) # 查找src属性为logo.png的图片
# 5. 提取数据
# 获取标签文本
print(title.text) # 或 title.get_text()
# 获取属性值
link = soup.find('a')
print(link['href']) # 获取href属性
print(link.get('href')) # 更安全的获取方式
# 6. 层级导航
# 通过父/子节点
div = soup.find('div')
first_child = div.contents[0] # 直接子节点列表
children = div.children # 子节点迭代器
# 通过兄弟节点
next_sib = title.next_sibling
prev_sib = title.previous_sibling
# 7. CSS选择器(推荐)
items = soup.select('div.item') # 所有class="item"的div
nested = soup.select('ul > li.active') # ul下的直接li子元素且class包含active
核心要点:
- 用
find()找单个元素,find_all()找多个 - CSS选择器
select()功能最强,类似jQuery语法 - 记得处理编码和异常(这里省略了异常处理代码)
- 解析器选
lxml比html.parser更快更容错
一句话建议:优先用CSS选择器,配合find_all()基本能解决90%的提取需求。
from lxml import etree
with open(‘tmp.html’,‘r’) as f:
____tree=etree.HTML(f.read())
tmp=tree.xpath(’//tr’)
import json
out=list()
for tmp1 in tmp[1:]:
____i=0
____dict_d={1:‘Date’,2:‘Day’,3:‘Time’,4:‘Course’,5:’ Course Title’,6:‘Duration’}
____t1=dict()
____for t in tmp1:
________i=i+1
________t2=t.xpath(‘text()’)[0]
________t1[dict_d[i]]=t2
____out.append(t1)
out2=dict()
for o in out:
____try:
________out2[o[‘Course’]]={‘Course Title’:o[’ Course Title’],‘Date’:o[‘Date’],‘Day’:o[‘Day’],‘Duration’:o[‘Duration’],‘Time’:o[‘Time’]}
____except:
________pass
print(out2)
非常感谢你的回答,都是我没有见过的东西,需要慢慢消化。在等待的时候我已经用 dict,list 和 bs4 实现了。就是代码看起来很初级的样子
为什么不用 pyquery 呢 滑稽

