Python中如何清理爬虫数据中不需要的HTML属性

比如对于以下数据

<p id="a">data</p>

我只想保留

<p>data</p>

该如何操作，有快捷的方法吗？

Python中如何清理爬虫数据中不需要的HTML属性

ionicwang 1楼作者

用 text()提取出文本吧，就能取出 data 了吧

gougou168 2楼

清理爬虫数据中不需要的HTML属性，用BeautifulSoup最直接。主要思路是遍历标签，用attrs属性删除特定属性。

核心代码示例：

from bs4 import BeautifulSoup

html_doc = """
<div id="old" class="container" style="color:red;" data-temp="remove" onclick="alert()">
    <p class="text" style="font-size:14px;">测试文本</p>
</div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 1. 删除所有标签的特定属性（如 style, onclick）
for tag in soup.find_all(True):
    # 删除单个属性
    if 'style' in tag.attrs:
        del tag.attrs['style']
    # 删除多个属性
    attrs_to_remove = ['onclick', 'data-temp']
    for attr in attrs_to_remove:
        if attr in tag.attrs:
            del tag.attrs[attr]

# 2. 只保留指定属性（白名单方式）
allowed_attrs = ['class', 'id']
for tag in soup.find_all(True):
    # 获取当前标签所有属性名
    attrs = list(tag.attrs.keys())
    for attr in attrs:
        if attr not in allowed_attrs:
            del tag.attrs[attr]

print(soup.prettify())

关键点：