HTML Parsing

纯净的 HTML 解析库, 取代复杂的 beautifulsoup4, pyquery, lxml

github: https://github.com/gaojiuli/htmlparsing

安装

pip install htmlparsing or

pip install git+https://github.com/gaojiuli/htmlparsing

用法

import requests
from htmlparsing import Element
url = ‘https://python.org’
r = requests.get(url)

初始化

e = Element(text=r.text, base_url=url)

获取页面中的链接

e.links """ {...'/users/membership/', '/events/python-events', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'} """

e.absolute_links “”" {…‘https://python.org/download/alternatives’, ‘https://python.org/about/success/#software-development’, ‘https://python.org/download/other/’, ‘https://python.org/community/irc/’} “”"

选择器以及选择属性

e.xpath('//a')[0].attrs
"""{'href': '#content', 'title': 'Skip to content'}"""
e.xpath(’//a’)[0].attrs.title
“”“Skip to content”""
e.css(‘a’)[0].attrs
“”"{‘href’: ‘#content’, ‘title’: ‘Skip to content’}"""
e.parse(’<a href="#content" title=“Skip to content”>{}</a>’))
“”"<Result (‘Skip to content’,) {}>"""

获取文本内容和整个 HTML

e.xpath('//a')[5].text
"""PyPI"""
e.xpath(’//a’)[5].html
“”"<a href=“https://pypi.python.org/” title=“Python Package Index”>PyPI</a>"""
e.xpath(’//a’)[5].markdown
“”“PyPI”""

目前支持的选择器: xpath, css ,parse

github: https://github.com/gaojiuli/htmlparsing

Python中如何使用htmlparsing库进行HTML解析

sinazl 1楼

恕我直言，我感觉你这是对 kenneth 大神的 requests-html( https://github.com/kennethreitz/requests-html)低配仿造啊……

nodeper 2楼

Python里解析HTML，我一般用BeautifulSoup，它比标准库的html.parser好用多了。先装库：pip install beautifulsoup4。

看个例子，假设我们要从一个网页里抓所有链接：

from bs4 import BeautifulSoup
import requests

# 获取网页内容
url = 'https://example.com'
response = requests.get(url)
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有<a>标签
links = soup.find_all('a')

# 提取href属性
for link in links:
    href = link.get('href')
    text = link.get_text(strip=True)
    if href:
        print(f"链接文本: {text}, 地址: {href}")

如果想找特定class的元素，用find_all()加参数：

# 找class为"article"的div
articles = soup.find_all('div', class_='article')
for article in articles:
    title = article.find('h2').get_text(strip=True)
    print(title)

用CSS选择器更直接：

# 选择所有class为"menu"的ul下的li
menu_items = soup.select('ul.menu > li')
for item in menu_items:
    print(item.get_text())

解析本地HTML文件也一样：

with open('page.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'html.parser')

总结：用BeautifulSoup配合html.parser或lxml解析器就行。

nodeper 3楼

是啊，我只是做一个不绑定 requests 的而已

ionicwang 4楼

还是解析神器 pyquery 好用

zlyuanteng 5楼

pyquery 链接就直接 d(“a”)啊，xpahth 不是更麻烦

htzhanglong 6楼

恕我直言，要不是有人指出来，你会提出这是“参考”了 kenneth 大神的项目吗？

bupafengyu 7楼

一定要这么步步紧逼吗？哈哈，打脸打得好

caililin 8楼

我是先给他提了 issue，对于 html 和 element 作为共同的东西看待，他没回复，我就实现了一个。

wuwangju 9楼

提 pr 改动太大。

sinazl 10楼

事确实 pyquery 更好