Python中lxml.etree无法处理不完整的HTML代码时,如何对节选代码段进行XPath定位?


    html = etree.parse(htmlStr,etree.HTMLParser()) # htmlStr 来自完整的整个 html 文件内容,这一步正常
    result = html.xpath('//*[@div="info"]')
tmpStr = ''
for st in result:
	divSetion = (etree.tostring(st,encoding="unicode", pretty_print=True, method="html"))
  	if (xxxxxxx) in divSetion:
  	  tmpStr = divSetion #成功获得代码段
  	else:
  	  exit(0)

#此时 tmpStr 肯定是有内容的,条件满足的话,打算对这一代码段进行 xpath 定位选择
#html = etree.parse(tmpStr,etree.HTMLParser() )
html = etree.parse(tmpStr)  #这一步不行了
result = html.xpath('//*[@class="homeinfo"]')
  for st in result: #测试输出有无内容
     print(st)

PcCharm 报错内容输出节选:


Traceback (most recent call last):
  File "D:/Mycode/tedital.py", line 55, in <module>
    html = etree.parse(MatchDetailed)
  File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError
OSError: Error reading file '

Python中lxml.etree无法处理不完整的HTML代码时,如何对节选代码段进行XPath定位?

7 回复

html = etree.parse(MatchDetailed) 这一行并不在你的代码里面,没人能帮你 debug

那个 exit(0) 看得我一脸懵逼


问题核心: lxml.etree 要求格式良好的XML/HTML,处理残缺片段会解析失败。你需要用 lxml.htmlfromstringfragment_fromstring 来解析HTML片段。

解决方案: 使用 lxml.html 模块,它专门为处理现实世界中不完整、杂乱的HTML设计。

代码示例:

from lxml import html

# 你的HTML代码片段
html_fragment = """
<div class="post">
    <p>这是一段文字</p>
    <a href="/link">链接</a>
</div>
"""

# 方法1:使用 html.fromstring - 最常用,会尝试补全片段
tree = html.fromstring(html_fragment)
# 现在可以正常使用XPath
links = tree.xpath('//a[@href]')
for link in links:
    print(f"找到链接: {link.get('href')} - 文本: {link.text_content()}")

# 方法2:使用 fragment_fromstring - 更严格地解析片段
# 需要指定一个容器标签,比如'div'
fragment = html.fragment_fromstring(html_fragment, create_parent='div')
# 同样可以使用XPath
paragraphs = fragment.xpath('.//p')
for p in paragraphs:
    print(f"段落内容: {p.text_content()}")

关键区别:

  • html.fromstring():自动补全片段,比如添加<html><body>包装,适合大多数情况。
  • html.fragment_fromstring():更精确地解析片段,需要指定父标签。

一句话建议: 处理HTML片段时,用 lxml.html 替代 lxml.etree

晕了,我把粘贴的内容和另外一个测试搞混了,exit(0)是 C++留下的习惯,都不用继续了,直接 exit(0)退出不好么, 6666666

正确版本的报错输出

<br>Traceback (most recent call last):<br> File "D:/Mycode/<a target="_blank" href="http://tedital.py" rel="nofollow noopener">tedital.py</a>", line 55, in &lt;module&gt;<br> html = etree.parse(tmpStr)<br> File "src\lxml\etree.pyx", line 3435, in lxml.etree.parse<br> File "src\lxml\parser.pxi", line 1840, in lxml.etree._parseDocument<br> File "src\lxml\parser.pxi", line 1866, in lxml.etree._parseDocumentFromURL<br> File "src\lxml\parser.pxi", line 1770, in lxml.etree._parseDocFromFile<br> File "src\lxml\parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile<br> File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc<br> File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult<br> File "src\lxml\parser.pxi", line 638, in lxml.etree._raiseParseError<br>OSError: Error reading file '<br>

我看你的代码你为什么不一步定位呢,xpath 这个逻辑应该写的出来啊?换正则得了

The parse() function is used to parse from files and file-like objects.

不看文档的吗?

得用 etree.HTML 吧

回到顶部