Python中如何使用xpath提取title标签内的文字?

<div class=“item-pic”>
<a href="//2.taobao.com/item.htm?id=560088094729" target="_blank" title=" [转卖] 创得 小米 5 手机壳小米 5s 保护套小米 6 防摔硅</a>

</div>

大佬们用 xpath 如何取到 title 里的文字啊?

import os
import requests
from lxml import etree

res = requests.get(‘https://s.2.taobao.com/list/list.htm?_input_charset=utf8&q=小米 6&st_edtime=1’).content
txt = etree.HTML(res)
txt2 = txt.xpath(’//div/a/text()’)

for tt in txt2:
print(tt)

os.system(“pause”)
这样取不到
Python中如何使用xpath提取title标签内的文字?


3 回复
import requests
from lxml import etree

# 获取网页内容
url = "https://example.com"
response = requests.get(url)
html_content = response.content

# 解析HTML
parser = etree.HTMLParser()
tree = etree.fromstring(html_content, parser)

# 使用XPath提取title标签文字
# 方法1:直接定位title标签
title_text = tree.xpath('//title/text()')[0]
print(f"方法1提取结果: {title_text}")

# 方法2:更精确的定位(指定html和head层级)
title_text = tree.xpath('/html/head/title/text()')[0]
print(f"方法2提取结果: {title_text}")

# 方法3:使用string()函数获取标签内所有文本(包括子标签)
title_text = tree.xpath('string(//title)')
print(f"方法3提取结果: {title_text}")

核心要点:

  1. //title 表示在整个文档中查找所有title标签
  2. /text() 获取标签的直接文本内容
  3. 如果title标签内有其他标签,用string(//title)获取全部文本
  4. 结果返回列表,需要用[0]获取第一个匹配项

常见问题处理:

# 处理可能不存在的title标签
titles = tree.xpath('//title/text()')
if titles:
    print(titles[0])
else:
    print("未找到title标签")

# 处理多个title标签(虽然不常见)
all_titles = tree.xpath('//title/text()')
for title in all_titles:
    print(title)

一句话建议://title/text()基本够用,有嵌套标签时改用string()函数。


python<br>import os<br>import requests<br>from lxml import etree<br><br>res = requests.get('<a target="_blank" href="https://s.2.taobao.com/list/list.htm?_input_charset=utf8&amp;q=小米" rel="nofollow noopener">https://s.2.taobao.com/list/list.htm?_input_charset=utf8&amp;q=小米</a> 6&amp;st_edtime=1').content<br>txt = etree.HTML(res)<br>items = txt.xpath('//div[@class="ks-waterfall"]')<br>for item in items[1:]:<br> print(item.attrib)<br> a = item.xpath('./div/div[@class="item-info"]/div/a')<br> for d in a:<br> print(d.attrib)<br><br>


import requests
from lxml import etree

res = requests.get(‘https://s.2.taobao.com/list/list.htm?_input_charset=utf8&q=ddr3 1866&st_edtime=1’).content
txt = etree.HTML(res)

txt4 = txt.xpath(’//div/div[class=“item-info”]/div/a’)

print(txt4[0].attrib.get(‘title’))
可以了

回到顶部