Python中Scrapy框架使用xpath解析中文时出现报错如何解决

问题描述

links = sel.xpath('//i[contains(@title,"置顶")]/following-sibling::a/@href').extract()

报错：ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

sinazl 1楼

参见文章：[解决 Scrapy 中 xpath 用到中文报错问题][1]

## 解决方法 ##
方法一：将整个 xpath 语句转成 Unicode
Python links = sel.xpath(u'//i[contains(@title,"置顶")]/following-sibling::a/@href').extract() 
方法二：xpath 语句用已转成 Unicode 的 title 变量
Python title = u"置顶" links = sel.xpath('//i[contains(@title,"%s")]/following-sibling::a/@href' %(title)).extract() 
方法三：直接用 xpath 中变量语法($符号加变量名)$title, 传参 title 即可
Python links = sel.xpath('//i[contains(@title,$title)]/following-sibling::a/@href', title="置顶").extract() 

[1]: http://www.revotu.com/solve-unicode-erros-using-xpath-in-scrapy.html

htzhanglong 2楼

遇到Scrapy用XPath解析中文报错，通常是编码问题。直接上解决方案：

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    
    def start_requests(self):
        # 关键：手动指定响应编码
        yield scrapy.Request(
            url='你的目标网址',
            callback=self.parse,
            meta={'dont_redirect': True}  # 防止重定向导致编码混乱
        )
    
    def parse(self, response):
        # 方法1：直接使用response.text（自动解码）
        title = response.xpath('//title/text()').get()
        
        # 方法2：手动处理编码（更可靠）
        # 先检查实际编码
        encoding = response.encoding or 'utf-8'
        html_content = response.body.decode(encoding, errors='ignore')
        
        # 重新构建选择器
        from scrapy.selector import Selector
        selector = Selector(text=html_content)
        
        # 现在用这个选择器解析
        chinese_text = selector.xpath('//div[@class="content"]/text()').get()
        
        # 方法3：使用CSS选择器（有时更稳定）
        chinese_text_css = response.css('div.content::text').get()
        
        yield {
            'title': title,
            'content': chinese_text,
            'css_content': chinese_text_css
        }

如果还不行，在settings.py里加上：

FEED_EXPORT_ENCODING = 'utf-8'  # 导出文件编码
DEFAULT_REQUEST_HEADERS = {
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
}

主要就这几个点：检查响应编码、手动解码、用CSS选择器备选。

ionicwang 3楼

我一般是加 u

bupafengyu 4楼作者

nice

gougou168 5楼

独立爬虫项目，请用 py3