Python中scrapy报错如何处理

2019-01-15 18:22:41 [zhipin][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zhipin.com/gongsi/b3f8f20dd9098f711nJ539W-FA~~.html> (failed 1 times): User timeout caused connection failure: Getting https://www.zhipin.com/gongsi/b3f8f20dd9098f711nJ539W-FA~~.html took longer than 180.0 seconds..


Python中scrapy报错如何处理

1 回复

Scrapy报错处理的核心是看懂错误信息并定位问题。常见报错分几类:

  1. 导入/环境错误:检查pip install scrapy是否成功,虚拟环境是否正确激活。

  2. 爬虫规则错误

# 比如Selector使用错误
# 错误:response.xpath('//div[@class="content"]')  # 可能返回空列表
# 正确做法:
items = response.xpath('//div[@class="content"]')
if items:
    # 处理逻辑
else:
    self.logger.warning('未找到目标元素')
  1. 请求/响应错误
# 在settings.py中增加调试信息
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# 或在爬虫中捕获异常
def parse(self, response):
    try:
        if response.status != 200:
            self.logger.error(f'响应异常: {response.url}')
    except Exception as e:
        self.logger.error(f'解析失败: {str(e)}')
  1. Item Pipeline错误
class MyPipeline:
    def process_item(self, item, spider):
        try:
            # 数据处理逻辑
            return item
        except Exception as e:
            spider.logger.error(f'Pipeline处理失败: {e}')
            raise DropItem(f"处理失败: {item}")

关键调试技巧

  • scrapy shell <url>交互测试选择器
  • 开启详细日志:scrapy crawl spider -L INFO
  • 检查中间件顺序和settings配置

一句话建议:仔细看错误堆栈,从最后一行往上找第一个你的代码文件。

回到顶部