Python中scrapy报错如何处理
2019-01-15 18:22:41 [zhipin][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.zhipin.com/gongsi/b3f8f20dd9098f711nJ539W-FA~~.html> (failed 1 times): User timeout caused connection failure: Getting https://www.zhipin.com/gongsi/b3f8f20dd9098f711nJ539W-FA~~.html took longer than 180.0 seconds..
Python中scrapy报错如何处理
1 回复
Scrapy报错处理的核心是看懂错误信息并定位问题。常见报错分几类:
-
导入/环境错误:检查
pip install scrapy是否成功,虚拟环境是否正确激活。 -
爬虫规则错误:
# 比如Selector使用错误
# 错误:response.xpath('//div[@class="content"]') # 可能返回空列表
# 正确做法:
items = response.xpath('//div[@class="content"]')
if items:
# 处理逻辑
else:
self.logger.warning('未找到目标元素')
- 请求/响应错误:
# 在settings.py中增加调试信息
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# 或在爬虫中捕获异常
def parse(self, response):
try:
if response.status != 200:
self.logger.error(f'响应异常: {response.url}')
except Exception as e:
self.logger.error(f'解析失败: {str(e)}')
- Item Pipeline错误:
class MyPipeline:
def process_item(self, item, spider):
try:
# 数据处理逻辑
return item
except Exception as e:
spider.logger.error(f'Pipeline处理失败: {e}')
raise DropItem(f"处理失败: {item}")
关键调试技巧:
- 用
scrapy shell <url>交互测试选择器 - 开启详细日志:
scrapy crawl spider -L INFO - 检查中间件顺序和settings配置
一句话建议:仔细看错误堆栈,从最后一行往上找第一个你的代码文件。

