Python中Scrapy框架常见问题与解决方案

这分钟(两次 log 的 1 分钟间隔)里爬了一个网页, 处理了 0 个 item…

Scrapy框架常见问题与解决方案

Scrapy用起来确实爽，但坑也不少。下面是我踩过的一些坑和解决办法：

1. 请求被网站屏蔽 这是最常见的问题。很多网站会封杀Scrapy的默认User-Agent。

# 在settings.py中配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 2  # 下载延迟，避免请求太快
CONCURRENT_REQUESTS = 16  # 并发请求数

2. 处理JavaScript渲染的页面 有些页面数据是JS动态加载的，直接用Scrapy抓不到。

# 方案1：使用Splash（推荐）
# 安装：pip install scrapy-splash
# settings.py配置
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
}

# spider中使用
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    def start_requests(self):
        yield SplashRequest(
            url='http://example.com',
            callback=self.parse,
            args={'wait': 2}  # 等待JS执行
        )

# 方案2：直接找API接口
# 很多网站的数据其实是通过API获取的，用浏览器开发者工具找XHR请求

3. 处理登录和Session 需要登录才能访问的页面：

import scrapy

class LoginSpider(scrapy.Spider):
    def start_requests(self):
        return [scrapy.FormRequest(
            'http://example.com/login',
            formdata={'username': 'user', 'password': 'pass'},
            callback=self.after_login
        )]
    
    def after_login(self, response):
        # 检查登录是否成功
        if "logout" in response.text:
            # 登录成功，继续爬取
            yield scrapy.Request('http://example.com/protected')

4. 数据存储问题

# 使用Item Pipeline存储数据
# pipelines.py
import pymongo

class MongoPipeline:
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )
    
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
    
    def close_spider(self, spider):
        self.client.close()
    
    def process_item(self, item, spider):
        self.db[spider.name].insert_one(dict(item))
        return item

# settings.py启用Pipeline
ITEM_PIPELINES = {
    'myproject.pipelines.MongoPipeline': 300,
}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'scrapy_data'

5. 处理反爬机制

# 使用代理IP
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# 在spider中动态设置代理
class ProxySpider(scrapy.Spider):
    def start_requests(self):
        proxies = ['http://proxy1:port', 'http://proxy2:port']
        for url in self.start_urls:
            proxy = random.choice(proxies)
            yield scrapy.Request(url, meta={'proxy': proxy})

6. 调试技巧

# 使用scrapy shell快速测试
# 命令行：scrapy shell "http://example.com"
# 然后可以直接测试选择器

# 在代码中调试
def parse(self, response):
    # 查看响应内容
    print(response.text[:500])
    
    # 使用scrapy.utils.response.open_in_browser
    from scrapy.utils.response import open_in_browser
    open_in_browser(response)  # 在浏览器中打开

7. 常见错误处理

# 处理404等HTTP错误
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.python.failure import Failure

def parse(self, response):
    # 正常解析逻辑
    pass

def errback(self, failure):
    # 处理错误
    if failure.check(HttpError):
        response = failure.value.response
        print(f'HTTP错误: {response.status}')

8. 分布式爬虫

# 使用scrapy-redis实现分布式
# 安装：pip install scrapy-redis
# settings.py配置
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://localhost:6379'

# spider继承RedisSpider
from scrapy_redis.spiders import RedisSpider

class MyDistributedSpider(RedisSpider):
    name = 'distributed_spider'
    redis_key = 'myspider:start_urls'

总结： 遇到问题先看日志，大部分错误信息都很明确。

bupafengyu 3楼

什么叫两次 log 的一分钟间隔里？是不是类似下面的 log

2018-11-09 22:20:26 [scrapy.extensions.logstats] INFO: Crawled 99 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2018-11-09 22:20:28 [scrapy.core.engine] DEBUG:

caililin 4楼

#2 你把 log-level 设置成 info 就看出来了

nodeper 5楼

好的谢谢你