Python中Scrapy爬虫的received_count和scraped_count相差很大是什么原因？

请问一下大家，scrapy 的报告中显示：received_count 69941, scraped_count 66392, 那中间的差异数 3000 多条是什么原因，
是不是一些网页上的杂七杂八的不重要的其他的数据，

还是真实的我要抓取的数据，但没有抓取到？但是我 scrapy 并没有报错呀。。

谢谢。。

另外，大家用 scrapy 采集写入 Mysql 的时候，都用 twisted 吗？
会不会是因为我没有使用 twisted 的原因，导致一些数据采集了，但来不及插入数据库？
Python中Scrapy爬虫的received_count和scraped_count相差很大是什么原因？

h691938207 1楼

在Scrapy里，received_count和scraped_count差得大，通常是因为爬虫收到（received）的请求数远多于最终成功解析（scraped）并交给pipeline处理的item数。

主要原因就这几个：

请求失败了：很多请求因为网络、反爬（429/403）或者页面结构变了（404）直接挂了，根本没走到解析回调函数那步，所以只计入received，不计入scraped。
解析出问题了：请求成功了，但你的parse回调函数里，要么选择器没匹配到数据（返回None或空列表），要么抛异常被框架吞了，导致没yield出item对象。
过滤掉了：用了DupeFilter或者自定义的过滤规则，把重复请求或者不符合条件的请求产生的item给去掉了。
自己没yield item：有些页面（比如列表页）你只负责提取新链接并yield新Request，本身不产生item，这些请求当然只有received计数。

想定位的话，直接加日志或者看stats就行：

import scrapy
import logging

class MySpider(scrapy.Spider):
    name = 'my_spider'
    
    def parse(self, response):
        # 1. 先确认请求本身成功了
        self.logger.info(f"Parsing: {response.url} - Status: {response.status}")
        
        # 2. 检查选择器是否匹配到东西
        items = response.css('div.item')
        self.logger.info(f"Found {len(items)} candidate elements")
        
        if not items:
            # 这里可能就是问题所在：页面结构变了，没抓到数据
            self.logger.warning(f"No data extracted from {response.url}")
            # 可以考虑在这里保存一下空页面用于调试
            # with open('debug_empty.html', 'w') as f:
            #     f.write(response.text)
            return
        
        for item in items:
            # 3. 提取字段，检查是否有空字段
            title = item.css('h2::text').get()
            if not title:
                # 如果关键字段为空，你可能不想yield这个item
                continue
                
            yield {
                'title': title,
                'url': response.url
            }
    
    def closed(self, reason):
        # 爬虫结束时打印最终统计
        stats = self.crawler.stats.get_stats()
        self.logger.info(f"Received: {stats.get('downloader/request_count', 0)}")
        self.logger.info(f"Scraped: {stats.get('item_scraped_count', 0)}")
        self.logger.info(f"Failed requests: {stats.get('downloader/response_status_count/404', 0) + stats.get('downloader/response_status_count/503', 0)}")

跑完爬虫，重点看stats里的这几个键：

downloader/request_count：总请求数（≈received）
item_scraped_count：成功爬取的item数（≈scraped）
downloader/response_status_count/404等：各种失败响应的计数

总结：差得大很正常，重点看失败请求和空解析的比例，针对性调整请求头和解析逻辑。