Python中如何使用scrapy的bloom_filter进行去重

我想在 scrapy 下载中间件用 BloomFilter 去重, 思路是将数据库之前下载过的正文页 url 添加到 BloomFilter(), 判断是否存在 目前遇到的问题是启动爬虫, 定义的 start_url 已经加到 BloomFilter(),不能往下执行 代码如下 from bloom_filter import BloomFilter from scrapy import signals import pymysql import pandas as pd from scrapy.conf import settings from scrapy.exceptions import IgnoreRequest

have_met = BloomFilter() #实例布隆容器 class JianshuSpiderMiddleware(object): def init(self): self.conn = pymysql.connect(host=settings['MYSQL_HOST'], database=settings['MYSQL_DB'],user=settings['MYSQL_USER'], password=settings['MYSQL_PWD'], charset="utf8") self.cur = self.conn.cursor() self.counter = 0 sql = 'SELECT url FROM jianshu_1;' df = pd.read_sql(sql, self.conn) for url in df['url'].get_values(): "将数据库中上次爬取的 url 添加" have_met.add(url)

def process_request(self, request, spider): if request.url in have_met: print('-------已经存在---------') raise IgnoreRequest() else: return None

不知哪位大神可否解答,十分感谢。


Python中如何使用scrapy的bloom_filter进行去重

3 回复

在Scrapy里用Bloom Filter做去重,得自己实现一个DupeFilter。Scrapy默认用的是内存set,数据量大时内存扛不住。Bloom Filter内存占用小,但可能有误判(不过去重场景里可以接受)。

首先装个pybloom-live

pip install pybloom-live

然后写个自定义的DupeFilter。在项目里创建dupefilter.py

from scrapy.dupefilters import BaseDupeFilter
from pybloom_live import BloomFilter
import logging

class BloomFilterDupeFilter(BaseDupeFilter):
    def __init__(self, path=None, capacity=1000000, error_rate=0.001):
        self.file_path = path
        self.capacity = capacity
        self.error_rate = error_rate
        self.filter = BloomFilter(capacity=capacity, error_rate=error_rate)
        self.logger = logging.getLogger(__name__)
        
        if path and os.path.exists(path):
            with open(path, 'rb') as f:
                self.filter = BloomFilter.fromfile(f)
            self.logger.info(f"Loaded BloomFilter from {path}")
    
    @classmethod
    def from_settings(cls, settings):
        return cls(
            path=settings.get('BLOOM_FILTER_PATH'),
            capacity=settings.getint('BLOOM_FILTER_CAPACITY', 1000000),
            error_rate=settings.getfloat('BLOOM_FILTER_ERROR_RATE', 0.001)
        )
    
    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.filter:
            return True
        self.filter.add(fp)
        return False
    
    def request_fingerprint(self, request):
        from scrapy.utils.request import request_fingerprint
        return request_fingerprint(request)
    
    def close(self, reason):
        if self.file_path:
            with open(self.file_path, 'wb') as f:
                self.filter.tofile(f)
            self.logger.info(f"Saved BloomFilter to {self.file_path}")

settings.py里配置:

DUPEFILTER_CLASS = 'your_project.dupefilter.BloomFilterDupeFilter'
BLOOM_FILTER_PATH = 'bloom_filter.blm'
BLOOM_FILTER_CAPACITY = 1000000
BLOOM_FILTER_ERROR_RATE = 0.001

这样就能用Bloom Filter去重了。记得根据预估的URL数量调整capacity,error_rate设小点能减少误判。

总结:自己实现DupeFilter,用pybloom-live存指纹。


首次提问,不懂代码排版,,,,各位大佬见谅

我就不信这排版有人会看

回到顶部