Python中如何使用scrapy的bloom_filter进行去重
我想在 scrapy 下载中间件用 BloomFilter 去重, 思路是将数据库之前下载过的正文页 url 添加到 BloomFilter(), 判断是否存在 目前遇到的问题是启动爬虫, 定义的 start_url 已经加到 BloomFilter(),不能往下执行 代码如下 from bloom_filter import BloomFilter from scrapy import signals import pymysql import pandas as pd from scrapy.conf import settings from scrapy.exceptions import IgnoreRequest
have_met = BloomFilter() #实例布隆容器 class JianshuSpiderMiddleware(object): def init(self): self.conn = pymysql.connect(host=settings['MYSQL_HOST'], database=settings['MYSQL_DB'],user=settings['MYSQL_USER'], password=settings['MYSQL_PWD'], charset="utf8") self.cur = self.conn.cursor() self.counter = 0 sql = 'SELECT url FROM jianshu_1;' df = pd.read_sql(sql, self.conn) for url in df['url'].get_values(): "将数据库中上次爬取的 url 添加" have_met.add(url)
def process_request(self, request, spider): if request.url in have_met: print('-------已经存在---------') raise IgnoreRequest() else: return None
不知哪位大神可否解答,十分感谢。
Python中如何使用scrapy的bloom_filter进行去重
在Scrapy里用Bloom Filter做去重,得自己实现一个DupeFilter。Scrapy默认用的是内存set,数据量大时内存扛不住。Bloom Filter内存占用小,但可能有误判(不过去重场景里可以接受)。
首先装个pybloom-live:
pip install pybloom-live
然后写个自定义的DupeFilter。在项目里创建dupefilter.py:
from scrapy.dupefilters import BaseDupeFilter
from pybloom_live import BloomFilter
import logging
class BloomFilterDupeFilter(BaseDupeFilter):
def __init__(self, path=None, capacity=1000000, error_rate=0.001):
self.file_path = path
self.capacity = capacity
self.error_rate = error_rate
self.filter = BloomFilter(capacity=capacity, error_rate=error_rate)
self.logger = logging.getLogger(__name__)
if path and os.path.exists(path):
with open(path, 'rb') as f:
self.filter = BloomFilter.fromfile(f)
self.logger.info(f"Loaded BloomFilter from {path}")
@classmethod
def from_settings(cls, settings):
return cls(
path=settings.get('BLOOM_FILTER_PATH'),
capacity=settings.getint('BLOOM_FILTER_CAPACITY', 1000000),
error_rate=settings.getfloat('BLOOM_FILTER_ERROR_RATE', 0.001)
)
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.filter:
return True
self.filter.add(fp)
return False
def request_fingerprint(self, request):
from scrapy.utils.request import request_fingerprint
return request_fingerprint(request)
def close(self, reason):
if self.file_path:
with open(self.file_path, 'wb') as f:
self.filter.tofile(f)
self.logger.info(f"Saved BloomFilter to {self.file_path}")
在settings.py里配置:
DUPEFILTER_CLASS = 'your_project.dupefilter.BloomFilterDupeFilter'
BLOOM_FILTER_PATH = 'bloom_filter.blm'
BLOOM_FILTER_CAPACITY = 1000000
BLOOM_FILTER_ERROR_RATE = 0.001
这样就能用Bloom Filter去重了。记得根据预估的URL数量调整capacity,error_rate设小点能减少误判。
总结:自己实现DupeFilter,用pybloom-live存指纹。
首次提问,不懂代码排版,,,,各位大佬见谅
我就不信这排版有人会看

