Python中scrapy框架里面的middleware如何使用
from scrapy.http.headers import Headers
from Espider.tools.get_cookies import get_cookies
import pymongo,random
from Espider.tools.user_agents import user_agents
from fake_useragent import UserAgent
class zhipincookiemiddleware():
def init(self, mongodbHost, mongodbPort, mongodbName):
self.mongodbHost = mongodbHost
self.mongodbPort = mongodbPort
self.mongodbName = mongodbName
@classmethod
def from_crawler(cls, crawler):
return cls(mongodbHost=crawler.settings.get(‘MONGODB_HOST’), mongodbPort=crawler.settings.get(‘MONGODB_PORT’),
mongodbName=crawler.settings.get(‘MONGODB_DBNAME’))
def process_request(self, request, spider):
ua=UserAgent()
self.client = pymongo.MongoClient(self.mongodbHost, self.mongodbPort)
self.mongodb = self.client[self.mongodbName]
self.collection = self.mongodb[spider.name + ‘_cookie’]
self.cookies_str = self.collection.find_one()[‘cookie’]
self.headers = {
“User-Agent”:ua.random,
“cookie”: random.choice(self.cookies_str)}
request.headers = Headers(self.headers)
框架里面写了一个 cookiemiddleware
我这里写了一个 random cookie,在每次请求的时候 会重新随机一下 cookie 吗?
Python中scrapy框架里面的middleware如何使用
Scrapy的中间件(Middleware)是处理请求和响应的钩子框架,用于全局修改Scrapy的输入和输出。主要分为下载器中间件(Downloader Middleware)和蜘蛛中间件(Spider Middleware)。
下载器中间件示例(处理请求和响应):
# middlewares.py
class CustomDownloaderMiddleware:
def process_request(self, request, spider):
# 处理请求:添加代理、修改请求头等
request.headers['User-Agent'] = 'Mozilla/5.0'
return None # 继续处理
def process_response(self, request, response, spider):
# 处理响应:修改响应内容、重试等
if response.status == 403:
new_request = request.copy()
new_request.dont_filter = True
return new_request # 重新调度请求
return response
蜘蛛中间件示例(处理爬虫输入输出):
class CustomSpiderMiddleware:
def process_spider_input(self, response, spider):
# 处理蜘蛛输入
return None
def process_spider_output(self, response, result, spider):
# 处理蜘蛛输出
for item in result:
yield item
启用中间件(settings.py):
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
SPIDER_MIDDLEWARES = {
'myproject.middlewares.CustomSpiderMiddleware': 543,
}
使用场景:
- 请求处理:添加代理、自动重试、请求去重
- 响应处理:响应验证、异常处理
- 数据清洗:统一处理提取的数据
- 监控统计:收集爬虫运行指标
中间件的数字优先级决定执行顺序(小的先执行),通过返回不同值控制流程:返回None继续、Request对象重新调度、Response对象传递响应、抛出IgnoreRequest异常丢弃请求。
总结:中间件是Scrapy的插件机制,通过process_*方法介入爬取流程。
你们公司还要人嘛?

