Python中Scrapy爬虫使用代理时遇到报错如何解决?
2019-01-04 16:26:57 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1131 2019-01-04 16:27:04 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_1.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1132 2019-01-04 16:27:09 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_2.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1133 2019-01-04 16:27:16 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_3.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1134 2019-01-04 16:27:21 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_4.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
查了国外网站,没找到原因是啥,其他网站没问题,就这个网站,报错 不知道为啥??
http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index.html 这个网站
Python中Scrapy爬虫使用代理时遇到报错如何解决?
遇到Scrapy爬虫使用代理报错,通常有几个常见原因和对应的解决方法。下面我直接给你一个完整可运行的示例,并解释关键点。
import scrapy
from scrapy.crawler import CrawlerProcess
class ProxySpider(scrapy.Spider):
name = 'proxy_spider'
start_urls = ['http://httpbin.org/ip']
# 方法1:直接在请求中设置代理(最常用)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse,
meta={'proxy': 'http://your-proxy-ip:port'}, # 替换为实际代理
errback=self.errback_handler
)
# 方法2:通过下载器中间件设置(更灵活)
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
},
'HTTPPROXY_ENABLED': True,
}
def parse(self, response):
self.logger.info(f"Response body: {response.text}")
def errback_handler(self, failure):
self.logger.error(f"Proxy failed: {failure.value}")
# 如果要使用中间件方式,可以这样配置:
class CustomProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = 'http://your-proxy-ip:port'
# 运行爬虫
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'DOWNLOADER_MIDDLEWARES': {
'__main__.CustomProxyMiddleware': 100, # 使用自定义中间件
}
})
process.crawl(ProxySpider)
process.start()
常见报错及解决:
- 代理格式错误:确保代理地址格式正确,HTTP代理用
http://ip:port,HTTPS代理用https://ip:port - 代理认证问题:如果需要用户名密码,使用
http://user:pass@ip:port格式 - 连接超时:在Request中添加
meta={'download_timeout': 10}设置超时时间 - 代理失效:实现代理池轮换机制,在中间件中动态切换代理
调试建议:先用http://httpbin.org/ip测试代理是否生效,这个网站会返回你当前使用的IP地址。
总结:检查代理格式、认证信息和网络连接是关键。
import base64
class proxy_middleware(object):
def init(self):
proxy_host = “w.t.16yn”
proxy_port = “***”
self.username = “*"
self.password = "”
self.proxies = {“http”: “http://{}:{}/”.format(proxy_host, proxy_port)}
self.proxy_server = ‘https://w5.t.16yun.cn:6469’
self.proxy_authorization = 'Basic ’ + base64.urlsafe_b64encode(
bytes((self.username + ‘:’ + self.password), ‘ascii’)).decode(‘utf8’)
def process_request(self, request, spider):
request.meta[‘proxy’] = self.proxy_server
request.headers[‘Proxy-Authorization’] = self.proxy_authorization
我改成这样还是不行
self.proxy_server = ‘https://w5.t.16yun.cn:6469’
改成
self.proxy_server = ‘http://w5.t.16yun.cn:6469’
https://github.com/scrapy/scrapy/issues/1855 你看看这个情况跟你一样么?
为啥这个原因

