Python中Scrapy爬虫使用代理时遇到报错如何解决?

2019-01-04 16:26:57 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1131 2019-01-04 16:27:04 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_1.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1132 2019-01-04 16:27:09 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_2.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1133 2019-01-04 16:27:16 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_3.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]
1134 2019-01-04 16:27:21 [csrc][scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index_4.html> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [(‘SSL routines’, ‘ssl3_get_record’, ‘wrong version number’)]>]


查了国外网站,没找到原因是啥,其他网站没问题,就这个网站,报错 不知道为啥??

http://www.csrc.gov.cn/pub/newsite/xxpl/yxpl/index.html 这个网站
Python中Scrapy爬虫使用代理时遇到报错如何解决?


7 回复

遇到Scrapy爬虫使用代理报错,通常有几个常见原因和对应的解决方法。下面我直接给你一个完整可运行的示例,并解释关键点。

import scrapy
from scrapy.crawler import CrawlerProcess

class ProxySpider(scrapy.Spider):
    name = 'proxy_spider'
    start_urls = ['http://httpbin.org/ip']
    
    # 方法1:直接在请求中设置代理(最常用)
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={'proxy': 'http://your-proxy-ip:port'},  # 替换为实际代理
                errback=self.errback_handler
            )
    
    # 方法2:通过下载器中间件设置(更灵活)
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
        },
        'HTTPPROXY_ENABLED': True,
    }
    
    def parse(self, response):
        self.logger.info(f"Response body: {response.text}")
    
    def errback_handler(self, failure):
        self.logger.error(f"Proxy failed: {failure.value}")

# 如果要使用中间件方式,可以这样配置:
class CustomProxyMiddleware:
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://your-proxy-ip:port'

# 运行爬虫
if __name__ == '__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'DOWNLOADER_MIDDLEWARES': {
            '__main__.CustomProxyMiddleware': 100,  # 使用自定义中间件
        }
    })
    process.crawl(ProxySpider)
    process.start()

常见报错及解决:

  1. 代理格式错误:确保代理地址格式正确,HTTP代理用http://ip:port,HTTPS代理用https://ip:port
  2. 代理认证问题:如果需要用户名密码,使用http://user:pass@ip:port格式
  3. 连接超时:在Request中添加meta={'download_timeout': 10}设置超时时间
  4. 代理失效:实现代理池轮换机制,在中间件中动态切换代理

调试建议:先用http://httpbin.org/ip测试代理是否生效,这个网站会返回你当前使用的IP地址。

总结:检查代理格式、认证信息和网络连接是关键。

import base64


class proxy_middleware(object):

def init(self):
proxy_host = “w.t.16yn”
proxy_port = “***”
self.username = “*"
self.password = "

self.proxies = {“http”: “http://{}:{}/”.format(proxy_host, proxy_port)}
self.proxy_server = ‘https://w5.t.16yun.cn:6469
self.proxy_authorization = 'Basic ’ + base64.urlsafe_b64encode(
bytes((self.username + ‘:’ + self.password), ‘ascii’)).decode(‘utf8’)

def process_request(self, request, spider):
request.meta[‘proxy’] = self.proxy_server
request.headers[‘Proxy-Authorization’] = self.proxy_authorization

我改成这样还是不行

self.proxy_server = ‘https://w5.t.16yun.cn:6469
改成
self.proxy_server = ‘http://w5.t.16yun.cn:6469

https://github.com/scrapy/scrapy/issues/1855 你看看这个情况跟你一样么?

为啥这个原因

回到顶部