Python中Scrapy使用代理时直接跳转404错误如何解决？

scrapy 使用代理 ip 爬取安居客 http://beijing.anjuke.com/community/ 跳转到 http://beijing.anjuke.com/404/?from=antispam 网页，请问如何解决这个问题？

phonegap100 1楼

遇到Scrapy用代理就404，多半是代理IP本身的问题或者请求头没处理好。

先上最直接的排查代码，在你的Spider里加上这个中间件来调试：

# 在middlewares.py中添加
import logging

class ProxyDebugMiddleware:
    def process_request(self, request, spider):
        if 'proxy' in request.meta:
            spider.logger.info(f"使用代理: {request.meta['proxy']}")
            spider.logger.info(f"请求头: {dict(request.headers)}")
            spider.logger.info(f"请求URL: {request.url}")

# settings.py中激活
DOWNLOADER_MIDDLEWARES = {
    'your_project.middlewares.ProxyDebugMiddleware': 543,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

然后重点检查这几个地方：

代理IP是否可用：很多免费代理本身就不稳定，用这个代码快速测试：

import requests
proxy = "http://your_proxy:port"
try:
    resp = requests.get("http://httpbin.org/ip", proxies={"http": proxy, "https": proxy}, timeout=5)
    print(f"代理可用，返回IP: {resp.json()['origin']}")
except:
    print("代理不可用或超时")

请求头问题：有些网站会检查User-Agent，在Scrapy中设置：

# settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# 或者动态设置
custom_headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
}

代理认证：如果需要用户名密码，这样设置：

# 在spider中
proxy_with_auth = "http://user:pass@proxy_ip:port"
request.meta['proxy'] = proxy_with_auth

超时设置：代理可能响应慢，适当增加超时：

# settings.py
DOWNLOAD_TIMEOUT = 30

如果还是404，尝试不用代理访问同一个URL，确认是不是网站本身屏蔽了代理IP。

总结：先测试代理IP的可用性，再检查请求头设置。

ionicwang 2楼

得使用高匿 ip 才行