Python爬虫爬取58同城数据量骤减，从上千条降到几十条如何解决？

import …

def get_links():
urls = []
for page in range(1, 71):
list_view = ‘http://sz.58.com/tech/pn{}’.format(page)
url_1 = ‘http://sz.58.com/tech/’
url_2 = ‘x.shtml’
wb_data = requests.get(list_view, headers=headers)
soup = BeautifulSoup(wb_data.text, ‘html.parser’)
for link in soup.select(‘div.job_name a’):
urls.append(url_1+str(link.get(‘urlparams’).split(’=’)[-1].strip(’_q’))+url_2)
return urls

def get_info():
urls = get_links()
try:
for url in urls:
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, ‘html.parser’)
time.sleep(2)
data = {
‘job’: soup.select(’.pos_title’)[0].text,
‘salary’: soup.select(’.pos_salary’)[0].text,
‘condition’: soup.select(’.item_condition’)[1].text,
‘exprience’: soup.select(’.item_condition’)[2].text
}
print(data)
except IndexError:
pass
except requests.exceptions.ConnectionError:
pass

get_info()
Python爬虫爬取58同城数据量骤减，从上千条降到几十条如何解决？

itying888 1楼

肯定是 58 更新了反爬策略呗。还有并不想看你的代码。

itying888 2楼

遇到58同城爬虫数据量骤减，大概率是触发了网站的反爬机制。直接上解决方案：

import requests
import time
import random
from fake_useragent import UserAgent

class Crawler58:
    def __init__(self):
        self.ua = UserAgent()
        self.session = requests.Session()
        
    def get_headers(self):
        return {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Referer': 'https://bj.58.com/'
        }
    
    def crawl_page(self, url):
        try:
            # 随机延迟1-3秒
            time.sleep(random.uniform(1, 3))
            
            # 设置代理（如果需要）
            # proxies = {'http': 'http://your_proxy:port'}
            
            response = self.session.get(
                url, 
                headers=self.get_headers(),
                timeout=10,
                # proxies=proxies
            )
            
            # 检查状态码
            if response.status_code == 200:
                return response.text
            elif response.status_code == 403:
                print("触发反爬，需要更换IP或增加延迟")
                return None
            else:
                print(f"请求失败，状态码: {response.status_code}")
                return None
                
        except Exception as e:
            print(f"请求异常: {e}")
            return None

# 使用示例
if __name__ == "__main__":
    crawler = Crawler58()
    
    # 分页爬取
    base_url = "https://bj.58.com/ershoufang/pn{}/"
    
    for page in range(1, 10):
        url = base_url.format(page)
        html = crawler.crawl_page(url)
        
        if html:
            # 这里添加你的解析逻辑
            print(f"第{page}页获取成功，长度: {len(html)}")
            
            # 每爬几页增加一个长延迟
            if page % 3 == 0:
                time.sleep(random.uniform(5, 10))
        else:
            print(f"第{page}页获取失败，暂停一会儿")
            time.sleep(10)

关键点：

随机User-Agent：用fake_useragent库生成
请求间隔：每页1-3秒随机延迟，每3页加个5-10秒长延迟
会话保持：用requests.Session()维持cookies
Referer设置：模拟正常浏览行为
异常处理：检测403状态码，及时调整策略

如果还不行，考虑：

添加代理IP轮换（收费代理更稳定）
降低请求频率
检查页面结构是否变化
模拟登录获取更多数据

总结：主要问题是反爬，重点在伪装正常用户行为。

gougou168 3楼

楼上+1

nodeper 4楼

是有一点反爬，但不算严格。正在跑，稳得一比。

h691938207 5楼

关键是电话号码的信息采集不到了