Python爬虫爬取58同城数据量骤减,从上千条降到几十条如何解决?

import …

def get_links():
urls = []
for page in range(1, 71):
list_view = ‘http://sz.58.com/tech/pn{}’.format(page)
url_1 = ‘http://sz.58.com/tech/
url_2 = ‘x.shtml’
wb_data = requests.get(list_view, headers=headers)
soup = BeautifulSoup(wb_data.text, ‘html.parser’)
for link in soup.select(‘div.job_name a’):
urls.append(url_1+str(link.get(‘urlparams’).split(’=’)[-1].strip(’_q’))+url_2)
return urls


def get_info():
urls = get_links()
try:
for url in urls:
wb_data = requests.get(url, headers=headers)
soup = BeautifulSoup(wb_data.text, ‘html.parser’)
time.sleep(2)
data = {
‘job’: soup.select(’.pos_title’)[0].text,
‘salary’: soup.select(’.pos_salary’)[0].text,
‘condition’: soup.select(’.item_condition’)[1].text,
‘exprience’: soup.select(’.item_condition’)[2].text
}
print(data)
except IndexError:
pass
except requests.exceptions.ConnectionError:
pass

get_info()
Python爬虫爬取58同城数据量骤减,从上千条降到几十条如何解决?


5 回复

肯定是 58 更新了反爬策略呗。还有并不想看你的代码。


遇到58同城爬虫数据量骤减,大概率是触发了网站的反爬机制。直接上解决方案:

import requests
import time
import random
from fake_useragent import UserAgent

class Crawler58:
    def __init__(self):
        self.ua = UserAgent()
        self.session = requests.Session()
        
    def get_headers(self):
        return {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Referer': 'https://bj.58.com/'
        }
    
    def crawl_page(self, url):
        try:
            # 随机延迟1-3秒
            time.sleep(random.uniform(1, 3))
            
            # 设置代理(如果需要)
            # proxies = {'http': 'http://your_proxy:port'}
            
            response = self.session.get(
                url, 
                headers=self.get_headers(),
                timeout=10,
                # proxies=proxies
            )
            
            # 检查状态码
            if response.status_code == 200:
                return response.text
            elif response.status_code == 403:
                print("触发反爬,需要更换IP或增加延迟")
                return None
            else:
                print(f"请求失败,状态码: {response.status_code}")
                return None
                
        except Exception as e:
            print(f"请求异常: {e}")
            return None

# 使用示例
if __name__ == "__main__":
    crawler = Crawler58()
    
    # 分页爬取
    base_url = "https://bj.58.com/ershoufang/pn{}/"
    
    for page in range(1, 10):
        url = base_url.format(page)
        html = crawler.crawl_page(url)
        
        if html:
            # 这里添加你的解析逻辑
            print(f"第{page}页获取成功,长度: {len(html)}")
            
            # 每爬几页增加一个长延迟
            if page % 3 == 0:
                time.sleep(random.uniform(5, 10))
        else:
            print(f"第{page}页获取失败,暂停一会儿")
            time.sleep(10)

关键点:

  1. 随机User-Agent:用fake_useragent库生成
  2. 请求间隔:每页1-3秒随机延迟,每3页加个5-10秒长延迟
  3. 会话保持:用requests.Session()维持cookies
  4. Referer设置:模拟正常浏览行为
  5. 异常处理:检测403状态码,及时调整策略

如果还不行,考虑:

  • 添加代理IP轮换(收费代理更稳定)
  • 降低请求频率
  • 检查页面结构是否变化
  • 模拟登录获取更多数据

总结:主要问题是反爬,重点在伪装正常用户行为。

楼上+1

是有一点反爬,但不算严格。正在跑,稳得一比。

关键是电话号码的信息采集不到了

回到顶部