Python爬虫求助：如何解决常见问题与优化代码？

如何在企查查网站自动爬取多家公司或企业的相关数据？
Python爬虫求助：如何解决常见问题与优化代码？

帖子回复：

哥们儿，爬虫这玩意儿确实容易踩坑。我直接给你几个最常见问题的代码级解决方案和优化思路，你对着改就行。

1. 反爬：User-Agent被识别 直接上完整代码，用fake-useragent动态轮换：

from fake_useragent import UserAgent
import requests

ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://example.com', headers=headers)

2. IP被封：加代理池

import requests

proxies = {
    'http': 'http://your-proxy-ip:port',
    'https': 'https://your-proxy-ip:port'
}
response = requests.get('https://example.com', proxies=proxies, timeout=5)

3. 解析优化：用lxml替代正则

from lxml import etree
import requests

html = requests.get('https://example.com').text
tree = etree.HTML(html)
# 比正则稳定多了
titles = tree.xpath('//h2[@class="title"]/text()')

4. 异步提速（aiohttp示例）

import aiohttp
import asyncio

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, f'https://example.com/page/{i}') for i in range(10)]
        results = await asyncio.gather(*tasks)

asyncio.run(main())

5. 结构化存储优化

import pandas as pd
from sqlalchemy import create_engine

# 数据攒到DataFrame再批量入库
df = pd.DataFrame(data_list)
engine = create_engine('sqlite:///data.db')
df.to_sql('table_name', engine, if_exists='append', index=False)  # 比逐条insert快10倍

关键点总结：

动态UA和代理解决反爬
lxml比正则更稳
异步IO提升吞吐量
批量操作数据库

一句话建议： 爬虫的核心是模拟真人操作，别太频繁，该加延迟就加延迟。

（注：实际部署时记得遵守robots.txt，控制请求频率）