Python多进程加协程爬虫问题如何解决?
我想用 mutiprocess 和 gevent,requests 配合来爬取拉钩上面的关于各种语言的职位信息,思路就是每种语言开一个进程,比如 java 有 30 页职位信息,我就在 java 进程用 gevent 开 30 个协程去爬每 30 个页面职位信息,可是当我爬的时候发现进程数开多了之后, 多进程加协程却没有单纯的多进程爬取速度快,且随着进程数开启的越多,会有更多的页面 timeout,爬取失败。但单纯的多进程却始终维持在 34s 左右,且基本都爬下所有的页面。下面是我测试的数据:
一个进程+多协程 12s
一个进程 34s
两个进程+多协程 26s
两个进程 34s
三个进程+多协程 51s
三个进程 34s
下面是我的主要代码,请问是我的进程和协程的使用姿势不对还是有什么问题吗,为什么在进程数开多了之后协程反而会拖慢爬虫的效率呢,第一次提问,如有不对之处请多多见谅,谢谢各位。
def get_profession_jobs(self, professions):
process = []
for profession in professions:
p = Process(target=self.get_all_pages, args=(profession, self.page_list))
p.start()
process.append(p)
for p in process:
p.join()
def get_all_pages(self, profession, pages):
print(os.getpid())
jobs = [gevent.spawn(self.get_detail_page, profession, page) for page in pages]
gevent.joinall(jobs)
#for page in pages:
#self.get_detail_page(profession, page)
def get_detail_page(self, profession, page):
user_agent = self.user_agents[random.randint(0, 3)]
header = {
'Host': 'www.lagou.com',
'Referer': self.base_referer + profession,
'User-Agent': user_agent,
'Origin': 'https://www.lagou.com',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8' }
data = {'pn': page, 'kd': profession}
print(profession + ' page ' + str(page) + ' start')
response = requests.post(self.base_url, headers=header, data=data, timeout=20)
self.clear_data(response.content, profession)
print(profession + ' page ' + str(page) + ' finished')
def clear_data(self, page, profession):
results = json.loads(page.decode('utf-8'))['content']['positionResult']['result']
for result in results:
job_name = result['positionName']
job_class = profession
publish_date = result['createTime']
money = result['salary']
experience = result['workYear']
education = result['education']
location = result['city']
with open('jobs.csv', 'a', newline='') as jobs:
writer = csv.writer(jobs)
writer.writerow([job_name, job_class, publish_date, money, experience, education, location])
if name == 'main':
start_time = time.time()
professions = ['PHP', 'Python', 'Go', 'Java']
spider = Spider()
spider.get_profession_jobs(professions)
end_time = time.time()
print('All finished')
print('Used ' + str(end_time-start_time) + ' seconds')
Python多进程加协程爬虫问题如何解决?
核心思路是:多进程处理CPU密集型任务(如解析),协程处理I/O密集型任务(如网络请求)。用multiprocessing创建进程池,每个进程内运行一个asyncio事件循环来管理协程。
import asyncio
import aiohttp
from multiprocessing import Pool
from functools import partial
async def fetch(session, url):
"""单个协程:获取网页内容"""
async with session.get(url) as response:
return await response.text()
async def crawl_urls(urls):
"""单个进程内的协程任务:并发抓取一批URL"""
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
# 这里可以加入解析等CPU操作
return results
def process_urls(urls):
"""每个进程执行的函数:创建事件循环运行协程"""
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
return loop.run_until_complete(crawl_urls(urls))
def main():
# 示例URL列表(实际应用可分块)
all_urls = ['http://example.com/page1', 'http://example.com/page2', ...]
# 将URL列表分成N块(N=进程数)
chunk_size = len(all_urls) // 4
url_chunks = [all_urls[i:i+chunk_size] for i in range(0, len(all_urls), chunk_size)]
# 创建进程池,每个进程处理一个URL块
with Pool(processes=4) as pool:
results = pool.map(process_urls, url_chunks)
# 合并所有进程的结果
all_results = [item for sublist in results for item in sublist]
print(f"总共获取{len(all_results)}个页面")
if __name__ == '__main__':
main()
关键点:
- 进程间用
multiprocessing.Pool并行,解决GIL限制 - 进程内用
asyncio+aiohttp实现高并发I/O - 每个进程有自己的事件循环,避免跨进程协程问题
注意:实际使用时需要添加错误处理、限速控制(如asyncio.Semaphore)和分布式任务队列(如Celery)来扩展。
一句话建议:用进程分治CPU负载,用协程榨干网络I/O。
把每个页面的下载时间都打印出来就知道了吧。。。
问题在网速


