Python中使用aiohttp编写爬虫时遇到“Too many open files”错误,如何解决任务过多的问题?
完整代码
import time
import asyncio
import aiohttp
from bs4 import BeautifulSoup as bs
BASE_URL = “http://www.biqudu.com”
TITLE2URL = dict()
CONTENT = list()
async def fetch(url, callback=None, **kwarags):
headers = {‘User-Agent’:‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)’}
sem = asyncio.Semaphore(5)
with (await sem):
async with aiohttp.ClientSession() as session:
async with session.get(url, headers=headers) as res:
page = await res.text()
if callback:
callback(page, **kwarags)
else:
return page
def parse_url(page):
soup = bs(page, “lxml”)
dd_a_doc = soup.select(“dd > a”)
for a_doc in dd_a_doc:
article_page_url = a_doc[‘href’]
article_title = a_doc.get_text()
if article_page_url:
TITLE2URL[article_title] = article_page_url
def parse_body(page, **kwarags):
title = kwarags.get(‘title’, ‘’)
print("{}".format(title))
soup = bs(page, “lxml”)
content_doc = soup.find(“div”, id=“content”)
content_text = content_doc.get_text().replace(‘readx();’, ‘’).replace(’ ', “\r\n”)
content = “%s\n%s\n\n” % (title, content_text)
CONTENT.append(content)
def main():
t0 = time.time()
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch(BASE_URL+"/43_43074/", callback=parse_url))
tasks = [fetch(BASE_URL + page_url, callback=parse_body, title=title) for title, page_url in TITLE2URL.items()]
loop.run_until_complete(asyncio.gather(*tasks[:500]))
loop.close()
elapsed = time.time() - t0
print(“cost {}”.format(elapsed))
if name == “main”:
main()
错误信息
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py", line 797, in _wrap_create_connection
return (yield from self._loop.create_connection(*args, **kwargs))
File "/usr/lib/python3.5/asyncio/base_events.py", line 695, in create_connection
raise exceptions[0]
File "/usr/lib/python3.5/asyncio/base_events.py", line 662, in create_connection
sock = socket.socket(family=family, type=type, proto=proto)
File "/usr/lib/python3.5/socket.py", line 134, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 24] Too many open files
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/home/xsy/Workspace/Self/aiotest/aiotest.py”, line 58, in <module>
main()
File “/home/xsy/Workspace/Self/aiotest/aiotest.py”, line 52, in main
loop.run_until_complete(asyncio.gather(*tasks[:500]))
File “/usr/lib/python3.5/asyncio/base_events.py”, line 387, in run_until_complete
return future.result()
File “/usr/lib/python3.5/asyncio/futures.py”, line 274, in result
raise self._exception
File “/usr/lib/python3.5/asyncio/tasks.py”, line 239, in _step
result = coro.send(None)
File “/home/xsy/Workspace/Self/aiotest/aiotest.py”, line 18, in fetch
async with session.get(url, headers=headers) as res:
File “/usr/local/lib/python3.5/dist-packages/aiohttp/client.py”, line 690, in aenter
self._resp = yield from self._coro
File “/usr/local/lib/python3.5/dist-packages/aiohttp/client.py”, line 267, in _request
conn = yield from self._connector.connect(req)
File “/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py”, line 402, in connect
proto = yield from self._create_connection(req)
File “/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py”, line 749, in _create_connection
_, proto = yield from self._create_direct_connection(req)
File “/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py”, line 860, in _create_direct_connection
raise last_exc
File “/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py”, line 832, in _create_direct_connection
req=req, client_error=client_error)
File “/usr/local/lib/python3.5/dist-packages/aiohttp/connector.py”, line 804, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host www.biqudu.com:443 ssl:True [Too many open files]
目前我能想到的方法是
- 修改 linux 的最大文件打开数量限制
- 对 task 切片多次运行
但是这样解决都感觉太蠢了,请问有什么更好的方式吗?
Python中使用aiohttp编写爬虫时遇到“Too many open files”错误,如何解决任务过多的问题?
考虑下多个请求共同一个 client session,一个 clientsession 至少会占用一个链接的
遇到“Too many open files”错误,通常是因为同时创建的连接数超过了系统限制。解决的核心是控制并发量。这里提供一个使用asyncio.Semaphore来限制并发请求数的完整示例:
import aiohttp
import asyncio
from asyncio import Semaphore
async def fetch(session, url, semaphore):
async with semaphore: # 通过信号量控制并发
async with session.get(url) as response:
return await response.text()
async def main(urls, max_concurrent=50):
# 限制并发连接数,例如设为50
semaphore = Semaphore(max_concurrent)
connector = aiohttp.TCPConnector(limit=max_concurrent) # 同时限制连接器
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [fetch(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# 使用示例
if __name__ == "__main__":
urls = ["http://example.com"] * 1000 # 假设有大量URL
results = asyncio.run(main(urls, max_concurrent=50))
关键点:
asyncio.Semaphore:确保同时运行的fetch协程不超过max_concurrent个。TCPConnector(limit=...):设置aiohttp本身的连接池上限,与信号量保持一致。- 调整
max_concurrent:根据你的系统ulimit(通过ulimit -n查看)和服务器承受能力调整这个值,通常从50-200开始测试。
另外,可以考虑使用asyncio.as_completed()来逐步处理结果,避免同时积压太多任务对象。
总结:用信号量控制并发数。
好哒,我试试
linux 打开文件句柄上限了解一下?
Too many open files 的话不是应该将最大文件数修改大一些么
了解了,确实可以,但是觉得这种解决方式不优雅…
其实我对你这两句话表示疑惑:
sem = asyncio.Semaphore(5)
with (await sem):
请问你是要靠 Semaphore 控制并发嘛?可是每个 fetch 用一个独立的 Semaphore 你靠什么去控制并发呢?
治标不治本呢…现在这个是链接是 1000 多…下次爬的假如是 10000 岂不是又要改…我按一楼的改好了…
另外 说的那一点也是,你为什么每一个 fetch 都用一个独立的 ClientSession 呢?
事实上 Semaphore 或者 ClientSession 两者之中任何一个都能控制并发。Semaphore 可以控制同时运行的 task,而 ClientSession 可以控制最大连接数(当然你得加参数)。当然你必须用同一个对象才行。
这有什么优雅不优雅的,每个发行版的初始 open files 限制都不一样,而在云上的话,早就被云供应商改成 65535 甚至更高了
感谢各位,已经有个解决的思路了,谢谢大家~~~
楼上说得对,aiohttp 文档中说一个 app 只需要一个 ClientSession 就够了。可以把 session 作为 fetch 的一个参数。
我看你这里 task 是 500 个啊,怎么是 1000 个 *tasks[:500]))
可以问一下,你之前你这个问题有想到什么好的 解决方案么??
1 楼

