Python中抓取1万个网页需要多少时间？

这么开放的问题叫人怎么回答…
哪个网站？什么工具？什么带宽？手上有多少 IP ？制约因素太多了…

这问题问得太笼统了，兄弟。1万个网页要多久？这完全取决于你的机器、网络、目标网站以及你怎么写代码。

简单来说，单线程同步请求是最慢的。假设每个请求（包括网络延迟、解析）平均耗时1秒（这已经很乐观了），1万个就是10000秒，差不多2.8小时。但现实是，很多网站有反爬，响应可能慢，这个时间会成倍增加。

想快？必须用异步。用 aiohttp 和 asyncio，配合信号量控制并发数，速度能提升几十倍。下面给你个能直接跑的示例代码，你改改就能用。

import asyncio
import aiohttp
import time
from typing import List

async def fetch_page(session: aiohttp.ClientSession, url: str, semaphore: asyncio.Semaphore):
    """抓取单个页面"""
    async with semaphore:  # 用信号量控制并发，别把人家服务器打挂了
        try:
            async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as response:
                html = await response.text()
                # 这里你可以处理html，比如用BeautifulSoup解析
                return html[:100]  # 示例：返回前100个字符
        except Exception as e:
            print(f"请求 {url} 失败: {e}")
            return None

async def main(urls: List[str], concurrency: int = 50):
    """主函数，控制并发抓取"""
    semaphore = asyncio.Semaphore(concurrency)
    connector = aiohttp.TCPConnector(limit=0)  # 不限制连接池总量
    timeout = aiohttp.ClientTimeout(total=30)

    async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
        tasks = [fetch_page(session, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

if __name__ == "__main__":
    # 示例：生成1万个测试URL（这里用同一个，实际你要换成自己的列表）
    base_url = "https://httpbin.org/delay/1"  # 这个测试端点会延迟1秒响应
    urls = [base_url for _ in range(100)]  # 先测试100个，没问题再改成10000

    start = time.time()
    # 运行异步主函数
    results = asyncio.run(main(urls, concurrency=50))
    end = time.time()

    print(f"抓取了 {len(urls)} 个页面，成功 {sum(1 for r in results if r is not None)} 个")
    print(f"总耗时: {end - start:.2f} 秒")

核心点：