基于Python asyncio的异步爬虫框架，如何使用和实现？

轻量异步爬虫框架 aspider ，基于 asyncio

介绍

对于单页面，只要实现框架定义的 Item 就好：

import asyncio
from aspider import AttrField, TextField, Item
class HackerNewsItem(Item):
target_item = TextField(css_select=‘tr.athing’)
title = TextField(css_select=‘a.storylink’)
url = AttrField(css_select=‘a.storylink’, attr=‘href’)
async def clean_title(self, value):
    return value
items = asyncio.get_event_loop().run_until_complete(HackerNewsItem.get_items(url=“https://news.ycombinator.com/”))
for item in items:
print(item.title, item.url)


Notorious ‘ Hijack Factory ’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
 ......

对于多页面的网站，使用 Spider 即可：

import aiofiles
from aspider import AttrField, TextField, Item, Spider
class HackerNewsItem(Item):
target_item = TextField(css_select=‘tr.athing’)
title = TextField(css_select=‘a.storylink’)
url = AttrField(css_select=‘a.storylink’, attr=‘href’)
async def clean_title(self, value):
    return value
class HackerNewsSpider(Spider):
start_urls = [‘https://news.ycombinator.com/’, ‘https://news.ycombinator.com/news?p=2’]
async def parse(self, res):
    items = await HackerNewsItem.get_items(html=res.html)
    for item in items:
        async with aiofiles.open('./hacker_news.txt', 'a') as f:
            await f.write(item.title + '\n')
if name == ‘main’:
HackerNewsSpider.start()

[2018-07-11 17:50:12,430]-aspider-INFO  Spider started!
[2018-07-11 17:50:12,430]-Request-INFO  <GET: https://news.ycombinator.com/>
[2018-07-11 17:50:12,456]-Request-INFO  <GET: https://news.ycombinator.com/news?p=2>
[2018-07-11 17:50:14,785]-aspider-INFO  Time usage: 0:00:02.355062
[2018-07-11 17:50:14,785]-aspider-INFO  Spider finished!

同样支持 js 加载：

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)

在 Item 以及 Spider 中要是想加载 js，同样只要带上 load_js=True 即可

项目 Github 地址：aspider

基于Python asyncio的异步爬虫框架，如何使用和实现？

支持 js 加载啊，看起来屌屌的

对于基于asyncio的异步爬虫框架，核心是利用aiohttp进行网络请求，配合asyncio的并发控制。下面是一个可直接运行的示例：

import asyncio
import aiohttp
from bs4 import BeautifulSoup

class AsyncCrawler:
    def __init__(self, concurrency=10):
        self.semaphore = asyncio.Semaphore(concurrency)
    
    async def fetch(self, session, url):
        async with self.semaphore:  # 控制并发数
            async with session.get(url) as response:
                return await response.text()
    
    async def parse(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        # 这里写你的解析逻辑
        titles = soup.find_all('h1')
        return [title.get_text() for title in titles]
    
    async def crawl(self, url):
        async with aiohttp.ClientSession() as session:
            html = await self.fetch(session, url)
            data = await self.parse(html)
            return data
    
    async def crawl_many(self, urls):
        tasks = [self.crawl(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

# 使用示例
async def main():
    crawler = AsyncCrawler(concurrency=5)
    urls = [
        'https://httpbin.org/html',
        'https://httpbin.org/html',
        'https://httpbin.org/html'
    ]
    results = await crawler.crawl_many(urls)
    for url, result in zip(urls, results):
        if isinstance(result, Exception):
            print(f"{url} failed: {result}")
        else:
            print(f"{url}: {result}")

if __name__ == "__main__":
    asyncio.run(main())

实现要点：

用aiohttp.ClientSession管理HTTP会话
asyncio.Semaphore限制并发连接数避免被封
asyncio.gather并发执行多个爬取任务
异常处理确保单个任务失败不影响整体

建议：用生产者-消费者模式处理大规模爬取。

收藏下哈

htzhanglong 4楼

已 star

哈哈谢谢

谢~

已 star, 希望能讲下做一个这样的框架的实现思路是怎样的，想学习一下如何写框架

有兴趣可以看看源码一起开发

楼主有开个讨论群吗☺

有问题可邮件或者 issue

最近在学 py，有空看看先 star 了

共同学习

yuanlaile 13楼

提个小 bug

文档里面的 spider 部分，res 没有 html，那个应该是 body，example 里面是对的

h691938207 14楼

已经改了感谢

h691938207 15楼

能支持 js，这个屌屌哒

zlyuanteng 16楼

有兴趣可以看我的项目。https://github.com/kkyon/botflow
封装了 asyncio 细节。
Botflow is a Python Fast Data driven programming framework for Data pipeline work( Web Crawler,Machine Learning,Quantitative Trading.etc) http://docs.botflow.org/

好的

好的大佬我看看

htzhanglong 19楼

为啥我看代码排版是坨屎啊，不扯淡了，我工作还没找到合适的呢，你啥时候准备从那个公司溜啊。

vueper 20楼作者

window 平台下报错
Traceback (most recent call last):
File “weibospider.py”, line 26, in <module>
HackerNewsSpider.start()
File “C:\Users\hwywhywl\StudioProjects\weibo_splider\lib\site-packages\aspider<a target=”_blank" href=“http://spider.py” rel=“nofollow noopener”>spider.py", line 92, in start
spider_ins.loop.add_signal_handler(_signal, lambda: asyncio.ensure_future(spider_ins.stop(_signal)))
File “C:\Users\hwywhywl\Anaconda3\lib\asyncio<a target=”_blank" href=“http://events.py” rel=“nofollow noopener”>events.py", line 499, in add_signal_handler
raise NotImplementedError
NotImplementedError

ioloop.add_signal_handler 在 window 下不支持，判断一下吧

vueper 21楼作者

收到谢谢我来修复哈

htzhanglong 22楼

如果可以的话麻烦提个 issue

zlyuanteng 23楼

已提

bupafengyu 24楼

修复了

htzhanglong 25楼

我又来了，先回复一个再定位

可能是个 bug：
start_urls 如果有不能匹配规则的链接，后面的所有连接全部报错
例如：
http://www.example.com/article/123.html
http://www.example.com/article/123.html
http://www.example.com/article/123.html
http://www.example.com/article/123.html

emmm…还没打完就发出去了

例如
http://域名 /article/123.html
http://域名 /article/124.html
http://域名 /
http://域名 /article/126.html
http://域名 /article/127.html

这样的 urls，第三个没有获取，会导致最后两个报错
我再定位下

bupafengyu 27楼

你好，感谢你提的 bug，不过我不大明白你的意思，可以整理下再结合具体代码提哥 issue 么？

phonegap100 28楼

pyppeteer 安装是不是很麻烦？必须 fq ？ win10 下

zlyuanteng 29楼

我这边安装还好，或者你可以手动安装，然后你可以关注下我正在为 aspider 编写的 splash 插件，也可以方便的加载 js https://github.com/aspider-plugins/aspider-splash

songsunli 30楼

我用 http://npm.taobao.org/mirrors 镜像下载。好的，你的框架不错，试用中

bupafengyu 31楼

好的欢迎提意见

request = Request(“http://www.lutec.com/”, load_js=True),直接 timeout，无法退出，是否要设置什么参数

gougou168 33楼

你直接通过里面的二维码进交流群吧来讨论下你的问题 https://github.com/howie6879/aspider/blob/master/docs/cn/README.md

回到顶部