Gain

github 地址: https://github.com/gaojiuli/gain/

gain 是为了让每大家能够轻松编写 python 爬虫, 它使用了 asyncio, uvloop 和 aiohttp.

准备

Python3.5+

安装

pip install gain

用法

Write spider.py:

from gain import Css, Item, Parser, Spider
class Post(Item):
title = Css(’.entry-title’)
content = Css(’.entry-content’)
async def save(self):
    with open('scrapinghub.txt', 'a+') as f:
        f.writelines(self.results['title'] + '\n')
class MySpider(Spider):
start_url = ‘https://blog.scrapinghub.com/’
parsers = [Parser(‘https://blog.scrapinghub.com/page/\d+/’),
Parser(‘https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9-]+/’, Post)]
MySpider.run()

run python spider.py

案例

案例在 /example/ 目录下.

github 地址: https://github.com/gaojiuli/gain/

新型 Python 爬虫框架 gain: 基于 asyncio, uvloop 和 aiohttp 如何使用

ionicwang 1楼

mark

gougou168 2楼

gain 这个框架我没在生产环境用过，不过看它的文档和源码，它是个基于 asyncio、uvloop 和 aiohttp 的异步爬虫框架，设计思路和 Scrapy 有点像，但完全是异步的。我来给你写个最基础的例子，你一看就明白了。

首先，你得安装它：

pip install gain

然后，一个最简单的爬虫大概长这样：

import asyncio
from gain import Spider, Request, Item, Field

# 1. 定义你要抓的数据结构（类似Scrapy的Item）
class PostItem(Item):
    title = Field()
    link = Field()

# 2. 写你的爬虫类，继承Spider
class MySpider(Spider):
    # 起始URL
    start_url = 'https://httpbin.org/ip'
    
    # 并发数限制
    concurrency = 3
    
    # 这是入口，框架会自动调用
    async def parse(self, response):
        # 这里直接返回一个字典，框架会自动转成PostItem
        yield {
            'title': 'Test Title',
            'link': response.url
        }
        
        # 如果你想继续抓链接，可以yield Request对象
        # yield Request('https://httpbin.org/get', callback=self.parse_detail)
    
    # 另一个回调方法的例子
    async def parse_detail(self, response):
        # 处理详情页...
        pass

# 3. 运行爬虫
if __name__ == '__main__':
    MySpider.run()

几个关键点：

Item 类：定义你要抓的数据字段，用 Field() 声明
Spider 类：你的爬虫逻辑在这里
- start_url：起始地址（也支持 start_urls 列表）
- concurrency：控制并发数
- parse 方法：默认的回调，处理响应并yield数据或新请求
yield 的两种东西：
- 字典：会自动转成你定义的 Item
- Request 对象：用于继续抓新页面，要指定 callback 方法
运行：直接调用 Spider.run()，它会处理好事件循环

更实用一点的例子，比如抓博客文章列表：

class ArticleSpider(Spider):
    start_url = 'https://blog.example.com/articles'
    
    async def parse(self, response):
        # 假设每篇文章有个 <a class="title"> 标签
        for link in response.css('a.title'):
            url = link.attr('href')
            title = link.text()
            
            # 抓详情页
            yield Request(url, callback=self.parse_article)
        
        # 翻页（如果有的话）
        next_page = response.css('a.next-page').attr('href')
        if next_page:
            yield Request(next_page, callback=self.parse)
    
    async def parse_article(self, response):
        # 提取文章内容
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('.article-body').text(),
            'url': response.url
        }

选择器语法：gain 用的其实是 parsel 库（和Scrapy一样），所以 css() 和 xpath() 用法和Scrapy完全一致。

配置：你可以在爬虫类里设置这些：

class MySpider(Spider):
    headers = {'User-Agent': 'MyBot/1.0'}
    proxy = 'http://proxy:8080'  # 代理
    timeout = 10  # 超时（秒）
    retry_times = 2  # 重试次数

数据存储：默认打印到控制台，但你可以通过 pipelines 存到文件或数据库。不过说实话，gain 的生态不如 Scrapy 成熟，很多功能要自己实现。

一句话建议：小项目可以试试 gain，但成熟项目还是建议用 Scrapy 配合 scrapy-playwright。