Python爬虫中如何在pipelines.py将数据写入MySQL数据库？

[scrapy 爬虫成长日记之将抓取内容写入 mysql 数据库 - 秋楓 - 博客园]( http://www.cnblogs.com/rwxwsblog/p/4572367.html)

在Scrapy的pipelines.py里把数据存到MySQL，核心就这几步：

先在settings.py里配好数据库连接：

MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'your_db'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'your_password'

在pipelines.py里写个处理类：

import pymysql
from itemadapter import ItemAdapter

class MySQLPipeline:
    def __init__(self, mysql_host, mysql_db, mysql_user, mysql_password):
        self.host = mysql_host
        self.db = mysql_db
        self.user = mysql_user
        self.password = mysql_password
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mysql_host=crawler.settings.get('MYSQL_HOST'),
            mysql_db=crawler.settings.get('MYSQL_DATABASE'),
            mysql_user=crawler.settings.get('MYSQL_USER'),
            mysql_password=crawler.settings.get('MYSQL_PASSWORD')
        )
    
    def open_spider(self, spider):
        self.connection = pymysql.connect(
            host=self.host,
            user=self.user,
            password=self.password,
            database=self.db,
            charset='utf8mb4',
            cursorclass=pymysql.cursors.DictCursor
        )
        self.cursor = self.connection.cursor()
    
    def close_spider(self, spider):
        self.connection.close()
    
    def process_item(self, item, spider):
        # 根据你的item结构写SQL
        sql = """
        INSERT INTO your_table (title, url, content, created_at)
        VALUES (%s, %s, %s, NOW())
        """
        values = (
            item.get('title'),
            item.get('url'),
            item.get('content')
        )
        
        try:
            self.cursor.execute(sql, values)
            self.connection.commit()
        except Exception as e:
            self.connection.rollback()
            spider.logger.error(f"MySQL error: {e}")
        
        return item

别忘了在settings.py里启用这个pipeline：

ITEM_PIPELINES = {
    'your_project.pipelines.MySQLPipeline': 300,
}

简单说就是配连接、写插入逻辑、记得提交事务。

ionicwang 3楼

哈哈我最近也在搞这个，一个 tips，不要每次处理 item 就插入，速度很慢，可以做 batch insert

h691938207 4楼

有 batch insert 的实例么？感谢！

songsunli 5楼

感谢，我先看看去

caililin 6楼

https://stackoverflow.com/questions/29063215/batch-bulk-sql-insert-in-scrapy-pipelines-postgresql

itying888 7楼

感谢，文章中的代码好像适用于 PYTHON2 的环境，我在 PYTHON3 运行时遇到下面几个问题，恳请指点：
1、MySQLdb 包在 WIN7+PYTHON3 的环境下应该改为用 pymysql ？
2、dbpool=adbapi.ConnectionPool(‘MySQLdb’,**dbargs) 这个方法对应的 PYTHON3 的语法应该是什么？
3、cursorclass=MySQLdb.cursors.DictCursor, 这个语句对应的 PYTHON3 的语法应该是什么？
4、conn.execute() 这个方法的 conn 在文中找不到定义的地方，我看文中有 conn.fetchone()这样的方法调用，是否说明 conn 是个游标，但是 adbapi.ConnectionPool 是没有 cursor 属性的，那么游标是如何定义的呢？
5、程序有 INSERT 的操作，但是为何没有 commit()来提交事务，这样不会有问题么？

ionicwang 8楼

在这篇文章中还看到以下的代码，上网搜了搜没找到关于这三个语句用法的详细说明，不知哪里可以找到？
特别是第三条语句的用法猜不出来。

d=self.dbpool.runInteraction(self._do_upinsert,item,spider)
d.addErrback(self._handle_error,item,spider)
d.addBoth(lambda _:item)

nodeper 9楼

1. MySQLdb 支持 py3
2. 通用语法
3. from MySQLdb.cursors import DictCursor

conn 和 addBoth 貌似都是 twisted 维护连接池的用法，连接池会自动提交

h691938207 10楼

谢谢。为了能 import MySQLdb，我应该安装什么包呢？
我在网上看到这句话：“ MySQLdb，目前看来，可以视为一个不再继续维护的项目了。另外，针对 python 3.x 的 mysql，另外一个项目，pymysql 导致可以考虑。其是兼容 dbapi 的”
因为我是 PYTHON3.6 的版本，所以是不是应该用 pymysql 更合适呢？
如果确实更建议用 pymysql 的话，文章中这些代码在 pymysql 下应该如何改写呢？

dbpool=adbapi.ConnectionPool(‘MySQLdb’,**dbargs)
cursorclass=MySQLdb.cursors.DictCursor

另外，d.addBoth(lambda _:item) 这个语句的作用是什么呢？
恳请指点！

gougou168 11楼

还没上 3.6 不太了解是否支持，github 看看是否支持 3.6

建议先把 py 的基础语法过一遍再写爬虫

bupafengyu 12楼

https://github.com/rmax/dirbot-mysql/blob/master/dirbot/pipelines.py
这里有一个异步操作 mysql 的 pipeline 例子

zlyuanteng 13楼

博文中的代码，在插入和更新记录时，是每条记录就提交一次事务么? 如果不是，那么在哪里可以设置一次事务包括多少条记录呢？