Python中scrapy框架是否支持直接使用wordpress-rpc进行内容发布?

null
Python中scrapy框架是否支持直接使用wordpress-rpc进行内容发布?

1 回复

是的,Scrapy框架支持通过wordpress-rpc(即XML-RPC接口)直接发布内容到WordPress。

Scrapy本身是一个异步爬虫框架,不内置XML-RPC客户端,但你可以轻松地在Pipeline或Spider中集成python-wordpress-xmlrpc这类第三方库来完成发布。核心步骤是:爬取数据后,在Pipeline里将Item数据转换为WordPress的Post对象,然后通过XML-RPC发送。

下面是一个完整示例:

  1. 安装必要库
pip install python-wordpress-xmlrpc scrapy
  1. 在Scrapy项目中创建Pipeline
# pipelines.py
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods.posts import NewPost
from scrapy.exceptions import DropItem

class WordPressPublishPipeline:
    def __init__(self, wp_url, wp_username, wp_password):
        self.wp_url = wp_url
        self.wp_username = wp_username
        self.wp_password = wp_password
        self.client = None

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            wp_url=crawler.settings.get('WORDPRESS_URL'),
            wp_username=crawler.settings.get('WORDPRESS_USERNAME'),
            wp_password=crawler.settings.get('WORDPRESS_PASSWORD')
        )

    def open_spider(self, spider):
        # 建立WordPress连接
        self.client = Client(self.wp_url, self.wp_username, self.wp_password)

    def process_item(self, item, spider):
        # 创建WordPress文章对象
        post = WordPressPost()
        post.title = item['title']
        post.content = item['content']
        post.post_status = 'publish'  # 或'draft'保存为草稿
        post.terms_names = {
            'category': item.get('categories', ['Uncategorized']),
            'post_tag': item.get('tags', [])
        }

        try:
            # 发布文章
            post_id = self.client.call(NewPost(post))
            spider.logger.info(f'Successfully published post ID: {post_id}')
            return item
        except Exception as e:
            spider.logger.error(f'Failed to publish post: {e}')
            raise DropItem(f"Publishing failed: {e}")

    def close_spider(self, spider):
        self.client = None
  1. 配置settings.py
# settings.py
ITEM_PIPELINES = {
    'your_project.pipelines.WordPressPublishPipeline': 300,
}

WORDPRESS_URL = 'https://your-site.com/xmlrpc.php'
WORDPRESS_USERNAME = 'your_username'
WORDPRESS_PASSWORD = 'your_password'
  1. Spider示例
import scrapy

class SampleSpider(scrapy.Spider):
    name = 'sample'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {
            'title': response.css('h1::text').get(),
            'content': response.css('article').get(),
            'categories': ['Scrapy'],
            'tags': ['web-scraping', 'wordpress']
        }

关键点

  • 使用python-wordpress-xmlrpc库处理XML-RPC通信
  • 在Pipeline中实现发布逻辑,保持爬虫代码纯净
  • 通过settings管理WordPress凭证,避免硬编码
  • 注意异常处理和日志记录

总结:用Pipeline集成wordpress-xmlrpc库就能实现Scrapy到WordPress的自动化发布。

回到顶部