Python爬虫框架Scrapy的使用方法与常见问题
Scrapy 提供了内置的 Telnet 终端,以供检查,控制 Scrapy 运行的进程。Telnet 仅仅是一个运行在 Scrapy 进程中的普通 Python 终端。因此你可以在其中做任何是。
这个 Telnet 在 scrapy 中什么作用,我的脚本挂在服务器报错如下
[scrapy.utils.signal][ERROR] Error caught on signal handler: > Traceback (most recent call last): File “/home/www/.local/lib/python3.4/site-packages/twisted/internet/defer.py”, line 151, in maybeDeferred result = f(args, **kw) File “/usr/lib/python3.4/site-packages/pydispatch/robustapply.py”, line 55, in robustApply return receiver(arguments, **named) File “/home/www/.local/lib/python3.4/site-packages/scrapy/extensions/telnet.py”, line 63, in stop_listening self.port.stopListening() AttributeError: ‘TelnetConsole’ object has no attribute 'port’
Python爬虫框架Scrapy的使用方法与常见问题
Scrapy这框架确实好用,搞爬虫的基本都得会。核心就那几个组件,理解了就很简单。
首先得创建项目:
scrapy startproject myproject
cd myproject
scrapy genspider example example.com
最关键的几个文件:
items.py- 定义数据结构spiders/- 写爬虫逻辑的地方pipelines.py- 数据处理管道middlewares.py- 中间件settings.py- 配置文件
写个简单的爬虫示例:
import scrapy
from myproject.items import MyItem
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
def parse(self, response):
# 提取数据
item = MyItem()
item['title'] = response.css('h1::text').get()
item['url'] = response.url
# 翻页
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
yield item
常见问题:
- 403/429错误 - 加User-Agent和延迟
# settings.py
USER_AGENT = 'Mozilla/5.0...'
DOWNLOAD_DELAY = 2
- 动态内容加载 - 用Splash或Selenium中间件
# 安装scrapy-splash
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
}
- 数据存储 - 在pipeline里处理
class MyPipeline:
def process_item(self, item, spider):
# 存数据库或文件
return item
- 并发控制 - 调整settings
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
- 代理设置 - 用中间件轮换IP
class ProxyMiddleware:
def process_request(self, request, spider):
request.meta['proxy'] = 'http://proxy_ip:port'
运行爬虫:
scrapy crawl example -o data.json
调试用shell:
scrapy shell 'http://example.com'
# 然后直接测试选择器
response.css('div.content').get()
记住Scrapy的异步架构,别在parse里写阻塞代码。用ItemLoader处理数据清洗会更干净。
总结:Scrapy的核心是理解其组件架构和异步机制。
你已经发过一遍了,请勿重复发帖

