Python中使用Scrapy采集百度贴吧，如何按内容或标题关键字筛选帖子？XPath写了总是采不到怎么办？

比如：我想采集贴吧，标题　或内容含有关键词：征婚　交友　美女　这几个关键词的全部贴子．

下面代码可以采集全部贴子：

item['title'] = response.xpath('//h1[[@style](/user/style)="width: 470px"]/text()').extract()[0].strip() ####贴子标题 item['url'] = response.meta['text_url'] ####贴子地址 item['content'] = response.xpath('//*[starts-with([@id](/user/id), "post_content_")]/text()').extract()[0].strip() ####贴子的内容 item['time'] = response.xpath('//div[[@class](/user/class)="l_post j_l_post l_post_bright noborder "]').re("\d+-\d+-\d+ \d+:\d+") ####发贴时间 item['click'] = random.randint(0, 20) ###点击次数，给了一个随机值

用下面的两个方法．先查一下内容再决定要不要

＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃＃方法　二

okok = response.xpath('//[starts-with([@id](/user/id), "post_content_")]/text()').extract()[0].strip() if '交友' or '征婚' or '美女' in okok: 　　 item['content'] = response.xpath('//[starts-with([@id](/user/id), "post_content_")]/text()').extract()[0].strip() 　　 item['title'] = response.xpath('//h1[[@style](/user/style)="width: 470px"]/text()').extract()[0].strip() 　　 item['url'] = response.meta['text_url'] 　　 item['time'] = response.xpath('//div[[@class](/user/class)="l_post j_l_post l_post_bright noborder "]').re("\d+-\d+-\d+ \d+:\d+") 　　 item['click'] = random.randint(0, 20) 　　 print item 　　 yield item

这两个总是不行．也用过 contains(str1, str2) 可能是用的不行．总也不成功．

不知道有什么办法．可以通过一组关键词采集百度贴子．

谢谢．

Python中使用Scrapy采集百度贴吧，如何按内容或标题关键字筛选帖子？XPath写了总是采不到怎么办？

核心问题：XPath写不对导致采不到数据。 百度贴吧页面结构复杂，直接按文本内容匹配容易失败。给你两个直接能用的方案：

方案一：使用contains()结合多个属性匹配（更可靠）

import scrapy

class TiebaSpider(scrapy.Spider):
    name = 'tieba'
    start_urls = ['https://tieba.baidu.com/f?kw=python']
    
    def parse(self, response):
        # 关键：同时匹配标题文本和@title属性，避免动态加载干扰
        for post in response.xpath('//li[@class=" j_thread_list clearfix"]'):
            title_text = post.xpath('.//a[@class="j_th_tit "]/text()').get()
            title_attr = post.xpath('.//a[@class="j_th_tit "]/@title').get()
            
            # 筛选逻辑：标题或内容包含指定关键词
            keyword = "爬虫"
            if (title_text and keyword in title_text) or (title_attr and keyword in title_attr):
                yield {
                    'title': title_text.strip(),
                    'link': response.urljoin(post.xpath('.//a[@class="j_th_tit "]/@href').get())
                }

方案二：用CSS选择器+正则过滤（更简洁）

def parse(self, response):
    keyword = "爬虫"
    for post in response.css('li.j_thread_list'):
        # 提取所有文本节点合并后判断
        full_text = ''.join(post.css('a.j_th_tit ::text').getall())
        if keyword in full_text:
            yield {
                'title': full_text.strip(),
                'link': response.urljoin(post.css('a.j_th_tit::attr(href)').get())
            }

为什么你的XPath可能失效：

贴吧页面有大量<script>干扰，用//a[contains(text(),"关键词")]会匹配到脚本内容
标题实际显示文本可能在@title属性里，不在元素文本节点
页面懒加载导致部分元素初始状态为空

调试技巧（临时添加）：

# 在parse方法开头临时添加，保存实际页面结构
with open('debug.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

总结：优先用方案一的属性+文本双保险匹配。

h691938207 2楼

XPath 小技巧：如果是 chrome 浏览器的话，开发都工具的 Elements 下 ctrl+f 有 find by string, selector, or XPath 的功能。

取不到要么是你取的元素是 ajax 异步请求的，可以模拟请求。

要么你的 xpath 表达式有问题，可以用上述方法检验。

不是取不到．是能取到全部贴子信息．但是现在是无法挑选内容有：交友　　征婚　美女等特定关键词的方法．
不知道是在采集的时候过虑还是在入库的时候过虑．应该怎么过虑出来要的信息？

回到顶部