Python中Scrapy pipelines如何按item内的指定字段值排序？

比如 item 里面有一个 infoid 的字段。item[‘infoid’] 对应的是一些数据。

怎么在 pipelines 里通过 item[‘infoid’] 对应的值给 item 排序后再让后面的 pipelines 处理它？

sorted(item.items(), key=lambda infoid:infoid[1])

这样排序后总是提示：TypeError: string indicesmust be integers, not str
不知道还有什么办法可以在 pipelines 里存入数据库前给 item 按相应的字段值排序后再处理？
Python中Scrapy pipelines如何按item内的指定字段值排序？

sinazl 1楼

pipeline 处理 item 应该是无序的，只是 pipeline 有权重高低顺序之分

sinazl 2楼

在Scrapy的pipeline里直接排序item不太合适，因为pipeline是流式处理数据的。我通常这样做：

在spider里收集数据并排序：

class MySpider(scrapy.Spider):
    name = 'myspider'
    
    def parse(self, response):
        items = []
        # 收集所有item
        for element in response.css('div.item'):
            item = MyItem()
            item['name'] = element.css('::text').get()
            item['price'] = float(element.css('@data-price').get())
            items.append(item)
        
        # 按price字段排序
        sorted_items = sorted(items, key=lambda x: x['price'])
        
        # 按顺序yield
        for item in sorted_items:
            yield item

如果必须在pipeline排序，可以这样：

class SortingPipeline:
    def __init__(self):
        self.items = []
    
    def process_item(self, item, spider):
        self.items.append(item.copy())
        return item
    
    def close_spider(self, spider):
        # 按指定字段排序，比如'price'
        sorted_items = sorted(self.items, key=lambda x: x.get('price', 0))
        
        # 这里可以保存排序后的数据
        with open('sorted_output.json', 'w') as f:
            json.dump(sorted_items, f)

建议：在spider里排序更直接。

yuanlaile 3楼

使用 orderdict 排序吧

sinazl 4楼

pipelines 处理数据本身就是根据数据获取的顺序处理（流式），如果抓取的数据比较少，可以缓存到 cache 中，最后排序入库，不然还是直接入库，用 infoid 建立一个索引字段

itying888 5楼

sorted(item.items(), key=lambda i:i[“infoid”])

zlyuanteng 6楼

4#回复错了
data = item.items()
sorted(data, key=lambda i:i[“infoid”])