Python中如何高效读取一个文件夹下的百万个文件？

之前把爬虫爬取的源文件都存在了一个文件夹，有一百多万个，现在要读取，直接用 os.walk(path) 这种方式，几个小时了还卡在这一步，有没有其他的方式可以快速的读取
Python中如何高效读取一个文件夹下的百万个文件？

caililin 1楼

ls /xxx >list

sinazl 2楼

核心思路：分批次处理 + 惰性迭代，避免一次性加载所有文件路径到内存。

直接上代码，用 os.scandir() 或 pathlib 的惰性迭代器，配合生成器分批处理：

import os
from pathlib import Path

def process_million_files(folder_path, batch_size=1000):
    """
    高效处理百万量级文件的生成器函数。
    每次yield一个批次的文件路径列表。
    """
    # 方法1: 使用 os.scandir() - 更底层，性能通常最好
    with os.scandir(folder_path) as entries:
        batch = []
        for entry in entries:
            if entry.is_file():  # 只处理文件，忽略目录
                batch.append(entry.path)
                if len(batch) >= batch_size:
                    yield batch
                    batch = []
        if batch:  # 处理最后一批
            yield batch

    # 方法2: 使用 pathlib (代码更简洁，性能稍逊但可接受)
    # folder = Path(folder_path)
    # batch = []
    # for file_path in folder.iterdir():
    #     if file_path.is_file():
    #         batch.append(str(file_path))
    #         if len(batch) >= batch_size:
    #             yield batch
    #             batch = []
    # if batch:
    #     yield batch

# 使用示例
folder = "/path/to/your/million/files"
for file_batch in process_million_files(folder, batch_size=5000):
    # 在这里处理每个批次的文件
    for file_path in file_batch:
        # 执行你的文件处理逻辑，例如：
        # with open(file_path, 'r') as f:
        #     content = f.read()
        #     # 处理内容...
        pass  # 替换为实际处理代码
    # 处理完一批后，内存会被释放，接着处理下一批

关键点：