Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现

Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现

BM25是一个用于评估查询与文档库中文档相关性的算法。以下是BM25库在Rust中的使用方法和示例代码。

嵌入文本(Embed)

use bm25::{Embedder, EmbedderBuilder, Embedding, TokenEmbedding, Language};

// 定义文档集
let corpus = [
    "The sky blushed pink as the sun dipped below the horizon.",
    "Apples, oranges, papayas, and more papayas.",
    "She found a forgotten letter tucked inside an old book.",
    "A single drop of rain fell, followed by a thousand more.",
];

// 创建嵌入器
let embedder: Embedder = EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();

// 验证平均文档长度
assert_eq!(embedder.avgdl(), 5.75);

// 嵌入文档
let embedding = embedder.embed(corpus[1]);

// 验证嵌入结果
assert_eq!(
    embedding,
    Embedding(vec![
        TokenEmbedding {
            index: 1777144781,
            value: 1.1422123,
        },
        TokenEmbedding {
            index: 3887370161,
            value: 1.1422123,
        },
        TokenEmbedding {
            index: 2177600299,
            value: 1.5037148,
        },
        TokenEmbedding {
            index: 2177600299,
            value: 1.5037148,
        },
    ])
)

评分(Score)

use bm25::{Embedder, EmbedderBuilder, Language, Scorer, ScoredDocument};

// 定义文档集和查询词
let corpus = [
    "The sky blushed pink as the sun dipped below the horizon.",
    "She found a forgotten letter tucked inside an old book.",
    "Apples, oranges, pink grapefruits, and more pink grapefruits.",
    "A single drop of rain fell, followed by a thousand more.",
];
let query = "pink";

// 创建评分器
let mut scorer = Scorer::<usize>::new();

// 创建嵌入器
let embedder: Embedder =
    EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();

// 嵌入所有文档并加入评分器
for (i, document) in corpus.iter().enumerate() {
    let document_embedding = embedder.embed(document);
    scorer.upsert(&i, document_embedding);
}

// 嵌入查询词
let query_embedding = embedder.embed(query);

// 验证评分结果
let score = scorer.score(&0, &query_embedding);
assert_eq!(score, Some(0.7046783));

// 获取匹配结果
let matches = scorer.matches(&query_embedding);
assert_eq!(
    matches,
    vec![
        ScoredDocument {
            id: 2,
            score: 0.9639215
        },
        ScoredDocument {
            id: 0,
            score: 0.7046783
        }
    ]
);

搜索(Search)

use bm25::{Document, Language, SearchEngineBuilder, SearchResult};

// 定义文档集
let corpus = [
    "The rabbit munched the orange carrot.",
    "The snake hugged the green lizard.",
    "The hedgehog impaled the orange orange.",
    "The squirrel buried the brown nut.",
];

// 创建搜索引擎
let search_engine = SearchEngineBuilder::<u32>::with_corpus(Language::English, corpus).build();

// 执行搜索
let limit = 3;
let search_results = search_engine.search("orange", limit);

// 验证搜索结果
assert_eq!(
    search_results,
    vec![
        SearchResult {
            document: Document {
                id: 2,
                contents: String::from("The hedgehog impaled the orange orange."),
            },
            score: 0.9530774,
        },
        SearchResult {
            document: Document {
                id: 0,
                contents: String::from("The rabbit munched the orange carrot."),
            },
            score: 0.6931472,
        },
    ]
);

完整示例

以下是一个完整的BM25搜索实现示例:

use bm25::{Document, Language, SearchEngineBuilder, SearchResult};

fn main() {
    // 准备文档集
    let documents = vec![
        Document {
            id: 1,
            contents: String::from("Rust是一种系统编程语言,专注于安全、速度和并发性"),
        },
        Document {
            id: 2,
            contents: String::from("BM25是信息检索中常用的相关性评分算法"),
        },
        Document {
            id: 3,
            contents: String::from("Rust的包管理工具Cargo使得项目管理变得简单"),
        },
        Document {
            id: 4,
            contents: String::from("BM25基于概率模型,适用于全文检索场景"),
        },
    ];

    // 创建搜索引擎
    let search_engine = SearchEngineBuilder::with_documents(Language::ChineseSimplified, documents)
        .build();

    // 执行搜索
    let query = "Rust";
    let results = search_engine.search(query, 5);

    // 打印结果
    println!("搜索 '{}' 的结果:", query);
    for result in results {
        println!("文档ID: {}, 评分: {:.4}", result.document.id, result.score);
        println!("内容: {}\n", result.document.contents);
    }
}

主要特性

  • 多语言分词器,支持词干提取、停用词去除和Unicode规范化
  • 语言检测
  • 完全可配置的BM25参数
  • 快速批量处理的并行支持
  • 模块化和可定制
  • 通过编译时特性进行配置

BM25算法假设您事先知道文档的平均(有意义的)词数。如果这个假设不适用于您的用例,您有两种选择:(1)做出合理的猜测(例如基于样本);(2)配置算法忽略文档长度。建议使用前者。

BM25有三个参数:bk1avgdl。这些术语与Wikipedia上给出的公式匹配。avgdl(‘平均文档长度’)是上述平均有意义词数;您应该总是提供一个值。b控制文档长度归一化;0表示没有归一化(长度不会影响分数),而1表示完全归一化。如果您知道avgdl0.75通常是一个不错的选择。k1控制重复词的权重。对于几乎所有用例,1.2的值都是合适的。


1 回复

Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现

以下是基于tantivy库实现BM25全文检索的完整示例代码,包含详细注释:

use tantivy::{
    collector::TopDocs,
    directory::MmapDirectory,
    doc,
    query::{BM25Params, BooleanQuery, QueryParser},
    schema::{Schema, IndexRecordOption, TextFieldIndexing, TextOptions, TEXT, STORED},
    Index, 
    Document,
};

fn main() -> tantivy::Result<()> {
    // 1. 创建Schema - 定义索引结构
    let mut schema_builder = Schema::builder();
    
    // 配置文本字段选项,启用BM25所需的所有特性
    let text_options = TextOptions::default()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_index_option(IndexRecordOption::WithFreqsAndPositions)
                .set_fieldnorms(true)  // BM25需要字段规范
        )
        .set_stored();  // 存储原始内容以便检索后显示
    
    // 添加字段
    let title = schema_builder.add_text_field("title", text_options.clone());
    let body = schema_builder.add_text_field("body", text_options);
    let is_published = schema_builder.add_bool_field("published", STORED);
    let schema = schema_builder.build();

    // 2. 创建索引 - 使用内存映射目录提高大索引性能
    let dir = MmapDirectory::create_in_ram();
    let index = Index::open_or_create(dir, schema.clone())?;

    // 3. 配置自定义BM25参数
    let k1 = 1.5;  // 控制词频饱和度(默认1.2)
    let b = 0.9;   // 控制文档长度归一化(默认0.75)
    let params = BM25Params::new(k1, b);
    
    // 4. 添加文档
    let mut index_writer = index.writer(50_000_000)?; // 50MB内存预算
    
    // 文档1
    index_writer.add_document(doc!(
        title => "The Rust Programming Language",
        body => "Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.",
        is_published => true
    ))?;
    
    // 文档2
    index_writer.add_document(doc!(
        title => "Learning Rust",
        body => "This book will teach you about the Rust programming language. Learn about ownership, borrowing and lifetimes.",
        is_published => true
    ))?;
    
    // 文档3
    index_writer.add_document(doc!(
        title => "Advanced Rust",
        body => "For experienced Rust programmers. Covers async programming, unsafe Rust and FFI.",
        is_published => false  // 未发布文档
    ))?;
    
    index_writer.commit()?;  // 提交写入

    // 5. 创建查询解析器并设置字段权重
    let query_parser = QueryParser::for_index(&index, vec![title, body])
        .set_field_boost(title, 3.0)  // title字段权重更高
        .set_field_boost(body, 1.0);

    // 6. 执行查询
    let reader = index.reader()?;
    let searcher = reader.searcher();
    
    // 示例1: 简单查询
    println!("简单查询结果:");
    let query = query_parser.parse_query("rust programming")?;
    let top_docs = searcher.search(&query, &TopDocs::with_limit(5))?;
    print_results(&searcher, top_docs)?;
    
    // 示例2: 布尔查询+过滤器
    println!("\n布尔查询结果:");
    let bool_query = BooleanQuery::new()
        .add_must(Box::new(query_parser.parse_query("rust")?))
        .add_must(Box::new(query_parser.parse_query("language")?));
    
    let top_docs = searcher.search(&bool_query, &TopDocs::with_limit(5))?;
    print_results(&searcher, top_docs)?;

    Ok(())
}

// 辅助函数: 打印搜索结果
fn print_results(searcher: &tantivy::Searcher, docs: Vec<(f32, tantivy::DocAddress)>) -> tantivy::Result<()> {
    for (score, doc_address) in docs {
        let retrieved_doc = searcher.doc(doc_address)?;
        println!("Score: {:.3}, Doc: {:?}", score, retrieved_doc);
    }
    Ok(())
}

关键点说明

  1. Schema设计

    • 必须配置WithFreqsAndPositionsfieldnorms以启用BM25
    • 可以混合存储字段(STORED)和非存储字段
  2. BM25调参

    • k1: 控制词频饱和度,值越大词频影响越大
    • b: 控制文档长度归一化,0表示禁用,1表示完全启用
  3. 查询优化

    • 使用BooleanQuery组合多个查询条件
    • 通过set_field_boost调整不同字段的权重
    • 使用MmapDirectory处理大型索引
  4. 结果处理

    • 每个结果包含BM25相关性评分
    • 可以访问原始文档内容(如果配置了STORED)

这个示例展示了BM25在Rust中的完整工作流程,从索引创建到复杂查询执行,可以根据实际需求调整参数和查询逻辑。

回到顶部