Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现
Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现
BM25是一个用于评估查询与文档库中文档相关性的算法。以下是BM25库在Rust中的使用方法和示例代码。
嵌入文本(Embed)
use bm25::{Embedder, EmbedderBuilder, Embedding, TokenEmbedding, Language};
// 定义文档集
let corpus = [
"The sky blushed pink as the sun dipped below the horizon.",
"Apples, oranges, papayas, and more papayas.",
"She found a forgotten letter tucked inside an old book.",
"A single drop of rain fell, followed by a thousand more.",
];
// 创建嵌入器
let embedder: Embedder = EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();
// 验证平均文档长度
assert_eq!(embedder.avgdl(), 5.75);
// 嵌入文档
let embedding = embedder.embed(corpus[1]);
// 验证嵌入结果
assert_eq!(
embedding,
Embedding(vec![
TokenEmbedding {
index: 1777144781,
value: 1.1422123,
},
TokenEmbedding {
index: 3887370161,
value: 1.1422123,
},
TokenEmbedding {
index: 2177600299,
value: 1.5037148,
},
TokenEmbedding {
index: 2177600299,
value: 1.5037148,
},
])
)
评分(Score)
use bm25::{Embedder, EmbedderBuilder, Language, Scorer, ScoredDocument};
// 定义文档集和查询词
let corpus = [
"The sky blushed pink as the sun dipped below the horizon.",
"She found a forgotten letter tucked inside an old book.",
"Apples, oranges, pink grapefruits, and more pink grapefruits.",
"A single drop of rain fell, followed by a thousand more.",
];
let query = "pink";
// 创建评分器
let mut scorer = Scorer::<usize>::new();
// 创建嵌入器
let embedder: Embedder =
EmbedderBuilder::with_fit_to_corpus(Language::English, &corpus).build();
// 嵌入所有文档并加入评分器
for (i, document) in corpus.iter().enumerate() {
let document_embedding = embedder.embed(document);
scorer.upsert(&i, document_embedding);
}
// 嵌入查询词
let query_embedding = embedder.embed(query);
// 验证评分结果
let score = scorer.score(&0, &query_embedding);
assert_eq!(score, Some(0.7046783));
// 获取匹配结果
let matches = scorer.matches(&query_embedding);
assert_eq!(
matches,
vec![
ScoredDocument {
id: 2,
score: 0.9639215
},
ScoredDocument {
id: 0,
score: 0.7046783
}
]
);
搜索(Search)
use bm25::{Document, Language, SearchEngineBuilder, SearchResult};
// 定义文档集
let corpus = [
"The rabbit munched the orange carrot.",
"The snake hugged the green lizard.",
"The hedgehog impaled the orange orange.",
"The squirrel buried the brown nut.",
];
// 创建搜索引擎
let search_engine = SearchEngineBuilder::<u32>::with_corpus(Language::English, corpus).build();
// 执行搜索
let limit = 3;
let search_results = search_engine.search("orange", limit);
// 验证搜索结果
assert_eq!(
search_results,
vec![
SearchResult {
document: Document {
id: 2,
contents: String::from("The hedgehog impaled the orange orange."),
},
score: 0.9530774,
},
SearchResult {
document: Document {
id: 0,
contents: String::from("The rabbit munched the orange carrot."),
},
score: 0.6931472,
},
]
);
完整示例
以下是一个完整的BM25搜索实现示例:
use bm25::{Document, Language, SearchEngineBuilder, SearchResult};
fn main() {
// 准备文档集
let documents = vec![
Document {
id: 1,
contents: String::from("Rust是一种系统编程语言,专注于安全、速度和并发性"),
},
Document {
id: 2,
contents: String::from("BM25是信息检索中常用的相关性评分算法"),
},
Document {
id: 3,
contents: String::from("Rust的包管理工具Cargo使得项目管理变得简单"),
},
Document {
id: 4,
contents: String::from("BM25基于概率模型,适用于全文检索场景"),
},
];
// 创建搜索引擎
let search_engine = SearchEngineBuilder::with_documents(Language::ChineseSimplified, documents)
.build();
// 执行搜索
let query = "Rust";
let results = search_engine.search(query, 5);
// 打印结果
println!("搜索 '{}' 的结果:", query);
for result in results {
println!("文档ID: {}, 评分: {:.4}", result.document.id, result.score);
println!("内容: {}\n", result.document.contents);
}
}
主要特性
- 多语言分词器,支持词干提取、停用词去除和Unicode规范化
- 语言检测
- 完全可配置的BM25参数
- 快速批量处理的并行支持
- 模块化和可定制
- 通过编译时特性进行配置
BM25算法假设您事先知道文档的平均(有意义的)词数。如果这个假设不适用于您的用例,您有两种选择:(1)做出合理的猜测(例如基于样本);(2)配置算法忽略文档长度。建议使用前者。
BM25有三个参数:b
、k1
和avgdl
。这些术语与Wikipedia上给出的公式匹配。avgdl
(‘平均文档长度’)是上述平均有意义词数;您应该总是提供一个值。b
控制文档长度归一化;0
表示没有归一化(长度不会影响分数),而1
表示完全归一化。如果您知道avgdl
,0.75
通常是一个不错的选择。k1
控制重复词的权重。对于几乎所有用例,1.2
的值都是合适的。
1 回复
Rust全文检索库BM25的使用:基于概率模型的文档评分与关键词搜索实现
以下是基于tantivy
库实现BM25全文检索的完整示例代码,包含详细注释:
use tantivy::{
collector::TopDocs,
directory::MmapDirectory,
doc,
query::{BM25Params, BooleanQuery, QueryParser},
schema::{Schema, IndexRecordOption, TextFieldIndexing, TextOptions, TEXT, STORED},
Index,
Document,
};
fn main() -> tantivy::Result<()> {
// 1. 创建Schema - 定义索引结构
let mut schema_builder = Schema::builder();
// 配置文本字段选项,启用BM25所需的所有特性
let text_options = TextOptions::default()
.set_indexing_options(
TextFieldIndexing::default()
.set_index_option(IndexRecordOption::WithFreqsAndPositions)
.set_fieldnorms(true) // BM25需要字段规范
)
.set_stored(); // 存储原始内容以便检索后显示
// 添加字段
let title = schema_builder.add_text_field("title", text_options.clone());
let body = schema_builder.add_text_field("body", text_options);
let is_published = schema_builder.add_bool_field("published", STORED);
let schema = schema_builder.build();
// 2. 创建索引 - 使用内存映射目录提高大索引性能
let dir = MmapDirectory::create_in_ram();
let index = Index::open_or_create(dir, schema.clone())?;
// 3. 配置自定义BM25参数
let k1 = 1.5; // 控制词频饱和度(默认1.2)
let b = 0.9; // 控制文档长度归一化(默认0.75)
let params = BM25Params::new(k1, b);
// 4. 添加文档
let mut index_writer = index.writer(50_000_000)?; // 50MB内存预算
// 文档1
index_writer.add_document(doc!(
title => "The Rust Programming Language",
body => "Rust is a systems programming language that runs blazingly fast, prevents segfaults, and guarantees thread safety.",
is_published => true
))?;
// 文档2
index_writer.add_document(doc!(
title => "Learning Rust",
body => "This book will teach you about the Rust programming language. Learn about ownership, borrowing and lifetimes.",
is_published => true
))?;
// 文档3
index_writer.add_document(doc!(
title => "Advanced Rust",
body => "For experienced Rust programmers. Covers async programming, unsafe Rust and FFI.",
is_published => false // 未发布文档
))?;
index_writer.commit()?; // 提交写入
// 5. 创建查询解析器并设置字段权重
let query_parser = QueryParser::for_index(&index, vec![title, body])
.set_field_boost(title, 3.0) // title字段权重更高
.set_field_boost(body, 1.0);
// 6. 执行查询
let reader = index.reader()?;
let searcher = reader.searcher();
// 示例1: 简单查询
println!("简单查询结果:");
let query = query_parser.parse_query("rust programming")?;
let top_docs = searcher.search(&query, &TopDocs::with_limit(5))?;
print_results(&searcher, top_docs)?;
// 示例2: 布尔查询+过滤器
println!("\n布尔查询结果:");
let bool_query = BooleanQuery::new()
.add_must(Box::new(query_parser.parse_query("rust")?))
.add_must(Box::new(query_parser.parse_query("language")?));
let top_docs = searcher.search(&bool_query, &TopDocs::with_limit(5))?;
print_results(&searcher, top_docs)?;
Ok(())
}
// 辅助函数: 打印搜索结果
fn print_results(searcher: &tantivy::Searcher, docs: Vec<(f32, tantivy::DocAddress)>) -> tantivy::Result<()> {
for (score, doc_address) in docs {
let retrieved_doc = searcher.doc(doc_address)?;
println!("Score: {:.3}, Doc: {:?}", score, retrieved_doc);
}
Ok(())
}
关键点说明
-
Schema设计:
- 必须配置
WithFreqsAndPositions
和fieldnorms
以启用BM25 - 可以混合存储字段(STORED)和非存储字段
- 必须配置
-
BM25调参:
k1
: 控制词频饱和度,值越大词频影响越大b
: 控制文档长度归一化,0表示禁用,1表示完全启用
-
查询优化:
- 使用
BooleanQuery
组合多个查询条件 - 通过
set_field_boost
调整不同字段的权重 - 使用
MmapDirectory
处理大型索引
- 使用
-
结果处理:
- 每个结果包含BM25相关性评分
- 可以访问原始文档内容(如果配置了STORED)
这个示例展示了BM25在Rust中的完整工作流程,从索引创建到复杂查询执行,可以根据实际需求调整参数和查询逻辑。