Lindera Dictionary

Crates.io

一个用于 Lindera 的形态分析词典库。

此包包含词典结构和维特比算法。

词典格式

IPADIC

此仓库使用 mecab-ipadic。

IPADIC 词典格式

有关 IPADIC 词典格式和词性标签的详细信息，请参阅手册。

索引	名称（日语）	名称（英语）
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Major POS classification
5	品詞細分類1	Middle POS classification
6	品詞細分類2	Small POS classification
7	品詞細分類3	Fine POS classification
8	活用形	Conjugation type
9	活用型	Conjugation form
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation

IPADIC 用户词典格式（CSV）

IPADIC 用户词典简单版本

索引	名称（日语）	名称（英语）
0	表層形	surface
1	品詞	Major POS classification
2	読み	Reading

IPADIC 用户词典详细版本

索引	名称（日语）	名称（英语）	备注
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	POS
5	品詞細分類1	POS subcategory 1
6	品詞細分類2	POS subcategory 2
7	品詞細分類3	POS subcategory 3
8	活用形	Conjugation type
9	活用型	Conjugation form
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation
13	-	-	13 之后可以自由扩展。

IPADIC NEologd

此仓库使用 mecab-ipadic-neologd。

IPADIC NEologd 词典格式

有关 IPADIC 词典格式和词性标签的详细信息，请参阅手册。

索引	名称（日语）	名称（英语）
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	Major POS classification
5	品詞細分類1	Middle POS classification
6	品詞細分類2	Small POS classification
7	品詞細分類3	Fine POS classification
8	活用形	Conjugation type
9	活用型	Conjugation form
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation

IPADIC NEologd 用户词典格式（CSV）

IPADIC NEologd 用户词典简单版本

索引	名称（日语）	名称（英语）
0	表層形	surface
1	品詞	Major POS classification
2	読み	Reading

IPADIC NEologd 用户词典详细版本

索引	名称（日语）	名称（英语）	备注
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞	POS
5	品詞細分類1	POS subcategory 1
6	品詞細分類2	POS subcategory 2
7	品詞細分類3	POS subcategory 3
8	活用形	Conjugation type
9	活用型	Conjugation form
10	原形	Base form
11	読み	Reading
12	発音	Pronunciation
13	-	-	13 之后可以自由扩展。

UniDic

此仓库使用 unidic-mecab。

UniDic 词典格式

有关 unidic-mecab 词典格式和词性标签的详细信息，请参阅手册。

索引	名称（日语）	名称（英语）
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Major POS classification
5	品詞中分類	Middle POS classification
6	品詞小分類	Small POS classification
7	品詞細分類	Fine POS classification
8	活用型	Conjugation form
9	活用形	Conjugation type
10	語彙素読み	Lexeme reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthography appearance type
13	発音形出現形	Pronunciation appearance type
14	書字形基本形	Orthography basic type
15	発音形基本形	Pronunciation basic type
16	語種	Word type
17	語頭変化型	Prefix of a word form
18	語頭変化形	Prefix of a word type
19	語末変化型	Suffix of a word form
20	語末変化形	Suffix of a word type

UniDic 用户词典格式（CSV）

UniDic 用户词典简单版本

索引	名称（日语）	名称（英语）
0	表層形	Surface
1	品詞大分類	Major POS classification
2	語彙素読み	Lexeme reading

UniDic 用户词典详细版本

索引	名称（日语）	名称（英语）	备注
0	表層形	Surface
1	左文脈ID	Left context ID
2	右文脈ID	Right context ID
3	コスト	Cost
4	品詞大分類	Major POS classification
5	品詞中分類	Middle POS classification
6	品詞小分類	Small POS classification
7	品詞細分類	Fine POS classification
8	活用型	Conjugation form
9	活用形	Conjugation type
10	語彙素読み	Lexeme reading
11	語彙素（語彙素表記 + 語彙素細分類）	Lexeme
12	書字形出現形	Orthography appearance type
13	発音形出現形	Pronunciation appearance type
14	書字形基本形	Orthography basic type
15	発音形基本形	Pronunciation basic type
16	語種	Word type
17	語頭変化型	Prefix of a word form
18	語頭変化形	Prefix of a word type
19	語末変化型	Suffix of a word form
20	語末変化形	Suffix of a word type
21	-	-	21 之后可以自由扩展。

ko-dic

此仓库使用 mecab-ko-dic。

ko-dic 词典格式

有关 mecab-ko-dic 使用的词典格式和词性标签的信息记录在此 Google 电子表格中，链接自 mecab-ko-dic 的仓库自述文件。

请注意 ko-dic 比 NAIST JDIC 少一个特征列，并且具有一组完全不同的信息（例如不提供单词的"原始形式"）。

标签是对 세종 (Sejong) 指定的标签的轻微修改。从 Sejong 到 mecab-ko-dic 标签名称的映射在上文链接的电子表格的 태그 v2.0 标签中给出。

词典格式在电子表格的 사전 형식 v2.0 标签中完全指定（韩语）。任何空白值默认为 *。

索引	名称（韩语）	名称（英语）	备注
0	표면	Surface
1	왼쪽 문맥 ID	Left context ID
2	오른쪽 문맥 ID	Right context ID
3	비용	Cost
4	품사 태그	part-of-speech tag	参见电子表格的 `태그 v2.0` 标签
5	의미 부류	meaning	（示例太少，无法确定）
6	종성 유무	presence or absence	`T` 为真；`F` 为假；否则为 `*`
7	읽기	reading	通常与表面匹配，但对于外来词可能不同，例如汉字词
8	타입	type	其中之一：`Inflect` (활용)；`Compound` (복합명사)；或 `Preanalysis` (기분석)
9	첫번째 품사	first part-of-speech	例如，给定词性标签 “VV+EM+VX+EP”，将返回 `VV`
10	마지막 품사	last part-of-speech	例如，给定词性标签 “VV+EM+VX+EP”，将返回 `EP`
11	표현	expression	`활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드` – 说明用法、复合名词和关键分析如何组织的字段

ko-dic 用户词典格式（CSV）

ko-dic 用户词典简单版本

索引	名称（日语）	名称（英语）	备注
0	표면	Surface
1	품사 태그	part-of-speech tag	参见电子表格的 `태그 v2.0` 标签
2	읽기	reading	通常与表面匹配，但对于外来词可能不同，例如汉字词

ko-dic 用户词典详细版本

索引	名称（韩语）	名称（英语）	备注
0	표면	Surface
1	왼쪽 문맥 ID	Left context ID
2	오른쪽 문맥 ID	Right context ID
3	비용	Cost
4	품사 태그	part-of-speech tag	参见电子表格的 `태그 v2.0` 标签
5	의미 부류	meaning	（示例太少，无法确定）
6	종성 유무	presence or absence	`T` 为真；`F` 为假；否则为 `*`
7	읽기	reading	通常与表面匹配，但对于外来词可能不同，例如汉字词
8	타입	type	其中之一：`Inflect` (활용)；`Compound` (복합명사)；或 `Preanalysis` (기분석)
9	첫번째 품사	first part-of-speech	例如，给定词性标签 “VV+EM+VX+EP”，将返回 `VV`
10	마지막 품사	last part-of-speech	例如，给定词性标签 “VV+EM+VX+EP”，将返回 `EP`
11	표현	expression	`활용, 복합명사, 기분석이 어떻게 구성되는지 알려주는 필드` – 说明用法、复合名词和关键分析如何组织的字段
12	-	-	12 之后可以自由扩展。

CC-CEDICT

此仓库使用 CC-CEDICT-MeCab。

CC-CEDICT 词典格式

有关 unidic-mecab 词典格式和词性标签的详细信息，请参阅手册。

索引	名称（中文）	名称（英语）
0	表面形式	Surface
1	左语境ID	Left context ID
2	右语境ID	Right context ID
3	成本	Cost
4	词类	Major POS classification
5	词类1	Middle POS classification
6	词类2	Small POS classification
7	词类3	Fine POS classification
8	併音	pinyin
9	繁体字	traditional
10	简体字	simplified
11	定义	definition

CC-CEDICT 用户词典格式（CSV）

CC-CEDICT 用户词典简单版本

索引	名称（中文）	名称（英语）
0	表面形式	Surface
1	词类	Major POS classification
2	併音	pinyin

CC-CEDICT 用户词典详细版本

索引	名称（中文）	名称（英语）	备注
0	表面形式	Surface
1	左语境ID	Left context ID
2	右语境ID	Right context ID
3	成本	Cost
4	词类	POS
5	词类1	POS subcategory 1
6	词类2	POS subcategory 2
7	词类3	POS subcategory 3
8	併音	pinyin
9	繁体字	traditional
10	简体字	simplified
11	定义	definition
12	-	-	12 之后可以自由扩展。

完整示例代码

use lindera_dictionary::{Dictionary, DictionaryKind};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建 IPADIC 词典实例
    let dictionary = Dictionary::from_kind(DictionaryKind::IPADIC)?;
    
    // 使用词典进行分词处理
    let text = "日本語の形態素解析を行います";
    let tokens = dictionary.tokenize(text)?;
    
    // 输出分词结果
    for token in tokens {
        println!("Surface: {}, POS: {:?}", token.text, token.detail);
    }
    
    Ok(())
}

use lindera_dictionary::{UserDictionary, DictionaryKind};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建用户词典
    let user_dict = UserDictionary::from_csv(
        "path/to/user_dict.csv",
        DictionaryKind::IPADIC
    )?;
    
    // 使用用户词典进行分词
    let text = "特定の専門用語を含む文章";
    let tokens = user_dict.tokenize(text)?;
    
    // 输出分词结果
    for token in tokens {
        println!("Surface: {}, Reading: {:?}", token.text, token.reading);
    }
    
    Ok(())
}

use lindera_dictionary::{DictionaryBuilder, DictionaryKind};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 使用构建器模式创建词典
    let dictionary = DictionaryBuilder::new()
        .kind(DictionaryKind::IPADIC)
        .build()?;
    
    // 进行分词处理
    let text = "Rustで日本語処理";
    let tokens = dictionary.tokenize(text)?;
    
    // 输出详细的分词信息
    for token in tokens {
        println!(
            "表層形: {}, 品詞: {}, 原形: {}, 読み: {}",
            token.text,
            token.detail.get(4).unwrap_or(&"*".to_string()),
            token.detail.get(10).unwrap_or(&"*".to_string()),
            token

zlyuanteng 1楼

Rust分词词典库lindera-dictionary的使用指南

介绍

lindera-dictionary是一个专为Rust语言设计的高效日语分词词典库，基于Lindera分词器开发。该库提供了完整的词典管理功能，支持多种词典格式，特别优化了日语文本处理性能。

主要特性

支持IPADIC、UniDic和CC-CEDICT等多种词典格式
提供词典构建、加载和查询功能
内存效率高，加载速度快
完全兼容Lindera分词器

安装方法

在Cargo.toml中添加依赖：

[dependencies]
lindera-dictionary = "0.10.0"

基本使用方法

1. 加载词典

use lindera_dictionary::Dictionary;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 加载IPADIC词典
    let dictionary = Dictionary::load("ipadic")?;
    
    // 使用词典进行分词操作
    // ... 此处可结合lindera分词器使用
    
    Ok(())
}

2. 词典查询示例

use lindera_dictionary::{Dictionary, DictionaryKind};

fn lookup_word() -> Result<(), Box<dyn std::error::Error>> {
    let dict = Dictionary::load_from_kind(DictionaryKind::IPADIC);
    
    // 查询单词信息
    if let Some(entry) = dict.get("日本語") {
        println!("词条信息: {:?}", entry);
    }
    
    Ok(())
}

3. 自定义词典构建

use lindera_dictionary::DictionaryBuilder;

fn build_custom_dictionary() -> Result<(), Box<dyn std::error::Error>> {
    // 创建词典构建器
    let mut builder = DictionaryBuilder::new();
    
    // 添加自定义词条
    builder.add_word(
        "Rust言語",    // 表面形
        "rust,言語",   // 读取形
        "名詞,一般,*,*", // 词性标签
        1000           // 词频
    );
    
    // 构建并保存词典
    builder.build("custom_dict")?;
    
    Ok(())
}

高级功能

批量词条添加

use lindera_dictionary::DictionaryBuilder;

fn batch_add_words() -> Result<(), Box<dyn std::error::Error>> {
    let mut builder = DictionaryBuilder::new();
    
    let words = vec![
        ("AI", "えーあい", "名詞,一般,*,*", 800),
        ("機械学習", "きかいがくしゅう", "名詞,サ変,*,*", 1200),
    ];
    
    for (surface, reading, pos, cost) in words {
        builder.add_word(surface, reading, pos, cost);
    }
    
    builder.build("tech_dict")?;
    Ok(())
}

词典合并

use lindera_dictionary::{Dictionary, DictionaryMerger};

fn merge_dictionaries() -> Result<(), Box<dyn std::error::Error>> {
    let dict1 = Dictionary::load("ipadic")?;
    let dict2 = Dictionary::load("custom_dict")?;
    
    let merger = DictionaryMerger::new();
    let merged_dict = merger.merge(&[dict1, dict2])?;
    
    // 保存合并后的词典
    merged_dict.save("merged_dict")?;
    
    Ok(())
}

性能优化建议

预加载词典：在应用启动时加载词典到内存
使用合适的词典格式：根据需求选择IPADIC（标准）或UniDic（学术用途）
内存映射：对于大型词典，考虑使用内存映射文件

错误处理

use lindera_dictionary::Dictionary;

fn safe_dictionary_loading() {
    match Dictionary::load("ipadic") {
        Ok(dict) => {
            // 成功加载词典
            println!("词典加载成功");
        }
        Err(e) => {
            eprintln!("词典加载失败: {}", e);
        }
    }
}

注意事项

确保词典文件路径正确
考虑词典文件的大小和内存占用
定期更新词典以获得最新的词汇支持

这个库为Rust开发者提供了强大的日语文本处理能力，特别适合需要高性能日语分词的应用程序。

完整示例demo

use lindera_dictionary::{Dictionary, DictionaryBuilder, DictionaryKind, DictionaryMerger};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // 示例1: 加载标准词典
    println!("=== 加载IPADIC词典 ===");
    let ipadic_dict = Dictionary::load_from_kind(DictionaryKind::IPADIC)?;
    
    // 查询示例单词
    if let Some(entry) = ipadic_dict.get("日本語") {
        println!("查询结果: {:?}", entry);
    }
    
    // 示例2: 构建自定义词典
    println!("\n=== 构建自定义词典 ===");
    let mut builder = DictionaryBuilder::new();
    
    // 添加自定义词条
    builder.add_word(
        "Rust言語",      // 表面形 (surface form)
        "rust,言語",     // 读取形 (reading form)
        "名詞,一般,*,*", // 词性标签 (part-of-speech tags)
        1000             // 词频 (cost)
    );
    
    builder.add_word(
        "AI開発",
        "えーあいかいはつ",
        "名詞,サ変,*,*",
        1500
    );
    
    // 构建并保存自定义词典
    builder.build("my_custom_dict")?;
    println!("自定义词典构建完成");
    
    // 示例3: 加载自定义词典
    println!("\n=== 加载自定义词典 ===");
    let custom_dict = Dictionary::load("my_custom_dict")?;
    
    // 查询自定义词条
    if let Some(entry) = custom_dict.get("Rust言語") {
        println!("自定义词条查询结果: {:?}", entry);
    }
    
    // 示例4: 词典合并
    println!("\n=== 词典合并 ===");
    let merger = DictionaryMerger::new();
    let merged_dict = merger.merge(&[ipadic_dict, custom_dict])?;
    
    // 保存合并后的词典
    merged_dict.save("merged_dictionary")?;
    println!("词典合并完成并已保存");
    
    // 示例5: 错误处理
    println!("\n=== 错误处理示例 ===");
    match Dictionary::load("nonexistent_dict") {
        Ok(_) => println!("词典加载成功"),
        Err(e) => println!("词典加载失败: {}", e),
    }
    
    Ok(())
}

这个完整示例展示了lindera-dictionary库的主要功能：

加载标准IPADIC词典并进行查询
构建包含自定义词条的自定义词典
加载和查询自定义词典
合并多个词典
错误处理的最佳实践

要运行此示例，请确保在Cargo.toml中添加了正确的依赖项，并且已下载所需的词典文件。

Rust分词词典库lindera-dictionary的使用，高效日语分词与词典管理工具