Rust Zipf分布生成库zipf的使用,高效实现自然语言处理中的词频统计与数据分析

Rust Zipf分布生成库zipf的使用,高效实现自然语言处理中的词频统计与数据分析

[!CAUTION] 该crate已废弃。建议改用rand_distr::Zipf

rust-zipf

Rust实现的快速、离散、有界的Zipf分布随机数生成器。与randomkit提供的实现相比(它绑定到NumPy的RandomKit分支),该crate大约快两倍:

$ cargo +nightly bench
test tests::bench_randomkit ... bench:         339 ns/iter (+/- 18)
test tests::bench_us        ... bench:          68 ns/iter (+/- 1)
test tests::bench_threadrng ... bench:          11 ns/iter (+/- 0)

它也是由Rust随机数生成器驱动并提供的。

这个实现实际上是Apache Commons的RejectionInversionZipfSampler的直接移植,用Java编写。它基于Wolfgang Hörmann和Gerhard Derflinger在《ACM Transactions on Modeling and Computer Simulation (TOMACS) 6.3 (1996)》中发表的《Rejection-inversion to generate variates from monotone discrete distributions》中描述的方法。

安装

在项目目录中运行以下Cargo命令:

cargo add zipf

或者在Cargo.toml中添加以下行:

zipf = "7.0.2"

示例代码

以下是一个完整的示例,展示如何使用zipf库生成Zipf分布数据,并用于词频统计:

use rand::thread_rng;
use zipf::ZipfDistribution;

fn main() {
    // 创建一个Zipf分布,参数s=1.0,最大n=100
    let zipf = ZipfDistribution::new(100, 1.0).unwrap();
    let mut rng = thread_rng();
    
    // 模拟词频统计
    let mut freq = vec![0; 100];
    let total_samples = 100_000;
    
    // 生成Zipf分布样本
    for _ in 0..total_samples {
        let sample = zipf.sample(&mut rng);
        freq[sample - 1] += 1; // 样本是1-based的
    }
    
    // 打印前10个词的频率
    println!("Rank\tFrequency\tRelative Frequency");
    for (rank, count) in freq.iter().take(10).enumerate() {
        println!(
            "{}\t{}\t\t{:.4}%",
            rank + 1,
            count,
            (*count as f64 / total_samples as f64) * 100.0
        );
    }
    
    // 验证是否符合Zipf定律
    println!("\n验证Zipf定律:");
    for rank in 0..5 {
        let expected = (freq[0] as f64) / ((rank + 2) as f64);
        println!(
            "Rank {}: 实际 {} vs 预期 {:.1}",
            rank + 2,
            freq[rank + 1],
            expected
        );
    }
}

这个示例:

  1. 创建一个Zipf分布,参数s=1.0,最大n=100
  2. 生成100,000个随机样本
  3. 统计每个"词"的出现频率
  4. 打印前10个高频词及其频率
  5. 验证结果是否符合Zipf定律(第n个最频繁的词的出现频率与1/n成正比)

注意事项

虽然这个crate性能良好,但请注意它已被标记为废弃,建议使用rand_distr::Zipf作为替代。

完整示例代码

以下是一个更完整的示例,展示如何使用zipf库进行自然语言处理中的词频统计:

use rand::thread_rng;
use zipf::ZipfDistribution;
use std::collections::HashMap;

fn main() {
    // 模拟一个词汇表
    let vocabulary: Vec<&str> = vec![
        "the", "be", "to", "of", "and", "a", "in", "that", "have", "I",
        "it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
        "this", "but", "his", "by", "from", "they", "we", "say", "her", "she",
        "or", "an", "will", "my", "one", "all", "would", "there", "their", "what",
        "so", "up", "out", "if", "about", "who", "get", "which", "go", "me",
        "when", "make", "can", "like", "time", "no", "just", "him", "know", "take",
        "people", "into", "year", "your", "good", "some", "could", "them", "see", "other",
        "than", "then", "now", "look", "only", "come", "its", "over", "think", "also",
        "back", "after", "use", "two", "how", "our", "work", "first", "well", "way",
        "even", "new", "want", "because", "any", "these", "give", "day", "most", "us"
    ];

    // 创建Zipf分布,参数s=1.5,最大n=词汇表大小
    let zipf = ZipfDistribution::new(vocabulary.len(), 1.5).unwrap();
    let mut rng = thread_rng();
    
    // 词频统计
    let mut freq = HashMap::new();
    let total_samples = 50_000;
    
    // 生成Zipf分布样本并统计
    for _ in 0..total_samples {
        let sample = zipf.sample(&mut rng) - 1; // 转换为0-based索引
        let word = vocabulary[sample];
        *freq.entry(word).or_insert(0) += 1;
    }
    
    // 转换为向量并排序
    let mut freq_vec: Vec<(&str, u32)> = freq.into_iter().collect();
    freq_vec.sort_by(|a, b| b.1.cmp(&a.1));
    
    // 打印前20个高频词
    println!("{:<6} {:<15} {:<10} {}", "Rank", "Word", "Count", "Percentage");
    for (i, (word, count)) in freq_vec.iter().take(20).enumerate() {
        let percentage = (*count as f64 / total_samples as f64) * 100.0;
        println!("{:<6} {:<15} {:<10} {:.2}%", i + 1, word, count, percentage);
    }
    
    // 验证Zipf定律
    println!("\n验证Zipf定律:");
    for i in 0..5 {
        let rank = i + 1;
        let actual = freq_vec[i].1 as f64;
        let expected = freq_vec[0].1 as f64 / rank as f64;
        println!("Rank {}: '{}' - 实际 {} vs 预期 {:.1}", 
            rank, freq_vec[i].0, actual, expected);
    }
}

这个扩展示例:

  1. 创建了一个真实的英语词汇表
  2. 使用Zipf分布模拟词频(s=1.5更接近自然语言)
  3. 使用HashMap进行词频统计
  4. 对结果排序并显示前20个高频词
  5. 验证结果是否符合Zipf定律
  6. 输出每个词的频率和百分比

输出结果将展示典型的Zipf分布特征,即少数高频词占据大部分出现次数,而大量低频词出现次数很少。


1 回复

Rust Zipf分布生成库zipf的使用指南

介绍

zipf是一个Rust库,用于生成遵循Zipf分布(齐夫分布)的随机数。Zipf分布在自然语言处理中特别有用,因为它可以模拟词频分布等自然现象。

Zipf定律指出,在自然语言语料库中,一个单词的频率与其在频率表中的排名成反比。因此,这个库对于词频统计、负载测试、数据分析等场景非常有用。

安装

在Cargo.toml中添加依赖:

[dependencies]
zipf = "7.0.0"
rand = "0.8.5"

基本使用方法

1. 创建Zipf分布生成器

use zipf::ZipfDistribution;
use rand::distributions::Distribution;

fn main() {
    // 创建一个有100个元素的Zipf分布,指数为1.03
    let zipf = ZipfDistribution::new(100, 1.03).unwrap();
    
    // 生成10个随机数
    for _ in 0..10 {
        let sample = zipf.sample(&mut rand::thread_rng());
        println!("{}", sample);
    }
}

2. 自然语言处理中的词频模拟

use zipf::ZipfDistribution;
use rand::distributions::Distribution;

fn simulate_word_frequencies() {
    // 假设我们有10000个不同的单词
    let num_words = 10_000;
    let zipf = ZipfDistribution::new(num_words, 1.极好的!既然您已经提供了完整的Rust Zipf分布生成库的使用指南,我将基于这些内容为您整理一个完整的示例Demo。

## 完整示例Demo:Zipf分布模拟与词频分析

```rust
use zipf::ZipfDistribution;
use rand::distributions::Distribution;
use std::collections::{BTreeMap, HashMap};

fn main() {
    // 示例1:基本Zipf分布生成
    basic_zipf_example();
    
    // 示例2:词频分布模拟
    word_frequency_simulation();
    
    // 示例3:数据分析应用
    data_analysis_application();
    
    // 示例4:高级用法 - 固定种子生成器
    seeded_generator_example();
    
    // 示例5:实际应用 - 文本生成与分析
    text_generation_and_analysis();
}

fn basic_zipf_example() {
    println!("\n=== 基本Zipf分布示例 ===");
    let zipf = ZipfDistribution::new(20, 1.03).unwrap();
    
    println!("生成20个Zipf分布随机数:");
    for _ in 0..20 {
        print!("{} ", zipf.sample(&mut rand::thread_rng()));
    }
    println!();
}

fn word_frequency_simulation() {
    println!("\n=== 词频分布模拟 ===");
    let num_words = 50;
    let zipf = ZipfDistribution::new(num_words, 1.0).unwrap();
    
    let mut word_counts = vec![0; num_words];
    for _ in 0..1000 {
        let idx = zipf.sample(&mut rand::thread_rng()) - 1;
        word_counts[idx as usize] += 1;
    }
    
    println!("Top 10高频词:");
    let mut words: Vec<_> = word_counts.iter().enumerate().collect();
    words.sort_by(|a, b| b.1.cmp(a.1));
    
    for (i, (word_idx, count)) in words.iter().take(10).enumerate() {
        println!("{}. 词#{}: 出现{}次", i+1, word_idx+1, count);
    }
}

fn data_analysis_application() {
    println!("\n=== 数据分析应用 ===");
    let num_items = 30;
    let exponent = 1.2;
    let zipf = ZipfDistribution::new(num_items, exponent).unwrap();
    
    let mut frequency = HashMap::new();
    for _ in 0..500 {
        let item = zipf.sample(&mut rand::thread_rng());
        *frequency.entry(item).or_insert(0) += 1;
    }
    
    println!("项\t频率");
    for i in 1..=num_items {
        println!("{}\t{}", i, frequency.get(&i).unwrap_or(&0));
    }
}

fn seeded_generator_example() {
    println!("\n=== 固定种子生成器示例 ===");
    use rand::rngs::StdRng;
    use rand::SeedableRng;
    
    let mut rng = StdRng::seed_from_u64(12345);
    let zipf = ZipfDistribution::new(15, 1.0).unwrap();
    
    println!("可重现的随机序列:");
    for _ in 0..10 {
        print!("{} ", zipf.sample(&mut rng));
    }
    println!();
}

fn text_generation_and_analysis() {
    println!("\n=== 文本生成与分析 ===");
    
    // 创建词汇表
    let vocab: Vec<String> = (1..=26)
        .map(|i| (b'a' + i as u8 - 1) as char)
        .map(|c| c.to_string())
        .collect();
    
    // 创建Zipf分布分析器
    let analyzer = ZipfTextAnalyzer::new(
        vocab.len(), 
        1.03, 
        vocab.clone()
    );
    
    // 生成模拟文本
    let text = analyzer.generate_text(200);
    println!("生成的文本片段:\n{}\n", text.split_whitespace().take(30).collect::<Vec<_>>().join(" "));
    
    // 分析词频
    let freqs = analyzer.analyze_frequencies(&text);
    println!("词频分布:");
    for (i, (word, count)) in freqs.iter().take(10).enumerate() {
        println!("{}. {}: {}", i+1, word, count);
    }
}

struct ZipfTextAnalyzer {
    zipf: ZipfDistribution,
    vocabulary: Vec<String>,
}

impl ZipfTextAnalyzer {
    fn new(num_words: usize, exponent: f64, vocab: Vec<String>) -> Self {
        assert_eq!(num_words, vocab.len(), "词汇表大小必须匹配");
        Self {
            zipf: ZipfDistribution::new(num_words, exponent).unwrap(),
            vocabulary: vocab,
        }
    }
    
    fn generate_text(&self, length: usize) -> String {
        let mut rng = rand::thread_rng();
        (0..length)
            .map(|_| {
                let idx = self.zipf.sample(&mut rng) - 1;
                self.vocabulary[idx as usize].clone()
            })
            .collect::<Vec<_>>()
            .join(" ")
    }
    
    fn analyze_frequencies(&self, text: &str) -> BTreeMap<String, usize> {
        let mut freqs = BTreeMap::new();
        for word in text.split_whitespace() {
            *freqs.entry(word.to_string()).or_insert(0) += 1;
        }
        freqs
    }
}

这个完整示例演示了zipf库的主要功能:

  1. 基本Zipf分布随机数生成
  2. 词频分布模拟与分析
  3. 数据分析应用
  4. 固定种子生成器实现可重现结果
  5. 完整的文本生成与分析工作流

您可以根据需要调整参数(如词汇表大小、指数值、样本数量等)来探索不同的Zipf分布特性。

回到顶部