如何用Python统计英文API开发文档（如Javadoc）的词频？

原帖地址： https://segmentfault.com/q/1010000010016451

如题，简单一点的功能是如何对一份英文 API 开发文档进行词频的统计？（文档可能是多个 html 文件，也可能是 chm 文件，不是简单的 txt 文本）；

复杂一点的需求是，因为开发文档涉及很多类名、函数或方法名等，单词可能会连在一起，统计时最好能够分开（涉及英文分词技术）；

再复杂一点的需求是，因为单纯统计一个文档的词频没多大的实际意义，如何将统计后的单词再加工处理：

剔除掉一些简单的，并对开发来说没多大意义的单词，如 the, are, to, is ……
分析出里面涉及到计算机的专业名词或编程语言的关键字（涉及到文档对应的不同语言）；
对最终分析出的单词标注出解释（中文）……

如果开发具有以上功能的软件，具体需要涉及哪些技术？ Python ？英语分词技术？机器学习？欢迎提供你的想法……

呃，其实我的痛点是，看一份英文文档时，有太多不懂的单词，经常要去查单词，效率太低了，如果有一个工具可以统计分析出一份文档的词汇，就可以在看文档前先大致熟悉词汇的意思，提高效率；而且对于开发时，命名也有帮助……

如何用Python统计英文API开发文档（如Javadoc）的词频？

phonegap100 1楼

痛点加一。
以前曾经用这种方法看美剧。统计一遍字幕单词，过一遍词汇就可以看了。
不过字幕文件是纯文本，毕竟容易处理

h691938207 2楼

import re
from collections import Counter
import os

def count_api_doc_word_frequency(doc_path, output_file='word_frequency.txt'):
    """
    统计API文档词频的核心函数
    
    Args:
        doc_path: API文档路径（可以是文件或文件夹）
        output_file: 输出结果文件名
    """
    # 支持的文档格式
    supported_extensions = {'.txt', '.java', '.js', '.py', '.md', '.html', '.xml'}
    
    def extract_text_from_file(file_path):
        """从单个文件中提取文本内容"""
        try:
            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                content = f.read()
            
            # 移除HTML标签（处理Javadoc等文档）
            content = re.sub(r'<[^>]+>', ' ', content)
            
            # 移除代码块（```code```格式）
            content = re.sub(r'```.*?```', ' ', content, flags=re.DOTALL)
            
            # 移除单行代码标记
            content = re.sub(r'`[^`]+`', ' ', content)
            
            # 移除URL链接
            content = re.sub(r'http[s]?://\S+', ' ', content)
            
            # 移除特殊字符和数字，只保留字母和连字符
            content = re.sub(r'[^a-zA-Z\s-]', ' ', content)
            
            # 将多个空格合并为一个
            content = re.sub(r'\s+', ' ', content)
            
            return content.lower()  # 转换为小写以便统计
            
        except Exception as e:
            print(f"读取文件 {file_path} 时出错: {e}")
            return ""
    
    # 收集所有文件
    all_files = []
    if os.path.isfile(doc_path):
        all_files.append(doc_path)
    elif os.path.isdir(doc_path):
        for root, dirs, files in os.walk(doc_path):
            for file in files:
                if any(file.endswith(ext) for ext in supported_extensions):
                    all_files.append(os.path.join(root, file))
    
    if not all_files:
        print("未找到支持的文档文件")
        return
    
    # 合并所有文本内容
    all_text = ""
    for file_path in all_files:
        print(f"正在处理: {file_path}")
        all_text += extract_text_from_file(file_path) + " "
    
    # 分割单词
    words = re.findall(r'\b[a-zA-Z][a-zA-Z-]*\b', all_text)
    
    # 过滤常见停用词（可根据需要扩展）
    stop_words = {
        'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
        'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
        'this', 'that', 'these', 'those', 'it', 'they', 'we', 'you', 'he', 'she',
        'get', 'set', 'void', 'return', 'public', 'private', 'protected', 'class',
        'interface', 'extends', 'implements', 'throws', 'exception', 'error'
    }
    
    filtered_words = [word for word in words if word not in stop_words and len(word) > 2]
    
    # 统计词频
    word_counts = Counter(filtered_words)
    
    # 按频率排序
    sorted_words = word_counts.most_common()
    
    # 输出结果
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write("API文档词频统计结果:\n")
        f.write("=" * 50 + "\n")
        f.write(f"总文档数: {len(all_files)}\n")
        f.write(f"总单词数: {len(filtered_words)}\n")
        f.write(f"唯一单词数: {len(sorted_words)}\n")
        f.write("=" * 50 + "\n\n")
        
        for word, count in sorted_words:
            f.write(f"{word}: {count}\n")
    
    print(f"统计完成！结果已保存到 {output_file}")
    print(f"前10个高频词:")
    for word, count in sorted_words[:10]:
        print(f"  {word}: {count}")

# 使用示例
if __name__ == "__main__":
    # 示例1：统计单个文件
    # count_api_doc_word_frequency('api_documentation.txt')
    
    # 示例2：统计整个文件夹
    count_api_doc_word_frequency('./docs/', 'api_word_freq.txt')

这个方案直接处理文本清洗和词频统计，用正则过滤代码和标签，Counter做计数，适合快速分析API文档关键词。

eggper 3楼

墨墨背单词有单词本功能，可以大致的提炼出单词列表

h691938207 4楼

多个 html 也好，单个 chm 也好，txt 也罢，你首先都需要将其转换为纯文本单词流。这可能需要你预先完成去除 html 标签 /解压 chm 文件 /去除无用内容等工作。
英文分词应该有现成工具可用，搜一下就有了
the, are, to, is 这些停用词有现成的停用词列表，直接拿过来用即可。
专业名词可能需要自己整理一部分，关键字请参加语言标准。
标出解释需要使用开放的词典 API。

sinazl 5楼

tfidf

gougou168 6楼

nltk

yibo5220 7楼

想了一个最简单的实现方案

维护一个文本，一行一个单词，作为排除列表

然后用 regex 去提取页面所有单词

([a-zA-Z]+((’|-)[a-zA-z]+)?)
可以匹配
I’m a google-based programer.
里面的所有单词

然后再判断这些单词在不在排除列表里面，就得到的一个需要查询的单词列表

然后就调用 API 去查询单词一次，存进 dict 里面就好了。

easy job!

zlyuanteng 8楼

貌似我开发的爱英阅大致能满足楼主需求^-^： http://iyingyue.net/iyingyue/index.html
chm 文档可以先转成 pdf 再提取

itying888 9楼

由于问题编辑不了，这里更正下：
分开连在一起的单词确实不是分词技术，之前说错了；

更新的问题可以看原帖： https://segmentfault.com/q/1010000010016451

bupafengyu 10楼

正经程序员命名的时候都会用-、_、大小写其中一个做分词吧

songsunli 11楼

导出纯文本。只分析正文。

bupafengyu 12楼

先下载下来，保存为 txt，用 hadoop 跑一遍 word count，然后手工筛选单词