Golang英文文本处理与命名实体识别插件库prose的使用

prose是一个纯Go语言编写的自然语言处理库(目前仅支持英文)，支持分词、句子分割、词性标注和命名实体识别。

安装

$ go get github.com/jdkato/prose/v2

使用示例

基础用法

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // 使用默认配置创建新文档
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // 遍历文档中的token
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // 遍历文档中的命名实体
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // 遍历文档中的句子
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

文档创建过程遵循以下步骤序列：

tokenization -> POS tagging -> NE extraction
            \
             segmentation

禁用特定功能

可以通过传递适当的功能选项来禁用特定步骤：

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

分词(Tokenizing)

prose包含一个能够处理现代文本的分词器，包括以下非单词字符：

类型	示例
电子邮件地址	`Jane.Doe@example.com`
标签	`#trending`
提及	`@jdkato`
URL	`https://github.com/jdkato/prose`
表情符号	`:-)`, `>:(`, `o_0` 等

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // 创建新文档
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // 遍历token
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

句子分割(Segmenting)

prose包含目前最准确的句子分割器之一。

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // 创建新文档
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // 遍历句子
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

词性标注(Tagging)

prose包含基于Textblob的"快速准确"POS标注器。以下是支持的全部POS标签：

标签	描述
`(`	左圆括号
`)`	右圆括号
`,`	逗号
`:`	冒号
`.`	句号
`''`	右引号
````	左引号
`#`	数字符号
`$`	货币
`CC`	并列连词
`CD`	基数词
`DT`	限定词
`EX`	存在词there
`FW`	外来词
`IN`	从属连词或介词
`JJ`	形容词
`JJR`	形容词比较级
`JJS`	形容词最高级
`LS`	列表项标记
`MD`	情态动词
`NN`	名词单数或不可数
`NNP`	专有名词单数
`NNPS`	专有名词复数
`NNS`	名词复数
`PDT`	前位限定词
`POS`	所有格结尾
`PRP`	人称代词
`PRP$`	物主代词
`RB`	副词
`RBR`	副词比较级
`RBS`	副词最高级
`RP`	副词小品词
`SYM`	符号
`TO`	不定式to
`UH`	感叹词
`VB`	动词原形
`VBD`	动词过去式
`VBG`	动词现在分词
`VBN`	动词过去分词
`VBP`	动词非第三人称单数现在式
`VBZ`	动词第三人称单数现在式
`WDT`	wh限定词
`WP`	wh人称代词
`WP$`	wh物主代词
`WRB`	wh副词

命名实体识别(NER)

prose v2.0.0包含一个改进版的命名实体识别功能，默认可以识别人名(PERSON)和地理/政治实体(GPE)。

package main

import (
    "fmt"
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

更多关于golang英文文本处理与命名实体识别插件库prose的使用的实战教程也可以访问 https://www.itying.com/category-94-b0.html

eggper 1楼

更多关于golang英文文本处理与命名实体识别插件库prose的使用的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

使用Go语言的Prose库进行文本处理与命名实体识别

Prose是一个用Go编写的自然语言处理库，专注于英文文本处理，提供分词、词性标注、命名实体识别(NER)等功能。下面我将详细介绍如何使用Prose库。

安装Prose

首先安装Prose库：

go get github.com/jdkato/prose/v2

基本功能示例

1. 分词(Tokenization)

package main

import (
	"fmt"
	"github.com/jdkato/prose/v2"
)

func main() {
	// 创建文档
	doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
	if err != nil {
		panic(err)
	}

	// 获取分词结果
	for _, tok := range doc.Tokens() {
		fmt.Printf("Token: %s (POS: %s)\n", tok.Text, tok.Tag)
	}
}

2. 命名实体识别(Named Entity Recognition)

package main

import (
	"fmt"
	"github.com/jdkato/prose/v2"
)

func main() {
	text := "Apple is looking at buying U.K. startup for $1 billion. " +
		"Tim Cook is the CEO of Apple Inc."

	// 创建文档
	doc, err := prose.NewDocument(text)
	if err != nil {
		panic(err)
	}

	// 提取命名实体
	for _, ent := range doc.Entities() {
		fmt.Printf("Entity: %s (Label: %s)\n", ent.Text, ent.Label)
	}
}

输出示例：

Entity: Apple (Label: ORG)
Entity: U.K. (Label: GPE)
Entity: $1 billion (Label: MONEY)
Entity: Tim Cook (Label: PERSON)
Entity: Apple Inc. (Label: ORG)

高级用法

1. 自定义处理选项

package main

import (
	"fmt"
	"github.com/jdkato/prose/v2"
)

func main() {
	text := "Dr. John Smith lives in New York and works for Google."

	// 使用自定义选项创建文档
	doc, err := prose.NewDocument(
		text,
		prose.WithTokenization(false),  // 禁用分词
		prose.WithTagging(false),       // 禁用词性标注
		prose.WithExtraction(true),     // 启用实体提取
	)
	if err != nil {
		panic(err)
	}

	// 只提取实体
	for _, ent := range doc.Entities() {
		fmt.Printf("%s - %s\n", ent.Text, ent.Label)
	}
}

2. 处理长文本

对于长文本，可以分块处理：

package main

import (
	"fmt"
	"github.com/jdkato/prose/v2"
	"strings"
)

func processChunk(chunk string) {
	doc, err := prose.NewDocument(chunk)
	if err != nil {
		fmt.Printf("Error processing chunk: %v\n", err)
		return
	}

	for _, ent := range doc.Entities() {
		fmt.Printf("Found entity: %s (%s)\n", ent.Text, ent.Label)
	}
}

func main() {
	longText := "This is a long text about Microsoft and Bill Gates. " +
		"Microsoft is a technology company based in Redmond, Washington. " +
		"Bill Gates co-founded Microsoft with Paul Allen."

	// 简单分块处理 - 实际应用中可能需要更智能的分块
	chunks := strings.SplitAfter(longText, ". ")
	for _, chunk := range chunks {
		if chunk != "" {
			processChunk(chunk)
		}
	}
}

性能优化建议

复用Document对象：如果需要处理多个文本，考虑复用Document对象
选择性启用功能：只启用需要的功能(分词、标注或实体识别)
预处理文本：清理不必要的字符或规范化文本格式

限制说明

Prose主要针对英文文本优化，对其他语言支持有限
实体识别准确率可能不如专业商业API
处理长文本时可能需要分块

总结

Prose为Go开发者提供了一个简单而强大的英文文本处理工具，特别适合需要轻量级NLP功能的应用程序。虽然它不如Python中的NLTK或spaCy全面，但对于许多基本任务来说已经足够，并且具有Go语言的高效和并发优势。

对于生产环境，建议在关键功能上添加适当的错误处理和性能监控，并根据实际需求调整处理参数。