Golang中正则表达式对英文和希腊文匹配行为的差异

Golang中正则表达式对英文和希腊文匹配行为的差异我想在希腊语文本中使用Golang查找完整的单词。

使用 \b 作为正则表达式模式的单词边界，在英语中有效，但在希腊语中无效。

当我运行以下代码时，找到了一个 joy 的出现，这是正确的。

但是，对于希腊语，没有返回任何结果。如果我移除希腊语两侧的 \b，则会找到两个 ἀγαπη 的出现。

func main() {
	pattern := "\\bjoy\\b"
	text := "sing joyfully with joy"
	matcher, err := regexp.Compile(pattern)
	if err != nil {
		fmt.Println(err)
	}
	indexes := matcher.FindAllStringIndex(text, -1)
	expect := 1
	got := len(indexes)
	if expect != got {
		fmt.Printf("expected %d, got %d\n", expect, got)
	}
	pattern = "\\bἀγαπη\\b"
	text = "Ὡς ἀγαπη τὰ σκηνώματά σου. Ὡς ἀγαπητὰ τὰ σκηνώματά σου."
	matcher, err = regexp.Compile(pattern)
	if err != nil {
		fmt.Println(err)
	}
	indexes = matcher.FindAllStringIndex(text, -1)
	expect = 1
	got = len(indexes)
	if expect != got {
		fmt.Printf("expected %d, got %d\n", expect, got)
	}
}

更多关于Golang中正则表达式对英文和希腊文匹配行为的差异的实战教程也可以访问 https://www.itying.com/category-94-b0.html

phonegap100 1楼

\B 表示非单词边界。我需要一个单词边界。

更多关于Golang中正则表达式对英文和希腊文匹配行为的差异的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

gougou168 2楼

是的，抱歉，我想是我匆忙中看错了描述，把“非”与“ASCII”关联起来，而不是与“单词边界”关联……

gougou168 3楼

这里由kosmo提出的建议是有效的：

pattern = "(\\s+|\\w+)ἀγαπη(\\s+|\\w+)"

h691938207 4楼

\b 是 ASCII 单词边界，你可能想使用 \B。

google/re2

RE2 是一个快速、安全、对线程友好的替代方案，用于替代 PCRE、Perl 和 Python 中使用的回溯正则表达式引擎。它是一个 C++ 库。 - google/re2

avatar

gougou168 5楼

在Go的正则表达式中，\b单词边界依赖于Unicode字符类别。\b匹配\w（单词字符）和\W（非单词字符）之间的位置，而\w默认只匹配ASCII字符集中的[A-Za-z0-9_]。

希腊字母属于Unicode字母类别，但不属于ASCII的\w范围。因此，当希腊字母与非字母字符相邻时，\b无法正确识别边界。

解决方案是使用Unicode字符类别或自定义边界：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    // 方案1：使用Unicode单词边界（\b的Unicode版本）
    pattern := `(?u)\bἀγαπη\b`
    text := "Ὡς ἀγαπη τὰ σκηνώματά σου. Ὡς ἀγαπητὰ τὰ σκηνώματά σου."
    
    matcher, err := regexp.Compile(pattern)
    if err != nil {
        fmt.Println(err)
        return
    }
    
    indexes := matcher.FindAllStringIndex(text, -1)
    fmt.Printf("Unicode边界匹配到 %d 个结果\n", len(indexes))
    
    // 方案2：使用显式边界匹配
    pattern2 := `(^|\P{L})ἀγαπη(\P{L}|$)`
    matcher2, err := regexp.Compile(pattern2)
    if err != nil {
        fmt.Println(err)
        return
    }
    
    indexes2 := matcher2.FindAllStringIndex(text, -1)
    fmt.Printf("显式边界匹配到 %d 个结果\n", len(indexes2))
    
    // 方案3：使用Unicode字符类别
    pattern3 := `\p{Greek}+`
    matcher3, err := regexp.Compile(pattern3)
    if err != nil {
        fmt.Println(err)
        return
    }
    
    words := matcher3.FindAllString(text, -1)
    fmt.Printf("希腊单词: %v\n", words)
}

关键点：

(?u)标志启用Unicode模式，使\b、\w等支持Unicode字符
\P{L}匹配任何非字母字符（包括标点、空格等）
\p{Greek}专门匹配希腊字母

对于完整的希腊语单词匹配，推荐使用：

func findGreekWord(text, word string) [][]int {
    // 匹配单词边界或字符串边界
    pattern := fmt.Sprintf(`(^|[^\p{Greek}])%s([^\p{Greek}]|$)`, regexp.QuoteMeta(word))
    re, err := regexp.Compile(pattern)
    if err != nil {
        return nil
    }
    return re.FindAllStringIndex(text, -1)
}

这样就能正确处理希腊语文本中的单词边界匹配。