golang高效提取文本中的URL链接插件库xurls的使用

xurls是一个使用正则表达式从文本中提取URL的Go库，需要Go 1.23或更高版本。

基本使用

以下是xurls的基本使用示例：

import "mvdan.cc/xurls/v2"

func main() {
    // Relaxed模式可以匹配没有协议头的URL
    rxRelaxed := xurls.Relaxed()
    rxRelaxed.FindString("Do gophers live in golang.org?")  // "golang.org"
    rxRelaxed.FindString("This string does not have a URL") // ""

    // Strict模式只匹配有协议头的URL
    rxStrict := xurls.Strict()
    rxStrict.FindAllString("must have scheme: http://foo.com/.", -1) // []string{"http://foo.com/"}
    rxStrict.FindAllString("no scheme, no match: foo.com", -1)       // []string{}
}

命令行工具xurls

要全局安装该工具：

go install mvdan.cc/xurls/v2/cmd/xurls@latest

使用示例：

$ echo "Do gophers live in http://golang.org?" | xurls
http://golang.org

完整示例

下面是一个完整的示例程序，展示如何使用xurls从文本中提取URL：

package main

import (
	"fmt"
	"mvdan.cc/xurls/v2"
)

func main() {
	text := `
	Here are some example URLs:
	- https://golang.org
	- http://github.com
	- www.example.com (only matched by Relaxed)
	- contact@example.com (not a URL)
	`

	// 使用Relaxed模式提取所有URL
	relaxed := xurls.Relaxed()
	urls := relaxed.FindAllString(text, -1)
	fmt.Println("Relaxed模式匹配结果:")
	for _, url := range urls {
		fmt.Println(url)
	}

	// 使用Strict模式提取所有URL
	strict := xurls.Strict()
	urls = strict.FindAllString(text, -1)
	fmt.Println("\nStrict模式匹配结果:")
	for _, url := range urls {
		fmt.Println(url)
	}
}

输出结果：

Relaxed模式匹配结果:
https://golang.org
http://github.com
www.example.com

Strict模式匹配结果:
https://golang.org
http://github.com

更多关于golang高效提取文本中的URL链接插件库xurls的使用的实战教程也可以访问 https://www.itying.com/category-94-b0.html

sinazl 1楼

更多关于golang高效提取文本中的URL链接插件库xurls的使用的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

使用xurls高效提取文本中的URL链接

xurls是一个高效的Go语言库，专门用于从文本中提取URL链接。它比正则表达式更快速、更准确，支持多种URL格式。

安装xurls

go get -u github.com/mvdan/xurls/v2

基本用法

package main

import (
	"fmt"
	"github.com/mvdan/xurls/v2"
)

func main() {
	text := `访问我的网站 https://example.com 或者发邮件到mailto:contact@example.com。
	也可以看看我们的FTP站点 ftp://files.example.com。`

	// 提取所有URL
	urls := xurls.Relaxed().FindAllString(text, -1)
	fmt.Println("所有URL:")
	for _, url := range urls {
		fmt.Println(url)
	}

	// 只提取严格格式的URL
	strictUrls := xurls.Strict().FindAllString(text, -1)
	fmt.Println("\n严格格式URL:")
	for _, url := range strictUrls {
		fmt.Println(url)
	}
}

高级功能

1. 自定义URL模式

func main() {
	// 创建自定义提取器
	custom := xurls.StrictMatchingScheme(`https?|ftp`)
	
	text := "链接: http://a.com, https://b.com, ftp://c.com, ssh://d.com"
	
	urls := custom.FindAllString(text, -1)
	fmt.Println("自定义协议URL:")
	for _, url := range urls {
		fmt.Println(url)
	}
	// 输出: http://a.com, https://b.com, ftp://c.com
}

2. 替换URL

func main() {
	text := "访问 https://old.com 获取更多信息"
	
	// 替换URL
	replaced := xurls.Relaxed().ReplaceAllString(text, "https://new.com")
	fmt.Println(replaced)
	// 输出: 访问 https://new.com 获取更多信息
}

3. 提取URL位置信息

func main() {
	text := "文本中有多个URL: https://first.com 和 http://second.org"
	
	// 获取URL的位置信息
	indexes := xurls.Relaxed().FindAllStringIndex(text, -1)
	for _, loc := range indexes {
		start, end := loc[0], loc[1]
		fmt.Printf("找到URL: %q (位置: %d-%d)\n", text[start:end], start, end)
	}
}

性能优化技巧

重用提取器：避免在循环中重复创建提取器

func extractURLs(texts []string) [][]string {
	extractor := xurls.Relaxed() // 只创建一次
	var allURLs [][]string
	
	for _, text := range texts {
		urls := extractor.FindAllString(text, -1)
		allURLs = append(allURLs, urls)
	}
	
	return allURLs
}

并行处理：对于大量文本可以并行提取

func extractURLsParallel(texts []string) [][]string {
	extractor := xurls.Relaxed()
	var wg sync.WaitGroup
	var mu sync.Mutex
	allURLs := make([][]string, len(texts))
	
	for i, text := range texts {
		wg.Add(1)
		go func(idx int, t string) {
			defer wg.Done()
			urls := extractor.FindAllString(t, -1)
			
			mu.Lock()
			allURLs[idx] = urls
			mu.Unlock()
		}(i, text)
	}
	
	wg.Wait()
	return allURLs
}

与其他方法的比较

xurls相比标准库的regexp包有以下优势：

更准确识别各种URL格式
性能更高（内部使用确定性有限自动机）
支持自定义协议
维护更活跃

注意事项

xurls不会验证URL是否真实存在，只做模式匹配
对于非常规格式的URL可能需要自定义模式
在极端情况下可能匹配到非URL文本（如"1.2.3.4"）

xurls是处理文本中URL提取的高效工具，适合日志分析、爬虫、文本处理等各种场景。

golang高效提取文本中的URL链接插件库xurls的使用

golang高效提取文本中的URL链接插件库xurls的使用

基本使用

更多功能

命令行工具xurls

完整示例

使用xurls高效提取文本中的URL链接

安装xurls

基本用法

高级功能

1. 自定义URL模式

2. 替换URL

3. 提取URL位置信息

性能优化技巧

与其他方法的比较

注意事项