Golang中如何测量CGO调用的耗时

Golang中如何测量CGO调用的耗时我想了解一些能够测量cgo中每个被调用函数时间开销的工具。

我的Go版本是1.16.6 linux/amd64

据我所知，C调用Go函数的过程会使用_obj/_cgo_export.c中的函数。首先，会调用_cgo_wait_runtime_init_done。然后，会调用crosscall2。

以上过程发生在C语言侧。

在crosscall2之后，我们会进入汇编侧，使用src/runtime/asm_amd64.s中的cgocallback。

cgocallback之后是Go语言侧的cgocallbackg。然后是cgocallbackg、cgocallbackg1。最后，通过cb(frame)(src/runtime/cgocall.go)，我们进入真正的Go函数。

因此

我的问题是，由于整个过程被分割成两种不同的语言，我该如何测量整个过程中每个函数的时间开销？

我听说过pprof，但它无法给出C语言侧的时间消耗情况。我使用过clock()(time.h)来计算C语言使用Go函数的整个过程所花费的时间，但我仍然不知道Go语言侧的情况。

我尝试过

//caller.c
entern int sum(int a, int b);// go func

void caller(void)
{
    start = clock();
    {
	    result = sum(1, 2); // C calls go func
    }
    end = clock();
    time = end - start;
}

//callee.go
import "C"

//export sum
func sum(a, b C.int) C.int {
	return a + b
}

我使用go build -o sum.so -buildmode=c-archive callee.go，然后使用gcc -g -o caller.c sum.so -lpthread来获取可执行文件。使用go tool cgo callee.go来获取_obj文件，以便于gdb调试。

但显然，这种测量方式不够精细。

我有一个想法：修改_obj目录中的_cgo_export.c文件（即在每个重要的函数中添加start,end）并重新编译。

但实际上，我不知道该怎么做。😔

那么，有人能推荐一些测量cgo调用时序的工具，或者就如何实现我的白日梦提供一些帮助吗？任何指导都将不胜感激。

更多关于Golang中如何测量CGO调用的耗时的实战教程也可以访问 https://www.itying.com/category-94-b0.html

nodeper 1楼

如果还有其他效率更高的工具，请告诉我。非常感谢任何想法。

更多关于Golang中如何测量CGO调用的耗时的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

gougou168 2楼

你可以像下面这个例子一样使用 time 包。

start := time.Now().UTC()
...
fmt.Println(time.Since(start))

caililin 3楼

既然我无法导入 time 包，如何在 runtime 包中获取时间？（我无法修改上面的问题，所以创建了一个回复）

// 示例代码：使用 runtime 包获取时间信息
package main

import (
    "runtime"
)

func main() {
    // 这里可以添加使用 runtime 包的代码
}

songsunli 4楼

import "time" 会导致导入循环（抱歉，我在上次回复中没有把原因说清楚）。

你可以尝试在 runtime 包源代码的任何位置添加 time.Now().UTC()，然后当你使用 go build 构建与 cgo 相关的包时，就会收到编译器报出的 import cycle 信息。

fmt 包的情况也一样。可能它们都依赖于 runtime 包。

无论如何，感谢你的帮助！

yuanlaile 5楼

经过两天的搜索，我想我已经弄清楚了如何在 runtime 包中测量时间开销（例如在 runtime/cgocall.go 文件中的 cgocallback 函数里）。

由于 nanotime() 本身是在 runtime 包中实现的，因此可以在 runtime 包中使用 nanotime()。这样就不会出现 import cycle 的问题。

注意！在某些平台上，nanotime 使用的是模拟时间（runtime/time_fake.go）。在 linux/amd64 上，nanotime 是非模拟的（runtime/time_nofake.go）。

// runtime/cgocall.go
func cgocallback() {
   start := nanotime()
   // ...
   end := nanotime()
}

此外，printint、println 或 print 也可以在 runtime 包中使用（它们同样是在 runtime 包中编写的）。

然而，在 runtime/cgocall.go 的 func cgocallback 中，不能直接使用 printint、println、print，因为这会导致 stack split。

我认为导致栈分裂的原因是 func cgocallback 运行在 g0 的栈上，而 print 会使用另一个栈。因此发生了栈分裂，而 func cgocallback 中不允许这种情况（我不太确定，这是我的猜测）。

无论如何，使用以下代码解决了打印问题：

func cgocallback() {
    start := nanotime()
    //...
    end := nanotime()
    systemstack(func() {
        printint(end - start)
    }
}

在系统栈中调用 printint 函数。这不会导致 stack split，并且确实可以打印出时间。

我不太确定我说的是否正确。

zlyuanteng 6楼

在Go中测量CGO调用的耗时确实需要结合多种工具，因为涉及跨语言边界。以下是几种实用的方法：

1. 使用runtime/trace追踪CGO调用

package main

/*
#include <time.h>
#include <stdio.h>

static void c_function() {
    // 模拟C函数工作
    struct timespec ts = {0, 1000000}; // 1ms
    nanosleep(&ts, NULL);
}
*/
import "C"
import (
    "context"
    "os"
    "runtime/trace"
)

//export go_function
func go_function() C.int {
    // Go侧函数
    return 42
}

func main() {
    // 创建trace文件
    f, _ := os.Create("trace.out")
    defer f.Close()
    
    // 开始trace
    trace.Start(f)
    defer trace.Stop()
    
    ctx, task := trace.NewTask(context.Background(), "cgo-call")
    defer task.End()
    
    // 记录C调用Go
    trace.Log(ctx, "event", "C->Go call start")
    C.c_function()
    trace.Log(ctx, "event", "C->Go call end")
}

运行后使用go tool分析：

go run main.go
go tool trace trace.out

2. 使用perf工具（Linux系统）

# 编译Go程序
go build -o app main.go

# 使用perf记录
perf record -g ./app
perf report

# 或使用火焰图
perf record -F 99 -g ./app
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

3. 手动插桩测量

package main

/*
#include <time.h>
#include <stdint.h>

static uint64_t get_nanoseconds() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000 + ts.tv_nsec;
}

extern uint64_t go_side_start;
extern uint64_t go_side_end;
*/
import "C"
import (
    "fmt"
    "runtime"
    "sync/atomic"
)

var (
    goSideStart uint64
    goSideEnd   uint64
)

//export measure_go_function
func measure_go_function() C.int {
    // 记录Go侧开始时间
    atomic.StoreUint64(&goSideStart, C.get_nanoseconds())
    
    // 确保调度器不干扰测量
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    
    // 模拟工作
    result := 0
    for i := 0; i < 1000; i++ {
        result += i
    }
    
    // 记录Go侧结束时间
    atomic.StoreUint64(&goSideEnd, C.get_nanoseconds())
    return C.int(result)
}

func main() {
    // C侧测量
    cStart := C.get_nanoseconds()
    result := C.measure_go_function()
    cEnd := C.get_nanoseconds()
    
    // 获取Go侧时间
    goStart := atomic.LoadUint64(&goSideStart)
    goEnd := atomic.LoadUint64(&goSideEnd)
    
    fmt.Printf("Total C->Go->C: %d ns\n", cEnd-cStart)
    fmt.Printf("Go side only: %d ns\n", goEnd-goStart)
    fmt.Printf("CGO overhead: %d ns\n", (cEnd-cStart)-(goEnd-goStart))
    fmt.Printf("Result: %d\n", result)
}

4. 使用SystemTap进行深度追踪

创建systemtap脚本 cgo_timing.stp：

global cgo_enter_times, cgo_exit_times

probe process("/path/to/your/binary").function("crosscall2") {
    cgo_enter_times[tid()] = gettimeofday_ns()
}

probe process("/path/to/your/binary").function("cgocallback") {
    if (tid() in cgo_enter_times) {
        latency = gettimeofday_ns() - cgo_enter_times[tid()]
        printf("CGO call latency: %d ns\n", latency)
        delete cgo_enter_times[tid()]
    }
}

probe process("/path/to/your/binary").function("runtime.cgocall") {
    printf("Go -> C call at %d\n", gettimeofday_ns())
}

probe process("/path/to/your/binary").function("runtime.cgocallback") {
    printf("C -> Go callback at %d\n", gettimeofday_ns())
}

运行：

stap -g cgo_timing.stp -c ./your_program

5. 修改Go运行时源码进行测量

如果需要最精确的测量，可以修改Go运行时源码：

// 在 src/runtime/cgocall.go 中添加

+var cgoCallStartTime int64
+var cgoCallbackStartTime int64

 func cgocall(fn, arg unsafe.Pointer) int32 {
+    start := nanotime()
     // ... 原有代码
+    cgoCallStartTime = nanotime() - start
     return ret
 }

 func cgocallbackg() {
+    start := nanotime()
     // ... 原有代码
+    cgoCallbackStartTime = nanotime() - start
 }

然后重新编译Go工具链：

cd src
./make.bash

6. 使用eBPF进行动态追踪

// cgo_trace.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, u32);
    __type(value, u64);
} start SEC(".maps");

SEC("uprobe//path/to/binary:crosscall2")
int crosscall2_entry(struct pt_regs *ctx) {
    u32 tid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    bpf_map_update_elem(&start, &tid, &ts, BPF_ANY);
    return 0;
}

SEC("uretprobe//path/to/binary:crosscall2")
int crosscall2_exit(struct pt_regs *ctx) {
    u32 tid = bpf_get_current_pid_tgid();
    u64 *tsp = bpf_map_lookup_elem(&start, &tid);
    if (tsp) {
        u64 latency = bpf_ktime_get_ns() - *tsp;
        bpf_printk("crosscall2 latency: %llu ns", latency);
        bpf_map_delete_elem(&start, &tid);
    }
    return 0;
}

编译并加载：

clang -target bpf -c cgo_trace.c -o cgo_trace.o
bpftool prog load cgo_trace.o /sys/fs/bpf/cgo_trace

这些方法从简单到复杂，可以根据具体需求选择。对于生产环境，推荐使用perf或eBPF；对于开发调试，手动插桩结合trace工具通常足够。