Golang中cgo性能问题的探讨

Golang中cgo性能问题的探讨我使用Go前端对我的C语言库进行了基准测试……我发现Go在调用C函数时存在显著的性能损失：以调用我的函数“MqReadD”为例。

我的C代码（#1）耗时10毫秒……而Go包装器（#2）和CGO包装器（#3）总共耗时110毫秒……-> 如你所见，只有10%的时间是实际工作，90%是Go/CGO的开销。

-> 有什么方法可以加速Go代码吗？

（我已经使用了 export GODEBUG=cgocheck=0 来获得最佳性能。）

(pprof) list ReadD
Total: 2.71s
ROUTINE ======================== MqReadD in ...NHI1/theLink/libmsgque/read_mq.c
         0       10ms (flat, ■■■)  0.37% of Total
         .          .   1204:  struct MqS * const context,
         .          .   1205:  MQ_DBL * const valP
         .          .   1206:)
         .          .   1207:{
         .          .   1208:  check_CTX(MQ_HDL_NULL_ERROR(MqC))
         .       10ms   1209:  return sReadA8(context, (union MqBufferAtomU * const) valP, MQ_DBLT);
         .          .   1210:}
         .          .   1211:
         .          .   1212:enum MqErrorE
         .          .   1213:MqReadC (
         .          .   1214:  struct MqS * const context,
ROUTINE ======================== gomsgque.ReadD..1gomsgque.MqC in /.../NHI1/theLink/gomsgque/src/gomsgque/MqC.go
         0      120ms (flat, ■■■)  4.43% of Total
         .          .   1282:}
         .          .   1283:
         .          .   1284:/// \refdoc{ReadD}
         .          .   1285:func (this *MqC) ReadD () float64 {
         .          .   1286:  hdl := this.getCTX()
         .       10ms   1287:  var val_out C.MQ_DBL
         .      110ms   1288:  var errVal C.enum_MqErrorE = C.MqReadD (hdl, &val_out)
         .          .   1289:  if (errVal > C.MQ_CONTINUE) { MqErrorC_Check(C.MQ_MNG(hdl), errVal) }
         .          .   1290:  return (float64)(val_out)
         .          .   1291:}
         .          .   1292:
         .          .   1293:/// \refdoc{ReadF}
ROUTINE ======================== gomsgque._Cfunc_MqReadD in /tmp/go-build/b001/_cgo_gotypes.go
         0      110ms (flat, ■■■)  4.06% of Total
         .          .   3028://extern _cgo_a12d8325d15e_Cfunc_MqReadC
         .          .   3029:func _cgo_a12d8325d15e_Cfunc_MqReadC(p0 _Ctype_MQ_CTX, p1 *_Ctype_MQ_CST) uint32
         .          .   3030:
         .          .   3031:func _Cfunc_MqReadD(p0 _Ctype_MQ_CTX, p1 *_Ctype_MQ_DBL) uint32 {
         .          .   3032:   defer syscall.CgocallDone()
         .       60ms   3033:   syscall.Cgocall()
         .       10ms   3034:   r := _cgo_a12d8325d15e_Cfunc_MqReadD(p0, p1)
         .          .   3035:   return r
         .       40ms   3036:}
         .          .   3037://extern _cgo_a12d8325d15e_Cfunc_MqReadD
         .          .   3038:func _cgo_a12d8325d15e_Cfunc_MqReadD(p0 _Ctype_MQ_CTX, p1 *_Ctype_MQ_DBL) uint32
         .          .   3039:
         .          .   3040:func _Cfunc_MqReadF(p0 _Ctype_MQ_CTX, p1 *_Ctype_MQ_FLT) uint32 {
         .          .   3041:   defer syscall.CgocallDone()

更多关于Golang中cgo性能问题的探讨的实战教程也可以访问 https://www.itying.com/category-94-b0.html

sinazl 1楼

专家指出：

C 到 Go 的调用耗时很长——这是 cgo 的开销还是我的错误？ https://groups.google.com/d/msg/golang-nuts/B44pEq-uso8/uvL69eCxCgAJ

一个合理的经验法则是，从 Go 调用 C 所花费的时间相当于十次函数调用，而从 C 调用 Go 则更糟。这有几个原因，我们当然有兴趣让它更快，但这确实是个难题。

不幸的是，这意味着你不应该设计你的程序来随意地在 Go 和 C 之间进行调用。在可能的情况下，你应该批量处理调用，并且应该尝试完全用一种语言构建数据结构，然后再将它们传递给另一种语言。

Ian

更多关于Golang中cgo性能问题的探讨的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

sinazl 2楼

感谢您的回答 → 我认为核心问题在于CGO包装器以及“Cgocall”和“CgocallDone”这两个函数。

         .          .    3031:func _Cfunc_MqReadD(p0 _Ctype_MQ_CTX, p1 *_Ctype_MQ_DBL) uint32 {
         .          .    3032:   defer syscall.CgocallDone()
         .       60ms    3033:   syscall.Cgocall()
         .       10ms    3034:   r := _cgo_a12d8325d15e_Cfunc_MqReadD(p0, p1)
         .          .    3035:   return r
         .       40ms    3036:}

→ 在 libgo/runtime/go-cgo.c - native_client/nacl-gcc - Git at Google 找到了一些源代码。文档说明如下：

准备从Go代码调用C或C++代码。这会使当前goroutine脱离Go调度器，就像在进行系统调用一样。否则，如果C代码在互斥锁上休眠或由于其他原因，程序可能会死锁。思路是调用此函数，然后立即调用C/C++函数。在C/C++函数返回后，调用syscall_cgocalldone。通常的Go代码看起来像： syscall.Cgocall() defer syscall.Cgocalldone() cfunction()

好的 → GO是否有可能生成一个不包含“Cgocall”和“CgocallDone”的包装器？ → 可能作为顶层Go代码中的一个开关：

这将有助于那些运行时间短且没有阻塞问题的C函数。

        .          .    1284:/// \refdoc{ReadD}
         .          .   1285:func (this *MqC) ReadD () float64 {
         .          .   1286:  hdl := this.getCTX()
         .       10ms   1287:  var val_out C.MQ_DBL
                               // noblock            -> !! NEW !!
         .      110ms   1288:  var errVal C.enum_MqErrorE = C.MqReadD (hdl, &val_out)
         .          .   1289:  if (errVal > C.MQ_CONTINUE) { MqErrorC_Check(C.MQ_MNG(hdl), errVal) }
         .          .   1290:  return (float64)(val_out)
         .          .   1291:}

yibo5220 3楼

从性能分析数据看，CGO调用开销确实显著。以下是几种优化方案：

1. 批量调用减少CGO边界跨越

将多次调用合并为单次调用：

// C端批量接口
void batch_process(MQ_CTX ctx, MQ_DBL* inputs, MQ_DBL* outputs, int count) {
    for (int i = 0; i < count; i++) {
        outputs[i] = process_single(inputs[i]);
    }
}

// Go端批量调用
func (this *MqC) BatchReadD(values []float64) []float64 {
    hdl := this.getCTX()
    count := C.int(len(values))
    
    cInputs := make([]C.MQ_DBL, count)
    cOutputs := make([]C.MQ_DBL, count)
    
    for i, v := range values {
        cInputs[i] = C.MQ_DBL(v)
    }
    
    C.batch_process(hdl, &cInputs[0], &cOutputs[0], count)
    
    outputs := make([]float64, count)
    for i := 0; i < int(count); i++ {
        outputs[i] = float64(cOutputs[i])
    }
    return outputs
}

2. 使用C指针直接操作

避免每次调用都进行Go-C类型转换：

// 预分配C内存，直接操作
type MqCBatch struct {
    ctx   C.MQ_CTX
    data  *C.MQ_DBL
    count C.int
}

func NewMqCBatch(ctx C.MQ_CTX, size int) *MqCBatch {
    return &MqCBatch{
        ctx:   ctx,
        data:  (*C.MQ_DBL)(C.malloc(C.size_t(size * 8))),
        count: C.int(size),
    }
}

func (b *MqCBatch) Process() {
    // 单次CGO调用处理所有数据
    C.process_batch(b.ctx, b.data, b.count)
}

func (b *MqCBatch) SetValue(idx int, val float64) {
    // 直接操作C内存，无需CGO调用
    ptr := (*C.double)(unsafe.Pointer(uintptr(unsafe.Pointer(b.data)) + uintptr(idx*8)))
    *ptr = C.double(val)
}

3. 异步调用模式

使用goroutine池处理CGO调用：

type CGOJob struct {
    fn    func()
    done  chan struct{}
}

type CGOWorker struct {
    jobs chan *CGOJob
}

func NewCGOWorker() *CGOWorker {
    w := &CGOWorker{
        jobs: make(chan *CGOJob, 100),
    }
    go w.run()
    return w
}

func (w *CGOWorker) run() {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    
    for job := range w.jobs {
        job.fn()
        close(job.done)
    }
}

// 使用示例
func (this *MqC) ReadDAsync() <-chan float64 {
    result := make(chan float64, 1)
    go func() {
        hdl := this.getCTX()
        var val_out C.MQ_DBL
        var errVal C.enum_MqErrorE = C.MqReadD(hdl, &val_out)
        if errVal > C.MQ_CONTINUE { 
            MqErrorC_Check(C.MQ_MNG(hdl), errVal) 
        }
        result <- float64(val_out)
    }()
    return result
}

4. 内存池减少分配

复用CGO调用相关的内存：

var cgoPool = sync.Pool{
    New: func() interface{} {
        return &struct {
            val C.MQ_DBL
            err C.enum_MqErrorE
        }{}
    },
}

func (this *MqC) ReadDPooled() float64 {
    hdl := this.getCTX()
    item := cgoPool.Get().(*struct {
        val C.MQ_DBL
        err C.enum_MqErrorE
    })
    defer cgoPool.Put(item)
    
    item.err = C.MqReadD(hdl, &item.val)
    if item.err > C.MQ_CONTINUE { 
        MqErrorC_Check(C.MQ_MNG(hdl), item.err) 
    }
    return float64(item.val)
}

5. 内联汇编直接调用（高级）

对于特定平台，可以直接使用汇编调用：

// Linux x86_64示例
func callMqReadD(ctx uintptr) float64 {
    var result float64
    // 直接系统调用，完全绕过CGO
    // 需要精确了解C函数调用约定
    asmCall(ctx, uintptr(unsafe.Pointer(&result)))
    return result
}

性能关键路径应尽量减少CGO调用次数，优先使用批量处理。对于高频调用的简单函数，考虑用纯Go重写或使用汇编优化。