Golang在CentOS 8与CentOS 7中syscall6耗时对比分析

Golang在CentOS 8与CentOS 7中syscall6耗时对比分析这是一个执行多线程网络I/O的应用程序（用Go编写的用户空间NFS客户端）。它应该能够在10Gb网卡上实现接近线速的数据传输。在CentOS 7上确实如此，10GiB数据大约在10-11秒内完成，然而在CentOS 8上，速度慢了大约2-3秒，稳定在约14秒。

相同的虚拟机规格运行在同一个物理主机上。

项目：GitHub - sile16/fbcp: FlashBlade Fast File transfer tool

命令：

time ./fbcp -threads 16 -sizeMB 8 -profile pp.txt 172.19.0.20:/DEMO-NFS-1/10GRand |cat > /dev/null

Go版本 19.1

以下是各自的性能分析文件。

CentOS 7 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

File: fbcp
Type: cpu
Time: Sep 15, 2022 at 8:51am (CDT)
Duration: 11.22s, Total samples = 31.89s (284.32%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 28.68s, 89.93% of 31.89s total
Dropped 241 nodes (■■■ <= 0.16s)
Showing top 20 nodes out of 90
      flat  flat%   sum%        ■■■   ■■■%
    12.28s 38.51% 38.51%     12.28s 38.51%  runtime/internal/syscall.Syscall6
     5.19s 16.27% 54.78%      5.19s 16.27%  runtime.memclrNoHeapPointers
     4.66s 14.61% 69.39%      4.66s 14.61%  runtime.memmove
     2.68s  8.40% 77.80%      2.68s  8.40%  runtime.futex
     1.02s  3.20% 81.00%      1.02s  3.20%  runtime.epollwait
     0.64s  2.01% 83.00%      0.64s  2.01%  runtime.madvise
     0.43s  1.35% 84.35%      1.12s  3.51%  runtime.stealWork
     0.36s  1.13% 85.48%      0.36s  1.13%  runtime.(*randomEnum).next (inline)
     0.19s   0.6% 86.08%      0.20s  0.63%  runtime.casgstatus
     0.18s  0.56% 86.64%      6.14s 19.25%  runtime.findRunnable
     0.16s   0.5% 87.14%      0.21s  0.66%  runtime.lock2
     0.12s  0.38% 87.52%      9.39s 29.44%  internal/poll.(*FD).Read
     0.12s  0.38% 87.90%      0.16s   0.5%  runtime.checkTimers
     0.12s  0.38% 88.27%     12.40s 38.88%  syscall.RawSyscall6
     0.11s  0.34% 88.62%      1.22s  3.83%  runtime.notesleep
     0.11s  0.34% 88.96%      0.20s  0.63%  runtime.reentersyscall
     0.10s  0.31% 89.28%      1.16s  3.64%  runtime.netpoll
     0.08s  0.25% 89.53%      5.64s 17.69%  runtime.mallocgc
     0.07s  0.22% 89.75%      9.48s 29.73%  net.(*conn).Read
     0.06s  0.19% 89.93%      9.76s 30.61%  bufio.(*Reader).Read
(pprof) quit

CentOS 8 4.18.0-408.el8.x86_64 #1 SMP Mon Jul 18 17:42:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Type: cpu
Time: Sep 15, 2022 at 8:51am (CDT)
Duration: 14.32s, Total samples = 27.91s (194.87%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 25.39s, 90.97% of 27.91s total
Dropped 213 nodes (■■■ <= 0.14s)
Showing top 20 nodes out of 69
      flat  flat%   sum%        ■■■   ■■■%
    15.77s 56.50% 56.50%     15.77s 56.50%  runtime/internal/syscall.Syscall6
     3.15s 11.29% 67.79%      3.15s 11.29%  runtime.memclrNoHeapPointers
     3.14s 11.25% 79.04%      3.14s 11.25%  runtime.memmove
     1.24s  4.44% 83.48%      1.24s  4.44%  runtime.futex
     1.01s  3.62% 87.10%      1.01s  3.62%  runtime.epollwait
     0.19s  0.68% 87.78%      0.45s  1.61%  runtime.stealWork
     0.11s  0.39% 88.18%      3.43s 12.29%  runtime.findRunnable
     0.11s  0.39% 88.57%      1.15s  4.12%  runtime.netpoll
     0.11s  0.39% 88.96%     15.88s 56.90%  syscall.RawSyscall6
     0.08s  0.29% 89.25%      0.15s  0.54%  runtime.exitsyscall
     0.08s  0.29% 89.54%      3.41s 12.22%  runtime.mallocgc
     0.07s  0.25% 89.79%     10.97s 39.30%  bufio.(*Reader).Read
     0.06s  0.21% 90.00%     10.59s 37.94%  internal/poll.(*FD).Read
     0.06s  0.21% 90.22%     10.70s 38.34%  net.(*conn).Read
     0.05s  0.18% 90.40%     11.03s 39.52%  io.ReadAtLeast
     0.04s  0.14% 90.54%     10.30s 36.90%  syscall.read
     0.03s  0.11% 90.65%      0.16s  0.57%  github.com/rasky/go-xdr/xdr2.(*Encoder).encodeStruct
     0.03s  0.11% 90.76%     10.64s 38.12%  net.(*netFD).Read
     0.03s  0.11% 90.86%      3.61s 12.93%  runtime.schedule
     0.03s  0.11% 90.97%      0.52s  1.86%  runtime.stopm

更多关于Golang在CentOS 8与CentOS 7中syscall6耗时对比分析的实战教程也可以访问 https://www.itying.com/category-94-b0.html

sinazl 1楼

你好，@sile16，欢迎来到论坛。

~~虽然你可能会在 Go Bridge 论坛上找到答案，但我认为你的问题技术性很强，值得在 GitHub 的 Go 项目 issue 跟踪器上发布一个问题。祝你好运！~~

编辑： 在搜索了类似问题后，我找到了这个，它被关闭了，因为 issue 跟踪器不是提问的地方（在那个案例中，是关于为什么不同内核版本之间内存使用情况不同），我怀疑他们可能会对你关于运行时的问题做同样处理。当然，你仍然可以尝试提问，但也许 StackOverflow 或 Go 团队在其问题页面上列出的其他网站会更合适。

我最后想说的是，我并不想阻止你在这个论坛提问，但我认为，因为你的问题似乎与 Linux 内核版本有关，你可能需要让更多人看到你的问题，才能找到拥有可以复现你场景的环境的人。

更多关于Golang在CentOS 8与CentOS 7中syscall6耗时对比分析的实战系列教程也可以访问 https://www.itying.com/category-94-b0.html

htzhanglong 2楼

从性能分析数据来看，CentOS 8中runtime/internal/syscall.Syscall6的耗时占比显著增加（从38.51%上升到56.50%），这是导致性能差异的关键因素。这通常与内核系统调用机制或虚拟化层的变化有关。

问题分析

系统调用开销差异：CentOS 8使用4.18内核，而CentOS 7使用3.10内核。4.18内核在系统调用路径、Spectre缓解措施等方面有显著变化，可能增加了系统调用开销。
虚拟化环境因素：虽然使用相同的物理主机，但CentOS 8虚拟机可能受到不同的虚拟化配置影响。

解决方案

1. 减少系统调用频率

使用更大的缓冲区减少read系统调用次数：

// 在创建连接时设置更大的读取缓冲区
conn, err := net.Dial("tcp", address)
if err != nil {
    return err
}

// 设置socket缓冲区大小
if tcpConn, ok := conn.(*net.TCPConn); ok {
    tcpConn.SetReadBuffer(1 * 1024 * 1024) // 1MB缓冲区
    tcpConn.SetWriteBuffer(1 * 1024 * 1024)
}

2. 使用syscall.RawSyscall6替代

对于性能关键路径，考虑直接使用raw系统调用：

import (
    "syscall"
    "unsafe"
)

func rawRead(fd int, p []byte) (n int, err error) {
    var _p0 unsafe.Pointer
    if len(p) > 0 {
        _p0 = unsafe.Pointer(&p[0])
    }
    r0, _, e1 := syscall.RawSyscall6(
        syscall.SYS_READ,
        uintptr(fd),
        uintptr(_p0),
        uintptr(len(p)),
        0, 0, 0,
    )
    n = int(r0)
    if e1 != 0 {
        err = e1
    }
    return
}

3. 调整Go运行时参数

在程序启动时设置环境变量优化系统调用：

export GOMAXPROCS=16
export GODEBUG=asyncpreemptoff=1
export GOGC=50

4. 内核参数调优

在CentOS 8上调整内核参数：

# 增加socket缓冲区大小
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216

# 调整TCP参数
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
sysctl -w net.ipv4.tcp_window_scaling=1

# 减少上下文切换开销
sysctl -w kernel.sched_min_granularity_ns=10000000
sysctl -w kernel.sched_wakeup_granularity_ns=15000000

5. 使用io_uring（如果内核支持）

对于支持io_uring的CentOS 8内核（5.1+），可以使用新的异步I/O接口：

// 需要第三方库如github.com/iceber/iouring-go
import "github.com/iceber/iouring-go"

func readWithIOUring() {
    ur, err := iouring.New(1024)
    if err != nil {
        panic(err)
    }
    defer ur.Close()
    
    // 使用io_uring进行异步读取
    // ... 具体实现取决于实际需求
}

6. 性能对比测试代码

添加性能监控代码来量化系统调用开销：

import (
    "runtime"
    "time"
)

type SyscallMonitor struct {
    start time.Time
    calls int64
}

func (m *SyscallMonitor) Begin() {
    m.start = time.Now()
}

func (m *SyscallMonitor) End() {
    m.calls++
    if m.calls%1000 == 0 {
        elapsed := time.Since(m.start)
        callsPerSec := float64(m.calls) / elapsed.Seconds()
        runtime.KeepAlive(callsPerSec) // 防止被优化掉
    }
}

// 在关键系统调用前后使用
var monitor SyscallMonitor

func monitoredRead(fd int, buf []byte) (int, error) {
    monitor.Begin()
    defer monitor.End()
    return syscall.Read(fd, buf)
}

验证方法

使用perf工具分析系统调用延迟：

perf stat -e syscalls:sys_enter_read,syscalls:sys_exit_read ./fbcp

对比strace输出：

strace -c -e read ./fbcp

这些优化措施应该能显著减少CentOS 8上的系统调用开销，使性能接近CentOS 7的水平。关键是通过缓冲区优化减少系统调用次数，以及适当调整内核参数。