[问题描述]
用户报告有两台服务器在正常运行时会发生死机,问题看起来没有规律可寻,时间点比较随机。在发生宕机的时间段,可以在操作系统的 kernel log中看到如下信息。
Jul 12 02:22:59 db311 kernel: [23555142.837565] AMD-Vi: Completion-Wait loop timed out
Jul 12 02:23:00 db311 kernel: [23555142.981989] AMD-Vi: Completion-Wait loop timed out
Jul 12 02:23:00 db311 kernel: [23555143.126500] AMD-Vi: Completion-Wait loop timed out
Jul 12 02:23:00 db311 kernel: [23555143.195712] ------------[ cut here ]------------
Jul 12 02:23:00 db311 kernel: [23555143.195715] NETDEV WATCHDOG: eth1 (mlx5_core): transmit queue 56 timed out
Jul 12 02:23:00 db311 kernel: [23555143.195791] WARNING: CPU: 113 PID: 158090 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
.....
Jul 12 02:23:00 db311 kernel: [23555143.195969] ---[ end trace a7ca51f7a7ce4070 ]---
Jul 12 02:23:00 db311 kernel: [23555143.195973] mlx5_core 0000:01:00.1 eth1: TX timeout detected
Jul 12 02:23:00 db311 kernel: [23555143.270927] AMD-Vi: Completion-Wait loop timed out
Jul 12 02:23:00 db311 kernel: [23555143.415381] AMD-Vi: Completion-Wait loop timed out
[问题分析]
硬件配置:
CPU | AMD EPYC 7713 64-Core Processor |
---|---|
Memory | MTA36ASF8G72PZ-3G2E1 64GB D4-32R *8 |
OCP网卡 | Mellanox CX-4 Lx 10/25GbE SFP28 2-port OCP Ethernet Adapter |
SSD | 2.5” 5300 480G Entry SATA SSD *2, SAMSUNG MZQL23T8HCLS-00B7C 3840.7 GB/2.5”/NVMe *8 |
RAID卡 | 930-8i 2GB Flash PCIe 12Gb Adapter |
uEFI版本 | CFE132K |
BMC版本 | AMBT34O |
BMC日志:
服务器BMC中并没有记录任何硬件问题,在出现问题当天日志中记录有服务器电源重启的日志,表示服务器重新启动过。
9b | 07/12/2023 | 05:56:02 | System ACPI Power State ACPI_State | S5/G2: soft-off | Asserted
9c | 07/12/2023 | 05:56:06 | System ACPI Power State ACPI_State | S0/G0: working | Asserted
9d | 07/12/2023 | 05:56:14 | System ACPI Power State ACPI_State | S5/G2: soft-off | Asserted
9e | 07/12/2023 | 05:56:46 | System ACPI Power State ACPI_State | S0/G0: working | Asserted
9f | 07/12/2023 | 05:57:39 | System Event | Timestamp Clock Sync | Asserted
操作系统日志:
- OS版本信息
root@db311:~$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@db311:~$ uname -r
5.10.0-18-amd64
root@db311:~$ uname -a
Linux db311.org 5.10.0-18-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
两台机器的OS日志中也没有报告硬件错误hardware error的信息。
两台机器OS的IOMMU的地址翻译功能是打开的:
Jul 12 06:27:20 db311 kernel: [ 1.444777] iommu: Default domain type: Translated
Jul 12 06:00:26 db312 kernel: [ 1.269894] iommu: Default domain type: Translated
- 如上表中的硬件配置,这两台服务器都配置有8块NVMe SSD盘:
Jul 12 06:00:26 db311 kernel: [ 2.868820] nvme nvme0: pci function 0000:c1:00.0
Jul 12 06:00:26 db311 kernel: [ 2.868903] nvme nvme1: pci function 0000:c2:00.0
Jul 12 06:00:26 db311 kernel: [ 2.868974] nvme nvme2: pci function 0000:c3:00.0
Jul 12 06:00:26 db311 kernel: [ 2.869045] nvme nvme3: pci function 0000:c4:00.0
Jul 12 06:00:26 db311 kernel: [ 2.869118] nvme nvme4: pci function 0000:82:00.0
Jul 12 06:00:26 db311 kernel: [ 2.869177] nvme nvme5: pci function 0000:83:00.0
Jul 12 06:00:26 db311 kernel: [ 2.869261] nvme nvme6: pci function 0000:02:00.0
Jul 12 06:00:26 db311 kernel: [ 2.869328] nvme nvme7: pci function 0000:03:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985214] nvme nvme0: pci function 0000:c1:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985292] nvme nvme1: pci function 0000:c2:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985357] nvme nvme2: pci function 0000:c3:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985426] nvme nvme3: pci function 0000:c4:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985500] nvme nvme4: pci function 0000:82:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985577] nvme nvme5: pci function 0000:83:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985683] nvme nvme6: pci function 0000:02:00.0
Jul 12 06:27:20 db312 kernel: [ 2.985753] nvme nvme7: pci function 0000:03:00.0
- 检查操作系统OS日志,在出现宕机的时间段,OS发生了lockup的执行情况摘录如下,可以看到发生了软死锁soft lockup,进而导致fq_flush_timeout,系统超时引发宕机。
Jul 12 04:59:24 db312 kernel: [23564544.868826] AMD-Vi: Completion-Wait loop timed out
Jul 12 04:59:24 db312 kernel: [23564544.869509] watchdog: BUG: **soft lockup** - CPU#28 stuck for 22s! [swapper/28:0]
Jul 12 04:59:24 db312 kernel: [23564544.869971] Modules linked in: tcp_diag inet_diag bonding amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass ghash_clmulni_intel ipmi_ssif aesni_intel libaes crypto_simd cryptd glue_helper rapl wmi_bmof pcspkr xfs ast libcrc32c drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec i2c_algo_bit sg ccp rng_core sp5100_tco watchdog k10temp evdev acpi_cpufreq button acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler psmouse usbhid drm hid fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic mlx5_ib ib_uverbs xhci_pci ses ib_core xhci_hcd enclosure sd_mod ahci scsi_transport_sas libahci crc32_pclmul crc32c_intel mlx5_core libata megaraid_sas nvme usbcore scsi_mod nvme_core t10_pi mlxfw crc_t10dif ptp crct10dif_generic i2c_piix4 usb_common pps_core crct10dif_pclmul crct10dif_common pci_hyperv_intf wmi
-------------[ cut here ]------------
Jul 12 04:59:24 db312 kernel: [23564544.870023] PKRU: 55555554
Jul 12 04:59:24 db312 kernel: [23564544.870023] Call Trace:
Jul 12 04:59:24 db312 kernel: [23564544.870025] <IRQ>
Jul 12 04:59:24 db312 kernel: [23564544.870028] **fq_flush_timeout**+0x6d/0xa0
Jul 12 04:59:24 db312 kernel: [23564544.870032] ? fq_ring_free+0x100/0x100
Jul 12 04:59:24 db312 kernel: [23564544.870034] call_timer_fn+0x29/0x100
Jul 12 04:59:24 db312 kernel: [23564544.870037] __run_timers.part.0+0x1d9/0x250
Jul 12 04:59:24 db312 kernel: [23564544.870057] </IRQ>
Jul 12 04:59:24 db312 kernel: [23564544.870061] do_softirq_own_stack+0x37/0x50
Jul 12 04:59:24 db312 kernel: [23564544.870064] irq_exit_rcu+0x92/0xc0
Jul 12 04:59:24 db312 kernel: [23564544.870066] sysvec_apic_timer_interrupt+0x36/0x80
Jul 12 04:59:24 db312 kernel: [23564544.870068] asm_sysvec_apic_timer_interrupt+0x12/0x20
Jul 12 04:59:24 db312 kernel: [23564544.870070] RIP: 0010:cpu_idle_poll.isra.0+0x35/0xd0
- 查询所有NVMe SSD盘的MDTS值:
root@db311:~$ less nvme0.ctl.log |grep -i MDTS
mdts : 9
root@db311:~$cat nvme*|grep -i MDTS
mdts : 9
mdts : 9
mdts : 9
mdts : 9
mdts : 9
mdts : 9
mdts : 9
[分析结论]
经分析,该故障是由于IOMMU对IO的处理机制引起的:
- 客户使用的8块nvme盘的MDTS=9,根据MDTS的定义,nvme的IO size=2^9 * MPSMIN,而根据定义MPSMIN>=4KB,所以nvme的IO size>=2048KB,超过128KB
- 在进行IO传输时,IOMMU会进行IOVA的分配,该分配对分配的IOVA进行线性搜索,并且其间获取spinlock进行操作的同步。Spinlock在IO吞吐量很大时会显著地影响性能,甚至引发soft/hard lockup或rcu stall。
- 在Linux Kernel v4.7中针对这个问题,Linux引入了patch iommu/iova: introduce per-cpu caching to iova allocation,针对IO size <128K的情况进行了优化。但当IO size >128K,还是可能在大压力时出现soft/hard lockup或rcu stall的问题。
[解决方案]
针对该IOMMU的问题,用户有如下3种解决途径可选:
- 使用启动参数:iommu=pt
- Merge fix到所使用内核代码,并编译安装内核 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/pci.c?id=3710e2b056cb92ad816e4d79fa54a6a5b6ad8cbd
- 升级内核到v6.4或之上版本