0

I have a server running CentOS 8, the kernel crashed someday and I found the found the following three files in /var/crash: vmcore, vmcore-dmesg.txt, and kexec-dmesg.log.

I first looked at vmcore-dmesg.txt, which gives me the following info at the end

[291071.552140] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[291071.552141] {2}[Hardware Error]: event severity: fatal
[291071.552141] {2}[Hardware Error]:  Error 0, type: fatal
[291071.552142] {2}[Hardware Error]:   section_type: PCIe error
[291071.552142] {2}[Hardware Error]:   port_type: 4, root port
[291071.552142] {2}[Hardware Error]:   version: 3.0
[291071.552143] {2}[Hardware Error]:   command: 0x0547, status: 0x4010
[291071.552143] {2}[Hardware Error]:   device_id: 0000:16:01.0
[291071.552143] {2}[Hardware Error]:   slot: 82
[291071.552144] {2}[Hardware Error]:   secondary_bus: 0x18
[291071.552144] {2}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2031
[291071.552145] {2}[Hardware Error]:   class_code: 000406
[291071.552145] {2}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0013
[291071.552145] {2}[Hardware Error]:   aer_uncor_status: 0x00000020, aer_uncor_mask: 0x00100000
[291071.552146] {2}[Hardware Error]:   aer_uncor_severity: 0x00062030
[291071.552146] {2}[Hardware Error]:   TLP Header: 00000000 00000000 00000000 00000000
[291071.552146] Kernel panic - not syncing: Fatal hardware error!
[291071.552147] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 4.18.0-305.3.1.el8.x86_64 #1
[291071.552147] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[291071.552148] Call Trace:
[291071.552148]  <NMI>
[291071.552148]  dump_stack+0x5c/0x80
[291071.552149]  panic+0xe7/0x2a9
[291071.552149]  __ghes_panic.cold.32+0x21/0x21
[291071.552149]  ghes_notify_nmi+0x273/0x310
[291071.552149]  nmi_handle+0x63/0x110
[291071.552150]  default_do_nmi+0x49/0x100
[291071.552150]  do_nmi+0x17e/0x1e0
[291071.552150]  end_repeat_nmi+0x16/0x6f
[291071.552151] RIP: 0010:intel_idle+0x6b/0xb0
[291071.552151] Code: 40 5c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 75 19 e9 07 00 00 00 0f 00 2d 1e 01 55 00 c1 ee 18 b9 01 00 00 00 89 f0 0f 01 c9 <65> 48 8b 04 25 40 5c 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[291071.552152] RSP: 0018:ffffffff8fe03e40 EFLAGS: 00000002
[291071.552152] RAX: 0000000000000020 RBX: ffffffff8ff30ba8 RCX: 0000000000000001
[291071.552153] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000003
[291071.552153] RBP: ffff9e4a20835ad8 R08: 0000000000000002 R09: 0000000000029700
[291071.552154] R10: 0002cd7f37820a74 R11: ffff9e4a20828be4 R12: ffffffff8ff30a40
[291071.552154] R13: 0000000000000003 R14: 0000000000000003 R15: 0000000000000003
[291071.552154]  ? intel_idle+0x6b/0xb0
[291071.552154]  ? intel_idle+0x6b/0xb0
[291071.552155]  </NMI>
[291071.552155]  cpuidle_enter_state+0x87/0x3c0
[291071.552155]  cpuidle_enter+0x2c/0x40
[291071.552156]  do_idle+0x234/0x260
[291071.552156]  cpu_startup_entry+0x6f/0x80
[291071.552156]  start_kernel+0x518/0x538
[291071.552157]  secondary_startup_64_no_verify+0xc2/0xcb

Using lspci, I can find 0000:16.01.0 is 16:01.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port B (rev 02), which seems to be the PCI-E root. and

lspci -s 16:01.0 -tvv
0000:16:01.0-[18-1b]----00.0-[19-1b]----03.0-[1a-1b]--+-00.0  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.1  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      +-00.2  Intel Corporation Ethernet Connection X722 for 1GbE
                                                      \-00.3  Intel Corporation Ethernet Connection X722 for 1GbE

Then I looked at the kexec-dmesg.log file, which says

[Thu Jun 10 20:02:45 2021] Memory manager not clean during takedown.
[Thu Jun 10 20:02:45 2021] WARNING: CPU: 0 PID: 399 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Modules linked in: amdgpu(+) sd_mod t10_pi sg iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crc32c_intel drm ahci libahci uas libata usb_storage dm_mirror dm_region_hash dm_log dm_mod fuse overlay squashfs loop
[Thu Jun 10 20:02:45 2021] CPU: 0 PID: 399 Comm: systemd-udevd Tainted: G        W        --------- -  - 4.18.0-305.3.1.el8.x86_64 #1
[Thu Jun 10 20:02:45 2021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC621D8A, BIOS P2.10 04/03/2019
[Thu Jun 10 20:02:45 2021] RIP: 0010:drm_mm_takedown+0x1f/0x30 [drm]
[Thu Jun 10 20:02:45 2021] Code: f6 c3 48 8d 41 c0 eb bb 0f 1f 00 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 c7 75 01 c3 48 c7 c7 58 57 1b c0 e8 da b6 f6 c0 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00
[Thu Jun 10 20:02:45 2021] RSP: 0018:ffffc90000747a10 EFLAGS: 00010282
[Thu Jun 10 20:02:45 2021] RAX: 0000000000000000 RBX: ffff88805d44caf0 RCX: ffffffff8265f1c8
[Thu Jun 10 20:02:45 2021] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[Thu Jun 10 20:02:45 2021] RBP: ffff888050e65030 R08: 00000000000005e6 R09: 0000000000aaaaaa
[Thu Jun 10 20:02:45 2021] R10: 0000000000000000 R11: ffffc900009e0320 R12: ffff88805d44ca00
[Thu Jun 10 20:02:45 2021] R13: ffff888050e64f68 R14: 0000000000000000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] FS:  00007f16a3901180(0000) GS:ffff88805ea00000(0000) knlGS:0000000000000000
[Thu Jun 10 20:02:45 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jun 10 20:02:45 2021] CR2: 0000564d0235b008 CR3: 000000005d5b6002 CR4: 00000000007706b0
[Thu Jun 10 20:02:45 2021] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Thu Jun 10 20:02:45 2021] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Thu Jun 10 20:02:45 2021] PKRU: 55555554
[Thu Jun 10 20:02:45 2021] Call Trace:
[Thu Jun 10 20:02:45 2021]  amdgpu_gtt_mgr_fini+0x2d/0x80 [amdgpu]
[Thu Jun 10 20:02:45 2021]  ttm_bo_clean_mm+0xa8/0xc0 [ttm]
[Thu Jun 10 20:02:45 2021]  amdgpu_ttm_fini+0x98/0xe0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_bo_fini+0xe/0x30 [amdgpu]
[Thu Jun 10 20:02:45 2021]  gmc_v9_0_sw_fini+0x59/0xa0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_device_fini+0x297/0x4af [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_driver_load_kms+0x122/0x2a0 [amdgpu]
[Thu Jun 10 20:02:45 2021]  amdgpu_pci_probe+0xd1/0x150 [amdgpu]
[Thu Jun 10 20:02:45 2021]  local_pci_probe+0x41/0x90
[Thu Jun 10 20:02:45 2021]  pci_device_probe+0x105/0x1c0
[Thu Jun 10 20:02:45 2021]  really_probe+0x255/0x4a0
[Thu Jun 10 20:02:45 2021]  driver_probe_device+0x49/0xc0
[Thu Jun 10 20:02:45 2021]  device_driver_attach+0x50/0x60
[Thu Jun 10 20:02:45 2021]  __driver_attach+0x61/0x130
[Thu Jun 10 20:02:45 2021]  ? device_driver_attach+0x60/0x60
[Thu Jun 10 20:02:45 2021]  bus_for_each_dev+0x77/0xc0
[Thu Jun 10 20:02:45 2021]  ? klist_add_tail+0x3b/0x70
[Thu Jun 10 20:02:45 2021]  bus_add_driver+0x14d/0x1e0
[Thu Jun 10 20:02:45 2021]  ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021]  driver_register+0x6b/0xb0
[Thu Jun 10 20:02:45 2021]  ? 0xffffffffc07d3000
[Thu Jun 10 20:02:45 2021]  do_one_initcall+0x46/0x1c3
[Thu Jun 10 20:02:45 2021]  ? do_init_module+0x22/0x220
[Thu Jun 10 20:02:45 2021]  ? kmem_cache_alloc_trace+0x131/0x270
[Thu Jun 10 20:02:45 2021]  do_init_module+0x5a/0x220
[Thu Jun 10 20:02:45 2021]  load_module+0x14c5/0x17f0
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x35/0x70
[Thu Jun 10 20:02:45 2021]  ? __switch_to_asm+0x41/0x70
[Thu Jun 10 20:02:45 2021]  ? apic_timer_interrupt+0xa/0x20
[Thu Jun 10 20:02:45 2021]  ? __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021]  __do_sys_init_module+0x13b/0x180
[Thu Jun 10 20:02:45 2021]  do_syscall_64+0x5b/0x1a0
[Thu Jun 10 20:02:45 2021]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[Thu Jun 10 20:02:45 2021] RIP: 0033:0x7f16a24df80e
[Thu Jun 10 20:02:45 2021] Code: 48 8b 0d 7d 16 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4a 16 2c 00 f7 d8 64 89 01 48
[Thu Jun 10 20:02:45 2021] RSP: 002b:00007ffc5a383dd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[Thu Jun 10 20:02:45 2021] RAX: ffffffffffffffda RBX: 0000558aa33c7ee0 RCX: 00007f16a24df80e
[Thu Jun 10 20:02:45 2021] RDX: 0000558aa33c85e0 RSI: 00000000009621ec RDI: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] RBP: 0000558aa33c85e0 R08: 0000558aa33c301a R09: 0000000000000003
[Thu Jun 10 20:02:45 2021] R10: 0000558aa33c3010 R11: 0000000000000246 R12: 0000558aa3def1a0
[Thu Jun 10 20:02:45 2021] R13: 0000558aa33dabf0 R14: 0000000000020000 R15: 0000000000000000
[Thu Jun 10 20:02:45 2021] ---[ end trace 0950097d77ca3e03 ]---

Which seems to me related to GPU driver.

To my understanding, when kernel crashes, kdump tries to boot another kernel using kexec to dump the crashed kernel. Then the log seems to me like some PCI-E hardware error happens makes the main kernel crash, and when the kdump kernel starts, it crashed again because of GPU driver error. Am I understanding this correctly? Or the logs showed in kexec-dmesg.log is actually the stack trace of the main kernel?

My second question is then how to understand these error messages. As it seems only NIC is connected to the PCI-E root, is there something wrong with my motherboard/CPU, or the problem is likely on the kernel?

A side information, I found in /var/log that the following error often happens which does not crash the kernel

Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: It has been corrected by h/w and requires no further action
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]: event severity: corrected
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:  Error 0, type: corrected
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   section_type: PCIe error
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   port_type: 5, upstream switch port
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   version: 3.0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   command: 0x0147, status: 0x0010
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   device_id: 0000:18:00.0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   slot: 82
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   secondary_bus: 0x19
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x37c0
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   class_code: 000406
Jun  7 11:12:20 localhost kernel: {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0013
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_status: 0x00003000, aer_mask: 0x00002000
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0:    [12] Timeout               
Jun  7 11:12:20 localhost kernel: pcieport 0000:18:00.0: aer_layer=Data Link Layer, aer_agent=Transmitter ID

where 18:00.0 is a PCI bridge 18:00.0 PCI bridge: Intel Corporation Device 37c0 (rev 09) and

 lspci -s 18:00.0 -tvv
0000:18:00.0-[19-1b]----03.0-[1a-1b]--+-00.0  Intel Corporation Ethernet Connection X722 for 1GbE
                                      +-00.1  Intel Corporation Ethernet Connection X722 for 1GbE
                                      +-00.2  Intel Corporation Ethernet Connection X722 for 1GbE
                                      \-00.3  Intel Corporation Ethernet Connection X722 for 1GbE

Any help will be greatly appreciated.

0 Answers0