0

I have Proxmox 7.2 installed in a not so old desktop I am trying to use as a server. The problem is it keeps crashing ramdomly:

  • Sometimes it just hangs and I have to manually reboot the server
  • Sometimes the VMs/LXCs get random segmentation faults
  • Sometimes the Proxmox system gets random segmentation faults

I've already tried re-installing Proxmox several times, but it did not solve the issue.

The hardware I have:

  • Motherboard: Asus Z170 Pro Gaming Socket LGA1151
  • Processor: Intel Core i7-6700K
  • Memory: 32GB (2x 16GB) G.Skill RipJaws V rot DDR4-2133 DIMM
  • NVMe: 500GB WD Blue SSD SN570
  • Power Supply Rhombutech Saving Power MP-700p 700W

The NVMe and memory I bought new, but the processor, motherboard, power supply and case I got second hand from a friend of a friend, it was working for him.

I also updated the BIOS to the latest version recently.

I also have some logs and system metrics sent to the cloud (Elastic Cloud), so it should be quite easy to look back in time for investigations, here are the log files I'm collecting:

  • /var/log/secure
  • /var/log/messages
  • /var/log/syslog
  • /var/log/auth.log

Some extra information about the hardware:

lspci

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1b.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #19 (rev f1)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
02:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8192CE PCIe Wireless Network Adapter (rev 01)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 Non-Volatile memory controller: Sandisk Corp Device 501a

lsusb and lsusb.py (with a segfoult as bonus)

root@pve2:~# lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 0bda:0151 Realtek Semiconductor Corp. Mass Storage Device (Multicard Reader)
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
root@pve2:~# lsusb.py 
Segmentation fault
root@pve2:~# lsusb.py 
 WARNING: Failure to read usb.ids
usb1              1d6b:0002 09 1IF  [USB 2.00,   480 Mbps,   0mA] (xhci-hcd 0000:00:14.0) hub
  1-12              0bda:0151 00 1IF  [USB 2.00,   480 Mbps, 500mA] (Generic USB2.0-CRW 20060413092100000)
usb2              1d6b:0003 09 1IF  [USB 3.00,  5000 Mbps,   0mA] (xhci-hcd 0000:00:14.0) hub
usb3              1d6b:0002 09 1IF  [USB 2.00,   480 Mbps,   0mA] (xhci-hcd 0000:03:00.0) hub
usb4              1d6b:0003 09 1IF  [USB 3.10, 10000 Mbps,   0mA] (xhci-hcd 0000:03:00.0) hub
root@pve2:~#

Some of the errors I got today:

On Proxmox (running for a couple of days)

Jun  6 10:01:17 pve2 kernel: [132395.611477] traps: qm[270669] general protection fault ip:55e6b31483b6 sp:7ffe1e4cbda0 error:0 in perl[55e6b3070000+185000]
Jun  6 10:01:24 pve2 kernel: [132402.614702] traps: qm[270689] general protection fault ip:55e83b49b390 sp:7ffc5be81eb8 error:0 in perl[55e83b496000+185000]
Jun  6 10:01:31 pve2 kernel: [132409.403267] traps: qm[270694] general protection fault ip:565233b4b3b6 sp:7fff1e751200 error:0 in perl[565233a73000+185000]
Jun  6 10:01:45 pve2 kernel: [132423.938508] traps: pvestatd[1112] general protection fault ip:559c06c86ae6 sp:7ffd62a3d4b0 error:0 in perl[559c06bae000+185000]
Jun  6 10:03:01 pve2 kernel: [132499.542258] traps: pvescheduler[270889] general protection fault ip:56452c5e761c sp:7ffccedf9f80 error:0 in perl[56452c52c000+185000]
Jun  6 10:04:06 pve2 kernel: [132565.087986] traps: pveproxy worker[197138] general protection fault ip:56276de94870 sp:7ffe779a3ea0 error:0 in perl[56276ddc6000+185000]
Jun  6 10:08:22 pve2 kernel: [132821.060717] traps: pve-firewall[1117] general protection fault ip:7fa7576fb054 sp:7ffd822915b0 error:0 in libc-2.31.so[7fa757697000+14b000]
Jun  6 10:08:44 pve2 kernel: [132842.428074] traps: pveproxy worker[197140] general protection fault ip:7fbc7d1ba53f sp:7ffe779a3900 error:0 in liblttng-ust-tracepoint.so.0.0.0[7fbc7d1b8000+7000]
Jun  6 10:11:10 pve2 kernel: [132988.398936] pveproxy worker[271699]: segfault at 56272b9ab830 ip 00007fbc876f8bce sp 00007ffe779a3df8 error 6 in libc-2.31.so[7fbc875bb000+14b000]
Jun  6 10:11:10 pve2 kernel: [132988.398945] Code: 48 89 f1 48 29 f9 83 f9 3f 76 73 48 89 d1 f3 a4 c3 80 fa 10 73 17 80 fa 08 73 27 80 fa 04 73 33 80 fa 01 77 3b 72 05 0f b6 0e <88> 0f c3 c5 fa 6f 06 c5 fa 6f 4c 16 f0 c5 fa 7f 07 c5 fa 7f 4c 17
Jun  6 10:19:13 pve2 kernel: [133471.441967] traps: pveproxy worker[271057] general protection fault ip:56276de940e9 sp:7ffe779a3e80 error:0 in perl[56276ddc6000+185000]
Jun  6 10:19:45 pve2 kernel: [133504.036239] traps: pveproxy worker[271799] general protection fault ip:56276de7dde9 sp:7ffe779a3950 error:0 in perl[56276ddc6000+185000]
Jun  6 10:19:45 pve2 kernel: [133504.103915] traps: pveproxy worker[271801] general protection fault ip:56276de89118 sp:7ffe779a3ec0 error:0 in perl[56276ddc6000+185000]
Jun  6 10:25:29 pve2 kernel: [133847.236931] traps: lsusb.py[271853] general protection fault ip:598294 sp:7ffff4f178c0 error:0 in python3.9[41f000+288000]

On Proxmox after a reboot:

I rebooted it using systemctl reboot after many commands were failing with segfault and the VMs were not accessible any more. Here are some logs:

root@pve2:~# cat /var/log/messages  |grep -i error
Jun  4 12:33:42 pve2 kernel: [    0.231938] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
Jun  4 12:33:42 pve2 kernel: [    0.888108] RAS: Correctable Errors collector initialized.
Jun  4 12:33:42 pve2 kernel: [    2.473960] GPT: Use GNU Parted to correct GPT errors.
Jun  4 12:33:42 pve2 kernel: [    3.493375] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Jun  4 20:53:47 pve2 kernel: [30008.678581] traps: pvedaemon worke[64752] general protection fault ip:7f6d059cd17a sp:7ffe28952578 error:0 in libc-2.31.so[7f6d05896000+14b000]
Jun  4 21:14:45 pve2 kernel: [    0.235240] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
Jun  4 21:14:45 pve2 kernel: [    0.901385] RAS: Correctable Errors collector initialized.
Jun  4 21:14:45 pve2 kernel: [    3.551520] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Jun  6 10:01:17 pve2 kernel: [132395.611477] traps: qm[270669] general protection fault ip:55e6b31483b6 sp:7ffe1e4cbda0 error:0 in perl[55e6b3070000+185000]
Jun  6 10:01:24 pve2 kernel: [132402.614702] traps: qm[270689] general protection fault ip:55e83b49b390 sp:7ffc5be81eb8 error:0 in perl[55e83b496000+185000]
Jun  6 10:01:31 pve2 kernel: [132409.403267] traps: qm[270694] general protection fault ip:565233b4b3b6 sp:7fff1e751200 error:0 in perl[565233a73000+185000]
Jun  6 10:01:45 pve2 kernel: [132423.938508] traps: pvestatd[1112] general protection fault ip:559c06c86ae6 sp:7ffd62a3d4b0 error:0 in perl[559c06bae000+185000]
Jun  6 10:03:01 pve2 kernel: [132499.542258] traps: pvescheduler[270889] general protection fault ip:56452c5e761c sp:7ffccedf9f80 error:0 in perl[56452c52c000+185000]
Jun  6 10:04:06 pve2 kernel: [132565.087986] traps: pveproxy worker[197138] general protection fault ip:56276de94870 sp:7ffe779a3ea0 error:0 in perl[56276ddc6000+185000]
Jun  6 10:08:22 pve2 kernel: [132821.060717] traps: pve-firewall[1117] general protection fault ip:7fa7576fb054 sp:7ffd822915b0 error:0 in libc-2.31.so[7fa757697000+14b000]
Jun  6 10:08:44 pve2 kernel: [132842.428074] traps: pveproxy worker[197140] general protection fault ip:7fbc7d1ba53f sp:7ffe779a3900 error:0 in liblttng-ust-tracepoint.so.0.0.0[7fbc7d1b8000+7000]
Jun  6 10:11:10 pve2 kernel: [132988.398936] pveproxy worker[271699]: segfault at 56272b9ab830 ip 00007fbc876f8bce sp 00007ffe779a3df8 error 6 in libc-2.31.so[7fbc875bb000+14b000]
Jun  6 10:19:13 pve2 kernel: [133471.441967] traps: pveproxy worker[271057] general protection fault ip:56276de940e9 sp:7ffe779a3e80 error:0 in perl[56276ddc6000+185000]
Jun  6 10:19:45 pve2 kernel: [133504.036239] traps: pveproxy worker[271799] general protection fault ip:56276de7dde9 sp:7ffe779a3950 error:0 in perl[56276ddc6000+185000]
Jun  6 10:19:45 pve2 kernel: [133504.103915] traps: pveproxy worker[271801] general protection fault ip:56276de89118 sp:7ffe779a3ec0 error:0 in perl[56276ddc6000+185000]
Jun  6 10:25:29 pve2 kernel: [133847.236931] traps: lsusb.py[271853] general protection fault ip:598294 sp:7ffff4f178c0 error:0 in python3.9[41f000+288000]
Jun  6 10:31:11 pve2 kernel: [134189.777343] traps: qm[271922] general protection fault ip:55b19a2d3717 sp:7ffcb05d3de0 error:0 in perl[55b19a205000+185000]
Jun  6 10:31:53 pve2 kernel: [134231.851463] traps: qm[271926] general protection fault ip:7f141401d73c sp:7ffe7af84630 error:0 in libc-2.31.so[7f1413fb8000+14b000]
Jun  6 10:32:12 pve2 kernel: [134250.394295] qm[271935]: segfault at 55810bead000 ip 0000558105d50660 sp 00007ffed25618c0 error 4 in perl[558105c98000+185000]
Jun  6 10:32:14 pve2 kernel: [134252.866954] traps: qm[271936] general protection fault ip:563591efcde9 sp:7ffc15efcb50 error:0 in perl[563591e45000+185000]
Jun  6 10:33:46 pve2 kernel: [134344.633276] traps: pvescheduler[271956] general protection fault ip:56185ff7fc41 sp:7ffe042968b0 error:0 in perl[56185ff50000+185000]
Jun  6 10:33:46 pve2 kernel: [134344.812012] traps: vzdump[272033] general protection fault ip:55bc2b1c7843 sp:7ffea086b7a0 error:0 in perl[55bc2b107000+185000]
Jun  6 10:33:47 pve2 kernel: [134345.143812] traps: pvesh[272035] general protection fault ip:55a6110b1629 sp:7ffc8798c2b0 error:0 in perl[55a610ff6000+185000]
Jun  6 10:33:47 pve2 kernel: [134345.435018] traps: pve-ha-lrm[272040] general protection fault ip:5579d0581de9 sp:7fff34110c30 error:0 in perl[5579d04ca000+185000]
Jun  6 10:34:50 pve2 kernel: [    0.234922] acpi PNP0A08:00: _OSC: platform retains control of PCIe features (AE_ERROR)
Jun  6 10:34:50 pve2 kernel: [    0.903312] RAS: Correctable Errors collector initialized.
Jun  6 10:34:50 pve2 kernel: [    3.578699] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Jun  6 10:35:36 pve2 kernel: [   50.109334] traps: pveproxy worker[1169] general protection fault ip:556beb7b9f9b sp:7fff9f29c4d0 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:36 pve2 kernel: [   50.119784] traps: pveproxy worker[1448] general protection fault ip:556beb85d112 sp:7fff9f29c4a0 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:36 pve2 kernel: [   50.136214] traps: pveproxy worker[1449] general protection fault ip:556beb86cf94 sp:7fff9f29c460 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:38 pve2 kernel: [   51.573947] pveproxy worker[1167]: segfault at 0 ip 00007f0f8554abdb sp 00007fff9f29c568 error 6 in libc-2.31.so[7f0f8540d000+14b000]
Jun  6 10:35:44 pve2 kernel: [   57.694434] traps: pveproxy worker[1450] trap stack segment ip:556beb883c77 sp:7fff9f29c4f0 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:44 pve2 kernel: [   57.695836] traps: pveproxy worker[1452] general protection fault ip:556beb88d3b6 sp:7fff9f29c4f0 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:50 pve2 kernel: [   63.848312] traps: pveproxy worker[1472] general protection fault ip:556beb86f2d0 sp:7fff9f29be90 error:0 in perl[556beb7b5000+185000]
Jun  6 10:35:54 pve2 kernel: [   67.435972] traps: qm[1501] general protection fault ip:5620074ecc41 sp:7fffcdc0ab80 error:0 in perl[5620074bd000+185000]
Jun  6 10:36:11 pve2 kernel: [   84.584635] traps: pvedaemon worke[1156] general protection fault ip:557265692843 sp:7fff2f81a740 error:0 in perl[5572655d2000+185000]
Jun  6 10:36:44 pve2 kernel: [  117.406580] traps: pveproxy worker[1479] general protection fault ip:556beb86cf90 sp:7fff9f29c220 error:0 in perl[556beb7b5000+185000]
Jun  6 10:37:00 pve2 kernel: [  133.515279] traps: pvescheduler[1732] general protection fault ip:56093490b752 sp:7fff923748b0 error:0 in perl[5609348da000+185000]
Jun  6 10:38:13 pve2 kernel: [  207.027173] traps: pvestatd[1152] general protection fault ip:55b8958a23b6 sp:7fff65a70370 error:0 in perl[55b8957ca000+185000]



Any ideas on how can I get to the root cause of this issue? Could that be a hardware issue?
Tiago Queiroz
  • 51
  • 1
  • 5

0 Answers0