2

I am running fio jobs on my NVMe SSD and hotplug it then. The platform is hot-pluggable and the system is Centos 7.0.Several seconds after my plug-out, the system encounters a crash and gives these print info:

================

[ 1026.468414] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1

[ 1026.468422] pciehp 0000:5d:02.0:pcie04: Card present on Slot(6-1)

[ 1026.468432] pciehp 0000:5d:02.0:pcie04: slot(6-1): Link Down event

[ 1026.468451] pciehp 0000:5d:02.0:pcie04: Link Down event queued on slot(6-1): currently getting powered on

[ 1026.468457] pciehp 0000:5d:02.0:pcie04: Already enabled on slot(7-1)

[ 1026.468705] {1}[Hardware Error]: event severity: fatal

[ 1026.468744] {1}[Hardware Error]: Error 0, type: fatal

[ 1026.468782] {1}[Hardware Error]: section_type: PCIe error

[ 1026.468825] {1}[Hardware Error]: port_type: 0, PCIe end point

[ 1026.468867] {1}[Hardware Error]: version: 3.0

[ 1026.468915] {1}[Hardware Error]: command: 0x0102, status: 0x4010

[ 1026.468961] {1}[Hardware Error]: device_id: 0000:00:00.0

[ 1026.469901] {1}[Hardware Error]: slot: 0

[ 1026.469032] {1}[Hardware Error]: secondary_bus: 0x00

[ 1026.469070] {1}[Hardware Error]: vendor_id: 0x1ded, device_id: 0x3010

[ 1026.469117] {1}[Hardware Error]: class_code: 008001

[ 1026.469155] Kernel panic - not syncing: Fatal hardware error!

================

The possible root cause for system crash is that the contradictory event pair that "card present" and "link down" have messed up the system logic. So what confuses me is that pciehp reports both "card present" and "link down" at the same time. As my experience, "card present" often comes with "link up" and "link down" normally goes by "card not present".

Could anybody give me some clues about how this strange situation happens? Or which bit in PCIe register trigger "card present" event and "link down" event?

0 Answers0