3

I have a ESXI host that has crashed several times due to hardware issues. Every time in the logs I see:

A bus fatal error was detected on a component at bus 64 device 2 function 0.
A bus fatal error was detected on a component at slot 4.

On the console I see enter image description here

64 in Binary is 40 in hex. If I do:

[root@localhost:~] lspci | grep 0000:40:02.0
0000:40:02.0 Bridge: Intel Corporation Xeon E7 v2/Xeon E5 v2/Core i7 PCI 
Express Root Port 2a [PCIe RP[0000:40:02.0]]
[root@localhost:~] 

When doing:

esxcfg-info

and looking for SLOT 4 I get:

        \==+PCI Device : 
           |----Segment.........................................0x0000 
           |----Bus.............................................0x40 
           |----Slot............................................0x02 
           |----Function........................................0x00 
           |----Runtime Owner...................................vmkernel
           |----Has Configured Owner............................false
           |----Configured Owner................................
           |----Vendor Id.......................................0x8086 
           |----Device Id.......................................0x0e04 
           |----Sub-Vendor Id...................................0x0000 
           |----Sub-Device Id...................................0x0000 
           |----Vendor Name.....................................Intel Corporation
           |----Device Name.....................................Xeon E7 v2/Xeon E5 v2/Core i7 PCI Express Root Port 2a
           |----Device Class....................................1540 
           |----Device Class Name...............................PCI bridge
           |----PIC Line........................................15 
           |----Old IRQ.........................................255 
           |----Vector..........................................0 
           |----PCI Pin.........................................0 
           |----Spawned Bus.....................................66 
           |----Flags...........................................12803 
           \==+BAR Info : 
              \==+BAR0 : 
                 |----Type......................................0 
                 |----Address...................................0 
                 |----Size......................................0 
                 |----Flags.....................................0 
              \==+BAR1 : 
                 |----Type......................................0 
                 |----Address...................................0 
                 |----Size......................................0 
                 |----Flags.....................................0 
           |----Module Id.......................................0 
           |----Chassis.........................................0 
           |----Physical Slot...................................4294967295 
           |----VmKernel Device Name............................PCIe RP[0000:40:02.0]
           |----Slot Description................................SLOT 4
           |----Passthru Capable................................false
           |----Parent Device...................................
           |----Dependent Device................................
           |----Reset Method....................................5
           |----FPT Shareable...................................true

Does this mean that the CPU is going?

Dovid Bender
  • 397
  • 1
  • 6
  • 16
  • Have you tried to run any [diagnostics](https://www.dell.com/support/article/us/en/19/sln283546/how-to-run-hardware-diagnostics-on-your-poweredge-server?lang=en) on your server? – Appleoddity Nov 26 '17 at 20:43
  • @Appleoddity No but I will now. – Dovid Bender Nov 26 '17 at 20:59
  • Do you have anything populated in the server's PCIe slots? – ewwhite Nov 26 '17 at 21:06
  • 1
    Did you fully update the VMware software or the BIOS late last week? If it has an Intel processor they released microcode for it and then pulled it back because it is causing problems like this. If you downloaded an update before the pulled it that could explain it. – Todd Wilcox Jan 24 '18 at 18:00

1 Answers1

1

The iDrac doesnt show any issue about HWs? Maybe you should run full diagnostic at the boot screen.

If I remember correctly:

Press F10 at startup. In the left pane of Lifecycle Controller, click Hardware Diagnostics. In the right pane, click Run Hardware Diagnostics. The diagnostics utility is launched.

G3ph4z
  • 123
  • 1
  • 13