1

We have a Supermicro SuperServer 2029U-TN24R4T with currently 8 U.2 NVMe drives (Samsung PM1725a 1.6 TB) running on CentOS 7 with kernel 5.0.10-1.el7.elrepo.x86_64. After adding a new one (PM1725b 1.6 TB), it shows up for a few seconds in /dev (but only the nvme8, not nvme8n1 as one would expect) and then "gets lost". This is reproducible with different SSD trays of the chassis and even with exactly the same drives we currently use (the new one is a newer model). Adding the drive produces the following in the kernel log:

Jul 5 21:54:25 nvme02 kernel: pciehp 10002:02:05.0:pcie204: Slot(37): Card present
Jul 5 21:54:27 nvme02 kernel: pcieport 10002:02:05.0: Data Link Layer Link Active not set in 1000 msec
Jul 5 21:54:27 nvme02 kernel: pciehp 10002:02:05.0:pcie204: Failed to check link status
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Card present
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Link Up
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pci 10002:07:00.0: BAR 6: assigned [mem 0xc2400000-0xc240ffff pref]
Jul 5 21:54:31 nvme02 kernel: pci 10002:07:00.0: BAR 0: assigned [mem 0xc2410000-0xc2413fff 64bit]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: PCI bridge to [bus 07]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: bridge window [mem 0xc2400000-0xc24fffff]
Jul 5 21:54:31 nvme02 kernel: nvme nvme8: pci function 10002:07:00.0
Jul 5 21:54:31 nvme02 kernel: nvme 10002:07:00.0: enabling device (0000 -> 0002)
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Attention button pressed
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:00:00.0: can't derive routing for PCI INT A
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Powering off due to button press
Jul 5 21:54:31 nvme02 kernel: nvme 10002:07:00.0: PCI INT A: not connected
Jul 5 21:54:31 nvme02 libvirtd: 2019-07-05 19:54:31.593+0000: 15899: error : virPCIDeviceNew:1774 : internal error: dev->name buffer overflow: 10002:07:00.0
Jul 5 21:54:34 nvme02 ipmievd: Unknown sensor ff
Jul 5 21:54:40 nvme02 kernel: nvme nvme8: failed to mark controller CONNECTING
Jul 5 21:54:40 nvme02 kernel: nvme nvme8: Removing after probe failure status: 0
Jul 5 21:54:44 nvme02 ipmievd: Unknown sensor ff

The BIOS is only one version behind and the changelog doesn't say anything about this issue. The IPMI lists the new drive without any problems and the locate function also works properly. I assume that a reboot could help, however the disks must be (and are in general) hot-swappable, although we haven't tested it yet as we didn't have any disk failures. Due to the mentioned behaviour we don't want to pull a productive disk just for testing.

Any ideas would be greatly appreciated.

2 Answers2

0

Given this

Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]

Try adding pci=realloc to your kernel command line.

0

This seems like it would be worth a call to the manufacturer if you suspect hardware.

Can you try this with a more stable kernel revision, or are you bound to that particular OS and kernel combination?

ewwhite
  • 194,921
  • 91
  • 434
  • 799