4

We have a SuperMicro GPU server with:

  • 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
  • 512GB memory
  • more than enough disk space
  • X10DRG-O+-CPU (BIOS Version : 2.0a [current])
  • X9DRG-O-PCIE PCI-E expander card
  • 8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0. When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.

To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

Update: The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU. Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

pks
  • 41
  • 3
  • Does your PSU provide enough power? – Gerald Schneider Feb 08 '17 at 15:49
  • It has 4 1600W (2+2 redundancy) power supplies, so yeah I guess they should. See here https://www.supermicro.com/products/system/4U/4028/SYS-4028GR-TR.cfm – pks Feb 08 '17 at 18:42
  • It's not clear if you're testing or production. In your place, I'd try with four cards, two per cpu. I'd try to swap failover psu's with online ones. I'd try to monitor power consumption and system/CPU/GPU temperatures. I'd come back to the community with more details then. – Marco May 08 '17 at 10:39
  • We have the same problems with two machines, fresh Ubuntu 16.04. install, kernel 4.4.0-75. A SuperMicro GPU server: - 2x Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz - 128GB memory - Board: X10DRG-O+-CPU (BIOS Version : 2.0b) - 8x NVIDIA GTX 1080 It seems that driver version 367.44 is a lot more stable than any newer version or beta version, but still far from perfect. We also see random freezes. – emjotde May 08 '17 at 10:20
  • We're also facing the same problem with five different machines that have a couple similar configurations: Supermicro X10DRG-O+-CPU bios 2.0a, 2x E5-2650 v4 @ 2.20GHz, kernel 4.4.0-91, with 8x Nvidia GTX 1080, on the 384.66 driver. Seems we are not alone: I am interested if anybody has found a solution to this problem. – David Bau Sep 04 '17 at 02:39
  • See @tinkerthinker 's solution below, which appears to have worked for me. – David Steinhauer Apr 11 '18 at 14:52

2 Answers2

1

I had the exact same issue on the same computer. To fix this, you will need to disable the on-board VGA by changing jumper JPG1 on the motherboard. Unfortunately, you'll need to remove the daughterboard to do so. Note that, to re-install the daughterboard, you may need to apply quite a bit of pressure for it to connect properly with the motherboard again.

  • I was fighting this same issue on an identical system for more than a year. We tried this solution, and so far it appears to have resolved our issue! Thanks for posting this. You have truly saved us from a lot of trouble! – David Steinhauer Apr 11 '18 at 14:47
  • With CentOS 7.3, the way we were able to force the hangs (for troubleshooting) was by running a program which repeatedly queried the GPU temperatures, using NVML. This generally hung the server within a couple of hours. After the jumper change, the system has been operating for about 20 days with no hangs. – David Steinhauer Apr 11 '18 at 14:50
0

There is a known issue with the PCI bus (power management) that seems to be resolved by SuperMicro. We have just received a flashable BIOS+firmware update from them and are testing it. I don't think I can share the update (unsure about licensing) so would advise you to contact SuperMicro..

adev
  • 1
  • 1