1

I have a number of servers that have GRID K2 nvidia Tesla cards in.

Initially these were working fine. But I recently upgraded the kernel driver and have found a problem where CUDA based apps were no longer detecting GPU's being present.

On closer inspection details from /proc/drivers/nvidia/gpus/*/information Are no longer giving valid GPU UUID & Video BIOS detail. Instead I'm getting the following. While on a working node I get normal detail (no, ?'s).

Bus Location:    0000:89:00.0
Model:           GRID K2
IRQ:             46
GPU UUID:        GPU-????????-????-????-????-????????????
Video BIOS:      ??.??.??.??.??
Bus Type:        PCIe
DMA Size:        37 bits
DMA Mask:        0x1fffffffff
Bus Location:    0000:8a:00.0

I have tried cold rebooting the machines to the previous known configuration working version (these servers are netbooted) and the problem is also persisting with the old drivers.

What could be going wrong here? Are the cards toast?

hookenz
  • 14,132
  • 22
  • 86
  • 142
  • Hardware problem would be my first suspicion. Time to get your hands dirty and start shuffling GPUs around. – Michael Hampton Jun 15 '15 at 03:38
  • I don't think that is the case. There are 3 machines doing the same thing and it's happening across all cards. I've tried a cold boot but that hasn't helped. It's like the updated driver has done something to them that's stopped them working correctly. – hookenz Jun 15 '15 at 03:43
  • Pulling and shuffling cards is a bit of a problem, I'm approx 6640 miles away from them. – hookenz Jun 15 '15 at 03:47
  • You should probably be chatting with NVIDIA, then, rather than us. And grabbing your passport... – Michael Hampton Jun 15 '15 at 03:55
  • Yeah I'm going to do that... thanks anyway I thought it might be simple :/ – hookenz Jun 15 '15 at 03:56

0 Answers0