Misbehaving NVLINK with 2080 ti cards?

Question

I am running into problems with nvlink'd RTX videocards, and I wonder if someone more experienced with this tech could kindly look at the output below and tell me if there is a problem?

Using a pair of MSI 2080 ti cards and an RTX NVLINK bridge by ASUS, Ryzen/X370 system, running Ubuntu 18.04 Linux and several versions of Nvidia's driver.

Nvidia-smi invocations run very very slow, caffe and CUDA example programs misbehave.

Programs like Caffe misbehave badly when run on both GPUs (ie using caffe --gpu 0,1). Setup & scaffolding can take 20 minutes to complete (for googlenet which on a single GPU would stand up in a few seconds), and then training sometimes proceeds in the expected manner or it freezes after a few iterations.

I am seeing the following output which seems wrong. Am I mistaken?

Is this output strange or is my understanding incorrect? Any help much appreciated!

I am running nvidia-persistenced as a daemon under my user account ID.

Details...

$ nvidia-smi -L    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)

$ nvidia-smi nvlink --status    # Takes over a minute to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0: 25.781 GB/s
         Link 1: <inactive>
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0: 25.781 GB/s
         Link 1: <inactive>

Shouldn't both links (0 and 1) be active?

$ nvidia-smi nvlink -c     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

Shouldn't there be both Link 0 and Link 1 here?

$ nvidia-smi nvlink --capabilities     # Takes several minutes to finish running.
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-dd1093e0-466f-7322-e214-351b015045d9)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-2a386612-018c-e3fe-3fd4-1dde588af45d)
         Link 0, P2P is supported: true
         Link 0, Access to system memory supported: true
         Link 0, P2P atomics supported: true
         Link 0, System memory atomics supported: true
         Link 0, SLI is supported: true
         Link 0, Link is supported: false

Shouldn't there be both Link 0 and Link 1 here?

$ nvidia-smi topo --matrix    # Takes over a minute to finish running.

        GPU0    GPU1    CPU Affinity
GPU0     X      NV1     0-11
GPU1    NV1      X      0-11

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

Here we should see NV2 links (ie the 2080ti bridge has a pair of nvlink 'links')?

$ simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
> GPU0 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "GeForce RTX 2080 Ti" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from GeForce RTX 2080 Ti (GPU0) -> GeForce RTX 2080 Ti (GPU1) : Yes
> Peer access from GeForce RTX 2080 Ti (GPU1) -> GeForce RTX 2080 Ti (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> GeForce RTX 2080 Ti (GPU0) supports UVA: Yes
> GeForce RTX 2080 Ti (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 22.52GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

AFAIK cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1 should manage 44 GB/s or so, not 22 GB/s?

$ p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce RTX 2080 Ti, pciBusID: a, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce RTX 2080 Ti, pciBusID: b, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 529.48   3.20
     1   3.19 532.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1
     0 531.71  24.23
     1  24.23 530.74
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 533.58   6.30
     1   6.31 526.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 525.50  48.37
     1  48.41 523.52
P2P=Disabled Latency Matrix (us)
   GPU     0      1
     0   1.26  12.74
     1  15.19   1.44

   CPU     0      1
     0   3.92   9.03
     1   8.86   3.82
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1
     0   1.25   0.92
     1   0.96   1.44

   CPU     0      1
     0   4.22   2.78
     1   2.78   3.76

But here, the "Bidirectional P2P=Enabled Bandwidth Matrix" should show 96 GB/s not 48.41.

score 0 · Answer 1 · answered Apr 21 '21 at 07:41

0

I have very similar experience and the root cause was NVLINK does not inserted correctly. you may double check that nvidia-smi command output speed as expected when you remove NVLINK bridge. NVLINK has two side connectors which should be contact properly on GPUs otherwise nvidia-smi NVLINK --status commands shows "inactive" and that could generate that slow response or pending.

answered Apr 21 '21 at 07:41

BPARK

1

Hi, thanks for this. I found the same thing. The NVLINK bridge looked like it was inserted all the way but it was not. – Eric M Apr 25 '21 at 21:56

Misbehaving NVLINK with 2080 ti cards?

1 Answers1