PCI-E bottleneck when transferring data between CPU and GPU

I've read that the transfer overhead between CPU and GPU is a big bottleneck in achieving high performance in GPU/CPU applications. Why is this so?

According to Nvidia's bandwidthtest program, my CPU/GPU bandwidth is about 4 to 5 GBps. Is this the peak performance, and actual performance is likely much lower? My application can only reach ~17 Gbps when data transfer is included in the performance measurement, a large drop from the 100+ Gbps rate when measuring only the GPU computation without data transfer.

Rayne

Posted 2011-04-28T04:12:22.200

Reputation: 479

Answers

I've read that the transfer overhead between CPU and GPU is a big bottleneck in achieving high performance in GPU/CPU applications. Why is this so?

There are two senses to your question:

Why is this (the transfer) the bottleneck?
What are the physical reasons for it?

In the first sense, it's because everything else on your machine moves soooo much faster:

While this chart is for the CPU and memory, similar charts hold true for the GPU. The upshot of this is that if you want to get good performance, you need to make as much use out of each memory load as possible.

You can plot this with the roofline model:

The x-axis shows how many times each byte is used when loaded from memory. The y-axis is performance in operations per second. The diagonal lines on the left-hand side are regions where the speed of memory limits computation.

In that region you can achieve greater performance by using faster memory (like the GPU's local memory or the CPU's L3, L2, or L1 caches), so that a higher diagonal lines limits you, or by increasing arithmetic intensity, so you move to the right. The flat lines on the top are limits on raw computation speed once memory is loaded. There can be a line for straight floating-point operations, SSE2, AVX, &c. In this case, the diagram shows that some deep-learning kernels reuse their data a lot and can make full use of all the special math operators in a GPU, so the only way to make them faster is to build a new device: the TPU.

Your second question: why is the bus the bottleneck? There are a number of technological reasons, but ignore those. Intel's chips are approaching 8nm between transistors. GPUs are somewhere in that ballpark. The bus, on the other hand, is easily measured in inches: that's 25 million times farther.

Assuming a 3 GHz processor, it takes about 0.3ns to do an operation.

It takes 0.25ns for a bit to move down a 3 inch bus. Since each bit sent down the bus requires at least 1 cycle to send and 1 cycle to receive, a full bit transfer takes about 0.9ns. (This ignores quite a bit of additional overhead on messages which is capture by models like LogP.) Multiply this by 1MB and you get about 1ms for a data transfer. In the same time you were doing that transfer you could have done a million other operations. So physics says the bus is a fundamental limiter of performance.

Richard

Posted 2011-04-28T04:12:22.200

Reputation: 2 565

Because that's the PCI-e bandwidth, see http://en.wikipedia.org/wiki/PCI_Express

5GB/sec seems reasonable given that in a real system you can't do entirely back-to-back transfers all the time, you have to let go of the bus for other peripherals from time to time.

On-GPU bandwidth is only going to the DRAM, and maybe not even that (cache hits within the GPU), and is therefore much higher.

The answer to your next question "why isn't there more bandwidth in PCIe" is basically down to cost/power/size/latency tradeoffs. A PCIe lane is slower than 10G ethernet but the bus trancievers are cheaper; a higher bandwidth system would drive up the cost of all expansion cards.

pjc50

Posted 2011-04-28T04:12:22.200

Reputation: 5 786