Hardware limitations in data transfer (high throughput)

I am currently looking into the hardware limitations for a scientific setup. We are running into high-load related loss of data. I will first explain the problem and propose a solution, which I hope you can verify.

We have a camera providing four 120px x 120px images at 10kHz. These are gathered by a frame grabber (NI PCIe-1433). The frame grabber is connected to a PCI slot.

If I understood correctly, the data will transfer from the frame grabber to the CPU. (Frame grabber -> bus -> south bridge -> bus -> north bridge -> front side bus -> cpu -> memory controller on-chip -> bus -> RAM?)

We then load the data onto the high-end gpu, which means the CPU requests the data from RAM (RAM->bus->CPU memory controller?) and loads it to the GPU (CPU -> front side bus -> north bridge -> bus -> NVidia GPU?).

The frame grabber specifications itself are quite clear, and it should be able to handle it. The current thinking is that the double CPU load (writing to RAM; RAM -> GPU) is causing a bottleneck. The likely fixes are then to either upgrade the CPU to a higher single-clock speed model and/or to upgrade the RAM.

I'm also looking for a resource that succinctly explains these data transfers (probably without the frame grabber) and how to assess the speeds and find optional bottlenecks.

Daimonie

Posted 2016-12-02T09:44:31.547

Reputation: 11

Is it PCI or PCI-E? At 144,000,000 pixels per second (assuming no data compression is being done) and 8 bit per pixel that would be a data stream of 1,152,000,000 bits/137 MB/s. Normal PCI would cap out somewhere around 133 MB/s. The name would suggest it's a PCI-E model? As modern computer architecture doesn't really have a classic north/south bridge like it used to be and technologies like DMA might be used as well it's not that easy to tell which path data will take. How are you loading the data into the GPU? – Seth – 2016-12-02T09:58:22.397

It should be PCIe. The motherboard is a dell 0k240y, the processor is a intel xeon 1620v3. The data is accessed by labview, which calls a C++/CUDA dll which loads and computes the data we want. – Daimonie – 2016-12-02T10:16:56.733

BTW, the calculation you did above doesn't take into account the 8B/10B encoding, does it? – Daimonie – 2016-12-02T10:25:22.803

Not it doesn't. It was just a quick calculation in case it's really just PCI ignoring any special encodings, compression or similar that could take place. Your idea about the data from the grabber to RAM is pretty accurate, even if FSB and north bridge have been pretty much eliminated. As for the CPU to GPU path it would depend on how that module is designed. Depending on it the GPU might directly read from RAM. I'm not how you'd benchmark either connection. There are probably some RAM speed tests available. Also consider checking your code for any bottlenecks like expensive ops in loops. – Seth – 2016-12-02T10:31:04.397

Thanks - I was just reading about 8/10 so I was curious. Right, I'm glad I got some things right! The code doesn't do much expensive OPS. The GPU coding is C++/CUDA, so while the CPU uses pointers on the RAM data (fast), it still requires a Host/Device transfer (slow). If I got this right, the PCIe lanes for the grabber and GPU are different, so there should not be much of a requirement there. However, there might be a CPU-ram bottleneck from what I gather. It is beginning to look like a i7 skylake might really make a difference (southbridge bus speed, single-core speed). – Daimonie – 2016-12-02T10:37:38.127

The southbridge isn't a part of the CPU, it is part of the chipset. And the southbridge probably wont be used here at all unless the PCIe slot you're using is connected to it. Most of the PCIe lanes are connected directly to the CPU. It would help if you were to explain what exactly your problem is. Also RAM bandwidth likely isnt the issue here as quad channel DDR4 has far more bandwidth than the PCIe bus - 64GB/sec is the max bandwidth supported by the CPU. As far as assessing the bottlenecks there are various bus analyzers but they are rather expensive. – Muh Fugen – 2016-12-02T10:40:39.363

After looking at the datasheet for that card it does have its own DMA engine, and your Nvidia GPU will as well. It doesnt seem like having a faster CPU clock speed here would do much at all, as the processors on both cards will be accessing the memory of the host system directly. You will likely need to contact your vendors for support here. – Muh Fugen – 2016-12-02T10:47:19.643

I saw that the skylake uses DMI3, which features a better bus for north/south. Figured that would help. Thanks for the clear answers! As for describing the problem; it is simply that we are losing data somewhere in the process depending on the frequency of acquisition. Above 8 kHz, we start losing more and more data. This is due to a bottleneck somewhere, and we are trying to assess where. I can run the CPU/GPU part of the process on an old GTX 580 without any problems at up to 14kHz. This is what suggest the hardware bottleneck to us. – Daimonie – 2016-12-02T11:04:06.300

Just to add this. I've spoken with an application engineer of National Instruments, who agrees that the card should support the throughput of the system. As a result, we are now taking a very critical look at our software implementation, in particular LabView. – Daimonie – 2016-12-12T11:40:00.213

I wanted to update this issue for future reference. Because of shifting priorities, we decided to get a stronger PC to try and resolve the issue without the expenditure of time (ASUS ROG Maximus Hero IX, Intel i7-7700K). Sadly, this did not solve it. The promised critical look at the C++/CUDA implementation showed that this was not the issue. Similarly, its labview minimal implementation did not either. The issue seems to lie in the acquisition, which is not acquiring all the captured frames. This will be resolved in the near future. – Daimonie – 2017-10-05T22:11:53.280

Hardware limitations in data transfer (high throughput)

No answers