1

We are planning to build a pair of multi-GPU Linux servers for machine learning and data science tasks. Per our requirements, we need to put a lot of RAM in these machines; we're planning on 24x 64GiB LRDIMMs for a total of 1.5TiB. For GPUs, we were going to use Titan X's for the best bang for the buck, but according to Nvidia's Linux driver documentation, current-gen cards can't handle more than 1TiB of host system RAM. I've heard "rumours" that the Pascal architecture will come with increased addressing capabilities, but I can't find any reliable documentation to confirm or contradict this. If this turns out to be true, we might go with the newer GTX 1080 cards, even though they have 4GiB less RAM on the graphics card.

Hence my question: is there some documentation of Pascal's addressing capabilities that I'm missing?

Or alternatively, could somebody with access to a GTX 1080 run a
grep DMA /proc/driver/nvidia/gpus/domain:bus:device.function/information for me?

mvoelske
  • 111
  • 3
  • I'd be really worried about putting TitanX's into any server due to their power-pull and heat/air requirements - can you not just use Tesla's - they're designed for the job, you can get lots inside a single server and it's what they're designed for. – Chopper3 Jul 12 '16 at 10:51
  • @Chopper3 That's a valid concern, but the server chassis we are considering are capable of handling this (as is the A/C in our server room). In any case, according to the link in my question, current-gen Tesla cards would have the same issue regarding host system RAM. – mvoelske Jul 12 '16 at 11:02

1 Answers1

0

Answering my own question for future reference. We decided to go with the GTX 1080 cards. Under driver version 367.57, they report the following DMA capabilities:

$ grep DMA /proc/driver/nvidia/gpus/0000\:04\:00.0/information
DMA Size:    47 bits
DMA Mask:    0x7fffffffffff

As such, they should be able to address up to about 140 terabytes of host system RAM -- more than enough for our use case.

mvoelske
  • 111
  • 3