Two systems with identical GPUs, have very different performances when running Tensorflow script on GPU

3

I have two computers with the same GPU(GTX 1080), installed the same copy of OS and softwares. But when I run my tensorflow program(an RNN model), the speed are very different. One is about 1.5x faster than the other.

Here are the key specs of the two:

SystemA: Asus Z170-P, i7 6700T, 32GB Ram, GTX 1080.
SystemB: Asus X99 E-WS, i7 5930K, 128G Ram, GTX 1080. (Problem one)

Both are installed with(using the same method):

OS: Ubuntu 16.04
GPU driver version: 378.13
Cuda version: 8.0
cuDNN version: 5.1
Tensorflow: installed using method pip install tensorflow-gpu==1.0.1
Python: Anaconda 3.6

Sample code:

import tensorflow as tf
import numpy as np
from tqdm import trange

np.random.seed(111)
h,w = 3000, 2000
steps = 1000

x = tf.placeholder(dtype=tf.float32, shape=[h, w], name='x')
t = tf.constant(np.random.random(size=[w, w]), dtype=tf.float32)
m = tf.matmul(x,t)

x0 = np.random.random(size=[h, w])
sess = tf.Session()
for i in trange(steps):
    x0 = sess.run(m, feed_dict={x: x0})

SystemA performs 75 iter/sec and systemB only has 50 iter/sec, yes the poorer one is actually faster.

Key observations:

  1. SystemB have a much larger page fault while running the program.
  2. By monitoring the Volatile GPU-Util from nvidia-smi, systemA stably seat at about 40% while systemB is about 30%.

Things I have tried on systemB:

  1. Upgrade BIOS to the latest version and reset default settings.
  2. Call Asus customer service for help.
  3. Swap GPU card with system A.
  4. Change PCI-e slot to make sure it running at x16 gen3.
  5. Inject LD_PRELOAD="/usr/lib/libtcmalloc.so" to .bashrc file.

The main differences of the output of /usr/bin/time -v are:

# The first value is for systemB and the second is for systemA.
System time (seconds): 7.28  2.95
Percent of CPU this job got: 85%  106%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.41  0:14.89
Minor (reclaiming a frame) page faults: 684695  97853
Involuntary context switches: 164  91063
File system inputs: 0  24
File system outputs: 8  0

Can anybody point me to a direction of how to profile/debug this issue? Many thanks in advance!

Update:

My rams are Corsair DDR4 3000MHz CMK32GBX4M2B3000C15 which seems like its not listed on the motherboard support list, may be this is the cause? But I have used this computer for a year now, no issues what so ever.

UPDATE:

By a disscussion with stackoverflow user wontonimo, we have found that it is the BUS+CPU casuing the problem. Hopefully someonehere can point to me a direction to fix this. And the post is here.

Xer

Posted 2017-05-08T12:19:30.000

Reputation: 175

4Those systems are not the same. The GPU might be, but the systems aren't. – Stese – 2017-05-08T12:24:55.343

Did you actually try to do this with the same random input on both machines? I don't believe numpy utilizes the GPU by default and by having different inputs (as you're using random values) the run time might vary. This could be further impacted by the amount of available entropy depending on how the random implementation works. – Seth – 2017-05-08T12:26:11.640

@StevenDavison, Thanks for reply, yes you are right, sorry for the confusion. My bad english. – Xer – 2017-05-08T12:30:30.710

@Seth, Thank you for your reply, this issue actually appears for every models using gpu, in which there isn't using np.random. Tried with setting a fixed random seed still gives the same results. – Xer – 2017-05-08T12:33:32.573

The Intel page for the 5930k lists it as supporting only 64GB: http://ark.intel.com/products/82931/Intel-Core-i7-5930K-Processor-15M-Cache-up-to-3_70-GHz maybe a memory problem?

– Mokubai – 2017-05-08T13:30:42.417

@Mokubai, Thanks for your reply, yes, however my system has successfully detected 128ram, I also tried to remove 64g ram, but problem still remains. – Xer – 2017-05-08T14:19:54.647

"which seems like its not listed on the motherboard support list, may be this is the cause?" - The memory compatability list is not a complete list of compatible memory, its simply the list of memory modules, ASUS used during their compatability testing. You should use that list to determine if a motherboard supports ECC/NON-ECC memory, what frequency ranges it was tested with, what voltage the modules that were tested used. It should be used as a guide not as the sole source to determine if memory will work or not. – Ramhound – 2017-05-08T17:44:06.430

@Ramhound, thats a very helpful information, many thanks. – Xer – 2017-05-08T19:36:12.290

No answers