9

I need some help to determine whether the memory bandwidth I'm seeing under Linux on my server is normal or not. Here's the server spec:

HP ProLiant DL165 G7
2x AMD Opteron 6164 HE 12-Core
40 GB RAM (10 x 4GB DDR1333)
Debian 6.0

Using mbw on this server I get the following numbers:

foo1:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.58047    MiB: 1024.00000 Copy: 1764.082 MiB/s
1   Method: MEMCPY  Elapsed: 0.58012    MiB: 1024.00000 Copy: 1765.152 MiB/s
2   Method: MEMCPY  Elapsed: 0.58010    MiB: 1024.00000 Copy: 1765.201 MiB/s
AVG Method: MEMCPY  Elapsed: 0.58023    MiB: 1024.00000 Copy: 1764.811 MiB/s
0   Method: DUMB    Elapsed: 0.36174    MiB: 1024.00000 Copy: 2830.778 MiB/s
1   Method: DUMB    Elapsed: 0.35869    MiB: 1024.00000 Copy: 2854.817 MiB/s
2   Method: DUMB    Elapsed: 0.35848    MiB: 1024.00000 Copy: 2856.481 MiB/s
AVG Method: DUMB    Elapsed: 0.35964    MiB: 1024.00000 Copy: 2847.310 MiB/s
0   Method: MCBLOCK Elapsed: 0.23546    MiB: 1024.00000 Copy: 4348.860 MiB/s
1   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.230 MiB/s
2   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.359 MiB/s
AVG Method: MCBLOCK Elapsed: 0.23545    MiB: 1024.00000 Copy: 4349.149 MiB/s

On one of my other servers (based on Intel Xeon E3-1270):

foo2:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.18960    MiB: 1024.00000 Copy: 5400.901 MiB/s
1   Method: MEMCPY  Elapsed: 0.18922    MiB: 1024.00000 Copy: 5411.690 MiB/s
2   Method: MEMCPY  Elapsed: 0.18944    MiB: 1024.00000 Copy: 5405.491 MiB/s
AVG Method: MEMCPY  Elapsed: 0.18942    MiB: 1024.00000 Copy: 5406.024 MiB/s
0   Method: DUMB    Elapsed: 0.14838    MiB: 1024.00000 Copy: 6901.200 MiB/s
1   Method: DUMB    Elapsed: 0.14818    MiB: 1024.00000 Copy: 6910.561 MiB/s
2   Method: DUMB    Elapsed: 0.14820    MiB: 1024.00000 Copy: 6909.628 MiB/s
AVG Method: DUMB    Elapsed: 0.14825    MiB: 1024.00000 Copy: 6907.127 MiB/s
0   Method: MCBLOCK Elapsed: 0.04362    MiB: 1024.00000 Copy: 23477.623 MiB/s
1   Method: MCBLOCK Elapsed: 0.04262    MiB: 1024.00000 Copy: 24025.151 MiB/s
2   Method: MCBLOCK Elapsed: 0.04258    MiB: 1024.00000 Copy: 24048.849 MiB/s
AVG Method: MCBLOCK Elapsed: 0.04294    MiB: 1024.00000 Copy: 23847.599 MiB/s

For reference here's what I get on my Intel based laptop:

laptop:~$ mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.40566    MiB: 1024.00000 Copy: 2524.269 MiB/s
1   Method: MEMCPY  Elapsed: 0.38458    MiB: 1024.00000 Copy: 2662.638 MiB/s
2   Method: MEMCPY  Elapsed: 0.38876    MiB: 1024.00000 Copy: 2634.043 MiB/s
AVG Method: MEMCPY  Elapsed: 0.39300    MiB: 1024.00000 Copy: 2605.600 MiB/s
0   Method: DUMB    Elapsed: 0.30707    MiB: 1024.00000 Copy: 3334.745 MiB/s
1   Method: DUMB    Elapsed: 0.30425    MiB: 1024.00000 Copy: 3365.653 MiB/s
2   Method: DUMB    Elapsed: 0.30342    MiB: 1024.00000 Copy: 3374.849 MiB/s
AVG Method: DUMB    Elapsed: 0.30491    MiB: 1024.00000 Copy: 3358.328 MiB/s
0   Method: MCBLOCK Elapsed: 0.07875    MiB: 1024.00000 Copy: 13003.670 MiB/s
1   Method: MCBLOCK Elapsed: 0.08374    MiB: 1024.00000 Copy: 12228.034 MiB/s
2   Method: MCBLOCK Elapsed: 0.07635    MiB: 1024.00000 Copy: 13411.216 MiB/s
AVG Method: MCBLOCK Elapsed: 0.07961    MiB: 1024.00000 Copy: 12862.006 MiB/s

So according to mbw my laptop is 3 times faster than the server!!! Please help me explain this. I've also tried to mount a ram disk and use dd to benchmark it and I get similar differences so I don't think mbw is to blame.

I've checked the BIOS settings and the memory seem to be running at full speed. According to the hosting company the modules are all OK.

Could this have something to do with NUMA? It seems like Node Interleaving is disabled on this server. Will enabling it (thus turning off NUMA) make a difference?

foo1:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 7898 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 12288 MB
node 1 free: 12073 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 12288 MB
node 2 free: 12034 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8192 MB
node 3 free: 8032 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10 

UPDATE:

Have disabled NUMA (numa=off on linux boot) and disabled ECC in BIOS. No changes, still the same numbers as above.

UPDATE 2:

Here's the layout of the memory according to dmidecode:

PROC 1 DIMM 1
PROC 1 DIMM 4
PROC 1 DIMM 7
PROC 1 DIMM 10
PROC 1 DIMM 12

PROC 2 DIMM 1
PROC 2 DIMM 4
PROC 2 DIMM 7
PROC 2 DIMM 10
PROC 2 DIMM 12

These are all 4GB Samsung modules (part no M393B5270CH0-CH9)

I've had a look the the HP docs on how to populate the memory in this server and if I understand it correctly the modules that are currently in DIMM 12 should have been placed in the DIMM 3 slot. Can such a misconfiguration explain the results I'm getting?

UPDATE 3:

I have now removed 2 modules to get 4x4 GB on each side (4-4) placed in 1-4-7-10. Unfortunately I'm not seeing any difference in the benchmarks. Shouldn't the server be able to use all four channels now? I've also tried with the stream benchmark with multiple threads and the results are very disappointing. The only thing I can think about know is to ask the hosting company to replace the whole server...

UPDATE 4:

I must had done something wrong when I tested the last setup (32 GB) with stream yesterday because today I'm seeing excellent results:

foo1:~# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 24
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 703 microseconds.
   (= 703 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       36873.0022       0.0009       0.0009       0.0010
Scale:      34699.5160       0.0009       0.0009       0.0010
Add:        30868.8427       0.0016       0.0016       0.0017
Triad:      25558.7904       0.0019       0.0019       0.0020
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

(I've abandoned mbw since it only runs in single threaded mode. It still gives the same crappy results on this server).

So the problem must have been those two last 4GB modules that forced the server to run in single channel mode just like @chx pointed out below. Now the only remaining question is if it is possible to use 40 GB and still get the full bandwidth? Can I use 2 x 8GB + 6 x 4GB? Does it matter in which channel I place the larger modules?

ntherning
  • 265
  • 2
  • 7
  • Well - your laptop isn't running ECC, so that would explain that. Is ECC running on the Intel server? – pauska Sep 25 '12 at 15:33
  • The RAM modules in the AMD server are ECC modules. I don't know about the Intel server. dmidecode doesn't give any info on the modules it uses. But could ECC really explain the huge difference? Googling suggests that ECC RAM gives a few % penalty. I'm seeing a lot more than that here! – ntherning Sep 25 '12 at 15:51
  • What's the layout of the memory? Is it registered? – Chris S Sep 25 '12 at 17:27
  • @ChrisS: I've updated the question with more info on the memory modules and the current layout. – ntherning Sep 25 '12 at 20:03
  • 1
    The wrong organization can cause havoc, but yours is correct. I'm really not sure what's going on here, but I know current Opterons beat current Intel processors in various memory benchmarks on account of the Opterons having 4 channels, and Intel only having 3. There might be something going on with the single threaded nature of the `mbw` software; though `dd` is showing similar results... Not sure, but not right. – Chris S Sep 25 '12 at 20:33

1 Answers1

9

You are forcing the system to operate in single channel (!) mode by using 5-5 modules per CPU instead of 4-4 or 8-8. That's the reason. Try removing 1 - 1 and report back.

The 6164 is a G34 socket CPU which is capable of quad channel operating if the memory modules are setup right. Your setup is the worst possible.

chx
  • 1,665
  • 1
  • 16
  • 25
  • 2
    Good catch on the DIMM population! :) – ewwhite Sep 25 '12 at 21:42
  • The hosting company which I rent it from has changed the layout of the modules to slots 1-4-7-10-3 which is what HP says in the manual. No noticeable difference though in my tests. I have now asked them to remove that last 4GB module on each side. – ntherning Sep 26 '12 at 05:25
  • 1
    @chx - it seems like you're were right with the single channel mode. Please see my last update. – ntherning Sep 27 '12 at 11:40
  • You need four identical modules per channel, end of story. And because it's dual CPU, actually you need eight to get both of them running in quad channel mode. So either 32GB or 64GB. There's no middle ground. – chx Sep 27 '12 at 13:18
  • Ok, that's what I feared. I don't understand why they let me buy 40 GB if that's not going to work optimally. :-( Thanks for all help! – ntherning Sep 27 '12 at 13:31
  • Because it's *extremely* unlikely you will be memory bandwidth constrained. You are fretting over the wrong thing. – chx Sep 27 '12 at 13:33
  • 2
    Well, your're probably right. But the nerd in me just can't let this be! :-) If you're not already tired of me please help me explain this: They have now added back the 4GB modules that were taken out before making it 40GB in total again and 5-5. stream again gives poor results. But I just tried to remove the numa=off boot option and after a reboot stream gives me close to the excellent results I saw with 32GB in my last update. – ntherning Sep 27 '12 at 14:02