Why is my bare-metal 16x 2.93GHz cores computer performing poorer than a VPS with 4x 2.5GHz cores?

Question

I have a written a piece of multi-threaded software that does a bunch of simulations a day. This is a very CPU-intensive task, and I have been running this program on cloud services, usually on configurations like 1GB per core.

I am running CentOS 6.7, and /proc/cpuinfo gives me that my four VPS cores are 2.5GHz.

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
microcode       : 1
cpu MHz         : 2499.992
cache size      : 30720 KB
physical id     : 3
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good unfair_spinlock pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm arat xsaveopt fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 4999.98
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

With a rise of exchange rates, my VPS started to be more expensive, and I have came to a "great deal" on used bare-metal servers.

I purchased four HP DL580 G5, with four Intel Xeon X7350s each. Basically, each machine has 16x 2.93GHz cores and 16GB, to keep things like my VPS cloud.

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU           X7350 @ 2.93GHz
stepping        : 11
microcode       : 187
cpu MHz         : 1600.002
cache size      : 4096 KB
physical id     : 6
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 27
initial apicid  : 27
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca lahf_lm dts tpr_shadow vnmi flexpriority
bogomips        : 5866.96
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Essentially it seemed a great deal, as I could stop using VPS's to perform these batch works. Now it is the weird stuff...

On the VPS's I have been running 1.25 thread per core, just like I have been doing on the bare metal. (The extra 0.25 thread is to compensate idle time caused by network use.)
On my VPS, using in total 44x 2.5GHz cores, I get nearly 900 simulations per minute.
On my DL580, using in total 64x 2.93GHz cores, I am only getting 300 simulations per minute.

I understand the DL580 has an older processor. But if I am running one thread per core, and the bare metal server has a faster core, why is it performing poorer than my VPS?

I have no memory swap happening in any of the servers.

TOP says my processors are running at 100%. I get an average load of 18 (5 on VPS).

Is this going to be this way, or am I missing something?

Running lscpu gives me 1.6GHz on my bare metal server. This was seen on the /proc/cpuinfo as well.

Is this information correct, or is it linked to some incorrect power management?

[BARE METAL] $ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 15
Stepping:              11
**CPU MHz:               1600.002**
BogoMIPS:              5984.30
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-15


[VPS] $ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             4
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
**CPU MHz:               2499.992**
BogoMIPS:              4999.98
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-3

Because eight year old processors do far less per clock cycle than current processors. — Michael Hampton, Nov 28 '15 at 14:48
You'll want to reset your BIOS settings to default. These servers sound like they had a non-optimal configuration on them as well. See my edit below. — ewwhite, Nov 28 '15 at 15:39
You should try running only 1 thread per core. If the core is slow, an high load can mean that CPU is spending a lot of time switching tasks and is getting less done. — Nemo, Nov 28 '15 at 21:30
You can search the Internet for "CPU benchmarks" to find performance comparisons. My favorite such resource is [CPUBenchmark.net](https://www.cpubenchmark.net/). — , Nov 28 '15 at 23:02
Take a look at the cache size difference as well. Cache misses can be terrible. — acelent, Nov 29 '15 at 00:59

Jim B · Accepted Answer · 2018-04-13T19:36:06.213

44

Processor advancements, clock speed and IPC calculations can make it almost impossible to try to reasonably compare decade old CPUs to modern ones. Not only are the instructions per cycle going to vary, but newer processors have instruction sets dedicated to complex calculations (Intel has added AES-NI as an example), clock speed is no longer a reasonable comparator, due to these factors (did I mention multi-core vs hyperthreading...). With enough time and patience you could certainly figure out how many older procs equal 1 newer proc but the calculations will end up saying its cheaper and faster to buy a new CPU.

edited Apr 13 '18 at 19:36

answered Nov 28 '15 at 16:21

Jim B

23,938
4
35
58

2

There are a bunch of websites out there that already do this, by comparing CPU benchmarks of each processor. – Michael Hampton Nov 28 '15 at 16:33
4

Not exact but here is the Passmark benchmark for [Intel Xeon E5-2680 @ 2.70GHz](http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2680+%40+2.70GHz) vs [Intel Xeon X7350 @ 2.93GHz](https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+X7350+%40+2.93GHz&id=2156&cpuCount=4) – chue x Nov 28 '15 at 17:15
The problem with a benchmark is that by its very nature, its not optimized for that particular processor. Its not bad for a rough estimate, but you would have to rewrite and recompile with the best instruction set for each. very few tools are written with anything other than "how many times can "x" be calculated. – Jim B Nov 28 '15 at 19:59
1

@JimB, yes, but the OPs simulation might also not be optimized for a particular CPU. (if it is, I missed it, sorry) – David Balažic Nov 28 '15 at 20:00
1

Probably no, in which case the simulation is de facto the benchmark unless the program IS recompiled. A third party benchmark would, be less accurate. – Jim B Nov 28 '15 at 20:03
BTW, yes Xeon E... v3 is Haswell. If the OP didn't compile with `-march=native` for each machine separately, they'll both be using the same instruction set, so AVX and FMA aren't giving any advantage. If the simulation bottlenecks on memory bandwidth (which is a shared resource), having a lot more slow cores isn't going to help if you have the same or less total memory bandwidth. (@Glauco Cattalini Lins: A NUMA kernel might help some). – Peter Cordes Nov 29 '15 at 08:31

ewwhite · Answer 2 · 2015-12-04T09:48:13.137

32

I don't want to sound terrible by emphasizing something that should be obvious here, but you're comparing a high-end server processor from 2014 to a high-end server processor from 2007.

I don't think this requires much more explanation.

There's a reason an HP ProLiant DL580 G5 is available so inexpensively today. They were large, slow and lack many features that are desirable in more modern servers. I sold my last one in 2009. It was a bad purchase, and you would be better served with a CPU from the Nehalem or Westmere CPU families, if you are forced to purchased used equipment.

In addition, the servers you purchased are very inefficient in terms of power consumption, so they will be costly to operate.

It appears as though your physical servers are running in a power-savings mode that scaled back your CPU clock speed. You'll want to go into the BIOS (press F9 at boot) and reset the server to factory defaults (who knows what else was modified from default?)

edited Dec 04 '15 at 09:48

answered Nov 28 '15 at 14:46

ewwhite

194,921
91
434
799

I am unable to open the X7350 page. It gives me an error. I understand the X7350 has less cores compared to the E5 on the VPS. But when it comes down to the threads per core, shouldn't the clock count for that? – Glauco Cattalini Lins Nov 28 '15 at 15:05
7

@GlaucoCattaliniLins No. – ewwhite Nov 28 '15 at 15:07
1

Could you elaborate on that? I am having trouble digesting it. I have the option to change for other models, so I want to get it right if it comes to that. -- At first I tought it could be the cache size of the E5 (30MB), but then it would be shared among other VPS's. – Glauco Cattalini Lins Nov 28 '15 at 15:13
1

@GlaucoCattaliniLins Don't buy old servers. Try to keep it within 3 years of current system levels. – ewwhite Nov 28 '15 at 15:14
11

@GlaucoCattaliniLins The X7350 is based on the Core 2 microarchitecture. In fact, it's comparable to a _Core 2 Quad_ (How long has it been since you've last heard of those?). It's so old that it doesn't support SSE4+, AVX(2), FMA or AES instructions, so if your simulations are numerical, they take a >2x penalty right there, and AES crypto speed suffers even more. Lastly, Intel has released 6 microarchitectural improvements since Core 2, and with each one increase the CPU's ability to run more instructions in parallel or out of order, the memory bandwidth. – Iwillnotexist Idonotexist Nov 28 '15 at 16:59
10

@GlaucoCattaliniLins By contrast your VPS server supports FMA, so it's at least as new as the Haswell microarchitecture. The FMA instruction allows one to do a multiplication and addition two-in-one, and everything in Haswell (the instruction decoders, reorder buffer, branch predictors, memory bandwidth, ALUs) has been tuned so that the dual vector FMAs can be kept fed. Haswell can thus sustain, **in a single clock cycle:** 1) Two 8-element vector operations of the form `float d = a + b*c`, 2) Two 32-byte loads (the `a` and `b`) and 3) one 32-byte store (the `d`). It's amazingly well-tuned. – Iwillnotexist Idonotexist Nov 28 '15 at 17:14
@IwillnotexistIdonotexist, you deserved the right answer for the time and effort to explain all that. – Glauco Cattalini Lins Nov 28 '15 at 23:50
2

@IwillnotexistIdonotexist: he almost certainly didn't compile for each machine separately with `-march=native`, so I'd guess his code is only using SSE2 on either system. I'd guess memory bandwidth is probably a bottleneck, esp. if his kernel doesn't have NUMA support, or his sim's allocation patterns aren't NUMA-friendly. This is what, quad socket quad core, with dual channel memory controllers on each socket? – Peter Cordes Nov 29 '15 at 08:38
@ewwhite: He probably just looked at the clock speed while the system was idle. 1.6GHz is normal for an idle Core2. (My old desktop Core2 clocked down to that, but stayed pegged at its normal speed when there was enough load to saturate all the cores). Resetting the BIOS prob. won't make a diff. I know you didn't suggest this, but the OP might try: you don't *want* to disable power saving. It works very well on modern CPUs. – Peter Cordes Nov 29 '15 at 08:41
1

@PeterCordes Running `lscpu` on that platform and CPU combination would never show 1600MHz unless the server were forced into `HP Static Low Power Mode`. I suggested resetting the BIOS to default because: 1) the defaults on that platform are sane and would revert to "Dynamic Power Savings Mode", 2) the old age of the servers, and 3) the unknown origin of the equipment. – ewwhite Nov 29 '15 at 12:26
@ewwhite: ah ok, if you're sure about that. On my SnB desktop, I guess I have a newer version of lscpu, because it shows current (1659MHz), as well as min and max MHz (1600 and 3800). Being pinned at min frequency could help explain some of the factor of 3 perf difference between the core2 and Haswell CPUs. Microarch improvements aren't sufficient to explain a factor of 3. I think he's comparing four 4-CPU VMs against his one 16-core clunker, so memory bandwidth is probably way different, though. – Peter Cordes Nov 29 '15 at 12:33
1

@PeterCordes This is related to a [big in VirtualBox](https://www.virtualbox.org/wiki/Changelog). Regarding recompilation, if he's using numpy for instance or an auto-tuning BLAS library, it wouldn't need recompilation. – Iwillnotexist Idonotexist Nov 29 '15 at 17:16

Why is my bare-metal 16x 2.93GHz cores computer performing poorer than a VPS with 4x 2.5GHz cores?

2 Answers2