Same Code, Different Computers - Counterintuitive and HUGE Differences in Performance

0

1

TL;DR: I'm doing some performance computing, and I've found that the ostensibly 'weaker' machine is outperforming the 'stronger' machine by orders of magnitude. Why?

I wrote some C code for a project. It involves 10,000 iterations of a lengthy process that generates pseudorandom data, and after each iteration, writes the data to a file. I used #pragma omp parallel for to multi-thread the task.

I can run my program on two machines: let's call them s and d. Here are the relevant specs (please ask for any other specs that might matter):

  • s: Linux Mint 15, Samsung 840 EVO SSD, 8gB RAM, quad-core Intel i3 CPU @ 2.40 gHZ
  • d: Linux Mint 16, Intel SSD, 8gB RAM, eight-core AMD FX-8320 CPU @ 3.5 gHZ

Here's the big surprise: s completes the task an order of magnitude faster than d. I've run the program a few times on both machines, and s completes the task in about 3-4 minutes, whereas d takes anywhere from 12 to upwards of 30 minutes (I lost track of time). Both of them fully exhaust their cores (i.e. all cores at 100%) while computing. This phenomenon is even with auxiliary programs (Firefox, etc.) open on s, and nothing else running on d.

But the code is the same. The compiler flags are the same. Even the output is the same. I even removed the drives from both s and d, and swapped them and then ran the program again, just to test that it wasn't in some way Operating-System related. The result was that the phenomenon persisted: the quad-core 2.4 gHZ CPU vastly outperformed the eight-core 3.5 gHZ CPU.

This is, of course, really puzzling and totally counterintuitive. Can anyone tell me what's going on?

Newb

Posted 2014-02-18T03:41:49.133

Reputation: 279

Get a profiler and measure. – Dour High Arch – 2014-02-18T03:52:21.227

Amdahl's law and that different systems will do the sequential fraction at a different speed in addition to the parallel portion. – Brian – 2014-02-18T03:52:26.603

Are you sure the Intel processor is an i3 and four cores? Even for 4th gen. i3, the ark only shows 2-core versions; perhaps you meant Xeon E3? (If it is only 4 threads, this would make the comparison even more skewed.) A 4x performance difference does seem odd. SPEC CPU2006 FP Rate results for higher performance but "similar" systems (AMD FX-8150 vs. Intel Xeon E3-1220--34% Intel advantage) seem to imply that the Intel system should be roughly "only" 10-15% faster.

– Paul A. Clayton – 2014-02-18T15:02:48.267

Also, are you sure that they are executing the same code? A portable binary might include multiple code paths to support different systems and the selection might be suboptimal for the AMD system (the Intel C compiler used to have this kind of issue). – Paul A. Clayton – 2014-02-18T15:22:45.133

@PaulA.Clayton according to my system diagnostics, the intel processor has four cores. Here's a screenshot: http://imgur.com/fYKceHe You have a good point with regard to the execution of code: perhaps the selection is suboptimal for the AMD processor. What can I do about this? How can I test this? Is there a different compiler I can use? (I'm using GCC at the moment.)

– Newb – 2014-02-18T19:33:31.747

That processor has 2 cores but 4 virtual processors (a.k.a. threads). That makes the performance difference look even more bizarre. For gcc you might try "-march=bdver2" for the AMD system, but a roughly 6x difference from expectations seems odd (~4x performance difference, ~1.35x μarch difference, ~2x core count). "-march=native" can be a convenient option for system-specific binaries where target=local; "-mtune=" provides more broadly compatible but specifically tuned binaries. – Paul A. Clayton – 2014-02-18T21:18:31.627

Answers

1

What you're talking about is the Megahertz Myth, a bigger number doesn't always mean better because actual computational speed is dependent on architecture and design factors. Here's a nice webpage on the issue.

user270595

Posted 2014-02-18T03:41:49.133

Reputation:

The AMD processor also is a design somewhat between simultaneous multithreading and traditional multicore. Two "cores" form a module which share a front-end (and FP/SIMD functionality) but with separate integer execution and L1 data caches. I.e., there is also a "core myth" effect. – Paul A. Clayton – 2014-02-18T13:57:17.913