How to mitigate the multi-core "performance penalty"

2

2

I've been doing some computations at home and at work and have noticed some unexpected performance issues. The work machine is quite a bit more "serious" than the home machine, but sometimes the home machine outperforms the work machine. I'm curious what the dynamic is -- why this should be, and if I can tweak it at all.

Ultimately on both machines the computations are just a lot of very large arbitrary-precision integer linear algebra computations (based on the GNU multi-precision library). Reducing a lot of sparse but "large" integer matrices, finding the vertices on the boundary of high-dimensional polyhedra, and so on.

On my home computer (which has two cores), if I run a standard computation on only one core (the 2nd core near-idle), it takes 282s.

On the home computer running two identical parallel computations (the same computation as above), it takes approx 320s on each core.

On my office computer, with all cores essentially idle except for one core running this computation, it takes 196s.

On my office computer, if I have all 8 cores running full-out, and one of the cores is doing the computation above, it takes 356s on the one core that's running the computation.

Here are the details on my home computer:

 cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
stepping    : 10
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority
bogomips    : 4787.65
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz
stepping    : 10
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm ida tpr_shadow vnmi flexpriority
bogomips    : 4787.98
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

rybu@rybu-laptop:~/prog/regina/exercise/4M-census/rank1/t$ 

and my office computer:

cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz stepping : 5 cpu MHz : 1600.000 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid bogomips : 6147.45 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz stepping : 5 cpu MHz : 1600.000 ca

Ryan Budney

Posted 2010-08-20T23:58:54.390

Reputation: 156

Your work machine is only a 4 core machine. It appears as an 8 core machine because you have hyperthreading turned on. – None – 2011-04-23T00:57:06.117

Answers

2

Gridengine doesn't do what you want. That's for processors that don't share a memory bus, like your SMP machines do.

You have the right idea about trying to keep your task's memory in cache. You do that by having all the memory accesses as close to each other as possible, for both instructions and data. Don't jump around from one area to another.

SMP systems are really optimized for throughput rather than latency. If latency is really that important to you, you are best off dividing it into a task for each processor.

Karl Bielefeldt

Posted 2010-08-20T23:58:54.390

Reputation: 1 050

Thanks, I'll see what I can do about that. It seems like the active code is in the range of about 3Mb, and the processor cache is about 8Mb. The problem is there's std::list in memory that it's building which occupies over 100Mb of RAM (only via std::list::push_back calls). I wonder if there's some voodoo I can do the sync-between-threads when it touches that list. – Ryan Budney – 2010-08-21T03:21:49.263

It also helps to pin the process to a core. You loose performance when it is moved to a different core after an interrupt. – BillThor – 2010-08-21T13:25:00.327

Is there a way to send a request to Linux for one processor to "take ownership" of a thread, i.e. not to toss it from core to core? My 8-core system when running 8 100% tasks seems to keep to that philosophy but it juggles things around every once and a while. I don't have an estimate on how often it is. Not very often, as I've never witnessed it happen. – Ryan Budney – 2010-08-22T04:52:56.410

Upgraded the office computer to 24Gb of RAM. Strangely, it seems like there's an increase in performance. Thanks for your comments. – Ryan Budney – 2010-08-24T09:19:48.250

See http://www.cyberciti.biz/tips/setting-processor-affinity-certain-task-or-process.html for how to lock a process to a core.

Some RAM is faster than others. That's probably what you were seeing.

– Karl Bielefeldt – 2010-08-24T18:38:45.287

I've recently revised the program so that there's a single task spawning many threads (using the POSIX pthread library), instead of having several identical applications running side-by-side. This appears to also be useful as there's far less code overhead. It also makes the organization of the tasks far simpler -- and coding a bit more of a headache. – Ryan Budney – 2010-08-31T01:38:58.923