Multiple CPUs, multithreaded performance

0

My program: 100% CPU and RAM based, performs mathematical calculations, reads the data from the HDD into RAM in the beginning. No communication between threads. The threads, all take the same time also (almost).

Question:

If my program uses threads equal to the number of CPU threads, what kind of performance can I expect from a 2 cpu system?

Say I use two 8 core xeons, each with 16 threads, so 16x2 = 32 threads total. If my program uses 32 threads, all with 100% usage, will I get double the preformance over a single same cpu?

ShadowHero

Posted 2013-12-05T02:47:43.360

Reputation: 957

You're going to have to be more specific about what you're attempting to parallelize if you want an answer to this question. – Ryan C. Thompson – 2013-12-05T03:03:47.267

1More details added. – ShadowHero – 2013-12-05T03:07:31.313

With a shared memory you're never going to get 100% out of the second, third, fourth... CPU, except in very rare circumstances. Early dual processor systems were lucky to get 50% added throughput from a second CPU (though that's improved as cache designs have gotten better). You do have an "advantage" in this case in that the throughput has already been "discounted" for multiple cores, so the "hit" might not be too bad. However, the 8-core design is likely "tuned" to make full use of memory bandwidth, so the system could "choke" with two units. – Daniel R Hicks – 2013-12-16T12:46:26.427

Answers

3

Really, the best answer you can give is "probably not, but it depends". You have twice the raw CPU horsepower available, but:

  1. You won't really have twice the usable memory bandwidth.

  2. It will take time to "ping pong" some cache lines between CPUs.

  3. Sometimes one thread will have to wait for another and the more threads you have, the more that happens.

  4. Sometimes, even though you have a lot of work to do, you can't do it all at once.

And so on.

In very rare cases, you can actually get more than twice the performance. If an operation is cache limited, having more cores may mean the thread can run for longer (because the core won't have other things to do because the other cores are doing them), allowing the CPU caches to stay hot for longer.

David Schwartz

Posted 2013-12-05T02:47:43.360

Reputation: 58 310

So, given the updated info, what's your guess regarding the performance improvement? – ShadowHero – 2013-12-05T03:14:03.057

Given the updated info, I'd say most likely yes, though you may have to make some adjustments to your program to avoid things like false sharing or implicit synchronization due to things like changing the process memory map. It may "just work", but it may take someone who understands these issues to "tweak" the code to make it work. – David Schwartz – 2013-12-05T03:18:28.807

0

The ultimate answer to performance questions is don't guess it, test it!

jwalker

Posted 2013-12-05T02:47:43.360

Reputation: 111

Cool, but how to test if I don't have the system? – ShadowHero – 2013-12-05T03:16:06.503

2There are any number of ways you can "borrow" a system and change its configuration. Amazon's EC2 service, for example, is the most obvious. – David Schwartz – 2013-12-05T03:22:38.433

@ShadowHero A more practical question would be what if you get the system based on a guess and find out there is no gain – jwalker – 2013-12-05T03:24:39.590

The bad thing is, those xeon are expensive... :/ – ShadowHero – 2013-12-05T03:32:57.193

@ShadowHero That's exactly why you shouldn't want to guess – jwalker – 2013-12-05T04:57:53.263

@ShadowHero BTW, did you consider CUDA? – jwalker – 2013-12-05T05:01:18.000

No, is it "faster"? – ShadowHero – 2013-12-05T07:57:49.840

@ShadowHero "For data parallel applications, accelerations of more than two orders of mangitude have been seen." – jwalker – 2013-12-05T09:31:04.710

@ShadowHero Sorry, I haven't used any of GPU APIs. You should ask this question on stackoverflow, not superuser. – jwalker – 2013-12-05T09:59:24.360

0

It sounds like you're working on an embarrassingly parallel computing task, in which case the answer is yes, your throughput will scale nearly linearly with the total CPU threads used.

Ryan C. Thompson

Posted 2013-12-05T02:47:43.360

Reputation: 10 085