Why is a single thread spread across CPU's?

24

3

I'm just curious why the scheduler constantly moves an app between CPUs, rather than keeping it on one. It looks a bit silly to have 4 cores at 25% rather than one at 100%.

Does it has to do with heat, or is it more efficient somehow? Do other OS's do it differently?

Insights or links to in-depth stuff would be nice. (Couldn't find much myself.)

Update:

By "spread out" I don't mean that it executes on several cpu's at once, but is being moved from one to the other several times per second, making the effect that it looks spread out.

Macke

Posted 2009-08-20T07:04:10.847

Reputation: 963

It's a goomba. SMB, not LBP. :) – Macke – 2017-04-18T08:22:55.070

In my "answer", I showed a single threaded program behaving exactly as you describe, i.e. "being moved from one to the other several times per second, making the effect that it looks spread out." – Evan Rosica – 2017-11-24T21:16:30.320

3Even when "nothing else is executing", there are always system threads competing for CPU. For example, the O/S has a thread to zero out reclaimed memory pages so when memory is required, it'll have some pages ready to go. When your thread goes to execute again, the cpu you were on may be in use by one of these threads. What should the os do? Wait for it or move you to a new cpu? What ever it does, you end up with undesirable behavior in some cases. – Tony Lee – 2009-08-20T13:22:11.813

Answers

8

I think wierob has described the point fairly well.
Here is an older article discussing processor affinity settings with a quad-core QX6800.
(the link points to the second page of that article).

If you do not force process affinity to a core do you loose on performance?

  • While the Windows scheduler needs to decide such affinity to avoid thrashing with caches,
    the processor design itself also considers such things.
  • The Intel QX6800 quad-core (since i refer it earlier in this answer)
    has an 8MB L3 cache shared across its 4 cores.

It should be noted that while you may have chosen to run just this one single-threaded process on the system, the OS itself would have several other tasks running which also need to be scheduled. The scheduler balances all this activity across the available processor pool (or cores).


Going forward, with the Nehalem architecture and NUMA,
processors across multiple sockets will also be able to better address access thrash.
Here is a quick picture from an ArsTechnica page on NUMA.

enter image description here

If Nehalem and i7 interest you, I have some more links at this answer.

nik

Posted 2009-08-20T07:04:10.847

Reputation: 50 788

What makes You think that "Going forward, with the Nehalem architecture and NUMA, processors across multiple sockets will also be able to better address access thrash." ? As I see it, NUMA makes memory even more local and particular-processor-related, therefore worsening effects of trashing. – Roland Pihlakas – 2017-01-15T22:14:56.833

@RolandPihlakas, been a while since this answer, but looking at the arstechnica article and these points I think I was accounting for the ability of new platforms to have better memory connectivity and the software to take advantage of that (over not having that option with multiple socket configurations at that time; i.e. before Nehalem). – nik – 2017-01-17T07:45:04.580

6

The scheduler just executes the next thread that is ready for execution on a "free" core/CPU.

You can assign a process to a specific CPU via the Windows task manager.

Having 4 cores at 25% means that 4 threads are executed simultaneously. Whereas, one core at x% means that only one thread is executed. So the former is more efficient in some cases.

But during its execution the cache of the CPU is filled with data accessed by the thread. So if the thread gets executed on another CPU, it will experience more cache misses, which are costly, since the data is not in the cache of this CPU.

What does your thread do? If the thread "sleeps" for a very short time the core it was executed on before might be occupied by another threat and thus your thread is executed on the next available core. What happens if you specify only one core to be used by your process (e.g. ia task manager)?

wierob

Posted 2009-08-20T07:04:10.847

Reputation: 130

@PärBjörklund from my experience at least Windows XP does not. I think the "cache-bouncing" problem was fixed in Vista or later – Waxhead – 2016-03-29T23:28:26.930

1"Having 4 cores at 25% means that 4 threads are executed simultaneously." No, it means one thread is executed, a bit on one core, then o another and so on. As Task Manager shows average use, it will show 25% (on a 4 core system, on a two core it would show 50%) for each core. It means the core was fully utilized one quarter of time and was idle the res of time. – David Balažic – 2016-12-04T17:08:26.117

3afaik Windows scheduler does a pretty good job of keeping threads on the same cpu/core for it's duration to avoid that issue. – Paxxi – 2009-08-20T07:36:28.100

@Pär: My thread seem to be executing on each core actually. – Macke – 2009-08-20T08:06:43.670

Yeah, it's probably the OS procs that bumps my thread around. How to accept two answers? :) – Macke – 2009-08-20T18:58:24.463

0

The OS migrates the thread across CPU cores (quickly, several times per second). It is more efficient to run it on the same core all the time. This can be enforced by the "Set affinity" context menu item in Task Manager.

Note that usually (typical home use) the difference is in the range of few percents.

The "4 cores each at 25% usage" means, as Task Manager shows average use, that each core was fully utilized one quarter of time and free the rest of time.

The description is for Windows, but it is similar on other operating systems too.

David Balažic

Posted 2009-08-20T07:04:10.847

Reputation: 1 242

0

It's not. One thread can only run on one processor. However, some processes have multiple threads, which can be spread out.

The reasoning, believe it or not, never considered what it looks like. The system tries to spread threads out because it has no way to know when one will spike.

tsilb

Posted 2009-08-20T07:04:10.847

Reputation: 2 492

1See my added clarification. This is one thread, running at full throttle, that is quickly being moved around so that, over time, each core (out of fore) is 25% busy. (All other processes/threads are neglible) – Macke – 2009-08-20T07:45:17.490

-1

If anyone's still reading this, I've noticed this, too, and performed quite a few tests to see if it's not just a fluke. It turns out it's not! I believe spreading a single thread over all cores is more efficient for several reasons:

  1. Spreading one thread across all cores allows for a lower power consumption. Most processors lower their frequencies and, more importantly, voltage according to load, so a Core 2 Quad, for example, will consume a lot less power and produce less heat by spreading one thread across all 4 cores rather than using one core (which would lead to the voltage increasing across ALL cores, since there's only one voltage regulator* - that's pretty ineffective).
  2. It ensures that the thread always runs at maximum/constant speed. If the thread suddenly requests more processing power, one core could become overloaded and there will be a delay in the execution. By spreading it across cores, any sudden spike will be handled smoothly without lags and delays.

Also, because of the above two observations, I have come to believe that Turbo Boost and IDA are ineffective. They might be useful on older operating systems, but Linux and Windows 7 spread everything across all cores pretty efficiently. So, a Core 2 Quad q9100 @ 2.26 GHz will almost (there are always exceptions :-) always be faster than a Core 2 Duo X9100 @ 3.06GHz, and I've rarely seen it use IDA (basically the predecessor to Turbo boost, increases frequency on one or two cores only for single threaded apps).

  • The Core 2 Quad has two clock domains thanks to the fact that there are two physical dies, so two cores can run at full frequency, while two are at the lowest frequency. I don't know whether there are two voltage regulators, though - I've noticed that the voltage is uniform across all 4 cores, so there must be only one regulator for the whole package.

JakL

Posted 2009-08-20T07:04:10.847

Reputation: 23

There is no such thing as "spreading a thread" (no, not even 5 years later). There is a single thread, executed on one core. And then later on another. And so on. At each moment one core is running at 100% and the others are idling. So there is no saving. Especially as you mention when all cores at are full voltage all the time anyway (as you said, they share voltage). Also as already addressed being on the same core ensures the thread gets all the processing power there is. As that core is already 100% used, the OS will schedule other threads to other, less utilized cores. – David Balažic – 2016-12-04T16:55:34.047

3This sounds dubious for several reasons. Please provide references to your "facts". First, why does computing stuff at 25% on four cores consume less power than 100% on one? (I can agree that heat is more evenly spread out, but...) Also, the thread in my question is running at full tilt (100%), so it won't "request more processing power", because it's already doing as much as possible. – Macke – 2011-06-25T18:13:33.780

Well, that's just from my own observations - I was intrigued by IDA and TurboBoost, decided to do some tests. It was quite a while ago, but I arrived to the above conclusions. The processor consumes less power, as all cores run at a lower voltage - a 0.1V reduction saves about 6-10 Watts in power consumption (if one core is loaded 100%, all cores run at a higher voltage, whether they're idling or not). This is especially true in Core2Duo with SLFM mode. You are right about the thread running at full tilt not requesting any more processor tacts, but there are apps that indeed do this. – JakL – 2011-06-26T19:12:10.313