Why do we have CPUs with all the cores at the same speeds and not combinations of different speeds?

In general if you are buying a new computer you would determine which processor to buy by what your expected workload will be. Performance in games tends to be determined by single core speed, whereas applications like video editing are determined by number of cores.

In terms of what is available on the market - all the CPUs seem to have roughly the same speed with the main differences being more threads or more cores.

For example:

Intel Core i5-7600K, base frequency 3.80 GHz, 4 cores, 4 threads
Intel Core i7-7700K, base frequency 4.20 GHz, 4 cores, 8 threads
AMD Ryzen 5 1600X, base frequency 3.60 GHz, 6 cores, 12 threads
AMD Ryzen 7 1800X, base frequency 3.60 GHz, 8 cores, 16 threads

So why do we see this pattern of increasing cores with all cores having the same clock speed?

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

For examples sake, instead of, say, four cores at 4.0 GHz (i.e. 4x4 GHz ~ 16 GHz maximum), what about a CPU with two cores running at say 4.0 GHz and say four cores running at 2 GHz (i.e. 2x4.0 GHz + 4x2.0 GHz ~ 16 GHz maximum). Wouldn't the second option be equally good at single threaded workloads, but potentially better at multi-threaded workloads?

I ask this question as a general point - not specifically about those CPUs I listed above, or about any specific one specific workload. I am just curious as to why the pattern is as it is.

Jamie

Posted 2017-06-24T13:25:56.087

Reputation: 949

15There are many mobiles with fast and slow cores, and on nearly all modern multi core servers the CPU core speeds clock independent depending on the load, some even switch off cores when not used. On a general purpose computer where you do not design for saving energy however having only two types of cores (CPU and GPU) just makes the platform more flexible. – eckes – 2017-06-24T13:29:30.547

5Before the thread scheduler could make an intelligent choice about which core to use it would have to determine if a process can take advantage of multiple cores. Doing that reliably would be highly problematic and prone to error. Particularly when this can change dynamically according to the needs of the application. In many cases the scheduler would have to make a sub optimal choice when the best core was in use. Identical cores makes things simpler, provides maximum flexibility, and generally has the best performance. – LMiller7 – 2017-06-24T15:34:49.447

33Clock speeds cannot reasonably be said to be additive in the manner you described. Having four cores running at 4 Ghz does not mean you have a "total" of 16 GHz, nor does it mean that this 16 Ghz could be partitioned up into 8 processors running at 2 Ghz or 16 processors running at 1 GHz. – Bob Jarvis - Reinstate Monica – 2017-06-24T23:43:05.780

In a similar manner - consider how dreadnought battleships, which had a uniform main battery, replaced the pre-dreadnought battleships which had a main battery of the largest guns, an intermediate battery that was smaller, and an anti-torpedo-boat battery which was smaller still.

– Bob Jarvis - Reinstate Monica – 2017-06-24T23:54:55.727

4 cores@4GHz doesn't mean that it's running at 16GHz. Parallel processing doesn't work that way. And AFAIK AMD has supported different clock speeds for different cores for a very long time – phuclv – 2017-06-25T03:06:41.840

16The premise of the question is simply wrong. Modern CPUs are perfectly capable of running cores at different speeds – phuclv – 2017-06-25T03:17:05.373

1Voted to reopen. Also, big.LITTLE designs in ARM SoCs are common, where the smaller cores are an entirely different design (sometimes different architecture), lower clocked and much more power efficient, while the big ones are used while the screen is on for apps in the foreground. – allquixotic – 2017-06-25T03:19:08.783

4Multi-core CPU: can I say I have a 3x2.1GHz=6.3GHz CPU?, How do I calculate clock speed in multi-core processors?, – phuclv – 2017-06-25T03:21:52.477

see the discussions here big.LITTLE x86: Why not?, Intel and the big.LITTLE concept

– phuclv – 2017-06-25T03:46:10.747

@LưuVĩnhPhúc Of course the calculation doesn't work like that - if it did the question would be comparing to equals, it is literally the entire point of the question. The example is simply for means of comparison. CPUs being capable of running different cores at different speeds would apply to any combination of cores.- The Thanks for the links nonetheless. – Jamie – 2017-06-25T13:56:16.200

1Another point to make is that most modern CPUs from Intel and AMD can dynamically scale clock speed based on the task they're doing. My 4790K usually sits at around 2GHz when I'm just browsing the web, but then kicks up to 4GHz+ when I'm gaming. – SGR – 2017-06-26T09:21:19.230

@LưuVĩnhPhúc intel have also been able to run cores at different clock speeds for a long time as well. – Baldrickk – 2017-06-26T10:19:31.413

@Baldrickk AMD are more blatant, especially with FX and very especially with unlocked "latent" cores, these were locked for a reason and generally need to be hobbled. – mckenzm – 2017-06-29T02:37:47.587

@BobJarvis: 16 GHz can't exactly be partitioned up into 8 processors of 2 GHz, of course, but can't it come pretty close? in contrast with the opposite direction? – user541686 – 2017-06-29T04:37:10.493

These days, people have such problems interpreting what Intel Core i5-7600K, base frequency 3.80 GHz, 4 cores, 4 threads means, can you imagine if you had a list of tech jargon about each individual core in the package? It would be marketing insanity, and everyone except for True Nerds would be confused. Intel has spent 30 years trying to make its chip designations accessible to consumers, which is why they (somewhat) recently moved to the i3/i5/i7 labeling, because otherwise people had no idea if a particular process was "fast" or "slow". – Christopher Schultz – 2017-06-30T13:32:46.237

Answers

This is known as heterogeneous multiprocessing (HMP) and is widely adopted by mobile devices. In ARM-based devices which implement big.LITTLE, the processor contains cores with different performance and power profiles, e.g. some cores run fast but draw lots of power (faster architecture and/or higher clocks) while others are energy-efficient but slow (slower architecture and/or lower clocks). This is useful because power usage tends to increase disproportionately as you increase performance once you get past a certain point. The idea here is to get performance when you need it and battery life when you don't.

On desktop platforms, power consumption is much less of an issue so this is not truly necessary. Most applications expect each core to have similar performance characteristics, and scheduling processes for HMP systems is much more complex than scheduling for traditional SMP systems. (Windows 10 technically has support for HMP, but it's mainly intended for mobile devices that use ARM big.LITTLE.)

Also, most desktop and laptop processors today are not thermally or electrically limited to the point where some cores need to run faster than others even for short bursts. We've basically hit a wall on how fast we can make individual cores, so replacing some cores with slower ones won't allow the remaining cores to run faster.

While there are a few desktop processors that have one or two cores capable of running faster than the others, this capability is currently limited to certain very high-end Intel processors (as Turbo Boost Max Technology 3.0) and only involves a slight gain in performance for those cores that can run faster.

While it is certainly possible to design a traditional x86 processor with both large, fast cores and smaller, slower cores to optimize for heavily-threaded workloads, this would add considerable complexity to the processor design and applications are unlikely to properly support it.

Take a hypothetical processor with two fast Kaby Lake (7th-generation Core) cores and eight slow Goldmont (Atom) cores. You'd have a total of 10 cores, and heavily-threaded workloads optimized for this kind of processor may see a gain in performance and efficiency over a normal quad-core Kaby Lake processor. However, the different types of cores have wildly different performance levels, and the slow cores don't even support some of the instructions the fast cores support, like AVX. (ARM avoids this issue by requiring both the big and LITTLE cores to support the same instructions.)

Again, most Windows-based multithreaded applications assume that every core has the same or nearly the same level of performance and can execute the same instructions, so this kind of asymmetry is likely to result in less-than-ideal performance, perhaps even crashes if it uses instructions not supported by the slow cores. While Intel could modify the slow cores to add advanced instruction support so that all cores can execute all instructions, this would not resolve issues with software support for heterogeneous processors.

A different approach to application design, closer to what you're probably thinking about in your question, would use the GPU for acceleration of highly parallel portions of applications. This can be done using APIs like OpenCL and CUDA. As for a single-chip solution, AMD promotes hardware support for GPU acceleration in its APUs, which combine a traditional CPU and a high-performance integrated GPU onto the same chip, as Heterogeneous System Architecture, though this has not seen much industry uptake outside of a few specialized applications.

bwDraco

Posted 2017-06-24T13:25:56.087

Reputation: 41 701

1Windows already has a notion of 'Apps', 'Background Processes' and 'Windows Processes'. So this doesn't extend to a hardware level? – Jamie – 2017-06-25T12:45:37.607

2@Jamie A "background" process gets smaller time slices and is more likely to be interrupted. Windows 10 does, to some extent, account for HMP systems, though there isn't much information on yet how. – Bob – 2017-06-26T05:36:25.597

So I think that after the edit @bwDraco has pretty much answered it for me. If there was a 'mixed' processor it could easily support the same instruction set if it was built that way, so then we would need some sort of scheduler to pick the right core. I'm thinking that really the applications which benefit from going to lots of small cores would probably benefit even more from going to lots and lots of really small cores. Thus we have GPU acceleration. – Jamie – 2017-06-26T21:57:48.077

3Note that the GPU case isn't trading 2 big cores for 10 small and slow cores, but rather the (very rough) equivalent of trading 2 big cores for 1024 small and slow cores. Massively parallel, not just a little bit more parallel. – Yakk – 2017-06-27T15:36:57.077

This question is about CPUs, but I think for the implied question, it's important to note that computers actually already kind of do this across the motherboard. While it doesn't make sense to run the CPU at different speeds rather than just the fastest available, different chips and buses on the motherboard already run at slower clock speeds, designed for a trade-off between cost of materials and development vs performance. – JFA – 2017-06-27T18:10:06.633

4Intel could probably get a Goldmont core to run AVX2 instructions without much extra silicon (slowly, by decoding to pairs of 128b ops). Knight's Landing (Xeon Phi) has Silvermont-based cores with AVX512, so it's not like it's impossible to modify Silvermont. But KNL adds out-of-order execution for vector instructions, while normal Silver/Goldmont only does OOO for integer, so they'd probably want to design it closer to Goldmont than KNL. Anyway, insn sets are not a real problem. It's OS support and small benefit that are the real obstacles to spending die-area on a low-power core. – Peter Cordes – 2017-06-28T20:30:02.793

if i look at the individual core speed i can see some cores run faster than others but the max speed is the same for all cores. – Suici Doga – 2017-07-15T07:01:01.977

What you're asking is why are current systems using Symmetric multiprocessing rather than Asymmetric multiprocessing.

Asymmetric multiprocessing were used in the old days, when a computer was enormous and housed over several units.

Modern CPUs are cast as one unit, in one die, where it is much simpler not to mix CPUs of different types, since they all share the same bus and RAM.

There is also the constraint of the clock that governs the CPU cycles and RAM access. This will become impossible when mixing CPUs of different speeds. Clock-less experimental computers did exist and were even pretty fast, but the complexities of modern hardware imposed a simpler architecture.

For example, Sandy Bridge and Ivy Bridge cores can't be running at different speeds at the same time since the L3 cache bus runs at the same clock speed as the cores, so to prevent synchronization problems they all have to either run at that speed or be parked/off (link: Intel's Sandy Bridge Architecture Exposed). (Also verified in the comments below for Skylake.)

[EDIT] Some people have mistaken my answer to mean saying that mixing CPUs is impossible. For their benefit I state : Mixing of differing CPUs is not beyond today's technology, but is not done - "why not" is the question. As answered above, this would be technically complicated, therefore costlier and for too little or no financial gain, so does not interest the manufacturers.

Here are answers to some comments below :

Turbo boost changes CPU speeds so they can be changed

Turbo boost is done by speeding up the clock and changing some multipliers, which is exactly what people do when overclocking, except that the hardware does it for us. The clock is shared between cores on the same CPU, so this speeds up uniformly the entire CPU and all its cores.

Some phones have more than one CPU of different speeds

Such phones typically have a custom firmware and software stack associated with each CPU, more like two separate CPUs (or like CPU and GPU), and they lack a single view of system memory. This complexity is hard to program and so Asymmetric multiprocessing was left in the mobile realm, since it requires low-level close-to-the-hardware software development, which is shunned by general-purpose desktop OS. This is the reason that such configurations aren't found in the PC (except for CPU/GPU if we stretch enough the definition).

My server with 2x Xeon E5-2670 v3 (12 cores with HT) currently has cores at 1.3 GHz, 1.5 GHz, 1.6 GHz, 2.2 GHz, 2.5 GHz, 2.7 GHz, 2.8 GHz, 2.9 GHz, and many other speeds.

A core is either active or idle. All cores that are active at the same time run at the same frequency. What you are seeing is just an artifact of either timing or averaging. I have myself also noted that Windows does not park a core for a long time, but rather separately parks/unparks all cores far faster than the refresh rate of Resource Monitor, but I don't know the reason for this behavior which probably is behind the above remark.

Intel Haswell processors have integrated voltage regulators that enable individual voltages and frequencies for every core

Individual voltage regulators differ from clock speed. Not all cores are identical - some are faster. Faster cores are given slightly less power, creating the headroom to boost the power given to weaker cores. Core voltage regulators will be set as low as possible in order to maintain the current clock speed. The Power Control Unit on the CPU regulates voltages and will override OS requests where necessary for cores that differ in quality. Summary: Individual regulators are for making all cores operate economically at the same clock speed, not for setting individual core speeds

harrymc

Posted 2017-06-24T13:25:56.087

Reputation: 306 093

3Ah. more mshorter and to the point. +1 – Hennes – 2017-06-24T14:00:48.050

My understanding is that if a core has a speed of 4.0GHz, that might break down as 40100mhz. So if you had a core at 4.0GHz and another core at 2.0GHz could they not be both break down as 40100mhz and 20*100mhz? Is that what you mean by the 'clock'? So is that an issue?

The argument of it being simpler to cast one die is only an argument if there is not a sufficient benefit to casting two different sized cores. – Jamie – 2017-06-24T14:19:05.377

3The clock pulses govern everything the CPU does, since data flows in it in steps that are governed by the clock. The clock is not here for telling the time, but for marking the time between data entering and exiting sub-circuits, so for computations to pass from one step to another, as well as RAM access stages. The clock is used for synchronization, and it would be hard to synchronize two CPUs that don't have the same timing between steps, or even the same steps. – harrymc – 2017-06-24T16:49:39.037

6@harrymc there are synchroniser blocks that manage it perfectly well; DRAM runs slower than core speed, and you can have Intel cores running at different speeds dynamically on the same chip. – pjc50 – 2017-06-24T21:47:31.690

1@Jamie the clock multiplication (see "PLL") is usually "multiply by X divide by Y", where X is limited to a few choices and Y can be varied more widely. You can have one core at 4GHz and another at 2GHz or even 3.9GHz if you want, but there's a penalty of a few cycles for crossing clock domains. – pjc50 – 2017-06-24T21:51:06.873

1@pjc50: Synchroniser blocks etc. between CPUs will make an architecture that is much too complicated and costly. Any price advantage that is gained in creating such a "middle-class" CPU will be lost that way, so there is no point. In addition, most OS today are uniquely oriented toward Symmetric multiprocessing. – harrymc – 2017-06-25T11:34:32.743

10Intel Core-series processors run at different speeds on the same die all the time. – Nick T – 2017-06-25T15:32:25.673

@NickT: All at the same time. – harrymc – 2017-06-25T15:46:30.860

2@Bob: The question is why are the processors all the same. It's well known that modern OS can vary power consumption and even park cores. – harrymc – 2017-06-26T05:44:09.740

9The sole existence of big.LITTLE architectures and core-indepenendent clock boosting proves you wrong. Heterogeneous multiprocessing is mainstream. It can be done, it is done in phones, but for some reason not in desktops. – Agent_L – 2017-06-26T11:02:02.747

9@Agent_L: The reason is the complexity. Desktop CPUs are costly enough already. So I repeat: Everything is possible, but the actual question is why it is not done, not whether it can be done. Do not attack me as if I have claimed this is impossible - all I say is that it's too complicated and costly and for too little gain to interest the manufacturers. – harrymc – 2017-06-26T12:56:47.873

2It's better now, but IMHO you should dive more into details on why it's done in phones and less so in PCs. I believe that is the root of the question and you've merely mentioned it for now, without any real explanation. Mentioning clockless designs is just a distraction, I'd drop it. You've literally wrote "impossible", and it's still there on RAM clock access - when it is clearly possible and done, on desktops: single-core turbo-boost introduces clock difference. Nobody attacks you, but the obviously false statements you've made. Or back them up better, maybe it's me who get turbo-boost wrong – Agent_L – 2017-06-26T14:45:49.283

2@Agent_L: I don't know exactly how turbo boost is done, but guess that it speeds up the clock and some multipliers, same as overclocking. The clock is shared, so this speeds up the entire CPU and all its cores. For phones: They typically have a custom firmware and software stack associated with every CPU, more like two separate CPUs (or like CPU and GPU), and lacking a single view of system memory. This complexity is hard to program and so left AMP in the mobile realm, as it requires low-level close-to-the-hardware software development, which is shunned by general-purpose desktop OS . – harrymc – 2017-06-26T15:09:44.020

3"The clock is shared between cores on the same CPU, so this speeds uniformly up the entire CPU and all its cores." Wrong. Plenty of us have given plenty of evidence that this different cores run at different clocks on the same die at the same time. Pretty much every large modern processor does this. – Grant Wu – 2017-06-26T19:46:53.323

2My server with 2x Xeon E5-2670 v3 (12 cores with HT) currently has cores at 1.3 GHz, 1.5 GHz, 1.6 GHz, 2.2 GHz, 2.5 GHz, 2.7 GHz, 2.8 GHz, 2.9 GHz, and many other speeds. In fact, it's rare that cat /proc/cpuinfo | grep MHz | uniq -c ever shows duplicates. – Nick T – 2017-06-26T22:26:01.443

@NickT: A core is either active or idle. All cores that are active at the same time run at the same frequency. What you are seeing is just an artifact of either timing or averaging. For example, Sandy Bridge and Ivy Bridge cores can't be running at different speeds at the same time since the L3 cache bus runs at the same clock speed as the cores, so to prevent synchronization problems they all have to either run at that speed or shut off (link).

– harrymc – 2017-06-27T06:07:15.737

1@harrymc Thanks, I have learned something new today. – Agent_L – 2017-06-27T08:33:39.213

Please remove the incorrect information about the E5-2670 v3. To quote http://ieeexplore.ieee.org/document/7284406/ :"The recently introduced Intel Xeon E5-1600 v3 and E5-2600 v3 series processors–codenamed Haswell-EP–implement major changes compared to their predecessors. Among these changes are integrated voltage regulators that enable individual voltages and frequencies for every core."

– Grant Wu – 2017-06-27T14:41:24.303

1@GrantWu: Individual voltage regulators differ from clock speed. Not all cores are identical - some are faster. Faster cores are given slightly less power, creating the headroom to boost the power given to weaker cores. Core voltage regulators will be set as low as possible in order to maintain the current clock speed. The Power Control Unit on the CPU regulates voltages and will override OS requests where necessary for cores that differ in quality. Summary: Individual regulators are for making all cores operate economically at the same clock speed, not for setting individual core speeds. – harrymc – 2017-06-27T15:57:00.357

"that enable individual voltages and frequencies for every core" "This enables per-core pstates (PCPS) [14] instead of one p-state for all cores as in previous products. The finer granularity of voltage and frequency domains enables energy-aware runtimes and operating systems to lower the power consumption of single cores while keeping the performance of other cores at a high level." "Previous Intel processor generations used either a fixed uncore frequency (Nehalem-EP and Westmere-EP) or a common frequency for cores and uncore (Sandy Bridge-EP and Ivy Bridge-EP)." – Grant Wu – 2017-06-28T06:10:26.217

1@GrantWu: That does not contradict what I said, just gives more hardware details. – harrymc – 2017-06-28T06:13:05.150

Yes it does. It says "individual... frequencies" for every core. Or look at https://stackoverflow.com/questions/2619745/mutli-core-processors-does-each-core-run-at-the-full-clock-speed-or-some-frac

– Grant Wu – 2017-06-28T06:16:56.940

Or, look at the abstract of https://aspire.eecs.berkeley.edu/wp/wp-content/uploads/2014/07/Per-Core-DVFS-With-Switched-Capacitor.pdf "it is highly desirable to independently control the supply and the clock frequency for each core":

– Grant Wu – 2017-06-28T06:18:23.743

1@GrantWu: That does not replace the CPU clock - it is only used to adjust the speed to follow the clock. This is probably the mechanism used for implementing turbo boost and for homogenizing the cores (cores performance might differ as not all cores are identical when manufactured). – harrymc – 2017-06-28T06:21:29.550

On closer look, I think @harrymc is correct. As of Skylake, all cores still share a clock domain. Though the publicly available literature is a little bit vague in whether it is merely referring to the base clock or the cores also share a multiplier; the latter is implied. – Bob – 2017-06-28T09:10:40.587

Why do we not have variants with differing clock speeds? ie. 2 'big' cores and lots of small cores.

It's possible that the phone in your pocket sports exactly that arrangement - the ARM big.LITTLE works exactly as you described. There it's not even just a clock speed difference, they can be entirely different core types - typically, the slower clocked ones are even "dumber" (no out-of-order execution and other CPU optimizations).

It's a nice idea essentially to save battery, but has its own shortcomings; the bookkeeping to move stuff between different CPUs is more complicated, the communication with the rest of the peripherals is more complicated and, most importantly, to use such cores effectively the task scheduler has to be extremely smart (and often to "guess right").

The ideal arrangement is to run non-time-critical background tasks or relatively small interactive tasks on on the "little" cores and wake the "big" ones only for big, long computations (where the extra time spent on the little cores ends up eating more battery) or for medium-sized interactive tasks, where the user feels sluggishness on the little cores.

However, the scheduler has limited information about the kind of work each task may be running, and has to resort to some heuristic (or external information, such as forcing some affinity mask on a given task) to decide where to schedule them. If it gets this wrong, you may end up wasting a lot of time/power to run a task on a slow core, and give a bad user experience, or using the "big" cores for low priority tasks, and thus wasting power/stealing them away from tasks that would need them.

Also, on an asymmetric multiprocessing system it's usually more costly to migrate tasks to a different core than it would be on an SMP system, so the scheduler generally has to make a good initial guess instead of trying to run on a random free core and moving it around later.

The Intel choice here instead is to have a lower number of identical intelligent and fast cores, but with very aggressive frequency scaling. When the CPU gets busy it quickly ramps up to the maximum clock speed, does the work the fastest it can and then scales it down to go back to lowest power usage mode. This doesn't place particular burden on the scheduler, and avoids the bad scenarios described above. Of course, even when in low clock mode, these cores are "smart" ones, so they'll probably consume more than the low-clock "stupid" big.LITTLE cores.

Matteo Italia

Posted 2017-06-24T13:25:56.087

Reputation: 1 490

1Heuristics should be pretty simple. Any involuntary task switch (use of full timeslice) is an indication that the slow cpu is inappropriate for the task. Very low utilization and all voluntary task switches is indication that the task could be moved to the slow cpu. – R.. GitHub STOP HELPING ICE – 2017-06-25T02:12:42.567

3another problem is that 4 stupid 2GHz cores may take more die size than 2 smart 4GHz cores, or they may be smaller and take much less power than 4 GHz cores but run also much much slower – phuclv – 2017-06-25T03:43:45.057

2@R.: in line of principle I agree with you, but even enabling some basic scheduler support for this I saw ridiculous core jostling on an ARM board I used, so there must be something else to it. Besides, most "regular" multithreaded software is written with SMP in mind, so it's not untypical to see thread pools as big as the total number of cores, with jobs dragging on the slow cores. – Matteo Italia – 2017-06-27T07:03:19.103

@Ramhound: A 120W 10-core part has a power budget of 12W per core (except in single-core turbo mode). This is why the highest single-core clocks are found in the quad-core parts, where e.g. Intel's i7-6700k has a power budget of 91W for 4 cores: 22.75W per core sustained with all cores active (at 4.0GHz even with an AVX2+FMA workload like Prime95). This is also why the single-core Turbo headroom is only an extra 0.2GHz, vs. a 22-core Broadwell E5-2699v4 with 2.2GHz base@145W, 3.6GHz turbo.

– Peter Cordes – 2017-06-28T18:59:52.897

@Ramhound: added an answer that expands on this. A many-core Xeon seems to be exactly what the OP is looking for: operate as many low-power cores, or spend a lot of power running a single-thread fast when possible (turbo).

– Peter Cordes – 2017-06-28T20:10:12.837

Performance in games tends to be determined by single core speed,

In the past (DOS era games): Correct.
These days, it is no longer true. Many modern games are threaded and benefit from multiple cores. Some games are already quite happy with 4 cores and that number seems to rise over time.

whereas applications like video editing are determined by number of cores.

Sort of true.

Number of cores * times speed of the core * efficiency.
If you compare a single identical core to a set of identical cores, then you are mostly correct.

In terms of what is available on the market - all the CPUs seem to have roughly the same speed with the main differences being more threads or more cores. For example:

Intel Core i5 7600k, Base Freq 3.80 GHz, 4 Cores Intel Core i7 7700k, Base Freq 4.20 GHz, 4 Cores, 8 Threads AMD Ryzen 1600x, Base Freq 3.60 GHz, 6 Cores, 12 Threads AMD Ryzen 1800x, Base Freq 3.60 GHz, 8 Cores, 16 Threads

Comparing different architectures is dangerous, but ok...

So why do we see this pattern of increasing cores with all cores having the same clock speed?

Partially because we ran into a barrier. Increasing clock speed further means more power needed and more heat generated. More heat meant even more power needed. We have tried that way, the result was the horrible pentium 4. Hot and power hungry. Hard to cool. And not even faster than the smartly designed Pentium-M (A P4 at 3.0GHz was roughly as fast as a P-mob at 1.7GHz).

Since then, we mostly gave up on pushing clock speed and instead we build smarter solutions. Part of that was to use multiple cores over raw clock speed.

E.g. a single 4GHz core might draw as much power and generate as much heat as three 2GHz cores. If your software can use multiple cores, it will be much faster.

Not all software could do that, but modern software typically can.

Which partially answers why we have chips with multiple cores, and why we sell chips with different numbers of cores.

As to clock speed, I think I can identify three points:

Low power CPUs makes sense for quite a few cases which raw speed is not needed. E.g. Domain controllers, NAS setups, ... For these, we do have lower frequency CPUs. Sometimes even with more cores (e.g. 8x low speed CPU make sense for a web server).
For the rest, we usually are near the maximum frequency which we can do without our current design getting too hot. (say 3 to 4GHz with current designs).
And on top of that, we do binning. Not all CPU are generated equally. Some CPU score badly or score badly in part of their chips, have those parts disabled and are sold as a different product.

The classic example of this was a 4 core AMD chip. If one core was broken, it was disabled and sold as a 3 core chip. When demand for these 3 cores was high, even some 4 cores were sold as the 3 core version, and with the right software hack, you could re-enable the 4th core.

And this is not only done with the number of cores, it also affects speed. Some chips run hotter than others. Too hot and sell it as a lower speed CPU (where lower frequency also means less heat generated).

And then there is production and marketing and that messes it up even further.

Why do we not have variants with differing clock speeds? ie. 2 'big' cores and lots of small cores.

We do. In places where it makes sense (e.g. mobile phones), we often have a SoC with a slow core CPU (low power), and a few faster cores. However, in the typical desktop PC, this is not done. It would make the setup much more complex, more expensive, and there is no battery to drain.

Hennes

Posted 2017-06-24T13:25:56.087

Reputation: 60 739

1As I pointed out - "I ask this question as a general point - not specifically about those cpus I listed above", and there was a reason I gave two examples from each architecture.

If we treat the two scenarios as 1. all big cores, and 2. two big & two small - then i think all the points you mention apply to both cases - ie. a theoretical max single core speed, binning of chips, downclocking when not in use. – Jamie – 2017-06-24T14:34:52.640

A single max speed core is not all that interesting when it does not get choosen though. Schedulers will need to be updated to actually prefer the high speed core(s). – Hennes – 2017-06-24T19:37:34.383

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

Unless we were extremely concerned about power consumption, it would make no sense to accept all the cost associated with an additional core and not get as much performance out of that core as possible. The maximum clock speed is determined largely by the fabrication process, and the entire chip is made by the same process. So what would the advantage be to making some of the cores slower than the fabrication process supported?

We already have cores that can slow down to save power. What would be the point to limiting their peak performance?

David Schwartz

Posted 2017-06-24T13:25:56.087

Reputation: 58 310

2This is what I was thinking. Why intentionally use some inferior components when they could all be elite? +1. – MPW – 2017-06-26T13:14:40.620

1@MPW The choice isn't between creating a big core and then neutering it, it is between all big vs a few big and lots of small cores. Because you have two competing scenarios - single thread performance and multi thread performance - why not maximise both? Do we know that you can't fabricate a chip with a few big and lots of small cores? – Jamie – 2017-06-26T21:33:35.253

@Jamie You could fabricate a chip with a few big and lots of small cores. But the smaller cores wouldn't run at a lower clock speed. – David Schwartz – 2017-06-26T22:54:12.133

They would if they were designed that way... The question is why aren't they designed that way from scratch, not taking an existing fabrication process and neutering it. – Jamie – 2017-06-27T00:26:12.373

@Jamie I don't understand what you're saying. The whole CPU has to be made with the same fabrication process, and the maximum clock speed is largely a characteristic of the fabrication processes. Cores that require a lower clock speed at the same fabrication level would generally be more complex and take more space, otherwise why would they require a lower clock speed? – David Schwartz – 2017-06-27T00:27:58.507

Maybe I don't know enough about the fabrication process to understand. Could you not create two different cores on the same cpu within the same fabrication process? - ie. a 4.0GHz (40100mhz) core & a 2.0GHz (20100mhz) core. Some cpus have on-chip gpus, is this part of the fabrication process or is it added later? There is clearly currency in adding complexity - if the end result is worth it. – Jamie – 2017-06-27T00:39:36.023

@Jamie Sure, you could do that. But likely the 2.0GHz core would be larger and more complex, requiring it to run at a lower frequency. (Why else would it need to run at a lower frequency even though it's built with the same fabrication process?) – David Schwartz – 2017-06-27T01:17:56.937

Why do we not have variants with differing clock speeds? For example, two 'big' cores and lots of small cores.

Nominal clock speeds don't really mean too much for most larger processors nowadays since they all have the capability to clock themselves up and down. You're asking whether or not they can clock different cores up and down independently.

I'm kind of surprised by many of the other answers. Modern processors can and do do this. You can test this by, for example, opening up CPU-Z on a smartphone - my Google Pixel is perfectly capable of running different cores at different speeds:

It is nominally 2.15 Ghz, but two cores are at 1.593 Ghz and two are at 1.132 Ghz.

In fact, since 2009 mainstream Intel CPUs have had logic to boost individual cores higher while underclocking other cores, allowing better single core performance while remaining within a TDP budget: http://www.anandtech.com/show/2832/4

Newer Intel processors with "Favored Core" (an Intel marketing term) have each core characterized at the factory, with the fastest cores being able to boost extra high: http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7

AMD's Bulldozer chips had a primitive version of this: http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/4

AMD's new Ryzen chips probably have this as well, although it's not explicitly stated here: http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/11

Grant Wu

Posted 2017-06-24T13:25:56.087

Reputation: 141

You are answering a different question. The question is about lots of big cores vs a couple of big cores and lots of small cores - the merits of the two scenarios. In both situations you can clock up and down dependent on demand, or boost a core. – Jamie – 2017-06-25T12:41:48.327

3That's not how I read the question. The question does not mention architecturally different cores, despite using the words "big" and "small". It focuses exclusively on clock speed. – Grant Wu – 2017-06-26T19:42:46.307

On a modern system you often do have all of the cores running at different speeds. Clocking down a core that isn't heavily used reduces power usage and thermal output, which is good, and features like "turbo boost" let one or two cores run significantly faster as long as the other cores are idle, and therefore the power usage and heat output of the entire package don't go too high. In the case of a chip with such a feature, the speed you see in the listing is the highest speed you can get with all the cores at once. And why would all of the cores have the same maximum speed? Well, they're all of an identical design, on the same physical chip, laid down with the same semiconductor process, so why should they be different?

The reason all of the cores are identical is because that makes it easiest for a thread that's running on one core at one point to start running on a different core at another point. As mentioned elsewhere, there are commonly-used chips that don't follow this principle of identical cores, namely the ARM "big.LITTLE" CPUs. Although in my mind the most important difference between the "big" and "little" cores isn't clock speed (the "big" cores tend to be fancier, wider, more speculative cores that get more instructions per clock at the cost of higher power usage, while the "little" cores hew closer to ARM's single-issue, in-order, low-power roots), since they're different designs on the same chip they will generally have different maximum clock speeds as well.

And getting further into the realm of heterogeneous computing, it's also becomming common to see "CPU" and "GPU" cores integrated onto the same chip. These have thoroughly different designs, run different instruction sets, are addressed differently, and generally will be clocked differently as well.

hobbs

Posted 2017-06-24T13:25:56.087

Reputation: 701

Fast single-thread performance and very high multi-thread throughput is exactly what you get with a CPU like Intel's Xeon E5-2699v4.

It's a 22-core Broadwell. The sustained clock speed is 2.2GHz with all cores active (e.g. video encoding), but the single-core max turbo is 3.6GHz.

So while running a parallel task, it uses its 145W power budget as 22 6.6W cores. But while running a task with only a few threads, that same power budget lets a few cores turbo up to 3.6GHz. (The lower single-core memory and L3-cache bandwidth in a big Xeon means it might not run as fast as a desktop quad-core at 3.6GHz, though. A single core in a desktop Intel CPU can use a lot more of the total memory bandwidth.)

The 2.2GHz rated clock speed is that low because of thermal limits. The more cores a CPU has, the slower they have to run when they're all active. This effect isn't very big in the 4 and 8 core CPUs you mention in the question, because 8 isn't that many cores, and they have very high power budgets. Even enthusiast desktop CPUs noticeably show this effect: Intel's Skylake-X i9-7900X is a 10c20t part with base 3.3GHz, max turbo 4.5GHz. That's much more single-core turbo headroom than i7-6700k (4.0GHz sustained / 4.2GHz turbo without overclocking).

Frequency/voltage scaling (DVFS) allows the same core to operate over a wide range of the performance / efficiency curve. See also this IDF2015 presentation on Skylake power management, with lots of interesting details about what CPUs can do efficiently, and trading off performance vs. efficiency both statically at design time, and on the fly with DVFS.

At the other end of the spectrum, Intel Core-M CPUs have very low sustained frequency, like 1.2GHz at 4.5W, but can turbo up to 2.9GHz. With multiple cores active, they'll run their cores at a more efficient clock-speed, just like the giant Xeons.

You don't need a heterogeneous big.LITTLE style architecture to get most of the benefit. The small cores in ARM big.LITTLE are pretty crappy in-order cores that aren't good for compute work. The point is just to run a UI with very low power. Lots of them would not be great for video encoding or other serious number crunching. (@Lưu Vĩnh Phúc found some discussions about why x86 doesn't have big.LITTLE. Basically, spending extra silicon on a very-low-power extra-slow core wouldn't be worth it for typical desktop/laptop usage.)

whereas applications like video editing are determined by number of cores. [Wouldn't 2x 4.0 GHz + 4x 2.0 GHz be better at multi-threaded workloads than 4x 4GHz?]

This is your key misunderstanding. You seem to be thinking that the same number of total clock ticks per second is more useful if spread over more cores. That's never the case. It's more like

cores * perf_per_core * (scaling efficiency)^cores

(perf_per_core is not the same thing as clock speed, because a 3GHz Pentium4 will get a lot less work per clock cycle than a 3GHz Skylake.)

More importantly, it's very rare that the efficiency is 1.0. Some embarrasingly parallel tasks do scale almost linearly (e.g. compiling multiple source files). But video encoding is not like that. For x264, scaling is very good up to a few cores, but gets worse with more cores. e.g. going from 1 to 2 cores will almost double the speed, but going from 32 to 64 cores will help much much less for a typical 1080p encode. The point at which speed plateaus depends on the settings. (-preset veryslow does more analysis on each frame, and can keep more cores busy than -preset fast).

With lots of very slow cores, the single-threaded parts of x264 would become bottlenecks. (e.g. the final CABAC bitstream encoding. It's h.264's equivalent of gzip, and doesn't parallelize.) Having a few fast cores would solve that, if the OS knew how to schedule for it (or if x264 pinned the appropriate threads to fast cores).

x265 can take advantage of more cores than x264, since it has more analysis to do, and h.265's WPP design allows more encode and decode parallelism. But even for 1080p, you run out of parallelism to exploit at some point.

If you have multiple videos to encode, doing multiple videos in parallel scales well, except for competition for shared resources like L3 cache capacity and bandwidth, and memory bandwidth. Fewer faster cores could get more benefit from the same amount of L3 cache, since they wouldn't need to work on so many different parts of the problem at once.

Peter Cordes

Posted 2017-06-24T13:25:56.087

Reputation: 3 141

While it's possible to design computers that have different parts running at different independent speeds, arbitration of resources often requires being able to quickly decide which request to service first, which in turn requires knowing whether any other request might have come in soon enough to win priority. Deciding such things, most of the time, is pretty simple. Something like a "quiz buzzer" circuit could be implemented with as few as two transistors. The problem is that making quick decisions that are reliably unambiguous is hard. The only practical way to do that in many cases is to use a decide called a "synchronizer", which can avoid ambiguities but introduces a two-cycle delay. One could design a caching controller which would reliably arbitrate among two systems with separate clocks if one were willing to tolerate a two-cycle delay on every operation to determine who won arbitration. Such an approach would be less than useful, however, if one would like a cache to respond immediately to requests in the absence of contention, since even uncontested requests would still have a two-cycle delay.

Running everything off a common clock avoids the need for synchronization, which in turn avoids a two-cycle communications delay every time it's necessary to pass information or control signals between clock domains.

supercat

Posted 2017-06-24T13:25:56.087

Reputation: 1 649

Desktop computers do this already.

They have (set of) a CPU(s), with 1-72 threads active at once, and a (set of) GPU(s), with 16-7168 computing units.

Graphics is an example of a task that we have found massive parallel work to be efficient. The GPU is optimized to do the kind of operations that we want to do graphics (but it isn't limited to that).

This is a computer with a few big cores, and lots of small cores.

In general, trading one core at X FLOPS for three cores at X/2 FLOPS is not worth it; but trading one core at X FLOPS for one hundred cores at X/5 FLOPS is very much worth it.

When programming for this, you generate very different code for the CPU and for the GPU. Lots of work is done to divide the workload, so that the GPU gets tasks that are best done on the GPU, and the CPU gets tasks that are best done on the CPU.

It is arguably much easier to write code for a CPU, because massively parallel code is harder to get right. So only when the payoff is large is it worth trading single-core performance for multi-core situations. GPUs give a large payoff when used properly.

Now, mobile devices do this for a different reason. They have low-power cores that are significantly slower, but use significantly less power per unit of compute as well. This lets them stretch battery life much longer when not doing CPU intensive tasks. Here we have a different kind of "large payoff"; not performance, but power efficiency. It still takes a lot of work on the part of the OS and possibly application writer to get this to work right; only the large payoff made it worth it.

Yakk

Posted 2017-06-24T13:25:56.087

Reputation: 153

-1

The reason common systems have cores at the same speed is a simple math problem. Input and output timing (with optimizations) based on a single set of constants (which are scalable = multipliable by a number of units).

And someone here said mobile devices have multi-cpus with different speeds. That's just not true. Its not a central processing unit if it is not the unit of central processing; no matter what the manufacturer says it is or is not. in that case [not a cpu] its just a "support package".

Hypersoft Systems

Posted 2017-06-24T13:25:56.087

Reputation: 99

-10

I don't think the OP understands basic electronics. All computers require one thing for them to function - a clock. Clock cycles generated by an internal clock are the metronome for the movement of all data. To achieve synchronicity, all operations must be tied to a common clock. This is true for both internal data execution on an isolated computer as well as entire networks.

If you wanted to isolate cores on a CPU by running them at different frequencies, you could certainly design such a platform. Although, it would require engineering a motherboard solution that ties each individual core to its own isolated subset of motherboard features. You would be left with 4 individual computers instead of a quad-core computer.

Alternatively, as another person pointed out, you can add code to your kernel that adjusts core frequency on an individual basis. This will cause hits on performance, though. You can have speed or power efficiency - but you can't have both.

RyRoUK

Posted 2017-06-24T13:25:56.087

Reputation: 1

1I don't, hence my question. Comparing an Intel i5 7600 to an i5 7600k, we see that the base clock is 100mhz for both and the difference is the core ratio. So you could have two cores with the same base clock of 100mhz but with different core ratios - does this scenario violate the synchronicity requirement? – Jamie – 2017-06-24T21:01:48.010

4Yeah, this is oversimplifying too much; it's not really true that all operations must be tied to the same clock, there are lots of clock domains and it's perfectly possible to run different cores at the same speed. Bus clock is not the same as internal clock, etc. – pjc50 – 2017-06-24T21:45:52.643

11Modern chips already have multiple clock domains (even the RTC of a cheap&dumb microcontroller usually runs on a separate 32.7kHz domain). You just have to synchronize between clock domains. Even with a common clock you could divide it by 2, 4, 8 and so on. – Michael – 2017-06-24T21:46:18.897

1All true. But it still reduces efficiency of operation. And that is always the goal in regards to performance. That was my point. Sure, you can do it. But you'll take a hit on performance. – RyRoUK – 2017-06-27T10:41:18.480

"Reduces performance" - compared to what? You are assuming a base state where you have n processors running with the same clock. That doesn't have to be the case. Processor X + processor Y is a more powerful/flexible solution than processor X alone, no matter what exactly processor Y is. – hmijail mourns resignees – 2017-06-29T22:36:37.753

Compared to its own max voltage + frequency. If all cores are maxed out in both V & f, then scaling down any core would result in lower potential performance. – RyRoUK – 2017-07-01T06:38:33.013