Why can't you have both high instructions per cycle and high clock speed?



The Megahertz Myth became a promotional tactic due to differences between the PC's INTEL 8086 processor and Apple's Rockwell 6502 processor. The 8086 ran at 4.77MHz while the 6502 ran at 1MHz. However, instructions on the 6502 needed fewer cycles; so many fewer, in fact, that it ran faster than the 8086. Why do some instructions need fewer cycles? And why can't the instructions of the 6502, needing fewer cycles, be combined with a fast cycling processor of the 8086?

Wikipedia's article for instructions per cycle (IPC) says

Factors governing IPC
A given level of instructions per second can be achieved with a high IPC and a low clock speed...or from a low IPC and high clock speed.

Why can't you have both high instructions per cycle and high clock speed?

Maybe this has to do with what a clock cycle is? Wikipedia mentions synchronization of circuits? Not sure what that means.

Or maybe this has to do with how a pipeline works? I'm not sure why instructions in a short pipeline are different from instructions in a long pipeline.

Any insight would be great! Just trying to understand the architecture behind the myth. Thanks!


Instruction per Cycle vs Increased Cycle Count




Posted 2012-07-12T03:20:12.637

Reputation: 371

1> Why do some instructions need fewer cycles?   RISC/CISC (well, sort of).     And why can't the instructions of the 6502, needing fewer cycles, be combined with a fast cycling processor of the 8086?   They can and have. The problem is that once you have already establish a base, it is hard to ditch everything and start the next model from scratch. – Synetech – 2012-07-12T04:08:52.943

@Synetech, intel kinda sorta did that by presenting a CISC instruction set to programmers, then converting that to RISCier instructions on the chip – soandos – 2012-07-12T04:16:52.317

Well when I said that the two have been combined, I meant by completely different chip makers. I don't have a list on hand, but there have been others (non-Intel/AMD) that have done things like that. (Most people forget that there are plenty of chip makers because Intel and AMD now dominate the desktop market.) – Synetech – 2012-07-12T05:46:11.713




Shorter pipelines mean faster clock speeds, but may reduce throughput. Also, see answers #2 and 3 at the bottom (they are short, I promise).

Longer version:

There are a few things to consider here:

  1. Not all instructions take the same time
  2. Not all instructions depend on what was done immediately (or even ten or twenty) instructions back

A very simplified pipeline (what happens in modern Intel chips is beyond complex) has several stages:

Fetch -> Decode -> Memory Access -> Execute -> Writeback -> Program counter update

At each -> there is a time cost that is incurred. Additionally, every tick (clock cycle), everything moves from one stage to the next, so your slowest stage becomes the speed for ALL stages (it really pays for them to be as similar in length as possible).

Let's say you have 5 instructions, and you want to execute them (pic taken from wikipedia, here the PC update is not done). It would look like this:

enter image description here

Even though each instruction takes 5 clock cycles to complete, a finished instruction comes out of the pipeline every cycle. If the time it takes for each stage is 40 ns, and 15 ns for the intermediate bits (using my six stage pipeline above), it will take 40 * 6 + 5 * 15 = 315 ns to get the first instruction out.

In contrast, if I were to eliminate the pipeline entirely (but keep everything else the same), it would take a mere 240 ns to get the first instruction out. (This difference in speed to get the "first" instruction out is called latency. It is generally less important than throughput, which is the number of instructions per second).

The real different though is that in the pipelined example, I get a new instrution done (after the first one) every 60 ns. In the non-pipelined one, it takes 240 every time. This shows that pipelines are good at improving throughput.

Taking it a step further, it would seem that in the memory access stage, I will need an addition unit (to do address calculations). That means that if there is an instruction that does not use the mem stage that cycle, then I can do another addition. I can thus do two execute stages (with one being in the memory access stage) on one processor in a single tick (the scheduling is a nightmare, but let's not go there. Additionally, the PC update stage will also need an addition unit in the case of a jump, so I can do three addition execute states in one tick). By having a pipeline, it can be designed such that two (or more) instructions can use different stages (or leapfog stages, etc), saving valuable time.

Note that in order to do this, processors do a lot of "magic" (out of order execution, branch prediction and much much more), but this allows multiple instructions to come out faster than without a pipeline (note that pipelines that are too long are very hard to manage, and incur a higher cost just by waiting between stages). The flip side is that if you make the pipeline too long, you can get an insane clock speed, but lose much of the original benefits (of having the same type of logic that can exist in multiple places, and be used at the same time).

Answer #2:

SIMD (single instruction multiple data) processors (like most GPUs) do a lot of work on many bits of information, but it takes them longer to do. Reading in all the values takes longer (means a slower clock, though this offset by having a much wider bus to some extent) but you can get many more instruction done at a time (more effective instructions per cycle).

Answer #3:

Because you can "cheat" an artificially lengthen the cycle count so that you can do two instructions every cycle (just halve the clock speed). It is also possible to only do something every two ticks as opposed to one (giving a 2x clock speed, but not change in instructions a second).


Posted 2012-07-12T03:20:12.637

Reputation: 22 744

3Short pipelines mean slower clockspeeds! Pentium 4 had high clocks due to long pipelines, here's WP: "NetBurst differed from P6 (Pentium III, II, etc.) by featuring a very deep instruction pipeline to achieve very high clock speeds". The point is that you do little per stage to achieve high speeds. This didn't prove workable, though, and Intel lost huge momentum to AMD due to this. They went back to Pentium 3 architecture, and came up with "Core". – stolsvik – 2012-07-17T21:45:09.143

@stolsvik, can you explain this? It makes no sense to me (having less interstitial stages means all else equal, the clock cycles will be shorter, giving a higher clock speed) – soandos – 2012-07-18T01:12:07.807

4One pipeline stage is done per clock cycle; The entire pipeline advances one step per clock - fetching new instructions at the bottom, "emitting" finished instructions at the top. Therefore, the idea with Pentium4 was to make very small steps that was quick to perform, giving high clocks, but thus requiring a long pipeline. The clue with a pipeline (all processors employ one) is that you have several instructions in progress being processed at any time. A long pipeline means that many instructions are in progress - and if a branch prediction fails, then you'll have to flush the entire pipe. – stolsvik – 2012-07-18T07:16:56.100

For your answer #2, the CPU only accesses the data through the cache (memory access is usually transparent from the instruction's perspective). Slowing down the clock frequency won't affect how long the data will take to come from RAM (if it's not in the cache). Also, the bus width only affects speed of SIMD operations relative to what size your operands are (i.e. I can load 8 8-bit operands on a 64-bit bus at a time, but I still have to manually load 8 64-bit values if I have 64-bit operands). – Breakthrough – 2012-07-18T14:06:27.720


Also for answer #1, when you say "if there is an instruction that does not use the mem stage that cycle, then I can do another addition", this is false. Out of order execution is applied at the instruction level, not micro-operation level. If an instruction did require two executes in the pipeline, this would cause a bubble in the pipeline. Lastly, the x86 architecture has a separate ALU to compute memory addresses on-the-fly during memory reads/writes (allows for the [EBX+ECX*4+100] style addressing).

– Breakthrough – 2012-07-18T14:12:59.070


I'm greatly oversimplifying this, but the important point to remember is that these terms are comparing apples to oranges. A "Cycle" is not a single unified unit of measurement that is the same across all processors, like a "second" is a unified measurement of time. Instead, a cycle represents a certain unit of work, which is defined somewhat arbitrarily but bounded by the complexity of the pipeline design and of course by physics.

In many cases, doing a lot of work in one cycle could enable you to clear the entire pipeline. If successful, this means that your next cycle is going to be un-optimized because you have to fill the pipeline again, which can take some time.

I could design a very simplistic processor that processes one stage of one RISC instruction every cycle, and if this were the basis of my CPU, I could probably achieve a very, very high cycles per second due to the reduced complexity of what constitutes "a cycle".

The details get into a lot of physics and electrical engineering that I don't really understand, but remember that clock rate is not achieved by just naively adding input voltage to the processor and hoping for the best. At the very least, thermal profile is another necessary concern.


Posted 2012-07-12T03:20:12.637

Reputation: 32 256

This does not really answer his question (which has nothing to do with why can't things just be sped up). He is asking how more cycles != more work all the time – soandos – 2012-07-12T03:52:20.817

This answer does however address an issue I didn't see in the other answers, that is it talks about the inclusion of particular instruction sets that complete operations at fewer clock cycles and the ability to measure clock cycles based on the slowest instruction sets that may not be as efficient. (I could be very wrong though...I find architecture to be fascinating but I won't consider myself an expert by any means) – Stephen R – 2012-07-12T18:29:45.293


Here's a very simple (perhaps grossly oversimplified) explanation: Say you have a particular job to do, say add two 32-bit numbers. You can take two approaches. You can split it into a very large number of very small steps or you can split it into a small number of very large steps.

For example, you could just say "add the two numbers". Now you only have one step. But that step has multiple parts and will take longer to do. So you have high instructions per cycle -- one in this case. But your clock speed can't be high because you have a lot to do in that cycle.

You could alternatively say, "Fetch the first number into a register. Then fetch the second number. Then add the least significant bits. Then add the second-least significant bit with the carry from before. Then add the third-least .... Then add the most significant bits. If there was a carry, set the overflow flag. Then write the result to memory." Now you have a huge number of steps. But each step can be absurdly fast. So you have low instructions per cycle (1/36 or so in this case). But your clock speed can be very high since each cycle only has a very small bit to do.

To have both high instructions per cycle and a high clock speed, you'd have to divide a complex instruction into a very small number of very simple steps. But that can't be done because the instruction is complex.

The actual specific trade-offs and cycle numbers are vastly different because modern CPUs are pipelined and overlap instructions. But the basic idea is correct.

David Schwartz

Posted 2012-07-12T03:20:12.637

Reputation: 58 310


You can have both high instructions per cycle and a high clock speed. Where you run into limits is when the digital circuit's propagation delay exceeds a single clock cycle's pulse width. This can be overcome by increasing the CPU voltage, but it should be noted that this will increase power consumption (and thus, heat dissipated).

So, if you want a faster clock speed, you have to increase the voltage (increasing the electron drift velocity) to reduce the propagation delay. If this delay does exceed a clock cycle, the CPU will most likely not behave as expected, and the software running on it will crash or throw an exception. There's obviously a limit to the voltage you can run through a processor however, and this is dictated by the design of the CPU itself - mainly, the current-carrying capacity of the internal electrical pathways.

Pipelining allows for higher clock speeds in some cases, because each instruction is divided into several smaller "micro-operations". These micro-operations are very simple operations, using much smaller circuits interconnected in a chain (in the physical sense, as the less distance the electrons need to travel, the shorter the propogation delay through a particular sub-unit).

The added advantage to a pipelined CPU is that you can greatly increase the number of instructions executed per unit-time, at the expense of a more complex design.

As for why some instructions need more or less cycles, it depends on what instruction you're executing. For example, in the x86 instruction set, there is a MOVS instruction which can move an entire string in memory from one place to another. Clearly, you can't instantaneously copy a long string, but you can by copying it word-by-word, taking multiple clock cycles. Thus, the MOVS instruction takes a variable amount of time (depending on the amount of characters to be copied).

The effect of multi-cycle operations is less noticable on a RISC design (i.e. ARM) as opposed to a CISC design (i.e. x86). This is because RISC-based designs will only have the most commonly used elementary operations, and are much easier to pipeline in a way to achieve a throughput of one instruction per cycle.


Posted 2012-07-12T03:20:12.637

Reputation: 32 927


How long your computer takes to finish a particular task doesn't depend on the clock speed of the computer... it depends on how the computational units are designed and engineered.

The clock speed is actually a (more or less) arbitrary decision made by the CPU designer, sometimes for good reasons (efficiency), sometimes for poor ones (advertising).

Let's say that a given CPU has a mixture of instructions that take between 1 and 100 nanoseconds (ns) to finish. You could set the clock rate such that 1 "tick" is 100 ns (10 MHz), meaning every instruction would finish in exactly 1 tick. However, if the instruction execution times are evenly distributed, this means that your computational units would be idle 50% of the time (the average execution speed would be 50ns, leaving the other 50ns of the tick idle). If, on the other hand, you set your tick to be 10ns, the instructions would range between 1 and 10 ticks, but the unit would never be idle more than 9ns before the next instruction began, and the average idle would be 5ns. Meaning your average idle time is down from 50% (average of 50ns out of every 100) to 9% (since average execution time is now 55ns (average execution of 50ns + average idle of 5ns)).

During development, a CPU will be designed to run at a certain speed, based on how much work the CPU is actually capable of carrying out. If you raise or lower the clock speed, you're not actually changing the amount of work the CPU can accomplish, you're just messing with the efficiency ratio of it.

(And before you cry about overclocking CPUs: this gives you two advantages that result in real-world speed gains: fast executing instructions (that take less than 1 cycle) end up with faster execution times, and all instructions have less idle time. Both of these can in fact increase the amount of work your computer can perform, but you'll find that overclocking a CPU by X% doesn't always equal X% increase in work done when you benchmark it.)


A CPU can accomplish X work in a second. If you use H clock speed and I IPC, we have I=X/H. Changing H doesn't change X, but it inversely affects I.

Benjamin Chambers

Posted 2012-07-12T03:20:12.637

Reputation: 143

1Clock speed is far from an arbitrary decision. It needs to be carefully chosen as a function of the CPU supply voltage, as well as IC trace lengths (to avoid excessive propagation delays). – Breakthrough – 2012-07-18T13:45:10.070

I think you missed the fact that a CPU is a synchronous digital circuit. Instructions don't take X nanoseconds (assuming your clock cycle is less than propagation delay), everything happens on a rising or falling clock edge - or both. Instructions take X cycles, not X units of time. Yes, you can modify how long a cycle is, but the distinction is what happens when. And lastly, the amount of work a CPU can do in a second is a function of clock speed, so your formula doesn't really check out here.

– cp2141 – 2012-07-18T14:20:30.213

A CPU is a synchronous amalgamation of several asynchronous units. Clock ticks are used to line things up nicely, but they don't determine how long execution takes... For instance, an integer add will take a certain amount of time based on how far current must travel through the CPU and how quickly transistors will switch states. The result is READ at the next clock tick, but the actual computation is done asynchronously throughout the tick. – Benjamin Chambers – 2012-07-18T14:35:18.620


One cannot have both high instructions per cycle and high clock speed because the requirements are contradictory.

One can show that, in a first approximation, the IPC depends on complexity (A) of the design as

IPC = a sqrt(A)

whereas max frequency (F) achievable by the design scales as [1]

F = 1 / { b + c sqrt(A) }

with a, b and c parameters.

So increasing the complexity of the muarch increases the IPC at expense of reducing the working frequency, whereas reducing the complexity increases the frequency at expense of the IPC. This correspond to the two extremes cases mentioned in the wikipedia article, but the wikipedia fails to mention the names: Brainiac and speed-demon.

  • Brainiac design: High IPC and low frequency
  • Speed-demon desing: High frequency and low IPC.

[1] Some authors claim the expression for the frequency is "1 / { b + c A}" instead, but in both cases increasing complexity reduces the maximum achievable frequency.


Posted 2012-07-12T03:20:12.637

Reputation: 101