Why has the size of L1 cache not increased very much over the last 20 years?

The Intel i486 has 8 KB of L1 cache. The Intel Nehalem has 32 KB L1 instruction cache and 32 KB L1 data cache per core.

The amount of L1 cache hasn't increased at nearly the rate the clockrate has increased.

Why not?

eleven81

Posted 2009-11-18T16:45:41.147

Reputation: 12 423

You are comparing apples to oranges. Clock rates have increased, but there is no correlation to the need for more cache. Just because you can do something faster, doesnt mean you benefit from a bigger bucket. – Keltari – 2013-05-26T04:45:24.170

Excess cache and the management overhead can slow a system down. They've found the sweet spot and there it shall remain. – Fiasco Labs – 2013-05-26T04:54:32.363

Answers

30K of Wikipedia text isn't as helpful as an explanation of why too large of a cache is less optimal. When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory. I don't know what proportions CPU designers aim for, but I would think it is something analogous to the 80-20 guideline: You'd like to find your most common data in the cache 80% of the time, and the other 20% of the time you'll have to go to main memory to find it. (or whatever the CPU designers intended proportions may be.)

EDIT: I'm sure it's nowhere near 80%/20%, so substitute X and 1-X. :)

JMD

Posted 2009-11-18T16:45:41.147

Reputation: 4 427

6"When the cache gets too large the latency to find an item in the cache (factoring in cache misses) begins to approach the latency of looking up the item in main memory." Are you sure about this? For example doubling the amount of installed RAM will certainly not increase it's latency, why would this be true for cache? And also, why would the L2 cache grow bigger with new CPUs, if this is a problem? I'm no expert in this, I really want to know :) – sYnfo – 2009-11-18T19:18:36.080

2I had prepared a big, long description of caching in software, and measuring when your cache has outgrown itself and should be dumped/rebuilt, but then I decided it might be best to admit that I'm not a hardware designer. :) In either case, I suspect the answer can be summed up by the law of diminishing returns. I.e. more is not always better. – JMD – 2009-11-18T19:48:03.807

3From my long history of fiddling with hardware at low levels, but not actually being a designer, I'd say that latency appears to be related to how many ways the cache is associative, not the size. My guess is that the extra transistors that would go into the cache have proven to be more effective elsewhere to overall performance. – Brian Knoblauch – 2009-11-18T20:09:53.263

1@JMD I'd be interested in that description nevertheless ;) Although comments are probably not the best place for this, true.
@Brian So, if I understand it correctly, they decided to put less transistors in L1 cache and in the same time put much more in L2, which is significantly slower? Please take no offense, I'm just curious :) – sYnfo – 2009-11-18T20:30:38.603

One factor is that L1 fetches start before the TLB translations are complete so as to decrease latency. With a small enough cache and high enough way the index bits for the cache will be the same between virtual and physical addresses. This probably decreases the cost of maintaining memory coherency with a virtually-indexed, physically-tagged cache.

AJW

Posted 2009-11-18T16:45:41.147

Reputation: 101

1I believe this is the reason, but let me give the number. The page size on the x86 architecture is 4096 bytes. The cache wants to choose the cache bucket in which to look for the entry of the cache line (64 bytes) before the page translation is complete. It would be expensive to have to decide between too many entries in a bucket, so each bucket only has 8 entries in it. As a result, for the last ten years, all the expensive x86 cpus have exactly 32768 bytes (512 cache lines) in their L1 data cache. – b_jonas – 2015-09-03T19:53:04.537

As this is so hard to increase, the cpus add a middle level of cache, so we have separate L2 and L3 caches now. Also, the L1 code cache and L1 data cache are separate, because the CPU knows if it's accessing code or data. – b_jonas – 2015-09-03T19:54:31.093

1most interesting answer:) – CoffeDeveloper – 2014-01-03T16:52:53.100

Cache size is influenced by many factors:

Speed of electric signals (should be if not the speed of light, something of same order of magnitude):
- 300 meters in one microsecond.
- 30 centimeters in one nanosecond.
Economic cost (circuits at different cache levels may be different and certain cache sizes may be unworth)
- Doubling cache size does not double performance (even if physics allowed that size to work) for small sizes doubling gives much more than double performance, for big sizes doubling cache size gives almost no extra performance.
- At wikipedia you can find a chart showing for example how unworth is making caches bigger than 1MB (actually bigger caches exist but you must keep in count that those are multiprocessor cores.)
- For L1 caches there should be some other charts (that vendors don't show) that make convenient 64 Kb as size.

If L1 cache size didn't changed after 64kb it's because it was no longer worth. Also note that now there's a greater "culture" about cache and many programmers write "cache-friendly" code and/or use prefetech instructions to reduce latency.

I tried once creating a simple program that was accessing random locations in an array (of several MegaBytes): that program almost freezed the computer because for each random read a whole page was moved from RAM to cache and since that was done very often that simple program was draining out all bandwith leaving really few resources for the OS.

CoffeDeveloper

Posted 2009-11-18T16:45:41.147

Reputation: 179

I believe it can be summed up simply by stating that the bigger the cache, the slower the access will be. So a larger cache simply doesn't help as a cache is designed to reduce slow bus communication to RAM.

Since the speed of the processor has been increasing rapidly, the same-sized cache must perform faster and faster in order to keep up with it. So the caches may be significantly better (in terms of speed) but not in terms of storage.

(I'm a software guy so hopefully this isn't woefully wrong)

Andrew Flanagan

Posted 2009-11-18T16:45:41.147

Reputation: 1 680

From L1 cache:

The Level 1 cache, or primary cache, is on the CPU and is used for temporary storage of instructions and data organised in blocks of 32 bytes. Primary cache is the fastest form of storage. Because it's built in to the chip with a zero wait-state (delay) interface to the processor's execution unit, it is limited in size.

SRAM uses two transistors per bit and can hold data without external assistance, for as long as power is supplied to the circuit. This is contrasted to dynamic RAM (DRAM), which must be refreshed many times per second in order to hold its data contents.

Intel's P55 MMX processor, launched at the start of 1997, was noteworthy for the increase in size of its Level 1 cache to 32KB. The AMD K6 and Cyrix M2 chips launched later that year upped the ante further by providing Level 1 caches of 64KB. 64Kb has remained the standard L1 cache size, though various multiple-core processors may utilise it differently.

EDIT: Please note that this answer is from 2009 and CPUs have evolved enormously in the last 10 years. If you have arrived to this post, don't take all our answers here too seriously.

harrymc

Posted 2009-11-18T16:45:41.147

Reputation: 306 093

This is just description of situation, and does not explain anything about why. – Eonil – 2019-01-23T03:45:57.953

@Eonil - We could not provide the “why” answer if we wanted to. However, diminishing returns on the performance is a viable reasonable explanation. When the question was written nearly a decade ago, it was much more expensive, to increase the size without including a performance hit. This answer attempted to at the very least answer the intended question that was asked. – Ramhound – 2019-01-23T10:07:54.953

A typical SRAM cell is made up of six MOSFETs. Each bit in an SRAM is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. Source Second Source

– lukecampbell – 2013-05-28T16:44:19.150

-2

Actually L1 cache size IS the biggest bottleneck for speed in modern computers. The pathetically tiny L1 cache sizes may be the sweetspot for the price, but not the performance. L1 cache can be accessed at GHz frequencies, the same as processor operations, unlike RAM access 400x slower. It is expensive and difficult to implement in the current 2 dimensional design, however, it is technically doable, and the first company which does it successfully, will have computers 100's of times faster and still running cool, something which would produce major innovations in many fields and are only accessbile currently through expensive and difficult to program ASIC/FPGA configurations. Some of these issues are to do with proprietary/IP issues and corporate greed spanning now decades, where a puny and ineffectual cadre of engineers are the only ones with access to the inner workings, and who are mostly given marching orders to squeeze out cost-effective obfuscated protectionist nonsense. Overly privatized research always leads to such technologic stagnation or throttling (as we have seen in aerospace and autos by the big manufacturers and soon to be pharma). Open source and more sensible patent and trade secret regulation benefiting the inventors and public (rather than the company bosses and stockholders) would help here a lot. It should be a no-brainer for development to make much larger L1 caches and this should and could have been developed decades ago. We would be a lot further ahead in computers and many scientific fields using them if we had.

Zack Barkley

Posted 2009-11-18T16:45:41.147

Reputation: 11