CPU Cache Implementation Details

Cache implementers have the problem that each cell in the huge main memory potentially(可能) has to be cached.

If the working set of a program is large enough this means there are many main memory locations which fight for each place in the cache.

Previously it was noted that a ratio of 1-to-1000 for cache versus main memory size is not uncommon.



It would be possible to implement a cache where each cache line can hold a copy of any memory location (see Figure 3.5).

This is called a fully associative cache.

To access a cache line the processor core would have to compare the tags of each and every cache line with the tag for the requested address.

The tag would be comprised(构成) of the entire part of the address which is not the offset into the cache line (that means, S in the figure on page 15 is zero).

There are caches which are implemented like this but, by looking at the numbers for an L2 in use today, will show that this is impractical(不切实际).



Given a 4MB cache with 64B cache lines the cache would have 65,536 entries.

To achieve adequate performance the cache logic would have to be able to pick from all these entries the one matching a given tag in just a few cycles.

The effort(功夫) to implement this would be enormous(巨大).





For each cache line a comparator is needed to compare the large tag (note, S is zero).

The letter next to each connection indicates the width in bits.

If none is given it is a single bit line.

Each comparator has to compare two T-bit-wide values.

Then, based on the result, the appropriate cache line content is selected and made available.


This requires merging as many sets of O data lines as there are cache buckets.


The number of transistors(晶体管) needed to implement a single comparator is large especially since it must work very fast.

No iterative comparator is usable.

The only way to save on the number of comparators is to reduce the number of them by iteratively comparing the tags.

This is not suitable for the same reason that iterative comparators are not: it takes too long.

ps: 消耗了太多的时间。

Fully associative caches are practical for small caches (for instance, the TLB caches on some Intel processors are fully associative) but those caches are small, really small.

  • 一般的机器
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              40960K


We are talking about a few dozen(一打,12 个) entries at most.


For L1i, L1d, and higher level caches a different approach is needed.

What can be done is to restrict(限制,约束) the search.

In the most extreme(极端) restriction each tag maps to exactly one cache entry.

The computation is simple: given the 4MB/64B cache with 65,536 entries we can directly address each entry by using bits 6 to 21 of the address (16 bits).

The low 6 bits are the index into the cache line.


Such a direct-mapped cache is fast and relatively easy to implement as can be seen in Figure 3.6.


It requires exactly one comparator, one multiplexer (复用器)(two in this diagram where tag and data are separated, but this is not a hard requirement on the design), and some logic to select only valid cache line content.

The comparator is complex due to the speed requirements but there is only one of them now;

as a result more effort can be spent on making it fast.

  • 改进后收益




The real complexity in this approach lies in the multiplexers.

The number of transistors in a simple multiplexer grows with O(log N), where N is the number of cache lines.

This is tolerable(可容忍的) but might get slow, in which case speed can be increased by spending more real estate on transistors in the multiplexers to parallelize some of the work and to increase the speed.

  • 空间换时间


虽然 cache size 的增长,晶体管可以增长的很慢。

The total number of transistors can grow slowly with a growing cache size which makes this solution very attractive(有吸引力).


But it has a drawback:

it only works well if the addresses used by the program are evenly distributed with respect to the bits used for the direct mapping.

If they are not, and this is usually the case, some cache entries are heavily used and therefore repeated evicted while others are hardly used at all or remain empty.

  • 均匀分布

说道数据的分布,就会想到 hash。

说道均匀分布,那么增删节点,其实可以参考 一致性 hash


解决 cache 分布均匀的问题




This problem can be solved by making the cache set associative.

A set-associative cache combines the good features of the full associative and direct-mapped cachesto largely avoid the weaknesses of those designs.

ps: 这种设计将前面两种的优缺点进行整合,融合了二者的长度。

Figure 3.7 shows the design of a set-associative cache.

The tag and data storage are divided into sets, one of which is selected by the address of a cache line.

This is similar to the direct-mapped cache.

But instead of only having one element for each set value in the cache a small number of values is cached for the same set value.

The tags for all the set members are compared in parallel, which is similar to the functioning of the fully associative cache.



The result is a cache which is not easily defeated by unfortunate–or deliberate–selection of addresses with the same set numbers and at the same time the size of the cache is not limited by the number of comparators which can be implemented economically.


If the cache grows it is (in this figure) only the number of columns which increases, not the number of rows.

The number of rows (and therefore comparators) only increases if the associativity of the cache is increased.

Today processors are using associativity levels of up to 24 for L2 caches or higher.

L1 caches usually get by with 8 sets.



Given our 4MB/64B cache and 8-way set associativity the cache we are left with has 8,192 sets and only 13 bits of the tag are used in addressing the cache set.

To determine(确认) which (if any) of the entries in the cache set contains the addressed cache line 8 tags have to be compared.

That is feasible to do in very short time.

With an experiment we can see that this makes sense.

鉴于我们的 4MB/64B 高速缓存和8路组关联性,我们剩下的高速缓存有 8,192 组,并且只有 13 位标签用于寻址高速缓存集。

为了确定高速缓存组中的哪些条目(如果有的话)包含所寻址的高速缓存行 8,必须比较标签。



Table 3.1 shows the number of L2 cache misses for a program (gcc in this case, the most important benchmark of them all, according to the Linux kernel people) for changing cache size, cache line size, and associativity set size.

In section 7.2 we will introduce the tool to simulate the caches as required for this test.


Just in case this is not yet obvious(明显), the relationship of all these values is that the cache size is

cache line size × associativity × number of sets

The addresses are mapped into the cache by using

O = log2 cache line size
S = log2 number of sets

in the way the figure on page 15 shows.



Figure 3.8 makes the data of the table more comprehensible(容易理解的,直观的).

It shows the data for a fixed cache line size of 32 bytes.

Looking at the numbers for a given cache size we can see that associativity can indeed help to reduce the number of cache misses significantly(显著地).

For an 8MB cache going from direct mapping to 2-way set associative cache saves almost 44% of the cache misses.

The processor can keep more of the working set in the cache with a set associative cache compared with a direct mapped cache.



In the literature(文献) one can occasionally(偶尔) read that introducing associativity has the same effect as doubling cache size.

This is true in some extreme cases as can be seen in the jump from the 4MB to the 8MB cache.

But it certainly is not true for further doubling of the associativity.

As we can see in the data, the successive gains are much smaller.

We should not completely discount the effects, though.

In the example program the peak memory use is 5.6M.

So with a 8MB cache there are unlikely to be many (more than two) uses for the same cache set.

With a larger working set the savings can be higher as we can see from the larger benefits of associativity for the smaller cache sizes.


  • 命中率与缓存大小


缓存无限大,把所有数据都缓存进去。命中率 100%。




In general, increasing the associativity of a cache above 8 seems to have little effects for a single-threaded workload(负载).

With the introduction of hyper-threaded(超线程) processors where the first level cache is shared and multi-core processors which use a shared L2 cache the situation changes.

Now you basically have two programs hitting on the same cache which causes the associativity in practice to be halved (or quartered for quad-core processors).

So it can be expected that, with increasing numbers of cores, the associativity of the shared caches should grow.

Once this is not possible anymore (16-way set associativity is already hard) processor designers have to start using shared L3 caches and beyond, while L2 caches are potentially(可能) shared by a subset of the cores.


Another effect we can study in Figure 3.8 is how the increase in cache size helps with performance.

This data cannot be interpreted without knowing about the working set size.

Obviously, a cache as large as the main memory would lead to better results than a smaller cache, so there is in general no limit to the largest cache size with measurable benefits.


As already mentioned above, the size of the working set at its peak is 5.6M.

This does not give us any absolute number of the maximum beneficial cache size but it allows us to estimate(估计) the number.

ps: 这个数据的大小需要根据实际的业务来进行界定调整。

The problem is that not all the memory used is contiguous and, therefore, we have, even with a 16M cache and a 5.6M working set, conflicts (see the benefit of the 2-way set associative 16MB cache over the direct mapped version).

ps: 这里还有个问题需要注意,并不是所有的内存都是连续的。而不连续的内存就会导致访问变慢。顺序读永远是最快的。

But it is a safe bet that with the same workload the benefits of a 32MB cache would be negligible.(微不足道)

动态的看待 workload & cache size

But who says the working set has to stay the same?

Workloads are growing over time and so should the cache size.

When buying machines, and one has to choose the cache size one is willing to pay for, it is worthwhile to measure the working set size.

Why this is important can be seen in the figures on page 21.


Two types of tests are run.

In the first test the elements are processed sequentially.

The test program follows the pointer n but the array elements are chained so that they are traversed in the order in which they are found in memory.

This can be seen in the lower part of Figure 3.9.

There is one back reference from the last element.

In the second test (upper part of the figure) the array elements are traversed in a random order.

In both cases the array elements form a circular single-linked list.


如果类比数据结构的话,就是一个是 arrayList,另一个是 linkedList。