The purpose of prefetching is to hide the latency of a memory access.
The command pipeline and out-of-order (OOO) execution capabilities of today’s processors can hide some latency but, at best, only for accesses which hit the caches.
To cover the latency of main memory accesses, the command queue would have to be incredibly（令人难以置信的长） long.
Some processors without OOO try to compensate（补偿） by increasing the number of cores, but this is a bad trade unless all the code in use is parallelized.
Prefetching can further help to hide latency.
The processor performs prefetching on its own, triggered by certain events (hardware prefetching) or explicitly requested by the program (software prefetching).
The trigger for the CPU to start hardware prefetching is usually a sequence of two or more cache misses in a certain pattern.
These cache misses can be to succeeding or preceding cache lines.
In old implementations only cache misses to adjacent cache lines are recognized.
With contemporary（当代的，现代的） hardware, strides（进步） are recognized as well, meaning that skipping a fixed number of cache lines is recognized as a pattern and handled appropriately（适当）.
为什么要两个 cache miss 才会触发预取
It would be bad for performance if every single cache miss triggered a hardware prefetch.
Random memory access patterns, for instance to global variables, are quite common and the resulting prefetches would mostly waste FSB bandwidth.
This is why, to kickstart prefetching, at least two cache misses are needed.
Processors today all expect there to be more than one stream of memory accesses.
The processor tries to automatically assign each cache miss to such a stream and, if the threshold is reached, start hardware prefetching.
CPUs today can keep track of eight to sixteen separate streams for the higher level caches.
The units responsible for the pattern recognition（模式识别） are associated with the respective（相应的，各自的） cache.
There can be a prefetch unit for the L1d and L1i caches.
There is most probably a prefetch unit for the L2 cache and higher.
The L2 and higher prefetch unit is shared with all the other cores and hyper-threads using the same cache.
The number of eight to sixteen separate streams therefore is quickly reduced.
Prefetching has one big weakness: it cannot cross page boundaries.
The reason should be obvious when one realizes that the CPUs support demand（需求） paging.
If the prefetcher were allowed to cross page boundaries, the access might trigger an OS event to make the page available.
This by itself can be bad, especially for performance.
What is worse is that the prefetcher does not know about the semantics of the program or the OS itself.
It might therefore prefetch pages which, in real life, never would be requested.
That means the prefetcher would run past the end of the memory region the processor accessed in a recognizable pattern before.
This is not only possible, it is very likely.
If the processor, as a side effect of a prefetch, triggered a request for such a page the OS might even be completely thrown off its tracks if such a request could never otherwise happen.
It is therefore important to realize that, regardless of how good the prefetcher is at predicting the pattern, the program will experience cache misses at page boundaries unless it explicitly prefetches or reads from the new page.
This is another reason to optimize the layout of data as described in section 6.2 to minimize cache pollution by keeping unrelated data out.
Because of this page limitation the processors today do not have terribly sophisticated（非常复杂） logic to recognize prefetch patterns.
With the still predominant（优越的） 4k page size there is only so much which makes sense.
The address range in which strides（步幅） are recognized has been increased over the years, but it probably does not make much sense to go beyond the 512 byte window which is often used today.
Currently prefetch units do not recognize non-linear access patterns.
It is more likely than not that such patterns are truly random or, at least, sufficiently non-repeating that it makes no sense to try recognizing them.
If hardware prefetching is accidentally triggered（意外触发） there is only so much one can do.
One possibility is to try to detect（检测） this problem and change the data and/or code layout a bit.
This is likely to prove hard.
There might be special localized solutions（本地化解决方案） like using the ud2 instruction on x86 and x86-64 processors.
This instruction, which cannot be executed itself, is used after an indirect jump instruction; it is used as a signal to the instruction fetcher that the processor should not waste efforts decoding the following memory since the execution will continue at a different location.
This is a very special situation, though.
In most cases one has to live with this problem.
It is possible to completely or partially disable hardware prefetching for the entire processor.
On Intel processors an Model Specific Register (MSR) is used for this (IA32 MISC ENABLE, bit 9 on many processors; bit 19 disables only the adjacent cache line prefetch).
This, in most cases, has to happen in the kernel since it is a privileged operation（特权行动）.
If profiling shows that an important application running on a system suffers from bandwidth exhaustion and premature cache evictions due to hardware prefetches, using this MSR is a possibility.
The advantage of hardware prefetching is that programs do not have to be adjusted.
The drawbacks, as just described, are that the access patterns must be trivial and that prefetching cannot happen across page boundaries.
For these reasons we now have more possibilities, software prefetching the most important of them.
Software prefetching does require modification of the source code by inserting special instructions.
Some compilers support pragmas to more or less automatically insert prefetch instructions.
On x86 and x86-64 Intel’s convention for compiler intrinsics to insert these special instructions is generally used:
_MM_HINT_T0 = 3,
_MM_HINT_T1 = 2,
_MM_HINT_T2 = 1,
_MM_HINT_NTA = 0
void _mm_prefetch(void *p,
enum _mm_hint h);
Programs can use the
_mm_prefetch intrinsic on any pointer in the program.
Most processors (certainly all x86 and x86-64 processors) ignore errors resulting from invalid pointers which makes the life of the programmer significantly（显著） easier.
If the passed pointer references valid memory, the prefetch unit will be instructed to load the data into cache and, if necessary, evict other data.
Unnecessary prefetches should definitely（无疑） be avoided since this might reduce the effectiveness of the caches and it consumes memory bandwidth (possibly for two cache lines in case the evicted cache line is dirty).
The different hints to be used with the
_mm_prefetch intrinsic（固有，内在） are implementation defined.
That means each processor version can implement them (slightly) differently.
What can generally be said is that
_MM_HINT_T0 fetches data to all levels of the cache for inclusive caches and to the lowest level cache for exclusive caches（独占缓存）.
If the data item is in a higher level cache it is loaded into L1d.
_MM_HINT_T1 hint pulls the data into L2 and not into L1d.
If there is an L3 cache the
_MM_HINT_T2 hints can do something similar for it.
These are details, though, which are weakly specified and need to be verified for the actual processor in use.
In general, if the data is to be used right away using _MM_HINT_T0 is the right thing to do.
如果数据要立刻使用，那就指定为 _MM_HINT_T0。这样会把数据全量加载到 L1 cache，自然地，要求缓存大小必须能够存储下预取的数据。
Of course this requires that the L1d cache size is large enough to hold all the prefetched data.
If the size of the immediately used working set is too large, prefetching everything into L1d is a bad idea and the other two hints should be used.
The fourth hint, _MM_HINT_NTA, allows telling the processor to treat the prefetched cache line specially.
NTA stands for non-temporal aligned which we already explained in section 6.1.
The program tells the processor that polluting caches with this data should be avoided as much as possible since the data is only used for a short time.
The processor can therefore, upon loading, avoid reading the data into the lower level caches for inclusive cache implementations.
When the data is evicted from L1d the data need not be pushed into L2 or higher but, instead, can be written directly to memory.
There might be other tricks the processor designers can deploy if this hint is given.
The programmer must be careful using this hint: if the immediate working set size is too large and forces eviction of a cache line loaded with the NTA hint, reloading from memory will occur.
Figure 6.7 shows the results of a test using the now familiar pointer chasing（追） framework.
The list is randomly laid out in memory.
The difference to previous test is that the program actually spends some time at each list node (about 160 cycles).
As we learned from the data in Figure 3.15, the program’s performance suffers badly as soon as the working set size is larger than the last-level cache.
We can now try to improve the situation by issuing prefetch（发出预取） requests ahead of the computation.
I.e., in each round of the loop we prefetch a new element.
The distance between the prefetched node in the list and the node which is currently worked on must be carefully chosen.
Given that each node is processed in 160 cycles and that we have to prefetch two cache lines (NPAD=31), a distance of five list elements is enough.
The results in Figure 6.7 show that the prefetch does indeed help.
As long as the working set size does not exceed the size of the last level cache (the machine has 512kB = 2^19B of L2) the numbers are identical.
The prefetch instructions do not add a measurable extra burden（可衡量的额外负担。）.
As soon as the L2 size is exceeded the prefetching saves between 50 to 60 cycles, up to 8%.
The use of prefetch cannot hide all the （处罚） but it does help a bit.
AMD implements, in their family 10h of the Opteron line, another instruction: prefetchw.
This instruction has so far no equivalent on the Intel side and is not available through intrinsics.
The prefetchw instruction tells the CPU to prefetch the cache line into L1 just like the other prefetch instructions.
The difference is that the cache line is immediately put into ’M’ state.
This will be a disadvantage if no write to the cache line follows later.
If there are one or more writes, they will be accelerated since the writes do not have to change the cache state that happened when the cache line was prefetched.
This is especially important for contended cache lines where a simple read of a cache line in another processor’s cache would first change the state to ’S’ in both caches.
Prefetching can have bigger advantages than the meager（微不足道的） 8% we achieved here.
But it is notoriously（臭名昭著） hard to do right, especially if the same binary is supposed to perform well on a variety of machines.
The performance counters provided by the CPU can help the programmer to analyze prefetches.
Events which can be counted and sampled include hardware prefetches, software prefetches, useful/used software prefetches, cache misses at the various levels, and more.
In section 7.1 we will introduce a number of these events.
All these counters are machine specific.
When analyzing programs one should first look at the cache misses.
When a large source of cache misses is located one should try to add a prefetch instruction for the problematic memory accesses.
This should be done in one place at a time.
The result of each modification should be checked by observing the performance counters measuring useful prefetch instructions.
If those counters do not increase the prefetch might be wrong, it is not given enough time to load from memory, or the prefetch evicts memory from the cache which is still needed.
gcc today is able to emit prefetch（发起预取） instructions automatically in one situation.
If a loop is iterating over an array the following option can be used:
The compiler will figure out whether prefetching makes sense and, if so, how far ahead it should look.
For small arrays this can be a disadvantage and, if the size of the array is not known at compile time, the results might be worse.
The gcc manual warns that the benefits highly depend on the form of the code and that in some situation the code might actually run slower.
Programmers have to use this option carefully
Special Kind of Prefetch: Speculation（推测）
The OOO(out of order) execution capability of a modern processor allows moving instructions around if they do not conflict with each other.
For instance (using this time IA-64 for the example):
st8 [r4] = 12
add r5 = r6, r7;;
st8 [r18] = r5
This code sequence stores 12 at the address specified by register r4, adds the content of registers r6 and r7 and stores it in register r5.
Finally it stores the sum at the address specified by register r18.
The point here is that the add instruction can be executed before–or at the same time as–the first st8 instruction since there is no data dependency.
But what happens if one of the addends（加数） has to be loaded?
st8 [r4] = 12
ld8 r6 = [r8];;
add r5 = r6, r7;;
st8 [r18] = r5
The extra ld8 instruction loads the value from the address specified by the register r8.
There is an obvious data dependency between this load instruction and the following add instruction (this is the reason for the ;; after the instruction, thanks for asking).
What is critical here is that the new ld8 instruction–unlike the add instruction–cannot be moved in front of the first st8.
The processor cannot determine quickly enough during the instruction decoding whether the store and load conflict, i.e., whether r4 and r8 might have same value.
If they do have the same value, the st8 instruction would determine the value loaded into r6.
What is worse, the ld8 might also bring with it a large latency in case the load misses the caches.
The IA-64 architecture supports speculative（投机） loads for this case:
ld8.a r6 = [r8];;
[... other instructions ...]
st8 [r4] = 12
ld8.c.clr r6 = [r8];;
add r5 = r6, r7;;
st8 [r18] = r5
The new ld8.a and ld8.c.clr instructions belong together and replace the ld8 instruction in the previous code sequence.
The ld8.a instruction is the speculative load.
The value cannot be used directly but the processor can start the work.
At the time when the ld8.c.clr instruction is reached the content might have been loaded already (given there is a sufficient number of instructions in the gap).
The arguments for this instruction must match that for the ld8.a instruction.
If the preceding st8 instruction does not overwrite the value (i.e., r4 and r8 are the same), nothing has to be done.
The speculative load does its job and the latency of the load is hidden.
If the store and load do conflict the ld8.c.clr reloads the value from memory and we end up with the semantics of a normal ld8 instruction.
Speculative（投机） loads are not (yet?) widely used.
But as the example shows it is a very simple yet effective way to hide latencies.
Prefetching is basically equivalent and, for processors with few registers, speculative loads probably do not make much sense.
Speculative loads have the (sometimes big) advantage of loading the value directly into the register and not into the cache line where it might be evicted again (for instance, when the thread is descheduled).
If speculation is available it should be used.
When one tries to use software prefetching one often runs into problems with the complexity（复杂性） of the code.
If the code has to iterate over a data structure (a list in our case) one has to implement two independent iterations in the same loop:
the normal iteration doing the work and the second iteration, which looks ahead, to use prefetching.
This easily gets complex enough that mistakes are likely.
Furthermore, it is necessary to determine how far to look ahead.
Too little and the memory will not be loaded in time.
Too far and the just loaded data might have been evicted again.
Another problem is that prefetch instructions, although they do not block and wait for the memory to be loaded, take time.
The instruction has to be decoded, which might be noticeable if the decoder is too busy, for instance, due to well written/generated code.
Finally, the code size of the loop is increased.
This decreases the L1i efficiency.
预取数据也需要解码，这会占用 L1i 的资源，降低其性能。
If one tries to avoid parts of this cost by issuing multiple prefetch requests in a row (in case the second load does not depend on the result of the first) one runs into problems with the number of outstanding prefetch requests.
An alternative approach is to perform the normal operation and the prefetch completely separately.
This can happen using two normal threads.
The threads must obviously be scheduled so that the prefetch thread is populating a cache accessed by both threads.
There are two special solutions worth mentioning:
（1）Use hyper-threads (see page 29) on the same core.
In this case the prefetch can go into L2 (or even L1d).
（2）Use “dumber”（笨） threads than SMT threads which can do nothing but prefetch and other simple operations.
This is an option processor manufacturers（处理器制造商） might explore.
The use of hyper-threads is particularly intriguing（特别有趣）.
As we have seen on page 29, the sharing of caches is a problem if the hyper-threads execute independent code.
If, instead, one thread is used as a prefetch helper thread this is not a problem.
To the contrary, it is the desired effect since the lowest level cache is preloaded.
Furthermore, since the prefetch thread is mostly idle or waiting for memory, the normal operation of the other hyperthread is not disturbed much if it does not have to access main memory itself.
The latter is exactly what the prefetch helper thread prevents.
The only tricky（狡猾） part is to ensure that the helper thread is not running too far ahead.
It must not completely pollute the cache so that the oldest prefetched values are evicted again.
On Linux, synchronization is easily done using the
futex system call or, at a little bit higher cost, using the POSIX thread synchronization primitives.
The benefits of the approach can be seen in Figure 6.8.
This is the same test as in Figure 6.7 only with the additional result added.
The new test creates an additional helper thread which runs about 100 list entries ahead and reads (not only prefetches) all the cache lines of each list element.
In this case we have two cache lines per list element (NPAD=31 on a 32-bit machine with 64 byte cache line size).
The two threads are scheduled on two hyper-threads of the same core.
The test machine has only one core but the results should be about the same if there is more than one core.
The affinity（亲和力） functions, which we will introduce in section 6.4.3, are used to tie the threads down to the appropriate hyper-thread.
To determine which two (or more) processors the OS knows are hyper-threads, the NUMA_cpu_level_mask interface from libNUMA can be used (see Appendix D).
ssize_t NUMA_cpu_level_mask(size_t destsize,
unsigned int level);
This interface can be used to determine the hierarchy（等级制度） of CPUs as they are connected through caches and memory.
Of interest here is level 1 which corresponds to hyper-threads.
To schedule two threads on two hyperthreads the libNUMA functions can be used (error handling dropped for brevity):
sizeof(self), &self, 1);
CPU_XOR(&hts, &hts, &self);
After this code is executed we have two CPU bit sets.
self can be used to set the affinity of the current thread and the mask in hts can be used to set the affinity of the helper thread.
This should ideally happen before the thread is created.
In section 6.4.3 we will introduce the interface to set the affinity.
If there is no hyper-thread available the NUMA_cpu_level_mask function will return 1.
This can be used as a sign to avoid this optimization.
The result of this benchmark might be surprising (or perhaps not).
If the working set fits into L2, the overhead of the helper thread reduces the performance by between 10% and 60% (mostly at the lower end, ignore the smallest working set sizes again, the noise is too high).
This should be expected since, if all the data is already in the L2 cache, the prefetch helper thread only uses system resources without contributing to the execution.
Once the L2 size is not sufficient is exhausted the picture changes, though.
The prefetch helper thread helps to reduce the runtime by about 25%.
We still see a rising curve simply because the prefetches cannot be processed fast enough.
The arithmetic（算数） operations performed by the main thread and the memory load operations of the helper thread do complement each other, though.
The resource collisions are minimal which causes this synergistic effect.
The results of this test should be transferable（转让） to many other situations.
Hyper-threads, often not useful due to cache pollution, shine in these situations and should be taken advantage of.
The NUMA library introduced in Appendix D makes finding thread siblings very easy (see the example in that appendix).
If the library is not available the sys file system allows a program to find the thread siblings (see the thread_siblings column in Table 5.3).
Once this information is available the program just has to define the affinity of the threads and then run the loop in two modes: normal operation and prefetching.
The amount of memory prefetched should depend on the size of the shared cache.
In this example the L2 size is relevant（响应的） and the program can query the size using
Whether or not the progress of the helper thread must be restricted depends on the program.
In general it is best to make sure there is some synchronization since scheduling details could otherwise cause significant performance degradations.
Direct Cache Access
One sources of cache misses in a modern OS is the handling of incoming data traffic.
Modern hardware, like Network Interface Cards (NICs) and disk controllers, has the ability to write the received or read data directly into memory without involving the CPU.
This is crucial（关键） for the performance of the devices we have today, but it also causes problems.
Assume an incoming packet from a network: the OS has to decide how to handle it by looking at the header of the packet.
The NIC places the packet into memory and then notifies the processor about the arrival.
The processor has no chance to prefetch the data since it does not know when the data will arrive, and maybe not even where exactly it will be stored.
The result is a cache miss when reading the header.
Intel has added technology in their chipsets（芯片组） and CPUs to alleviate（缓解） this problem.
The idea is to populate the cache of the CPU which will be notified about the incoming packet with the packet’s data.
The payload of the packet is not critical here, this data will, in general, be handled by higher-level functions, either in the kernel or at user level.
The packet header is used to make decisions about the way the packet has to be handled and so this data is needed immediately.
The network I/O hardware already has DMA to write the packet.
That means it communicates directly with the memory controller which potentially（可能） is integrated in the Northbridge.
Another side of the memory controller is the interface to the processors through the FSB (assuming the memory controller is not integrated into the CPU itself).
The idea behind Direct Cache Access (DCA) is to extend the protocol between the NIC and the memory controller.
In Figure 6.9 the first figure shows the beginning of the DMA transfer in a regular machine with Northand Southbridge.
The NIC is connected to (or is part of) the Southbridge.
It initiates the DMA access but provides the new information about the packet header which should be pushed into the processor’s cache.
The traditional behavior would be, in step two, to simply complete the DMA transfer with the connection to the memory.
For the DMA transfers with the DCA flag set the Northbridge additionally sends the data on the FSB with a special, new DCA flag.
The processor always snoops the FSB and, if it recognizes the DCA flag, it tries to load the data directed to the processor into the lowest cache.
The DCA flag is, in fact, a hint; the processor can freely ignore it. After the DMA transfer is finished the processor is signaled.
事实上，DCA标志是一个提示; 处理器可以自由地忽略它。 DMA传输完成后，处理器发出信号。
The OS, when processing the packet, first has to determine what kind of packet it is.
If the DCA hint is not ignored, the loads the OS has to perform to identify the packet most likely hit the cache.
Multiply this saving of hundreds of cycles per packet with tens of thousands of packets which can be processed per second, and the savings add up to very significant numbers, especially when it comes to latency.
操作系统必须判断出 packet 的种类。
Without the integration of I/O hardware (a NIC in this case), chipset, and CPUs such an optimization is not possible.
It is therefore necessary to make sure to select the platform wisely if this technology is needed.
- Hardware Prefetching
- Software Prefetching
- Special Kind of Prefetch: Speculation（推测）
- Helper Threads
- Direct Cache Access