Instruction Cache

Not just the data used by the processor is cached;

the instructions executed by the processor are also cached.

However, this cache is much less problematic(问题) than the data cache.

ps: 不单单是数据,指令也可以被缓存。

一些原因

There are several reasons:

  • 代码数量

The quantity(数量) of code which is executed depends on the size of the code that is needed.

The size of the code in general depends on the complexity of the problem.

The complexity of the problem is fixed.

  • 编译器知道指令的规则

While the program’s data handling is designed by the programmer the program’s instructions are usually generated by a compiler.

The compiler writers know about the rules for good code generation.

  • 程序流程是更容易预测的

Program flow is much more predictable(可预测的) than data access patterns.

Today’s CPUs are very good at detecting patterns.

This helps with prefetching.

  • 局部性原理

Code always has quite good spatial and temporal locality.

一些规则

There are a few rules programmers should follow but these mainly consist of rules on how to use the tools.

We will discuss them in section 6.

Here we talk only about the technical details of the instruction cache.

多核 CPU 的发展

Ever since the core clock of CPUs increased dramatically(显著) and the difference in speed between cache (even first level cache) and core grew, CPUs have been designed with pipelines(流水线).

That means the execution of an instruction happens in stages(阶段).

First an instruction is decoded, then the parameters are prepared, and finally it is executed.

Such a pipeline can be quite long (> 20 stages for Intel’s Netburst architecture).

长流水线

A long pipeline means that if the pipeline stalls(拖住,熄火) (i.e., the instruction flow through it is interrupted) it takes a while to get up to speed again.

Pipeline stalls happen, for instance, if the location of the next instruction cannot be correctly predicted or if it takes too long to load the next instruction (e.g., when it has to be read from memory).

As a result CPU designers spend a lot of time and chip real estate on branch prediction so that pipeline stalls happen as infrequently as possible.

ps: CPU 设计者想方设法,让流水线不阻塞。

近代处理器

On CISC processors the decoding stage can also take some time.

The x86 and x86-64 processors are especially affected.

In recent years these processors therefore do not cache the raw byte sequence of the instructions in L1i but instead they cache the decoded instructions(解码指令).

L1i in this case is called the “trace cache”.

Trace caching allows the processor to skip over the first steps of the pipeline in case of a cache hit which is especially good if the pipeline stalled.

L2 Cache

As said before, the caches from L2 on are unified(统一) caches which contain both code and data.

Obviously here the code is cached in the byte sequence form and not decoded.

最佳实践

To achieve the best performance there are only a few rules related to the instruction cache:

  • Generate code which is as small as possible.

There are exceptions when software pipelining for the sake of using pipelines requires creating more code or where the overhead of using small code is too high.

  • Help the processor making good prefetching decisions.

This can be done through code layout or with explicit prefetching.

These rules are usually enforced by the code generation of a compiler.

There are a few things the programmer can do and we will talk about them in section 6.

ps: 预取可以提升性能。单个指令尽可能的小,可以提升缓存的数量。

Self Modifying Code

In early computer days memory was a premium(额外费用).

People went to great lengths(竭尽全力) to reduce the size of the program to make more room for program data.

One trick frequently deployed was to change the program itself over time.

SMC

Such Self Modifying Code (SMC) is occasionally(偶尔) still found, these days mostly for performance reasons or in security exploits(战功).

SMC should in general be avoided.

Though it is generally correctly executed there are boundary cases which are not and it creates performance problems if not done correctly.

Obviously, code which is changed cannot be kept in the trace cache which contains the decoded instructions.

But even if the trace cache is not used because the code has not been executed at all (or for some time) the processor might have problems.

If an upcoming instruction is changed while it already entered the pipeline the processor has to throw away a lot of work and start all over again.

There are even situations where most of the state of the processor has to be tossed away(扔掉了).

SI 协议

Finally, since the processor assumes–for simplicity reasons and because it is true in 99.9999999% of all cases–that the code pages are immutable, the L1i implementation does not use the MESI protocol but instead a simplified SI protocol.

This means if modifications are detected a lot of pessimistic assumptions(悲观的假设) have to be made.

建议

It is highly advised to avoid SMC whenever possible.

Memory is not such a scarce(稀缺) resource anymore.

It is better to write separate functions instead of modifying one function according to specific needs.

Maybe one day SMC support can be made optional and we can detect exploit code(检测漏洞利用代码) trying to modify code this way.

If SMC absolutely has to be used, the write operations should bypass the cache as to not create problems with data in L1d needed in L1i.

See section 6.1 for more information on these instructions.

Linux 对于 SMC 的处理

On Linux it is normally quite easy to recognize programs which contain SMC.

All program code is write-protected when built with the regular toolchain.

The programmer has to perform significant(重大的) magic at link time to create an executable where the code pages are writable.

When this happens, modern Intel x86 and x86-64 processors have dedicated(专用) performance counters which count uses of self-modifying code.

With the help of these counters it is quite easily possible to recognize programs with SMC even if the program will succeed due to relaxed permissions.

参考资料

P32