In the previous section we have already pointed out the problem we have when multiple processors come into play.
Even multi-core processors have the problem for those cache levels which are not shared (at least the L1d).
It is completely impractical to provide direct access from one processor to the cache of another processor.
The connection is simply not fast enough, for a start.
The practical alternative is to transfer the cache content over to the other processor in case it is needed.
Note that this also applies to caches which are not shared on the same processor.
The question now is when does this cache line transfer have to happen?
This question is pretty easy to answer:
when one processor needs a cache line which is dirty in another processor’s cache for reading or writing.
But how can a processor determine（确定） whether a cache line is dirty in another processor’s cache?
Assuming it just because a cache line is loaded by another processor would be suboptimal (at best).
Usually the majority of memory accesses are read accesses and the resulting cache lines are not dirty.
Processor operations on cache lines are frequent (of course, why else would we have this paper?) which means broadcasting information about changed cache lines after each write access would be impractical.
What developed over the years is the MESI cache coherency protocol (Modified, Exclusive, Shared, Invalid).
The protocol is named after the four states a cache line can be in when using the MESI protocol:
Modified: The local processor has modified the cache line. This also implies it is the only copy in any cache.
Exclusive: The cache line is not modified but known to not be loaded into any other processor’s cache.
Shared: The cache line is not modified and might exist in another processor’s cache.
Invalid: The cache line is invalid, i.e., unused.
This protocol developed over the years from simpler versions which were less complicated（不太复杂） but also less efficient（不太高效）.
With these four states it is possible to efficiently implement write-back caches while also supporting concurrent use of read-only data on different processors.
The state changes are accomplished without too much effort by the processors listening, or snooping（窥探）, on the other processors’ work.
Certain operations a processor performs are announced on external pins and thus make the processor’s cache handling visible to the outside.
The address of the cache line in question is visible on the address bus.
In the following description of the states and their transitions (shown in Figure 3.18) we will point out when the bus is involved.
Initially all cache lines are empty and hence also Invalid.
If data is loaded into the cache for writing the cache changes to Modified.
If the data is loaded for reading the new state depends on whether another processor has the cache line loaded as well.
If this is the case then the new state is Shared, otherwise Exclusive.
Request For Ownership
If a Modified cache line is read from or written to on the local processor, the instruction（指令） can use the current cache content and the state does not change.
If a second processor wants to read from the cache line the first processor has to send the content of its cache to the second processor and then it can change the state to Shared.
The data sent to the second processor is also received and processed by the memory controller which stores the content in memory.
If this did not happen the cache line could not be marked as Shared.
If the second processor wants to write to the cache line the first processor sends the cache line content and marks the cache line locally as Invalid.
This is the infamous（臭名昭著的） “Request For Ownership” (RFO) operation.
Performing this operation in the last level cache, just like the I->M transition is comparatively（相对） expensive.
For write-through caches we also have to add the time it takes to write the new cache line content to the next higher-level cache or the main memory, further increasing the cost.
ps: 这种 RFO 操作的代价是非常大的。
If a cache line is in the Shared state and the local processor reads from it no state change is necessary and the read request can be fulfilled from the cache.
If the cache line is locally written to the cache line can be used as well but the state changes to Modified.
It also requires that all other possible copies of the cache line in other processors are marked as Invalid.
Therefore the write operation has to be announced（声明） to the other processors via an RFO message.
If the cache line is requested for reading by a second processor nothing has to happen.
The main memory contains the current data and the local state is already Shared.
In case a second processor wants to write to the cache line (RFO) the cache line is simply marked Invalid.
No bus operation is needed.
The Exclusive state is mostly identical to the Shared state with one crucial（关键的） difference:
a local write operation does not have to be announced on the bus.
The local cache is known to be the only one holding this specific cache line.
This can be a huge advantage so the processor will try to keep as many cache lines as possible in the Exclusive state instead of the Shared state.
The latter is the fallback（倒退） in case the information is not available at that moment.
The Exclusive state can also be left out（抛弃） completely without causing functional problems.
It is only the performance that will suffer since the E->M transition is much faster than the S->M transition.
There are two situations when RFO messages are necessary:
A thread is migrated（迁移） from one processor to another and all the cache lines have to be moved over to the new processor once.
A cache line is truly needed in two different processors.
In multi-thread or multi-process programs there is always some need for synchronization;
this synchronization is implemented using memory.
So there are some valid RFO messages.
They still have to be kept as infrequent（罕见的） as possible.
There are other sources of RFO messages, though.
In section 6 we will explain these scenarios（场景）.
The Cache coherency（相关性） protocol messages must be distributed among the processors of the system.
A MESI transition cannot happen until it is clear that all the processors in the system have had a chance to reply to the message.
That means that the longest possible time a reply can take determines the speed of the coherency protocol.
Collisions on the bus are possible, latency can be high in NUMA systems, and of course sheer traffic volume can slow things down.
All good reasons to focus on avoiding unnecessary traffic.
There is one more problem related to having more than one processor in play.
The effects are highly machine specific but in principle the problem always exists:
the FSB is a shared resource.
In most machines all processors are connected via one single bus to the memory controller (see Figure 2.1).
If a single processor can saturate（饱和） the bus (as is usually the case) then two or four processors sharing the same bus will restrict（限制） the bandwidth available to each processor even more.
Even if each processor has its own bus to the memory controller as in Figure 2.2 there is still the bus to the memory modules.
Usually this is one bus but, even in the extended model in Figure 2.2, concurrent accesses to the same memory module will limit the bandwidth.
The same is true with the AMD model where each processor can have local memory.
All processors can indeed concurrently access their local memory quickly, especially with the integrated memory controller.
But multithread and multi-process programs at least from time to time have to access the same memory regions to synchronize.
Concurrency is severely limited by the finite bandwidth available for the implementation of the necessary synchronization.
Programs need to be carefully designed to minimize accesses from different processors and cores to the same memory locations.
The following measurements will show this and the other cache effects related to multi-threaded code.