When it comes to multi-threading, there are three different aspects of cache use which are important:

• Concurrency

• Atomicity

• Bandwidth

These aspects also apply to multi-process situations but, because multiple processes are (mostly) independent, it is not so easy to optimize for them.

The possible multiprocess optimizations are a subset of those available for the multi-thread scenario.

So we will deal exclusively with the latter here.

In this context concurrency refers to the memory effects a process experiences when running more than one thread at a time.

A property of threads is that they all share the same address space and, therefore, can all access the same memory.

In the ideal case, the memory regions used by the threads most of the time are distinct, in which case those threads are coupled only lightly (common input and/or output, for instance).

If more than one thread uses the same data, coordination is needed; this is when atomicity comes into play.

Finally, depending on the machine architecture, the available memory and interprocessor bus bandwidth available to the processors is limited.

We will handle these three aspects separately in the following sections–although they are, of course, closely linked.

# Concurrency Optimizations

Initially, in this section, we will discuss two separate issues which actually require contradictory（矛盾） optimizations.

Normal cache optimization calls for keeping data together so that the footprint of the application is small, thus maximizing the amount of memory which fits into the caches at any one time.

## 问题

There is a problem with this approach, though: if multiple threads write to a memory location, the cache line must be in ‘E’ (exclusive) state in the L1d of each respective core.

This means that a lot of RFO messages are sent, in the worst case one for each write access.

So a normal write will be suddenly very expensive.

If the same memory location is used, synchronization is needed (maybe through the use of atomic operations, which is handled in the next section).

The problem is also visible, though, when all the threads are using different memory locations and are supposedly independent.

## 性能对比图

Figure 6.10 shows the results of this “false sharing”.

The test program (shown in section A.3) creates a number of threads which do nothing but increment a memory location (500 million times).

The measured time is from the program start until the program finishes after waiting for the last thread.

The threads are pinned to（固定到） individual processors.

The machine has four P4 processors.

The blue values represent runs where the memory allocations assigned to each thread are on separate cache lines.

The red part is the penalty（惩罚） occurred when the locations for the threads are moved to just one cache line.

### 蓝色

The blue measurements (time needed when using individual cache lines) match what one would expect.

The program scales without penalty to many threads.

Each processor keeps its cache line in its own L1d and there are no bandwidth issues since not much code or data has to be read (in fact, it is all cached).

The measured slight increase is really system noise and probably some prefetching effects (the threads use sequential cache lines).

### 测量过载

The measured overhead, computed by dividing the time needed when using one single cache line versus a separate cache line for each thread, is 390%, 734%, and 1,147% respectively.

These large numbers might be surprising at first sight but, when thinking about the cache interaction（相互作用） needed, it should be obvious（明显）.

The cache line is pulled from one processor’s cache just after it has finished writing to the cache line.

All processors, except the one which has the cache line at any given moment, are delayed and cannot do anything.

Each additional processor will just cause more delays.

## 如何解决这个问题

It is clear from these measurements that this scenario（脚本） must be avoided in programs.

Given the huge penalty, this problem is, in many situations, obvious (profiling will show the code location, at least) but there is a pitfall with modern hardware.

### 单核的测量信息

Figure 6.11 shows the equivalent measurements when running the code on a single processor, quad core machine (Intel Core 2 QX 6700).

Even with this processor’s two separate L2s the test case does not show any scalability issues.

There is a slight overhead when using the same cache line more than once but it does not increase with the number of cores.

If more than one of these processors were used we would, of course, see results similar to those in Figure 6.10.

Despite the increasing use of multi-core processors, many machines will continue to use multiple processors and, therefore, it is important to handle this scenario correctly, which might mean testing the code on real SMP machines.

## 简单的解决方案

There is a very simple “fix” for the problem: put every variable on its own cache line.

This is where the conflict with the previously mentioned optimization comes into play, specifically, the footprint of the application would increase a lot.

This is not acceptable; it is therefore necessary to come up with a more intelligent solution.

## 如何区分变量是否被属于竞态资源

What is needed is to identify which variables are used by only one thread at a time, those used by only one thread ever, and maybe those which are contested at times.

Different solutions for each of these scenarios are possible and useful.

The most basic criterion（标准） for the differentiation of variables is:

are they ever written to and how often does this happen.

## 变量的各种状态标记

Variables which are never written to and those which are only initialized once are basically constants.

Since RFO messages are only needed for write operations, constants can be shared in the cache (‘S’ state).

So, these variables do not have to be treated specially;

grouping them together is fine.

If the programmer marks the variables correctly with const, the tool chain will move the variables away from the normal variables into the .rodata (read-only data) or .data.rel.ro (read-only after relocation) section No other special action is required.

If, for some reason, variables cannot be marked correctly with const, the programmer can influence their placement by assigning them to a special section.

## 将基本不变的变量分组的可能性

When the linker constructs the final binary, it first appends the sections with the same name from all input files;

those sections are then arranged in an order determined by the linker script.

This means that, by moving all variables which are basically constant but are not marked as such into a special section, the programmer can group all of those variables together.

There will not be a variable which is often written to between them.

By aligning the first variable in that section appropriately（适当）,it is possible to guarantee that no false sharing happens.

## 例子

Assume this little example:

int foo = 1;
int bar __attribute__((section(".data.ro"))) = 2;
int baz = 3;
int xyzzy __attribute__((section(".data.ro"))) = 4;


If compiled, this input file defines four variables.

The interesting part is that the variables foo and baz, and bar and xyzzy are grouped together respectively（分别）.

Without the attribute definitions the compiler would allocate all four variables in the sequence in which they are defined in the source code the a section named .data. With the code as-is the variables bar and xyzzy are placed in a section named .data.ro.

The section name .data.ro is more or less arbitrary.

A prefix of .data. guarantees that the GNU linker will place the section together with the other data sections.

## 技术的其他应用场景

The same technique can be applied to separate out variables which are mostly read but occasionally also written to.

Simply choose a different section name.

This separation seems to make sense in some cases like the Linux kernel.

If a variable is only ever used by one thread, there is another way to specify the variable.

In this case it is possible and useful to use thread-local variables.

The C and C++ language in gcc allow variables to be defined as per-thread using the __thread keyword.

### gcc 中如何指定

ps: 可以将不变的进行分组，同理也就可以对主要读取+偶尔写入的进行分组。

int foo = 1;
int baz = 3;


The variables bar and xyzzy are not allocated in the normal data segment;

instead each thread has its own separate area where such variables are stored.

The variables can have static initializers.

Due to the variable being thread-local, false sharing is not a problem–unless the program artificially creates a problem.

This solution is easy to set up (the compiler and linker do all the work), but it has its cost.

When a thread is created, it has to spend some time on setting up the thread-local variables, which requires time and memory.

### 代价

• 浪费内存 & 耗时

ps: 应该就是 java 中的 ThreadLocal 这样可以避免不同线程间的信息

• 每个线程的副本都不同

Another drawback of using thread-local storage (TLS) is that, if the use of the variable shifts over to another thread, the current value of the variable in the old thread is not available to new thread.

Each thread’s copy of the variable is distinct.

Often this is not a problem at all and, if it is, the shift over to the new thread needs coordination, at which time the current value can be copied.

• 更大的问题资源的浪费

A bigger problem is possible waste of resources.

If only one thread ever uses the variable at any one time, all threads have to pay a price in terms of memory.

If a thread does not use any TLS variables, the lazy allocation of the TLS memory area prevents this from being a problem (except for TLS in the application itself).

If a thread uses just one TLS variable in a DSO, the memory for all the other TLS variables in this object will be allocated, too.

This could potentially（可能） add up if TLS variables are used on a large scale（规模大）.

# 最佳实践

In general the best advice which can be given is

Maybe extend this separation to read-mostly variables as a third category.

（2） Group read-write variables which are used together into a structure.

Using a structure is the only way to ensure the memory locations for all of those variables are close together in a way which is translated consistently by all gcc versions..

（3） Move read-write variables which are often written to by different threads onto their own cache line.

This might mean adding padding at the end to fill a remainder of the cache line. If combined with step 2, this is often not really wasteful.

## 拓展上述案例

Extending the example above, we might end up with code as follows (assuming bar and xyzzy are meant to be used together):

int foo = 1;
int baz = 3;
struct {
struct al1 {
int bar;
int xyzzy;
};