Explicit NUMA Optimizations

All the local memory and affinity rules cannot help out if all threads on all the nodes need access to the same memory regions.

It is, of course, possible to simply restrict the number of threads to a number supportable by the processors which are directly connected to the memory node.

This does not take advantage of SMP NUMA machines, though, and is therefore not a real option.

If the data in question is read-only there is a simple solution: replication.

Each node can get its own copy of the data so that no inter-node accesses are necessary.



但是,这并没有利用SMP NUMA机器,因此不是一个真正的选择。




Code to do this can look like this:

void *local_data(void) {
    static void *data[NNODES];
    int node = NUMA_memnode_self_current_idx();
    if (node == -1)
        /* Cannot get node, pick one. */
        node = 0;
    if (data[node] == NULL)
        data[node] = allocate_data();
    return data[node];
void worker(void) {
    void *data = local_data();
    for (...)
        compute using data

In this code the function worker prepares by getting a pointer to the local copy of the data by a call to local_data.

Then it proceeds with the loop, which uses this pointer.

The local_data function keeps a list of the already allocated copies of the data around.

Each system has a limited number of memory nodes, so the size of the array with the pointers to the per-node memory copies is limited in size.

The NUMA_memnode_system_count function from libNUMA returns this number.

If memory for the given node has not yet been allocated for the current node (recognized by a null pointer in data at the index returned by the NUMA_memnode_self_current_idx call), a new copy is allocated.


It is important to realize that nothing terrible happens if the threads get scheduled onto another CPU connected to a different memory node after the getcpu system call.


It just means that the accesses using the data variable in worker access memory on another memory node.

This slows the program down until data is computed anew, but that is all.

The kernel will always avoid gratuitous(无偿) rebalancing of the per-CPU run queues.

If such a transfer happens it is usually for a good reason and will not happen again for the near future.






Things are more complicated(复杂) when the memory area in question is writable.

Simple duplication will not work in this case.

Depending on the exact situation there might a number of possible solutions.


For instance, if the writable memory region is used to accumulate(积累) results, it might be possible to first create a separate region for each memory node in which the results are accumulated.


Then, when this work is done, all the per-node memory regions are combined to get the total result.

map-reduce 的思想。

This technique can work even if the work never really stops, but intermediate(中间) results are needed.

The requirement for this approach is that the accumulation of a result is stateless, i.e., it does not depend on the previously collected results.


DMA 的方式

It will always be better, though, to have direct access to the writable memory region.


If the number of accesses to the memory region is substantial(大量的), it might be a good idea to force the kernel to migrate the memory pages in question to the local node.

If the number of accesses is really high, and the writes on different nodes do not happen concurrently, this could help.

But be aware that the kernel cannot perform miracles: the page migration is a copy operation and as such it is not cheap.

This cost has to be amortized.






The numbers in Figure 5.4 show that access to remote memory when the caches are ineffective is not measurably(不用测量的) slower than access to local memory.


前端时间看了一个 RDMA-Rmote Direct Memory Access 觉得很有趣。

This means a program could possibly save bandwidth to the local memory by writing data it does not have to read again into memory attached to another processor.

The bandwidth of the connection to the DRAM modules and the bandwidth of the interconnects are mostly independent, so parallel use could improve overall performance.



Whether this is really possible depends on many factors.

One really has to be sure that caches are ineffective since otherwise the slowdown related to remote accesses is measurable.

Another big problem is whether the remote node has any needs for its own memory band-width.

This possibility must be examined in detail before the approach is taken.

In theory, using all the bandwidth available to a processor can have positive effects(积极的影响).

A family 10h Opteron processor can be directly connected to up to four other processors.

Utilizing all that additional bandwidth, perhaps coupled with appropriate prefetches (especially prefetchw) could lead to improvements if the rest of the system plays along.