Swapping and Policies

If physical memory runs out, the system has to drop clean pages and save dirty pages to swap.

The Linux swap implementation discards(丢弃) node information when it writes pages to swap.

That means when the page is reused and paged in the node which is used will be chosen from scratch(抹掉).

The policies for the thread will likely cause a node which is close to the executing processors to be chosen, but the node might be different from the one used before.



This changing association means that the node association cannot be stored by a program as a property of the page.

The association can change over time.

For pages which are shared with other processes this can also happen because a process asks for it (see the discussion of mbind below).

The kernel by itself can migrate pages if one node runs out of space while other nodes still have free space.

Any node association the user-level code learns about can therefore be true for only a short time.

It is more of a hint than absolute information.

Whenever accurate(准确) knowledge is required the get_mempolicy interface should be used (see section 6.5.5).

VMA Policy


To set the VMA policy for an address range a different interface has to be used:

#include <numaif.h>
long mbind(void *start, unsigned long len,
int mode,
unsigned long *nodemask,
unsigned long maxnode,
unsigned flags);


This interface registers a new VMA policy for the address range [start, start + len).

Since memory handling operates on pages the start address must be page-aligned(页对齐).

The len value is rounded up(四舍五入) to the next page size.


The mode parameter specifies, again, the policy; the values must be chosen from the list in section 6.5.1.

As with set_mempolicy, the nodemask parameter is only used for some policies.

Its handling is identical(相同).

mbind 接口的语义

The semantics of the mbind interface depends on the value of the flags parameter.

By default, if flags is zero, the system call sets the VMA policy for the address range.

Existing mappings are not affected.


If this is not sufficient there are currently three flags to modify this behavior;

they can be selected individually or together:


The call to mbind will fail if not all pages are on the nodes specified by nodemask.

In case this flag is used together with MPOL_MF_MOVE and/or MPOL_MF_MOVEALL the call will fail if any page cannot be moved.


The kernel will try to move any page in the address range allocated on a node not in the set specified by nodemask.

By default, only pages used exclusively by the current process’s page tables are moved.


Like MPOL_MF_MOVE but the kernel will try to move all pages, not just those used by the current process’s page tables alone.

This operation has system-wide implications since it influences the memory access of other processes– which are possibly not owned by the same user as well.

Therefore MPOL_MF_MOVEALL is a privileged operation (CAP_NICE capability is needed).


Note that support for MPOL_MF_MOVE and MPOL_MF_MOVEALL was added only in the 2.6.16 Linux kernel.

Calling mbind without any flags is most useful when the policy for a newly reserved address range has to be specified before any pages are actually allocated.


void *p = mmap(NULL, len,
MAP_ANON, -1, 0);
if (p != MAP_FAILED)
mbind(p, len, mode, nodemask, maxnode,

This code sequence reserve(保留) an address space range of len bytes and specifies that the policy mode referencing the memory nodes in nodemask should be used.

Unless the MAP_POPULATE flag is used with mmap, no memory will have been allocated by the time of the mbind call and, therefore, the new policy applies to all pages in that address space region.


The MPOL_MF_STRICT flag alone can be used to determine whether any page in the address range described by the start and len parameters to mbind is allocated on nodes other than those specified by nodemask(节点掩码).

No allocated pages are changed.

If all pages are allocated on the specified nodes, the VMA policy for the address space region will be changed according to mode.


Sometimes rebalancing of memory is needed, in which case it might be necessary to move pages allocated on one node to another node.

Calling mbind with MPOL_MF_MOVE set makes a best effort to achieve that.

Only pages which are solely referenced by the process’s page table tree are considered for moving.

There can be multiple users in the form of threads or other processes which share that part of the page table tree.

It is not possible to affect other processes which happen to map the same data.

These pages do not share the page table entries.






If both the MPOL_MF_STRICT and MPOL_MF_MOVE bits are set in the flags parameter passed to mbind the kernel will try to move all pages which are not allocated on the specified nodes.

If this is not possible the call will fail.

Such a call might be useful to determine whether there is a node (or set of nodes) which can house all the pages.


Several combinations can be tried in succession until a suitable node is found.

The use of MPOL_MF_MOVEALL is harder to justify unless running the current process is the main purpose of the computer.

The reason is that even pages that appear in multiple page tables are moved.

That can easily affect other processes in a negative way.

This operation should thus be used with caution.