Measuring Memory Usage
Knowing how much memory a program allocates and possibly where the allocation happens is the first step to optimizing its memory use.
There are, fortunately, some easy-to-use programs available which do not require that the program be recompiled or specifically modified.
For the first tool, called massif, it is sufficient to not strip the debug information which the compiler can automatically generate.
It provides an overview of the accumulated memory use over time.
Figure 7.7 shows an example of the generated output.
Like cachegrind (section 7.2), massif is a tool using the valgrind infrastructure.
It is started using
valgrind --tool=massif command arg
where command arg is the program which is to be observed and its parameter(s).
The program will be simulated and all calls to memory allocation functions are recognized.
The call site is recorded along with a timestamp value;
the new allocation size is added to both the whole-program total and total for the specific call site.
The same applies to the functions which free memory where, obviously, the size of the freed block is subtracted from the appropriated sums.
This information can then be used to create a graph showing the memory use over the lifetime of the program, splitting each time value according to the location which requested the allocation.
Before the process terminates massif creates two files: massif.XXXXX.txt and massif.XXXXX.ps;
XXXXX is as before the PID of the process.
.txt file is a summary of the memory use for all call sites and the
.ps is what can be seen in Figure 7.7.
Massif can also record the program’s stack usage, which can be useful to determine the total memory footprint of an application.
But this is not always possible.
In some situations (some thread stacks or when signaltstack is used) the valgrind runtime cannot know about the limits of the stack .
In these situations, it also does not make much sense to add these stacks’ sizes to the total.
There are several other situations where it makes no sense.
If a program is affected by this, massif should be started with the addition option –stacks=no.
Note, this is an option for valgrind and therefore must come before the name of the program which is being observed.
Some programs provide their own memory allocation implementation or wrapper functions around the system’s allocation functions.
In the first case, allocations are normally missed; in the second case, the recorded call sites hide information, since only the address of the call in the wrapper function is recorded.
For this reason, it is possible to add additional functions to the list of allocation functions.
--alloc-fn=xmalloc parameter would specify that the function xmalloc is also an allocation function, which is often the case in GNU programs.
Calls to xmalloc are recorded, but not the allocation calls made from within xmalloc.
The second tool is called memusage;
it is part of the GNU C library.
It is a simplified version of massif (but existed a long time before massif).
It only records the total memory use for heap (including possible calls to mmap etc.
if the -m option is given) and, optionally, the stack.
The results can be shown as a graph of the total memory use over time or, alternatively, linearly over the calls made to allocation functions.
The graphs are created separately by the memusage script which, just as with valgrind, has to be used to start the application:
memusage command arg
-p IMGFILE option must be used to specify that the graph should be generated in the file IMGFILE.
This is a PNG file.
The code to collect the data is run in the actual program itself, it is not an simulation like valgrind.
This means memusage is much faster than massif and usable in situations where massif would be not useful.
Besides total memory consumption, the code also records allocation sizes and, on program termination, it shows a histogram of the used allocation sizes.
This information is written to standard error.
Sometimes it is not possible (or feasible) to call the program which is supposed to be observed directly.
An example is the compiler stage of gcc, which is started by the gcc driver program.
In this case the name of the program which should be observed must be provided to the memusage script using the
-n NAME parameter.
This parameter is also useful if the program which is observed starts other programs.
If no program name is specified all started programs will be profiled（异形）.
Both programs, massif and memusage, have additional options.
A programmer finding herself in the position needing more functionality should first consult the manual or help messages to make sure the additional functionality is not already implemented.
这两个程序，massif 和 memusage 都有其他选择。
Now that we know how the data about memory allocation can be captured, it is necessary to discuss how this data can be interpreted in the context of memory and cache use.
The main aspects of efficient dynamic memory allocation are linear allocation and compactness of the used portion. （使用部分的紧凑性）
This goes back to making prefetching efficient and reducing cache misses.
A program which has to read in an arbitrary（随意） amount of data for later processing could do this by creating a list where each of the list elements contains a new data item.
The overhead（高架，上面） for this allocation method might be minimal (one pointer for a single-linked list) but the cache effects when using the data can reduce the performance dramatically（显着）.
One problem is, for instance, that there is no guarantee（保证） that sequentially allocated memory is laid out sequentially in memory.
There are many possible reasons for this:
memory blocks inside a large memory chunk administrated by the memory allocator are actually returned from the back to the front;
a memory chunk is exhausted and a new one is started in a different part of the address space;
the allocation requests are for different sizes which are served from different memory pools;
the interleaving allocations in the various threads of multi-threaded programs.
If data must be allocated up front for later processing, the linked-list approach is clearly a bad idea.
There is no guarantee (or even likelihood) that the consecutive elements in the list are laid out consecutively in memory.
To ensure contiguous allocations, that memory must not be allocated in small chunks.
Another layer of memory handling must be used; it can easily be implemented by the programmer.
An alternative is to use the obstack implementation available in the GNU C library.
This allocator requests large blocks of memory from the system’s allocator and then hands arbitrarily large or small blocks of memory out.
These allocations are always sequential unless the large memory chunk is exhausted, which is, depending on the requested allocation sizes pretty rare.
Obstacks are not a complete replacement for a memory allocator, they have limited abilities to free objects.
See the GNU C library manual for details.
So, how can a situation where the use of obstacks (or similar techniques) is advisable be recognized from the graphs?
Without consulting the source, possible candidates for the changes cannot be identified, but the graph can provide an entry point for the search.
If many allocations are made from the same location, this could mean that allocation in bulk（块） might help.
In Figure 7.7, we can see such a possible candidate in the allocations at address 0x4c0e7d5.
From about 800ms into the run until 1,800ms into the run this is the only area (except the top, green one) which grows.
Moreover, the slope is not steep, which means we have a large number of relatively small allocations.
This is, indeed, a candidate for the use of obstacks or similar techniques.
Another problem the graphs can show is when the total number of allocations is high.
This is especially easy to see if the graph is not drawn linearly over time but, instead, linearly over the number of calls (the default with memusage).
In that case, a gentle slope in the graph means a lot of small allocations.
memusage will not say where the allocations took place, but the comparison with massif’s output can say that, or the programmer might recognize it right away.
Many small allocations should be consolidated to achieve linear memory use.
But there is another, equally important, aspect to this latter class of cases:
many allocations also means higher overhead in administrative data.
This by itself might not be that problematic.
The red area named “heap-admin” represents this overhead in the massif graph and it is quite small.
But, depending on the malloc implementation, this administrative data is allocated along with the data blocks, in the same memory.
当今的 GNU C
For the current malloc implementation in the GNU C library, this is the case:
every allocated block has at least a 2-word header (8 bytes for 32-bit platforms, 16 bytes for 64-bit platforms).
In addition, block sizes are often a bit larger than necessary due to the way memory is administrated (rounding up block sizes to specific multiples).
This all means that memory used by the program is interspersed（穿插） with memory only used by the allocator for administrative purposes.
We might see something like this:
Each block represents one memory word.
In this small region of memory we have four allocated blocks.
The overhead due to the block header and padding is 50%.
Due to the placement of the header, this automatically means that the effective prefetch rate of the processor is lowered by up to 50% as well.
由于 Header 的放置，这自动意味着处理器的有效预取率也降低了50％。
If the blocks were be processed sequentially (to take maximum advantage of prefetching), the processor would read all the header and padding words into the cache, even though they are never supposed to be read from or written to by the application itself.
Only the runtime uses the header words, and the runtime only comes into play when the block is freed.
One could at this point argue that the implementation should be changed to put the administrative data somewhere else.
This is indeed done in some implementations, and it might prove to be a good idea.
There are many aspects to be kept in mind, though, security not being the least of them.
Regardless of whether we might see a change in the future, the padding issue will never go away (amounting to 16% of the data in the example, when ignoring the headers).
Only if the programmer directly takes control of allocations can this be avoided.
When alignment requirements come into play there can still be holes, but this is also something under control of the programmer.