Oracle Blog

Performance and Scalability

Wicked fast memstat

The memstat command is commonly used to diagnose memory issues on
Solaris systems. It characterizes every physical page on a system and provides
a a high level summary of memory usage, such as the following:

However, memstat is horribly slow on large systems. Its running
time grows as O(physmem \* NCPU), and can take an hour or more on
the largest systems, which makes it practically unusable. I
have recently worked with Pavel Tatashin to optimize memstat,
and if you use memstat, you will like the results.

memstat is an mdb command; see its soucre code in the file
usr/src/cmd/mdb/common/modules/genunix/memory.c.
For every page that memstat examines, it reads the page_t
structure describing the page, and reads the vnode_t structure
describing the page's identity. Each read of a kernel data
structure is expensive - it is a system call; specifically, a
pread() from the special file /dev/kmem. Max Bruning in his
blog
suggested the first optimization: rather than finding
non-free pages through the page_hash[] and reading them
one at a time, memstat should read dense arrays of page_t's
from the memsegs. These include free pages which must be
ignored, but it reduces the number of system calls and is a
net win. Max reports more than a 2X speedup. This is a good start,
but is just the tip of the iceberg.

The next big cost is reading the vnode_t per page. The key
observation is that many pages point to the same vnode_t; thus,
if we save the vnode_t in mdb when we first read it, we can
avoid subsequent reads of the same vnode_t. In practice,
there are too many vnode_t's on a production system to save
every one, as this would greatly increase the memory consumption of
mdb, so we implement a cache of up to 10000 vnode_t's, with LRU
replacement, organized in a hash table for rapid lookup by
vnode_t address. Also, we only save the vn_flag field of the vnode_t
object to save space, since only the flag is needed to characterize
a page's identity. The cache eliminates most vnode_t related reads,
gaining another 2X in performance.

The next cost is a redundant traversal of the pages. memstat traverses
and reads the pages twice, performing a slightly different accounting
on the second traversal. We eliminated the second traversal and
did all accounting on the first pass, gaining another 2X in performance.

The last big cost relates to virtual memory management, and is the
reason that the running time grows as O(NCPU). The pread system
call jumps to the kernel module for /dev/kmem, whose
source code is in usr/src/uts/common/io/mem.c. For each read
request, the code determines the physical address (PA), creates a
temporary virtual address (VA) mapping to this address, copies the data
from kernel to user space, and unmaps the VA. The unmap operation must be
broadcast
all CPUs to make sure no CPU has the stale VA to PA translation in
its TLB. To avoid this cost, we extended and leveraged a
Solaris capability called Kernel Physical Mapping (KPM), in which
all of physical memory is pre-assigned to a dedicated range of
kernel virtual memory that is never mapped for any other purpose.
Thus a KPM mapping never needs to be purged from the CPU TLB's,
and the memstat running time is no longer a function of NCPU.
This optimization yields an additional 10X or more speedup on large
CPU count systems.

Finally, the punchline: the combined speedup from all
optimizations is almost 500X in the best case, and memstat
completes in seconds to minutes. Here are the memstat run
times before versus after on various systems:

(The E25K speedup is "only" 14X because it does not support our
KPM optimization; KPM is more complicated on UltraSPARC IV+
and older processors due to possible VA conflicts in their
L1 cache).

As a bonus, all mdb -k commands are
somewhat faster on large CPU count systems due to the KPM
optimization. For example, on a T5440 running 10000 threads,
an mdb pipeline to walk all threads and print their stacks
took 64 seconds before, and 27 seconds after.

But wait, there's more! Thanks to a suggestion from Jonathan Adams,
we exposed the fast method of traversing pages via memsegs with
a new mdb walker which you can use:

> ::walk allpages

These optimizations are coming soon to a Solaris near you, tracked
by the following CR:
6708183 poor scalability of mdb memstat with increasing CPU count
They are available now in Open Solaris developer build 118, and will be
in OpenSolaris 2010.02. They will also be in Solaris 10 Update 8,
which is patch 141444-08 for SPARC and 141445-08 for x86.