Thursday, January 12, 2012

Linux Memory Reporting

As much as I like Linux (and, really, I do: I ran Linux 0.99.something on my desktop in college, and wrote my class papers using vim and LaTeX), there are certain things about it that drive me crazy, and the way it reports memory usage is definitely on the list. It should be possible for a reasonably intelligent human being (in which category I place myself) to answer simple questions about system memory usage, such as "How much memory is my database using?" or "How much memory is my web server using?" relatively simply.
Unfortunately, I can't, and I don't think I'm alone. The first thing I typically do is run "free -m" to get a picture of overall system memory utilization. This is generally pretty accurate, and if there's only one process on your machine that's using any significant amount of memory, that may be all you need, in which case you're lucky. Next, I run "top", hit capital "M" to sort by memory usage, and start looking at the individual processes. That's where the wheels come off.

I just did this on an otherwise idle system and 10 httpd processes popped up nearly to the top of the list, each reporting VIRT as 186MB, RES at 3304, and SHR as 768. Which number or combination of numbers represents my true memory usage? According to the top man page, the definition of VIRT is: "The total amount of virtual memory used by the task. It includes all code, data, and shared libraries plus pages that have been swapped out." It makes sense, therefore, that this number is much higher than the actual memory usage of the process: code, data, and shared library pages will only be faulted into memory as needed, but they're still counted against VIRT. Furthermore, those 10 httpd processes consist of one parent process and nine children, and some of the parents data or stack pages may be shared with its children via copy-on-write. VIRT, therefore, is a gross overestimate of the real memory impact of running an idle httpd on this machine. Just to be clear, I think VIRT is a useful piece of information for the system to report, but it doesn't tell me what I want to know right now, which is how much memory httpd is using.

The top man page defines RES as "The non-swapped physical memory a task has used" and adds the note that "RES = CODE + DATA". That sounds more like a measure of current (rather than theoretical) memory usage. Right now, the parent httpd process is reporting RES = 6892, and all of the children are reporting RES = 3304, except for one, which is reporting 2988. That adds up to about 35MB, which sounds plausible, but it turns out not to be right, because when I run free, stop httpd, and run free again, memory usage only drops by about 11MB, presumably because the RES number doesn't account (or doesn't fully account?) for sharing between the parent process and its children. The equation "RES = CODE + DATA" is also evidently false, because if I enable those additional fields inside top (f-r-s-enter), I find out that the processes with RES = 3304 have CODE = 332 and DATA = 2280, a discrepancy of almost 700kB.

Although the example above talks about httpd, the same problems all apply to PostgreSQL. When I configure shared_buffers = 200MB and start up the server, it reports VIRT = 369MB and RES = 16MB. The shared memory segment, including the 200MB of shared_buffers, is reflected in the virtual size but not the resident size. This is understandable: starting the server doesn't actually access the memory allocated for shared_buffers, and Linux won't allocated it until it's really used. But, as it turns out, the postmaster (i.e. the parent postgres process) will never reflect a resident size higher than 16MB, even after every block in the shared memory segment has been used. Instead, each individual backend that touches any part of the shared memory segment will count just the portions it touched in its shared memory total; thus, new backends will appear to start out using a small amount of memory and then grow (sometimes quite precipitously) as they begin to do actual work. If they eventually touches all of shared_buffers, they should level off at some value equal to approximately the total size of PostgreSQL's shared memory segment plus some amount that reflects the private memory it is using.

What this means in practice is that it's just about impossible to look at the output of top and have any idea how much memory PostgreSQL is using, or even whether its memory usage is growing over time. In fact, memory leaks in PostgreSQL are fairly rare, because we use a system of memory contexts to track allocations. There are per-tuple and per-query contexts where many allocations are done; once we're done processing a given tuple or query we eradicate the entire contents of the memory context in a bulk operation. This is a very effective way of preventing leaks; only when we're allocating memory in a session-lifespan memory context do we need to worry about a long-term leak. However, if we do spring a leak, it's hard to spot it from looking at the top (or ps) output unless it's pretty egregious. If you have a postgres process which is using much more memory than the other postgres processes, and you can correlate that with an overall decrease in system free memory, then you've got a leak. A small leak, however, is likely to go unobserved, because there's so much noise in the reported numbers that a real problem looks just like an artifact. Even a 10MB leak (which is pretty significant) would blend right in unless you happened to be running with a very small value for shared_buffers.

On the flip side, the tendency of each new postgres process to start out with a small resident size and then grow as it begins touching shared_buffers can easily create the perception of a real leak where none exists. In fact, no new memory is being allocated at all: the apparent growth in resident size is really just the result of faulting already-allocated pages into the address space of a process where the kernel hadn't previously chosen to map them. But seeing a process start out small and then within a minute balloon to multiple gigabytes can be alarming to system administrators, to say the least.

top advertises a few other potentially interesting values as well, but they don't really add much to the total picture. The SHR number is "simply reflects memory that could be potentially shared with other processes". In theory, that ought to help clarify things: but since memory that is opportunistically shared (like copy on write pages) is not distinguished from actual shared memory, it's not that helpful. There's also an optional column for SWAP, but it's not the number of pages that process has pushed out to the swapfile; it's just the portion of VIRT that isn't currently resident. So a demand-paged shared library that has never been fully loaded (because it hasn't been accessed) counts the same as an unshared stack page that's been evicted due to extreme memory pressure, which is bizarre.

I haven't run across smem before, but it does look like their idea of proportional set size (PSS) might be a useful concept. smem's notion of RSS is the the same as what top reports as RES, so it has the same pros and cons. There's also a USS column which, at least in my copy of smem, isn't documented, so it's hard to know whether that's useful or not.

On my system, the "real" memory utilization of httpd, as determine by starting and stopping it, is about 11M. The total of the PSS columns for httpd is 7871kB and the total of the USS columns is 5624kB. So PSS alone is clearly less than the real memory utilization, and PSS+USS is slightly more (but maybe that's rounding error?).

In short: seems interesting, but it's hard to tell what it's really doing. It has a blizzard of command-line options that are described with phrases like "show library-oriented view" but there's no clear definition of what that actually means, which makes it not that useful IMV.

PostgreSQL already has functions like "pg_database_size" which give information which can also be obtained from the operating system.

Likewise, PostgreSQL functions concerning memory usage can be considered; this would be operating system independent, avoiding the issues with Linux. Of course, allocator overhead (space) cannot be taken into account, but leaks would still show up. If the information is maintained not per each allocation but per the larger units of allocation you mentioned, the slowdown might be acceptable.

See this LWN article on PSS/USS. USS is just the unshared page count, i.e. memory returned when process is killed.

Random fact: Chrome's also in the same multiprocess boat, and its built-in task manager and about:memory page use USS/PSS for accounting.

I think what would be interesting and is missing from this picture is a tool that presents USS for a set of processes. This would be useful for multiprocess applications like PostgreSQL, Chrome, etc. Simply summing the processes' USS would undercount, as any shared pages would be left unaccounted for. The hypothetical tool would need to determine which pages are shared exclusively among the processes belonging to the set in question (probably fine to assume a tree, so just pointing it at a parent process would work).

This was a very helpful blog post. It highlights the vague and inconsistent method developers have used to show memory.

There are various methods for representing memory that is shared, either via SysV shared memory, fork's copy-on-write, or shared libraries. Does every process get charged the full amount, or do they split it among themselves, e.g. if five processes use shared memory, is each process charged 20% of the total size? (If another prcocess attaches, does your percentage decrease?) What happens when you map in a large shared memory area but only access part of it? When do you stop using that memory?

There doesn't seem to be any clear answers because the answer depends on what you want to do with the number. What hasn't happened is anyone really laying out these issues and defining exactly what their tool does. What is happening now is that each developer is deciding on their own how to answer these questions, and it isn't making it out into the documentation, probably because few users would even understand the explanation.

This uncertainty has lingered for years, and I don't see it geting resolved anytime soon, unless someone steps up to the plate, defines what these numbers mean, and then records how each tool handles them.

As Robert described, PostgreSQL isn't alone in having this issue, but maybe we have to take the lead in explaining it. In an ideal world, someone would define these tradeoffs, and every tool would use the same terms, and allow output matching each of the posssible use cases.

You've inspired me to write up what I know (and have discovered) about the various sorts of memory usage information that you can get from Linux with top, ps, and smem. I've posted the result as Linux memory stats.

(The short version is that the question is ambiguous but you can get various sorts of useful information, especially with smem.)

I do confess I struggle on interpreting these values on a regular basis. What I hate most is that SHM is accounted for as "cached" by each and every tool I know... even /proc/meminfo!

However, on the plus side:I totally love atop (http://www.atoptool.nl). It is based on Linux' process accounting, gives very detailed information on almost every aspect of the system and its processes, has a comparably small footprint (compared to top at least) etc.

It can save snapshots of the system (default every 10 mins.) which sum up the usage in the past timespan, incl. exited processes.Especially that last feature has saved my back many times, as it is very easy to point at the process/user that was consuming most memory. Comes in very handy when you dip into swapspace or experience OOM situations.

As a bonus, you get disk I/O and network traffic (even per process with some patches), context switches, migrations etc.