Thursday, June 28, 2012

Absurd Shared Memory Limits

Today, I fixed a problem. Or at least, I think I fixed it. Time will tell. But Thom Brown seems pretty happy, and so does Dan Farina. So let me tell you about it. Here's the executive summary: assuming the patch I committed today holds up, PostgreSQL 9.3 will largely eliminate the need to fiddle with operating system shared memory limits.
A PostgreSQL database cluster involves multiple processes - one per session plus a few extras - that need to be able to communicate via a shared memory segment. For historical reasons, most UNIX-like operating systems provide at least three different methods of creating a shared memory segment. In the beginning, there was System V shared memory. Then, a bunch of people got together and decided that they didn't like the System V shared memory interface very much, so they created a new interface called POSIX shared memory which did mostly the same thing - but not quite. Both System V shared memory and POSIX shared memory involve creating named shared memory segments (though they use different naming conventions); to attach to one of these segments, you identify it by name. At some point along the way, it became possible to create shared memory segments in yet another way using a system call named mmap() and passing it the options MAP_SHARED and MAP_ANONYMOUS. Shared memory segments created this way don't have a name, so they can only be shared between a parent process and its descendents.

PostgreSQL uses System V shared memory, because it provides a feature that is available via neither of the other two systems: the ability to atomically determine the number of processes attached to the shared memory segment. When the first PostgreSQL process attaches to the shared memory segment, it checks how many processes are attached. If the result is anything other than "one", it knows that there's another copy of PostgreSQL running which is pointed at the same data directory, and it bails out. This is really good, because having two PostgreSQL processes pointed at the same data directory at the same time is a sure way to corrupt your database.

Despite the fact that System V shared memory is the only available shared memory implementation that provides this feature, most operating system vendors frown on it, and encourage users to use the newer POSIX shared memory facilities instead. They do this by limiting the amount of System V shared memory that can be allocated by default to absurdly small values, often 32MB. On some older systems, you actually had to recompile the kernel to raise the limit; thankfully, on modern systems, it's usually as simple as editing /etc/sysctl.conf and running sysctl -f /etc/sysctl.conf to read the update settings. Still, it poses a needless obstacle for new PostgreSQL users, who now have a choice between (1) terrible database performance and (2) fiddling with kernel settings that they don't understand.

In my opinion, the decision to tightly limit the amount of System V shared memory that can be allocated is a poor one on the part of OS vendors. POSIX shared memory and mmap's anonymous shared memory have much higher limits, or none at all; and as far as I can see, limiting System V shared memory makes things inconvenient for users of programs like PostgreSQL without any compensating advantage.

The good news is that the above-mentioned commit contains a workaround. We allocate a very small System V shared memory segment (48 bytes, on the systems I tested; it could vary slightly by platform) which provides the interlock to prevent multiple instances of PostgreSQL from attaching to the same data directory at the same time, and allocate a large anonymous shared memory block for everything else. Assuming the patch doesn't get reverted for one reason or another, this means that in PostgreSQL 9.3 it will be possible to start PostgreSQL on all platforms I'm familiar with - using an arbitrarily high shared_buffers setting - without any adjustment of default operating system limits. That should hopefully make things easier for first-time users.

39 comments:

It sounds very good. I hate when small projects depend on custom kernel settings. With large projects you usually have to do some kernel tuning, but if your project is on a single node machine and not rocket science, it must just work out of the box.

I'm quite happy about this change, thanks Robert for seeing it through! I've often made the mistake of "fixing" a PG server which wouldn't start due SHMMAX/SHMALL by using only sysctl -w ..., only to have the same problem a few months later after the machine reboots itself, because I neglected to update sysctl.conf as well, or made a typo in sysctl.conf.

And your post is actually the first time I've seen the tip of using sysctl -f /etc/sysctl.conf, instead of the steps our docs recommend for Linux. Hrmph, the -f flag isn't even in my man page for sysctl.

For most applications/daemons, there's a reason you don't want multiple running concurrently, but if it somehow happens, it isn't the end of the world. For example, if you have a mail server running and you try to start a second, it isn't going to work well (only one can bind to TCP Port 25). But other than that, you aren't likely to experience pain.

With a Database, there's a much greater risk. Having two PostgreSQL servers running concurrently and accessing the same database files will basically guarantee corrupted data. As Postgres goes to extremes to protect your data (as it should), this is an unacceptable situation. In order to ensure it doesn't happen, Postgres uses the most reliable and robust method possible to make sure there is only one Postgres instance using a given data directory.

Other methods tend to have drawbacks (flock() doesn't work on NFS, etc). This one works, and works well.

"flock() doesn't work on NFS". Are you saying that this SysV shm locking scheme does work on NFS? Or are the downsides to other locking mechanisms in that "etc"?

I'm not necessarily disagreeing with what you've written, but as glyph said: "I'm not saying I'm sure there's no good reason, I've just never seen one mentioned in the previous mailing list threads." This answer still does not address this. Can you point us at a comparison of locking methods that shows that SysV shm is the most reliable and robust method for locking?

Given that SysV and POSIX shared memory chunks have named identifiers, I'm not sure why you'd compare them to mmap() with MAP_ANONYMOUS. The logical equivalent is an mmap()'d file on the actual file system, not an anonymous chunk of mmap()'d memory.

SysV shared memory has a fixed overhead, as do many other things in the kernel. The higher you raise those limits, the higher your fixed overhead becomes. The reason we as OS vendors do not ship with the ability to use many gigabyte SysV shared memory segments by default (historically) is that few people use it and we do not want to put the burden of that fixed overhead on everyone who does not need it.

This patch will reduce performance outright on BSD kernels for users who previously leveraged the shm_use_phys optimization (pretty much everyone who runs a serious database) because the kernel will have to manage pv entries for all of those mmap'd pages. It will also create additional memory pressure on those systems because more pv entries will need to be allocated.

On FreeBSD you have to enable shared memory for jails if you want to jail your postgres process, and you can't have more than one postgres instance in a jail. It's always been considered insecure because the system-v shared memory data is readable by all jails. Is this going to solve the FreeBSD jails problem?

There is a very good reason we OS vendors do not ship with SysV default limits high enough to run a serious PostgreSQL database. There is very little software that uses SysV in any serious way other than PostgreSQL and there is a fixed overhead to increasing those limits. You end up wasting RAM for all the users who do not need the limits to be that high. That said, you are late to the party here, vendors have finally decided that the fixed overheads are low enough relative to modern RAM sizes that the defaults can be raised quite high, DragonFly BSD has shipped with greatly increased limits for a year or so and I believe FreeBSD also.

There is a serious problem with this patch on BSD kernels. All of the BSD sysv implementations have a shm_use_phys optimization which forces the kernel to wire up memory pages used to back SysV segments. This increases performance by not requiring the allocation of pv entries for these pages and also reduces memory pressure. Most serious users of PostgreSQL on BSD platforms use this well-documented optimization. After switching to 9.3, large and well optimized Pg installations that previously ran well in memory will be forced into swap because of the pv entry overhead.

Is there some kind of workaround for this? It would seem pretty odd if this shm_use_phys optimization were only available to processes using System V shared memory. If there's a way we can request that same optimization for an mmap'd segment I certainly think we'd do that.

Robert, there is nothing analogous to this for mmap nor is there a workaround. Basically, if this change sticks (and I am not necessarily saying it shouldn't), the BSD's will need to do major work to their VM's to reduce the pv_entry overhead in order to stay relevant as a platform for PostgreSQL. Matt Dillon has seen this and is trying to come up with a solution for the DragonFly VM presently, which may or may not be portable to the other BSD's. See: http://developers.slashdot.org/comments.pl?sid=3107463&cid=41308595 (and further up that thread)

On FreeBSD (I believe this is also available for other *BSDs) you could use mlock() and munlock() with the starting virtual address and length of the memory chunk to wire them, but this requires root privileges at this time (we are working on changing this, to allow unprivileged process to request certain administrator-adjustable amount in the meantime). Will that be helpful?

Speaking for PV entries, the FreeBSD VM subsystem now have capability to promote contiguous normal pages to "superpage"s in a transparent manner to application, if the mappings are aligned and sufficiently large. [1]

On FreeBSD (I believe this is also available for other *BSDs) you could use mlock() and munlock() with the starting virtual address and length of the memory chunk to wire them, but this requires root privileges at this time (we are working on changing this, to allow unprivileged process to request certain administrator-adjustable amount in the meantime). Will that be helpful?

Speaking for PV entries, the FreeBSD VM subsystem now have capability to promote contiguous normal pages to "superpage"s in a transparent manner to application, if the mappings are aligned and sufficiently large. [1]

I think on FreeBSD you could use mlock() to wire memory pages in the main memory. Note that at this time it's not available to non-root privileged processes (we are working on changing this to an administrator-adjustable amount instead by the way).

Speaking for PV entries, this can be mitigated on modern hardware by utilizing so called "Super Page" [1] which is transparent to applications on FreeBSD -- the system promotes smaller contiguous mappings to superpage mapping when needed if the mapping is aligned and sufficiently large in size.

For those coming here from because someone recently linked here (*cough*Slashdot*cough*), the issue here seems to have been resolved (as mentioned at said site), at least for Dragonfly: http://lists.dragonflybsd.org/pipermail/users/2012-October/017536.html

I'm kinda late to the party stumbled across this while researching PostgreSQL optimizations. I use the sysctl shm_use_phys to tell the FreeBSD kernel to wire down and and don't swap the pages allocated to SysV shared segments. Like sjg said this patch will affect performance on BSD systems. The only way around that would be to use the mlock system call on mmap anonymous memory but that requires the process to have root privileges.

Anyways that been said can't this be a compile time option ? I understand the need for this patch but those SysV memory limits no longer apply on most BSD kernels last I checked.

Given the description this is essentially a "political" patch, and we all know that politics are finnicky and have little to do with technical merits. On the technical side this means pgsql now depends on two out of three available ways to do the same thing, which actually enlarges the political vulnerability service --maybe some wit will insist on disabling mmap for yet another political stance-- so it'd be nice to be able to do just that, should the resident master tuner want to.

Given that the goal is to not require arcane tuning knowledge for non-tuners, there shouldn't be anything against the ability to jump back to pre-mmap using some tunable, and the dubious joy of tuning sysv shmen parameters.

Of course there's technical elegance in the One True Solution, but that's not a valid argument for patches that are essentially political in nature. Politics are messy, so don't go try impose neatness where flexibility is worth that much more, should you suddenly and unexpectedly need it.