Thursday, July 21, 2011

Read Scaling Out to 32 Cores

With the exception of a week's vacation, my last month has been mostly absorbed by PostgreSQL performance work. Specifically, I've been looking at the workload generated by "pgbench -S", which essentially fires off lots and lots of SELECT queries that all do primary key lookups against a single table. Even more specifically, I've been looking at the way this workload performs on systems with many CPU cores where (I found) PostgreSQL was not able to use all of the available CPU time to answer queries. Although the single-core performance was around 4,300 tps, performance with 36 clients (on a 24-core server) was only about 36,000 tps.

Research revealed that performance was being limited mostly by PostgreSQL's lock manager. Each SELECT query needed to lock the table being queried - and its index - against a concurrent DROP operation. Since PostgreSQL 8.2, the lock manager has been partitioned: a lock request against a database object will be assigned to one of 16 "partitions", and lock requests against objects that fall into different partitions can proceed in parallel. Unfortunately, that doesn't help much in this case, because only two objects are being locked: the table, and its index. Most of the traffic therefore targets just two of the sixteen lock manager partitions. Furthermore, because the query itself is so trivial, the rate of lock and unlock requests is extremely high - on a more complex query, the bottleneck wouldn't be as severe.
On Monday, I committed a patch which greatly improves performance for this workload. Even on my 2-core laptop, there's about a 4% improvement when using a large number of concurrent clients, and in a 24-core configuration, an earlier version of this test weighed in at around 129,000 tps - about 3.5 times the previous performance. Both the hardware and test configuration have changed a bit since then: on 32 cores, with 32 clients, I now see about 179,000 tps. Another user, running a slightly different test (using prepared queries, and a lot more concurrent connections), reported a 20-80% speedup on an 8-core system.

That's a lot of speedup, so you might be wondering how it works. Basically, the observation I made is that the sorts of locks that queries take during normal running almost never conflict. Reads and writes on the same table can, of course, proceed in parallel, so the relation taking a lock is really only trying to make sure that someone else can't, for example, drop the table while it's being read or written. Since reading or writing a table is a lot more common than dropping it, it makes sense to optimize for that case. The patch introduces two new data structures that allow that to happen.

First, whenever any PostgreSQL backend takes a "strong" table lock (one that would conflict with ordinary reads and writes to the table), it hashes the table's internal identifier into one of 1024 partitions, and increments a counter for that partition. When the lock is released, the counter is decremented. Thus, each counter tracks the number of "strong" locks on tables that fall into that partition. By looking at the appropriate counter, a backend which only wants a "weak" lock (enough to read or write the table) can very quickly determine whether a "strong" lock could potentially be present. If so, it takes the lock using the old method, to guarantee that lock conflicts will be detected, and deadlocks broken.

Second, if a backend verifies that no strong lock can possibly be present, it can record the lock in a "fast path" queue instead of routing it through the main lock manager. Each backend has its own "fast path" queue, so contention is minimized. The queues are small - only 16 entries - but that's OK: it keeps the code simple, while still allowing the fast path mechanism siphon off a great deal of lock traffic. Backends wanting a "strong" lock must examine these queues and move any conflicting locks found there to the main lock manager, again to ensure proper conflict and deadlock detection.

While working on this patch, I found a few more things that can be optimized to squeeze out a little bit more performance, and to make the performance gains generated by this patch more robust. I'll write about those projects in my next blog post. But here's the bottom line: when the two follow-on patches I have in the works are finished and committed, I expect PostgreSQL to have robust read scalability out to 32 cores, even for tiny queries.

As great as this improvement is, it is only going to be in 9.2 because we are already beta-testing 9.1.

Also, we always knew we had lock scalability problems on many-cored servers but the solution was usually to try to improve the lock manager --- this takes the creative approach of avoiding the lock manager completely for common use cases, which is brilliant.

@erdman indeed. we'll be reviewing the changes with that in mind, but from the blog post, it sounds like it might be fairly invasive.

@robert are you doing any testing with 32+ core machines? 32 is about the limit you can run with postgres, so while it's really good to see performance increases, it would also be good to know if this might help when going beyond 32 cores. Any thoughts?