Sunday, June 13, 2010

Making PostgreSQL Faster

Although we've made great progress in speeding up PostgreSQL over the last few years, there's always more to be done. Performance, with PostgreSQL as with any other database, is largely determined by the availability of three resources: CPU, memory, and disk. What could we do to use each of these resources most efficiently?

PostgreSQL is already pretty efficient at using the CPU. For high-concurrency databases, I don't anticipate that things will get much better than they already are. For low-concurrency databases, we need parallel query - that is, the ability to use more than one CPU to process the same query.

Memory is a little bit more of a problem. We do a good job keeping our memory footprint small, but we don't manage it terribly well. work_mem limits the maximum size of a sort or hash, but takes no account of current conditions: if the system is swapping due to memory pressure, you get the same plan as if the system has 40GB of free memory. And all the memory allocated to shared_buffers remains allocated even when it isn't truly needed.

I/O is perhaps the biggest problem. I don't think this problem is unique to PostgreSQL - I believe all databases probably share this pain point to some degree. Disks are slow. With respect to PostgreSQL specifically, there are a number of things we need to do to minimize our I/O bandwidth, including index-only scans and further improvements to VACUUM. Partial vacuum (implemented in 8.4) is a pretty big deal, but there's more that needs to be done.

We also need to put more effort into minimizing our on-disk format and WAL volume. The actual disk space is cheap, but the time needed to read and write a larger volume of data hurts performance.

Solid state disks aren't a serious competitor to disk arrays for large database systems yet. They may be some day, but they're neither large enough (except for very small installations) nor reliable enough at this point.

Partial vacuum is a feature whereby the system tries to skip vacuuming of pages that contain only tuples which are visible to all transactions. So repeated vacuums of large tables will just re-vacuum the parts that have had updates or deletes, and pass over the rest.

That other Robert is wrong, mostly. For existing flat-file image "databases", rotating rust will be cheaper than NAND type SSD.

OTOH, if money is less important than performance for even such abominations, then terabyte SSD (NAND type) are already available.

For those willing/able to either build a new application, or re-factor (coders do it all the time, why not databasers?) those flat-file monsters, a BCNF schema can be an order of magnitude smaller than the flat-file image "database", and for such SSD is perfectly feasible. The performance improves not only because the storage is SSD, but also because the volume of bytes is so much smaller.

For insight into where SSD has been and where it is going: www.storagesearch.com (not my site, by the way).

I totally agree with you, Robert.In my experience with Postgres I have more troubles with setting memory limits on a machine with lots of RAM to get optimal performance for DW-type queries. Just as you put it - there is no change in plan depending on the current mem utilization on machine and settings for work_mem can eat up all memory depending on the plan.

More options for making Postgres faster, which I advocate, include making some subtle changes to the behaviour of user code. For example, provide the option for everything to work in set semantics rather than multiset semantics. When this option is enabled, then it should be much easier to optimize SQL code because more changes can be made in the query plans which in the presence of duplicates would change the answers you get from operations but without duplicates these wouldn't, and so optimizations that used to be unsafe are now safe. Of course, backwards compatibility is important, and to help achieve both, I would hoist some troublesome features such as multiset support from the core algorithms, making the core set oriented, and then emulate the old behaviour over top using things like count attributes. Since users typically want set semantics anyway, indicating explicitly that they want them should boost performance significantly, and users who want multiset performance can still get it. There are many other such improvements that can be made, which essentially comes down to altering the query language that Postgres takes.

By saying Postgres should support set semantics over multiset semantics, I am not saying that possibly-costly duplicate elimination must be done all over the place; rather I'm mainly saying to support the concept that keeping track of how many instances of a tuple are in a relation is not necessary all the time, besides knowing whether the count is zero or nonzero. If the number of duplicates is deemed unimportant, then it is safe for the optimizer to change the order of operations or substitute operations such that the changes might alter how many duplicates you have, but wouldn't alter whether you have zero or nonzero instances of each tuple.

Therefore, this change can be tested with practically no alteration to most of the Postgres code; we're essentially just saying that the optimizer isn't blocked from many things it could do because we're declaring the consequences of that to not be a problem.

Also, the code can decide case-by-case as to whether it actually eliminates duplicates in the middle of processing, because that would reduce the amount of data to process subsequently.

Depending on how it worked before, the MINUS/EXCEPT operator may need a tweak if it previously used multiset semantics since for things to not break it would have to eliminate all occurrences of source tuples matched by any filter tuples, even if there are more of the former and one-by-one elimination would leave some.

Obviously, any aggregate operators would still need to receive duplicates as appropriate (including COUNT), with the semantics as if all duplicates were eliminated from any joins or other operations that are supposed to evaluate prior to the group-by or aggregation.

Actual duplicates in the result can also be eliminated simply so the user gets something clean, but doing this aside from helping performance can also be triggered by another preference. Eg, the user could just set a preference with 3 options: dup-count-significant (legacy SQL semantics), dup-count-insignificant (best performance option), no-dups (most consistent results).

One question on this, it may look bit simple. How could we keep maximum of database tables in Buffer, so that there is less I/O when users querying on those tables. Is there anyway where we could cache all the tables in buffer, which make faster retrieval of the queries.