Tuesday, January 12, 2016

PostgreSQL Past, Present, and Future: Moving The Goalposts

It's nice to see that PostgreSQL 9.5 is finally released! There are a number of blog posts out about that already, not to mention stories in InfoWorld, V3, and a hostofotherpublications. Of all the publicity, though, I think my favorite piece is a retrospective post by Shaun Thomas reviewing how far PostgreSQL has come over the last five years. As Shaun notes, both the scalability and the feature set of PostgreSQL have increased enormously over the last five years. It's easy to miss when you look one release, even (as I do) one commit at a time, but the place where we are now is a whole new world compared to where we were back then.

Back in 2010, I wrote a blog post entitled Big Ideas and a follow-up post called Lots and Lots of PostgreSQL Feature Requests summarizing first the ideas that I had and then the responses that I heard from other people about what was missing from PostgreSQL. Five years later, many of the items on those lists appear somewhat dated. PostgreSQL 9.5 adds INSERT .. ON CONFLICT, which answers many of the use cases that led people to ask for merge; and it adds row-level security. Index-only scans, granular collation support, SE-Linux integration, LATERAL, updateable views, and materialized views are all features we've had for so long now that most PostgreSQL users probably don't even think of them as new features any more. Previous releases have also added new data types, particularly jsonb, which allows easy manipulation of JSON data and which has been further enhanced in PostgreSQL 9.5; as well as new index types like SP-GIST. PostgreSQL 9.5 adds BRIN indexes, part of a growing body of work to adapt PostgreSQL to progressively larger workloads.

Many of the really important features from those lists which haven't yet been released are well underway. For example, we don't have multi-master replication in core just yet, but enormous progress has been made with the inclusion of logical decoding in PostgreSQL 9.4, and there will in all likelihood be moreprogress on logical replication in PostgreSQL 9.6. A very simple version of parallel query has already been committed to PostgreSQL 9.6 and further enhancements are on the way. Work is also under way on partitioning syntax.

It's time, then, to look to the future. What are the major gaps that exist in PostgreSQL today, as opposed to five years ago? Specifically, once we get through the twin knotholes of parallel query and logical replication, long-overdue projects where the slow progress we've made is a direct result of just how very difficult it is to get them off the ground, what comes next? Leave a comment below with your thoughts.

I outlined some of my own ideas about this in a presentation called The Elephants in the Room, which I gave at both pgconf.us 2015 and pgconf.eu 2015. Both video and slides are online. That presentation mentions both parallel query and logical replication, of course, plus a few other things:1. Horizontal Scalability. While PostgreSQL 9.5 scales to large boxes better than any previous release (and probably worse than any future release, since we keep making improvements!), what happens when you need more than one server for your workload? PostgreSQL 9.5 has a little-noted feature to allow foreign tables to participate in table inheritance, which is a long way of spelling "sharding" if you tilt your head just right, but the query planner and executor capabilities to make it a really killer feature are not there yet. More generally, regardless of how we get there, leveraging one box effectively is good, but leveraging multiple boxes is better.2. On-Disk Format. At each of the last two instances of PGCon (protip: you should go), there has been some discussion about making PostgreSQL's table storage pluggable, just as our indexing system has been for many years. This would open the door either to replacing PostgreSQL's storage format entirely with something better, without thereby breaking backward compatibility; or perhaps more likely, to introducing specialized storage formats which are better for certain applications. Storage formats which are more compact and therefore allow reading the same amount of useful data from the disk with less physical I/O seem particularly important.3. Built-In Connection Pooling. I'm not sure how many users are out there using an external connection pool and wishing they could be rid of it, but I'll bet there are some.

The presentation also talks about direct I/O, but in some sense that's not really a feature from a user perspective. If somebody implements it and that turns out to be beneficial, the feature will be improved performance. Of course, one can never have enough performance, whatever the source.Again, thoughts on other things PostgreSQL needs are very welcome. Please comment below on what else you think should be added. Thanks.

I bugged you about these things briefly at PGConfUS 2015. I get that most PG users don't realize they need these things or are content to work around them, but for Oracle users looking at moving to PG they are likely showstoppers.

#3 is being worked on for 9.6 and we have proposed patches. With GIN indexes, I don't see the value to #5, and in fact our TODO list now mentions on-disk bitmap indexes as undesired because of GIN being better in almost every use-case. The TODO text is:

The rigidity of on-disk bitmap indexes, and the existence of GIN and in-memory bitmaps make this undesirable.

I'd like to see two priorities:1. In-place upgrades. I've been spoiled by SQL Server in the past, and find it really painful to upgrade servers to new versions.2. Cross-platform completeness. No more features that are platform dependent, such as replication that does not support Windows.Neither is sexy, but both represent big steps in operational maturity.

Maybe not in the exact order, mixed bag of small and big features:1. Declarative partitioning with proper FK propagation and global indicies2. Optimization in pushing down joins on partitions' keys/FKs (cascading them down), so related to p. 1!3. IOT - Index Organized Tables4. On-disk pluggable formats 5. Pluggable tablespaces - much easier to manage when dealing with huge tables, also easier to move between different IO paths6. Long term dream: Merge of Postgres-XL into the core. Sharding will never be enough unless there's a mesh/actor/load-balancing logic in there. I wish XL/X2 would get GTM integrated into the core or got rid of and then multiple coordinators/datanodes could be configured per node using standard PG. 7. In-place upgrades8. CPU/IO vertical parallelization, this could be also higher in the list and very happy to see it's getting there steadily.

Declarative partitioning is being worked on, probably for 9.7. Pushdown of joins is also for 9.6 or 9.7. #5 is hard because so much is shared between databases. #6 is being done as part of sharding, which is a multi-year project we have started. #7 is covered by pg_upgrade, unless you want zero-downtime upgrades, which requires logical replication and pg_upgrade.

I am so happy about INSERT ON CONFLICT, that is the biggest thing I am excited about in 9.5

Going forward I would agree with others that in place upgrades would be very helpful.

The other one that I don't know that there is anything that can be done about is the amount of locking that is done during update table. If there is anything that can be done to improve this I think that would help a lot because that would help with being able to run schema changes while the db has users using the db.

One thing that I would love to see a little work done on is enum types. We use them extensively for our application, and they are amazing for not having to have a ton of static lookup tables, yet still being type safe.

I would love if there was some love given to being able to rename / delete an enum value without having to re-create the enum entirely (with renamed / deleted values) and migrate all table columns to the new enum, then drop the old one.

It's a usability request, but man i've spent too much time on this because other developers were too eager in creating types, and the enum value names didn't match the standards we had laid out. Fixing them is a real pain in the butt now.

Oh, one other thing i'd absolutely love to see is being able to push down join condition / where clause to a CTE (where applicable). They are such a useful tool for writing easy to understand queries, but having them work as an optimization fence makes them unusable for a ton of places i'd love to use them.

It's sad there is so much opposition to changing the current behavior regarding this.

There are a couple things about CTEs that make this difficult or impossible right now.

1. Because they can be recursive, CTEs currently exist as a temporary in-memory structure. This is an implicit optimization fence that can't really be avoided. The CTE has to be executed independently of the rest of the query to produce its results. It's basically just a temp table you're not to creating manually.2. Due to this, a large amount of DBAs and other advanced SQL users have been using it as an optimization fence for circumventing planner mistakes. It's possible to make a query orders of magnitude faster by using CTEs this way. If they ever made it an integral part of the planner instead of a temporary in-memory object, I'm pretty sure the amount of outcry would be audible from space.

Now, this isn't to say pushdown is bad. The planner is generally missing this in several areas, including (and maybe especially) foreign tables and complicated views. It's just that opposition isn't really the only thing preventing the use case you're suggesting.

I understand that writeable CTE's are not applicable for push down, and I can reason that recursive are also not applicable (though if you know the exact reasons, i'd love to learn).

All of this should be known at planning time though, it seems possible that you'd be able to enable push down for CTE's if it met a set of criteria that is deemed "safe".

I do understand the need to preserve the status quo for existing users, but don't underestimate what this does to users wanting to migrate to Postgres from other systems. I worked with Sql Server for years, when moving to Postgres it was quite painful to realize I couldn't use CTE's the same way as I was used to with Sql Server for performance reasons. Having the fence be a GUC or even part of the CTE syntax would be infinitely better than what we have now IMO.

If Postgresql would get the pgbouncer transaction pooling built-in without the caveats, than that would really be a performance enhancing killer feature. Cheap database connections with dynamically managed forked workers.

"Just turn on database pooling in your application" (most of those suck, they keep way too many connections open which is a problem for Postgresql)

And people that already have that setting turned on will almost certainly see a big reduction in worker processes and probably an increase in available cache memory.

First a fair disclaimer: as a longtime (22+) Oracle DBA my opinion is certainly biased, but since we're actively looking for a "cheaper" alternative (for at least some workloads) I see only PostgreSQL as a possible alternative. I love everything about PostgreSQL. Almost. The lack of two crucial features are preventing us to deploy PG at the moment. First is the "neglected" state of backup and recovery in PG. The "feature" we evaluate first with every DB engine is how strong it's backup & recovery tools are. Sorry to say, but we're spoiled by Oracle RMAN, hence it feels like we're in a stone-age while dealing with a backup and recovery procedures in PG. Ok, we could somehow survive with the current state of b/r. But the complete lack of builtin auditing capabilities is a show stopper for us (and no, log_statement='all' doesn't count as a proper audit, pgaudit extensions looks promising). Of course, we're looking forward for other already mentioned enhancements as well (declarative partitioning being at the top of our list) but our vote goes to: #1 true incremental binary backup and recovery, #2 add full support for DML/DDL auditing.

We've started doing #1 here - https://github.com/2ndquadrant-it/barman/issues/21. Page-level increments, compression, parallelism. Both for backups and recovery, of course. But as of now it is just PoC implementation.

Thanks for the tip about #1, looks promising, I hope someday this feature will end in core PG, because PostgreSQL certainly deserves solid backup and recovery tools.

About #2 proposal, as far as we know pgaudit doesn't log bind variable values. Without that pgaudit can't be really used for auditing. Laws in EU that covers processing/auditing personal data are way more restrictive and demanding compared to the US laws (where personal data is just a commodity). Right now, we're forced to look for some commercial solutions (such as Imperva), that's really a pity because PG could prosper in various EU founded projects in public sector where protection & audit of processing personal data in RDBMS databases mandates using commercial database (mostly Oracle). In other words, what we badly need is some replacement for Oracle FGA feature (allowing us to place audit policy on particular columns). If that's not possible, then we would at least love to see pgaudit having an option to turn on logging of bind variable values.

With row level security introduced in PG 9.5, the lack of proper auditing is the last obstacle for many governmental projects to consider PG as a new DB platform. As much as I love working with Oracle database, I would enjoy even more migrating our stuff to PostgreSQL.

Indeed. We are using `refresh materialized view concurrently`, but it seems that the refresh operation is always blocked by some reader lock on the view. So, yes, possibly auto-refresh of materialized views.

#1 is coming as part of sharding, perhaps in 9.7. #3 has been proposed as a patch and hopefully will be in 9.6. I don't think #7 will be done anytime soon. #8 might be done in 9.5 with the shared buffer improvements --- more testing is necessary.

One feature I'd love to have for cstore_fdw is support for indexes. cstore currently uses built-in min/max indexes to skip over unrelated data segments. This helps, but being able to leverage PostgreSQL's indexing subsystem would be huge and help cstore support many more use-cases.

For newcomers to FDWs, I think having a tutorial-like documentation that gets people started out with the simple APIs and gradually teaches them new ones, would help. Also, most FDW writers copy and paste pieces from other FDWs. Having a reference FDW that has a modular design could help developers in getting started.

But seriously, this is all thanks to you and the other devs really kicking ass on the engine these last few years.

For me, there are only a few big "missing" things I've always wanted in Postgres:

1. Sharding. I work with a few of huge databases that need horizontal scalability, and this has always been a manual process. Postgres-XL gets us some of the way, but depending on a coordinator precludes independent node loading and drastically reduces import speed. CitusDB still has major issues with transaction limitations. We're getting there, but it's still very much a roll-your-own situation right now.2. Correlated column statistics. I know this is an extremely difficult problem to solve, but it's seriously hurting the planner in some very specific ways. I always cry a little inside when I have to move WHERE conditions outside of an optimization fence to speed up queries because the multiplied probabilities seriously screwed up the row estimates.3. Not query hints, but... something. What we have right now is arguably worse, since we're hard-coding optimization fences like CTEs, OFFSET 0, and other such quirks directly into our queries. Those are a lot harder to ignore, disable, or deprecate than some SQL decorators. I've seen this discussion come up several times in the mailing lists, but it always vanishes into The Ether within a few days. :(

Parallel queries *used* to be on my list too, but that's clearly being addressed in 9.6 and subsequent versions. Putting the background workers in was really a genius move to get this moving, considering the implications for unrelated extensions leveraging it as well.

In any case, keep up the good work; I'm not sure you'll ever really comprehend just how grateful many of us really are.

I may be in the minority here, but having seen the lies and havoc that MySQL's "storage engines" cause, I am against having different on-disk formats. I suppose my worries could be mitigated if it were guaranteed that there would not be substantial features limited to certain formats but this feels like the kind of thing where once you open the door, you can't be sure what will walk in.

Part of the problem with MySQL's approach is that a wide variety of pre-existing loosely related storage engines were taken and wrappered by MySQL, kind of like how Pg's FDWs work. If Postgres does alternatable storage engines right, any variety of engines will be largely designed for Postgres from the start and work as well as the current only choice engine. Switching engines should not change the feature set you get, just how various use cases perform.

I'm sympathetic to the concerns about fragmentation and I share them. That's certainly something to watch out for. And I don't entirely agree with Darren's comment that every storage engine must offer the same features - e.g. a read-only FDW could save a lot of space on disk by not supporting writes. But I think there's too much potential upside to variant storage formats to continue ignoring those possibilities.

My understanding is that refresh is not incremental, ie view refresh reruns the whole query - not just diffing which would obviously be much faster - like react.js does with dom updates. The 9.4 change was just to refresh concurrently ie to not lock out the old view while being refreshed.Ideal would be to be able to combine materalized and on-the fly query for any new tuples not in materialized state.Configurable auto-refresh such as update immediately or during slack periods would be great aswell, since for many views such as period summaries it's fine to have some lag.

Yes, I think you are right. You are looking for per-row updates. The 9.4 code only _updates_ the changed rows, but it does compute the entire query. I don't think per-row updates are always possible, e.g. AVG() columns, but we could offer it where it is possible. You could get around AVG() by storing the SUM and COUNT, but more complex functions might not be possible.

Yes I would expect aggregates to require recalculation, if any their inputs had changed, but if one structures multiple views hierarchically, such as daily totals then monthly totals then yearly, the heavy lifting would be done at the detail level, where the most benefit from incremental refresh would be obtained, re calculating months and years based on daily totals would be relatively trivial.Also very useful I think for storing transitive closure of large graphs, only recalculating nodes where a sub node has changed is obviously much more efficient than recalculating all nodes - something I currently do which enables lightning fast graph traversal queries.

Yes, such recalculations are certainly possible, though with with Postgres's plug-in aggregates, we could only do it with built-in aggregates that support this approach. I don't know anyone working on this, however.

+1 for "logical replication". Right now there's no online way to split a cluster or migrate a single database to another cluster. If you use pg_dump, you have to stop all writes until the database has dumped & restored, or risk losing transactions. The only other way is to use replication to duplicate the whole cluster, then remove the databases you don't want to migrate. If you're splitting because of space, having to duplicate databases that you'll immediately throw away is a waste of time and storage. Being able to scale horizontally means having convenient tools to manage scaling out. Right now the inability to split a cluster online is a big impediment.

Not mentioned here yet is a real need for better tools for parsing and understanding query plans. [explain] output is too hard to read and too cryptic for the casual database user. Query tuning is a black art and I don't think it should be. I was thinking of putting together some better data visualizations of explain outputs for pgconf us this year, but unfortunately I'm not sure now that I'll have the time to do it.

see: http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-STATEMENTS-RETURNING : The current implementation of RETURN NEXT and RETURN QUERY stores the entire result set before returning from the function, as discussed above. That means that if a PL/pgSQL function produces a very large result set, performance might be poor: data will be written to disk to avoid memory exhaustion, but the function itself will not return until the entire result set has been generated. A future version of PL/pgSQL might allow users to define set-returning functions that do not have this limitation.

So in other words, return cursors instead of sets. I too have this issue with larger data sets. I have worked around it in the past with views, and putting any required PL/pgSQL functionality into functions that are called from the WHERE clause to filter rows that cannot be filtered via the view's SQL. I also think that cursors also might help reduce memory when both the client and server cannot hold the whole resultset.

PostgreSQL's query rewriter is nice. Views in PostgreSQL work much better than in other database software. I wish that it also applied to CTEs.

Replacing WAL logs with incremental backups sounds like replacing two very similar things. Your backup solution would have to be doing something intelligent with its own incremental backups to have an advantage (such as merging replaced blocks, which sounds like a hot standby). Otherwise, WAL logs are not functionally different than your own incremental backups. I have heard of taking backups from the standby server before. Maybe this is why. The standby is acting as an incremental backup solution.

For me (us) the incremental backup done properly translates in the way leading commercial RDMS vendor is doing it (at block level). Our typical data processing looks like this:- Full backup on Saturday (non-working day)- from Monday to Friday our Oracle db is generating ~150GB of redo log over the day. At night we take incremental daily backup, which is just ~3-5GB (changed database blocks).

If we need to recover db on Friday, we take full backup from Saturday + ~20GB of incremental backups + some redo logs from Friday. If we would use PG, we would need to apply 600GB of WAL files. In addition to that if we found out that we need to do point in time recovery due to some logical error to some previous time, let's say as of current time - 2 days, this is easier/faster done with the help from incremental backups. As you can see, it's not about replacing WAL logs altogether, also I don't see how standby can truly replace incremental backup (for example allowing faster PITR?).

With a hot standby on its own I/O (not part of a SAN, perhaps local RAID), you can back up daily without affecting production, so you would not have to accumulate that many WAL files. But optimizing for faster restores is obviously a plus. Even with improvements to that, it is nicer to have a time delayed hot stand by, which you can do now. Even with the best restore technology, something that is "more online" is a better fall back plan.

So basically set shared buffers higher and add an feature to make a temp like table error out instead of flushing to disk when it gets too big.

Someone has asked about using /dev/shm for a tablespace before, but the problem there is that the schema / table definition needs to live separately from the tuples for that to not cause issues for the rest of the cluster. Also, tuples take up room in both /dev/shm and shared buffers, so you need double the memory. /dev/shm might not be the best use case for disposable table spaces.

Having ephemeral storage that is not /dev/shm could still be useful for other things. Maybe temp spaces that are not RAIDed or something, and the process that needs that space can be re-run without too much pain.

My whish list contains many of the improvements mentioned above, plus flashback queries (e.g. select ... as of 5 minutes ago). Since PostgreSQL has MVCC, I think this feature can be implemented and should work althought an involved dead row has been deleted by a vacuum (error can be something like: flashback query has no visibility for the required timestamp).

This kind of features turn transactions "oops-safe" :-)

Another big dream in my whish list, not related to the PostgreSQL core, is an Apache mod_plpgsql module (capable of execute a function from http), in base of which can be implemented a database-centric development framework. With that, any SQL/plpgsql developer in the world becomes a super-productive Web developer

Incremental refresh of materialised views which does not need a full re-computation of the view. It would be nice if it could deal with nested materialised views, joins (LEFT, INNER) and some windowing and aggregate functions (e.g row_number, count, min/max, sum, string_agg). This would really reduce the time to produce analytics from large warehouse datasets that only change a small amount on each update.

Configure the server to retain a snapshot of the database that is at least x minutes old (like a rolling replication slot) and offer the ability to revert a table (or set of tables) back to a specific trans ID.

Yes, I was aware of that. But for smaller setups, setting up something dedicated like that is not as convenient, and they are the ones who probably need it the most. The syntax, I imagine, would be like a truncate, but that it would accept a trans ID. Also, maybe instead of reverting the table, it could optionally put the rows into a new table, or maybe you could query it like the old time travel functionality, but with trans IDs. Anyway, I am sure that it would be a hit with the devs and DBAs who sometimes forget to use transactions and WHERE clauses at the same time!

#3: Connection pooling on the server side is a hack. Clients should always do their own pooling because even if the server has a pool, the TCP handshake is an unavoidable delay. Client based pools skip the TCP handshake when a connection is reused. Therefore, server based pools are always inferior to client pools. No hacks please.

I agree that server side pools are inferior to client side pools; but that doesn't mean server side pools are useless. For example, suppose there are several - perhaps even many - different applications accessing the same database server. Even though each application may have its own connection pool, the total number of connections to the server across all of those pools may be very high. If the server can handle that without becoming overwhelmed, that is better.

Then the correct thing to do is to tune the client side pools to close down idle connections instead of aggressively keeping so many unused connections open for long periods of time. I suppose it might help to enforce this on the server side. Maybe the DBAs know better than the application developers :-)

Regarding pooling, it might be useful to something LISTEN/NOTIFYish with expiring client side caches that have a volatility of stable. That, I would love to see.

I woluld like to see sql hints. I konow that PG should choose best query plan based on available stats. Sadly, we do not want (or could) make such stats because of high data volume we manage. There should be a way to manually improve queries when needed.

Create index with parallel processes! We want to use Postgres for some very large BI warehouses, but the index creation is seriously hurting. 9.5 was a huge win for performance, but parallelism is what we really need.

I would realy like to have a option to hint postgres that a table should be hold in memory. I mean not temporary tables. If a update occur the data should be written to disc. Would be realy nice to have such a feature.

PosgreSQL is proving i's growing , maturing and its strong is enoug against existing giant, Oralce or any. I have a chance bidding a project proposing 9.5 against Oracle where users 's very impressive the ability our cloud app with PosgreSQL can offer instead of Oracle. This is a huge challenge in Thailand.Keep the good work PosgreSQL.