Wednesday, November 11, 2015

Parallel Sequential Scan is Committed!

I previously suggested that we might be able to get parallel sequential scan committed to PostgreSQL 9.5. That did not happen. However, I'm pleased to report that I've just committed the first version of parallel sequential scan to PostgreSQL's master branch, with a view toward having it included in the upcoming PostgreSQL 9.6 release.

Parallel query for PostgreSQL - for which this is the first step - has been a long-time dream of mine, and I have been working on it for several years, starting really in the PostgreSQL 9.4 release cycle, where I added dynamic background workers and dynamic shared memory; continuing through the PostgreSQL 9.5 release cycle, where I put in place a great deal of additional fundamental infrastructure for parallelism; and most recently today's commits. I'd like to tell you a little bit about today's commits, and what comes next.

But first, I'd like to give credit where credit is due. First, Amit Kapila has been a tremendous help in completing this project. Both Amit and I wrote large amounts of code that ended up being part of this feature, and that code is spread across many commits over the last several years. Both of us also write large amounts of code that did not end up being part of what got committed. Second, I'd like to thank Noah Misch, who helped me very much in the early stages of this project, when I was trying to get my heads around the problems that needed to be solved. Third, I'd like to thank the entire PostgreSQL community and in particular all of the people who helped review and test patches, suggested improvements, and in many other ways made this possible.

Just as importantly, however, I'd like to thank EnterpriseDB. Without the support of EnterpriseDB management, first Tom Kincaid and more recently Marc Linster, among others, it would not have been possible for me to devote the amount of my time and Amit's time to this project that was necessary to make it a success. Equally, without the support of my team at EnterpriseDB, who have patiently covered for me in many ways whenever I was too busy with this work to handle other issues, this project could not have gotten done. Thanks to all.

The Gather node launches a number of workers, and those workers all execute the subplan in parallel. Because the subplan is a Parallel Seq Scan rather than an ordinary Seq Scan, the workers coordinate with each other so that each block in the relation is scanned just once. Each worker therefore produces on a subset of the final result set, and the Gather node collects all of those results.

One rather enormous limitation of the current feature is that we only generate Gather nodes immediately on top of Parallel Seq Scan nodes. This means that this feature doesn't currently work for inheritance hierarchies (which are used to implement partitioned tables) because there would be an Append node in between. Nor is it possible to push a join down into the workers at present. The executor infrastructure is capable of running plans of either type, but the planner is currently too stupid to generate them. This is something I'm hoping to fix before we run out of time in the 9.6 release cycle, but we'll see how that goes. With things as they are, about the only case that benefits from this feature is a sequential scan of a table that cannot be index-accelerated but
can be made faster by having multiple workers test the filter condition
in parallel. Pushing joins beneath the Gather node would make this much more widely applicable.

Also, my experience so far is
that adding a few workers tends to help a lot, but the benefits do not
scale very well to a large number of workers. More investigation is
needed to figure out why this is happening and how to improve on it. As you can see, even a few workers can improve performance quite a bit, so this isn't quite so critical to address as the previous limitation. However, it would be nice to improve it as much as we can; CPU counts are growing all the time!

Finally, I'd like to note that there are still a number of loose ends that need to be tied up before we can really call this feature, even in basic form, totally done. There are, likely, also bugs. Testing is very much appreciated, and please report issues you find to pgsql-hackers (at) postgresql.org. Thanks.

That's up to the operating system scheduler, but presumably yes. If the total number of processes trying to run on the machine is greater than the number of CPU cores, then they can't all run simultaneously, so in that case you could potentially end up with workers that aren't actually running, which would be just a waste of resources. But assuming that's not the case, the system will presumably run all the workers at once.

How do we deal with varying I/O overhead?E.g. on a 16 core box I'd love to have 10+ workers doing parallel sequential scan over data that is already in memory.But in case the data must be read from storage I want to throttle workers count so that they do not compete for I/O. Micromanagement such details for each query seems to be tedious.Is there any chance PostgreSQL would figure out best plan (vary workers count) automatically?

There is a ton of interesting research work to do be done to solve problems like this. Right now, it's very much unsophisticated. The emphasis up until now has been on getting it to work at all - from here, we need to shift to more of a performance focus, but ironing out complicated problems like this will take time and, importantly, reports of real-world experiences from the field.

I agree with anonymous and I hope you get time to make some progress on optimizing the number of workers for a given query. The `max_parallel_degree` setting should be a maximum number of workers as the name implies, not a fixed number. Presume this is why you gave it this name?

Parallel query is more of a vertical scalability solution than a horizontal scalability solution. The goal is to effectively use multiple CPUs on a single machine, not multiple machines as CitusDB does. Using multiple machines would be a cool project, too, but it wouldn't be this project, even though many of the query optimization problems are related.

For horizontal scalabilityI would rather refer to Postgres-X2, and esp XL which has similar Scatter-Gather approach with the optimization that each worker node can push down the work to other worker nodes, so it is more like a mesh/grid and not a tree. I do hope XL, X2 and core will be merged at some point. Robert, I missed the last eu pgconf unconference - I think it was supposed to be discussed there? Big thanks for your work of course!

> adding a few workers tends to help a lot, but the benefits do not scale very well to a large number of workers. This is because with a large number of concurrent readings you start to get IO waiting. Basically, for a data stored on a single disk, 4 workers would be optimal if: (time of reading a block of data from disk) == 3 x (time of processing a block of data).

this is really great feature!do you think it makes sense to put in a setting on total number of parallel workers?my concern is that if each query can spawn 4 threads and we get 100 connections - we may end up with 400 threads..it would be nice to have a gloabal limit on PQs.then we would not end up like work_mem which can eat all your memory..thanks

max_worker_processes limits the total number of worker processes that can exist in the system at any one time for any purpose. Parallel worker processes count against this limit, as do any other background worker processes that are running.

I just tested that myself on a fresh build :)It seems it automatically picks the highest available number, in my case 5 which seems to be max_worker_processes - autovacuum_max_workers (8 - 3 on a default install).

Vibhor Kumar shared the news yesterday, that today made HackerNews home page. This is a huge game changer! I'm pretty sure it also impacts development and testing complexity. I hope it evolves into a horizontal scalability solution sometime soon. Thanks everyone!

how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?

postgres=# insert into test select generate_series(1,10000000);INSERT 0 10000000postgres=# explain (verbose,analyze,costs,timing,buffers) select count(*) from test;WARNING: terminating connection because of crash of another server processDETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.HINT: In a moment you should be able to reconnect to the database and repeat your command.server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.The connection to the server was lost. Attempting reset: Failed.!> \qOOM occure.

how large can one create the dynamic shared memory which is used internally for communication and transferring of tuples form workers to main process ? And, does increasing them give better performance ?

How can we implement parallel queries in 9.4 version. I tried finding all but all blog refer to 9.6.Actually we will be scaling more than 1000000 users so we need to improve performance for concurrent users as the response time is almost shoot to 15 s per request, which gradually goes to time out.