The Power of Subqueries in PostgreSQL

I’m pretty sure everything I’m writing here is true for most other RDBMS, too, but since I’m currently using PostgreSQL I had a chance to test it and show some hard figures here.

The problem to solve is actually a common one and rather easy to solve conceptually: Take attributes from table1 and store them in corresponding rows in table2, using a common id to join them. The straight-forward (and almost direct) translation into SQL is therefore:

There’s nothing wrong with that statement and it’s going to do exactly what’s intended. Only, it’s not very clever, and hence not very performant. This obviously only matters if your tables are on the bigger end. In my case table1 (the one to update) has 576,806 rows while table2 (the one providing the attribute data) has a whopping 848,664,485 rows. Also I should mention that table2 contains multiple rows for each corresponding row in table1. In that concrete case table2 contains data about point locations (latitude, longitude, timestamp) of people whose attributes (age, gender, etc…) are stored in table1. And there is this one attribute which is wrongly and inefficiently stored with each point location, while it is only dependent on the person and should hence be stored there.

Now, let’s have look at the query plan PostgreSQL suggests, to see how ineffective it really is:

Yep, that looks busy and not like it’s going to finish fast… I actually did a timed test run with only the first 100 99 rows of table1 (a.id < 100). Notice that the query plan changed lightly to account for this limitation:

More than 17 minutes is a long time to update 99 rows. Too long! Luckily salvation is around the corner. It comes in a very slick and efficient form: a subquery. Often overlooked it can improve query (and update) statements like ours significantly, without making the code the least bit more complicated:

The LIMIT 1 is owed to the aforementioned fact that table2 contains multiple rows for each row in table1. It’s necessary because the SET statement expects only a single value, so we have to make sure the subquery returns exactly one record. And this is how it performs in the real world example:

Wow, we’re down to 206 milliseconds now. That’s just 0.01988% of the previous execution time, or a speed-up by factor 5030! That can definitely be labeled “performance tweaking”. So next time you have a query involving two large tables, think about employing a subquery. This works with both SELECTs and (as shown) UPDATEs, but should also work with DELETEs – please let me know in the comments if you have any experiences here.