Tuesday, June 08, 2010

Why Join Removal Is Cool

When people talk to me about the (limited implementation of) join removal that will be part of PostgreSQL 9.0, the conversation usually goes in two ways. Some people ask how the feature works and then say something like "oh, I guess that could be useful every once in a while". Other people already know exactly how the feature works and usually say some variant of "this is an amazingly wonderful feature that I am looking forward to with great enthusiasm".

The difference between these two groups of people (I think) is not so much their level of technical knowledge or how closely they've been following pgsql-hackers, but their use case. If your database is primarily a data warehouse, my guess is that you won't have many occasions to benefit from join removal. Where this feature really comes in handy is in OLTP workloads with highly normalized data, in situations where users are generating queries against views (perhaps through some sort of reporting interface) and expecting to get results back immediately.

Let's take an example. Suppose you're writing a bug-tracking system. Each bug has a number of properties associated with it: who reported it, who's working on it, current status, priority, date opened, release for which it's slated to be fixed, date of last status change, date resolved, people who want to get an email when it's updated, comments, etc. If like me you're a big fan of database normalization, you'll not want to store all of these as text fields. So you might end up with a table like this:

You'll probably also end up with some supplementary tables for the items that can exist multiple times, like bug_comment and bug_watchers. Now, to make reporting easier, you'll probably want to define a view over the bug table that joins to all the other tables, so that it's easy to get the text values for the reporter, status, etc.

And now you can pretty easily write an engine that will let users select the columns they'd like to see from bug_view and the filter conditions they'd like to apply (only open bugs, only bugs slated to be resolved in release X, etc.) via a spiffy web interface. Note that the reporter, bug status, and priority fields can't be null, so we can use a plain old JOIN, but the bug might be assigned to no one or have no target release, so we use LEFT JOIN in those cases. (Otherwise, rows where those fields were NULL would not appear in the output.)

Over time, you'll tend to add more fields. Scalar fields like open_date don't cause much heartache, but as you add more fields that require joins, your view will tend to slow down. Some people might say that the answer is simply to denormalize - use natural keys in the bug table, and don't join. While that solution may be appropriate for some people, it is not without its downsides: database normalization was invented for a reason. The good news is that PostgreSQL is fast and has an excellent query planner, so even fairly complex queries run quite quickly. The bad news is that every query against the view is going to hit every table that's part of the view definition, so if you add enough of them, it's eventually going to be slow.

And, realistically, most of the time, users aren't going to want all the columns anyway. In a web application, 8-12 columns of output in an HTML table is typically about as much as you can squeeze in without starting to have a lot of line wrapping issues. This leads pretty naturally to the following question: if you don't need all of the columns, can you skip some of those joins and speed up the query?

Yes. In PostgreSQL 9.0, we can drop a join against a base table if (1) it's a left join, (2) there is a unique index on all or a subset of the join columns, and (3) none of the attributes from the nullable side of the join are used elsewhere in the query. So, in the above example, we could skip the joins to person at or release tr if the assigned_to or target_release columns, respectively, are not selected, assuming those tables have unique indexes on their id columns (if they don't, the join might change the number of rows in the output, so we must perform it).

We can't skip joining to any of the other tables, because those are inner joins. That's an implementation restriction which I hope will be lifted in PostgreSQL 9.1, but some more logic is needed to make that safe. In the meantime, a useful workaround may be to write those joins as LEFT JOINs rather the INNER JOINs, in cases where either join type will produce the same results.

11 comments:

That is handy. Helps keep the query generator on the app-side rather small as it can always leverage the view's abstraction, and/or avoids nearly duplicate views. Well, maybe not always at this point, but it certainly seems like a step in the right direction.

While I definitely think this is a very useful feature, I'm not sure that ORMs (Django specifically) can benefit from this as much as some have suggested. The article suggests that the left join can be removed as long as the fields are not selected upon, but in Django most model calls will select all fields modeled thus not allowing the left joined table to be eliminated

"join removal is a dumb feature, rather learn to design proper queries!"

it's removing the heavylifting from programmers, we can write generalized query(this article has good example, letting the user select which field needed be displayed) and let the database engine intelligently decide if join operation(s) can be omitted from execution

About 2 minutes into the tutorial "8. Demonstrating Table Elimination" http://screenr.com/iQG we show a second kind, dependent on foreign keys being declared in the attributes, and putting a condition on the attribute. Do you know if PostgreSQL supports this too?

If you want to know more about the technique described in the tutorial, head over to http://www.anchormodeling.com, and don't forget to try out our online modeling tool :)