Sunday, February 26, 2006

We all know that SQL can be a harsh mistress, and relational theorists such as Chris Date have long argued that the language is fundamentally flawed and that vendors have been misapplying relational theory from the start.

Now a Perl developer on perlmonks.org has been reading Date's book and finds that it explains all of his frustrations with databases. One tip for addressing their shortcomings is to add the handy DISTINCT keyword to every single query, because stupid old SQL doesn't automatically apply the degree of uniqueness you might have in mind:

In fact, one of the founders of relational theory, C.J. Date, recommends that every SQL SELECT statement be written as SELECT DISTINCT ... Unfortunately, many folks get upset with this for a variety of reasons. First, DBMS do a horrible job of optimizing DISTINCT queries. There are many cases where the DISTINCT can be safely ignored but in practice, using DISTINCT with large sets of data and complicated queries can have an abysmal impact on performance. I think this can be put into the "premature optimization" category, though. Until we know that DISTINCT is causing an impact, isn't it better to improve the chance of getting correct results and worry about performance afterwards?

I had to read that a couple of times to realise that adding a DISTINCT to every single query in the hope that it might mask some unknown deficiency in your model, your query, or the SQL language itself is not the "premature optimization" being referred to here - he means the idea that doing so might affect performance. After all, it might not, right?

7 comments:

I'll agree that we usually don't want repeated rows and that they're typically the result of a poor approach. Lets consider our options:

1a. Write SQL query that returns distinct rows.2. Add DISTINCT to it and watch the database spend resources on the unnecessary.

1b. Write SQL query that does not return distinct rows.2. Add DISTINCT to it and hide the fact that the underlying query needs to be re-written.

While it'd be lovely to see a rdbms that could deal with the above in a performant manner, at that point we're getting into ontological systems. Now we want systems that can discern that you ought to have joined tables t1 & t2 via intersection table q instead intersection table p. I'm reminded of the quip that every description of good public policy can be improved by adding "and everyone gets a pony too." (See John and Belle have a blog)

I think 10g R3 is going to come with ponies for everyone. Too little too late, is what I say. Why couldn't they have simply invented a query language that can correctly guess when I'm thinking of distinct rows and when I'm not?

btw I should perhaps mention that I have not read Chris Date's book and so cannot confirm whether he is accurately quoted, in the right context, as seriously suggesting we all indiscriminately slap the keyword "DISTINCT" on all SQL queries like a bunch of complete idiots.

I have read the book, but don't have it with me at the moment. One place I have seen this suggested is in an article called "The Askew Wall" by Hugh Darwen here:

http://www.thethirdmanifesto.com/

You need to understand the context in which it is said, which is that a "true" RDBMS (TRDBMS) should never return duplicate rows. So "select deptno from emp;" should return 3 rows not 14. So if you want SQL to behave more like a TRDBMS, you have to add DISTINCT to all queries.

This is of course NOT meant as advice about writing performant queries for production Oracle systems.

Presumably a Truly Madly Deeply Relational Database could default to DISTINCT unless you explicitly specified ALL, if for example (as someone pointed out on the perlmonks thread) you actually wanted a list of salaries so you could add them up. I expect this is all made clear in the book.