DISTINCTly UnSETtling

In a recent Perlmonks thread, I managed to tick a lot of programmers off by pointing out some problems with SQL ignoring set theory. I showed how what superficially appear to be a bunch of logically equivalent queries return different results. However, they all return the correct results if the "DISTINCT" keyword is used.

Unfortunately I used an unnormalized schema as an example in order to keep things simple. Many programmers focused on this. When I write things "off the cuff", I have an annoying tendency to think that providing a simplistic example will make the problem clearer. This is often true, but people frequently focus an obvious problem and ignore the larger ones. I sometimes do this myself. On the plus side, if I never posted anything controversial, that would suggest that either I'm following the herd or I'm not thinking enough, so on with the controversy!

The problem is that SQL returns bags of data instead of sets. Rarely do we ever want bags. Some argue that we really are getting sets instead of bags because "under the hood" the query engine is fetching off all of the data and merely masking out the data we don't want. The problem with this logic is that most people merely want to use SQL, not worry about the fiddly bits inside. Do you really want to write more garbage collection code in C? I don't want to have to stop at every query and wonder about whether or not I need to throw a "DISTINCT" in there. So the counter-example I provided was this a fictional story about an accountant who should be fired.

So a salesperson walks up to an accountant and says "Bob, for our customers with excellent credit ratings, what cities do they live in? I need to plan flights through those cities."

"What do you mean by 'excellent credit rating', Alice?"

"For the sake of argument, let's include everyone with a credit rating greater than 700."

To make that do what you probably want, you need a "SELECT DISTINCT" in there. That's what SQL should do by default, with an optional "NODISTINCT" or something similar in there.

Ben Tilly offered an interesting rebuttal. He mentioned how he was once a junior programmer bitten by Microsoft Access's default behavior of returning distinct results. He was summing receipts, but Access, having DISTINCT on by default, was discarding duplicate values, not duplicate results. That's silly. It's like saying all guys named "Bob" are the same guy. The Microsoft programmers had the right idea and the wrong implementation. Further, since so many Access databases I've worked on have been poorly designed, I suspect this was a common problem.

For this to work properly, DISTINCT should discard duplicate elements from the same tuple. If you truly have duplicates from different tuples, you have a denormalized database and you need to either normalize it or adjust your query to take this into account. (For example, a customer table might list "CITY" instead of "CITY_ID").

The argument I hear against this is an all too common one. Every piece of data should theoretically be tagged with the data value and the tuple of origin (and also the data type, but the reasons behind this are beyond the scope of this post). This means that queries would be more memory intensive and thus slower.

Does this sound familiar? We usually call this "premature optimization". Database vendors have focused so long on making things fast and worried about making things correct as an afterthought. The original relational theory proposed by Codd and evangelized by Date did not have this problem, but SQL won the query language wars and we seem to be stuck with it.

Can anyone provide a counter-argument? What I mean, specifically, can you argue that the default behavior should be to return true duplicates? Do we really want bags more often than sets? Obviously, the following query might return apparent duplicates if we had a properly implemented "DISTINCT" on by default:

SELECT total FROM order;

However, that apparent duplication would only be there because you really can have multiple orders totalling to $100.00 (yes, that's probably an unnormalized example. Imagine that I referenced an ORDER_ITEM table and summed the line items.)

Frankly, I can't imagine why anyone would want bags of results other than the obvious fact that most databases frequently fail to optimize queries with "DISTINCT" in them. But that's telling me I'm supposed to worry about the implementation details again. I just want to focus getting the results I want, not how I'm getting them. SQL is a declarative language and those languages' greatest strength is that they are result oriented.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Without JavaScript enabled, you might want to
use the classic discussion system instead. If you login, you can remember this preference.

Please Log In to Continue

A wise man said any SQL query without an ORDER BY clause is a stealth bug waiting to emerge when the implicit order dependency explodes into view.

In the customer-visit itinerary, ORDER BY may actually be a better solution than DISTINCT -- the repetition value in the BAG will be needed to help plan how long to lay over in each city, as it measures how many visits are required there. GROUP BY (City) COUNT(Customer) might be better, and would get you the equivalent of DISTINCT.

If you need to know counts that's what "GROUP BY" is for, so we can do:

SELECT city, COUNT(customer_id) as customer_count FROM CustomerGROUP BY cityORDER BY city

This will have no dupes and gives you the same numerical information as a dupe-containing result, but in a nice sane way, with no need to count dupes programatically in the program that executes that query.

You've come up with a rather artificial reason why one might, in this case, want a bag instead of a set, but your example is arbitrary. Who knows why Alice is in a particular city? Maybe she has some weird contractual agreement that she needs to at least be in those cities once a year? The point is the same: if I ask for information, why should I want my query, by default, to repeat itself rather than just give me the information I ask for?

I think we're in violent agreement that naive SQL is bad, and we're just arguing over which form of naivety is worst!

This is not just a artificial quibble on the artificial example. I gave a different, more general rule that explains why the artificial example is wrong at an even deeper level. I claim that in general both general rules need to be addressed together.

I agree that SQL having been designed abstractly by mathematicians (with whom I happily self-identify) so that it encourages thinking SETs y