My Links

Advertisement

News

Welcome to my weblog. My name is Jeff Smith, I am software developer in Boston, MA and I was recently named a 2009 SQL Server MVP. Check in frequently for tips, tricks, commentary and ideas on SQL Server and .NET programming.

A quick follow up to the "Nested WHERE-IN" anti-pattern post from yesterday ... If you didn't get a chance, be sure to read the comments from that post as well, there are some great points in there so far.

Going back to Northwind and our example of returning all customers that have ordered productID 1, let's assume that we didn't check the schema carefully and we did not realize that the CustomerID column is not in the Order Details table. So, we might try to write our SELECT like this:

Now, try running that ... it works! Well, sort of ... it runs, returns results, generates no errors -- and it produces the wrong results! All customers are returned, not just those that ordered productID 1. This is an issue with correlated sub-queries in general, in that if a column is not prefixed with a table name, and it doesn't exist in the table(s) in the sub-query, SQL will assume it is a reference to a column in the parent query. In other words, the above SQL is checking to see if customerID is equal to itself, so it returns all of the results. This is completely logical and makes perfect sense from a processing perspective (I would not call it a bug), but it can lead to confusion when programmers are not aware of why and how this can happen.

Which is what we expect -- an error message telling us that we are trying to use a column that does not exist in the table.

Perhaps this is a minor issue, but it does happen all the time, and it can be maddening to troubleshoot .. especially for beginners just learning the language. As mentioned, this is a general issue with correlated sub-queries in general, and part of the reason why I prefer to use derived tables.

When using derived tables, each sub-SELECT is fully self-contained and can contain no references to anything outside of it (except for parameters or variables); you can cut and paste that SELECT into a separate Query Analyzer window, run it, tweak it, test it, and so on, and when you are done, you can paste it right back in to your original SELECT confident that it works and does its job. To me, this is a huge benefit and a really great way to break down complicated SQL statements into small, more manageable parts. (more on that here.) I feel that learning to do this is crucial for a good SQL programmer, and using WHERE-IN and correlated subqueries as a first instinct instead doesn't cultivate a clean "break the problem down into smaller parts" mindset, but instead leads to a thought process of "keep adding things to the SELECT until it works."

In general, just remember that while WHERE-IN clauses do have their place, learning to think in terms of JOINS and derived tables is a really important skill to acquire when working with SQL. Sometimes, beginners find the "Nested WHERE-IN" technique to be most natural, but if instead you focus on learning JOINS and relations and derived tables, you will be a much better SQL programmer in the long run.

You do not want DISTINCT on all columns in the customer table; you just want to return 1 row per customer. Thus, the distinct (or GROUP BY) should be done in the derived. Examine both the execution plan and the performance and you will see that using a derived table is more efficient.

As for your second question: if you just want to return customerID, there is no need to include the Customer table -- that is unnecessary and adds overhead to your SELECT. If you just want to return a distinct list of CustomerID's, you just write it the same way I did in my derived query:

That's kind of the entire point of this post and the last; you know you want to find distinct CustomerID's for those that ordered productID 1, so you write the SQL for that. Then, you know you want to return the customer columns such as Name and so on, so you join those results to the Customer table. And there's your results.