Introduction

Welcome to this, my first article in many years. I am intending on doing a series of articles on SQL, one a week. I know I am probably not writing anything that’s not already on the site, but my goal is to provide a series of articles that cover things in an organised way, almost like a book that people can work through. I suspect a lot of developers work with databases, and know a little bit of SQL, but really don’t understand how to really get the most out of it. Hopefully I can help change that.

Background

This first article is going to start with basics. I assume you know how to write a ‘where’ clause, and I’m going to start by talking about the different ways of connecting tables to select data between them. If you didn’t download the code for this article, please do so now. It contains all the SQL you’ll want to run in Management Studio while reading through this.
Download it now. I’ll wait….
OK, back ? The first thing you’ll see is code that looks like this:

Don’t worry if you have no idea what this is ( these are Common Table Expressions, the subject of the second article ). Suffice it to say, this creates three in memory tables, one of the sequence 1-10, one of only odd numbers to 11 and one of only even, to 10.

INNER JOIN

I assume that everyone knows what this does:

select * from listall

It returns the contents of the listall table ( 1-10). Now, what does this do:

select * from listall la innerjoin listodd lo on lo.id = la.id

It returns the odd sequence to 9, because an inner join will return only rowsets where the two tables can be connected. This is useful where you have, for example, an order table with a client id, and a client table where that id gives you all the client’s details. Now, what does this do ?

select * from listeven le innerjoin listodd lo on lo.id = le.id

It returns nothing, because the odd and even rows have no rows in common.

OUTER JOIN

Now, what does this do ?

select * from listall lo leftouterjoin listeven le on lo.id = le.id

It returns all our rows from the table on the LEFT of the select, even if the join does not match. Where the join does not match, a null is returned. This can be useful where you are using the join to work out if something exists, you can use a case statement to set a value according to if a match was found, like so:

As you may be aware, coalesce, like isnull, will return the second value if the first one is null. The difference is that coalesce ONLY does this, and ISNULL can evaluate an expression as the first argument.
One thing to add, the word ‘OUTER’ is optional, you can just specify ‘left join’, ‘right join’ or ‘full join’.

CROSS JOIN

This is probably the one people know the least. What do you expect this to do ?

select * from listeven le crossjoin listodd lo

If you run it, you will get 30 results. There are 5 numbers in listeven and 6 in listodd. 5 times 6 is 30 results. Cross join returns every row in table 1, combined with every row in table 2. This is a very powerful technique, when used correctly. There will likely be an article down the track on cool things you can do with cross joins, and I am certain that cross joins will feature in next week’s article. As you can see, because it just returns EVERY possible combination, there is no column to join on, in fact, if the two tables have columns in common, both values are returned, and you will have two columns with the same names, unless you alias them like this:

select le.id, lo.id as id2 from listeven le crossjoin listodd lo

You are not obliged to return ALL the data cross joined, you can specify columns and alias them, just like any other select.
A short hand for a cross join, is a comma, as in:

select * from listeven, listodd

However, while you may see this used in online examples, it has not been 'valid' SQL for over 20 years, and no database implementation is required to support it, so you should not use it in your own code.

INTERSECT

Although they are not really joins, I am going to cover two other things here. The first is ‘intersect’. It works like this:

select * from listall
intersectselect * from listodd

In this case, listodd has one value not in listall ( 11), so the result is the contents of listodd, minus the 11. Sometimes you will have two result sets, and you will want to know what values are in common between them. INTERSECT allows you to do this. Of course, this can be done by intersecting two select statements that contain as many joins and other operations as you wish. Select * is acceptable syntax, but it’s required that both tables have the same column names and data types. Be careful, sometimes you will only care about an intersection of one column ( usually a primary key ), in which case, it probably makes sense to have CTEs that return the primary key values only, and one that returns all the data, so you can end up with code like this ( this one is not in the file, as it won’t run )

select * from myClientList where clientid in ( select clientid from list1 intersectselect clientid from list2)

This would of course be at the bottom of statements defining three possibly complex CTEs. Hopefully you can see how CTEs increase readability and maintainability of SQL code.

EXCEPT

I recently had the task of transforming a large data set from one database/format to another. We had a ton of complex business rules, so I ended up with a view that grabbed all the data from the old DB, a lot of views to process different data types, then a view at the top to aggregate all those values, and turn them in to one list again. It was very complex, but at the end, we wanted to make sure we had not missed any records. This is exactly the sort of situation where the ‘EXCEPT’ operation is invaluable.

select * from listall
exceptselect * from listodd

This returns everything in the first table that is not in the second ( but NOT vice versa ). So, while the number ‘11’ is in listodd and not in listall, it is not returned by this operation. In my case, this returned all of the documents in my base list, that were not already in the top level list, and which had therefore not fulfilled any business rules and would otherwise have been lost. Given the nature of our data transformations, my query was a little more complex ( it was a union of three selects ), and I had to do an EXCEPT on my primary key only, then select all my data on the basis of that operation. If ANY row differs, the data will be returned as being different, so if you want to find only what records are not there ( regardless of if they have been transformed in any way ), then an operation on just the key is what you need.

UNION and UNION ALL

I nearly forgot to cover these. What if, given our data set, we wanted a sequence from 1 to 11 ? This will do that:

select * from listall
unionselect * from listodd
orderby id

It returns the combination of values from both tables, removing duplicates. If you want to see all values, including duplicates, do this:

select * from listall
unionallselect * from listodd
orderby id

I have seen some truly awful SQL code based on unions. Remember, union means, run these separate SQL queries, then merge them. If you can instead create one query, it’s bound to be faster ( but UNION itself is VERY fast, what I mean is, if you run two queries that are complex, you pay the cost of both queries, instead of just running one ). However, UNION was invaluable in the task I was describing earlier, because I had a base view, and different views for different business rules, which then needed to be pulled together again. It WAS slow, but it was the best possible tool for the task at hand.

Several people suggested I add a visual representation of the different join types. There is already an excellent article on CP that does this, you can see it here. I see no reason to duplicate it, when it is better for people seeking more information to read that article, as well as mine. I will add, the 'excluding joins' the author shows are perfectly valid, for the purposes of the techniques I've shown, I'd use the 'EXCEPT' keyword to get the same result. I am not sure which is faster.

As you can see, there’s a lot more to SQL than just the select statement and inner joins. Next week I’ll set out to explain how to get the most out of CTEs. If you have any thing you’d like me to specifically cover, or even any general SQL questions, please ask. I promise I’ll answer every question, even if I sometimes have to say that I’m not sure ( but I’ll try to find out ).

Share

About the Author

Programming computers ( self taught ) since about 1984 when I bought my first Apple ][. Was working on a GUI library to interface Win32 to Python, and writing graphics filters in my spare time, and then building n-tiered apps using asp, atl and asp.net in my job at Dytech. After 4 years there, I've started working from home, at first for Code Project and now for a vet telemedicine company. I owned part of a company that sells client education software in the vet market, but we sold that and I worked for the owners for five years before leaving to get away from the travel, and spend more time with my family. I now work for a company here in Hobart, doing all sorts of Microsoft based stuff in C++ and C#, with a lot of T-SQL in the mix.

It's never too late I am still writing these and keeping an eye on them, and I welcome all questions.

You're right, there are always many ways to do the same thing in SQL.

I think that using EXCEPT is more readable than a left join. It's immediately apparent what you want to do, and I always like to say, getting code that a computer understands is easy, but we should work hard to write codes that humans can easily understand. The same is true for INTERSECT. I doubt there are any performance gains to be found, so I'd say the main reasons to know about these things are readability, and ability to do well in job interviews

It returns all our rows from the table on the LEFT of the select, even if the join does not match.

should actually say that it joins all rows from the "LEFT of the JOIN". I have seen people have issues with left vs right joins because they thought they applied to the joining convention. example [ la.numbers = lo.numbers vs lo.numbers = la.numbers ] so I have made a point to always refer to it as being on the left or right of the join keyword itself.

Yes, good point. I was thinking of ugly code I've seen that UNIONS on complex selects ( four times ) and which I was able to break down to one select and which sped things up. I'll update the article to make that clear, that the cost of UNION is the cost of the separate selects.

I've found it's one of those personal preference things - presumably from some of us old farts that were writing DBMS code before the inner join was invented I've noticed it especially where SQL is being built on the fly in code.As I said, it was a pedantic comment, but you do still come across SQL like that, as you have found

It's really quite common still. And if you think about it there's no wonder either.ANSI joins and implicit joins isn't handled the same way by the optimizer. With an ANSI join you're telling the optimizer which Tables to join, and by using parantheses also in which order. With an implicit join you're telling the optimizer which FIELDS to join. The difference might seem to be subtle, but it can sometimes make a big difference in the execution plan.

The more interesting difference occurs when you add a filter predicate to the query. Compare

SELECT *
FROM A,B,C
WHERE A.X = B.X
AND B.Y = C.Y
AND A.Z = 1

TO

SELECT *
FROM A JOIN B ON A.X = B.X
JOIN C ON B.Y = C.Y
WHERE A.Z = 1

When you have an ANSI JOIN the filtering clause A.Z = 1 would always be executed on a potentially very large resultset after all the joining is done.In the case of an implicit join this filtering clause could be executed together with the first join and therefore letting the server work with a much smaller resultset for the second join. Effectively the same as

SELECT *
FROM (A JOIN B ON A.X = B.X AND A.Z = 1)
JOIN C ON B.Y = C.Y

But watch out, adding filter predicates to the join can give you unexpected results when you're doing outer joins.

SELECT *
FROM A LEFTOUTERJOIN B
ON A.X = B.X
AND B.Z = 1

is not the same as

SELECT *
FROM A LEFTOUTERJOIN B
ON A.X = B.X
WHERE B.Z = 1

That said, the optimizers get smarter for every version and my understanding of how things work might not be true anymore, or in the next version, or between different databases.

You are not obliged to return ALL the data cross joined, you can specify columns and alias them, just like any other select. A short hand for a cross join, is a comma, as in:

select * from listeven, listodd

I think this shouldn't be "advertised" as a shortcut - it's a legacy notation that has been removed from the official ANSI/ISO SQL Standard with the SQL-92 version (more than 20 years ago) and should be avoided if ever possible.

=============================Marc Scheuner, Berne, Switzerlandmscheuner - at - gmail.comMay The Source Be With You!

I'd like to see information on which parts are standard SQL and which parts are T-SQL-specific.

And regarding UNION ALL -- I usually use UNION ALL when I know that there are no duplicates. I have no numbers to back it up, but I feel that it saves the system from wasting its time looking for duplicates.

You're right about that, but there's more to it. When using a UNION the database needs to Create two resultsets, store them in memory and make a DISTINCT on the values. Potentially using a huge amount of memory if the resultsets are large.In the case of UNION ALL there is the possibility to PIPE the first resultset to output as the records are created and just continue to PIPE the second resultset to output when the first one is finished. Potentially using very little memory even when the resultsets are huge.

Now I'm fully aware that PIPELINING is Oracle specific, but it would surprise me if MSSQL doesn't have the same functionality. (Compare it with YIELD in C#)

You can use standard inserts. The point is, in this case, it's FAR faster to do it without actually creating a table. Tally tables are widely used, and will get their own article. The 'classic' way of doing it is to create a temp table, and it's significantly slower than the method I showed. I've done an article on CTEs, although I didn't talk about using them for inserts, because I'd typically use them to segment out parts of a complex select query.

But using either of these techniques all you to create and fill the tables once and run multiple queries against them. which allows you then to run one or more queries so that you can see the difference between them.

Look forward to reading the rest in the series so I can see if I can learn from them.

Thanks for the comments. Its true with the CTEs, you can't run more than one query, at once. But I thought it was a pretty neat way to get three data sets that showed the concepts clearly. I am intending on using the AdventureWorks database to get access to more complicated data sets.

Christian Graus

Driven to the arms of OSX by Vista.

Read my blog to find out how I've worked around bugs in Microsoft tools and frameworks.

Hi - thanks for the feedback. I felt I covered an intro and background in my first paragraph, although I didn't have headers to that effect ?

I also thought the site had tagged and formatted my SQL - was that just the formatting I got from SQL Server ? What is the benefit of the pre tags ?

I won't add the results of the queries because they can be long and they are a waste of space. Anyone wanting to read the article, should have SQL Server open. My intention ( although it's not really easy to do in this example ) is for people to run my SQL, then change it to see what happens, and thus learn more than if they just read the article.

Yes, I did add a link to another article this morning that has, I think, a very nice graphical representation of the different join types. But, I agree, I should also link to some MSDN articles on the topics in question, not least because they will give comprehensive documentation on syntax, etc.

Christian Graus

Driven to the arms of OSX by Vista.

Read my blog to find out how I've worked around bugs in Microsoft tools and frameworks.

Yeah, agreed on the pre tags, and I did it. And I see if I added the results, they'd be compressible. I still feel like I'd rather people generate the result in SQL Server for themselves, so they get the most out of the articles.