Stats

SQL Joins and Subqueries

A discussion about SQL joins and subqueries with information about how to format join statements properly

This is a discussion about SQL joins and subqueries with information about how to format join statements properly.

In any non-trivial task, developers need to join tables together. Joins are important when we want to:

Display/get data from more than 1 table in a result set.

Filter out records based on a particular restriction which requires using a column from another table.

Connect to an intermediary [relationship] table that helps connect two primary [entity] tables which have a many to many relationship. For example, if Student and School are two entity tables, then EnrollmentRecords might be the relationship table connecting the two.

There are certainly variations and extensions of why joins might be needed, but the above 3 cover most of the circumstances.

There is an alternative to using joins in some scenarios – subqueries. In instances where the goal is to filter on a restriction, a subquery can be used. Consider the following 2 SQL statements [using the Adventureworks database]:

-- Use a SubquerySELECT * FROM AdventureWorks.Person.Address
WHERE StateProvinceID IN
(
SELECT StateProvinceID
FROM AdventureWorks.Person.StateProvince
WHERE StateProvinceCode = 'CA'
)
-- Use a JoinSELECT addr.*
FROM AdventureWorks.Person.Address addr
INNERJOIN AdventureWorks.Person.StateProvince state
ON addr.StateProvinceID = state.StateProvinceID
WHERE state.StateProvinceCode = 'CA'

These are both functionally equivalent. Which of these is better? In this trivial example, both are fine. However, when query speed becomes an important factor, joins are generally going to outperform subqueries. I'll explore this issue more in my next post.

Also, the queries, as-is, use select *. This is generally not a good practice – instead, a column list should be provided in the select statement. However, having the select * raises one important difference between the 2 queries above – the columns from the joined table will be available in the result set whereas the columns from the table in the subquery will not. This is why I specified the table alias in the 2nd query [for the select *] to make the results display equivalently.

With the join, notice there are two conditions – one that the StateProvinceID must be equal and one that the state code needs to be 'CA'. Why did I put the StateProvinceID check in the 'on' clause and the state code check in the 'where' clause? I think doing it this way is cleaner. When I join tables, I consider there to be three types of join conditions:

Those that are fundamental to the tables [these would normally be identified by foreign key relationships or, lacking those, identically named columns. These should always be in the 'on' clause.

Those that aren't fundamental to the tables but that are almost always used as filters when joining on a table. For example, on the student enrollment table at Boston Public Schools, we have a field for whether the record is the most recent one and whether the withdrawal code is set for that record. Almost all queries we use check to ensure the record is the most recent one and that the student isn't withdrawn. While these two filters are not fundamental join conditions when joining with enrollment table, they are so frequent that it simply makes sense and is cleaner to have these in the 'on' clause instead of in the 'where' clause.

Those that serve as potentially modifiable filters. For example, a check that the state code is 'CA' or that some test score is > 30 should probably be in the where clause. Theoretically, if this logic is in a stored procedure, there should be an input parameter for the value to check/numerical threshold for filters in this category.

It is also possible to join the tables without any 'on' condition. For example, the following is functionally equivalent to the above two statements:

This is an implicit join with a comma separating the joined tables and all conditions in the where clause. The original join [with an 'on' condition] is an explicit join. Is the implicit join syntax recommended? Functionally and speed-wise, these two are identical. However, for queries with a number of tables being joined, the implicit syntax can become unmaintainable and difficult to debug/extend. I always prefer the explicit join syntax.