SQL EXISTS Exploration (Thanks Christoph Schmitz!)

This week has been a very good week for me in terms of learning some great SQL information. My newest nugget of gold is the SQL EXISTS clause. I recently came across a situation where using an INNER JOIN blew away the performance of a SQL IN clause; on that blog post, Christoph Schmitz suggested that I take a look at the SQL EXISTS clause stating that it was generally faster than the IN clause and that NOT EXISTS can be insanely fast when compared to a NOT IN clause.

Some of Christoph's advice:

Given proper indexes, EXISTS is much faster than IN. And, considering that NOT IN is even slower than IN, NOT EXISTS can really bring dramatic performance improvements.

You still have to keep in mind that different database engines handle things differently. For example, with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes.

The rule I use is: if I expect the sub-select to return more than 12 records 20% or more of the time, use EXISTS. The speed gain for IN on small lists is not as dramatic as the speed loss for EXISTS on large lists.

More importantly, the difference between NOT IN and NOT EXISTS can be as much as 20:1 on large sub-selects, as opposed to IN and EXISTS, where I have rarely seen a difference of more than 3:1. As I understand it, this is because NOT EXISTS can use optimized join algorithms to locate matching rows, whereas NOT IN must compare each row against every possible matching value in the subselect.

It also makes a difference whether or not the referenced field(s) in the subselect is indexed. EXISTS will often use an index to compare the values in the master query to the sub-query. As far as I know, IN can use an index to retrieve the subquery values, but not to sort or compare them after they have been retreived into memory.

.... Incidentally, the dramatic differences between IN and EXISTS are not only a "PostgreSQL Thing". The same rules apply to MS SQL Server and SQL Anywhere, for the same reasons.

From Chris' and Josh's advice, it would seem that I should err on the side of SQL EXISTS most of the time. Of course, I am not excluding all uses of the SQL IN clause - it has its time and place and even ColdFusion's CFQueryParam tag has a List attribute to make the SQL IN clause easier to use.

Anyway, I thought I would build upon my UPDATE and DELETE posts to explore the EXISTS statement. As before, I will be building in-memory tables to test with. Since I don't want to repeat this code for the different examples, I am using ColdFusion's CFSaveContent tag to store the SQL table build scripts and then just including them into the subsequent examples:

<!---

Save SQL that builds temporary in-memory tables so

that we can use them in more than one example without

taking up too much room.

--->

<cfsavecontent variable="strSQLBuild">

<!--- Declare in-memory data tables. --->

DECLARE

@boy TABLE

(

id INT,

name VARCHAR( 30 )

)

;

DECLARE

@girl TABLE

(

id INT,

name VARCHAR( 30 )

)

;

DECLARE

@relationship TABLE

(

boy_id INT,

girl_id INT,

date_started DATETIME,

date_ended DATETIME

)

;

<!--- Populate the boy table with some information. --->

INSERT INTO @boy

(

id,

name

)(

SELECT 1, 'Ben' UNION ALL

SELECT 2, 'Arnold' UNION ALL

SELECT 3, 'Vincent'

);

<!--- Populate the girl table with some information. --->

INSERT INTO @girl

(

id,

name

)(

SELECT 1, 'Maria Bello' UNION ALL

SELECT 2, 'Christina Cox' UNION ALL

SELECT 3, 'Winona Ryder'

);

<!--- Populate the relationship table. --->

INSERT INTO @relationship

(

boy_id,

girl_id,

date_started,

date_ended

)(

SELECT 1, 1, '2007/01/01', NULL UNION ALL

SELECT 1, 3, '2004/09/15', '2005/06/15' UNION ALL

SELECT 2, 1, '2006/05/14', '2006/05/23' UNION ALL

SELECT 3, 2, '2007/02/14', NULL

);

</cfsavecontent>

Now that we have that taken care of, I am going to show you an old-school example that uses the SQL IN clause. Here, we are going to select all boys that are currently in an open relationship (not in the swinging sense, but rather as in the relationship has not yet ended):

<!--- Query for boys using SQL EXISTS construct. --->

<cfquery name="qINTest" datasource="#REQUEST.DSN.Source#">

<!--- Include SQL build statements. --->

#PreserveSingleQuotes( strSQLBuild )#

<!---

Select all the boys who are CURRENTLY in relationships.

This will denoted by records in the relationship table

that doesn't have an END date.

--->

SELECT

b.id,

b.name

FROM

@boy b

WHERE

b.id IN

(

SELECT

r.boy_id

FROM

@relationship r

<!---

Notice that our sub query can reference

row/column-specific values in the parent

query (b.id).

--->

WHERE

r.boy_id = b.id

AND

r.date_ended IS NULL

)

;

</cfquery>

<!--- Dump out the returned record set. --->

<cfdump

var="#qINTest#"

label="SQL IN Test: @boy"

/>

Notice that the subquery in the SQL IN clause can make row-specific references to the data in the parent query. Running the above code, we get the following CFDump output:

As you can see, Ben is returned since he is still dating Maria Bello (lucky bastard!) and Vincent is returned as he is still dating Christina Cox.

Ok, now, I am going to swap out the SQL IN clause with the SQL EXISTS clause. Since the subquery of an EXISTS clause can also reference row-specific data in the parent query, almost nothing has to change:

<!--- Query for boys using SQL EXISTS construct. --->

<cfquery name="qExistsTest" datasource="#REQUEST.DSN.Source#">

<!--- Include SQL build statements. --->

#PreserveSingleQuotes( strSQLBuild )#

<!---

Select all the boys who are CURRENTLY in relationships.

This will denoted by records in the relationship table

that doesn't have an END date.

--->

SELECT

b.id,

b.name

FROM

@boy b

WHERE

EXISTS

(

SELECT

*

FROM

@relationship r

<!---

Notice that our sub query can reference

row/column-specific values in the parent

query (b.id).

--->

WHERE

r.boy_id = b.id

AND

r.date_ended IS NULL

)

;

</cfquery>

<!--- Dump out the returned record set. --->

<cfdump

var="#qExistsTest#"

label="SQL EXISTS Test: @boy"

/>

Running the above code, we get the following CFDump output:

While my in-memory data tables are very small and don't have any indexing or good stuff like that, this should be a better way to go (or at least a better general plan of attack). I have looked at the SQL EXISTS clause before and something about it always confused me; I am happy to finally have my head wrapped at least mostly around it at this point. I am sure other will have more tips on when to use and when to not use it, but this is another good step on my way to SQL mastery.

Yeah, I like that idea. The more info that can be gotten across to the programmer, the better. Actually, I was just reading over on The SQL Authority where Pinal Dave was talking about the same sort of thing, but in his case, I think it was in regards to using COUNT():

"with MSSQL (2000) it is generally better to use SELECT * in the subquery (instead of SELECT fieldname), because the query optimizer will choose the best index or combination of indexes"

SELECT 1 might be the same as SELECT myFirstColumn.

I've just changed a query that did NOT IN to a NOT EXISTS on a large table that is forever growing. No visible change to the speed of the page load but I'm hoping it's made some difference as the server is up and down at the moment and this page is the prime suspect :)

Hmmm, interesting. You could be right. However, as a rule, my tables usually have an autoincrementing ID column as the first column, which should be index (I think). But, that is just a convention. I guess using * might be the best way to go.