NULL-Thing to See Here

I’ve been writing quite a bit of SQL lately, and although I’ve been messing around with SQL for many, many years, there are still things to learn. The most recent lesson involved about 3 hours of staring at a query trying to figure out why it wasn’t working they way it was supposed to. You quickly learn that in SQL you can’t test for NULL using = (or !=) and instead you must use IS NULL and IS NOT NULL. Initially this just seems to be a quirk with SQL, but the true relevance of this hit home in my latest battle with a query.
I’ve been writing code in C for over 20 years, I’ve gotten quite familiar with, and comfortable with the uses of pointers, and in particular with null pointers. In the world of pointers, we assign null for a lot of reason… when we need to indicate that the pointer has no value, we use null, we use it in linked lists to indicate that there is no next link (or no previous link for doubly linked lists), just about all the uses though get interpreted as meaning “no value” or “nothing”. A null string, is not the same as an empty string, instead it is one that you have not created yet, it’s nothing, it indicates something that does not exist.

So, it’s easy to think of NULL in SQL the same way, as nothing. The problem is that in SQL, NULL is NOT nothing, even though the most common uses of NULL in database columns is to indicate no the absence of a value. When SQL was first created there the creators were basing many of it’s features on mathematical set theory. In set theory a null is undefined, or unknown, it is used when you don’t know the value of something, or can’t (yet) know the value of something. Set theory does not allow you to not have a value, the closest it comes to this would be an empty set. As in all mathematics though it is possible to not know the value of something while still knowing that that something does indeed exist. SQL inherits this definition of NULL from mathematics.

This is why we have to write IS NULL and IS NOT NULL instead of = NULL and != NULL. If it were nothing, then it would make sense to be able to compare nothing to nothing, and it would make sense that nothing compared to nothing would be a match. Undefined is quite different. Two unknowns are not equal, or at least we can’t know if they are equal or not since we don’t know their value. In fact it is more likely that two unknowns would be different (once we finally discover their values) than it would for them to be the same.

Okay, so I’ve explained why we can’t compare two NULL values. This isn’t new, anyone working in SQL learns this quickly enough, but they rarely understand why it can’t be done, just what the “workaround” is. Understanding this becomes crucial when you JOIN tables in SQL and the columns that you join on can contain NULLs. This is where I spent more than 3 hours scratching my head in wonder. I joined two tables, or more accurately, I joined a table with a previous version of itself in order to identify rows that existed in one table and not the other based on their data rather than their numeric id columns. I knew the id columns had changed, but that the data had not. The query didn’t work, in fact in some cases it almost looked like the query was producing the opposite result of what it should have. I was convinced that SQL Server 2000 was buggy and I’d somehow managed to confuse the SQL parser.

I was wrong. The tables had to be joined on 12 columns (yeah, I know … it was ugly), and all but one of those could contain NULLs, reality was that most of the columns did contain NULL. The problem is in how a join is specifed in SQL, here is an example:

SELECT foo, bar FROM quux INNER JOIN baz ON quux.code = baz.code

The join is specified by identifying columns that reference each other, and the join is then performed with rows from each table that contain rows with equal values. Ah! there is the problem, when NULLs are involved, nothing is equal to them, not even another NULL. So joins with columns that have nulls produce results that just plain don’t seem to work. It all makes sense when we remember that NULL does not mean nothing, but means unknown/undefined. The solution is quite simple:

Unfortunately, this will impact performance quite significantly for large tables. Assuming there was an index on the code columns of both tables, the index would not be usable, since we are now applying the ISNULL() function, and the index was generated on the column itself, not the result of that column being fed to a function. The result is a full table scan of both tables involved.

The lesson here is that as you design a database (or refactor one) you should put some serious thought into any column that allows the use of NULL. I’d even go so far as to recommend that you try to avoid them and think of other ways to represent the same data. Often the same result can be achieved with a second table that links to a “base” table, and only contains data when a matching row in the base table requires it. When implemented in this manner the absence of data rather than the presence of a NULL is a much better, and clearer indication of nothing than using a NULL. if you do it this way, you won’t run into weird behavior from JOINs when you least expect it.

…and that’s is… NULL-Thing to see here, carry on…

Advertisements

Share this:

Like this:

Related

Responses

[…] and has some actual content, as opposed to the non-content you usually find here at phasorburn. The topic is SQL and the treatment of NULL when used in JOIN situations. SQL is something I’ve mostly shied away from, but I can use it in anger if I […]