NULL - The database's black hole

I once read a scientific article on black holes that started of on the observation that, since a hole is the absence of anything and black is invisible in space, a black hole is in fact an invisible nothing – so what the heck are we talking about? Well, almost the same can be said about NULL in databases.

Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard:

null value: A special value that is used to indicate the absence of any data value.

This definition differs significantly from some common misconceptions about NULL:

·NULL is not the same as the numeric value 0, even though they are pronounced the same in some languages (like Dutch or German).

·NULL is also not the same as the string value ‘’ (also known as the empty string), even though many Oracle developers would like to believe so. However, they can’t be blamed for this as this is completely Larry Ellison’s fault.

·NULL is definitely not the same as either of the date values January 1st 1753, January 1st 1900, or December 31st 9999, even though there might be valid (performance related) reasons to use either of those values as a magic value instead of NULL in a specific situation.

·Moving to the more controversial stuff, NULL does not mean “not applicable”. Of course, a NULL in a table is often a result of the attribute not being applicable for a specific occurrence of the entity stored in the table (e.g., a column “birthday” in a customer table that stores details of businesses as well as humans) – but in other columns and other tables, or even in another row of the same table, there might be a completely different reason for the data being missing (absent)!

·And saving the best (read: most controversial) for the last, NULL is also definitely not meant to signify “unknown”. Again, a NULL in a table might result from the value being unknown at data entry time (e.g., when we forget to ask a customer for his or her birthday), but there might be other reasons as well. Unfortunately, many text books insist on explaining the behaviour of NULL in expressions by describing NULL as unknown, rather than missing or absent, causing this misconception about NULL to be the most widespread and the hardest to combat.

Allow me to dwell some more on this whole unknown/not applicable thing, as it’s able to confuse even highly-appraised scientists – as is aptly demonstrated in “Much Ado About Nothing”, an exchange between Dr. E.F. Codd and C.J. Date about NULL, in which Date gets Codd to agree that there should be more than one kind of NULL. The fact that both Codd and Date apparently missed, is that the examples used in their debate failed to meet the basic rules of normalization! Moving back to the birthday example to illustrate this, it is true that NULL in this column can have different causes – but that doesn’t change the meaning of this NULL, which is limited to “the birthday for this particular customer is not in the database”. If the business doesn’t care why a birthday is missing, then the fact that there might be different causes doesn’t have any consequences on the database. If, on the other hand, the business does case about the reason that a birthday is missing, then this reason should of course be modeled and stored in the database – but not in the same column as the birthday! “Birthday” is one attribute and “Reason birthday is missing” is a different attribute – heck, they even have completely disjunctive domains! The fact that these attributes are mutually exclusive doesn’t warrant violating first normal form by stuffing these two attributes in a single column! In this case, a proper design for a SQL Server table would look like this:

CREATE TABLE Customers

(CustomerID int NOT NULL,

Birthday datetime NULL,

ReasonNoBirthday int NULL,

-- Other columns,

CONSTRAINT PK_Customers PRIMARY KEY (CustomerID),

--Time part for birthday has to be midnight

CONSTRAINT CK_Birthday CHECK

(CONVERT(char(8),Birthday,108)='00:00:00'),

-- Birthday mutually exclusive with ReasonNoBirthday

CONSTRAINT CK_ReasonNoBirthday CHECK

((Birthday IS NULL AND ReasonNoBirthday IS NOT NULL)

OR (Birthday IS NOT NULL AND ReasonNoBirthday IS NULL))

);

Things get more complex when nullable columns are used in expressions and predicates. In a procedural language, this wouldn’t have been a problem – if a procedural program fails to find the information it needs, it enters a conditional branch to handle this situation, as defined by the programmer. In a declarative, set-based language such as SQL,this was not possible. The alternatives were either to have the SQL developer add conditional expressions for each nullable column in a query to handle missing data, or to define a decent default behavior in SQL for missing data so that developers only have to write explicit conditional expressions if they need to override the default behavior.

The default NULL handling in expressions is very much based on how humans would handle similar situations. For instance, stop and think a moment what you would reply if I asked you to calculate my age, and to mimic database behavior, I’d also constrain your answer to be within the domain of integer values. You’d probably first ask me for my birthday. But if I refuse to specify my birthday, you would be unable to answer the question – so you wouldn’t answer it at all. And this is exactly what a database does – if any value to be used in an expression is missing (NULL) , there won’t be any result of the expression; in other words, the result is missing (NULL) as well. This is known as the rule of NULL propagation: any expression in SQL returns NULL if any of its input arguments is NULL (with the notable exception of a few functions that are specifically conceived for NULL handling). And that is a second similarity between NULL and a black hole: the gravity of a black hole pulls in everything that gets too close, causing it to “disappear” in the hole, and the rule of NULL propagation causes any expression that has a NULL in it to have no result, as if the other input values disappear as well.

Just to confuse matters more, default handling of NULL in a predicate is different. There is a valid reason for this, though. First of all, expressions are supposed to return a value that adheres to the rule of a datatype, but “Hey dude, I can’t answer that” is obviously not a valid value in any numeric, date, time, or datetime domain. It would of course be valid in a domain for character strings, but you can’t just overload what might already be a valid “regular” value with a special meaning. NULL however is valid, in any domain. So there really was no other choice but to return NULL if input data for an expression is missing. The reason that predicates can’t be handled the same way, is that a predicate is typically used in a WHERE expression, so there has to be a result for each row – even if some input data is missing, the DBMS still has to decide whether or not to include the row in the output. This problem was also solved by mimicking what humans would do. Suppose I’d ask you to tell me whether I am older than thirty-five. This time, I won’t restrict your answer to any domain, but I do force you to give an answer. And of course, I still refuse to disclose my date of birth. So, since you obviously can’t say “yes” or “no” without having a 50% chance of being wrong, what will your answer be? Probably some more or less polite variation of “bugger off, dude, how am I supposed to know that if you don’t disclose your birthday?” Translated back to database terms, you answer would not be true or false, but unknown. Not unknown because some input data is unknown (as some text books write), but unknown because some input data is missing.

This choice solves the problem, since we are now able to evaluate each predicate in a WHERE clause for each row, with the result being either true (row is included), false (row is omitted), or unknown (in which case the row is omitted as well). In a future post, I will cover how this affects logical expressions that involve AND, OR, or NOT to modify predicates, and how unknown is sometimes treated the same as false, yet other times treated the same as true, depending on the context of the predicate.

Comment Notification

Comments

I find this a very interesting discussion of NULL. Well done. I especially like pointing out the difference between "Birthday" and "Why Birthday is missing" as being separate attributes. That being said, I'm not sure I agree. I'm still chewing on this, but I love angle you've taken (catchy title too :) ).

In probability theory, the birthday paradox states that in a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will have the same birthday. For 57 or more people, the probability is more than 99%

Actually the chewing is more overall - I would agree with the separation of attributes. After chewing on it (resulting in very little sleep and dreams about playing poker with Codd, Date and Larry Ellison - we ended up throwing Ellison out of the game), I believe I have my finger on what gave me pause. It results in me babbling significantly to explain, so I'm going to respond fully in separate post.

I still really, really your discussion of things. You took theory and moved it to real world very nicely.

A functional dependecy x -> y over r must satisfy that x = y forall tuples in r - not the case if the table contains nulls. (For example see "Theory of R. DB." (Maier) p42 for the formal definitions and some important consequences).

Normalization of optional attributes can be done by decomposition. Fairly obviously, the Birthday and ReasonNoBirthday attributes belong in separate tables since they refer to different entities - "Customer with birthday" and "Customer without birthday".

Oops! Obviously the second sentence I wrote was rubbish! The requirement is that an FD satisfies the predicate T1{x} = T1{y} implies T2{x} = T2{y} for every pair of tuples T1 and T2. Therefore, no such FD is satisfied if either x or y are null.

Thanks for your comments. However, I don't think I can agree. I had to do some googling in order to find formal definitions for 2NF, since I am used to working with a modeling method that skips normalisation and takes the model straight into 5NF. I found some contradicting ways to formulate the requirements for 2NF, but no real disagreements.

I'll go with the definition on Wikipedia: "a 1NF table is in 2NF if and only if, given any candidate key and any attribute that is not a constituent of a candidate key, the non-key attribute depends upon the whole of the candidate key rather than just a part of it".

I did not google for a definition of functional dependency, since I know that Y is functionally dependent on X if for any given value of X, there is at most one value of Y.

By this definition of functional dependency, both Birthday and ReasonNoBirthday are definitely functional dependent on CustomerID, since for each CustomerID, there will be either exacly one Birthday or none at all (NULL), and exactly one ReasonNoBirthday or none at all (NULL).

The Customers table has one candidate key: CustomerID. It has two non-key columns: Birthday and ReasonNoBirthday. Both are functionally dependent on the candidate key, ergo the table is in second normal form.

Unless I am really being led astray as to the 2NF definition or as to the definition of functional dependency, but I'm sure you'll tell me if I am!

I think you are making excellent points, I loved your post! Just wanted to note that I think that Birthday is functionally dependent on both CustomerID and ReasonNoBirthday . If ReasonNoBirthday is not null, then Birthday must be NULL. In some cases ReasonNoBirthday is also functionally dependent on both CustomerID and Birthday - it is when you must specify ReasonNoBirthday when you omit Birthday. However, I think in some cases you can omit both Birthday and ReasonNoBirthday - it depends on your business rules. What do you think?

Birthday would be functionally dependent on both CustomerID and ReasonNoBirthday if you have to specify both before you can tell for sure what (if any) the birthday is. That is clearly not the case - given a CustomerID, I can tell you without a trace of a doubt (assuming I trust the contents of my DB) whether a birthday is stored for this customer and what it is.

An example of functional dependency on more than one attribute would be the AmountOrdered being dependant on the combination of OrderID and OrderLineNo. If you only know the OrderID, you can't tell me the AmountOrdered (unless you're lucky and it's a one-line order), and neither can you tell me the AmountOrdered if I supply OrderLineNo but not OrderID. You really need both before you can tell me the AmountOrdered.

Sure, there is a relationship between Birthday and ReasonNoBirthday (as captured in the CHECK constraint), but it's not a relationship of functional dependancy.

The quote "any expression in SQL returns NULL if any of its input arguments is NULL (with the notable exception of a few functions that are specifically conceived for NULL handling)." appears a bit misleading? AFAIK sum, min, max, and other statistical functions were Not "specifically conceived for NULL handling": Consider SELECT

You make a fair point. I admit that I forgot to include aggregate functions in my post.

Since aggregate functions are quite simple, I won't write a seperate post on them. Their treatment of NULLs is very simple - they just dsiregard them. So if you have a table with three rows, with values 10, 90, and NULL in an integer column, then the results of MIN(column), MAX(column), SUM(column), AVG(column), and COUNT(column) will be 10, 90, 100, 50, and 2 - just as if the row with the NULL value doesn't exist at all.

If the elimitanion of NULL values results in no rows being left, COUNT(..) will return 0 and all other aggregate functions will return NULL (just as when they are run on an empty table).

Returning to the Dates of Birth - I've taken to seperating Day, Month and Year into seperate fields in SQL Server instead of treating Date Of Birth as a Date. Why ? My analysis has shown that in most systems containing public or customer data that a higher than expected proportion of people are being recorded as being born on the 1st of any specific month (5-10% more than statistically probable, depending on the type of system). Unless its absolutely required otherwise I've got into the habit of making this three nullable fields - this means that when someone simply gives their age, (or possibly guesses at their own date of birth - common for those that have no cerain birth record) we can record without forced inaccuracy, exactly the information given. A simple calculated field will give an implied date of birth if neccessary, but it prevents the operator being forced to enter inaccurate data just because the data type demands it.

Leave a Comment

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.