Alf Pedersen On NULL Value Usage In Database Design

As you know, I am not a huge fan of NULL values in databases. There are times, when I do like them (such as with undefined date/time fields), but for the most part, I think they are horrible, cause confusion, and create much more overhead than they are worth from any sort of business requirements standpoint. I have gotten a lot of heat for this, but every now and then, someone like Alf Pedersen reaches out to me to lend some moral support.

Alf Pedersen, who has spent over 30 years in the computer and database world, has written some really good stuff on database design. But, more specifically related to my feelings, he has a well written article on the downside of allowing NULL values in your database schema. If you don't agree with me, maybe his article, NULL values in a database: A programmer's nightmare, will change your mind.

Thanks Alf for your insight and your moral support :)

If I could make one note about his article, it would be about the COUNT() and SUM() stuff. In his examples, he is demonstrating that NULL values in the database return "unexpected" results (for those not 100% familiar with the database schema). While this is true, the fact that COUNT() and SUM() (and other aggregates) do not take into account NULL values can certainly be leveraged to our advantage. For instance, you can use CASE statements inside of your aggregates to conditionally exclude row values by returning NULL. This is especially powerful when you need to run several different types of conditional aggregates over a single group.

But, of course, this type of logic has nothing to do with the actual value in the database row and can still be done for a database column that doesn't allow NULL values.

Reader Comments

Ben,I have posted before in disagreement over the concept of NULL in the DB. I know everyone has their own viewpoints, and I am not really posting to argue good vs. bad. If we all thought the same way, the world would be a pretty boring place!but...I cannot let some of the things mentioned in that article stand without commenting.He is doing a lot of handwaiving to get his point across.

For example:he uses "WHERE EXISTS (SELECT COUNT(AVAILABLE)..." as an example of why you can't trust NULL... The fact is that aggregate functions always return a value, so the above example is something that would never fail, becaus there will always exist a returned value from that selection.Someone with very little database experience could potentially read his article and incorrectly believe that there is something wrong with the implementation of count and other aggregates.

He further picks on using SUM on a column with NULLs. Since the definition of these aggregates is that they perform their calculations on non-NULL records, then in english, the sum functions purpose is to "return the summation of all known values". The functions are not broken, they are doing exactly what they are defined to do.

Going back up a bit in his article, he mentions the programmer having to use NVL() as a way to define a default value for the NULLs for compairison like "NVL(available, 'X') <> 'Y'".

In english, this is "where the value of AVAILABLE is defined and it is not equal to 'Y' OR the value of AVAILABLE is not defined and 'X' is not equal to 'Y'".As he states, this requires the programmer to guess at a default.

The proper thing to do here is "available <> 'Y' OR available IS NOT NULL"

In english, this is "where the value of AVAILABLE is defined and not equal to 'Y' or the value of AVAILABLE is not defined"This makes a bit more sense, and it does not require the programmer to make any assumptions about the NULLs.

</rant>

Ok, I feel better! LOLWow... this was definatley a rant! I hope I was able to get my point across without sounding like a loon!

Thanks for the link, Benihana! Good to see a DB expert with a sense of humor for a change. :)

One system I worked on used NULL values to indicate if a given value had been provided yet. The public-facing application was basically a wizard with many many forms. Users had the option of saving their work and coming back days or weeks later to continue filling out forms.

NULL values were understood by convention to represent data that had not yet been captured. Providing a default value for a true/false field would be misleading: the field has NO value, since the user has not yet submitted that data.

There were 10-15 developers on this project, and we found this convention very easy to work with. I can't recall a single issue arising from this approach in 2.5 years.

Please never ever feel bad about expressing an opinion for or against anything that I ever write or link to. This whole blog is a learning experience, especially for me, so the more opinions I get, the more of an educated decision I can make in the things that I do; so, thank you for your notes.

And, while I agree with you on what you are saying about Aggregates, I think part of the point that he might be trying to get across is that some things shouldn't be "unknown" and that allowing some of the data to be unknown can return unexpected results unless the SQL developer users some workarounds. For example, the account/balance scenario where one of the balances is unknown. This should really NEVER happen. That's like saying, "knowing whether or not you are alive" is optional. The reason d'etre for an account is to track a balance and if no balance is unknown, the data is corrupt. One way to enforce this would be to not allow NULLs.

@Dave,

I think the whole questionnaire scenario in which we are collection multi-state data in whole or in part is probably the best use-case for NULL values. The reason is, NULL values have a business meaning in the context of the problem. NULL means "answer not yet collected". This is a business requirement and a very valid reason to require NULL.

I think there are other use cases for NULL, but again, on the whole, NULLs do not fulfill a business requirement, only a theoretical DB design purpose.

Ben,I told you that you'd put the heat on me! :-) But you really do have a nice comment to Ken, and I will join you there. I really like the style of your blog.

Now, let's get straight on with it:

Ken, I may do a lot of arm-waiving to get my point across. Never did see it that way, but you may be right.

If so, it is because I have tried to explain a complex matter in a popularized and (yes, Dave) a bit humorous text, all in order to reach as many DB people as possible: Not all are deeply educated (and sometimes it shows, unfortunally).

The reason for showing the trouble with some aggregate functions was precisely because they disclose so clearly how the use of NULLs can generate big trouble.

Especially, the SUM example is very relevant, because, you can't get away from the fact that:

1.000 + something unknown = something unknown

and in one of my rows there is a NULL...

Anyone trying to tell me something different will not have access to my ears.

And the COUNT function goes the same way, with special emphasis on Chris Date's comment near the bottom of this, precicely on COUNT:

@Ben,I know I can speak freely on your blog, I just wanted to preface what I was saying so that I wouldn't come across as attacking or anything.Thanks again for all the work you do on this blog. You tackle a lot of issues, so it is great to open dialogs into these things!

@Alf,I hope that I did not come across too strongly here. You indeed have written a piece that puts very complex concepts into an easier to understand format. Thank you for that!

I still don't think I agree with everything you are saying...To pick on the sum example you give here:1,000 + something unknown = something unknownI totally agree with that in mathematical terms.The problem I have here is that you cannot relate this to the way sum works...

If I gave you a basket and inside the basket were 3 apples and a black bag that obviously has something in it, but you cannot tell what it is... and I ask you "how many visible apples are there?" the answer is 3.The answer is not "3 and an unknown in the black bag".See, I didn't ask anything about the unknown. Sum is the same way.It specifically returns the summation of all known things.

in math if you are asked to sum 3, 5 and "x", then you would have3+5+x = 8+x.You would not be able to reduce this any further because x is not defined.but in SQL, if you call the Sum function on a column containing 3, 5 and NULL... then the Sum function will return 8 because that is exactly what the Sum function is designed to do.It may be different than your definition of what Sum *should* do, but it is not broken.I guess that is my real point...Anyway... Again, thank you for your article, and thanks to both of you for opening up this dialog!

No offense taken at all! Just a pleasure talking to you. And yes, Ben runs a good blog: credits given.

The problem you state with SUM is relevant; very much so indeed! The same goes for other aggregates.

But: WHERE lies the problem?

Actually, Ken, you answered it yourself:

3+5+x = 8+x.

It is unknown...

And still the SUM function dares to deliver an answer?

So where is the error? As you say, the SUM function is designed to perform this way. So much for that design: The SUM function should in this case have returned: "I really don't know: Something I don't know anything about is going on in this database...".

We are now talking 3-valued logic logic instead of good, old 2-valued logic: Yes or No. That's exactly what we need databases for. Either true or false.

Do not believe that a design is correct just because some "authority" says so: All the relational flaws in Standard SQL are well-documented, and allowing NULL is really one of the worse.

The problem is in a totally different sphere, actually. Why didn't Codd create the NULL concept? Because it was unthinkable: A database should deliver an answer:

That's what we use queries for: To get an answer: Yes or No. Is there available tickets for flights from Oslo to Santorini on Monday? That's a query.

I do not want to see the word I don't know on my screen: I want an answer. Either true or false. Never I don't know. but that's what you get with nulls.

And relational theory states that relations contains tuples that hold TRUE propositions.

So where do the concept of NULL come from, since it was non-existent in Codd's revolutionary relational model, which we all try to use each day?

My answer would be:

Performance.

It is a fact that joining many tables will slow you down considerably HW-wise and DB-wise. This is the main reason why NULLs are discussed so much, and it is the main reason why so many want it.

Another reason: Programmer convenience. Not having to write all those joins. Just face it. You hate it.

Wouldn't you agree that if multi-joins were not a problem; if the DB system/HW didn't care about how many tables/joins were needed; if you had better tools for setting up such complex queries, you wouldn't think of using NULLs?

So what are we left with?

In order to cope with inadequate HW/DB platforms, we break the rules of consistent, reliable storage of information in our databases: We allow NULLs all over the place, and we denormalize our data in order to "help" those inadequate solutions from vendors. What if we did put some pressure on them to deliver a real relational solution instead?

Ok, I got carried away. But for some considerations on denormalization, another perversion in relation(!) to the relational data model, may I suggest you read this:

That's what SUM is delivering, according to you, and according to me...

Same goes for COUNT.

Let us leave these falsely defined functions that will let you believe that not is not not :-), and look at another, even more important effect of accepting NULLs:

Incomplete DB Design

Whenever you feel that an attribute (OK, column if you like) must be NULL, what does it MEAN?

It means you are uncertain of the need for that column at that place: It MAY be needed, and it MAY NOT. It is a strong first sign/signal of a missing entity. And you are very well aware of it, but so many times, you think: "Oh, another table; it will hurt performance", and there you go...

I will give you an example of how vendors drive us:

In the beginning of the Warehousing era (early 1990s), at Oracle Openworld in SF Oracle was handing out t-shirts with "Size matters" on them. Next year, they had discovered the limitations on the harddisk (and whatever else) side. What did they do? Handing out t-shirts with "Any size matters" on them.

It is all about selling. Unfortunately, it is clogging our knowledge and understanding of basics. I try to write about basics.

Alf,I would be interested to hear how you would implement David's scenario above.Lets make it a bit simpler for discussion.An application has a data entry screen that allows the user to input data into 10 fields that all have a 1-to-1 relationship with the primary key for the record.The application has a business need to allow the user to save the form in "draft" status before they are finished filling in all the fields.For text fields, this is a simple matter... if the user hasn't input anything than the value that you will store in the DB will be ''.For integer or date fields, what would you store in the DB?would you really create additional tables to store each of these ancillary fields?

I understand your point of having additional tables to overcome the need to store NULLs for some instances, but I could not imagine doing this for every case.

I could do that, but it would be just like delivering you the answers without presenting you with the questions. Besides, I am very busy right now checking a database model for a friend (and myself).

But to help you on your way, I have found the following links very valuable: Please read all of it: They get my highest recommendations, and I am certain that if you mean business in your field, you will enjoy this:

Fabian stopped maintaining this site a while ago (he was really feed up) But fortunately he kept the stuff he had produced, online. (Yes I know him: I consider him a friend of mine, so now you're warned! :-)

One of the worst websites I've seen, with the best content I've ever read, is this (can't believe where they got their colors from...:

Wow Ben, you get enough comments, you were bound to get some spam! lolAlso, I think you must have a timeout on your comment entry screen as every time I type a long comment, I submit it and get an error and I have to resubmit. Is this on purpose to try to prevent some of the spam somehow?*shrug*

Alf,I will do some reading over the course of the next week... you have given me a lot of links!!!

I just perused "How to Handle Missing Information without Using NULL"...wow.I understand the theory and the point, but given today's technology it doesn't seem all that practical... You end up exerting an awful lot of energy trying to do something that can be done with today's tools by putting a bit more business logic in the layer that talks to the DB...

6NF seems a bit extreme to me, and I can see why you chose not to answer my question in detail yourself, as this paper does a good job of explaining how one would take that type of scenario and rip it apart into potentially 30+ tables.

I know this is way out of scope of this blog, but I like where this discussion is going, so I wanted to continue it a bit...

Looking at the example in the paper I mention above, they use new tables like Job_Unk and Unemployed to store the information about rows of data that do not have a job defined either because the job is unknown or the record [person] is unemployeed.If you are going to write business logic to populate special tables with data when someone is entering the information in an application, then why not write logic that places special values into the existing table columns instead of the NULLs... I think it is splitting hairs to say that NULL is a rediculous concept, but creating a table to store references such as Job_Unk is OK... if it is ok to state that the job is unknown, how is that different than storing NULL in today's existing DB infrastructure?

I didn't see anything that dealt with data such as an employees end date... it is very tipical to have to store employee information such as when they started with a company and when they left the employement of the company... I guess I wold make the same point here that I did above... what is the difference in the following 2 ways of storing end_date:table: employeeemp_id, name, end_date1, Ken, 1/1/20072, Bob, NULL

and from what I can gather would be the proposed method:table: CALLEDid, name1, Ken2, Bob

table: date_leftid, date1, /1/2007

table: end_date_unk2

My method handles an unknown date with a concept built to handle unknowns...the alternate method handles the same unknown by making its own handling process for unknown values... what does the second method buy you other than complexity?

I guess what I am getting at is...I had a great DB Design class in college and the main thing that I got out of that class is that there is Database Design Theory and Database Design... The theory is something for the acadamia world to ponder, but when you are actually building something "in the wild", the theory is just that... a theory. I agree it is important to understand the "hows" and "whys", but in my opinion at some point "good enough" has to be good enough.

Your table end_date_unk should not contain a record: you are still employed.

"My method handles an unknown date with a concept built to handle unknowns..."

Fine. Teach your concept to the programmers coming in after you leave, so they'll act in the same way, and all will be OK...

"the alternate method handles the same unknown by making its own handling process for unknown values... "

Fine. Teach your handling process to the programmers coming in after you leave, so they'll act in the same way, and all will be OK...

"what does the second method buy you other than complexity?"

Flexibility.

Teach your method to the programmers coming in after you leave, so they'll act in the same way, and all will be OK...

"I guess what I am getting at is...I had a great DB Design class in college and the main thing that I got out of that class is that there is Database Design Theory and Database Design... "

You weren't exposed to a teacher but to an unskilled ignorant. It is quite common. If they weren't, they wouldn't be teachers, but working in the industry, producing real, working database systems.

"The theory is something for the acadamia world to ponder, but when you are actually building something "in the wild", the theory is just that... a theory."

And theory is practical, according to Date. And me ;-)

I once quit a job and went sailing across the Atlantic. When I returned I started working again for the same company, So far, I have 2 start dates and 2 end dates. I don't think I'll be working for them again, but you never know...

If anything can be done(recorded) more than once, it can be done many times. So you'll have a 1:many relationship. But of course, that is in the real world, not in your DB model...

As I said earlier, I am only the messenger boy. Argue with Codd (he's dead, but his thougts are alive), or Date/Pascal and every other significant theoretical. You will of course find a few teachers in our industry. They are easy to spot, "teaching" all but theory...

Your post displays that you are not thinking about logic (the actual business model), but how a computer is able to perform on your logic (The physical implementation). It is very common.

Have you changed your view about aggregate functions and the obvious errors they expose?

Enough for crossing words, but you'll get there. I do hope that Ben's readers get the point.

I haven't read through those links that you posted just yet, but I wanted to jump in with my 2 cents. I think there is something to be said for the having the "Best" solution at a given time, and not the best solution of all time. Application, in general, need to be refactored as time goes on. No one knows ever solution when a project starts (least of all, the clients).

So, in a situation where there is employment, you might start out with:

ID, Name, StartDate, EndDate

But then you get someone like you who leaves and comes back to the same company. Then, developers sit down, re-think some business scenarios and decide to split "employee" and "employment" records into:

Employee:ID, Name

Employment:EmployeeID, StartDate, EndDate

Now, they have an application that can be fairly easily converted into another; running a script to populate the new table would be child's play. Updating the code that references might take some more work, but more grunt work than complication.

The point is that "theory" and "practicality" are different in that they each have a time-cost trade-off. Building something that is theoretically the "Best" way to do something might take more time/money/effort to build. Sure, it might provide you with the most flexibility and future proofing, but it might take too long, or even go above the head of your employed programmers. The simpler solution, while not as effective, could be less time consuming, less expensive, and less demanding on the employed programmers.

Now, you might think, If the programmers cannot handle the complex nature of the "Best" solution, then get different programmers; but, this is where practicality vs. theory comes into play. Getting new programmers, making sure everyone can write effective SQL is not always the easiest option to fill.

That is not to say that I don't agree with what you are saying; I think we should all strive for what are to be considered best practices. I am just saying that at times, we don't want Best be the enemy of Good. In the interest of moving forward, allowing for, or even expecting to have to re-factor an application in the future will allow you be more innovative in the long run (IMO).

Yeah, the comment form has a life span that is valid. I can't remember what it is, but I think its like > 10 seconds and < 10 minutes or something. The point was that I didn't want spam forms storing the POST data and then submitting later. Therefor, the form basically dies if you try to submit it after a given amount of time, as you have seen. Then you just have to re-submit (as the form now has a new life span).

I have to clean up my anti-spam stuff at some point. It's too complicated.

Alf,Thank you for taking the time to give such detailed responses!I will try to digest a lot of the reading material this week...

I guess I was a bit hasty in my example of start/end dates of employees. you are correct, that is obviously potentially 1 to many...I guess a better example would be birth/death dates.I was trying to come up with an example of a date that needs to be stored but could potentially not yet exist. I can understand the point that if someone is alive we know that they are in the "still alive" state, and therefore the death date is not unknown in the sense that unknown could mean either they are dead and we do not know the date, or it could mean the person is still alive...So having ways to store those 2 scenarios differently in the DB does make sense, I just can't yet agree that this justifies the extra overhead of creating extra tables to store this information.I am a contractor paid hourly, and I couldn't look my client in the eye and try to tell them that they need to pay me to spend 5 times as long developing the DB schema because it is the "better" way to do it, when the shortcut method "just works".

I think I agree with Ben's last post, and don't think I could state it better myself.

Thanks again for taking the time to reply!

Ben,Thanks again for opening this discussion and so eloquently putting into words what I have been struggling to get out! lol

This is a thoroughly interesting discussion! My thanks to all those doing the thinking and writing. :)

It's easy for developers who are contending with deadlines, customers, and uncooperative applications to get impatient with theory. But I think a healthy respect for theory can make you a better developer. We should always be looking for a better way to do what we do, and academia gives us more good ideas than we know what to do with.

We have academics to thank for object-oriented programming, the relational database, the quicksort algorithm. . . heck, I think we should all be thankful we don't have to allocate physical memory because they invented 3rd-generation programming languages! And these ideas are not motivated by marketing, but by a desire to find the optimal solution for our thorniest problems.

That said, I do have a very simple use case to ask Alf about. How do you recommend we handle a column that stores middle names in a contact table? Many people do not have a middle name. I think that, despite this handicap, these contacts belong in the same table with the rest of humanity.

While I may be in my ivory tower, after reading the referenced article, I find the programming logic solid. However, I would argue the database design which used used to argue the logic is flawed. Why would anyone ever allow a balance field to be nullable, or an indicator field of "Y" or "N" to allow nulls in these situations.

I often see the problem is we dont' take the time to understand the data and therefor make assumptions which prove to be false and then blame the database design, poor coding or what not. The key here is balance. There IS a time to use NULLS, there are also times where they are terrible to use. You must UNDERSTAND the data you are working with before you criticize it. In the referenced articles sake, whoever designed that system really didn't understand the data and as a result has a flawed structure.

Great we have a primary system key of EmployeeID and it should not allow nulls

What about First Name? Well everyone has one right? Here's where we make an assumption about the data. We think everone has one so we don't allow null values.

What about the Last Name? Well again everyone has one right? WRONG! There are those who legally only have one name. Is it their first or last? Kinda makes you think about weather first name needs to allow nulls.

So if they don't have a last name what are you going to store? (Space?) Ok, so do Prince and Madona share the same last name? Sadly NULL (unknown) seems inappropriate as well because we do know they don't have one, therefor it is known to be undefined (maybe we should request the user to enter "N/A" here if they don't have one.

Moving on lets look at Middle Initial similar to last name not everyone has one, so "N/A" again? well the field is one character in length so now what? allow three when it's an initial field? or allow NULLS

Moving on lets look at birthdate. At time of entry we knew everyones birthdate except Jack's. But since the system forced us to enter something we picked a date. Wow this seems dangerous. Without nulls, how would I handle such things as dates?

If I refer to the article, I should normalize my date out so that such data would be managed in a different table and the absense of the data would indicate a "null" value. If I took this apporach I'd achieve a 5th normal form database design; and while theory indicates this is a good idea, practice states otherwise. The number of joins which would be forced upon my developers may actually run them into the engine limits on allowed joins.

If I leave it in the table, I must allow for NULL values or suffer allowing invalid data into the system. And while this violates the theory for 5th normal form Relational Database, it does improve system performance by limiting the involved joins.

Now what about active? Null values seem inappropriate here. A person is either Active, or not. There's no inbetween. If there is an inbetween then a bit field would be the wrong choice. So what's the business need? Are we only interested in Active or not?

My arguement against the article is based on the poor choices made by the designer of the database.

Simply put there is no right or wrong answer when it comes to should null values be allowed. Each column has to be considered for its merits. If at time of entry or update, could a person simply NOT KNOW what to put into this field. If the answer is yes then consider allowing NULLS. Also consider if functionality is based on a value in the column. If so, maybe they shouldn't be allowed to save the record if they don't know what to put into it.

As much as thinking about data and the database design is important, I think it's also important to realize that databases do not exist in a vacuum - they exist to solve a specific problem for a specific business with a planned interface (user input device). As such, the database is as only as good as what's allowed to be moved into it.

So, for example, you might have a field, "doohicky" which can be NULL in the database from a "database design" standpoint. However, if the interface built for it does not allow NULL (ex. user selected from a drop down box that has no "None of the above" choice), then does it make sense to have it nullable in the DB?

If the business logic does not allow null (ie. Interface comes back with a "Form error" if field is not entered/selected), should the database allow null?

I am not answering these as much as I simply posing them to allow people to remember that databases are not a end in and of themselves - they are a means to an end. This is why, as you say, violating 5th normalized form is good if it improves performance and developer sanity... because the application is the ultimate goal, not the database.

Plus, I think so much of is a weird judgement call. For example, if Prince can legally turn his name to be a "Symbol", then can someone legally change their name to be the "empty string"? If so, how can we determine an "unanswered question" from an explicitly "empty string" answer?

.... which really goes back to one of my very original arguments which was that NULLs don't add value (most of the time). That a NULL does not add value to the overall application more so than an empty string or default zero does.

I need null values in many to many with foreign key designs like customer and registered_customer and customer_unite tables. I have 3 columns in customer_unite table customer_id and reg_customer_id and customerID. so customer_id has foreign key referencing to customer table and reg_customer_id has foreign_key on registed_customer table.So if I have proper int id value on customer_id colunm then reg_customer_id is set to null and vise verce. So my opinion is that null are not not bad when using them in many to many relation not in "standalone" tables and you know to expext that another column has null.I cannot combine customer and registered_customer tables that would create so mutch problems to handle.

I agree that you should use NULL values when they make your life easier. I personally, would use ZERO rather than NULL, but that is simply because the server-side language that I use does not understand NULL values as falsey values - rather, it treats it as an empty string. As such, zero has more use as it will be evaluated as a falsey value. But again, that is because of the way my server-side language works.

@Ben thanks for replying me :) I think I maybe found place where to ask questions and share thoughts.. I`ve been developing my own webstore about 4 years (only when I got free time) without any help from any person. I had finded my way to (almost)complete my webshop.I`m using php and mysql.I`ve always thought that when I need help it takes too long to someone on blogs or forums to answer.But when I posted comment earlier that was so fast reply that I couldn`t believe.I think that I`m gonna be regular visitor of this site. :)

I am the co-founder and lead engineer at InVision App, Inc — the world's leading prototyping,
collaboration & workflow platform. I also rock out in JavaScript and ColdFusion 24x7 and I dream about
promise resolving asynchronously.