Nice summary! Of course we all know that the database WILL be expected to parse pieces of data that are allowed to be combined in one column. (Just witness the number of articles devoted to separating comma-separated strings!) Among the ways that MS-SQL helps us accept this situation and be productive anyway: full-text indexing, XML and spatial datatypes, and column sets.

If whether the combined data violates 1NF depends on how the client app uses it, what about the primary key rule? That is, does it still count as 1NF if the table has a primary key, but the application ignores it (never requests a row based on it)?

A table is in 1NF when every column has a fixed simple data-type that is understood by the database system, and every row has a primary key.

This is your own definition ?If not, please indicate references.

For example, people may leave off the primary key requirement because they know that the base relations of relational database theory are required to have a primary key

In fact there is no primary key in relational theory. Or at least, that expression is obsolete (Date on database: writings 2000-2006, p117).And there is no requierement.By definition a relation has at least one key (a "candidate" key); it's not possible to have duplicate tuples.

Concerning the "A Table which is not in 1NF" part.

I'm not sure to understand. Are you really saying that 1NF violation is somewhat related to the application using the database ?

this is a claim made only by people who don't understand domain theory

1. I find this article problematic because it uses terminology and ideas that are not going to be universally understood in the same way by all readers. In particular it uses the term "NULL" without explaining it. Are we really talking about SQL-style nulls here? Relational database theory doesn't include anything like SQL's kind of null. It certainly is not part of Codd's original relational model or even his later revisions of it. There are several contradictory definitions of nulls and Tom hasn't explained which one he is talking about.

2. The suggestion that domain theory means that relations can support nulls appears to be speculation on Tom's part. At least that's how it seems, because he hasn't referenced any other source for that claim. I don't recall ever seeing that from any other source.

3. Tom says that "every row [sic] has a primary key". I guess this means every relation has a primary key. Keys are usually understood to be a subset of the attributes of a relation or relation variable rather than a property that tuples have.

4. He goes on to say: "a relation in relational database theory is required to conform to 1NF". This is true of Codd's original relational model but then Codd's original relational model did not support nulls whereas Tom appears to be discussing some other system which does have nulls. Codd's RM2 on the other hand does NOT require every relation to conform to 1NF. Codd's RM2 specifically allows derived relations which do not have primary keys at all. Again, to give readers a chance of understanding his explanations Tom needs to explain what definitions he is referring to.

5. The claim that relations require "primary" keys is a bit ambiguous. What Codd referred to as a "primary key" in his 1970 paper is more usually today called a Candidate Key. A "primary" key is nowadays usually understood to be just one of the candidate keys which is somehow designated to be a "primary" one. Depending on the implementation that designation of a primary key may be part of the logical data model in the database or it may not be. There is nothing in the relational model that absolutely requires a key to be so designated. Candidate keys are fundamental but the concept of having a "primary" key or not having one is unimportant.

David Portas (6/30/2011)1. I find this article problematic because it uses terminology and ideas that are not going to be universally understood in the same way by all readers. In particular it uses the term "NULL" without explaining it. Are we really talking about SQL-style nulls here? Relational database theory doesn't include anything like SQL's kind of null. It certainly is not part of Codd's original relational model or even his later revisions of it. There are several contradictory definitions of nulls and Tom hasn't explained which one he is talking about.

It doesn't really matter which sort of nulls we are talking about, as long as they are not used to violate the principles of the relational model. If you want me to be specific, by NULL you can take me to mean the domain-specific BOTTOM value in each domain as used in the dentational semantics for recursive function calculus - ie for all effective computer languages - produced by Dana Scott and Chris Strachey a very long time ago. It really was a very long time ago - I was less than 30 years old when Scott came to Oxford and teamed up with Strachey.

2. The suggestion that domain theory means that relations can support nulls appears to be speculation on Tom's part. At least that's how it seems, because he hasn't referenced any other source for that claim. I don't recall ever seeing that from any other source.

Heath and Codd were talking about a value to denote absence of information as early as 1971, as part of their efforts to devise a join (which today we would call an outer join) that would include all information from each relation involved. That was during their collaboration on defining normal forms, when Codd decided to allow key elements to be dependent on partial keys in 3NF, while Heath advocated not doing so. That, incidentally, is why Date stated that Heath defined BCNF three years before Boyce got involved (the definition is in one of Heath's 1971 papers). You'll see references in Codd's later work to "private communications" from Heath about the full join problem (for example on page 407 of ACM ToDS 4/4, December 1979). The idea that domains have undefined values was certainly understood in the 1930s and probably well before that - I learnt domain theory from a 1930s mathematical logic text (although it was Scott's work in the 70s that made it a fundamental part of the foundations of computer science) and an Oxford educated mathematician like Codd was certainly familiar with that. His consistent use of the word "domain" in the 1970 paper doesn't suggest that he was trying to avoid the concept.

3. Tom says that "every row [sic] has a primary key". I guess this means every relation has a primary key. Keys are usually understood to be a subset of the attributes of a relation or relation variable rather than a property that tuples have.

Drat! I should not have used that statement. It was very sloppy. I should have said that the relation has a primary key (or at leat one primary key).

4. He goes on to say: "a relation in relational database theory is required to conform to 1NF". This is true of Codd's original relational model but then Codd's original relational model did not support nulls whereas Tom appears to be discussing some other system which does have nulls. Codd's RM2 on the other hand does NOT require every relation to conform to 1NF. Codd's RM2 specifically allows derived relations which do not have primary keys at all. Again, to give readers a chance of understanding his explanations Tom needs to explain what definitions he is referring to.

I really don't want to get into a long discussion of Nulls here - but I will say that I'm fed up with hearing from Date's followers that NULLs were not around until a decade after Codd's 1970 paper. It's absolutely clear that he and Heath discussed them from 1970 onwards (perhaps after the internal IBM paper, but certainly before the first published paper). I think perhaps you have forgotten that first normal form was introduced by Codd to ensure that the relational calculus could use first order predicate calculus: it would not require higher order predicate calculus. Nulls don't have any influence at all on whether a higher order predicate calculus is needed or not.I am always amused when someone tells me that derived relations have to obey the rules for normal forms - someone recently suggested that derived relations ought to fit 5NF, and it took me some time to stop laughing. I don't think a special case can be made for 1NF for derived relations.

5. The claim that relations require "primary" keys is a bit ambiguous. What Codd referred to as a "primary key" in his 1970 paper is more usually today called a Candidate Key. A "primary" key is nowadays usually understood to be just one of the candidate keys which is somehow designated to be a "primary" one. Depending on the implementation that designation of a primary key may be part of the logical data model in the database or it may not be. There is nothing in the relational model that absolutely requires a key to be so designated. Candidate keys are fundamental but the concept of having a "primary" key or not having one is unimportant.

I was sticking to the original definition (1970) which allows a relation to one of more primary keys. ("A relation may possess more than one nonredundant primary key": CACM 13/6, Dec 1970, page 380). Only one of the primary keys is chosen and called The primary key. Of course SQL has a restriction that there can be only one primary key. Modern terminology is as you say usually "candidate keys" rather than "primary keys". But the individual attriutes in a candidate key are still "prime" attributes, not "candidate" attributes, so the terminology isn't very pure.

edit: fix quote brackets. edit agin: [quite] didn't work nearly as well as [quote] - stupid copmputer reads what I type instead of what I mean to type.

I found this article interesting. However, granted I'm not a full-time DB designer, I am slightly confused by the statement:

Some people would still use the three columns, using a silly number like 000 000 0000 to indicate "no such number". That's better than using NULL to indicate "no such number".

For existing databases (in cases where I can't design the table from the ground up) I prefer not to use some arbitrary value to denote "no value" and do use NULL values in this case. I find using LEFT JOINs or searches WHERE x IS NULL easy enough in these cases...maybe I'm missing something?

However, I do agree that it's always best to extract the columns out into their own table - removes the problem completely.

Is there an article/source that explains why it is better to use a predefined value instead of NULL?

A table is in 1NF when every column has a fixed simple data-type that is understood by the database system, and every row has a primary key.

This is your own definition ?If not, please indicate references.

Since SQLServerCentral has a no-plagiarism rule, it has to be my own definition, doesn't it. "row" should of course have been "relation".

For example, people may leave off the primary key requirement because they know that the base relations of relational database theory are required to have a primary key

In fact there is no primary key in relational theory. Or at least, that expression is obsolete (Date on database: writings 2000-2006, p117).And there is no requierement.

I don't accept Date as an ultimate authority. He has written some good stuff, and also some not so good stuff (all extremely able people have the odd off day). Since the elements of a candidate key are called "prime" attribiutes, there should be no objection to calling it a "primary key". Because first normal form was defined in 1970 using the term "primary key" not "candidate key", and I was attempting to define and describe first normal form, I stuck to that terminology.

By definition a relation has at least one key (a "candidate" key); it's not possible to have duplicate tuples.

Concerning the "A Table which is not in 1NF" part.

I'm not sure to understand. Are you really saying that 1NF violation is somewhat related to the application using the database ?

No. I'm saying it is related to what the database is expected to do with the data. The application, once it has obtained the data from the database, can do what it likes. What must not happen is that the database be required to understand some encoding which is not presented to the database as part of the type of (the domain of) an attribute.

this is a claim made only by people who don't understand domain theory

In mathematics, a doman contains elements which can be fully defined, partly defined, undefined, or overdefined. Trivial domains contain only fully defined elements and a single undefined element (usually called BOTTOM, but sometimes NIL and sometimes NULL) and maybe a single overdefined element (usually called TOP if it exists). For example a domain of unsigned integers range 0 to 255 contains the 256 integers 0...255 which are fully defined elements and the BOTTOM element about which nothing is known except that it a member of that domain; it might or might not contain a TOP element that is what you get when you add 1 to 255 (perhaps called "Arithmetic Overflow" or something like that); it might or might not contain partly defined elements (such as an integer known to lie between 31 and 94). Dana Scott bought domain theory to his collaboration with Chris Strachey in 1970, and their subsequent work on denotational semantics made it into a fundamental part of theoretical computer science (making extensive use of partly defined values, as well as bottom and fully defined values), but of course the basic concept of a bottom value predates that by at least three decades. If you want to learn about it, maybe Joe Stoy's "Denotational semantics: the Scott-Strachey approach to programming language theory" will work for you or maybe I'm just prejudiced because I like him as a person - so try to borrow it, not buy it, in case it doesn't suit. I seem to remember that there is something much older, perhaps by Garrett Birkhoff (or maybe even by his father), and there was a section on domain theory in an old math logic textbook that I used back in 1963 (that's where I first came across mathematical domains: I can't remember who that book was by - it certainly wasn't Church, who was a far better writer than whoever it was but didn't cover domain theory in his math logic text). Or maybe you can get hold of Scott's OxPRG reports.

I started to forward this to a colleague who is just starting to design databases. After reading it thoroughly I thought better of it. I think the author has just started designing databases himself.

Why in an article about designing a database would you confuse it by saying it's ok to denormalize columns if the application is going to take care of it.

I question the value of technique "Usually a still better solution (better by far) is to introduce a new table OfficeFax that has a name column and a FaxNumber column". What?!?! You would create a PhoneTypes table listing the types of phone numbers you're going to store with a primary key and a description of each. Then create a PhoneNumbers table with a number column and a couple foreign keys relating it to the original information table and the PhoneTypes table. This creates a many to many relationship saying for a given phone type you can have many people that have them and many people may have a given phone types.

Overall a very poor article to give to a db newbee so I won't be doing that.

I found this article interesting. However, granted I'm not a full-time DB designer, I am slightly confused by the statement:

Some people would still use the three columns, using a silly number like 000 000 0000 to indicate "no such number". That's better than using NULL to indicate "no such number".

For existing databases (in cases where I can't design the table from the ground up) I prefer not to use some arbitrary value to denote "no value" and do use NULL values in this case. I find using LEFT JOINs or searches WHERE x IS NULL easy enough in these cases...maybe I'm missing something?

However, I do agree that it's always best to extract the columns out into their own table - removes the problem completely.

Is there an article/source that explains why it is better to use a predefined value instead of NULL?

Thanks, George.NET Programmer Analyst

Well, NULL should always mean "The database doesn't contain this value", so it would be used in the case where the value is applicable but unknown for one reason or another as well as in the case where there is no value because it's not applicable; if you want to be able to tell for certain that the value is inapplicable, you can't use NULL unless you can guarantee that there are no cases where the value is applicable but unknown. That's my only reason for preferringto avoid using NULL for this particular purpose.Other people have other reasons - C.J.Date for example is quite famous for his anti-NULL stance (as well as for many far more important things) so you might find something in one of his books or papers, and both David and Steven who have commented above have a much stronger anti-null attitude than mine so perhaps they could point you towards something suitable for you to read.