The What and Why of Normalization

Normalization is the process of designing tables so that each fact is
stored in exactly one place. A "fact" in this case is any detail that
we have to keep track of, such as a product's description, a product's
price, an employee's social security number, and so forth.

The process is all about figuring out what tables you need and what
columns each table will have. If we are talking about an employee's
social security number, then we can guess right from the start that
will have a table of EMPLOYEES, and that one of the columns will be
SSN. As we get more details, we add more tables and columns.

The advantage of normalization comes when your application writes
data to the database. In the simplest terms, when the application
needs to store some fact, it only has to go to one place to do it.
Writing this kind of code is very easy. Easy to write, easy to debug,
easy to maintain and improve.

When the database is not normalized, you end up spending more time
writing more complicated application code that is harder to debug.
The chances of bad data in your production database go way up.
When a shop first experiences bad data in production, it starts to
become tempting to "lock down" access to the database, either by
forcing updates to go through stored procedures or by trying to
enforce access to certain tables through certain codepaths. Both
of these strategies: stored procedures and code paths, are the
actually the same strategy implemented in different tiers, they
both try to prevent bugs by routing access through some bit of
code that "knows what to do." But if the database is normalized,
you do not need any magic code that "knows what to do."

So that, in brief, is what normalization is and why we do it.
Let's move on now to denormalization.

Denormalization is Harder to Talk About

Normalization is easy to explain because there is a clearly
stated end-goal: correct data. Moreover, there are well-defined
methods for reaching the goal, which we call the normal forms,
First Normal Form, Second Normal Form,
and higher forms. By contrast, denormalization is much harder
to talk about because there is no agreed-upon end goal. To make
matters worse, denormalization violates the original theory of
Relational Databases, so you still have plenty of people screaming
not to do it all, making things even more confusing. What we have
now in our industry is different
shops denormalizing in different ways for different reasons.

The arguments that I have heard in my career boil down to two
basic groups. The first set of arguments centers around
calculated or derived values, and the second set centers
around programmer convenience.

Arguments for Derived Values

My own experience comes down heavily in favor of denormalizing
by storing derived values directly into the tables, with the
extremely signficant caveat that you must have a way to ensure
that they are always correct. In this paradigm you maintain
strict normalization for facts supplied from the outside,
and then layer on additional facts that are calculated during
write operations and saved permanently.

Here is a very simple example.
A strictly normalized database happens to be missing data
that many programmers would automatically assume should be
stored. Believe it or not, a simple value in a shopping
cart like EXTENDED_PRICE is forbidden by 3rd normal form
because it is a non-key dependency, or, in plain
English, since it can be derived from other values (QTY * PRICE),
then it is redundant, and we no longer have each fact stored
in exactly one place. The value of EXTENDED_PRICE is only
correct if it always equals QTY * PRICE, and so there is now
a "fact" that is spread across three locations.
If you store EXTENDED_PRICE, but do not have a way to ensure
that it will always 100% of the time equal QTY * PRICE,
then you will get bad data.

So, given the risk of bad data, what is to be gained by
putting EXTENDED_PRICE into the cart? The answer is that
it adds value to the database and actually simplifies
application code. To see why, imagine a simple eCommerce
shopping cart that does not store any derived values.
Every single display of the cart to the user must go all
over the place to gather lots of details and recalculate
everything. This means re-calculating not just the
EXTENDED_PRICE, but adding in item level discounts, taking
account of possible tax exemptions for different items,
rolling
the totals to the cart, adding in tax, shipping, perhaps
a customer discount, a coupon, and who knows what else.
All of this just to display the cart, every time, no matter
what the purpose.

This situation leads to three problems. A pitifully slow
application (too many disk reads and lots of cycles calculating
the values), maddening bugs when an application update
has subtle changes to the calculations so the customer's
order no longer displays the same numbers as it did yesterday,
and the frustrating requirement that the simplest of reports
must route through application code to calculate these values
instead of simply reading them off the disk, which leads to
reporting systems that are orders of magnitude slower than they
could be and horribly more complicated than they need to be
because they can't just read straight from the tables.

Now let's look at how that same shopping cart would be used
if all of those calculated values were generated and saved
when the order is written. Building on your foundation of
normalized values (price, qty), you need only one body of code
that has to perform calculations. This magic body of code
takes the user-supplied values, adds in the calculations,
and commits the changes. All other subsequent operations
need only to read and display the data, making them faster,
simpler, and more robust.

So the obvious question is how to make sure the derived
values are correct. If they are correct, we gain the
benefits with no down side. If there is the smallest chance
of bad data, we will quickly pay back any benefit we gained
by chasing down the mistakes.

From a technical standpoint, what we really need is some
technology that will make sure the calculations cannot
be subverted, it cannot be possible for a stray
bit of program code or SQL Statement to
put the wrong value in for EXTENDED_PRICE. There are a
few generally accepted ways to do this:

Require all writes to go through a certain codepath.
The only PRO here is that you keep the logic in the
application code, and since most shops have more programmers
than database people, this makes sense. The only CON is that
it never works. One programmer working alone can maintain
discipline, but a team cannot. All it takes is one programmer
who did not know about the required codepath to screw it all
up. Also, it makes your system inflexible, as it is no longer
safe to write to the database except through a single application.

Require all writes to go through stored procedures.
This is nominally better than the codepath solution because it
is not subvertible, and you can allow different side apps and
utilities to safely write to the database. But it makes a lot
of work and tends to be very inflexible.

Putting triggers onto tables that perform the calculations
and throw errors if a SQL statement attempts to explicitly
write to a derived column. This makes the values completely
non-subvertible, ensures they will always be correct, and allows
access from any application or utility. The downside is that
the triggers cannot be coded by hand except at extreme cost, and
so must be generated from a data dictionary, which is fairly easy
to do but tends to involve extreme psychological barriers. In
these days of ORM many programmers mistakenly believe their
class files define reality, but this is not true. Reality is
defined by the users who one way or another create the paychecks, and
by the database, which is the permanent record of facts. But
a programmer who thinks his classes define reality simply cannot
see this and will reject the trigger solution for any number of
invalid reasons.

So denormalizing by putting in derived values can make a database
much more valuable, but it does require a clear systematic
approach to generating the derived values. There is no technical
problem associated with ensuring the values are correct because
of course the application has to do that somehow somewhere anyway,
the real barriers tend to be the psychological and political.

Arguments For Programmer Convenience

The second set of arguments for denormalization tend to be
rather weak, and come down to something like this (you have to
picture the programmer whining like a child when he
says this), "I don't like
my data scattered around so many tables, can't we play some
other game instead?"

Many programmers, when they first learn about normalization
and build a normalized database,
discover that the data they need to build a screen is "scattered"
about in many tables, and that it is tedious and troublesome to
get it all together for presentation to the user. A simple
example might be a contacts list. The main table is CONTACTS,
and it contains not much more than first and last name. A second
table is a list of PHONES for each contact, and a third
table is a list of various mailing addresses. A fourth table
of EMAILS stores their email addresses. This makes four tables
just to store a simple contact! We programmers look at this and
something inside of us says, "That's just way too complicated,
can't I do something else instead?"

This is a case of programmer convenience clashing with correctness
of data. Nobody argues (at least not that I've heard) that they
do not want the data to be correct, they just wonder if it is possible
to simplify the tables so that they do not have to go out to so
many places to get what they need.

In this case, programmers argue that denormalization will make
for simpler code if they deliberately skip one or more steps
in the normalizing process. (Technically I like to call the
result a "non-normalized" database instead of denormalized, but
most people call it denormalized, so we will go with that.)

The argument goes something like this: I know for a fact that
nobody in the contacts list will have more than 3 emails, so
I'm going to skip the EMAILS table and just put columns EMAIL1,
EMAIL2, and EMAIL3 into the main CONTACTS table. In this case,
the programmer has decided to skip 1st Normal Form and put a
repeating group into the CONTACTS table. This he argues
makes for simpler database retrieval and easier coding.

The result is painfully predictable. The simplification the
programmer sought at one stage becomes a raft of complications
later on. Here is an example that will appear trivial but really
gets to the heart of the matter. How do you count how many
emails a user has? A simple SELECT COUNT(*)...GROUP BY CONTACT
that would have worked before now
requires more complicated SQL. But isn't this trivial? Is it
really that bad? Well, if all you are coding is a CONTACTS
list probably not, but if you are doing a real application with
hundreds of tables and this "convenience" has been put out there
in dozens of cases,
than it becomes a detail that programmers need to know on a
table-by-table basis, it is an exception to how things ought
to be that has to be accounted for by anybody who touches the
table. In any shop with more than 5 programmers, whatever
convenience the original programmer gained is lost quickly
in the need to document and communicate these exceptions.
And this is only a single trivial example.

Other examples come when it turns out you need more than
three slots for phone. In the normalized case this never comes
up. Any user can have any number of phones, and the code to
display the phones is running through a loop, so it does not
need to be modified for the case of 1 phone, 2 phones, etc.
But in the "convenient" denormalized case you now must
modify the table structure and the code that displays the contacts,
making it quite inconvenient.

Then you have the case of how to define unused slots. If the
user has only one email, do we make EMAIL2 and EMAIL3 empty
or NULL? This may also seem like a silly point until you've sat
through a flamewar at the whiteboard and discovered just how
passionate some people are about NULL values. Avoiding that argument
can save your shop a lot of wasted time.

In short, programmer convenience should never lead to a shortcut
in skipping normalization steps because it introduces far
more complications than it can ever pay for.

4 comments:

When I described surrogates as deferring information and natural keys as making information immediate, I was writing in context of the overall database design where tables refer to each other via foreign keys. If a table references another table's natural key, then it too must contain the real information contained within the natural key (i.e. the information is immediate). Referencing a surrogate key defers the look up of the *real* information until it is read from the other table.

You keep mentioning in this article and others that the point of normalization is to "store each fact in exactly one place". This is either erroneous or misleading. The point of normalization is to structure the database in such a way that the data is consistent, thereby preventing data anomalies. That is the goal, although it appears that "storing a fact in exactly one place" is your theory to how that goal should be achieved. But the goal itself and how the goal might be accomplished are two different things.

One very important issue I take with such a portrayal of normalization is that it is very likely to confuse beginners concerning the proper choice of keys. Whether a key is a surrogate key or a natural key is an orthogonal issue to normalization. Surrogates defer information to another table, while natural keys make information immediate to the given table. Your definition makes it sound as if you recommend that all keys should automatically be surrogates in order to "store each fact in exactly one place", when in reality the relational model (and normalization that depends on the relational model) is always first and foremost concerned about the real keys (natural keys). You can't even talk about things such as functional dependencies and key definitions without knowing what real data is depended upon by other data.

Enforcing consistency in an SQL database requires the use of foreign keys, but notice that foreign keys MUST rely on redundant data stored throughout the database. In order for two tables to check that the data shared by the foreign key matches, they must each have a copy of that data. Perhaps one might say that if only surrogates are used, then "facts" are ultimately deferred to a single table. The only problem is that in the meantime, all tables involved are checking only meaningless surrogates, not meaningful data. The result is that "facts" might be stored in one place, but you end up with terrible lack of consistency in your database.

Perhaps all this should be addressed in another article, emphasizing that normalization is NOT about "storing each fact in exactly one place"?

You keep mentioning in this article and others that the point of normalization is to "store each fact in exactly one place". This is either erroneous or misleading. The point of normalization is to structure the database in such a way that the data is consistent, thereby preventing data anomalies. That is the goal, although it appears that "storing a fact in exactly one place" is your theory to how that goal should be achieved. But the goal itself and how the goal might be accomplished are two different things.

One very important issue I take with such a portrayal of normalization is that it is very likely to confuse beginners concerning the proper choice of keys. Whether a key is a surrogate key or a natural key is an orthogonal issue to normalization. Surrogates defer information to another table, while natural keys make information immediate to the given table. Your definition makes it sound as if you recommend that all keys should automatically be surrogates in order to "store each fact in exactly one place", when in reality the relational model (and normalization that depends on the relational model) is always first and foremost concerned about the real keys (natural keys). You can't even talk about things such as functional dependencies and key definitions without knowing what real data is depended upon by other data.

Enforcing consistency in an SQL database requires the use of foreign keys, but notice that foreign keys MUST rely on redundant data stored throughout the database. In order for two tables to check that the data shared by the foreign key matches, they must each have a copy of that data. Perhaps one might say that if only surrogates are used, then "facts" are ultimately deferred to a single table. The only problem is that in the meantime, all tables involved are checking only meaningless surrogates, not meaningful data. The result is that "facts" might be stored in one place, but you end up with terrible lack of consistency in your database.

Perhaps all this should be addressed in another article, emphasizing that normalization is NOT about "storing each fact in exactly one place"?

Pages

In the Top 200 Tech Blogs!

In 2008 Datamation listed The Database Programmer as one of the top 200 tech blogs, on the same page with James Gosling (!), Coding Horror, Dzone, and the author's own personal hero, Dave Thomas (if you haven't bought "The Pragmatic Programmer", go online and buy it now):