Singular Value Consulting

Rise and Fall of the Third Normal Form

The ideas for relational databases were worked out in the 1970’s and the first commercial implementations appeared around 1980. By the 1990’s relational databases were the dominant way to store data. There were some non-relational databases in use, but these were not popular. Hierarchical databases seemed quaint, clinging to pre-relational approaches that had been deemed inferior by the march of progress. Object databases just seemed weird.

Now the tables have turned. Relational databases are still dominant, but all the cool kids are using NoSQL databases. A few years ago the implicit assumption was that nearly all data lived in a relational database, now you hear statements such as “Relational databases are still the best approach for some kinds of projects,” implying that such projects are a small minority. Some of the praise for NoSQL databases is hype, but some is deserved: there are numerous successful applications using NoSQL databases because they need to, not because they want to be cool.

So why the shift to NoSQL, and why now? I’m not an expert in this area, but I’ll repeat some of the explanations that I’ve heard that sound plausible.

Relational databases were designed in an era of small, expensive storage media. They were designed to conserve a resource that is now abundant. Non-relational databases may be less conservative with storage.

Relational databases were designed for usage scenarios that not all applications follow. In particular, they were designed to make writing easier than reading. But its not uncommon for a web application to do 10,000 reads for every write.

The scalability of relational database transactions is fundamentally limited by Brewer’s CAP theorem. It says that you can’t have consistency, availability and partition tolerance all in one distributed system. You have to pick two out of three.

Part of the justification for relational databases was that multiple applications can share the same data by going directly to the same tables. Now applications share data through APIs rather through tables. With n-tier architecture, applications don’t access their own data directly through tables, much less another application’s data.

The object oriented worldview of most application developers maps more easily to NoSQL databases than to relational databases. But in the past, this objection was brushed off. A manager might say “I don’t care if it takes a lot of work for you to map your objects to tables. Data must be stored in tables.”

And why must data be stored in tables? One reason would be consistency, but #3 above says you’ll have to relax your ideas of consistency if you want availability and partition tolerance. Another reason would be in order to share data with other applications, but #4 explains why that isn’t necessary. Still another reason would be that the relational model has a theoretical basis. But so do NoSQL databases, or as Erik Meijer calls them, CoSQL databases.

I’m not an advocate of SQL or NoSQL databases. Each has its place. A few years ago developers assumed that nearly all data was best stored in a relational database. That was a mistake, just as it would be a mistake now to assume that all data should now move to a non-relational database.

Post navigation

18 thoughts on “Rise and Fall of the Third Normal Form”

One issue with the 3rd normal form is that it is defined in terms of the functional dependencies present in your data. The trouble is that, in real life, it can be quite difficult to determine all of the functional dependencies. So it is rather a theoretical issue: “assuming you have all functional dependencies then a 3rd normal form is a *necessary* (but not sufficient) condition”. How exciting is this?

How do you know that you have all functional dependencies? It sounds easy in a textbook. And it were probably easy in the 70s… but today, with thousands of attributes, hundreds of tables… who is keeping track of the functional dependencies?

Another issue is that once you have NULL values, then everything goes out the window. The NULL value is outside of your type domain, so it tends to disrupt everything (for example (x=a) OR (x!=a) is not necessarily true once you have NULL values). Yet, in practice, developers often feel the need for NULL values in part because dealing with the constraints of the pure relational model can be too expensive in practice. Life is just complicated sometimes. Of course, some would argue that you can always do away with NULL values, but they are everywhere which, maybe, tells you something important about the usability of the system.

Overall, what you get is that the 3rd normal form does not buy you much at all. And maintaining that 3rd normal form is expensive in practice. It is still a neat concept but it is nowhere as strong and useful as its advocates claim.

A related issue is that most of what we teach in database course (including the 3rd normal form) can be considered “overrated” and much is “irrelevant in practice”. With A. Badia, I wrote a paper describing some of these issues:

I think that the problem is that when people think of databases or learn about databases, it’s SQL only. Ignoring the recent NoSQL movement, there have been plenty of nonSQL database products out there for years. For example, to the best of my knowledge, the Sabre travel reservation system’s never used a SQL database. (They looked at the technology once, and decided it couldn’t handle their transaction volume.)
The thing that I find annoying about NoSQL is that too many of the stories that I’ve read are along the lines of “MySQL was too slow, so we abandoned SQL,” which strikes me as being more about MySQL than SQL technology in general.

The problem with the NoSQL movement, is it tend to depict SQL as evil. SQL is here to stay for a long time. The amount of money invested in it and the amount of data it manages is just too large and valuable to throw away the SQL advantages just because the next kid on the block believe NoSQL is cool.

You will often see people rewriting functionalities already available in SQL to their NoSQL databases because they actually need them. In these cases, it is not clear if in first place NoSQL was the right choice to go with.

I believe both types address different needs and they should be viewed as is. It is not SQL against NoSQL, it is more about SQL and NoSQL where it fits the bill.

His relevant point (or what I took away) is that not everyone means the same thing when they say NoSQL. In particular you can recognize SQL is flawed as a DSL but relational databases may not carry most of the flaws.

Relational databases began life as a set of related mathematical proofs which demonstrated that data stored in such a database would be retrieved when queried. And businesses which did and do require that all data entered be accurately stored *or* *rejected* at that time, and be guaranteed retrievable later, did and still do today use an RDBMS. (We’re not talking about system failures here which require some type of backup and restore strategy, we’re just talking about the guarantees made by the system itself).

NoSQL systems do not themselves guarantee that data which is accepted by the system will be kept or returned (you have to design the system to handle that). And for many applications that’s fine. The internet protocol itself is designed to handle loss of information in a robust way (it just retransmits packets which were lost or whose receipt acknowledgment was lost). But one of the benefits of the RDBMS application model is that the programmers don’t have to worry about that data loss, whereas any robust NoSQL system must be designed with that limitation constantly in mind.

That’s the essential difference. And that describes when you need to use an RDBMS and when you can and should use some other type of datastore. And in fact, you could put a wrapper around a NoSQL datastore server which could make the same guarantees as an RDBMS, but then it would suffer from the same distributed scalability issues that plague RDBMSs.

As to the plausible sounding reasons you repeated, #1 is wrong because RDBMSs weren’t only designed in the 1970s, they are still being designed today, and while the interfaces remain backward compatible, the backside applications and storage formats are not at all the same and are continually updated to reflect the current computing environment. #2 is wrong, it’s just plain wrong. Writing to an RDBMS is always more costly than reading because so much validation and indexing occurs so that reading is guaranteed fast. #4 is entirely wrong too. The Relational Model requires that all interaction with the tables be handled through the server via APIs to ensure the integrity of the database and its contained data is not in any way compromised.

To be fair to relational databases, there are a few other reasons why they’ve lasted so long.

First off, relational databases have good theory behind them. Relational calculus, relational algebra, and certain well-understood subsets of logic programming (e.g. Datalog) have all been shown to have equivalent power. In the past, whenever we’ve discovered sets of interesting formulations which are formally equivalent (e.g. DFAs and regular expressions, Turing-complete programming languages, the Curry-Howard isomorphism), we rightly consider those clusters to be interesting, and they usually turn out to be useful.

Secondly, SQL is an API, at least in theory. In practice, despite the existence of ANSI standards, nontrivial SQL still needs to be ported from one database vendor to another.

NoSQL, on the other hand, isn’t a thing, and I don’t think it really makes sense to speak of it as a thing. Well, not yet, anyway. It’s an umbrella term for a bunch of things which aren’t SQL, and do sort-of similar things. The lack of standards has both positive (can pick the technology for the application, fast-moving innovation) and negative (separate learning curve for each system, potential vendor lock-in) aspects.

Oh, and on Dan’s point, don’t forget WAIS/Z39.50. NoSQL databases do go back a long way.

It may be worth defining NoSQL. It does not mean “non SQL”. It rather means “Not only SQL”.

It makes NoSQL much more useful than no SQ. Indeed NoSQL does not exclude the use of relational databases. Granted, NoSQL covers cases where relational databases aren’t needed. But it also covers use cases where a relational database is valuable, but needs to be complemented by other data storage tools.

Still wet behind the ears, and clueless. There was a time when such folks were relegated to doing backups. Why are they allowed to make decisions? If the data matters to the developers and users, those that use any file storage (and, face it, NoSQL is just primitive file store) have to build a TPM from scratch (or wire in CICS; yes it’s available for most all OSs). If you don’t care about data integrity…

As to Not Only SQL: that expansion is disputed. Some of the insurrectionists use this as a sop, hoping the smart folks will be gulled. The point of NoSQL is perpetual employment of coders building siloed applications. Nothing more.

The reason Dr. Codd devised the RM wasn’t primarily storage footprint, rather it was data integrity (the need for joins on 1970s DASD was a sticking point). That benefit remains. Additionally, since the RM is, by definition, an orthogonal datastore, modification is trivial. With either hierarchical, network, or file datastores, you have to modify far more client side code. The Agile Crowd (not necessarily congruent with the NoSQL crowd, but close) take the shoot, ready, aim approach. For them, having to continually start from scratch is a good thing, since it means more LoC can be charged to the development. Less thinking time, more typing time. Since they get paid for typing, that’s what they maximize.

As to the notion that application data is necessarily application siloed: not really. The main problem facing development today is multiple client platforms. Putting data management in such various clients (desktop, laptop, tablet, phone) all written in different languages and OSs, is foolish. Put the data logic and data together; kind of like an object, ya know? While spinning rust per device has gotten to the TB level, throughput remains about where it’s been for a decade or so. Moreover, network speed will never approach board level speed. With server machines being multi-processor/core/SSD with hundreds of gig of memory, centralization in an intelligent way is the winner. All of the big players are now going the in-memory database route (through acquisition), as well. Now, there’s a platform that really benefits from 5NF minimalist footprint.

So, the answer to the development problem is to keep data and control in one place, letting clients just get data from the server and sending user input back (with locally copied edits). This is an old paradigm: *nix, database, VT-100 (or 370/3270 for the big iron folks). Now, there’s a lot to like about that paradigm. NetWare (remember that?) started back to the future soon after the PC emerged. The host/terminal paradigm remains the most robust way to build an application.

At the device level, the deduplication folks are, whether they understand it or not, are building on-the-fly/ad-hoc 5NF data stores. If storage were no longer an issue, dedupe wouldn’t even be a business.

(Aside)
— the Sabre travel reservation system’s never used a SQL database.

Sabre was first developed before Codd wrote his paper, didn’t use either available database engine, IMS or IDMS, which were around. It was built on a bespoke IBM mainframe, with a bespoke OS (Airline Reservation Operating System, what’s now TPF), with a bespoke file system, and a bespoke language (similar to BAL). It was designed to a minimal data footprint (as opposed to NoSql), minimal language, and minimal code. It was built to do a few things very fast with very few bytes. One might consider it the apotheosis (if one ignores the minimalism) of NoSQL; and it’s 50 years old. That’s how modern NoSQL is. There’s a wiki piece about it. It now runs on mostly mainstream platform.

3NF = ‘implied’ consistency and data governance, however I’ve yet to witness an organization that doesn’t grapple with these topics irrespective of their architecture. Most companies are a mixture of NoSQL, MySQL, LAMP, Possibly ORACLE, Teradata, etc. Whoever the CTO or COO saw fit from their personal experience. Full disclosure, I work for Vertica and joined because the product works, plain and simple. If you want a rigorous model, no problem. If you want 200 columns, no problem. If you use Hadoop, no problem.

I have come across too many developers who would prefer to play with the latest toy (EAV, Object DB, NOSQL….) rather than learn to code/query databases correctly. Trying to get a developer to design beyond 3NF is difficult when they can do so much more with Java and a database that can run on any rdbms platform. When I say more I do mean: more time spent coding, more code, more bugs ….. aka job security.

I do recognise there will always be a place for the alternatives and they will be better than a well designed rdbms database. But on the whole the majority of us will not have the 100 PetaByte database, with a millions of transactions/second. Instead we will have systems that need to help the business stay out of the red and be cost effective.

It’s a shame, but this really goes to show that it’s about time people got back to computational science rather than pure speculation / utilitarianism. The luxuries of near unlimited space and processing power seem to be making people lazy. There is a reason TNF is de rigeur, and that’s down to the fact it works. The reason it works is because the theory is sound. Mathematically sound. Most people who use it may never know the theory behind it, much as those who use structured programming don’t tend to understand why structure is important. It is the corollaries of the theory which are important – provability and ease of testing for structure, and consistency and reliability for TNF. The real benefit of structure and TNF is more tangible than theoretical – you are more likely to end up with reliable and maintainable systems as a result.

The rise of NotOnlySQL is unsurprising in one sense; it circumvents a number of disciplines which are a problem when dealing with large volumes of data. It also sidesteps the issues of data analysis and conversion and places the responsibility of this on error handling rather than proper data management. This may allow developers to deal with the peculiarities of high volume or variant source data, but the danger is that this becomes popular when volume is no issue at all, or where source data is easily understood – in short, in the vast majority of situations. What then appears to be an “improvement” over TNF turns out to be a nightmare for software owners and a boon for lazy programmers who don’t mind being handsomely rewarded for endless, unnecessary reworking of code. My personal take is that if I found someone using NOSQL as a methodology where I worked, outside of document management or terabytes of data, I would follow the same steps as if I found them using GOTOs in their code – they would need to justify it in the most convincing terms, or find themselves somewhere else to work.

… and BTW although XEmacs is better than vi, vi is quicker and far more elegant than basic Emacs.

TNF is not restricted to RDBMS. It’s relevant for Network databases (http://en.wikipedia.org/wiki/Network_database) as well.
Besides, nowadays any self-respecting RDBMS offers the means necessary to work with normalized data in a denormalized way (instead-of triggers, materialized views, function-based indices etc.).
As a programmer I’ve got so many better things to do than to reinvent transactions with every use case. With that in mind I want ACID and Two-Phase-Commit. The relational model isn’t necessary. The ongoing use of IMS is a testament to that.

There are some other reasons relational databases are advantageous which weren’t covered in the blog post, although Robert Young alluded to them.

A big reason will probably not be well received, but here goes. Data integrity rules don’t just ‘save storage’, they also enforce data integrity in a declarative manner. This is quite beautiful. Depending on developers to handle all this in their code is an iffy proposition at best. Yes, rules change, but that’s life, life is hard. Better to handle this in a declarative manner that’s well defined in a standard way that’s well understood by many than have to rely on Joe Coder (who may have moved on to another company anyway) to modify his code.

It’s the same story with queries. SQL is declarative, so, 9 times out of 10 (I’m being kind of generous) saying what you want instead of how you’re going to do it works better, because in a lot of cases you are using an optimizer to formulate (based on your specified needs) the best way to go about getting data. Yes, you still need to handle indexing and tuning (life is hard). But still you benefit from the work of some very big brains, hardened over years of use in a variety of industries. Yes, for a very specific query, you may be able to hand-code a faster way to grab that data, but the key here is flexibility (specifics evolve).

Some of my comments here are based on personal experience. I worked at a place that did not use a relational database, they used a ‘MultiValue’ database, which shares some properties with NoSQL approaches. But the main point is there was a very heavy dependency on developers to do the bookkeeping (number 4 in the blog post). Being on the reporting team, I saw that this led to numerous weird inconsistencies, bad data integrity, etc. Part of my job was straightening this out, so I couldn’t complain too much, it paid the bills. When the data was moved to a relational database, it was cleaner, it stayed cleaner, and extracting information from it was far simpler and faster.

I am not saying there is no place for NoSQL. Big players like Google and Amazon use it, and they need it. In that sort of environment, the trade-off makes sense. But as the great Ted Dziuba once said ‘You Are (Probably) Not Amazon.com’. So yes, it’s important to keep up with new technology, and yes, NoSQL is here to stay. What I see, though, is a dangerous tendency to avoid or minimize relational databases by people with more influence than solid knowledge of what the relational model is even about.

Another reason for SQL databases which tend to be created and maintained by staff with expertise in dealing with DATA is the prevention of duplicate data–people tend to forget that after the data is stored somewhere, it most likely will need to be retrieved and used again. Yes, you can prevent duplication with noSQL but unless you truly understand data relationships and data duplication issues, what you build today will become your data quality nightmare next year.