Sunday, February 10, 2013

Those Who Don't Know the Past ...

It's been long my contention that a core problem of the database management field is poor foundation knowledge, in which I include familiarity with its history. Consider The Rise and Fall of the Third Normal Form. The title signals a rich debunking target. John D. Cook writes:

The ideas for relational databases were worked out in the 1970’s and the first commercial implementations appeared around 1980. By the 1990’s relational databases were the dominant way to store data. There were some non-relational databases in use, but these were not popular. Hierarchical databases seemed quaint, clinging to pre-relational approaches that had been deemed inferior by the march of progress. Object databases just seemed weird.

How many times do I need to reiterate that industry's persistent confusion of SQL with the relational model does not make SQL DBMSs relational?

The claim that SQL DBMSs became dominant because non-relational DBMSs were just "quaint" or object DBMSs "weird" is a rather cavalier characterization of their flaws. The reality is that hierarchic and CODASYL DBMSs were so prohibitively complex and rigid, that IT organizations outright could not cope with them anymore. That is why even SQL, despite its poor relational fidelity and language design, readily proved superior.

Now the tables have turned. Relational databases are still dominant, but all the cool kids are using NoSQL databases. A few years ago the implicit assumption was that nearly all data lived in a relational database, now you hear statements such as "Relational databases are still the best approach for some kinds of projects", implying that such projects are a small minority. Some of the praise for NoSQL databases is hype, but some is deserved: there are numerous successful applications using NoSQL databases because they need to, not because they want to be cool.

Points arising:

1. The "cool kids" were not around when the IT industry struggled with the horrendous problems of non-relational DBMSs and they are reinventing, albeit with relabeling, all of them. Worse, even SQL implementations increasingly bring back old flaws. Here's what a friend developer emailed me about MySQL not recovering unused space:

The official solution is dump the tables, drop them and recreate them. This is even more fun when you find out how slow standard restores can be, or how slow data copying is. And people pay good money for support contracts based on this.

2. As I have demonstrated over and over again in my writings and teaching, trite, meaningless statements such as "good for some projects, but not for others", or "use the proper tool for the job" are indications of poor foundation knowledge (what a DBMS and databases are, what their functions are relative to applications, and the significance and importance of and requirements from, a data model--by which I don't mean a business model or a logical model--are. What do you exactly learn from the following?

I’m not an advocate of SQL or NoSQL databases. Each has its place. A few years ago developers assumed that nearly all data was best stored in a relational database. That was a mistake, just as it would be a mistake now to assume that all data should now move to a non-relational database.

3. The notion that SQL DBMSs are now in a minority is bunk. It takes decades for a technology to overcome resistance, be widely adopted, used and mastered. It took roughly a decade for SQL to be implemented and four decades to dominate, despite the horrible problems of preceding technologies. And SQL implementations are far from conferring all the practical advantages of relational technology. Marketing fads notwithstanding, can you name a current technology that comes even close?

So why the shift to NoSQL, and why now? I’m not an expert in this area, but I’ll repeat some of the explanations that I’ve heard that sound plausible.

Relational databases were designed in an era of small, expensive storage media. They were designed to conserve a resource that is now abundant. Non-relational databases may be less conservative with storage.

Aside from the constant logical-physical confusion, this is sort of backwards. In fact, one of the main initial objections to SQL DBMSs, that slowed their adoption, was that they were resource hogs. It is true that in time hardware became cheaper and SQL DBMSs became more efficient, but efficiency was also considerably predicated on whatever little relational fidelity they had: their performance was optimizable by the vendors at the database level, relieving application developers from a huge burden, including dependence on physical storage and access methods. The human productivity and improvements that SQL DBMSs brought were so significant relative to what preceded them, that even their initial higher resource consumption could not prevent their adoption over time.

Relational databases were designed for usage scenarios that not all applications follow. In particular, they were designed to make writing easier than reading. But its not uncommon for a web application to do 10,000 reads for every write.

This is backwards too. In fact, SQL's name in itself is due to the fact that it was initially a query-only language, without any integrity features important for updates. The foremost objective was to make databases more accessible to end-users.

It may well be that much of Web data is not important enough to require consistency. As one of my readers commented:

I think NoSQL has a place in the web because most of the information "web developers" want to store is useless and forgettable crap anyway. (I recall a .com I consulted at once who tried to sell my consulting firm its "user database" - I was able to point out to the owner that I could easily generate a mountainous list of duplicates of Mickey Mouse and Adolf Hitler myself without needing their "data.") When the people that employ these "developers" later find out that the "data" they have paid for is less substantial than wet toilet-tissue they then have an opportunity to learn the only way most humans seem capable of learning - by an abrasive experience. And the data itself can be (and probably would be even if it was queryable) safely flushed into the bitbucket. (I won't cry for the missed chances of marketers to sell more dirty ice to brain-damaged Eskimos.)

But not all Web data is unimportant--not everybody is Facebook, Google, or Twitter. And as all the hoopla over analytics and BI attests, there is now much focus on extracting useful information from that data. A variety of ad-hoc, proprietary NoSQL niche products proliferate, promising to be magic wands that will enable information extraction from data that was never structured with consideration of this possible need. There is no better evidence of the delusional aspects of the IT industry than (1) XML and the notion that by turning HTML tags into "semantic" tags, the Web will become usefully structured (2) BigData and the notion that "unstructured" data can be readily and usefully mined. What good is mining inconsistent data?

Unfortunately, the industry persists the pattern of extending technologies invented for different purposes e.g. XML (data transfer) and object-orientation (programming) to data management.

The scalability of relational database transactions is fundamentally limited by Brewer’s CAP theorem. It says that you can’t have consistency, availability and partition tolerance all in one distributed system. You have to pick two out of three.

A data model is purely logical and has nothing to do with scalability, which is part of te physical implementation. If SQL products do not scale, that is not because they are relational (which they really are not). If the promotion is accurate--it never is--then maybe there is finally some evidence in support of this argument.

More importantly, if consistency is not important, there is nothing even in SQL products that prevent its relaxation. What TRDBMSs (and I don't mean SQL) can do and NoSQL ones cannot is to guarantee consistency when necessary and relax it when it's not. Why trade down instead up?

The object oriented worldview of most application developers maps more easily to NoSQL databases than to relational databases. But in the past, this objection was brushed off. A manager might say "I don’t care if it takes a lot of work for you to map your objects to tables. Data must be stored in tables."

If application developers have an OO worldview, that's because OO is an application programming paradigm. But that is exactly its problem, the view is application-specific, not of a shared data resource. The paradigm for the latter is a data model, which OO lacks. This is why extending it to databases is akin to turning DBMSs into applications, defeating the whole purpose: a DBMS is not an application, it was invented precisely to relieve applications from functions that need to be centralized and serve multiple applications.

But here's the $64,000 question: What exactly does all this have to do with 3NF???