Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).

Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.

Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.

Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.

Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.

Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.

Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears – for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

If there’s much commonality among the component DBMS, each one would be sub-optimal.

If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.

Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.

Comments

Complexity is a fundamental issue in thinking about the optimal way to manage data. Assuming you *could* build the one singing and dancing DBMS, the design trade-offs current stores make would turn into knobs and dials within the DBMS, which would in turn make use by humans exceedingly difficult. At least some of the current revolt against RDBMS is a reaction to the fact they are already quite complicated.

On the other hand having multiple stores creates an integration problem, which has its own complexities. The interesting question is how many store types you can afford to integrate before your system collapses under its own weight. With the current state of the art the answer is “not many.” Maybe it’s time to get serious at the computer science level about improving that.

aaron on
February 21st, 2013 2:55 pm

The hardest part technically is around very large numbers of updating users (distributed locks, etc.) and large data crossing servers (distributed queries or contention against a shared store).

Point of fact, this is a solved problem for most companies’ usage – a few tens of TB, a few million IOPS. It gets problematic for only the top 1% of companies, and likewise the data would shrink by 99% if consolidated.

The real issues here (technically) are governance logical vs. physical models. To have a common store, requires the ability to guarantee priority where appropriate, which is often impossible to predict, but can be alleviated by allocating guaranteed resources. The real hard problem is that physical design generally is bad for shared use, and we have not progressed pushing out a logical design that compiles to an efficient physical one; this a solvable and fun problem – but a hard one to productize.

The real issues here (actually) are politics and how the human animal interacts. Database using apps are models of the *politics* of the users, and DBMS data and use models the corporate power structure. This has a variety of reasons – sometimes acquisition or departmental apps scaling, sometimes divide and conquer, sometimes defensive borders, etc.

The political justification for byzantine proliferation of databases is completely justified. Large database apps tend not to work well. Big db apps that are over a decade old are obsolete. Ones that are under a decade old are incomplete and not fully integrated. The technical failure at creating logical models that are completely independant from physical means that database implementations reflect a current view of issues. We’re just not wired to do big coordination well – look at Congress.

Hi Curt! Indeed we do agree. What do you think about trying to get integration technology to work better? There has been tremendous progress in database management over the last decade but it seems as if integration technology (e.g., messaging and replication) is still stuck in the 90s.

aaron on
February 22nd, 2013 6:02 am

1% – number of companiew whose relational data is too big for consolidation to a single db currently available
99% – likely reduction in total relational data size if all databases we’re consolidated (e.g., if you’re total data is a petabyte, you likely really have 10tb of actual data)

aaron on
February 22nd, 2013 6:02 am

1% – number of companies whose relational data is too big for consolidation to a single db currently available (this is not cumulative.)
99% – likely reduction in total relational data size if all databases we’re consolidated (e.g., if you’re total data is a petabyte, you likely really have 10tb of actual data)

[…] One database to rule them all?: Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? […]

(apologies for the grammar above) I’m not referring to non-prod data here, db virtualization/snapshots, etc. This is what I see in the field. Data in production is a hundred parodies of the core truth.

Consider a typical company, perhaps 20K staff and 1M item sales a day. Likely each staffer generates a few K of data a day, and each sale is perhaps a fraction of 1K a day. The company likely generated a fraction of 1G a day of actual data, and only augmented slightly with external enrichment.

That company is likely recording the bulk of the data, sale item and fulfillment workflow journal entries, literally one hundred times. There are reasons for this. Sales may need a compensation-centric view of the data and keep a copy of the sales, often with a crumb trail of dbms copies used in ETL. Operations may have real time replication to a message bus and to its dbms. Accounting, finance, marketing, etc., etc….

That is how this poor example company ends up with the data amplified. We can call this restating the data repeatedly with alternate semantics. That 1G of actual data generated (agreed – this is not information) appears as 100G of database size (I’m ignoring nontraditional RDBMS data, such as logs and sensors.)

This is a truth that practitioners will vouch for (please chime in!) And this is the real justification that can be made for the universal database. A huge amount of any companies effort is attempts to reconcile these conflicting versions.

The points I was making are that:
– It is technically feasible to have a universal database, but it is organizationally really hard to do so.
– Tech vendors are not helping the situation at all and not creating any technology that facilitates unification because it is easier not to. Established vendors increase sales with the data amplification, and new vendors with point solutions are not able to attack problems this big.

If Aaron is saying that “core” data is typically much smaller than representations of data + search structures & derived searches, I may agree, especially with nosql approaches that have as their core strategy the enumeration of all possible searches that could be done on a dataset. Even in a highly normalized rdbms schema, indexes, keys, xact logs, and row format overhead will typically be several times bigger than pure “payload” represented as bytes of simple text or compactly represented numbers.

Once upon a time, I developed an extremely efficient small device database (SQL! in 16K RAM Even in that hugely constrained environment, the ratio of “payload” to representational and search storage overhead was about 70% with a typical number of indexes.

And as companies grow, different groups end up needing different schemas for the various ways they interact with a given dataset. Nothing evil there, and no magic software bullet is going to “fix” it.

[…] basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, […]

aaron on
February 27th, 2013 3:15 pm

Try a thought experiment. Take a company you know well and compute how much *data* they actually generate. It should correlate tightly to the large activities they do. Then ask them how much new database data is generated daily. You will see the 100x. Please try this.

A database typically uses ~3X more storage than source data. Compression is offset with online backups, sort areas, indices, etc. Double that for disaster recovery, double again (sometimes) for non-production – and you get perhaps a 10x factor.

There is something else going on. Key data, such as sales, which is the bulk of the new data is stored many times in many grains. This accounts for the second 10x amplification of the data. Companies routinely amplify critical data a lot.

The reason I’m harping on this is that this hyperamplification is the true question to ask in thinking about a single database. What is the meaning of the data as it moves through the various workflows, and can it be represented in a unified way?

There are historical reasons for data to be represented repeatedly in mutated forms. For example, there may be divisions or business units perhaps with different frequencies of data need. There are often is historical software that is batch processing in nature and doesn’t deal with real time data. There are vendor packages that shred and duplicate data. And there is a lot of data movement with trails of replicated data between all of these.

Much of this is a legacy of a time where data would not be processed interactively and where it was unreasonable to expect the millions of IOPS needed to support all needs to come from a single DBMS.

To me, this is the interesting question in thinking about a single unified database.

Larry Dooley on
March 15th, 2013 4:26 pm

Curt

If I look at what Oracle is doing, they are looking to engineered hardware solutions to get past this. ExaData, Exalytics and I’m pretty sure more will be coming down the road. A lot of stuff gets stuffed into memory in these solutions. They also cost a good deal.

Although for smaller installations Oracle ODA with ZFS storage is cost efficient and fast

Don’t the benefits outway the costs? Divide and conquer. Time is money? Etc. With a structured worldwide list of areas of e.g research subjects, then known duplication and unresearched areas would be time saving. I think recently the BBC had a report about the NHS currently collating medical information. Is there such a list of research at universities, businesses, national health service people etc.?

There are many “NoSQL” data models. I definitely wouldn’t advise a doc or name-value-pair database for this sort of thing. One of the Graph databases may be good if you need extreme performance, as representing many2many relationships and doing fast…