Wednesday, April 16, 2014

Why The Clock is Ticking for MongoDB

Last month, ZDNet published an interview with MongoDB CEO Max Schireson which took the position that the document databases, such as MongoDB, are better-suited to today's applications than traditional relational databases; the title of the article implies that the days of relational databases are numbered. But it is not, as Schireson would have us believe, that the relational database community is ignorant of or has not tried the design paradigms which he advocates, but that they have been tried and found, in many cases, to be anti-patterns. Certainly, there are some cases in which the schemaless design pattern that is perhaps MongoDB's most distinctive feature is just the right tool for the job, but it is also misleading to think that such designs must use a document store. Relational databases can also handle such workloads, and their capabilities in this area are improving rapidly.

Let's look at his example of entering an order into a database. In this example, it is postulated that the order is split between 150 different relational tables, including an order header table, an order line table, an address information table and, apparently, 147 others. Relational databases do encourage users to break up data across multiple tables in this way, a process called normalization. But not for no reason. Storing every order in one large document may be ideal if all access will be strictly by order number, but this is rarely the case. When a user wants to run a report on all orders of one particular product, an index on the order line table can be used to efficiently find and retrieve just those order lines. If all order data is lumped together, the user will be forced to retrieve the entirety of each order that contains a relevant order line - or perhaps even to scan the entire database and examine every order to see whether it contains a relevant order line.

Of course, like any good thing, normalization can be overdone. Few database schemas are so heavily normalized that a simple order entry task touches 150 tables, but if yours is, you may well wish to consider denormalizing. But you need not go so far as to denormalize completely, as Schireson appears to advocate. Instead, you should determine what degree of normalization will best meet your current and future business needs. Seasoned relational database professionals understand the trade-offs between normalization and denormalization, and can help companies make good decisions about when and how to normalize. Schireson appears not to understand this trade-off, or else understands it but advocates for total denormalization anyway because that is the only paradigm his product can support.

Schireson also mentions another advantage of document stores: schema flexibility. Of course, he again ignores the possible advantages, for some users, of a fixed schema, such as better validity checking. But more importantly, he ignores the fact that relational databases such as PostgreSQL have had similar capabilities since before MongoDB existed. PostgreSQL's hstore, which provides the ability to store and index collections of key-value pairs in a fashion similar to what MongoDB provides, was first released in December of 2006, the year before MongoDB development began. True JSON capabilities were added to the PostgreSQL core as part of the 9.2 release, which went GA in September of 2012. The 9.4 release, expected later this year, will greatly expand those capabilities. In today's era of rapid innovation, any database product whose market advantage is based on the format in which it is able to store data will not retain that advantage for very long.

The advantages of relational databases are not so easily emulated. Relational databases allow complex transactions that affect multiple records, synchronous commit so that each transaction is guaranteed to be durable on disk before the client is notified that the commit has succeeded, support not only for JSON but also for other complex datatypes such as XML and geospatial data types, mature query optimizers that not only support combining data from multiple indexes (which Schireson mentions as a forthcoming feature; PostgreSQL added that capability in 2005) but also the ability to combine data for multiple tables via joins. While some MongoDB users may not require any of these features, many will, and I believe that MongoDB will find that adding these features to a product that natively supports JSON is much harder than adding JSON support to a product that already possesses these - and many other - enterprise features.

This is not to deny that MongoDB offers some compelling advantages. Many users have found that they can get up and running on MongoDB very quickly, an area where PostgreSQL and other relational databases have traditionally struggled. And the sharding capabilities of MongoDB are clearly useful to some users, but the process of scaling out is not as transparent as the documentation might imply, and sometimes goes badly wrong. In the end, a large, complex-system requiring continuous uptime typically requires that the application developer and DBA work together and have very specific knowledge of which data is stored where. Auto-sharding may succeed in hiding the complexity from the user in some cases, but it does not eliminate it.

In short, I don't expect MongoDB, or any similar product, to spell the end of the relational database. Rather, I think it's likely that PostgreSQL and other database engines will continue to innovate, providing many of the features that have caught the imagination of developers who are now choosing NoSQL engines; and that NoSQL systems will struggle to add features which relational databases have had for years.

66 comments:

I fully agree with your point of view. The whole NoSQL vs. relational debate reminds me much of the "relational" vs. "object oriented" debate in the late 90s where every vendor of an object-oriented database claimed that relational database will be gone in 10 years.

I think every tool has a problem domain where it fits and also has many domains where it does not make sense to use it. Claiming that NoSQL (or documented oriented systems) can cope with every problem domain is just as wrong as claiming that relational databases can cope with problem domain (but I to think there are more domains that can make use of a relational DB than there are for document DBs)

Consistently, I have failed to see what a "document store" has that another common technology called a "file system" does not. Save files on file systems, structured data in a database. How is this discussion even a thing?

Actually, there are several things it can give you (IMHO); first, you get parsing for free (a document is always a map of keys to values, a file is a binary blob); also, it gives you secondary indexing (depending on the db) ; third, it gives you easy sharding (which you can do in filesystems, but may be a pain, and you sometimes end up adding a lot of the complexity, like with hdfs).

Ben, the advantages having a document store is to ability to create indexed datasets (e.g. with map / reduce) and the ability to query them very fast. Another point is the possibility to replicate / shard data over many nodes with conflict detection. You can do all this also file based but it is very hard to get it done correctly.

Hi Robert, a single "order" document, as described in your example, can have multiple secondary indexes (similar to Postgres) in MongoDB. Using an index to search on Order location, type, amount, etc. would all be accomplished with additional secondary indexes. I think you are confusing MongoDB with a key/value store. Postgres is a better option for truly relational data, MongoDB will be a great option for a lot of other use cases.

It's very useful to be able to use an index to find all the relevant orders. But unless you can compute the query result entirely from the index, you'll still need to retrieve the *entirety* of each relevant order rather than only the specific parts of that order that are needed. That potentially means accessing a lot more data to compute the same query result.

Projection means that the whole document doesn't have to be sent from the server to the client; but it doesn't reduce the number of data pages that have to be scanned on the server side. The only way to do that is if the data is stored separately somewhere, either in side table, or an index, or some similar data structure.

I used MongoDB for two years on several projects. In each project the schema started out denormalized, and as the requirements grew, the schema evolved to be increasingly normalized, until we ended up with what we would have had in a relational database anyway. The main disadvantage of denormalized data is that ad-hoc and unanticipated queries are hard, which is especially damaging in a start-up situation where the needs of the app are not well understood ahead of time.

In MongoDB's defense, I will say that setting up a replica is trivial. Just launch it and seconds later the data has been fully replicated. It's magical.

I too agree. I've been using a noSQL database in production. I think it has it's place, but in many cases add unnecessary complexity. You really make a trade off of scalable performance vs complexity. To me, it feels like with noSQL a lot more of the database problem scope leaks it's way into your code.

Mongo is certainly easy to set up and something like sharding is easier than if you had to do it in a relational db. You pay the price for that ease later when you are maintaining it, in my experience.

I don't want to be too critical of anyone who picks MongoDB because it, and its replication in particular, are easy to set up. Those are legitimate advantages, and we shouldn't dismiss them. Instead, we should try to make relational databases (and PostgreSQL in particular) better in those areas. What I really object to about the original article is not that it highlights the strengths that MongoDB does have (because that's totally legitimate) but that it attempts to characterize the design patterns that have served relational database developers very well for many decades as, essentially, silly, because they are not. Unfortunately, I think that as relational database professionals, we too often make the mistake of assuming that it's obvious that these arguments are specious rather than taking the time to explain *why* they are specious.

Right, many large websites are written in PHP because PHP is easy to set up. If setting up PG takes an extra hour (for an inexperienced dev), that's a lot of start-ups that will pick something easier instead, then get stuck with it. I don't know how much of that initial hour (authentication and schema) is essential and how much is accidental. PG has a safety-first policy that gets in the way here. You can imagine a default config that lets anyone connect, and the act of connecting to a database creates it (like MongoDB). You could go even further and automatically create tables on single-column JSON inserts. It's a horrifying default to think about, but don't underestimate the value of being up-and-running quickly when doodling around. (And large websites often start as doodles.)

Robert, when you say: "In today's era of rapid innovation, any database product whose market advantage is based on the format in which it is able to store data will not retain that advantage for very long.", I can see your point, but I don't fully agree with you.

Graph databases, for instance, seems like a very good alternative to relational dbs in order to work with data with many relationship levels.

But I still think that 80% or more of the use cases we have nowadays are easily handled by a relational db. And for some exceptional cases, you must look carefully for which other alternative better fits your need, not relying on "my product works better for every scenario".

Discrediting NoSQL database systems because they don't compare apples to apples to a relational model is like discrediting Microsoft Word as a valid spreadsheet creation tool.

I can just imagine the review. "We tried to convert all of our worksheets into this trendy new 'word document' and we found that fundamental things were a real problem to do, like referencing a calculated cell. I mean it was great for things that didn't have numbers... and required fancy formatting, but eventually ALL our documents had numbers and then... Finally, after much expense, we decided to get rid of the 'fluff' and go back to what we are comfortable with... Microsoft Excel. At least there we can make a table if we want to and do our basic calculations."

The technologies are not mutually exclusive. This problem was the very first thing I remember about what the MongoDB folks said at the Chicago conference this year. I think it went something like this: "If you need the following things from your database: ACID, transactions, defined entites and relationships, normalized data; you are in the wrong place. We are not here to replace relational database systems, they still have a purpose. We are here to fill a need for an accessible unstructured data store that can scale. "

But Google drive is cloud based. It was built for the cloud from the very beginning. Doesn't matter if older, more mature products are also providing that functionality now. Google Drive was built with the intent of working on the cloud. Hence it is the perfect tool for cloud work.

I don't think that the technologies are mutually exclusive; many shops are using both. But there's clearly some overlap, or else the CEO of MongoDB wouldn't be publishing a ZDNet article predicting the demise of the relational database at the hands of MongoDB.

Robert, if the article implied all web development is better suited here, then I'll bring the torches to the bonfire.

I think what was meant tho in saying much of 'today's' web apps. Are to include those using html5, jquery, etc where reading and writing directly to an arbitrary store with JSON for things like blog posts, social media replies like this one, status updates, and other simple relationship data applications like perhaps a companies internal ECM repository where docs are coming in unstructured like a cruise ship on a beach and need to be queriable in unpredictable ways later. No SQL is hands down the better option.

And are these "blog posts, social media replies like this one, status updates, and other simple relationship data applications like perhaps a companies internal ECM repository where docs are coming in unstructured like a cruise ship on a beach and need to be queriable in unpredictable ways later" the only kind of data, or even the majority of data today? Nope! Then why did the original article make statements like the end of relational databases are near?

Most times people in RDBMS world do not react to such stupid claims by NoSQL companies, but sometimes it becomes necessary to remind people why exactly these claims are stupid. That's what Robert's article is trying to do.

I think the opinions of the SQL-clingers is that they don't look past the database when it comes to application development. They model the data only looking at the entity's structure, not how it will actually be fetched and stored.

NoSQL allows the application to control the schema and organize the storage in the same structure it is consumed. This eliminates ORM's and an code generators to translate the data into a useable format.

The author likes to talk about how you may want to pull ad-hoc reports etc. but with any high-volume system, you can't report off of the transaction system anyway without bringing down your client sites, and end up having to create complicated ETL's to push the data into separate reporting environments anyway.

I'd rather say that compared to "SQL-clingers" who know how to extract data from database "NOSQL-clingers" don't know how to work with databases so they use .txt files to store their data. Now they have a system that puts their .txt file collection for easy retrieval that's easy to setup for someone not even able to learn something as simple as sql.

Often, I think the SQL-clingers as you name them often have the wider vision: not just the single app talking to the single db, but the entire business process being supported by that app and data.

In the real world, you will always run into the need to do additional processes to the data that are not the proper domain of the "primary" app, be it reports or maintenance operations. Yes, reporting off the live system can be a performance problem, but it's guaranteed to be one in an environment that does many of its data manipulation functions only in the app.

Your talking of ORMs and code generators tells me that you're actually in the camp of developers who don't think past their specific app - the data and models are not there to serve the app, they are there to model the reality of the business - the app is there to serve the data (in more ways than one)

"In the real world, you will always run into the need to do additional processes to the data that are not the proper domain of the "primary" app, be it reports or maintenance operations."

But the ways that whole systems are written keeps evolving. It certainly used to be the case that what you call the "primary" app would rarely be the only consumer of the data in a database. But with the rise of SOA, the idea is that all data access goes via the service, so the service is the only app to directly access the database.

I'm not a zealous SOA fan, so I don't mean to debate the merits of that approach, just to mention that often enough these days the assumption that only one app will access a database is valid from a systems perspective and not just an app developer's perspective.

If all one wants is ease of setup, and all coding (transactions, etc.) in some client code, then any of the NoSql candidates will do. They're just slightly glorified files, after all. Anyone remember dBII? As one of the comments said, eventually you end up writing your own, bespoke, CICS if you actually care about the data. And the notion of an extra hour to setup a database engine as material to development of an application is silly beyond belief.

The degree of ignorance in this is stunning. When you have equal knowledge and experience of RDBMS of your choice and NoSQL come back and compare them. The developers and creators of MongoDB built and sold a multi billion dollar company on RDBMS, MongoDB kernel developers have worked on and for MSSQL, Oracle and MySQL. Actually they may have some insight into how to build a new database to meet the challenges relational DBs are sub optimal for.

Maybe so, but then why does the CEO of MongoDB need to fall back on misrepresenting the capabilities of relational databases in order to promote his product? He states clearly in the ZDNet article to which I linked that you should pick MongoDB because it handles schema-less data, and as I tried to explain here, PostgreSQL can do the same. He also implies that inserting a sales order will be really, really hard - you may need 150 tables! - which is not representative of any commercial or custom order entry application I have seen. I don't pretend to be an expert on what MongoDB can do, but I know enough about relational databases to know that those criticism are unfair.

The fact that someone is using market-cap as an reason in a technical discussion could be easily described as a " degree of ignorance in this is stunning" Here is a hint, Oracle market cap is US$ 185 bn, does it make it 100x "better" than MongoDB?

Maybe some people should go and see what are the different tradeoffs in relational, hierarchical and network models. This war was already fought and was won by the relational model, not because of a transient technological advantage (actually quite the opposite)

Amazing how old ideas come back to life and same discussions are entertained again and again... if only we had the means to record these discussions and the learnings we got and somehow distribute them to other people so they could learn what was done in the past.

Competition is great, it pushes every one to innovate, and the result is that difference between a relationnal DB an and noSQL can be very thin.

As a postresql expert, you probably know the JSON(b) store built in psql 9.4 (and before that the hstore) that let you mix relationnal and non relationnal data in a painless way, keeping the full power of relations when it's usefull, and combining it with other postgresql extension like postgis.

in reply to all the posts that propose nosql "document" system == file system storage, then you don't get it indeed. try tuning your file system to allow the use of a json formatted query to return documents that contain certain elements that exist in petabytes of data located across hundreds of hosts potentially in many different folders and tell me how that goes for you. for that matter, try that in a rdbms system… find all the rows (no matter what structure) in all tables that contain specific elements across a distributed db scaled to the multi-petabyte range. of course then you might say, something like 'I consistently don't get the difference between petabytes and megabytes' at which point I would probably realize I'm wasting my time.

Good post Robert, this for me was not a waste of time tho, and I really appreciate your view on MongoDB.

I agree. I think the people offering those analogies are thinking of a system that functions only as a key-value store; and imagining the filename as a key. That sort of setup is one valid way to organize a key-value store with short keys and very large values, and I've done it myself to good effect, because storing very large values (i.e. files) in a relational database with no knowledge of their structure doesn't always make sense.

Of course, when you move away from the idea of a NoSQL database as a pure key-value store and start to talk about query languages and indexing, then it gets a lot more interesting. But that's also true in an SQL database: it's the capabilities of the query language that let you do powerful things that a filesystem (or simple key-value store) couldn't.

You might be surprised to hear that PostgreSQL 9.4 will have the ability to search for JSON elements inside complex documents via a sophisticated indexing strategy. I would expect some problems running unmodified PostgreSQL on a petabyte of data, but I think you won't find many MongoDB deployments on that scale, either.

Given that the number of "multi-petabyte" installations is orders of magnitude smaller than the run of the mill database installations, the hyperbole in the ZDNet article should be toned down. "Today's applications" are not hugely different than yesterday's; not everyone is building a Facebook-scale operation.

And, I run a multi-gigabyte installation that performs quite well on one server on an SQL database. (I also have used MongoDB in different contexts.)

MongoDB is (re)fighting the same, self-inflicted fight that object-oriented database did ten years ago.

I'd like to replicate from live models into historical documents for transparent reporting of both in a consistent manner without having to hand-role my App to talk to an ACID MySQL cluster for the live models and a GraphDB for historical data.

A lot of the commenters seem upset at the hyperbole used by Max Schireson. Compared to the various CEOs of Microsoft, or Larry Ellison of Oracle, I think the MongoDB folks are still very much ahead in terms of polite behavior. I could write a thousand articles like this one about the RDBMS world's CEOs saying much less believable drivel than Max. Since everyone seems to be keeping score.

I do agree that RDBMS and for that matter open source RDBMS like PostgreSQL have been innovating and providing lot of additional features w.r.t. to data format. But consider this- word allows you to store data in tabular format and still we can not let go our excel sheets.

IMHO Max Schireson was not implying that "we must replace RDBMS with MongoDB" or "we must always consider MongoDB over RDBMS". I think all he wanted to say was that for certain requirements MongoDB would be a better choice.

I strongly believe RDBMS have a long life to live and the world will always have applications which can not be designed without RDBMS.

Truthfully it sounds like Max is saying "MongoDB is a much better alternative to doing RDBMS horribly wrong"

..and after fully reading Max's commentary in the ZDNet article, I see that he markets Mongo by weaving bits of truth with utter fallacy; who would put an object in 150 different places with the expectation of reassembling it fast? _yes_, with a RDBMS you must commit/put structure to your objects before storing them.. but *every* time you go to query Mongo you've got to put structure to the data in the query itself

A tool is only as effective as he who wields it, and many experienced people still (and will continue to) use RDBMSs for a reason.