There’s been a lot of back and forth lately from the NoSQL crowd around Michael Stonebreaker’s contention that reliance on relational technology and MySQL has trapped Facebook in a ‘fate worse than death.’ This was reported in a GigaOm post by Derrick Harris. Harris reports in a later post that most of the reaction to Stonebreaker’s contention was negative:

By and large, the responses weren’t positive. Some singled out Stonebraker as out of touch or as just trying to sell a product. Some pointed to the popularity of MySQL as evidence of its continued relevance. Many questioned how Stonebraker dare question the wisdom of Facebook’s top-of-the-line database engineers.

Harris, Jim Starkey, Paul Mikesell, and Curt Monash all take a stab at rehabilitating Stonebreaker’s argument in the second post. Their argument boils down to, “Yeah, Facebook did it, but only because they have great engineers, spent a fortune, and endured a lot of pain. There are easier ways.”

Sorry fellas, time to annoy the digerati again, and so soon after bashing Social Media. I disagree with their contention, which is well expressed in the article by this Jim Starkey quote:

If a company has plans for its web application to scale and start driving a lot of traffic, Starkey said, he can’t imagine why it would build that new application using MySQL.

In fact, I would argue that starting with NoSQL because you think you might someday have enough traffic and scale to warrant it is a premature optimization, and as such, should be avoided by smaller and even medium sized organizations. You will have plenty of time to switch to NoSQL as and if it becomes helpful. Until that time, NoSQL is an expensive distraction you don’t need.

The best example I see for why that’s the way to look at NoSQL comes from Netflix, which is mentioned towards the end of the article. I went through several expositions by Netflix engineers on their experience transitioning from an Oracle Relational data center to one based on NoSQL in the form of Amazon’s SimpleDB and then later Cassandra (the latter is still an ongoing transition as I understand it). You’re welcome to read the same sources, I’ve listed them at the bottom.

Netflix decided to move to the Cloud in late 2008 to early 2009 after an outage prompted them to consider what it would take to engineer their way to significantly higher up time. They concluded they couldn’t build data centers fast enough, and that as soon as one was built it was swamped for capacity and out of date. They agree with Amazon’s Werner Vogels that building data centers represented “undifferentiated heavy lifting”, and was therefore to be avoided, so they bet heavily on the Cloud. These are smart technologists who have been very transparent about their experiences, so it’s worth learning from them. Werner Vogels reaction to Stonebreaker’s remarks about Facebook are an apt way to start:

Scaling data systems in real life has humbled me. I would not dare criticize an architecture that holds social graphs of 750M and works.

The gist of the argument for NoSQL being a premature optimization is straightforward and rests on 3 points:

Point 1: NoSQL technologies require more investment than Relational to get going with.

Adopting the non-relational model in general is not easy, and Netflix has been paying a steep pioneer tax while integrating these rapidly evolving and still maturing NoSQL products. There is a learning curve and an operational overhead.

Or, as Sid Anand says, “How do you translate relational concepts, where there is an entire industry built up on an understanding of those concepts, to NoSQL?’

Companies embarking on NoSQL are dealing with less mature tools, less available talent that is familiar with the tools, and in general fewer available patterns and know-how with which to apply the new technology. This creates a greater tax on being able to adopt the technology. That sounds a lot like what we expect to see in premature optimizations to me.

Point 2: There is no particular advantage to NoSQL until you reach scales that require it. In fact it is the opposite, given Point 1.

It’s harder to use. You wind up having to do more in your application layer to make up for what Relational does that NoSQL can’t that you may rely on. Take consistency, for example. As Anand says in his video, “Non-relational systems are not consistent. Some, like Cassandra, will heal the data. Some will not. If yours doesn’t, you will spend a lot of time writing consistency checkers to deal with it.” This is just one of many issues involved with being productive with NoSQL.

Point 3: If you are fortunate enough to need the scaling, you will have the time to migrate to NoSQL and it isn’t that expensive or painful to do so when the time comes.

The root of premature optimization is engineers hating the thought of rewriting. Their code has to do everything just exactly right the first time or its crap code. But what about the idea you don’t even understand the problem well enough to write “good” code at first. Maybe you need to see how users interact with it, what sorts of bottlenecks exist, and how the code will evolve. Perhaps your startup will have to pivot a time or two before you’ve even started building the right product. Wouldn’t it be great to be able to use more productive tools while you go through that process? Isn’t that how we think about modern programming?

Yes it is, and the only reason not to think that way is if we have reason to believe that a migration will be, to use Stonebreaker’s words, “a fate worse than death.” The trouble is, it isn’t a fate worse than death. And yes, it will help to have great engineers, but by the time you get to the volumes that require NoSQL, you’ll be able to afford them, and even then, it isn’t that bad.

Netflix’s story is a great one in this respect. They went about their NoSQL migration in a clever way. They built a bi-directional replication between Oracle and SimpleDB, and then they started moving over one app at a time. They did this against a mature system rather than a new buggy untested by users system. As a result, things went pretty quickly and pretty smoothly. That’s how engineers are supposed to work: bravo Netflix!

I have a note out to Adrian Cockcroft to ask how long it took, but already I have found a reference to Sid Anand doing the initial “forklifting” of a billion records from Oracle to Simple DB in about 9 months, and they went on from there. When Sid Anand was asked what the most complex query was to convert from Oracle to NoSQL he said, “There weren’t really any.” He went on to say you wouldn’t convert your transactional data anyway, and that was pretty much it.

Conclusion

The world loves to see things in black and white. It sells more papers. Therefore, because some situations benefit from NoSQL for scaling, we hear a hue and cry that everyone must embrace NoSQL immediately. Poppycock. You can go a long long way with SQL-based approaches, they’re more proven, they’re cheaper, and they’re easier. Start out there and if the horse you’re riding is strong enough to carry you to NoSQL scaling levels you can tackle that when the time comes. Meanwhile, avoid premature optimizations. You don’t have time for them. Let all these guys with NoSQL startups make their money elsewhere. You need to stay agile and focused on your next minimum viable deliverable.

4 responses to “NoSQL is a Premature Optimization”

You make very good points about NoSQL and the almost rabid frenzy over the marketplace trying to adopt Hadoop/MapReduce even if the square peg doesn’t fit into the round hole they currently have and can only ever foresee.

As another proof point for your Point #1. Dr. Usama Fayyad, former Chief Data Officer at Yahoo! told the audience at Enzee Universe that Yahoo! discovered that Hadoop was 10-50X more expensive for a production deployment than a data store.

Hadoop is simply in the early stages and doesn’t have the wealth of enterprise readiness tools, such as Business Intelligence, ETL, Visualization, etc. that exist for data warehouses and that make it feasible for mainstream operational deployment. Yes it will get there but it will take some time.

To further point #3, you see that the market place is creating hybrid solutions that combine the strengths or each of these approaches – databases and Hadoop. Solutions such as in-database MapReduce processing.

The fourth point I would add is that there are some real handy uses for SQL. There are cases were SQL is simply A LOT simpler and faster since it is random access.

When we say NOSQL, it means NOT ONLY SQL, we are not saying no to RDBMS database. NOSQL solution can coexist wtih RDBMS solution. Any data which is Non transactional in nature can be move to NOSQL solution like cassandra. For example the Logdata,session state data.. . Again if you no you access pattern the way your data is going to be accessed , you don’t even need Hadoop based solution. As with NOSQL solution like cassandra you can have the schema configuration defined and stored parsed data at runtime.

Agreed in NOSQL solution we give up consistency, thats what CAP theorem states, You cannot have Consistency,Availability and Partition at all the time. Unless you give up consistency you cannot achieve scalability.

So finally the data which needs to adhere to ACID properites and is transactional in nature can stay in RDBMS, but non transactional data can be moved to NOSQL , ensuring fewer hits to RDBMS data 🙂