Tuesday, March 16, 2010

The Case for the Bit Bucket

By Michael Cohen

Mr. Cohen is most famous here for the discussions we've had before about Application Developers vs. Database Developers. Part I is here, Part II is here. Mr. Cohen is a friend of mine. I have a great deal of respect for him. We obviously disagree in some areas, but I've learned to appreciate his push-back and learn from it.

Modern RDBMS's are quite powerful today. Pretty much every one of them has full support for SQL, including vendor extensions, all of the features we've come to expect from a relational database, a full fledged programming language built in, and quite often support for extras like full text search or native handling of XML. Most now also now ship with highly feature specific add-ons - PostgreSQL has a geospatial package that makes it the defacto standard in that domain, MySql has hot replication in a master-slave paradigm, Oracle has....well, Oracle has all kinds of things, a full object system and Java inside, a message broker, an HTTP server, a complete UI toolkit, among other things.

So the question arises as to how much of this capability one should use. I think it's becoming apparent that the answer to this is, "not much." Why shouldn't you take advantage of as much of the database's feature set as possible? The answer is performance and scalability. But wait, aren't stored procedures faster than ad hoc queries? Yes (theoretically). Won't it be more performant to execute business logic as close as possible to the data it operates on? Why should we introduce yet another component into the architecture when the database is perfectly capable of handling a particular task?

For one thing, the programming languages and environments offered by relational databases are now relatively long in the tooth, and have been eclipsed by modern OO languages. Developers are much more productive building applications with these new languages, and find it painful and tedious to work within the relational model, with SQL. You can see proof of this now with the overwhelming popularity of ORM frameworks in most of the popular OO languages out there. Java has Hibernate/EJB/JPA and many others. Ruby has ActiveRecord, DataMapper, and Sequel. Python has SqlAlchemy and Djanjo's ORM. And it's not because these developers lack the skills to work with the database directly. Quite the contrary actually, it takes intimate knowledge of the database to work effectively with an ORM. What's more, the ORM is often able to make runtime optimizations that would be difficult or prohibitively time consuming to hand code. Finally, clustered caches offer massive performance and scalability improvements, handling writes back to the database transparently behind the scenes, but for the most part they preclude implementing complex business logic in the database.

The overall trend is clear, across languages and platforms. It's the movement of data out of the database and into the application layer. Less and less reliance on the database, perhaps only for archival purposes. John Davies has a good comment on this. He's operating in a unique environment with extremely rigorous performance requirements, but we're now starting to see similar constraints imposed by the web. There's a whole class of software that has come about due to the inability to scale the relational database beyond a certain point. Facebook developed Cassandra, now used by Twitter, Reddit, and Digg, among others. LinkedIn built Voldemort. My employer doesn't deal with the massive scale of these companies, but we do large scale data processing with Hadoop. HBase, another non-relational persistent data store, is a natural fit, and just about the only option really. We use MySql less and less.

Of course, not everybody is building applications with such high scalability requirements. But even for applications with less intensive scalability requirements I would argue the same tendency to minimize the workload on the database should apply. Cameron Purdy has a good quote, "If you don't pick your bottlenecks, they'll pick you." Design your application to bottleneck, he says. What he means is, your application is going to bottleneck on something, so you need to explicitly decide what it will bottleneck on. Unfortunately, most applications bottleneck on the database, as this is the hardest layer to scale. It's pretty easy to scale the front end, we just throw more instances of Apache out there. It's a little bit harder, but not much, to scale the app server. But it's pretty hard to scale the database tier, particularly for write intensive applications. For well funded organizations, Oracle RAC is the standard. MySql's master-slave setup and hot replication saw it win out over PostgreSQL despite the fact that Postgres is a much better database in just about every other respect. The NoSql projects listed above grew out of the inability even to scale out MySql.

The trend is clear. We're collecting and processing more data than ever before, and this will only increase as we go forward. Unfortunately, the relational database (at least in it's current form) isn't well suited to the scale of data processing an already significant and growing number of organizations deal with on a daily basis. We're now seeing new solutions come forth to address the shortcomings of the traditional RDBMS, and the same forces that have necessitated such developments are at work even in smaller organizations. At all levels, developers would do well to require as little functionality as possible from the database, essentially, to treat it as a bit bucket.

23 comments:

Some well constructed arguments there.The point about productivity in 'new' languages does seem to be the market position. It is simply cheaper to build apps using these tools and the expense of 'squeezing' the best out of the database doesn't pay off.The second point, about the benefits of non-SQL databases, I am not as convinced. Yes, there are use-cases where in-memory databases are better suited, especially for ephemeral data but this seems foreign to the concept of database as a 'persistence layer'. It certainly wouldn't be relevant to most of the applications I've worked with.

The document databases will be more interesting to watch, to see if they have any real impact over the object databases and XML databases which have already been hyped and have now settled into their niches.

I would say these are more tools that you the developer/architect has at your disposal. I don't buy database independence or bit bucket speak. The database is still a critical component of the infrastructure and they aren't going away. They are tried and proven, but so is Cassandra. I think each organization needs to look at what they are trying to accomplish.

NOSQL is not a panacea. By going to Cassandra, there are things you have to give up.

For example, Cassandra doesn't do joins where in a DB related data can be brought together. So with Cassandra your app has to bring it together. It doesn't guarantee referential integrity. Last I checked, companies NEED that integrity. Cassandra will "eventually" be in sync, but conflicts can and do occur. If I am running a search engine or something like that, I might not care as much about referential integrity as much where as a bank does care.

Don't get me wrong, Cassandra is really good stuff, but I think it can easily be abused for the wrong situations where it should not apply.

this Post is basically a contradiction to The Helsinki Declaration.I just can follow the new languages are more productive. Just if you need to take care of your data, let's say you need 2 or more of the ACID properties, you will have to come back to a RDBMS. It's really fun troubleshooting application side implementation of sequences or constraints, and how glorious they fail if the system gets scaled ;-)

Here we go: the old 'performance and scalability' and 'long in the tooth' nonsense.

Which have never been proven, btw.

No, Michel: google, amazon and others do not use a SINGLE one of the long list of technologies you mentioned - and your followers like to roll out.

Not a SINGLE one!!!

Yeah: they don't need your 'performance and scalability'.

This may come as a total surprise to you, but the number of businesses that require the dizzy heights of your 'high performance and scalability' is incredibly small.

(Please THINK before you reply with another roll list, ta?)

What the vast majority of businesses - outside of the rarefied sphere of web companies - need is the ability to process and record very large volumes of data.

Which is a totally different problem.

As to the 'long in the tooth': you have to cease confusing 'proven and reliable' with childish 'age' arguments.

In 12 years, I have not seen a SINGLE application using the technologies you push that has been reliable, maintainable and expandable.

Not a SINGLE one, Michel!!!

In case you have not noticed, any FOOL can cobble together an 'architecture', slap on it some acronym to give it a 'standard' patina and declare it as yet another 'success'.

Then when the problems start, said fool is long gone to the next 'technology du jour' and can never be made responsible for the disasters left behind…

Familiar? It is the history of every single 'bespoke application' I've seen in the last 12 years.

To actually build applications that stand the test of time, are maintainable and can be augmented without a complete re-write, requires a little bit more tought than just configuring the latest 'framework' fad: you need technologies that have proven themselves against the test of time.

Which are what? Ah yes, the 'long in the tooth' ones: that's why they lasted!

You and your kind constantly confuse productivity with churning out code to do what can be done in three sentences in a real application environment.

'The overall trend is clear, across languages and platforms. It's the movement of data out of the database and into the application layer'

Dude, perfect example of the totally irresponsible and ignorant attitude and argumentation of your kind, John Davies' comments included.

You completely and totally ignore in your 'everything in memory' that data volumes are growing at exponential rates, much larger than the memory capacity of ANY system, fragmented and disjointed as your 'scalability" might be.

No, a farm of Apache servers is NOT an application processing environment, you dumbkopf!!! Far from it!

Companies are spending more and more to record, organize and analyze the immense volumes of data we see now. R-E-L-I-A-B-L-Y!

6 years ago I'd be hard pressed to produce a TB-class database used in a small to mid-size classic company.

Now we process, analyze and archive – in a retrievabe manner – 7TB daily!

DAILY, Michel! And I'm eyeing 20 TB, DAILY, by the end of this year. NO end in sight, BTW.

No, this is not some vaporous web would-be non-entity!

This is a mid-size classic commercial property management business.

Do you even grasp the data scale we're talking about here?

We do this with a system that cost half a megabuck. The software cost less than that, developed by a team of - wait for it! - 3 people!

Not some "Apache farm" whose data will vanish on the next power glitch, and a development environment with a cast of thousands.

When you've done this, REPEATABLY AND RELIABLY, with similar costs and volumes, come back and talk "IT technology".

Once again, you've clearly shown how dangerous and irresponsible your approach to development is.

Here we go: the old 'performance and scalability' and 'long in the tooth' nonsense.

Which have never been proven, btw.

No, Michel: google, amazon and others who must have performance and scalability do not use a SINGLE one of the long list of technologies you mentioned - and your followers like to roll out.

Not a SINGLE one!!!

Yeah: they don't need your 'performance and scalability'.

They use totally different architectures suited to their very specific environments.

This may come as a total surprise to you but the number of businesses that require the dizzy heights of your supposed 'high performance and scalability' is incredibly small.

(Please THINK before you reply with another roll list, ta?)

What the vast majority of businesses - outside of the rarefied sphere of web companies - need is the ability to handle very large volumes of data.

Which is a totally different problem.

As to the 'long in the tooth': you have to cease confusing 'proven and reliable' with 'age'.

In 12 years, I have not seen a SINGLE application using the technologies you push that has been reliable, maintainable and expandable.

Not a SINGLE one, Michel!!!

In case you have not noticed, any FOOL can cobble together an 'architecture', slap on it some hasty acronym to give it a 'standard' patina and declare it as yet another 'success'.

Then when the problems start, said fool is long gone to the next 'technology du jour' and can never be made responsible for the disasters left behind...

Familiar? It is the history of every single 'bespoke application' I've seen in the last 12 years.

To actually build applications that stand the test of time, are maintainable and can be augmented without a complete re-write, requires a little bit more tought than just configuring the latest 'framework' fad: you need technologies that have proven themselves against the test of time.

Which are what? Ah yes, the 'long in the tooth' ones. That's precisely why they lasted!

You and your kind constantly confuse productivity with churning out code to do what can be done in three sentences in a real application environment.

'The overall trend is clear, across languages and platforms. It's the movement of data out of the database and into the application layer'

Dude, perfect example of the totally irresponsible and ignorant attitude and argumentation of your kind, John Davies' comments included.

You completely and totally ignore in your 'everything in memory' mantra that data volumes are growing at exponential rates, much larger than the memory capacity of ANY system, fragmented and disjointed as your 'scalable systems' might be.

No, a farm of Apache servers is NOT an application processing environment, dumbkopf!!! Far from it!

Companies are spending more and more to record, organize and analyze the immense volumes of data we see now. R-E-L-I-A-B-L-Y!

6 years ago I'd be hard pressed to produce a TB-class database used in a small to mid-size classic company.

Now we process, analyze and archive – in a manageable manner – 7TB daily.

DAILY, Michel! And I'm eyeing 20 TB, DAILY, by the end of this year. NO end in sight, BTW.

No, this is not some vaporous web would-be non-entity!

This is a classic mid-size commercial property management business.

Do you even grasp the data scale we're talking about here?

We do this with a system that cost half a megabuck. The software cost less than that, developed by a team of - wait for it! - 3 people!

Not some 'Apache farm' whose data will vanish on the next power glitch, and a development environment with a cast of thousands.

When you've done this, REPEATABLY AND RELIABLY, with similar costs and volumes, come back and talk 'IT technology'.

Once again, you've clearly shown how dangerous and irresponsible your approach to development is.

Noons, such vitriole and venom. I can just imagine the spittle frothing out the corner of your mouth onto the keyboard. You should strive to always be dispassionate about technology, for it will blind you to alternate choices.

"No, Michel: google, amazon and others do not use a SINGLE one of the long list of technologies you mentioned - and your followers like to roll out. Not a SINGLE one!!!"

Actually, Amazon uses Dynamo. See the white paper at http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

from which Cassandra and Voldemort were derived. Google of course uses BigTable. Both of these of course are not built on the relational model.

Please do your homework before you reply, ta?

"In 12 years, I have not seen a SINGLE application using the technologies you push that has been reliable, maintainable and expandable. Not a SINGLE one, Michel!!!"

Noons, just because you've never been involved in such a project doesn't mean they don't exist. You're making a classic blunder here. http://www.nizkor.org/features/fallacies/hasty-generalization.html

"You completely and totally ignore in your 'everything in memory' that data volumes are growing at exponential rates, much larger than the memory capacity of ANY system, fragmented and disjointed as your 'scalability" might be."

Wow, this comment suggests you just totally don't understand distributed architecture. Terabytes of data are easily held in RAM - distributed across nodes in a cluster. The entire dataset need not be stored on one machine.

"No, a farm of Apache servers is NOT an application processing environment, you dumbkopf!!! Far from it!"

You're really going off the deep end here. Based on this comment it's not clear you even understand where in an architecture an Apache web server fits.

What actually surprises me if how "fast" Cassandra is being that it's built on Java :)

I have actually played with Cassandra. Very slick technology. The thing is, there is no integrity no relational ability that I am used to seeing out of an RDBMS. SQL was lacking obviously (NOSQL). So you have to fetch multiple keys out and join it in the app (hope you have enough RAM on large dataset) and even then, you aren't "guaranteed" that the data is consistent.

So Cassandra and many are fast by removing things that get in their way like integrity etc... and everything is denormalized... And if you need these things (or normalization), then you probably should stick with an RDBMS. I'd comment more but I am in the middle of several things and that is what was on my mind at the moment :)

First, here's an oxymoron: "Why shouldn't you take advantage of as much of the database's feature set as possible? The answer is performance and scalability."

My experience has been the absolute opposite. If you treat the database as a bit bucket and try to handle all data processing in the application layer, your application will not scale, period.

Here's a contradiction:

"Developers are much more productive building applications with these new languages, and find it painful and tedious to work within the relational model, with SQL."

followed by

"And it's not because these developers lack the skills to work with the database directly."

If it is painful and tedious to work with the relational model, with SQL, they simply don't have the skills to work with the database directly. You cannot get more direct than SQL.

"it takes intimate knowledge of the database to work effectively with an ORM."

Really?! Well, I have news for you. Many OO developers love ORM tools because they don't need to have intimate knowledge of the database. ORM tools make it easier to treat the database as a bit bucket. Why bother writing SQL when the ORM tool will do it for you?

"The overall trend is clear, across languages and platforms. It's the movement of data out of the database and into the application layer"

Why? Because a few large web sites are implementing NoSQL alternatives? Please! That's far from being a trend.

"But even for applications with less intensive scalability requirements I would argue the same tendency to minimize the workload on the database should apply."

Minimize the workload on the database... and move it where, to the app layer? Good luck with that! Same result... poor scalability and performance. Optimized SQL based on a sound data model scales extremely well. It's even better when you apply SQL features (e.g. analytics). I love the look on OO developers when I show them a single query that gives them the data they need when they were getting ready to write a program to crunch the data in the appp layer.

"But it's pretty hard to scale the database tier"

Not really. It only takes intimate database knowledge. It's been done many times before.

"We're collecting and processing more data than ever before, and this will only increase as we go forward."

Bring it on! Can you say Exadata?

"Unfortunately, the relational database (at least in it's current form) isn't well suited to the scale of data processing an already significant and growing number of organizations deal with on a daily basis."

A significant and growing number? Whatever number that may be it pales in comparison with the number of organizations that depend on established relational databases that implement ACID and scale to meet their needs.

"At all levels, developers would do well to require as little functionality as possible from the database, essentially, to treat it as a bit bucket."

What an appropriate way to close. Yes, developers will do great because they'll be able to code more in less time but performance and scalability will suffer tremendously due to the bit bucket mentality.

Facebook is built on Cassandra. True, that's pretty cool. When the data is as important as Aunt Salley's picture of her grandkids, who gives a flying fig if it's lost to a client query. Cassandra is "Eventually Consistent". That doesn't work for 99% of business apps. Can't run AP or AR, or HR or ERP or Billing or Trading or Banking, or Manufacturing or well, just about anything important with "Eventually consistent". Digg and Facebook and Bigtable, etc. Who cares. They are big, they run well of Cassandra, it's a cool technology. For that small slice of large website full of detritis, it's a perfect fit... costless database for worthless data.

As far as Amazon and their eventual consistency, that just proves that it takes a very large site with very specific requirements spending an enormous amount of money to make something fit into an ACID scenario. Which eventually means Oracle, right?

I'd call that an argument against doing such things, in general.

"costless database for worthless data" that's a gem of a quote, Skipjack. It so well describes Facebook et al and google et al.

Here I am using a google account. If it screws up (as it has so many times in the past, they slipstreamed a google groups change in a couple of days ago), so what? It's not like you need to find the same thing twice or care if it loses your post. Though somehow, I get upset when those problems happen. Too much ACID in my past, I guess.

I love how you guys basically ignore the comments of an industry heavyweight like John Davies, who literally tells you databases are on the way out, and works on Wall Street, the most ACIDic environment there is.

As you said John Davies is "operating in a unique environment". In his world, yesterday's activity sits in an archive and they generally aren't interested in it. Keeping 'current' information in memory for performance fits his world. Same with the facebook/twitter world. They are interested in 'now' and forget 'last week'.

The persistance layer is more about recording stuff today that you'll look at next week. That's what a lot of businesses do. They get an order, move some physical boxes around, send them off and, a month or two later, get the payment for it.

There is an information/virtual world, where stuff happens at the speed of light. Then there's a physical world of food and iron ore and doctors performing operations, which is governed by physical limits of speed and scalability.

"As you said John Davies is "operating in a unique environment". In his world, yesterday's activity sits in an archive and they generally aren't interested in it. Keeping 'current' information in memory for performance fits his world. "

Unique in the stratospheric performance requirements, not necessarily the persistent storage requirements. Are you sure "yesterday's activity sits in an archive?" I think there's probably pretty high probability that they need access to "historical" data quite a bit more often than you think, perhaps most of the time, what with the algorithm based trading and such that goes on in the financial world. At ant rate, "keeping 'current' information in memory" doesn't mean they're not persisting it to disk, they are. And they're still using relational databases all over the place, they're just moving away from them as a general trend. Keeping things in memory doesn't mean you're not using a database. Cameron Purdy is a multi-millionaire because he figured out how to solve the RDBMS scalability problem, while keeping the transactional semantics of the database largely intact.

"Are you sure "yesterday's activity sits in an archive?""A quote from the comment you linked to states

"You can now put an entire days trades from any of the world's largest banks into memory, true, at the end of the day we need to "archive" it...but you rarely need to index it for complex searches once it's archived, in effect this is pure archival. "

So I'm not sure, but that is the impression I get from the reference you provided.

They don't teach you this in college, but the fundamental theorem of the software industry is the idea that everything needs to be rewritten all the time. As a corollary, web startup engineers believe that there is no problem but scalability, and architecture is its solution. And thus, the NoSQL movement was born.

The idea is that object relational databases like MySQL and PostgreSQL have lapsed their useful lifetimes, and that document-based or schemaless databases are the wave of the future. Never mind of course that MySQL was the perfect solution to everything a few years ago when Ruby on Rails was flashing in the pan. Never mind that real businesses track all of their data in SQL databases that scale just fine. (For Silicon Valley readers, Walmart is a real business, Twitter is not.)

Invariably, all web projects start off with something like Rails or Django, most likely backed by MySQL. The data relationships are easy to model, and the application works well. If you are lucky enough that people actually use your application, eventually you will start to see some performance issues. At this point, a developer who values technological purity over gettin' shit done will advocate "rewriting the whole thing in a weekend using Cassandra". And if he's smart enough, he might just pull it off. (Of course, said developer has only migrated the app to use a different data store - all of the ancillary support code was conveniently ignored)

So you've magically changed your backend from MySQL to Cassandra. Stuff will just work now, right? Well, no. Did you know that Cassandra requires a restart when you change the column family definition? Yeah, the MySQL developers actually had to think out how ALTER TABLE works, but according to Cassandra, that's a hard problem that has very little business value. Right.

I'm not just singling out Cassandra - by replacing MySQL or Postgres with a different, new data store, you have traded a well-enumerated list of limitations and warts for a newer, poorly understood list of limitations and warts, and that is a huge business risk.

You Are Not Google

The sooner your company admits this, the sooner you can get down to some real work. Developing the app for Google-sized scale is a waste of your time, plus, there is no way you will get it right. Absolutely none. It's not that you're not smart enough, it's that you do not have the experience to know what problems you will see at scale.

Besides, did you know that Google Adwords is implemented on top of MySQL? What, that business critical code that operates at massive scale doesn't use BigTable? No, in fact there is such enormous value in sticking with what works that Google identifies problems with InnoDB at scale and submits patches, instead of saying "MySQL doesn't scale, let's dump it for something else".

NoSQL will never die, but it will eventually get marginalized, like how Rails was marginalized by NoSQL. In the meantime, DBAs should not be worried, because any company that has the resources to hire a DBA likely has decision makers who understand business reality.

Reading the recent flamory piece “I Can’t Wait for NoSQL to Die” from Ted Dziuba, I thought the author is wrong on so many levels. Not that I’m a NoSQL zealot, see my The Dark Side of NoSQL, but Ted is hilarous.....