Disk-less databases?

Being a company with an alternative (one of the first “NOSQL”) database, and a long history of solving problems that relational databases cannot, we often see system and application architects resorting to memory as a last ditch attempt to squeeze just a bit more performance out of RDBMS that simply were not designed to perform at scale. And so it was with some interest, and a little bias, I read a new blog post over on ODBMS.org this week titled “The future of data management: ‘Disk-less’ databases? An interview with Goetz Graefe” and decided to share some additional thoughts.

Goetz believes that “with no disks and thus no seek delays, assembly of complex objects will have different performance tradeoffs”. He thinks “a lot of options in physical database design will change, from indexing to compression and clustering and replication.”

It’s a valid assessment, but one which begs a larger discussion.

Turns out, this discussion has been going on for 20 years, starting with Times Ten. Perst and ExtremeDB use RAM to speed things up as well. We’ve shown that running with SSDs can give up to 80x speed increase on reads and 4x on writes. Developers can also configure Objectivity/DB, our flagship data management product, for purely cached applications, e.g. in the telecom and process control worlds, where there’s a lot of lookup data. You’ll find much of this covered in one of our older white papers titled “Flexible Deployment” available in both web/HTML and PDF formats on our site.

In today’s world of big data, it’s easy to build something that becomes I/O bound – or hitting the speed limit that any disk has in reading and writing information. Goetz is one of the most brilliant data management experts on the planet, but I think this interview neglects the business-end of the equation (memory is expensive). It would be better to make clear, if you’re worried about becoming I/O bound as your system grows, you actually have *two* choices:

You can move part or all of your application data into memory via memcache or ramdisk components, or buy a super beefy machine with terabytes of memory.

This could improve your application performance from some percentage, up to several multiples. This will, however, add complexity to your application, require management of potential numerous new component layers, and force you to make decisions between hot, warm and cold data (because most companies can only afford to move hot data into RAM while leaving everything else on cheaper disks). And, complexity adds cost. $500K or more for each super beefy multi-terabyte RAM machine, plus the added cost of engineering, maintenance and IT management, can all add up pretty quickly. You might get your web-facing system to run %30 faster. Is it worth the price you paid? Did you get a return on that investment?

But there is another option…

Distribute your data and processing.

Depending on the data store you use, each machine in your cluster (including significantly cheaper, commodity hardware) could be used to reduce the problem into little pieces that are much more quickly processed, or in some cases, the database can leverage the processing power of each machine to actually give you a near linear performance increase as machines are added. I know… many of you who are dealing with sharded databases are seeing significantly reduced performance as your joins increase. But what if (and this is kind of the whole point of “Not Only SQL” or “No SQL” data technologies) you could eliminate joins? What if, by just switching your data model and programming paradigms a bit, you could access your data anywhere it lived, in milliseconds or less?

You already most likely use an object oriented programming language (C# or Java), but also most likely find yourselves needing to normalize and map objects into a relational scheme or rows and columns. The nice thing about the technological landscape today is, if your data really doesn’t need to live in rows and columns, you don’t have to force it.

Welcome to the New World.

The general consensus and opinion is: Memory is expensive. Disks are cheap. Developers are resorting to memory to overcome several performance and other bottlenecks inherent with older and/or relational technologies. We often see memory being used as a short term band-aid or treatment of symptoms that actually don’t address the underlying disease.

No, relational databases aren’t a disease. Please don’t flame me. They do many things better than any other data technology. But they don’t do everything. The “one size fits all” approach is dead. If you need performance at scale, but don’t need to force all your data into rows and columns, and you also don’t see any return on investment in expensive memory solutions, then perhaps it is time to look at one of these new “NOSQL” products. It doesn’t take much effort to build a proof of concept, and see which problems you can solve with one of these products. Sure, you might need to take a polyglot application approach (which may also include some complexity issues), but in most cases I believe you’ll find you can achieve results that give you more freedom, fewer sleepless nights, and a system that just works.

If you need fast lookups of values, you can use a key-value store like Citrusleaf or Riak. If you’re dealing with related collections of objects that resemble a “document” then you can download Mongo, BigCouch or OrientDB. Need to walk a complex graph, where objects and connections (nodes and edges) can answer some deep social network analysis questions? Then get a graph database (we recommend InfiniteGraph of course) that treats edges as first class citizens and can traverse those connections thousands of times faster than a recursive join SQL query.

This is the space where we play.

<- START shameless self-promotion –>

Objectivity/DB is the original distributed, massively scalable data management and object persistence product that we have sold into leading government and enterprise systems for roughly the past 20 years (we’re on version 10 of the product now).

Last year, we developed InfiniteGraph, an API above our distributed core which allows developers to easily handle their graph data problems without having to learn thousands of methods and all the bells and whistles of our core database. Just download InfiniteGraph (we offer a completely free version), install it, grab some sample code from our Developer Wiki to help start your project, and violá!

We think InfiniteGraph can solve your relationship analytics, intelligence and social network analysis problems better than anything else out there. InfiniteGraph uses memory and cache to help you get the best performance on your live data, while also persisting relationship information to disk so you never have to worry about losing it all in a *flash* (pun intended).

If you need more of the data management functionality of Objectivity/DB, you can do that too. We’ll consult and train you, and ensure you can make use of all the best practices we have learned and applied to mission-critical government, security and intelligence, commercial, telecom, science and enterprise applications we have supported over the years.

<- END shameless self-promotion –>

On a related note with InfiniteGraph : We’re seeing complete applications built on InfiniteGraph, fully tested and deployed in a few weeks on average. It wasn’t too long ago that it took many months, to a year or more just to build anything interesting –- and even then, the slightest breeze or sudden spike in traffic (which was nowhere near today’s “Digg Effect”) could send the whole thing crashing, sending every IT person in the building and remote, scrambling from their dimly lit offices and Quake games to see what was the matter. And after each crisis was resolved (often temporarily), organizations found ourselves wondering (again) how they could economically exploit RAM and other off-disk schemes to speed response from over-burdened relational databases.

It’s good to see this discussion continues… Using memory as a band-aid, given the cost and added complexity, is not a real solution. But as alternative data technologies continue to mature and become more mainstream, at the same time new and cost efficient memory are being produced, I believe the bandaids will give way to truly amazing and blazingly fast systems that solve all our problems… until, that is, the continued exponential growth in data once again overloads those solutions, giving us a whole new set of problems to solve (again).

Thomas Krafft is the Director of Marketing at Objectivity, Inc. (the company behind Objectivity/DB and InfiniteGraph). He oversees all marketing efforts, including communications and PR, demand generation and content development. Having joined the company in 2008, Thomas brings a diverse experience from more than 15 years working with Fortune companies including Intuit and Veritas, successful startup ventures (including one acquired by Barnes & Noble), and hundreds of clients to which he provided marketing and internet consulting for several years. Thomas holds a B.A. in Political Science, International Relations, from California Polytechnic State University, San Luis Obispo.