Last week, Dan Weinreb tipped me off to something very cool: Mike Stonebraker and a group of MIT/Brown/Yale colleagues are calling for a complete rewrite of OLTP DBMS. And they have a plan for how to do it, called H-Store, as per a paper and an associated slide presentation.

On the system side, some of their most radical suggestions include:

No disks or other persistent storage at all.

No multi-threading.

No locks.

No redo logs (and perhaps not a lot of undo logs either).

Their programming wish list is equally dramatic. It includes:

All transactions implemented via a kind of stored procedure.

No more separate database language. SQL bad, Ruby good.

The relational model replaced by something more hierarchical.

Some limitations on the types of transactions allowed.

Mike and I have agreed for a while that current market-leading DBMS products aren’t optimal for much besides high-end OLTP. But now he’s saying that Oracle, SQL Server, and the like are utterly obsolete for OLTP as well.

There seem to be three main assumptions underlying the H-Store design, two of which Mike seems fairly certain of, and the third of which he regards as subject to further research. The first assumption is that there is no need any more for such a thing as a long-running transaction or cursor. Transactions aren’t held open any more for input from users at dumb terminals. Records aren’t sent down in batches to be scrolled through. Instead, transactions are fired off from web pages via internet protocols, with results being sent back upon transaction completion.

This assumption has three major consequences. First, multi-threading is no longer needed. That gets rid of huge overhead around connection pooling and b-tree consistency, to name just two areas. Second, traditional locking isn’t needed; H-Store relies on optimistic locking instead. Third, it’s a really pointless and high-overhead idea to call out to a separate data manipulation language like SQL. Instead, Mike favors programming languages that mix data manipulation into other logic, like Ruby on Rails (which will be used in the next iteration of H-Store), or the fourth-generation languages of the 1980s.

The second of H-Store’s three big assumptions is that you can do without disks. Disk rotation is the big technical bottleneck of database management. While most other measures of computing performance double every 2 years or so, disk rotation speeds have only increased 12.5-fold in the past half century. And unlike the case of data warehousing, the nature of OLTP makes it pretty impossible to be disk-centric without doing huge numbers of random disk seeks.

The H-Store way of getting data persistence is simply to keep multiple RAM copies of the data, widely dispersed geographically (to protect against power grid failures and physical disasters), with the (presumably identical) individual systems being as robust as the user deems fit. Clearly, that allows for hugely faster processing than anything disk-centric, even with the same data access methods (i.e, b-trees). Changing the data structure – e.g., to something like solidDB’s – should provide some further speedups yet.

That said, I don’t think going entirely without persistent storage is a great idea. There’s no way to be sure you designed out single points of failure – if nothing else, there are killer bugs, hacks, etc. And since it’s both radical and not-obviously-safe, I don’t think the no-persistence idea is likely to gain traction in the market. Fortunately for the H-Store project, persistence can easily be added in; if storage on magnetic or optical media is desired, it would be easy for one of the H-Store nodes to provide persistent checkpointing.

The third of H-Store’s three big assumptions – and the one that is called out as requiring further research – is that transactions fall primarily into certain specific categories. I want to understand that part better before making any attempt to write about it. Stay tuned.

Comments

I suggest you don’t ditch your WD/Seagate stocks just yet. Data persistence can not be guaranteed with RAM alone, regardless the number of replicas kept (but you can, on the other hand, get almost infinitely high availability figures with these). While some data may be classified as “transient” in nature, other data just has to be stored in non-volatile storage in order to survive possible database catastrophe. This does not mean, however, that RAM’s contents can’t be persisted to disk for persistency’s sake and nothing else.

Also, I don’t think SQL can be dismissed this easily, even if it really is bad. Borrowing from the joke: “1 billion Chinese can’t be wrong”, how many SQL programmers are there? What about the number of legacy applications?

With the risk of advertising oneself, or rather one’s employer, I suggest you look at Xeround Sound. It’s a product whose design and implementation share many similarities with the H-Store proposal.

[…] Stonebraker isn’t done with creating new kinds of databases. He is working with a group of MIT/Brown/Yale scientists to completely rethink the OLTP database concept. Curt Monash has coverage of it in his DBMS2 blog. […]

The H-Store paper in fact references ANTs as foreshadowing some of the same ideas.

However, I find them a difficult company to deal with, so I’m not in a position to say much interesting about what ANTs actually does and doesn’t do in worthwhile products today.

CAM

bwtaylor on
February 19th, 2008 6:05 pm

The problem with doing away with persistence is that being geographically diverse introduces latency because the speed of light is finite. It isn’t enough to be in a different building, you’ve got to be on a different power grid and that means being far away. This means really expensive networking. It also means that you may need to give up on global data consistency due to the CAP theorem. eBay does this kind of geographical redundancy, for example, (and they use Oracle – that should tell you something), and the upshot of it is that you deal with use case specific synchronization. The complexities here dwarf that from RDBMS’s, and it’s only for the internet-scale outfits with 1000+ developers where this is viable.

If you want to have data consistency, you must be ACID compliant, you WILL have redo logs and locking. You can give up on data consistency and then use a strategy called BASE, but to suggest that this will be simpler is a pipe dream.

Dave Gudeman on
February 19th, 2008 6:11 pm

I don’t know what the issue is that you had with ANTs, Curt, but why don’t you write me directly and I’ll see if I can’t get it resolved.

Also, I don’t get the point about “no multi-threading”. Surely you can’t mean that we don’t need multi-processor hardware or that we don’t need parallel execution to take advantage of it?

Sorry, I assumed you could get at the email address that I entered in the comment form. You can reach me at dave DOT gudeman AT ants DOT com. I’m a dev manager and one of the original ADS engineers.

Sorry I missed your point on multi-threading. In my defense, I’m peripherally aware that other dbs have used multi-threading as a way to parallelize user activity but all of my own work has involved high performance and you just don’t do that when you want high performance so it’s not what comes to mind when you say “multi-threading”. For high performance, you should only use threads to parallelize CPU activity; user operations should be parallelized using a different mechanism. In this model you don’t have to worry about connection pooling (which is basically a hack to get around the problems created by using threads to parallelize user activity) but you still have to worry about b-tree consistency and all of the other issues surrounding simultaneous access to shared data. You can’t avoid that problem, but you don’t need locks either. Locks are just one of a range of solutions.

Of course, Stonebreaker’s research is very cool. It’s always interesting to take a step back and try to approach an old set of problems from a completely different angle.

The email address thing was a brain-cramp on my part. I saw that you didn’t leave a URL, and acted as if it’s an email address you hadn’t left.

As for multi-threading, now we’re more on the same page. However, the H-Store approach is so aggressive that they even wish away the b-tree consistency problem. If you assume that most transactions take 50 microseconds and the longer ones take 1 millisecond, that’s not totally crazy. However, as I’ve been noting, just how single-threaded they do or don’t wind up being is still a rather open question.

Ok – I have been a DBA for 11 years now, and this article hurt a little to read. I understand where H-Store is coming from, and it is definitely a radical idea. DBMSs need ideas like this in order to evolve as they (or the companies developing them) are generally large lumbering beasts that resist significant change. I’m going to caveat this comment with the fact that I have used mostly Oracle DBs., from 7.3 all the way to 10g.

I don’t really believe that there are many databases that can survive without persistence. Someone mentioned eBay. Now eBay is not the best example for an application of this technology due to its large size and the nature of the business that they perform. A great deal of what they do there is examine trends over time, and that smells more OLAP than OLTP to me. There is certainly an OLTP aspect of their business. IMO, this is typical of most established businesses.

The reliability of network connections, power systems, etc. would be a real issue with a database stored entirely in RAM, regardless of replication efforts. Lose your link to the replication network and data doesn’t replicate. If you then lose power it is gone forever.

I do believe that H-Store (and other systems like it) can be effective in “front-ending” another DBMS based on persistent data. That way, you persist the data eventually but gain the performance benefits for the users.

As for replacing SQL…I have never understood the aversion to SQL. SQL is a DSL that is very well suited to its application. It can be wonky at times, and the syntax is not nearly as easy as I would like, but it is really nothing to be afraid of. I have yet to see any framework/language/DSL that comes close to providing the functionality offered by SQL in regards to manipulating sets of data.

The part about accessing data with stored procedures – YES. I use them all the time. I disagree about the Ruby bit, however (although I am a Ruby nut.) I am a big believer in using the right language for the right task, and Ruby doesn’t seem to be the best choice for the domain of set manipulation. I would, however, be happy if I could use Ruby inside the database for everything else.

suresh on
February 21st, 2008 7:32 am

Well, if this is something related to inmemory then apart from the heirarchial concept this is nothing new to the oralce in-memory database product which is already in production in the market.

I completely disagree about the need for disks. I know it seems
counterintuitive, but think about it. The reason you need disks is
for a redundant place to hold your storage, that won’t fail if your
machine crashes. In H-store, you just do that with another
machine with RAM.

Now, as bwtaylor very correctly pointed out, it is crucial that there
be a copy of everything at a site with extremely independent failures
modes. Indeed, you have to be on a different power grid.

This means that the claim that we can “scrap” all other OLTP systems
is rather hyperbolic. H-Store technology is suited for heavy-duty
applications with very serious high uptime requirements. Not
everybody needs that, and not everybody can pay for a distant recovery
site. But lots of companies really do need that for their
business-critical systems, more and more each year as the real world
becomes more dependent on computer technology and “the computer
is down now” becomes less acceptable as an excuse for bad customer
service.

Yes, that introduces latency, but only in the unusual case of a
“disaster” scenario in which your entire primary site goes down; you
arrange as much redundancy as possible at the primary site so that
“disasters” are rare enough that the latency introduced is acceptable.
After all, even with a conventional disk system, you can still have a
“disaster” locally, and then your disks have no power, and do you no
good whatsoever.

And there’s nothing that stops you from doing checkpoints to disk,
asynchronously, just for extra safety. In real life, you’d probably
be copying the data out to a data warehouse anyway, but you could
write it to persistent storage in any form. The key point is that you
need not use the disk to execute an OLTP transaction. Streaming
it out in the background to a disk has no effect on that basic point.

He says “This means really expensive networking.” On the contrary,
the good old Internet is just fine, since you make disasters rare.
Remember, even with a conventional DBMS, disasters happen and had
better be rare.

He says “You may have to give up on global data consistency due to the
CAP theorem.” Theoretically, yes. In practice, it’s quite
acceptable. And, one more time, you’re no worse off than with a
conventional DBMS, which can also suffer from a “disaster”.

Use-case-specific synchronization isn’t so bad. Again, this hardly
ever happens. You just fix it up manually from your recent paper
records. Again, this can happen with any DBMS.

“It’s only for Internet-scale outfits”: yes. With “1000+ developers”?
No. You buy the DBMS from a company, and they hire the developers,
and it hardly takes 1000+ of them!

You’re just wrong that you need locking to achieve ACID behavior. I
know it’s counter-intuitive, but this is really a radical idea.
Please read the paper. Yes, you do need redo logs, but you can purge
them after the transaction has committed on all copies; you don’t need
a disk-based AIRES-style log.

Regarding SQL and the relational model: Curt has gotten it somewhat
wrong here. H-Store is, in fact, relational, and is intended to use
SQL (although they haven’t written that part yet). However, it works
a lot better if the schema has certain characteristics, which have to
do with “hierarchy”, but in a complicated way. You’d have to read the
paper to understand all this.

When they say “no multi-threading”, what they mean is that for each
core, there is only one query-execution thread. A multi-core
processor would have many threads, but they would not interact any
more than the threads running on separate machines. There are other
kinds of auxiliary threads running around, if you look at the paper,
for servicing request queues and such. But the key point is: one
thread per core executing transactions. You do not need multiple
threads because there is no benefit from concurrency because you don’t
ever wait for anything, there being no disk, and no communcation
during transaction execution. [Except in the case of “general
transactions”; see the paper.]

B-tree consistency is no problem at all. No locks, no sharing. You
just write a simple B-tree implementation and it just works. That
makes H-store simpler than conventional DBMS’s.

Suresh: Yes, there is plenty new compared to Oracle’s existing
product (TimesTen). Read the paper and you’ll see.

Interesting and smart point about the data warehouse back there. If the updates are getting streamed to a data warehouse system that does persistent storage itself, discussion over. There’s no absolute need for disk on the OLTP system. It’s just a different version of the TimesTen/Oracle split.

Even so — no matter what the configuration, I want removable backups. Should the dread bug or ueberhack come that corrupts data, I want an uncorrupted old copy, insulated by a very secure air gap.

I see backups and data persistence as different issues in the context of an in-memory database. As an explanation I offer the following (half-)analogy: You are a security officer in a big department store. Your job is sitting in the monitor room and watching the CCTV screens so you can report what’s happening. In front of you, you also have two buttons: Print and Freeze. The Print button causes the printer in the corner to dump a hardcopy of the images and the Freeze button can be used to freezes the images on screen for any length of time and is auto-activated during power failures. This is where the analogy sort of breaks because I also need you to assume that the store and everything in it can be entirely reconstructed from these images, possibly via the clever use of time-energy-matter manipulation and a VHS recorder. However, if you just need the data in these images…

A database is made mostly of data. A backup is triggered by an event (time-, user- or whatever-based) and allows you to take “print” snapshot of your data. The snapshot can be kept anywhere, from a disk file to another (possibly also in-memory) database, and in any number of copies. Persistence of data, on the other hand, is the data’s ability to survive a power off. While some storage devices (e.g., disks) inherently provide this feature, RAM does not and therefore (some – see previous comment) data in it has to be saved to a persistent store. However, persistence alone does not guarantee you can rollback to a previous version of your data, but only to point in time you “froze” it. Since persistence is (ideally) continuous it is not the answer for uberhacks/dread bugs as it will (and should) dutifully record these as well.

So, to conclude, I see no problem with in-memory data in terms of survivability if the database provides both data persistence and backup capabilities. In case of failure, recovery would consist of reloading the persisted data whereas restoration could be used to recovery from data corruption or a meteorite hitting the data center.

The H-Store scheme gets persistence from geographically diverse replicated RAM copies. That’s not crazy. One has to posit an extreme disaster scenario for it not to work. And in extreme disasters, one has backups.

To look at it another way — unless there’s a massive system outage encompassing half or more of the continental US, H-Store isn’t going down. And if there IS such an outage, customers are likely to be forgiving while you recover from disk.

Geographically-distributed RAM replicas are swell – we use them all the time. However, IMHO, they can only for promise unrelenting availability of the data. My point is that these ensure data resiliency, not its persistence. Recovery from disk in both cases is taken for granted, the only differentiator being is the data’s staleness. A backup is only updated until the point in time in which it was taken, while persistence usually provides fresher data.

The thing about *not* persisting to disk is that it only works for an isolated failure model, where failures of machines will be isolated to those machine. That’s usually the case for hardware and power failure, but necessarily not software failure.

If you are running all the same code on all the same OS, then your distributed system is still vulnerable to security failures (dos/intrusion/virus/worms), and time and data-induced bugs. To address these failure scenarios properly you would need an ecosystem of machines running different OS’s and independent implementations of the same software. Who can afford that, just to avoid writing to disk?

However, the H-Store guys do have one interesting rejoinder. Namely, assume that all the data is being streamed to a data warehouse (different software, probably WITH disk of its own). Then that argument is weakened a lot.

Even so — I think fairly frequent checkpointing is the way to go. Leaving it out doesn’t save enough to make up for the doubt caused by discussions like this.

It sounds like the assumption is that the data processing engine is basically going to just throw the data changes on a message bus or forward the data to a data warehouse staging area for upload.

So here’s a question… how will this “H-Store” view of the world be different from the raft of transaction coordinating middleware that’s out there already? Most MQ’s/TC’s are tuned for throughput rather than persistence, support 3/4GL’s instead of SQL and have little or no persistent storage. The only thing Mike &c are taking away is the ACID properties of the transactions?

I guess what I’m saying is – hasn’t this already been invented and stabilized as a discrete set of products for 10 years or more already?

1. ACID isn’t being taken away.
2. See my later post on ObjectGrid vs. H-Store. You’re not all wrong.
3. In essence, this is an attempt to roll a lot of other technologies into one, which can lead to savings in all areas of TCO (programming, admin, hardware/power, etc.)

CAM

Dave Gudeman on
February 26th, 2008 6:54 pm

Well, I’ve finally finished reading the paper. I guess I have to be reasonably polite since I’ve revealed my company affiliations but I have to say that I’m skeptical.

As I said before, you always have to deal with b-tree consistency and other problems around simultaneous access to shared data. H-Store doesn’t avoid the problem; H-store simply replaces locking with partitioning (or I should say that H-store uses spatial partitioning as opposed to ADS which uses what might be called temporal partitioning). But spatial partitioning is not a magical solution; it has its own overheads:

1. Idle processors: each processor can only work on a particular subset of data, so if there is no work to do on that subset of data then the processor is idle.
2. Moving data: any work that requires sharing data among multiple processors requires moving data around.
3. Synchronization: any work that requires distributed commits has the overhead of synchronizing the distribute commit.

The claim of this paper is essentially that for “OLTP applications” 2 and 3 are either not significant portions of the work or can be largely eliminated by program re-writing. This claim remains to be proven and I rather doubt from my own experience that the claim is correct. The paper doesn’t address 1 at all.

Then there is the issue of having no persistent store. Other commenters have pointed out the problems with this.

My experience is that applications are never what the model says they should be. There was a comment in the paper to the effect that OLTP applications don’t do 5-way joins. I remember when we used to think that. We also thought that performance critical applications would use prepared statements and wouldn’t update indexed columns and would use tree-like schemas and other things. I fondly remember those days of optimistic naivete :-).

So, while I grant that Stonebreaker knows much more about databases that I do, I’m skeptical as to whether this will work. ANTs also is able to get 80X performance over commercial databases for certain kinds of work loads. But when we do real applications, the actual benefit is more like 3X to 10X.

In reply to Dave Gudeman: Your comments are very cogent. Here’s my best take on what the answers would be if you put this to the H-store guys. (I am not one of them, so these are just my own best understanding.)

1. Idle processors: Yes, I think that’s true. I’m not sure it’s a problem. The real measure of a system is whether it gets the latency and throughput that you want. Saying that idle processor are a drawback is like saying that not using the whole disk is a drawback. So what if it leaves processors idle? Processors are cheap. Scaling is what matters. (You may or may not buy this point…)

2. Moving data: Yes, that’s right. The paper calls this “general transactions”: the ones where the different processors need to interact during the execution of commands. It admits that these won’t work well. The hope is that the great majority of the commands whose speed is critical can be done as single-node or one-shot transactions (see paper). The degree to which this will work depends a lot on your schema and your queries. In some cases, it’s very easy to make this work. For example, in the database of one prominent company I know, there are only two kinds of data: data that is specific to one user account, and a small amount of slow-changing configuration data. The former can be partitioned trivially, and the latter just gets replicated everywhere. Now all transactions can run on a single node. In other applications, it’s much harder to make this work. We won’t know until and unless H-store gets productized and applied to a wide variety of applications. This is certainly a potential drawback of the H-store approach.

3. Synchronization: Yes, you’re right. H-store has a synchronization mechanism. The one in the paper depends on guaranteed hard maximum realtime limits on network transmission, which is unfortunately not a condition obtainable in real-world circumstances. However, the authors of the paper are working on improving this (private communication). Assuming they do, the claim is that the overhead will be amortized over a relatively large amount of actual work per command, so that its percentage of overhead will be acceptably low. Also, a single-node read transaction can all happen on one processor so it doesn’t run into this problem, and probably the hope is that those will be relatively common.

What appears to be the big issue is that you need, as you say, to do program re-writing. You have to arrange things so that your commands to H-store are of high enough granularity (lots of actual work per command). Furthermore, a single command always performs one ACID transaction. So you can’t interleave application logic with transaction execution. Some applications will be easy to rewrite this way and some will be hard. Again, they’ll need more experience to see how this works out.

About “not having any persistent store”, please read my earlier post, which I feel takes care of this issue.

About doing 5-way joins: well, it’ll also depend a lot on the application. If the five-way joins are on five tables that are generally used together, they can be co-located on single hosts, and ther’s nothing to stop H-store from acquiring a more sophisticated query optimizer down the line if it turns out that it’s needed more often than the paper says to expect. So I don’t think this is an inherent problem with the H-store concept.

On the whole, I agree that only time will tell, and it all depends on the specifics of the particular application and database structure.

I’m glad they’re working on better synchronization; despite quite a bit of time talking with them I never got to exactly that point.

Where you confused me is where you seemed to be saying that a stored procedure has to have all of the following properties:

A. No application logic.
B. Single ACID transaction.
C. Coarse-grained and doing a lot of work.

While that’s what I read, it surely can’t be exactly what you meant.

Best,

CAM

Dave Gudeman on
March 7th, 2008 10:02 pm

Good answers, Daniel. I guess we both agree that we’ll have to see.

I’d like to expand my point about idle processors. Basically, you can always make a database application faster by throwing hardware at it. For many applications, the big modern database products don’t scale very well with multiple machines, so typically you have to buy faster computers, which is expensive because price goes up much faster than computer speed. That is, to get a box that’s twice as fast, you pay lots more than twice as much. What H-Store is trying to take advantage of is that buying twice as many computers of the same speed costs exactly twice as much (less if you get bulk discounts ). So it’s less expensive to have a database product that scales and buy several cheap computers to run it on than to buy one that doesn’t scale and buy big iron.

However, if you have idle processors, that changes the equation. If your processors are idle half the time then you have to buy four computers, not just two. Suddenly the equation doesn’t look quite so good for multiple computers. At that rate, you are better off going with a computer that is twice as fast, so long as it is less than four times as expensive.

The H-Store claim is that they get an order of magnitude or whatever single-processor performance advantage over disk-based systems, even for database sizes that disk-based systems can put entirely in cache. (Indeed, the starting number is more like 15X, but that’s the really rose-colored view.) That’s supposed to make up for any parallelization awkwardness.

The same team did C-Store/Vertica, and they followed a similar approach. I’m not aware of Vertica having the same level of interprocessor data movement (or, better yet, data movement prevention) sophistication as some of its competitors. But based on their sales figures, it seems they’re winning a lot of POCs even so.

Great, this needs to be more widely distributed. Traditional IT managers are afraid to go for anything else but Oracle, DB2 etc. Good job on the part of these vendors. It is about time to shatter their ideas and destroy taboos.

We have worked on a new database/platforms for last 6 years we have been running enterprises on in memory model with many points being listed in this article. From our research we concur, the relational DB model is no good. It is not only about performance but also inflexibility and the resultant introduction of unnecessary complexity.

We have gone much further than all this. Combination of these DB thoughts with semantics leads to all kinds of other weird and amazing properties for db. Radical reductions in size of DB are possible. Atomisation of databases increases speed etc. etc. To discover what is right all the angles have to be investigated. Ultimately DB is just a component to keep/act on knowledge/information.

[…] technology. There also are strong similarities to the MPP in-memory row store project H-Store/VoltDB, although I don’t know whether Plattner would go so far as to adopt the H-Store view […]

Mike Chen on
November 12th, 2009 4:16 am

I’m pondering over the validity of tpc-c comparison in the h-store paper. To achieve 70k transactions per second, one is required to run on a database of 200TB. There is no hope of any machine holding such a huge amount of data in memory any time soon. The paper did point out that real world workloads doesn’t require such a huge data set. Is TPC-C being ridiculous or the assumption in the paper?

I don’t know what you’re talking about, because I don’t take benchmark descriptions seriously enough to read and remember them. See the “Benchmarking” section here or search on “TPC” to see what I mean.