You may have read somewhere that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, on-site Facebook messages. All-in-all they need to store over 135 billion messages a month. Where do they store all that stuff? Facebook's Kannan Muthukkaruppan gives the surprise answer in The Underlying Technology of Messages: HBase. HBase beat out MySQL, Cassandra, and a few others.

Why a surprise? Facebook created Cassandra and it was purpose built for an inbox type application, but they found Cassandra's eventual consistency model wasn't a good match for their new real-time Messages product. Facebook also has an extensive MySQL infrastructure, but they found performance suffered as data set and indexes grew larger. And they could have built their own, but they chose HBase.

HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It's good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop's file system, HDFS, which is also used by HBase.

Facebook chose HBase because they monitored their usage and figured out what the really needed. What they needed was a system that could handle two types of data patterns:

A short set of temporal data that tends to be volatile

An ever-growing set of data that rarely gets accessed

Makes sense. You read what's current in your inbox once and then rarely if ever take a look at it again. These are so different one might expect two different systems to be used, but apparently HBase works well enough for both. How they handle generic search functionality isn't clear as that's not a strength of HBase, though it does integrate with various search systems.

Facebook is not going to standardize on a single database platform, they will use separate platforms for separate tasks.

I wouldn't sleep on the idea that Facebook already having a lot of experience with HDFS/Hadoop/Hive as being a big adoption driver for HBase. It's the dream of any product to partner with another very popular product in the hope of being pulled in as part of the ecosystem. That's what HBase has achieved. Given how HBase covers a nice spot in the persistence spectrum--real-time, distributed, linearly scalable, robust, BigData, open-source, key-value, column-oriented--we should see it become even more popular, especially with its anointment by Facebook.

Reader Comments (12)

Isn't it still use case driven Andy? Eventual consistency for an inbox when order is important may not be a good match, but it is for other problems. Operationally neither seem to be on the easy side, so plugging into your hadoop skills has a lot of value.

None of those three services has stopped using Cassandra, Andy. Twitter is furthering investment, just taking a slow approach. Facebook's investment has stagnated, and Digg saw some problems but is continuing to use Cassandra. Each use case should be considered individually, but FUD has no place in rational technology decisions.

Cassandra is suffering from its own PR strategy. It has been largely promoted as the system being used by Facebook, Digg, and Twitter. At first, this seemed to be a smart strategy, because countless hipsters are dumb enough to think that if some software is good for Google/Facebook/Twitter then it'll be good enough for its startups too. But Cassandra, as well as HBase, is immature software so the bumps will come, sooner or later. It's natural, albeit not fair, to blame the software infrastructure that supposedly would provide scalability and fault tolerance! To make things worse, Digg has not released ANY single detailed outage post-mortem post as has been done by Foursquare or even Google in the past.

But you are right when you say that Twitter is investing on Cassandra, mainly for new products/services. I am particularly curious to see their real-time analytical engine that is based on Cassandra, by the way. Hopefully, they realized that Cassandra is not mature enough to support its scale of working NOW, so they are working on a smaller scale until Cassandra gets rid of its problems/limitations (auto load balancing, compression, out of memory exceptions, poor client api, data corruption, gc storming, lack of a query language, etc). Some of those issues are already being addressed while others are still delayed. Cassandra problems go far beyond

When it comes to Digg, at least three of its Cassandra experts have jumped out of company, so the future of Cassandra in Digg seems problematic until stated otherwise. But this company is doomed anyway so let it go for good.

Finally, as you pointed out, Cassandra investment inside Facebook has stopped, so it's natural to suppose that it will be legacy system there, to say the best. If I understood correctly, FB used Cassandra to Inbox search and they will replace it with Messages service so what would be the use of Cassandra in there?

Nevertheless, I guess one year from now we'll have a better picture of how Cassandra is being used in those companies.

Vladimir: indeed HBase 0.20.6 is far less stable than the current trunk, which we're about to release as 0.90. All of Facebook's work has been incorporated in trunk - they've been working with the open source community for months.

This "pointless invective" is an eye opener, just that. I don't know how long you have been working in software industry, but if you have never witnessed someone arguing along the lines of "look, Ruby on Rails is used by Twitter and they are pretty happy with it, let's use it now!" then you don't have been in the software industry for long enough (or maybe you are an incredibly lucky person).

It's amazing how many engineers adopt a technology just because "X" company (you name it) uses it. Of course, nobody will shamelessly admit this, but it's the way the software business works. Just a few examples: EJB was overhyped, Ruby on Rails was overhyped, Linux was overhyped, Application Servers were overhyped, and now it's NoSQL systems' turn.

Many of the technologies cited above are pretty good, but it takes a LOT of balance between open mindness and conservativism not to fall into traps along the way (Digg?).

In time, engineers gain experience with "A" tech (WHEN to use, when NOT to use), hipsters are axed, the hype fades out, some "A" systems vanish into thin air, while the fittest "A" systems survive and grow (Darwinism is wonderful, isn't it?)

No, I took the liberty to name it so, but it's not the formal PR that you usually find in companies. It's an ad-hoc and word-of-mouth advertisement, an endless reverberation of news done by fanboys on twitter, quora, etc. "Look, the new real-time analytics service by Twitter is backed by Cassandra. Cool, isn't it? You should use Cassandra too!" The week is almost folding up, but this post is still getting lost of retweets. :)

Nevertheless, it's not Cassandra community's fault, because this is done by ALL NoSQL systems so far, as well as Hadoop folks too. The problem is when this is used against the software project. I doubt Twitter/Quora is a good channel to advertise software (best alternative: tech conferences), but twitter is so full of fanboys that you have to endure this. :(