Big Data Debate: Will HBase Dominate NoSQL?

HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

For The Motion

Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

Against The Motion

Jonathan Ellis
Co-founder & CTO, DataStax

HBase Is Plagued By Too Many Flaws

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

The same design that makes HBase's foundation, HDFS, a good fit for batch analytics will ensure that it remains inherently unsuited for the high velocity, random access workloads that characterize the NoSQL market.

Jonathan Ellis is chief technology officer and co-founder at DataStax, where he sets the technical direction and leads Apache Cassandra as project chair.

With recent 2014 Hadoop (Mostly Spark) Strata Conference in New York, one pattern emerged strongly. This pattern is near realtime stream based processing. Also, google the original largescale user of mapreduce as moved to streaming based solution which it calls millwheel. As it turns out Mapreduce has been tossed out in favor of streaming enabling frameworks such as storm, samza and spark streaming to achieve copetitive analytic advantge, I believe that technologies that traditionally are write optimized will be used as distributed data store.

Moreover, among alll NOSQL databases eventual consistency (AP model) may provide the solution for distribued data store. People out there can use MQs, Kafka brokers to handle blockages or bursts but as a store Cassandra has all the right tickmarks agaist HBase for such a usecase. So with that being said processing frameoworks such as Spark will be successful and will become poster child of processing framework, as they offer caching and ability to rebuild based on RDD lineage information. However for analytics use cases (primary driver of big data market) that are shifting toward streaming solution will be better placed to scale and optimize with Cassandra.

At the same time, companies that worship data consistency may not be able to leleverage cassandra due to inherent AP based eventual consistency thinking. They have no option but to go for something like HBase, but that also may not just add couple of 9s after ecimal to consistency as event semantics itself are issue that may be solved my AMQP enabled data ingestion framewor. Hbase, will still be the choice of database for batch where updates are frequent but I see it diminshing over time.

Regardless In Summary, it depends on use case shifts in future. But with current use cases trends are against Hbase and in favor of Cassandra. Please do not take it personally. I am hadoop lover AND WANT HBase to succeed so that I do not have to relearn ;-)

Cassandra "flexible data placement (a.k.a SSD support)" is not that good. You put the whole Column Family into SSD , eventually CF will exceed the SSD size and than what? It is not the hot data set caching per se.

Valid point, yes. The argument was along the line: FB created Cassandra in the first place, then replaced it with something else (which happened to be HBase). Not the strongest argument, I admit, more an indicator.

However, as I said in the first paragraph: it's all relative, really. One size doesn't fit it all in the data storage and processing world (aka polyglot persistence). In this context I like to encourage everyone who hasn't done already to read Stonebraker's excellent piece (from 2005!): http://citeseerx.ist.psu.edu/v...

I am not sure how this holds to a proof point, "An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use." Why does Facebook choosing it mean that it's superior?

This is using the argument from authority logic. In other words, if most of what Facebook engineering does is right and they choose HBase, then it must be right. There is certainly no question as to whether or not Facebook is full of brilliant engineers. But there are plenty of other companies that do amazing things with technology who have made the decision to go with Cassandra. You can't say that HBase is a good choice simply because Facebook uses it.

"Compaction throttling to avoid spikes in application response time.- M7 does not have any Compactions - Done"

No compactions? Does M7 overwrite data in place?

The major issue with M3/5/7 is that it does not provide easy migration/upgrade from existing Hadoop/HBase to MapR's distribution. At least, this was the case in the late 2011. Besides this, its proprietary technology.

Mr. Ellis, everyone here understands that your analyses and opinion as well as all tests results you are referring to are highly biased in favor of Cassandra. I lmao (ye-h, I know some basic slang) when I read PDF you have posted link here to. 90msec read latency? Have the authors read data from other data center? In case of HBase? When all data fits block cache or OS page cache - the read latency is less than 1ms (actually - its 0.4-0.5ms in average). We (the company I am working on) have being routinely running different workloads on HBase in dev, staging and production for more than tree years already and stability, performance and feature set of HBase are getting better with every new version. For me (and for many others) , major advantages of HBase are:

1. Tight integration into Hadoop/HDFS stack. I think its the major one and this eventually will bring HBase on top of NoSQl crowd.

2. Extensibility. Coprocessors are very good feature for any one trying to implement something more complex than simple K-V look up.

3. Can I say that HBase is more SQL - friendly? Phoenix, Hive?

HBase (properly tuned and configured) is not beatable in write heavy workloads. We can get far more than 1M writes per sec from 20 node cluster (not from 200 as Mr. Netflix guy). Yes , the cluster and clients are tuned and use all recommended performance tips. Complex? May be. but eventually, everything will become available from out of box, w/o any additional tuning.

You are so proud of Cassandra random reads "domination" (due to row cache mostly in Cassandra and the lack of thereof in HBase ), but I would like to point out that Cassandra cache (both key and row) are half-baked and the implementation is far from optimal (you still keep keys in Java heap?). Sorry, I am not following the latest advancements in Cassandra development now. Moreover, the lack of good block cache in Cassandra makes Cassandra less suitable for short scan operations (one of the reasons, Facebook has decided in favor of HBase). For me, personally, its a deal breaker, because so many real customer workloads fall into "short scan operation" category. Another deal breaker is the lack of real Hadoop integration.

Random read performance in HBase (I do not think its really worse than Cassandra's) can be increased by introducing RowCache into HBase and when it will happen, I think, we will get indisputable winner, Mr. Ellis. Its doable and it is going to happen pretty soon.

"Dominant" doesn't discount the opportunity for diversity, though, I'll admit, it's a somewhat simplistic construction meant to spark debate. The question was NOT posed as an either/or. DataStax chose (for obvious reasons) to focus on HBase vs. Cassandra. I do think many people do have big expectations for HBase because of its tie with Hadoop. Perhaps a bigger role will emerge if some of the flaws DataStax points to can be addressed.

A bit of a silly premise, and definitely not an either/or scenario: HBase will clearly be used when Hadoop is used - end of story. Cassandra isn't going to displace HBase, but will co-exist to handle other, related use cases, more elegantly. Plus, MongoDB will be used as a more modern era alternative to mySQL, HANA will be used to fly through SAP analytics, MarkLogic excels at content-oriented apps. And there are several dedicated cloud databases too. The NoSQL (Not Only SQL) movement gains strength from diversity, and has pushed Oracle, IBM and Microsoft to offer up columnar, for example, options. But at this point NONE of the NoSQL databases could be considered dominant, and despite the growing popularity of Hadoop, no way HBase is going to extend into a more general purpose DB, it lacks the architectural chops (pointed out nicely by Mr. Ellis), and it lacks the expertise base with the chops.

When the day comes where there are more production Hadoop implementations than the combination of SAS, ODW, Teradata, IBM's many options, SAP BW and HANA, Microstrategy, Tableau, etc., etc., etc., well, maybe we can talk dominant down one DNA strain of the industry. That will take quite awhile.