Tag Archive for 'hbase'

Following up on the recent blog about Hadoop Summit 2014, I wanted to share an update on the state of consensus-based replication (CBR) for HBase. As some of our readers might know, we are working on this technology directly in the Apache HBase project. As you may also know, we are big fans and proponents of strong consistency in distributed systems, however I think the phrase “strong consistency” is a made up tautology since anything else should not be called “consistency” at all.

When we first looked into availability in HBase we noticed that it relies heavily on the Zookeeper layer. There’s nothing wrong with ZK per se, but the way the implementation is done at the moment makes ZK an integral part of the HBase source code. This makes sense from a historical perspective, since ZK has been virtually the only technology to provide shared memory storage and distributed coordination capabilities for most of HBase’s lifespan. JINI, developed back in the day by Bill Joy, is worth mentioning in this regard, but I digress and will leave that discussion for another time.

The idea behind CBR is pretty simple: instead of trying to guarantee that all replicas of a node in the system are synced post-factum to an operation, such a system will coordinate the intent of an operation. If a consensus on the feasibility of an operation is reached, it will be applied by each node independently. If consensus is not reached, the operation simply won’t happen. That’s pretty much the whole philosophy.

Now, the details are more intricate, of course. We think that CBR is beneficial for any distributed system that requires strong consistency (learn more on the topic from the recent Michael Stonebraker interview [6] on Software Engineering Radio). In the Hadoop ecosystem it means that HDFS, HBase, and possibly other components can benefit from a common API to express the coordination semantics. Such an approach will help accommodate a variety of coordination engine (CE) implementations specifically tuned for network throughput, performance, or low-latency. Introducing this concept to HBase is somewhat more challenging, however, because unlike HDFS it doesn’t have a single HA architecture: the HMaster fail-over process relies solely on ZK, whereas HRegionServer recovery additionally depends on write-ahead log (WAL) splitting. Hence, before any meaningful progress on CBR can be made, we need to abstract most, if not all, concrete implementations of ZK-based functionality behind a well-defined set of interfaces. This will provide the ability to plug in alternative concrete CEs as the community sees fit.

Below you can find the slides from my recent talk at the HBase Birds of Feather session during Hadoop Summit [1] that covers the current state of development. References [2-5] will lead you directly to the ASF JIRA tickets that track the project’s progress.

This month, we launched a trio of innovative Hadoop products: the world’s first production-ready distro; a wizard-driven management dashboard; and the first and only 100% uptime solution for Apache Hadoop.

We started this string of Big Data announcements with WANdisco Distro (WDD) a fully tested, free-to-download version of Apache Hadoop 2. WDD is based on the most recent Hadoop release, includes all the latest fixes and undergoes the same rigorous quality assurance process as our enterprise software solutions.

This release paved the way for our enterprise Hadoop solutions, and we announced the WANdisco Hadoop Console (WHC) shortly after. WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

The final product in this month’s Big Data announcements was WANdisco Non-Stop NameNode. Our patented technology makes WANdisco Non-Stop Namenode the first and only 100% uptime solution for Hadoop, and offers a string of benefits for enterprise users:

Automatic failover and recovery

Automatic continuous hot backup

Removes single point of failure

Eliminates downtime and data loss

Every NameNode server is active and supports simultaneous read and write requests

Full support for HBase

To support the needs of the Apache Hadoop community, we’ve also launched a dedicated Hadoop forum. At this forum, users can get advice on their Hadoop installation and connect with fellow users, including WANdisco’s core Apache Hadoop developers Dr. Konstantin V. Shvachko, Dr. Konstantin Boudnik, and Jagane Sundar.

For Apache Subversion users, we announced the next webinars in our free training series:

Hook Scripts – how to use hook scripts to automate tasks such as email notifications, backups and access control

Advanced Hook Scripts – an advanced look at hook scripts, including using a config file with hook scripts and passing data to hook scripts

We’ve announced an ongoing series of free webinars, which demonstrate how you can overcome these challenges from an administrative, business and IT perspective, and get the most out of deploying Subversion in an enterprise environment. These ‘Scaling Subversion for the Enterprise’ webinars will be conducted by our expert Solution Architect three times a week (Tuesday, Wednesday and Thursday) at 10.00am PST/1.00pm EST, and will cover:

The latest technology that can help you overcome the limitations and risks associated with globally distributed deployments

Answers to your business-specific questions

How to solve critical issues

The free resources and offers that can help solve your business challenges

We’re pleased to announce the release of the WANdisco Non-Stop NameNode, the only 100% uptime solution for Apache Hadoop. Built on our Non-Stop patented technology, Hadoop’s NameNode is no longer a single point of failure, delivering immediate and automatic failover and recovery whenever a server goes offline, without any downtime or data loss.

“This announcement demonstrates our commitment to enterprises looking to deploy Hadoop in their production environments today,” said David Richards, President and CEO of WANdisco. “If the NameNode is unavailable, the Hadoop cluster goes down. With other solutions, a single NameNode server actively supports client requests and complex procedures are required if a failure occurs. The Non-Stop NameNode eliminates those issues and also allows for planned maintenance without downtime. WANdisco provides 100% uptime with unmatched scalability and performance.”

Additional benefits of Non-Stop NameNode include:

Every NameNode server is active and supports simultaneous read and write requests.

All servers are continuously synchronized.

Automatic continuous hot backup.

Immediate and automatic recovery after planned or unplanned outages, without the need for administrator intervention.

Protection from “split-brain” where the backup server becomes active before the active server is completely offline. This can result in data corruption.

Full support for HBase.

Works with Apache Hadoop 2.0 and CDH 4.1.

“Hadoop was not originally developed to support real-time, mission critical applications, and thus its inherent single point of failure was not a major issue of concern,” said Jeff Kelly, Big Data Analyst at Wikibon. “But as Hadoop gains mainstream adoption, traditional enterprises rightly are looking to Hadoop to support both batch analytics and mission critical apps. With WANdisco’s unique Non-Stop NameNode approach, enterprises can feel confident that mission critical applications running on Hadoop, and specifically HBase, are not at risk of data loss due to a NameNode failure because, in fact, there is no single NameNode. This is a major step forward for Hadoop.”

If you’d like to get first-hand experience of the Non-Stop NameNode and are attending the Strata Conference in Santa Clara this week, you can find us at booth 317, where members of the WANdisco team will be doing live demos of Non-Stop NameNode throughout the event.

We are pleased to announce the latest release in our string of Big Data announcements: the WANdisco Hadoop Console (WHC.) WHC is a plug-and-play solution that makes it easy for enterprises to deploy, monitor and manage their Hadoop implementations, without the need for expert HBase or HDFS knowledge.

This innovative Big Data solution offers enterprise users:

An S3-enabled HDFS option for securely migrating from Amazon’s public cloud to a private in-house cloud

An intuitive UI that makes it easy to install, monitor and manage Hadoop clusters

Full support for Amazon S3 features (metadata tagging, data object versioning, snapshots, etc.)

The option to implement WHC in either a virtual or physical server environment.

Improved server efficiency

Full support for HBase

“WANdisco is addressing important issues with this product including the need to simplify Hadoop implementation and management as well as public to private cloud migration,” said John Webster, senior partner at storage research firm Evaluator Group. “Enterprises that may have been on the fence about bringing their cloud applications private can now do so in a way that addresses concerns about both data security and costs.”