Hadoop, bigdata, cloud computing and mobile BI

Main menu

Monthly Archives: July 2012

HBase is a NoSQL database. It is based on Google’s Bigtable distributed storage system – as it is described in Google research paper; “A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.” If you want to have a detailed explanation what each word means in this scary definition, I suggest to check this post out.

HBase supports scaling far beyond traditional RDBMS capabilities, it supports automatic sharding and massive parallel processing capabilities via Mapreduce. HBase is built on top of HDFS and provides fast lookups for large records. See more details about HBase Architecture here.

HBase can be used as data source as well as data sink for Mapreduce jobs. Our example in this post will use HBase as data sink. If you are interested in other examples, have a look at Hadoop wiki, HBase as MapRedude job data source and sink.

HBase distributed storage for stock price information

The example is going to process Apple stock prices downloaded from Yahoo finance web site, this is the same dataset – Apple stock prices – that we used previously to demonstrate Hive capabilities on Amazon Elastic MapReduce. It is stored in an AWS S3 bucket called stockprice. The MapReduce job will retrieve the file from there using s3n://AWS Access Key ID:AWS Secret Access Key//bucket/object url and will store the output in a HBase table called aapl_marketdata. The test environment was based on Hadoop-0.20.2 and HBase-0.90.6.

Now we are ready to run the MapReduce job. It is advisable to have a driver script to run your job and set all the required arguments in there for easier configuration but in essence it is just a plain old java code.

Amazon Web Services recently launched HBase on it Elastic MapReduce. It runs on the Amazon distribution of Hadoop 0.20.205 (as of writing this post, it is not available yet on MapR M3 or M5 distributions).

You can configure it using Create a New Job Flow menu:

Then select the EC2 instance (they need to be Large or bigger). If you like you can also add Hive or Pig:

Then you can define EC2 keys (if you want to login to the instances using ssh, you need to add your key)