MapReduce Integration

Through this part of the HBase tutorial you will learn HBase integration with MapReduce, what are classes, mapper, reducer, supporting classes, MapReduce over HBase, static and dynamic provisioning, data source and sink.

HBase MapReduce Integration Examples

One of the great features of HBase is its tight integration with Hadoop’s MapReduce framework.7.1 Framework 7.1.1 MapReduce Introduction MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way. There should be a way to build such a system that increases in performance linearly with the number of physical machines added. That is what MapReduce strives to do. It follows a divide-and-conquer approach by splitting the data located on a distributed filesystem so that the servers (or rather CPUs, or more modern “cores”) available can access these chunks of data and process them as fast as they can. The problem with this approach is that you will have to consolidate the data at the end. Again, MapReduce has this built right into it.

First it splits the input data, and then it returns a RecordReader instance that defines the classes of the key and value objects, and provides a next() method that is used to iterate over each input record.

Mapper

In this step, each record read using the RecordReader is processed using the map() method.

Reducer

The Reducer stage and class hierarchy is very similar to the Mapper stage. This time we get the output of a Mapper class and process it after the data has been shuffled and sorted.

OutputFormat

The final stage is the OutputFormat class, and its job is to persist the data in various locations. There are specific implementations that allow output to files, or to HBase tables in the case of the TableOutputFormat class. It uses a TableRecord Writer to write the data into the specific HBase output table.

7.1.3 Supporting Classes The MapReduce support comes with the TableMapReduceUtil class that helps in setting up MapReduce jobs over HBase. It has static methods that configure a job so that you can run it with HBase as the source and/or the target.

7.2 MapReduce over HBase 7.2.1 Preparation To run a MapReduce job that needs classes from libraries not shipped with Hadoop or the MapReduce framework, you’ll need to make those libraries available before the job is executed. You have two choices: static preparation of all task nodes, or supplying everything needed with the job.

Static Provisioning

For a library that is used often, it is useful to permanently install its JAR file(s) locally on the task tracker machines, that is, those machines that run the MapReduce tasks. This is done by doing the following:

Copy the JAR files into a common location on all nodes.

Add the JAR files with full location into the hadoop-env.sh configuration file, into the HADOOP_CLASSPATH variable:

Obviously this technique is quite static, and every update (e.g., to add new libraries) requires a restart of the task tracker daemons.

Dynamic Provisioning

In case you need to provide different libraries to each job you want to run, or you want to update the library versions along with your job classes, then using the dynamic provisioning approach is more useful.

7.2.2 Data Source and Sink The source or target of a MapReduce job can be a HBase table, but it is also possible for a job to use HBase as both input and output. In other words, a third kind of MapReduce template uses a table for the input and output types. This involves setting the TableInputFormat and TableOutputFormat classes into the respective fields of the job configuration. This blog will help you get a better understanding of Hbase!