Wednesday, August 28, 2013

Apache Hadoop and Spring Data : Configuring mapreduce Job

Spring for Apache Hadoop simplifies developing Apache Hadoop by providing a unified configuration model and easy to use APIs for using HDFS, MapReduce, Pig, and Hive. It also provides integration with other Spring ecosystem project such as Spring Integration and Spring Batch enabling you to develop solutions for big data ingest/export and Hadoop workflow orchestration.

In this tutorial I am going to demonstrate you how to configure mapreduce with the spring.the complete source code is available on the Github location

I assume that your Hadoop Cluster is up and running:

Let's set up a simple java project using maven and add the fallowing dependencies in the POM.xml.

In our main class we are executing the job using application context so we need to set up ApplicationContext but before that we need to provide the application specific properties files as fallows.application.properties

In application.properties file we are providing the namenode location,job tracker location,input path and output path of the data.the input data can be downloaded from the Here

Now we need to setup out ApplicationContext,in the provided code we are providing hadoop configuration using hdp:configuration, setup of job is done under the job element where we are providing the out input output paths,our job driver class and our Mapper reducer.

The next setting which we are doing here is to configure our job runner which will invoke our configured job,if you have multiple mapreduce jobs then we can configure the same in the applicationContext.xml.

Now its time to package our jar and run on the cluster for that I have provided the assembly.xml under the resources folder,run mvn assembly:assembly which will create a zip file in which you will find the SpringDataHadoop.jar