Custom Input Format in Hadoop

In this post, we will be looking at ways to implement custom input format in Hadoop. For doing this, we have taken the Titanic Bigdata set as an example and have implemented the following problem statement.

Problem Statement:

Find out the number of people who died and survived, along with their genders.

Here, we need to implement a custom key which is a combination of two columns i.e., 2nd column, which consists of the dead or the survivors and the 5th column, which contains the gender of the person. So, let’s prepare a custom key by combining both these columns and sort them using the gender column.

To begin with, we need to prepare our custom key. To prepare a custom, we need to implement the WritableComparable interface. Below is the source code, which contains the implementation of custom key.

In the compareTo method, we have written our logic to sort the keys by the gender column. We have taken the ComparisionChain class and first compared the gender column and then compared the 1st column. Therefore, this logic will print the keys sorted by Gender column.

Note: If you compare only one column, then the second will be considered as a single value by the WritableComparable interface.

Now, we have written a custom key. Next, we need to write one inputFormat class which extends the default FileInputFormat.