We have already seen an example of Combiner in MapReduce programming and Custom Partitioner. In this tutorial, I am going to show you an example of Map side join in Hadoop MapReduce. If you want to dig more into the deep of MapReduce, and how it works, than you may like this article on how map reduce works.
MapReduce process the big data sets, and processing large data sets most of the time required joining between datasets based on common key like we almost always do while playing with any RDBMS database based on primary/foreign key concept.

In this tutorial, I am going to explain you the usage of Map side join.Map side Join
You can use Map side join using two different ways based on your datasets, and those depends on below conditions -

Both datasets are must be divided into the same number of partitions, and must be already sorted by the same key.

From the two datasets one must be small(something like master dataset) and able to fit into the memory of each nodes.

In this tutorial, I will show you the second one, and to use the second one you need distributed cache to keep the small(master) dataset into the memory of each nodes.

OK, Let's find the user activity on social media, what are the actions user performed on popular social media like commenting on post, shared something, like something etc.
And for these we have two different log files -

user.log

user_activity.log

Here is the tabular view of these datasets,
1. user.log

2. user_activity.log

3. Expected output

Tools and Technologies we are using here:

Java 8

Eclipse Mars

Hadoop 2.7.1

Maven 3.3

Ubuntu 14(Linux OS)

Step 1. Create a new maven project
Go to File Menu then New->Maven Project, and provide the required details, see the below attached screen.

Step 2. Edit pom.xml
Double click on your project's pom.xml file, it will looks like this with very limited information.

You can verify with web-ui also using "http://localhost:50070/explorer.html#/" url.iii. Create input folder on HDFS with below command.

subodh@subodh-Inspiron-3520:~/software$ hadoop fs -mkdir /input

The above command will create an input folder on HDFS, you can verify it using web UI, Now time to move input file which we need to process, below is the command to copy the user.log and user_activity.log input file on HDFS inside input folder.

Note - user.log and user_activity.log file is available inside this project source code, you would be able to download it from our downloadable link, you will find downloadable link at the end of this tutorial.