Replicated Join in Pig

A Join simply brings together two data sets. These joins can happen in different ways in Pig – inner, outer, right, left, and outer joins. These however are simple joins and there are specialized joins supported by Pig. The specialized joins are:

Suppose there is big data file containing the land line numbers of people across all cities in India and there is a smaller file containing the STD codes (3 digit numbers) for each city in India and if the STD code number has to be prefixed to the respective city for each number in the bigger file – then a replicated join is best suited.

This is because instead of sorting the big file and then applying the Reduce method on each phone number, it is easier to upload the smaller file of STD code to each machine and append the STD code to the landline number by creating a replicated file in each machine.

To demonstrate the Replicated Joins in Pig we will be using apache_nobots_tsv.txt and nobots_ip_country_tsv.txt datasets.

In the below demonstration of the replicated join bigger file is apache_nobots_tsv.txt and the smaller file is nobots_ip_country_tsv.txt.

Find the data description for apache_nobots_tsv.txt which contains around 515 records.

Description of the above dataset:

1st Column: IP ADDRESS

2nd Column: Timestamp

3rd Column: Page name

4th Column: http status

5th Column: Payload

6th Column: user agent

Step 1: Loading of the Large Data set into Pig Relation.

In this step we are loading the apache_nobots_tsv.txt into relation weblogs_nobots.

Refer the below screenshot for the same.

Step 2: Loading of the smaller dataset into Pig Relation.

In this step we will be loading nobots_ip_country_tsv.txt into relation ip_address_country.

Step 3: Joining of the both the relation:

In this step we will performing replicated join on both the relation.

Pig will load the right-most relation, ip_country_tbl, into memory and will join the data with the nobots_weblogs relationship. It is important that the right-most relations be small enough to fit into a mapper’s memory.

Step 4: Dumping the final results.

In this step we will be displaying the final results after join operations and we will be limiting the output to first 5 records.

We hope this blog helped you to understand Replicated joins in pig, in our next blog we will be discussing about Merge joins in pig. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.