心有多大，舞台就有多大

use Secondary Sort to keep different inputs in order in Hadoop

SecondarySort is a technique that you can control the order of inputs that comes to Reducers.

For example, we wants to Join two different datasets, One dataset contains the attribute that the other dataset need to use. For simplicity, we call the first dataset ATTRIBUTE set, the other dataset DATA set. Since the ATTRIBUTE set is also very large, it’s not practical to put it in the Distributed Cache.

Now we want to join these two tables, for each record in the DATA set, we get its ATTRIBUTE. If we don’t use SecondarySort, after the map step, the DATA and ATTRIBUTE will come in arbitrary order, so if we want to append the ATTRIBUTE to each DATA, we need to store all the DATA in memory, and later when we meet the ATTRIBUTE, we then assign the ATTRIBUTE to DATA.

The issue here is that if the DATA set is huge, then store them all in the memory will be a pain in the ass, it will slow the computation dramatically, or even cause crash. The cause is that we don’t know when ATTRIBUTE arrives.

Suppose if we can guarantee the ATTRIBUTE comes first, then when we receive the DATA from the input stream, we can append the ATTRIBUTE to it right away without storing it, and output to the output stream.

(the two different datasets have different input format, so we will use MultipleInputs and GenericWritable here, check Here to see how to use MultipleInputs, and Here to see how to use GenericWritable, * this is my use case, you may have other use case that no need to use MultipleInputs and GenericWritable but uses SecondarySort).

(4) Write the Reducer. You see in the code, we assume the attribute will come first, and data comes later. For each data input, if the attribute is null, means no corresponding attribute for this data.

As you can see, the secondarySortOrder value tells the order.
(6) Set our own Partitioner, since right now we are using SecondarySortableTextKey as the key in Reducer, if we don’t change the partitioner, by default it will use SecondarySortableTextKey to partition, but we only want to change the input order, now the partition, so we should write a custom Partitioner:

You see here we only use the original key to partition, instead of the SecondarySortableTextKey.
(6) Custom group comparator. The partitioner only guarantee the same keys go to the same Reducer, but it doesn’t guarantee they go as a group, so we need to write this comparator.

Here you see, we compare the Key first, if the keys are the same, we compare the secondary order we set in the mapper. This will guarantee the lower secondary order record comes first.
(8) We write the driver to execute the MapReduce job, all the previous class we wrote will be set here.