I contributed a patch to MAPREDUCE-2454 in order to make the sort stage inthe MR data flow to be pluggable. Some of the benefits it brings are:

1. One can avoid sorting by providing an external sort implementation.There is a performance benefit for jobs that do not require sorting. Oncethe patch for MAPREDUCE-2454 is committed, the problems discussed inMAPREDUCE-4039 and MAPREDUCE-1928 can be solved. In other words,MAPREDUCE-4039 and MAPREDUCE-1928 become special cases of sort pluginimplementation.

2. A full join(inner and outer) done in the reducer can instead be done inthe reduce sort plugin more efficiently when both sides of the join arehuge. The reason is that both sides of the join can be sorted separatelyand data coming from the disk in the final merges can be joined right away.

3. One can implement specialized sorting algorithms based on the databeing processed in order to optimize performance.

I have followed the suggestions of developers and incorporated into theJira. The patch passed the Apache QA build and tests.

I request all committers to take a look at the patch and make anysuggestions so that it can be committed.

Thanks.-- Asokan

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext