This article explains amplify technique that is useful for improving prediction score.

Iterations are mandatory in machine learning (e.g., in stochastic gradient descent) to get good prediction models. However, MapReduce is known to be not suited for iterative algorithms because IN/OUT of each MapReduce job is through HDFS.

Using trainning_x3 instead of the plain training table results in higher and better AUC (0.746214) in this example.

A problem in amplify() is that the shuffle (copy) and merge phase of the stage 1 could become a bottleneck.
When the training table is so large that involves 100 Map tasks, the merge operator needs to merge at least 100 files by (external) merge sort!

Note that the actual bottleneck is not M/R iterations but shuffling training instance. Iteration without shuffling (as in the Spark example) causes very slow convergence and results in requiring more iterations. Shuffling cannot be avoided even in iterative MapReduce variants.

Amplify and shuffle training examples in each Map task

To deal with large training data, Hivemall provides rand_amplify UDTF that randomly shuffles input rows in a Map task.
The rand_amplify UDTF outputs rows in a random order when the local buffer specified by ${shufflebuffersize} is filled.

With rand_amplify(), the view definition of training_x3 becomes as follows: