The current implementation is during grouping process, all tuples that belongs to the same key are materialized into a DataBag, and the DataBag(s) are passed to the UDFs. This causes performance and memory problem. For a large key, if its tuples can not fit into memory, performance has to sacrifice to spill extra data into disk.

Since many UDFs do not really need to see all the tuples that belongs to a key at the same time, it is possible to pass those tuples as batches. A good example would be like COUNT(), SUM(). Tuples can be passed to UDFs in accumulative manner. When all the tuples are passed, the final method is called to retrieve the value. This way, we can minimize the memory usage and improve performance by avoiding data spill.

UDF change

An Accumulator interface is defined. UDFs that are able to process tuples in accumulative manner should implement this interface. It is defined as following:

public interface Accumulator <T> {
/**
* Pass tuples to the UDF. You can retrive DataBag by calling b.get(index).
* Each DataBag may contain 0 to many tuples for current key
*/
public void accumulate(Tuple b) throws IOException;
/**
* Called when all tuples from current key have been passed to accumulate.
* @return the value for the UDF for this key.
*/
public T getValue();
/**
* Called after getValue() to prepare processing for next key.
*/
public void cleanup();
}

UDF should still extend EvalFunc as before. The PIG engine would detect based on context whether tuples can be processed accumulatively. If not, then regular EvalFunc would be called. Therefore, for a UDF, both interfaces should be implemented properly

Use Cases

PIG engine would process tuples accumulatively only when all of the UDFs implements Accumulator interface. If one of the UDF is not Accumulator, then all UDFs are called by their EvalFunc interface as regular UDFs. Following are examples accumulator interface of UDFs would be called:

When to Call Accumulator

MR plan is evaluated by an AccumulatorOptimizer to check if it is eligible to run in accumulative mode. Before AccumulatorOptimizer is called, another optimizer, SecondaryKeyOptimizer, should be called first. This optimizer checks if POSort or PODistinct in the inner plan of foreach can be removed/replaced by using secondary sorting key supported by hadoop. If it is POSort, then it is removed. If it is PODistinct, it is replaced by POSortedDistinct. Because of this optimizer, the last two use cases with order by and distinct inside foreach inner plan can still run in accumulative mode. The AccumulatorOptimizer checks the reducer plan and enables accumulator if following criteria are met:

The reducer plan uses POPackage as root, not any of its sub-classes. POPackage is not for distinct, and any of its input is not set as inner.

The successor of POPackage is a POForeach.

The leaves of each POForEach input plan is an ExpressionOperator and it must be one of the following:

POUserFunc that implements Accumulator interface and its inputs contains only ExpressionOperation, POForEach, or POSortedDistinct, but not another POUserFunc.

Therefore, if under POForEach, there are multiple UDFs, some are Accumulators,while some are not, the Accumulator would be off.

Design

Once the optimizer detects that the reduce plan can run accumulatively, it set a flag to POPackage and POForEach to indicate the data is going to be processed in accumulative mode. POForEach in turn sets this flat to all the operations of its input plans.

During runtime, POPackages creates a tuple with AccumultiveBag as its fields. This bag wraps up an AccumulativeTupleFeeder, which has a handler to the reducer Iterator to pull next batch of tuples. It also has a buffer to hold tuples of current batch. All AccumulativeBag shares the same feeder. The tuple generated by POPackage is passed to POForeach, POForeach is able to get AccumulativeTupleFeeder from AccumulativeBag. It then calls feeder.nextBatch() to fill the AccumulativeBag with first batch of tuples, pass them to the POUserFunc. Because POUserFunc is marked as accumulative, it would call the accumulate() of the UDF. The POUserFunc returns with a code of STATUS_BATCH_OK. Then POForeach pulls next batch, and so on until the last batch of tuples are retrieved and processed. At the end, POForeach notifies POUserFunc that accumulation is done. It makes a final call to POUserFunc, which in turn calls getValue() to return the final result.

Following is the sequence diagram of the data flow:

Internal Changes

Accumulator

A new interface that UDF can implement if it can run in accumulative mode.

PhysicalOperator

Add new methods setAccumulative(), setAccumStart(), setAccumEnd() to flag a physical operator to run in accumulative mode, and mark the start and end of accumulation. This change is in patch of PIG-1038.

MapReduceLauncher

AccumulatorOptimizer

Another MROpPlanVisitor. It checks the reduce plan, if it meets all the criteria, it sets the "accumulative" flag to POPackage and POForEach. It is created and invoked by MapReducerLauncher.

POStatus

Add a new state "STATUS_BATCH_OK" to indicate a batch is processed successfully in accumulative mode.

POForEach

If its "accumulative" flag is set, the bags passed to it through a tuple are AccumulativeBag as opposed to regular tuple bags. It gets AccumulativeTupleBuffer from the bag. Then it runs a while loop of calling nextBatch() of AccumulativeTupleBuffer, pass the input to inner plans. If an inner plan contains any UDF, the inner plan returns POStatus.STATUS_BATCH_OK if current batch is processed successfully. When there are no more batches to process, POForEach notifies each inner plan that accumulation is done, it makes a final call to get result and out of the while loop. At the end, POForEach returns the result to its successor in reducer plan. The operators that called POForEach doesn't need to know whether POForEach gets its result through regular mode or accumulative mode.

AccumulativeBag

An implementation of DataBag use by POPackage for processing data in accumulative mode. This bag doesn't contain all tuples from iterator. Instead, it wrapps up AccumultiveTupleBuffer, which contains iterator to pull tuples out in batches. Call the iterator() of this call only gives you the tuples for current batch.

AccumulativeTupleBuffer

An underlying buffer that is shared by all AccumulativeBags (one bag for group by, multiple bags for cogroup) generated by POPackage. POPackage has an inner class which implements this interface. POPackage creates an instance of this buffer and set it into the AccumulativeBags. This buffer has methods to retrieve next batch of tuples, which in turn calls methods of POPackage to read tuples out of iterator, and put them in an internal list. The AccumulativeBag has access to that list to return iterator of tuples.

ExpressionOperator

Add new methods getChildExpression(), containUDF() and accumChild(). The accumChild() is called by all expression operators that has more than one child operator. The expression operator needs to drive all child operators that contain UDF to process batched data. If it is in accumulative mode, accumChild() returns POStatus.STATUS_BATCH_OK. If it is not in accumulative mode, accumChild() returns null.

POUserFunc

Which method of UDF to call is changed based on its state. If accumulative flag is on, and accumulation is started, it calls accumulate(), if accumulation is ended, it calls getValue(), followed by cleanup(). If accumulative flag is off, call exec().

Other ExpressionOperators

All of the following operations are changed to call accumChild() in getNext(), if it returns null, then call its regular logic.

Performance Test

Performance tests shows biggest gain when the data set to be processed is large and skewed. We run several tests on different size of inputs with the same script. It first groups by a key, then for each group, run the bag through 3 UDFs.