Details

Description

Implement RollupDimensions UDF which performs aggregation from most detailed level of dimensions to the most general level (grand total) in hierarchical order. Provide support for ROLLUP clause in CUBE operator.

Activity

This patch contains the following features
1) RollupDimensions UDF
2) Support for ROLLUP clause in CUBE operator
3) Testcases for both
4) Removes "dimensions::" namespace from the output schema of cube operator

Prasanth Jayachandran
added a comment - 03/Jul/12 22:38 Since the second version of this patch is generated using git, I have created a new review board request https://reviews.apache.org/r/5733/
Please let me know in case of any issues.

Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff (PIG-2765 depends on PIG-2726) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.

Prasanth Jayachandran
added a comment - 10/Jul/12 19:30 Since the old review is based on SVN trunk, I am not able to upload git patch to it. So I created a separate RB request with git trunk (pig-git) and uploaded the patch. Also, I uploaded PIG-2726 patch as parent diff ( PIG-2765 depends on PIG-2726 ) for this patch, which is supported only for git and other DVCS. Couldn't find a better way to handle this . Sorry for the inconvenience.

Prasanth Jayachandran
added a comment - 11/Jul/12 19:47 Updated the old review with a new patch. Please use the old review https://reviews.apache.org/r/5521/ to look at the changed bits based on your review comment. Ignore my previous comment too

Dmitriy V. Ryaboy
added a comment - 31/Jul/12 06:19 Ok this looks good to commit. Please add release notes to this ticket (or to documentation, via a different ticket) that explain what Rollup does, how to use them, and what the syntax looks like.

IntroductionData cube
In Online Analytical Processing Systems (OLAP), a data cube is a way of organizing and visualizing multi-dimensional data to get useful insights over some measure of interest. Measure is a term for numerical facts that are analyzed against various dimensions of data. Based on the type of aggregations performed, measures can be algebraic (like SUM, COUNT, MAX etc.) or holistic (TOP-K, MEDIAN etc.). The following articles on wikipedia provides good illustrations for data cubeshttp://en.wikipedia.org/wiki/Online_analytical_processinghttp://en.wikipedia.org/wiki/OLAP_cube

Cube operation
For a specified set of dimensions, cube operations computes aggregates for all possbile combinations of group by dimensions. For example, CUBE(product, location) will generate

{(product,location), (product,null), (null,location), (null,null)}

combinations for all tuples in product and location dimensions, where (null, null) represents the grand total.
The number of group by combinations generated by cube for n dimensions will be 2^n.

Rollup operation
For a specified set of dimensions, rollup operations computes multiple levels of aggregates based on hierarchical ordering. For example, ROLLUP(region,state, city) will generate

combinations for all tuples in region, state and city dimensions. It computes aggregates from the most-details level (region, state, city) to the grand total (null, null, null). Rollup is useful when there is hierarchical ordering on the dimensions.
The number of group by combinations generated by rollup for n dimensions will be n+1.

Cubing in Pig
Pig supports CUBE operator which can be used to perform cube or rollup operations. The patches (PIG-2710, PIG-2765) adds syntactic sugar to the existing builtin CubeDimensions and RollupDimensions UDF. With this new addition, aggregations across multiple dimensions can be easily represented using CUBE operator. Following example illustrates the CUBE operator usage:

outalias - The name of the output relation
inalias - The name of the input relation
CUBE, BY, ROLLUP - keywords
expression - projections (dimensions) of the relation.
PARALLEL n - Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).

Note the second column, ‘cube’ field which is a bag of all tuples that belong to ‘group’. Also note that the measure attribute ‘sales’ along with other unused dimensions in load statement are pushed down so that it can be referenced later while computing aggregates on the measure, like in this case SUM(cube.sales).

Basic usage of CUBE and ROLLUP operation combined
If CUBE and ROLLUP operations are used together, the output groups will be the cross product of all groups generated by cube and rollup operation. If there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1).

Handling null values in dimensions
Since null values are used to represent subtotals in cube and rollup operation, in order to differentiate the legitimate null values that already exists as dimension values, CUBE operator converts any null values in dimensions to "unknown" value before performing cube or rollup operation. For example, for CUBE(product,location) with a sample tuple (car,null) the output will be