Edward Capriolo

1) Promote a cool open source edition to hive by m6d (Rank UDF)2) Promote the upcoming "Programming Hive" book (that I am co-authoring)

What better way then to give a preview of the m6d rank case study in the programming hive book?

--M6d UDF Pseudo Rank--

by David Ha and Rumit Patel

Sorting data and identifying the top n elements is straightforward. You order the whole data set by some criteria and limit the result set to n. But there are times when you need to group like elements together and find the top n elements within that group only. For example, identifying the top ten requested songs for each recording artist or the top 100 best selling items per product category and country. Several database platforms define a rank() function that can support these scenarios, but until Hive provides an implementation, we can create a user-defined function to produce the results we want. We will call this function p_rank() for psuedo-rank, leaving the name rank() for the Hive implementation.

Say we have the following product sales data and we want to see the top three items per category and country:

To achieve the same result using HiveQL, the first step is partitioning the data into groups, which we can achieve using the distribute by clause. We must ensure that all rows with the same category and country are sent to the same reducer.

The next step is ordering the data in each group by descending sales using the sort by clause. While order by effects a total ordering across all data, sort by affects the ordering of data on a specific reducer. You must repeat the partition columns named in the distribute by clause.