A Brief Tour of DataFu

DataFu (now Apache DataFu) is a collection of user-defined functions for working with large-scale data in Hadoop and Pig. This library was born out of the need for a stable, well-tested library of UDFs for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like “People You May Know” and “Skills”. It contains functions for:

13.
Session Statistics• What if we want to count views per page peruser?pv_counts = FOREACH (GROUP pv BY (memberId,url)) GENERATEgroup.memberId as memberId,group.url as url,COUNT(pv) as cnt;• But refreshes and go-backs are not thatsignificant.• Multiple views across sessions are moremeaningful.

14.
Session Statistics• Use TimeCount to sessionize the counts:define TimeCount datafu.pig.date.TimeCount(10m);pv_counts = FOREACH (GROUP pv BY (memberId,url)) {ordered = order pv by time;GENERATEgroup.memberId as memberId,group.url as url,TimeCount(ordered.(time)) as cnt;}• Uses the same principle as Sessionize UDF.

19.
CountEach• Suppose we have a recommendation system, andweve tracked what items have been recommended.items = FOREACH items GENERATE memberId, itemId;• We want to produce a bag of items shown tousers with count for each item.• Output should look like:{memberId: int,items: {(itemId: long,cnt: long)}}

20.
CountEach• Typically, we would first count (member,item)pairs:items = GROUP items BY (memberId,itemId);items = FOREACH items GENERATEgroup.memberId as memberId,group.itemId as itemId,COUNT(items) as cnt;

21.
CountEach• Then we would group again on member:items = GROUP items BY memberId;items = FOREACH items generategroup as memberId,items.(itemId,cnt) as items;• But, this requires two MR jobs.

22.
CountEach• Using CountEach, we can accomplish the samething with one MR job and less code:items = FOREACH (GROUP items BY memberId) generategroup as memerId,CountEach(items.(itemId)) as items;• Better performance too! In one test I ran:– Wall clock time: 50% reduction– Total task time: 33% reduction

23.
AliasableEvalFunc• Pig has great support for UDFs• But, UDFs with many positional parametersare sometimes error prone.• Lets look at an example.