10
Pig Latin: Hello Word Count input_lines = LOAD '/tmp/book.txt' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/book-word-count.txt'; Map Reduce

45
Pig Joins Inner join: As shown (default) Self join: Copy an alias and join with that Outer joins: – LEFT / RIGHT / FULL Cross product: – CROSS You guys know (or remember ) what an INNER JOIN is versus an OUTER JOIN / LEFT / RIGHT / FULL versus a CROSS PRODUCT?

46
Pig Aggregate/Join Implementations Custom partitioning / number of reducers: – PARTITION BY specifies a UDF for partitioning – PARALLEL specifies number of reducers X = JOIN A BY prod, B BY name PARTITION BY org.udp.Partitioner PARALLEL 5; X = GROUP A BY hour PARTITION BY org.udp.Partitioner PARALLEL 5;

49
Pig: Other Operators FILTER: Filter tuples by an expression LIMIT: Only return a certain number of tuples MAPREDUCE: Run a native Hadoop.jar ORDER BY: Sort tuples SAMPLE: Sample tuples UNION: Concatenate two relations