Transcript

1.
Pig Setup and Test run
By Kannan Kalidasan

2.
Pig Introduction
Pig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java
code.
Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to
process data on Hadoop.
Help to increase productivity by not writing many lines of Java code.
It supports a variety of data types and also support user-defined functions (UDFs) to write custom
operations in Java, Python and JavaScript.
I recommended To learn Programming Pig – Allan Gates book.
Author explain the concepts in clear and simple way.

4.
Pig session has two modes
Local Mode : Access to a single machine. All files are installed and run using your local host and file
system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to
specify the mode.
pig -x local
MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the
default mode;
To add Hadoop Conf details to Pig Class path
export PIG_CLASSPATH=$HADOOP_HOME/conf/
both below commands are same and Start the pig session in MapReduce mode.
pig or pig -x mapreduce

5.
Note to Remember ...
●
Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and
proceed with our work.
●
Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster.
●
In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.

14.
Script Explanation
Load the file into a variable by mentioning the delimiter (‘;’) and Header name and its type.
Use comma to include more than one column data available in file.By Default , Pig loads files
delimited by tab. Need to explicitly mention type of delimiter character.
SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’
USING PigStorage(‘;’) AS (Year:chararray);
Group the variable stored data by year
GroupByYear = GROUP SampleRecord BY Year;

15.
Script Explanation ...
Count the records for each group set and generate the output as Key:Value.Its your wish how you
want to generate the file output.$0 is the group by criteria and $1 is the output of the count
CountByYear = FOREACH GroupByYear
GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
Store the variable in a file
STORE CountByYear
INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
For Complete Script commands , refer
http://pig.apache.org/docs/r0.10.0/start.html#data-results

17.
Thank You !!!
mail : kannanpoem1984@gmail.com
@kannanpoem on twitter
Blog: http://kannandreams.wordpress.com/about/
FB Community: www.facebook.com/groups/huge360/
HUGE - Hadoop User Group & Enthusiasts
Huge , Yes Its All about "BIG" Data
This has been created to build a group to get expertise and experts in Hadoop and Big Data .