Archives for May 2015

Apache Pig is one of the hottest languages in the Hadoop ecosystem. Right now the average salary for a Pig Developer is $124,563 according to a report released in Infoworld. A Pig developer can process both unstructured and semi-structured in Hadoop. If you like the sound of learning Apache Pig keep on reading because in this example you will learn to process data with Apache Pig.

What is Pig?

Pig (Pig Latin) is the high level language developed to abstract away the complexity of writing Mapreduce jobs. Pig allows developers the ability to write Pig scripts in a declarative language that allows data to be expressed in data flows. Just think of it as breaking everything down into to simple steps. Data can be processed with Pig from all kinds of formats and sources.

Pig was created out of the need to allow developers and analyst to have the ability to write Mapreduce jobs without having to write them in Java. The motivation was to give developers a SQL-like language to allow non-Java developers the ability to work with large data sets in Hadoop and perform ETL transactions. Where Hive excels at query data in a structured format, Pig excels at putting data into a format that can be easily queried. Many workflows would be to take unstructured data in raw form using Pig to put in a structured format then using Hive to query that data.

Data for Processing

We are going to use a set of population data. The data contains population totals for Minnesota broken down by age, gender, and year. The data was also used for a Pig Latin Eval series and can be accessed here on the Pig Example Github project.

Pig Environment

The development environment we will use is the HDP Sandbox. Pig is one of the applications pre-loaded in the sandbox and also available from the web-based HUE interface. It can run on your current desktop/laptop in a virtual machine.

Example Pig Script

First thing we need to do is load our data using the file browser. Once we have our population.csv uploaded we should see it in the file list.

Once we have our population data uploaded we can navigate to the Pig editor and begin entering our Pig script.

The first thing we will do is declare a variable called raw to load our population data. Next we will use the LOAD operator and put in our file path. Since we uploaded our data in the default directory, the path will be /user/hue/population.csv. Next, we will use the PigStorage(‘,’) to parse our csv file. Finally, we will declare our fields and cast them as integers.

Pig Script

1

2

raw=LOAD'/user/hue/population.csv'USING PigStorage(',')AS

(year:int,age:int,gender:int,popsize:int);

On our next line we will declare another variable called final and set it equal to a FOREACH that will iterate over the raw variable and pull out the year and popsize.

Pig Script

1

final=FOREACHraw GENERATE year,popsize;

Now we are ready to output our results. To do this we use the DUMP operator to display our final variable to the screen. After we have our script ready we can use the execute button to get the results.

Pig Script

1

DUMP final;

Final Pig Script

Learn to Process Data with Apache Pig

1

2

3

4

5

6

raw=LOAD'/user/hue/population.csv'USING PigStorage(',')AS

(year:int,age:int,gender:int,popsize:int);

final=FOREACHraw GENERATE year,popsize;

DUMP final;

Remember Pig runs as a batch process so the results might take a few minutes to process depending on your machine. When your results are ready, you will get a green progress bar under the Pig editor. If you get a red bar, no problem, just check to make sure the Pig script is correct and your population.csv is uploaded correctly. Once your results have been processed, scroll down and see the year and popsize.

Congratulations, you just processed data using Pig! Walking though this short demo you can begin to see how simple it is to write Pig scripts. In 2 steps we were able to load csv data using PigStorage and iterate over that data with a FOREACH. Even though Pig is simple to learn it is a very powerful tool to use and will take practice to master.

Want more Pig?

Ready to learn more? How about going through my course Pig Latin: Getting Started at Pluralsight. In this course we will go over the basics of the Pig Latin language in order to prepare developers to use Pig in Hadoop. The course is filled with demos that show you step by step how to solve real world problems with Pig Latin.

Today we are going to talk about how to concatenate fields using Pig Latin.

Image courtesy of Goldy at FreeDigitalPhotos.net

For this week’s example we are going to use a different data set than we have used in the Apache Pig Latin Eval Function series. Our new data set is a sample data set of fictitious names, phone number, addresses, and etc that we will call our leads.csv.

If you are familiar with SQL you know there are a couple of ways to concatenate fields and some of the options depend on the type of database you are using. MySQL uses the a CONCAT() function to allow for to concatenate fields together. The simplest example is suppose you have a table of data with first name and last name and you want to output the results in the following format: Last Name First Name. To get the correct format you can use the CONCAT() function.

Example of MySQL CONCAT() function

1

SELECT CONCAT(last_name' 'first_name)asname FROM leads;

Now lets look at using Pig Latin’s CONCAT() function

Pig Latin CONCAT() Function

First we will load our leads.csv data using PigStorage. Once we have the data loaded let’s create a new variable to cut our fields down to last_name, first_name, and address.

Cleaning fields

Before we use the CONCAT() function let’s see if we can clean up the ” surrounding each field. We are going to use REPLACE in our FOREACH we just created. Using this function we will just replace every ” with ”.

Last Step

Now we have are fields formatted with now “, we can use our CONCAT() function to put the last_name and first_name inside the same field. Just like in MySQL function we looked at above we will use CONCAT() to pass in both fields and add a space between the two.Pig Script

Wow that was easy

We were able to walk through a simple use case for using the Pig Latin concatenation function. Even with this simple example you can see how you would use the CONCAT() function to merge fields from different data sources. Another problem we solved is how to use the replace function to clean our data. If you would like to learn more about how to use Pig Latin checkout my Pluralsight course Pig Latin: Getting Started.

Is there a function built in Pig Latin that will average a particular field?

The answer is yes, Pig Latin has an Eval function that averages fields.

How does it compare to SQL?

If you are familiar with SQL then you know about the AVG() function and how to use it get the average values for a set of columns. For the following example assume we are still using our population data and we want to get the average of the population column in SQL. We would write the following:

Sample Population Data

SQL AVG() Function

AVG() in SQL

1

SELECT AVG(pop_size)FROM Population;

SQL provides an easy way to get the average of a column of data. In Pig Latin we have a similar function for averaging results. The function must be used after a GROUP BY, because the AVG() function requires that the result be a bag. So the above SQL expression would not work, but we could get the average of population size by grouping together our population years.

AVG() with Pig Latin

Let’s walk through how to get the average of the population size of each age group per year in Pig Latin. Remember each field in the data stands for population size for a specific age and gender. For example, the population size for a 30 year old male in 1850 was 639,636

First thing we are going to do is load our population data using the PigStorage(‘,’) to parse our csv.

Pig Script

1

2

3

4

population=LOAD'/user/hue/population.csv'USING PigStorage(',')

AS(year:int,age:int,gender:int,popsize:int);

DUMP population;

Results

1

2

3

4

5

6

7

8

(1850,0,1,1483789)

(1850,0,2,1450376)

(1850,5,1,1411067)

(1850,5,2,1359668)

(1850,10,1,1260099)

(1850,10,2,1216114)

(1850,15,1,1077133)

...........

Next we want to group each row by year. We will do this by using the Group By operator.

Now that we have our fields grouped for each year, we can get the average population size for each year. To this we will create a FOREACH to interate over each of the grouped years and average the popsize field.

It’s been an incredible journey shooting my first Pluralsight course. I’ve certainly learned a great deal throughout the process.

For the last month I’ve been recording away and trying to get my first course ready (“Pig Latin: Getting Started“), learning all the tips and tricks of Camtasia, setting up my own microphone, etc. I’ve had no background in setting up quality sound equipment, but I have an awesome Pluralsight editor who helped me through the process.

Finally, all the hard work has paid off and my course is live.

My first course is “Pig Latin: Getting Started.” This course is a beginner course on Pig Latin and tries to help users who are familiar with SQL translate those skills into Apache Hadoop’s Pig application. Typically anywhere you have a Hadoop Distributed File System (HDFS) installed, you will have Pig running as well.

Why Big Data?

Data is consuming the world as we speak, and Big Data developers are in high demand. The median salary for Big Data developers is around $103,000/year. Which makes this a great time to begin a career in Big Data or transition to a Big Data position.

One problem is knowing where to start. When I first started out, I didn’t have a direction or a roadmap for what to learn about the Hadoop Stack.

Do I learn Hive or Pig?

What is Oozie and Zookeeper?

What is this an animal farm?

This is the reason I decided to become a Pluralsight author. I wanted to help others who are just starting out in the Hadoop stack, whether it be job-related, in my case, or simply because you’re ready for a new challenge. Pluralsight gives me that opportunity to reach a huge audience. Together we can make a roadmap to help you navigate through the Hadoop stack.

Why Start with Pig Latin

Why did I start with Pig Latin? There are other applications I could have started with, but Pig Latin is a great first step. I feel Pig offers the right mix of ease-of-use and ability to create more powerful queries to give a better picture of what exactly MapReduce is. In fact, for my first MapReduce job, I used Pig Latin not Java. Before developing with Pig I did have some experience in Java, but it’s not necessary. To learn Pig Latin, all you really need is a basic understanding of SQL, and you can begin to write powerful MapReduce jobs in 10 minutes. Check out this example Pig Script.

If you ready for the challenge to conquer the Hadoop stack, then let’s get started with Pig Latin. Pig Latin: Getting Started will take you through the steps

Setting up a Hadoop development environment

Comparing Java MapReduce to Pig Latin

Loading and Storing Data

Examples on where to find Data

Using Pig from the Grunt Shell

Writing User Defined Function

Checkout the course to find out more about my journey into MapReduce and the Pig Grunt shell.