Data Analytics

Packt Publishing

The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to use data sources in Kettle, avoid pitfalls, and dig out the advanced features of Pentaho Data Integration the easy way.

Introduction

Data Analytics is the art of taking data and deriving information from it in order to make informed decisions. A large part of building and validating datasets for the decision making process is data integration—the moving, cleansing, and transformation of data from the source to a target. This article will focus on some of the tools that take Kettle beyond the normal data processing capabilities and integrate processes into analytical tools.

Reading data from a SAS datafile

SAS is one of the leading analytics suites, providing robust commercial tools for decision making in many different fields. Kettle can read files written in SAS' specialized data format known as sas7bdat using a new (since Version 4.3) input step called SAS Input. While SAS does support other format types (such as CSV and Excel), sas7bdat is a format most similar to other analytics packages' special formats (such as Weka's ARFF file format). This recipe will show you how to do it.

Why read a SAS file?

There are two main reasons for wanting to read a SAS file as part of a Kettle process. The first is that a dataset created by a SAS program is already in place, but the output of this process is used elsewhere in other Business Intelligence solutions (for instance, using the output for integration into reports, visualizations, or other analytic tools). The second is when there is already a standard library of business logic and rules built in Kettle that the dataset needs to run through before it can be used.

Getting ready

To be able to use the SAS Input step, a sas7bdat file will be required. The Centers for Disease Control and Prevention have some sample datasets as part of the NHANES Dietary dataset. Their tutorial datasets can be found at their website at http://www.cdc.gov/nchs/tutorials/dietary/downloads/downloads.htm. We will be using the calcmilk.sas7bdat dataset for this recipe.

How to do it...

Perform the following steps to read in the calcmilk.sas7bd dataset:

Open Spoon and create a new transformation.

From the input folder of the Design pallet, bring over a Get File Names step.

Open the Get File Names step. Click on the Browse button and find the calcmilk. sas7bd file downloaded for the recipe and click on OK.

From the input folder of the Design pallet, bring over a SAS Input step. Create a hop from the Get File Names step to the SAS Input step.

Open the SAS Input step. For the Field in the input to use as filename field, select the Filename field from the dropdown.

To clean the stream up and only have the calcmilk data, add a Select Values step and add a hop between the SAS Input step to the Select Values step. Open the Select Values step and switch to the Remove tab. Select the fields generated from the Get File Names step (filename, short_filename, path, and so on). Click on OK to close the step.

Preview the Select Values step. The data from the SAS Input step should appear in a data grid, as shown in the following screenshot:

How it works...

The SAS Input step takes advantage of Kasper Sørensen's Sassy Reader project (http://sassyreader.eobjects.org). Sassy is a Java library used to read datasets in the sas7bdat format and is derived from the R package created by Matt Shotwell (https://github.com/BioStatMatt/sas7bdat). Before those projects, it was not possible to read the proprietary file format outside of SAS' own tools.

The SAS Input step requires the processed filenames to be provided from another step (like the Get File Names step). Also of note, while the sas7bdat format only has two format types (strings and numbers), PDI is able to convert fields to any of the built-in formats (dates, integers, and so on).

Studying data via stream statistics

While Kettle's forte is extracting, manipulating, and loading data, there is an entire set of tools built for generating statistics and analytic style data from the data stream. This recipe will focus on several of those tools that will allow for even more insight into your data. Kettle treats the data worked on in transformations as a stream going from an input to an output. The tools discussed in this recipe will show how to learn more about the data stream through gathering statistics about the data for analysis.

Getting ready

This recipe will not be a single large process, but made up of smaller recipes around the same subject. We will be using the Baseball salary dataset that can be found on the book's website or from Lahman's Baseball Archive website, found at http://www.seanlahman.com/baseball-archive/statistics/. The code for this recipe can also be found on the book's website.

The recipe will be broken into smaller recipes that will focus on three steps: Analytic Query, Group by, and Univariate Statistics. These steps will allow us to gain some insight into the baseball player's salaries, such as the salary change from one contract to the next, frequency of being traded, and so on.

How to do it...

Perform the following steps to learn how to use the Analytic Query step:

Create a new transformation and add a Text file input step from the pallet to the canvas.

Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file.

Click on OK to close the Text file input step.

Add a Sort rows step from the pallet to the canvas. Create a hop from the Text file input to the Sort rows step.

The data needs to be sorted by playerID in the ascending order, with yearID in the descending order. Your Sort rows step should look similar to the following:

From the Statistics folder, select the Analytic Query step and add it to the canvas. Create a hop from the Sort rows step to the Analytic Query step.

For the Group field, select playerID. Fill in the Analytic Functions grid as follows:

Add a Calculator step and create a hop from the Analytic Query step to the Calculator step.

Open the Calculator step. Create a new field with the following criteria:

Finally, preview the Calculator step. You should receive an output similar to the following:

Now, the salary information provides a little more detail and can show how much a player gained (or lost) over the course of their career. Now, let's look at another step that can help show even more detail around the baseball player salary dataset—Univariate Statistics.

Perform the following steps to learn how to use the Univariate Statistics step:

Create a new transformation and add a Text file input step to the canvas.

Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

Bring a Univariate Statistics step from the pallet to the canvas and create a hop from the Text file input step to the Univariate Statistics step.

Open the Univariate Statistics step. For the Input field, select salary. Set the value of N, Mean, Std dev, Min, Max, and Median to true.

Click on OK to close the Univariate Statistics step and then preview the step. A row showing the various statistics around the salary will be displayed in the preview:

The data stream is processed, returning the salary statistics for the entire dataset. Now, as the last part of this recipe, let's explore the Group by step.

Perform the following steps to learn how to use the Group by step:

Create a new transformation and add a Text file input step to the canvas.

Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

Add a Sort rows step to the canvas and create a hop from the Text file input step to the Sort rows step.

Open the Sort rows step and sort the data on playerID in the ascending order.

Add a Group by step to the canvas and create a hop from the Sort rows step to the Group by step.

Open the Group by step. For the Group field, select playerID. Fill in the Aggregates data grid, as shown in the following screenshot:

Click on OK to close the Group by step and preview the data. The data stream will be grouped by the individual players and show salary statistics per player:

How it works...

This recipe covered three different ways to find out more information about the data being processed, each collecting statistics about the data in ways that are reported on, but do not necessarily have to be recalculated every time a report or analysis is done. For each of the steps, there are two things to consider:

The data must be sorted based on the query requirements

The original data will not continue through the stream after being processed by these steps

The Analytic Query step provides the ability to compare multiple records through a data stream which has historically been a complicated thing to do with just SQL. Quite often, comparisons need to be made within a group of data, usually in the form of tracking changes from one record/period to the next. For our baseball salary dataset, we looked at each player as they changed from season to season and how their salaries changed.

The Univariate Statistics step provides common statistics for the data stream being analyzed. Having these values can be used for data validation, comparisons between data loads, and for reporting. In the baseball salary dataset, we used the Univariate Statistics step to see the metrics around salary (specifically the mean, min, and max salary numbers for all the records).

The Group by step not only provides the same types of statistics as the Univariate Statistics step, but also allows for grouping the data together. For the baseball dataset, we used the Group by step to see the metrics around the mean, min, and max salary numbers for each player.

Building a random data sample for Weka

Weka is another open source tool that is officially supported by Pentaho, that focuses on data mining. Like it's cousins R and RapidMiner, Weka provides a library of statistical analysis tools that can be integrated into complex decision making systems. For this recipe, we will go over how to build a random dataset for Weka using Kettle.

How to do it...

Perform the following steps to build a random data sample for Weka:

Create a new transformation and add a Text file input step to the canvas.

Have the Text file input step point to the Salaries.csv file. On the Content tab, be sure to change the Separator from ;to ,. On the Fields tab, use the Get Fields button to get the fields of the file. Click on OK to close the Text file input step.

Add a Reservoir Sampling step from the Statistics folder in the pallet to the canvas. Create a hop from the Text file input step to the Reservoir Sampling step.

Open the Reservoir Sampling step. Change the value of the Sample size (rows) field to 1000. Click on OK to close the step.

Add an ARFF Output step to the canvas. Create a hop from the Reservoir Sampling step to the ARFF Output step.

Open the ARFF Output step. For the File name field, call the file baseball_salaries.

Switch to the Content tab. Ensure the Format matches the environment that you are working in (DOS for Windows, Unix for *nix).

Switch to the Fields tab. Use the Get Fields button to get the data stream fields into the Fields data grid. The step will make a best guess at the ARFF type for each data element. Click on OK to close the step.

Run the transformation. An ARFF file will be generated and can be used to work with the data within Weka.

How it works...

This recipe utilizes two steps, the first (Reservoir Sampling) of which can be used by anything that only needs a random data sample to process. The second one transforms the dataset into the standard format for Weka.

Reservoir Sampling takes large datasets and randomly selects records to create a smaller representative sample of the data. The two options in the step, sample size and random seed, control how big the sample set should get and how the records are randomly selected. For more details on the step, check out the Pentaho wiki at http://wiki.pentaho.com/display/DATAMINING/Using+the+Reservoir+Sampling+Plugin.

The ARFF Output step takes the data stream and stores the data in the standard format that Weka uses to process data. The first part of the file is the header, which provides the field details (name, type, and so on) and can also store the data source details (who created it, when the dataset was created, and so on). The second part fits the typical comma-separated values format, with each record's fields separated by a comma. To learn more about the format, check out the Weka wiki at http://weka.wikispaces.com/ARFF+(stable+version).

There's more...

There is another Kettle plugin that will actually take advantage of a model built in Weka and return the results back for further processing within Kettle. The step is called Weka Scoring and is a great place to start integrating machine learning into normal data integration processes. To learn more about the plugin, check out the wiki at http://wiki.pentaho.com/display/DATAMINING/Using+the+Weka+Scoring+Plugin.

Alerts & Offers

Series & Level

We understand your time is important. Uniquely amongst the major publishers, we seek to develop and publish the broadest range of learning and information products on each technology. Every Packt product delivers a specific learning pathway, broadly defined by the Series type. This structured approach enables you to select the pathway which best suits your knowledge level, learning style and task objectives.

Learning

As a new user, these step-by-step tutorial guides will give you all the practical skills necessary to become competent and efficient.

Beginner's Guide

Friendly, informal tutorials that provide a practical introduction using examples, activities, and challenges.

Essentials

Fast paced, concentrated introductions showing the quickest way to put the tool to work in the real world.

Cookbook

A collection of practical self-contained recipes that all users of the technology will find useful for building more powerful and reliable systems.

Blueprints

Guides you through the most common types of project you'll encounter, giving you end-to-end guidance on how to build your specific solution quickly and reliably.

Mastering

Take your skills to the next level with advanced tutorials that will give you confidence to master the tool's most powerful features.

Starting

Accessible to readers adopting the topic, these titles get you into the tool or technology so that you can become an effective user.

Progressing

Building on core skills you already have, these titles share solutions and expertise so you become a highly productive power user.