Post navigation

Complete Pentaho Installation on Ubuntu, Part 6

Install PDI/Kettle and Agile PDI in a Development Environment

This is where you’ll install and play with one of the most interesting, well crafted, user friendly and enjoyable applications I have ever used.

It is also the heart of the Pentaho BI suite as the tool to Extract, Transform & Load (ETL), process data and execute jobs. It’s name was Kettle, its former name, now is known as PDI for Pentaho Data Integration, and also Spoon, it’s executable file name.

We’ll install PDI as a desktop development tool.

1. Get and Install PDI.

Go to Pentaho files in sourceforge here and download the latest stable release (version 4.2 should be up by july 2011. Its a remarkable new version, check the improvements).

Double click on the file and extract it’s content into a new folder:

/Pentaho/data-integration

You can delete the .bat files and make all the .sh files, specially the spoon.sh as an executable file (right click on permisions tab). And in command terminal start it with:

./spoon.sh

Close the window dialog that offers to open a repository. You should be in the PDI, a development environment:

PDI

Note: You don’t have to configure anything or add drivers for common databases. I told you, this open source application is the result of a great community and is a very well crafted product, you’ll see.

[Edit]
Ups, There is a glitch on the interface and the new ubuntu 11.04 scrollbars as they don’t work or let you put steps into the canvas. The solution I took is disable them as shown on the PDI forums here.

2. Meet the Application

There are several resources you should browse and revisit them as you familiarize yourself with the concepts of ETL and this tool:

18 slides presentation by the project founder, manager and lead developer of PDI, explaining its capabilities. Slide #11 ‘use-cases’ list some of its uses. [video].

Check a guide that comes with your download at
/Pentaho/data-integration/docs/English/getting_started_with_pdi.pdf

A nice way to start learning about the PDI, ETL and datawarehousing is by opening the samples folder and check the components names and its notes, those are self explanatory. If you double click on them you will see the parameters that specify each behavior. If you right click on them you can select options to see the description, input or output fields, the text description -you should document the intention of the activity in here-, preview a sample run, etc.

Once you have reviewed some transformations I recommend one to start, that is create an object fundamental to multidimensional analisys: the time dimension. This is a table with a row for each day in the calendar, has columns showing special attributes like months, quarters, years, weekends So its easy to select dates based on those columns and then select the values in the fact table just with the indexed records which contain those needed dates.

A nice specification for a time dimension table is listed in this post of Nicholas Goodman. His blog has very interesting information too.

Check this pages, download the examples and run them in your environment.

Kettle Tip: Using java locales for a Date Dimension – Sept 2007 (link).
In this post, Roland Bouman, shows a simplified extraction and then proceeds to show how to connect to a database, use a SQL to create the table and execute it.
Then it explains three more steps to generate the data.
In here you will see the more difficult part of using PDI, the javascript step.

HowTo: Create a date dimension with PDI (Kettle) – March 2010 (link)
Geschrieben Von adds more characteristics for a day and uses more PDI steps to obtain them: calculators, filters (select), lookups. This will be version 2.0 of the last example.

Building a detailed Date Dimension with Pentaho Kettle – Sept 2010 (link)
In here, Slawomir Chodnicki explains briefly the desing considerations in his design. One important thing here is how he introduces the concept of updating your data on dimensions jus by re-run the transformation, this is something we must get used to. It is important if your job crashes and you have to rebuid the process or being capable to continue from a given point.
The file contains some erros on the java scrpts steps -some variables are not defined but referenced-, it is an oportunity to see the debugger messages of PDI.

5. Working without a Repository

If you are working with a developer team you shoud create a repository. Its simple, just click on new button and with a user with DB privileges on MySql, create the database.

Then you will get a single area for your programs and avoid versioning and syncing problems, your connection also get stored, etc. But if you are one or two people (normal for a pilot project) it is best to avoid using one. You can just synck and back up your program folders. Also you don’t need to change the normal way the BI server seeks programs.

Good chapter book sample “Pentaho Data Integration 4 Cookbook”: A transformation, A report from PDI data, PDI jobs from the BI Server process/ PUC, PUC-PDI-CDA, dashboard and data from PDI.

In the javascript step you can see very usefull sample code for each function. DEinspanjer explains with more detail than this: On left panel -> open Transform Functions -> open Date Functions -> right click on dateDiff -> select Sample.

8. Ruby Plugin

[Edit August 31, 2011]
Its news to me that there is another way to do the scripting than the javascript step. Now you can do your process in ‘Ruby’ with another plugin that you just need to unzip on the plugin directory: ruby-scripting-plugin.

7 thoughts on “Complete Pentaho Installation on Ubuntu, Part 6”

hello Martinez
Reached this post. Skipped ahead to schema editor. All is not well. Still couldn’t get Foodmart cubes in front-end. Actually I cannot find the xml files for this schema in BI installation. Thank you
mahamood