Saturday, June 10, 2006

In my previous blog entry, I wrote about how I'm currently checking out the Pentaho open source Business Intelligence platform. Well, I've only done a little bit of all the checking out I planned to do, but here I'd like to present some of the things that I found out so far.

In this blog entry, I will focus on some features of the latest addition to the Pentaho: K.E.T.T.L.E, which forms the principal ETL component of the platform.

In case you're interested in Pentaho in general: I just heard that on 21 june, MySQL will be conducting a Web-seminar on BI with Pentaho and MySQL, so you might want to register.

Huh? Kettle!? I thought you said Pentaho?!

I didn't mention this in my previous blog entry, but most of Pentaho is built on and from existing components rather than developed from scratch. For example, the reporting layer runs JasperReports, BIRT (discussed in one of my earlier blog entries) and JFreeReport, OLAP Services are provided by Mondrian, and one of the core components, the Workflow engine, is the Enhydra Shark Engine.

Pentaho actively contributes and strives to improve some of these existing components, for example, JFreeReport. The same goes for Kettle, which was actually acquired by Pentaho. Pentaho also bundles the components and provides the framework to let them work together.

Right now, the Pentaho developers are busy integrating Kettle with the rest of the platform, and probably sooner than later, the services that are now provided by Kettle will be known as 'Pentaho Data Integration'. (In fact, I just noticed that the welcome screen of the latest development release now announces itself as "Pentaho Data Integration - Previously Kettle").

So, what'd you say Kettle was...?

Anyway, back to Kettle. Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".

Kettle was first conceived about four years ago by Matt Casters, who needed a platform-independant ETL tool for his work as a BI Consultant. Matt's now working for Pentaho as Chief of Data Integration. Although there's a growing number of contributers, a lot of the work on Kettle is still being done by Matt himself.

E.T.L. and Datawarehousing

Being an ETL tool, Kettle is an environment that's designed to:

collect data from a variety of sources (extraction)

move and modify data (transport and transform) while cleansing, denormalizing, aggregating and enriching it in the process

frequently (typically on a daily basis) store data (loading) in the final target destination, which is usually a large, dimensionally modelled database called a data warehouse

Although most of these concepts are equally applicable to almost any data importing or exporting processes, ETL is most frequently encountered in datawarehousing environments.

Kettle Architecture

Kettle is built with the java programming language. It consists of four distinct applications (tip: follow links for relevant .pdf documentation):

is a graphically oriented end-user tool to model the flow of data from input through transformation to output. One such model is also called a transformation

Pan

is a commandline tool that executes transformations modelled with Spoon

Chef

is a graphically oriented end-user tool used to model jobs. Jobs consist of job entries such as transformations, FTP downloads etc. that are placed in a flow of control

Kitchen

is a commandline tool used to execute jobs created with Chef

Model-driven

An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all - rather, it's a purely declarative task which results in a model.

The command line tools Pan and Kitchen (or rather the underlying API) know how to read and interpret the models created by Spoon and Chef respectively. These tools actually execute the implied ETL processes. This is done all in one go: there is no intermediate code generation or compilation involved.

Repository-Based

Models can be saved to file in a particular XML format, or they can be stored into a relational database: a repository. Using a repository can be a major advantage, especially when handling many models. Because the models are stored in a structured manner, arbitrary queries can be written against the repository.

The repository may also be used to store the logs that are generated when executing transformations and jobs. Certain environments, such as banks, require that every manipulation that is performed with financial data be stored for longer periods of time for auditing purposes. The repository sure seems to be the place to do that, at least, as far as the ETL process is concerned.

Installing and Running Kettle

In case you're worried: no, we do not need to build a datawarehouse to show how Kettle can help you schedule the extraction, transformation and loading of 'hello world' into it. All we need is a Java 1.4 runtime environment (which is probably sitting on your system anyway) and Kettle itself.

If you don't have a suitable Java version find one on the Sun Microsystems download page. You should pick the distribution appropriate for your operating system, and install it according to the accompanying instructions.

Go to the Kettle downloads page to pick up Kettle. From here, you can download the stable 2.2.2 release, (as well as a lot of other stuff, such as documentation, the latest development release, and the Source downloads).

Installation is a breeze: just unzip the distribution in a location of your choice. Executables for the Kettle tools are located in the topmost Kettle directory. For the Windows operating system, these are .bat (batch) files: spoon.bat, pan.bat, chef.bat and kitchen.bat.

For *nix operating systems, .sh files (Shell scripts) are provided: spoon.sh, pan.sh, chef.sh and kitchen.sh. These need to be made executable before the can be used. This can be done by executing this command from inside the kettle directory:

chmod +x *.sh

Kettle Documentation is provided in the docs directory right below the main Kettle directory.

Putting the Kettle on

Ok, let's give this tool a spin, and see what it looks like. In the remainder of this article, I will be using a transformation and some datafiles I prepared. You can download it here (2k .zip dowload). Just unzip it into a directory of your choice. Be sure to use this directory wherever the remainder of this article refers to a particular filename.

Launcing Spoon

Launch Spoon by executing the .bat (windows) or .sh (*nix) file. The Kettle splash screen pops up, and right after that, you're invited to connect to a repository. We'll be skipping that for now, so hit the 'No Repository' button at the bottom of the dialog.

I think that a repository is a great feature, and I think it's even better that Kettle lets you decide if you want to use it. Although we won't be using the repository right away, we can still connect to one at any time once spoon is fully up and running.

Here's a picture of Spoon:

On the upper left side, a treeview is visible. The treeview acts as explorer of the currently loaded model as well as a menu from which we can drag items into the grahical view.

On the lower left side, right below the treeview, there's a little window that maintains a history of the different types of steps that were added to the model.

The right side shows a number of tabbed views. In the picture, the graphical view of the transformation model is selected. The model shown here is the 'hello, world!' transformation I just refered to. It's is included in the zip., bu you can also pick it up right here.

Spoon concepts

Although the transformation shown here is very simple, it does illustrate some of the most important concepts of Spoon. So, lets get a closer look at it's ingredients, the steps and the hops.

Steps

First of all, the graphical view shows a number of square icons. These are called transformation Steps. One such step denotes a particular kind of action that is performed on data. Steps are easily created by dragging them from the treeview and dropping them on the graphical model view.

Once a step is created, it can be opened by double-clicking it. A dialog window appears that can be used to parameterize the step to specify it's exact behaviour. Most steps have multiple property pages according to the different categories of properties appblicable to steps of that type.

In all cases, the name of the step may (and should) be modified to clarify the function of the step in the light of the entire ETL process. Most step types also have a separate property sheet to define the fields flowing in or out of the step.

Kettle provides a a lot of different step types, and you can extend Kettle and plugin your own. However, fundamentally, there are only three different kinds of steps: Inputs, Transformations, and Outputs.

Input

Input steps process some kind of 'raw' resource, such as a file, a database query or system variables, and create an outputstream of records from it.

Transformation

Transforming steps process inputstreams and perform a particular action on it, often adding new fields or even new records to it. This is then fed to one or more outputstreams. Kettle offers many transformation steps out of the box. Some steps in this category perform very simple tasks, such as renaming fields; some of them perform complex tasks, such as normalizing data or maintaining a slowly changing dimension in a datawarehouse.

Output

Output steps are like the reverse of input steps: they accept records, and store them in some external resource, such as a file or a database table.

Hops

In the graphical representation of the model, lines are visible that form connections between the steps. In Kettle, these connections are called hops. Hops between steps behave like pipelines. Records may flow through them from one step to the other. The records indeed travel in a stream-like manner, and steps may buffer records until the next step in line is ready to accept them. This is actually implemented by a creating a separate thread of execution for each hop.

Hops can be created by placing the mouse pointer above the source step, holding the shift button and then dragging (holding the left mouse button and the shift button) to the destination step. Hops may also be created by dragging the 'hops' node in the treeview onto the canvas. Then, A dialog appears that let's you select the source and destination steps from dropdown listboxes.

A Simple Recipe

Now that these spoon concepts are introduced, let's see how they are used to actually create an ETL process. Tip: click the picture to download the zipped data files and Kettle transformation if you haven't done so already.

By just looking at the graphical representation of the model, most people will grasp immediately what the implied ETL process looks like. First, in the step named Text file input, data is presumably read from a text file. From there, the data then flows into a step named Filter rows. There, a condition is checked for each record that flows through the input stream. The records are then split into two separate streams: data that passes the filter, and data that does not. Finally, each stream of filtered data is written to it's own text file.

Text file input

Double-clicking the text file input step reveals a number of Property pages: File, Content, Error handling and Fields. The text file that is read for input is specified in the File tab. Use the Browse button to locate files. After that, hit the Add button to select the file for input.

A filename (or pattern) maybe specified that defines the files that are to be read. Although you can add as many files as you like, only one file is specified in this case, spoon_hello_world_input.txt. Here are the contents:

Hello, World...Welcome to Spoon!

(Actually, this way of specifying the file is not really recommended for production purposes. The filename is 'hardcoded' into the transformation, which is not convenient to deploy the transformation. Check out this technical tip for a more robust method of working with input files.)

Use the Content tab to specify the format of the input file. For this example, I modified the field separator to a comma, and I unchecked the header row option.

Use the Fields tab to define the record layout. You should be able to use the Get Fields Button, which tries to discover the record layout using data from the specified files. (Although I've successfully used this function before, this particular setup seems to provoke a bug). After that, you can rename the fields, and adorn them with more specific properties, such as type, length and format.

Use the Preview rows button to check if the data is extracted in the desired format:

Well, this definitely seems correct: the lines are interpreted as records. The first line is split at the comma. Because the second line did not contain a comma, only the first field's got value.

A lot of the things we've seen sofar, such as the Fields tab, and the Preview rows button are common for a lot of step types.

Filter rows

There's just one hop that leaves the Text file input step, and it arrives at the Filter rows step. The filter rows step is a transforming step. It changes the contents of the inputstream by filtering out the records that do not match a criterion specified in the step properties. Double-clicking the step pops up the following dialog:

In the lower part of the dialog, a condition has been defined:

Field2 IS NOT NULL

The two dropdown-listboxes must be used to link to subsequent steps. The first one specifies this for the data that passed the criterion. The second one specifies the step that will receive all records that did not match de criterion.

Output steps

There are two output steps in the model: one to catch the records that passed the filter, and one to catch those that were rejected. Both are simple Text file output steps. This type of step is quite like the reverse of the Text file input step, and the methods to specify the filename and fields are virtually the same. So, here too, a file name and location must be specified, and fields need to defined so the step knows what fields it should write to file.

Verify, Preview and Run

Now that we have a transformation model, we can actually do things with it. Spoon's got a special transformation menu with items to work with the transformation model.

First of all, we should verify the model. The verify action can be found in the spoon transformation menu, and also on the toolbar. Verification logically checks te model. It will report unreachable steps, field name mismatches and other logical errors.

We can then preview the transformation. Preview is just like a normal execution of a transformation, except that it is run for a limited number of rows. The Preview action can be found in the menu and on the toolbar. The action is also accessible from the bottom of the Log View. When you choose to do a Preview, Kettle switches automatically to the Log View:

The log contains all kinds of information on the process execution: number of rows that entered a step, the errors that occurred, the time it took to run the task and much, much more. The log view is a very valuable source for troubleshooting and debugging the transformation.

If you're happy with the preview, you can then run the transformation for real. This action too is available from the menu, the toolbar and the log view.

This suggests the transformation works fine. The split rows step nicely separated the records that had comma separated values from the records that didn't.

Final thoughts

There are so much things I would like to write still about Pentaho and Kettle and I will most certainly do that in forthcoming blog entries. Just read on and see what you can expect in the remainder of this month and next month.

Realistic use case

The example show here is of course very easy and does not approach the complexity of a real world application. However, it did touch on a few of the fundamental concepts behind spoon, and served well to that purpose. I will be providing and explaining a more detailed transformation shortly.

What about Chef?

Although Spoon transformations are an important part, Kettle entails more than just that. We haven't discussed Chef and jobs at all, and I'll write something soon about that.

Using the repository

Although I mentioned the repository, we didnt actually use one. I will most certainly explain this feature in greater detail soon.

JDBC configuration

KETTLE comes with database support in the form of a lot of drivers. Sometimes you want to upgrade a particular driver. Easy, if you know how...

(Pentaho) deployment

Here, we used Kettle rather as a standalone tool: we never deployed the transformation. It would be nice to see how ETL hooks in with the workflow engine

109 comments:

As a developer of Kettle I have to wonder about the screenshots though, on which platform where they made? Please give the new 2.3.0 a try, this version has new SWT libs with better looking graphics on WinXP.

Then again, I'm not sure that's all too important. In the ETL/Data integration space, much worse examples of windowing interfaces can be found, hehe :-)

thank you for this great product! There doesn't seem to be much FOSS ETL going on, and it's really a pleasant surprise to experience the high level of quality of a tool like Kettle (and of course - Pentaho as a whole).

The screenshots were taken on my IBM Thinkpad - running Win XP Pro with 32 bit colors. I made them using the quick and dirty method I use for all pictures on the blog.

Kettle is great; it's amazingly functional and I'm usually pleasantly surprised at the corners I turn to find very helpful ETL/DW developer fetaures that make life easy. :) I hope you get a lot of value out of Kettle!

Thanks for this great and simple introduction about Kettle. I have just started using this wonderful product and your article has helped me in giving a good head-start. I look forward to your blog for inputs on Chef and other modules in Kettle.

Your article was an excellent guide to help me understand how the product works. After spending my whole career in the mainframe world (more years than I want to admit), I've been given the opportunity to transform data from Progress DB to Oracle DB using Kettle/Spoon.

"Why the strict distinction within the Kettle GUI environment between a TRANSFORMATION and a JOB?"

Well, this is a good question. Many people are at first puzzled by the distinction. It is however not hard when you realize a few things:

- steps in jobs are executed sequentially: all work done inside a step must complete before the next step is entered and executed

- steps in transformations are asynchronous: each step transforms input rows into output rows, but the rows can flow through the diagram in their own pace. A step is done when all the input rows are transformed into output rows, but the rows that come out of the step proceed immediately to the next step and do not wait for the other input rows. In a running transformation almost all steps are executing simultaneously. In a job it is also possible that multiple steps are executing simultaneously if you have a split in the flow, but in a job, the upstream steps and downstream steps do not execute simultaneously, but only sequentially.

- jobs are task oriented and that is indeed also why execution is sequential. By using a transformation as a job step, you can first completely perform the entire task of transforming rows, and then (and only then) move on to the next step.

- transformations are data (or row) oriented. Parallel, asynchronous flow through the transformation steps is good for performance. Transformations would be incredibly slow if the rows would have to wait for each other to enter the next step.

- the difference in flow has a few important consequences. One of them is that jobs only have one and only one entry point. Transformations don't care about that - in one transformation, you can have multiple streams of data that are not connected to each other that will execute in parallel. Another consequence is that transformations do not allow loops, whereas you can have loops in jobs. Conceptually, loops are maybe not impossible in transformations, but as I understand it, it is pretty hard to implement correctly - the synchronous execution of job steps allows more opportunities for that type of thing and that is why it is possible there.

"I mean, hierarchicaly, one or more transformations (i.e. steps) are included within a kettle job ?"

Well indeed, transformations are probably among the most used types of steps. But there is more - download a file, XSL transformation, ping a server...Many of these don't translate to the row oriented model inherent to transformations, and are thus only available as job steps.

"Is the converse possible ?"

I guess that by now, you can guess that the answer to this question is "No!" ;) However, it is possible to use transformations in transformations.

One thing might be a bit unclear at this point. You might have arrived at the conclusion that job steps cannot transfer data to one another. That is not true - they can. But the transfer is in bulk: a transformation step that appears in a job first has to finish. Then the results can be transferred to the next step, either in bulk or row by row. The difference with the way steps in a transformation are transferred to the subsequent step is that in the case of a job, the step might also fail - in that case no results are transferred at all. In the case of a tranformation, many rows might have flowed through the transformation until a problem occurs, at which point the transformation is put to a stop.

Hi, I am using Pentaho Data Integration (Kettle 2.5) on a Postgresql 8.4 DBMS. I am running a simple transaction that selects data from one table and inserts that data into another table having the same columns and types. I noticed that the SELECT is carried out fine, however an error is being generated upon INSERT: INSERT statements are missing quotation marks (opening and closing) for the data to insert. e.g. INSERT INTO schema1.user(id, username, logintime) VALUES ( 1, joe, 2007-03-24 15:44:10) and this statement is causing the error. How can I solve this issue? Maybe there is some function to edit SQL statements style? Thank you, Jones

you can define error handling for some steps by connecting to the step that is to receive the errors, and then right click on the step. Pick "define error handling" in the context menu to control what info you want to forward to the error handler.

thank you for this great product.iam facing some problem with log information.In the logging view, read ,written,Input and output,it is showing that all the record are moved.but some records are not moved. i believe that there is some problem in the logging view.please provide some documents related to logging view.

I am not Matt and Matt is unlikely to scavenge someone else's blog for feedback on the Kettle product.

You should probably post on the pentaho forums (forums.pentaho.org), and only of that does not resolve your issue, post a bug at jira.pentaho.org. A word of advice though, you are going to have to do a better job of describing the problems you are experiencing. After reading your comment, I have no idea what problem you are describing.

Hi Roland,Nice article on kettle and spoon.I have a different requirement here.I hope you can help me with it.I am building a product where in the beginning phase i need to do ETL.I plan to use Kettle for this purpose.Now the problem is tht i dont want to use the GUI.My application has to use Kettle to convert from say a flat file to db using ONLY the jars and the classes available. How do i go about doing that? Any pointers in this perspective. My applciation would user this interally - and hence all calls would be java based.

Hi Roland,Nice article on kettle and spoon.I have a different requirement here.I hope you can help me with it.I am building a product where in the beginning phase i need to do ETL.I plan to use Kettle for this purpose.Now the problem is tht i dont want to use the GUI.My application has to use Kettle to convert from say a flat file to db using ONLY the jars and the classes available. How do i go about doing that? Any pointers in this perspective. My applciation would user this interally - and hence all calls would be java based.

I am a newbie on Kettle and I got a problem on it. Hope you can help me.

I've tried to load data from text file input to Database.But i got problem when i try to specified the field.The data does not have any separator and a row is recognized by new line. Each field needs to be specified with its position and length.For example:2009051320090514I'd like to store it in two field:1. submissiontime string position:1 length:82. deliverytime string position:9 length:8

I am a newbie on Kettle and I got a problem on it. Hope you can help me.

I've tried to load data from text file input to Database.But i got problem when i try to specified the field.The data does not have any separator and a row is recognized by new line. Each field needs to be specified with its position and length.For example:2009051320090514I'd like to store it in two field:1. submissiontime string position:1 length:82. deliverytime string position:9 length:8

About kettle v.s talend: well, I mostly like the way there is choice in open source land too - a little bit of healthy competition helps to make both products better.

As for a more technical comparison, Talend is a code generator (I know it can generate Perl and Java - not sure about other target languages), whereas Kettle is a engine driven solution (interprets and executes).

I haven't had much experience with Talend so I can't comment in depth on that topic. When I downloaded and tried Talend, I did find it less intuitive than Kettle, but maybe that's just me and there are many people that find the reverse true. So I guess you should try for yourself.

Concerning performance - there are quite a few blogs and papers on the internet on that topic. There is no easy answer as the results from various authors seem to contradict each other.

What I really like about kettle is that you can transparently parallelize transformations, and that you can even run them on multiple machines. I don't believe such a feature is available in Talend (pls. check - I could be wrong). This means you can just keep scaling by adding more nodes.

Sounds great and i think i will stay with Kettle then. But if i ever plan to decide to use TOS someday anyway: will it be possible to integrate this tool into the Pentaho suite (i.e. regarding xaction workflows) or could i use TOS only like a thirdparty tool without any Pentaho integration in such a scenario?

"will it be possible to integrate this tool into the Pentaho suite (i.e. regarding xaction workflows) or could i use TOS only like a thirdparty tool without any Pentaho integration in such a scenario?"

Well, everything is possible of course - I mean, if you'd really want to you could write a xaction plugin to do whatever you please. But the point is, what problems does it solve, and is it worth the effort.

To tell you the truth for actual ETL, I don't recommend running kettle transformations through an action sequence either. I rather use the operating system scheduler to take care of that - I find it more manageable to use the server just for serving BI content to users, and independently run ETL to load the DWH. So in that case, you would be able to use any other ETL tool including TOS without impacting how you use your Pentaho server.

That said, there may be cases where you have a complex report or some complex logic to burst a report where you can benefit from having kettle doing some of the work (parallel data collection, aggregation) for you. In that case, it is extremely convenient to be able to run a transformation from an xaction. Odds are that you will lose with regard to out-of-the-box flexibility in case you'd wanna solve that using a hybrid talend/pentaho solution.

HI, could you kindly tell me, how to make use of the kettle functionalities externally? that is, using specific .jar files, i would like to create my own java file which would refer those jar files to execute.I tried, but there seems to be some prob. in the path of the .jar files..im getting (does not exist) error.. kindly throw me a light.. thanks in advance...

First of all, Congratulations ! Good Job!I'm developing a transformation, and I'd like to test if the information is already recorded in the "To database". And so avoid inserting the same information again. Thanks in advanceLeonardo from Rio de Janeiro.

Hi there,I need some help with text output. I need to generate a text output for the invoice data. I will process the text input which will add 'H' record and a 'D' detail record to the same file. These records are not in sequence, first 'H' records are written and then 'D' records. How to write Header and the corresponding line item details? any help is appreciated

there are several possible approaches. You could just setup two streams, one for the header and one for the detail records, then use a formula, calculator or javascript step to concatenate the strings to make a single field containing all the data. If you use parallel streams you can use the "append streams" steps to make a union of them (but ordered) and write that to file. (for serial stream you obviously control the order yourself)

If creating the header and detail records are in two separate transformations, you can use the "append" checkbox on the content tab of the text file output step.

there are many online resources that exaplain how to do this. Typically for loading denormalized dimension tables, you would first use a "Table input" step to get the data from the lowest level and then use "Database Lookup" steps to lookup data from parent tables.

Hi Roland, I'm new with Kettle and I need some help.I have to read an ASCII file an write into Mysql db.I have in input an ASCII file with 3 type of records (header, details, footer). Each type contains a different number of fileds and different kind of fields.Can you pls help me to do it ?ThanksStefano

Do you know in advance how many rows there are in the footer and details? If so, you should simply use three text inputs and use the header/footer configuration to specify which chunk you need.

If you don't know in advance how many rows there are in each section, use a generic text file input and treat it all as one big field. Then, use a regex step to distinguish between header/details/footer fields. Lead the differnt records each in their own stream, and use another regex to parse the individual fields.

Hi Roland, thank's a lot for your fast replay, but I have still another problem : reading an ASCII file with 3 types of records (header, details, sub details)to write them into mysql db, I have to respect referencial integrity of the db, writing an error file. How can I handle this in Kettle ?

Hello, I have this problem: I'm reading a text file sequentially and each line of the file should I write on a database table. How do I handle errors of procedure for each row? Example: The file contains 10 lines from the first row error in writing to the database, the other 9 are therefore not processed because the processing is interrupted. Let me know how you can make the next 9 rows are still processed. Thank you

Hi Roland,I wanted to know if there is a way to handle database errors in a similar way to step error handling done on rows.I have a transformation containing a database update that is executed in a loop and i need to ignore database errors during update without stopping the procedure.

Roland Thanks for your feedback.I need another help.If I want to join in a field of the output value in a column in a table + a string, for example:column_item + ": Error write to file" how can I do?I tried writing: {{column_item}} + ": Error write to file", but the procedure returns error.Have you any idea?Thanks

Chef is a - now deprecated - GUI for designing kettle jobs. The Job designer is now (and has been for quite a while actually) integrated in Spoon, offering a single workbench/IDE for designing kettle jobs and transformations.

If you're interested in PDI / Kettle, Wiley recently published "Pentaho Kettle Solutions" (ISBN: 978-0-470-63517-9, check out the Wiley website for a sample chapter, TOC and index http://tinyurl.com/33r7a8m), authored by Matt Casters (Kettle founder and lead developer), Jos van Dongen and me. You can check it out on amazon here: http://tinyurl.com/33a7gpu

BTW - you really should go to sourceforge.net and grab yourself a PDI 4.0.1 - there's a few nasty bugs in 4.0

As you told that for production purpose we should not hardcode the file name in transformation.You have provided a link there for this but this is for pentaho site.Can you give me some idea how to use file names for production purpose.

You can either store configuration data in a file or database table, and read from there. The file input step (and many other steps) can use the value from a field of the incoming stream to specify the file name (and path)

Another way is using variables. Most steps allow you to use variables inside configuration values. There are a few built-in variables, such as the directory of the current transformation. See this example to learn how to use variables for specifying file names:

Hello Roland, I bought both of your books, they are on the way :) In the meantime, could you point me out to some documentation on error handling? I saw you sent another gentleman to http://www.ibridge.be/?p=32 but, for example, it is not immediately clear to me what the error fields fieldname and error code field name should contatin. Moreover they were, in that example populated with "errorfield" and "errorcode" respectevly and is unclear to me whether they are inner Kettle variables? made up names? etc. For example if I need to find aout, in a failed insert, what primary key caused was associated to it. Is that information mapped in the "Error descriptor" fieldname? Somewhere else?Thanks for your help

I am able to read data stored in properties files as configuration file and getting the data stored in that file.In configuration file i stored the variable name and location of xml files.i want to read the data from those xml files.Ia am using read data from xml input for that but whenever i am defining file name as a variables and defining loop xpath there. I am not getting the field their stored in those xml files.Can you help me to resolve tis problem.

I have a xml file which has so many elements.having structure similar to below xml root element structure.

files-root element

UserUserTypeaddressfileType

UserTypeaddressfileType

file

i want to put user,usertype,address,filetype as a child element in element.If their is userType="registered" then only the user element is found.there are so many records stored in a xml file.How i can do this using PDI.

Anonymous, it's not really clear what XML structure you have and what you want to do with it. You should probably take some time to examine the samples included in kettle. If you still have questions, please go to the pentaho forums: forums.pentaho.org

I am not exactly sure about the 3.2. beginners guide, but the "Pentaho Kettle Solutions" book contains very detailed explanations of both importing and exporting XML, and also covers XML validation, SOAP webservices, JSON, etc etc.

So do yourself a favour and save yourself a ton of time with these books.

this is a good point: as far as I know, the "Get Data From XML" step will try to read the dtd, even if the "Validate XML?" check is not checked.\

I do not know why it does that - perhaps it is a bug in kettle, perhaps the cause is deeper and has to do with the libraries used by kettle.

My advice is to file a bug in Jira for it to at least get this question answered. To read XML files regardless of this problem, I have succesfully used a workaround that is really ugly but does in fact work: First, use an ordinary text file input step to read lines of plain text, and then filter out the lines that contain the DTD reference. After that, either concatenate the lines using a "group by" step and a "concatenate strings" function, and process the XML inline; or write the remaining lines back to a new text file, and read from there.

i have created a job where i am want to rename a main folder and then copy few selected folder from the main folder.I have one /index folder which have so many folders named userid-test, userid-result,userid-sem,userid-transaction. I want to rename that folder like index-2010 and then copy those users test and result folders only from that index folder.

I have create an job and then first i retrieved test folder name in one transformation and result folder names in other transformations. I have also renamed that folder but somehow i am not able to copy those test and result folder to new renamed folder.please help me.

Hi, i m doing my final year project as "web usage Mining Using proxy server",here i have log files from proxy server that are to be anlysed and produced reports like top 10 sites visited,web navigation pattern,etc usin (Pentaho and Mondrain).

But its very sad that this softwares are not thought to us,and i completely newbie to this.So ,i wish if someone could hepl me with which components are to be used when please

this is certainly a task you can perform with components from the pentaho stack.

One of the first steps you'll need to take is to design an appropriate database structure to serve the requirements. For mondrian, it works best if you end up with a dimensional model of the data, which you can structure in either a star or a snowflake. There are a few books around that can help you with this task. I would suggest taking a look at "Pentaho Solutions" (Wiley, ISBN: 978-0470484326) since this book covers both general datawarehouse & BI concepts but also has concrete examples to apply this to the Pentaho business intelligence suite.

If you know what the database design looks like, you have to fill it with data. This is where Kettle comes in. Kettle has actually a few examples of analyzing webserver logs. These were provided by Daniel Einspanjer, a member of the Mozilla data metrics team. They are using kettle for this type of work, so if kettle is good enough for them, then it's probably good enough for anyone :)

Pentaho Solutions has no less than 3 chapters devoted to using kettle and I think this should go a long way. But if you need more information on using Kettle, there are currently two books I'd recommend: "Pentaho 3.2 Data Integration: Beginner's Guide" (Packt, ISBN 978-1847199546) and "Pentaho Kettle Solutions" (Wiley, ISBN: 978-0470635179)

A final note: the title of your research mentions "mining". I should point out that Mondrian is an Analysis tool, not so much a data mining tool. But luckily, Pentaho offers data mining in the form of Weka. This is also integrated in Kettle. "Pentaho Solutions" contains one chapter about data mining with Weka.

The book covers all components of the Pentaho BI Suite. You’ll learn to install, use, and maintain Pentaho-and find plenty of background discussion that will bring you thoroughly up to speed on BI and Pentaho concepts.

Hi,l need Kettle to manage transferring of data between some network elements. l want to know how l can build an ETL solution with kettle and integrate it as part an application.l also want to know what l need to install, l'm confused: Do l just need to install with kettle (spoon, pan and kitchen), the Pentaho server (biserver-ce) and Pentaho Design Studio?Do l need to reinstall them (with all the .bat files) each time l turn on my computer ?To integrate the solution l want to build, what do l need ? how can l create a link with the solution? What programming langage is the most suitable to develop all my application that will use the ETL solution l want to produce with Kettle ?

kettle is just software - you install it once and use it as many times as you need after that. Installing kettle means simply unzipping the archive you download from the pentaho project page in sourceforge. Kettle (as all other parts of pentaho) rely on Java but I assume you've got that installed already.

So for ETL, you only need kettle. Only if you want to have your pentaho BI server run ETL directly you need to have the BI server and design studio to create an xaction that runs a kettle job or transformation. But it's certainly not a requirement to have your BI server run the ETL - many people just use kettle and run it outside the BI server (I do). And if you need scheduling, you can simply rely on the operating systems' task scheduler (f.e. cron)

What you need to really integrate kettle with an application depends on what level of integration you need. If you want data input and output into and from your application, it might be a good solution to build a kettle plugin.

Or maybe your application has standard XML or webservices to import/export data, and in that case I would probably not build a plugin but start with building a nomal kettle transformation that does the job without actually integrating the app with kettle.

Or you may have the need to generate complex jobs and or transformations dynamcally from within your application, and in that case it seems a good idea to actually link to the kettle API and use that. This last approach involves directly addressing the java classes that kettle is made of. It will be most easy if you application language is java or java-based (like clojure, scala or groovy)

For more detailed information on stuff like kettle plugins and using the kettle api to directly integrate with applications, take a look at our book, "Pentaho Kettle Solutions":

hi,i want to integrate pentaho with my web application which is in struts2.0 i dnt knw how to integrate can u pls tell me step by step procedure of how to integrate and how to generate report in web application

Thank you for this nice blog on Pentaho and Kettle. Few years ago(was junior developer at that time) I used Spoon and it was of great help for designing data flow.

Now I joined one project where we have to deal with very complex SQL statements to extract required business data. So i was thinking of having a tool which will generate those optimized SQL statements.

Is there any possibility to convert *.ktr transformations into sql format?

Kettle does a little bit of SQL generation. For instance, steps like the database lookup, update etc.

However, it is all very simple SQL.

I played with the idea of turning an entire ktr into a SQL statement too - this would be an interesting case of "database push down" for those databases that can handle that and actually perform better that way. However, I never really worked on the idea, and I don't think anyone else has a solution like that either.

Could you perhaps give an example of the kind of statements you'd like to generate? It would be interesting to me to see if I could whip up something that does what you want.

I just sent you by mail a query existing in production. we deal here only with selects and inserts. Usually for external tables we create aliases via DB links. As you can see theoreticaly we are supposed to use use only DB Readers and DB Updaters while for transformation we can use whatever is present in toolbox.

I am using Kettle for performing ETL some very large tables in the DB (20-30M rows). I have multiple Join Rows steps in the transformation. The processing speed of this seems to be extremely slow (15r/s). I have tried merge join (left outer and inner). They do not achieve what am trying to do with the data. So, was wondering if there was some configuration settings that might improve processing speed or is there another way to do this.?

I am using Kettle for performing ETL some very large tables in the DB (20-30M rows). I have multiple Join Rows steps in the transformation. The processing speed of this seems to be extremely slow (15r/s). I have tried merge join (left outer and inner). They do not achieve what am trying to do with the data. So, was wondering if there was some configuration settings that might improve processing speed or is there another way to do this.?

this is bad news. The "Join Rows" step will perform a cartesian product. If you specify a filter, then I believe that filter is applied after the product.

"I have tried merge join (left outer and inner). They do not achieve what am trying to do with the data."

Note that the merge join requires data from both streams to be ordered by the join key.

Now, concerning solutions:

You say you have very large database tables and you want to join them in the transformation.

There are 2 steps you can use:

1) the Database join step. This takes an inputstream from the transformation, and for each row in the input stream, it fetches the matching rows from the database, based on some join key you configured. This is what you need if there is a one-to-many relationship between the inputstream and the joined table respectively

2) the Database lookup step. This takes an input stream, and for each row in the input stream it looks up exactly one row in the database. You can use this if there is a one-to-many relationship between the database lookup table and the inputstream respectively. If the lookup table is relatively small you can gain a lot of performance by configuring this step to cache the looked up rows.

In both cases, the input stream can be taken from a table input step.

Note that in both cases, the database still does the work of matching the rows from the joined table to the join key; the difference with an in-database join is that kettle will perform a query on the database for each row from the input stream. This is useful in case an in-database join would lead to a very large intermediary result, choking the database; in such cases doing many queries sequentially has the effect of throttling the join, keeping the system as a whole more responsive (although the total workload is likely larger)

In some cases, it might still be the best idea to let the database do the joining entirely. In that case, simply use a table inputstep and enter the join query.

You should consider each these methods and pick the one that best suits your needs.

1) create text file input step. set limit to 12) lead outgoing stream of text file input step into a filter step. configure a dummy "always true" condition. Lead the "true" output to a dummy step. Lead the "false" output to a text file output step.

If you already have a transformation that uses the text file input for processing, you could technically use that, however, the filter step will cost some processing power. So if you only need to create the output file in advance, I would isolate the steps I described in a separate job which you can run whenever you need that empty file.

I am having file over FTP , My job fails as there is not file to get from FTP .At failure condition, I want to use a wait step and again the same Job(get file job) . my problem comes in to picture when my job after Failure condition too fails . so is there any way to re-execute the job until it is success .

in a job, you can create multiple hops going out of a job entry. For each hop you can use the right click menu to mark if the hop is to be followed unconditionally, upon success or upon failure. You can then lead the failure hop to an earlier point in the stream and thus create a loop.

However, what you don't want to have is this loop running around as fast as possible and consuming CPU. You should build in some sleep or something to allow other processes to run. I'm not sure how to do that.

Hi Roland,this blog is very nice, i'm new in Penthao spoon but i already tried some examples which are working fine.i know to retrieve data from a file and store them in a DB. But what is logic to use data from db or file and print them on an image as example a map, i don't wnat to use google map with Pentaho even if it's a good api, for security reason i prefer to use an image. Thank you for your help. Regards

Hello Roland,your blog is very nice and i learn a lot. I'm new in Pentaho and hope it's not too late for me.So i know retrive data from a file and store into a DB but what is a logic to print datas in an image as a map. I don't want to use google map api. Is there a plug in for this type of work?

Hello Roland,your blog is very nice and i learn a lot. I'm new in Pentaho and hope it's not too late for me.So i know retrive data from a file and store into a DB but what is a logic to print datas in an image as a map. I don't want to use google map api. Is there a plug in for this type of work?

I think you should checkout the community dashboard framework and community dashboard editor. These are Pentaho bi server (get downloads here http://community.pentaho.com/) plugins which allow you to create and run dashboards. You can install these from the pentaho marketplace (this is a sort of plugin manager for the bi server)

One of the cdf components is based on open street maps. This allows you to plot data on maps.

If that doesn't suit your needs, you can always write your own cdf component. Or maybe you might be able to repurpose an existing component like a scatter plot and set an image as backdrop.

Please go to forums.pentaho.org or ##pentaho freenode irc channel for detailed community support.

This blog is very nice. Well, I am new in the Penthao DI world. And I need to know how to "Fill" an Olap Cube and publish it on the Pentaho BI Server, but I don't know how to do it using KETTLE ( Transformations/Jobs).

I have searched on the Net, and I founded some solutions using the Mondrian to create the Mondrian Cube Schema file which contain the definition of OLAP cubes and The Mondrian Schema Workbench in order to create dimensions, measures and hierarchy; Then publish it on the Pentaho BI Server.But Now, the 'Mondiand.war' does not exist Anymore !!

Also, I have founded some articles talking about loading Dimensions Tables then join all of them in the Fact Table ( using the step "Merge Join") .. So is that the right solution to create The OLAP Cube !???

1. Should I load the Table from the database(MySQL) "Input", Then using the Step “Dimension lookup / update” in order to "convert it" into a Dimension ??Then I assemble all of them on tha Fact Table using the step "Merge Join" ???

2. How can I generate/load The "OLAP Cube" finally, it will be an xml file too or what???? In other word, what will be the "Output" !?

3. At least, how can I integrate/publish it on the Pentaho BI Server ???

This blog is very nice. Well, I am new in the Penthao DI world. And I need to know how to "Fill" an Olap Cube and publish it on the Pentaho BI Server, but I don't know how to do it using KETTLE ( Transformations/Jobs).

I have searched on the Net, and I founded some solutions using the Mondrian to create the Mondrian Cube Schema file which contain the definition of OLAP cubes and The Mondrian Schema Workbench in order to create dimensions, measures and hierarchy; Then publish it on the Pentaho BI Server.But Now, the 'Mondiand.war' does not exist Anymore !!

Also, I have founded some articles talking about loading Dimensions Tables then join all of them in the Fact Table ( using the step "Merge Join") .. So is that the right solution to create The OLAP Cube !???

1. Should I load the Table from the database(MySQL) "Input", Then using the Step “Dimension lookup / update” in order to "convert it" into a Dimension ??Then I assemble all of them on tha Fact Table using the step "Merge Join" ???

2. How can I generate/load The "OLAP Cube" finally, it will be an xml file too or what???? In other word, what will be the "Output" !?

3. At least, how can I integrate/publish it on the Pentaho BI Server ???

A Cube is a logical, not a physical concept. In the OLAP world, a cube is basically a set of metrics (some people say measures or facts) that can be navigated, filtered and aggregated along a set om dimensions.

Cubes are the backend as far as any OLAP client applications are concerned. Typically, the client applications use MDX as query language to let the user selection portions of the data from the cube. The client typically displayus that to the user in graphs and pivot tables, or combinations thereof in a dashboard.

Pentaho uses Mondrian as OLAP engine. Mondrian is based on the ROLAP principle: Relational OLAP. ROLAP simply means that the Cube is implemented using direct SQL queries on a RDBMS end. In ROLAP, the physical definition of the cube is just a description of how the tables in the RDBS map to dimensions and metrics. For mondrian, this is done in a mondrian schema file. Such files are in xml format and can define multiple cubes.

Typically, the cube is not the only level of data organization. One will usually want to organize the data warehouse (the database where you store your OLAP data and keep track of its history) in a way that allows you to easily create cubes. This is usually done by creating a set of star schemas.

A star schema is a single fact table which contains a set of columns which hold the metrics; usually these are counts, amounts etc. Apart from the metrics, the fact table also contains foreign keys (which may or may not be enforced at the RDBMS level) which point to its corresponding dimension tables. And typically, all data for a dimension is stored in a single dimension table. (Sometimes, a logical dimension is built off of several related dimension tables, this is referred to as snowflaking)

Creating and loading the fact and dimension tables, that is where kettle comes in. In many cases people use kettle to extract data from one or more source systems, then to enrich and validate that, and to organize and load that into the fact and dimension tables.

If you download pentaho ba server, mondrian is shipped as part of the pentaho web application. So there is no separate mondrian.war - you only have to deal with that if you set up mondrian manually and independently of pentaho.

If you want more information, there's a couple of books that are still quite ok to learn more about this subject:

I'm a beginner in Pentaho Data Integration [Kettle]. And I am doing my final year project.Well, I want to do an ETL process from conventional relational database model to dimensional database model. (Loading Dimension/Fact Table).

I am just wondering how to go about doing this ! I have read around and looked at examples, but I am just having difficulties wrapping my head around how this works !!!!

1. I have read that there is an ETL process to "create/fill" a TimeDimension ; How comes ?!.... Should I create a process (Transformation) in ETL then load it to a Table in the database, and after that loading that table from the database into a Transformation(Kettle) or what ???

2. Now after loading the Dimesions Tables and the fact Table : How will I affect the Date to a row in the DataBase. Because in the database, there is no table created called "TimeTable" !!In other word, When we will execute a query , we 're going to interrogate the database but indeed, there is no relationship between the Table Product and the TimeTable(which even doesn't exist), for example... !!?

3. In the exemple above, concerning the "Fact Table": How can I Add a Row into it, in order to calculate the "SalesDollars"?

1. Almost every star schema has at least one dimension table that has at its lowest granularity the calendar day.

Typically it has a hierarchy along year, quarter, month, day, but alternate hierarchies are possible to express fiscal year, semesters or weeks.

Some people call this the "Time dimension" - i prefer to call it "Date dimension".

The Date dimension table is typically used multiple times in an olap cube: the same physical dimension table can be used by a logical date dimensions like orderdate, shipdate, paymentdate etc. (This is called "role-playing")

Like all dimension tables, the date dimension table should have an integer column as key, and the fact table would have one corresponding column for each relevant different type of date. The key column can either be filled with a truly generated artificial key, but often it is more convenient to use a format like 20140318 to represent March 18, 2014. This is convenient when you have to load the fact table, since you can then directly convert the order date, shipment date etc. into the corresponding date dimension key. If the key of the date table is a truly artificial one, then you would need to do an explicit lookup to find the date dimension key for an actual date from the source system.

Typically people fill their date dimension table initially, independently of the daily dwh loading process. The reason for that is that you can: you know in advance what the calendar looks like and you can generate all the data in advance. Of course people don't fill it with any possible date, they simply make sure they hava a starting day that is early enough to capture all business transactions, and they will typically generate data for 10 or 20 years ahead of time.

In the kettle samples directory you will find a number of transformations (.ktr files) to generate a date dimension. You typically create the table yourself, and then load have kettle generate and load the data.

Here's an example I wrote already years back:http://rpbouman.blogspot.nl/2007/04/kettle-tip-using-java-locales-for-date.html

Here's another example:http://type-exit.org/adventures-with-open-source-bi/2010/09/building-a-detailed-date-dimension-with-pentaho-kettle/

2. In a star schema, there are only relationships between the fact table and its dimension tables. Both date and product are dimension tables in your example. You don't need any date inside your product table; you only need for the fact table to have a column that points to the product dimension, and any number of relevant columns to point to the date dimension table (for order date, shipdate, whatever dates you would like to use to navigate and filter your fact table)

3. The typical loading process of a star schema is: 1) make sure all dimension tables are up to date. 2) load the fact table. While loading the fact table, you typically use the column values form the source system to lookup the corresponding keys in the dimension table. As I pointed out, if you use a logical key for the date dimension, you can conveniently directly translate dates from the source system into date keys.

For product, you will need a lookup. There are several ways to do that, but in all cases it reliees on comparing values from the source system with columns in the dimension table to retrieve the dimension key. And then that dimension key is going to end up in the fact table. Please take a look at the various lookup steps in kettle, and check the documentation to find out more: http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps IN particular, take a look at the Dimension update/lookup step and the DAtabase lookup step.

Now for the sales amount, this is the only fact I've seen in your schema so far. (Another typical metric could be sales quantity).It looks like you got QuantityOrdred in your line item source table, and price in your product source table. There are 2 options: either you can load 1 row for each individual actual item that was sold on a particular order. Or you can load one row for each row in the OrderDetail table. In both cases, you will need to pull in the price from the product table to calculate the sales amount.

A few final notes. I noticed you have a store dimension, however, the source system does not seem to provide that kind of data. I also noticed that your star schema does not have a customer dimension. That is odd, since that is typically one of the most important ones. When designing a star schema there is a very simple checklist to consider to see if you got all your dimensions: "what, who, where, when, why". In your example, "what" is represented by the product dimensions. "where" is represented by the sore dimension, "when" is represented by the date dimension. The "why" is not always needed, but for example a "marketing campaign" dimension would be a good example of a "why". What should typically not be missing is one or more "who" dimensions, in particular, the customer. Another type of "who" is the sales guy or order picker.

i'm beginner in Penthao pdi, i downloaded the Pentaho spoon 5.0.1 and it's working fine.So i tried different sample files given in the software.So my question is:Is it possible to extract data from multiple .txt file and store them in a Database or in another file as xls?I tried Text Input file but i have to mention the fle name, i don't want to mention it becasue i have too many files in a folder.

Hi Roland,Thank you very much for your help.After some tests,now i can read data from all txt files from a directory.I just add the folder path and .*\.txt in wildcard, the system accepts and shows all txt files and in Fields Tab if i click the Preview row, it shows all data from all files. That's good for me at this step.Now can i calculate the average of all data, what step should i use? Thank you

Search This Blog

About Me

I'm Roland Bouman (@rolandbouman on twitter). I'm a software (web) application developer and I work on both the front end as well as the back end. I do data modeling, database design, ETL, Analytics, and Business Intelligence.