It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2009, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhereontheweb. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"

In the autumn of 2009, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

Part I: Getting Started

ETL Primer

Kettle Concepts

Installation and Configuration

Sample ETL Solution

Part II: ETL Subsystems

Overview of the 34 Subsystems of ETL

Data Extraction

Cleansing and Conforming

Handling Dimension Tables

Fact Tables

Loading OLAP Cubes

Part III: Management and Deployment

Testing and Debugging

Scheduling and Monitoring

Versioning and Migration

Lineage and Auditing

Securing your Environment

Documenting

Part IV: Performance and Scalability

Performance Tuning

Parallization and Partitioning

Dynamic Clustering in the Cloud

Realtime and Streaming data

Part V: Integrating and Extending Kettle

Pentaho BI Integration

Third-party Kettle Integration

Extending Kettle

Part VI: Advanced Topics

Webservices and Web APIs

Complex File Handling

Data Vault Management

Working with ERP Systems

Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

Looks awesome (as previous comment stated). Spend time on the Webservices and Web APIs section. More and more external data, like geo-location and government statistics, are enhancing the internal corporate data. The key to releasing this potential, as you know, is proper data integration. Show us how to do that properly!

Also...another plug...Realtime and Streaming are hot! Beyond performance considerations, show us how to use Kettle to shift for the nuggets in a HUGE data stream.

Richard, that chapter is already done actually.It shows you how to create, configure and monitor never-ending-streaming-real-time transformations.I left looking for the nuggets as an exercise to the reader. :-)

Sometimes I find it hard to believe myself, but in about one and a half month, we'll be done writing, then off to process the technical reviews.

If all goes well, the book will be published according to schedule, somewhere in September 2010.

Anyway - thanks a lot for your support - it is really good to hear you're looking forward to it. I hope we'll manage to deliver a book that meets or even exceeds your expectations, but frankly, with Matt being part of the author team, I think we should be able to go a long way.

thanks for your interest! Perhaps I should mention that "Pentaho Kettle Solutions" will be geared primarily at experienced ETL developers that want to learn how to use Kettle.

You might be interested in getting a copy of "Pentaho Solutions" first. This book is more general and geared more towards beginning BI developers. Never the less it has pretty deep coverage of ETL, with 3 chapters (about 100 pages) devoted entirely to ETL with Kettle.

Hello Roland,Great to hear you guys are writing a new book on Kettle. Looking forward to it!All I have to ask you is a small little favor:PLEASE hire a new proofreaders!The first book is full of typos, contradictions, missing and misleading information.I don't mean to be harsh or evil, just would like to point that out.Cheers!Renato back

"Pentaho Solutions" was reviewed by a team of three different technical reviewers, and edited by a professional copy editors from Wiley and then proofread again by us. We did spot a few errors (mostly typos) after publication, but I would expect that in any first edition of a book this size, and frankly my impression is that everybody did a pretty good on that.

Even though you're the first one to point out that the book is "full of typos, contradictions, missing and misleading information" I do take this very seriously. Would you be able to let us know exactly which issues you encountered, and at what page they occur? Please send it to me, Jos or Wiley.

You might have earned yourself a free copy of Kettle Solutions with this information. :)

Hopefully v4 will fix the rather involved process needed to iterate through lists of files without creating a job that calls an xForm for filenames, then another job, etc. The forums don't get this right, in any case, hpefully the " Pentaho 3.2 Data Integration: Beginner's Guide" will get it right...

Hi Anonymous, not sure if last two messages are by the same person, but here goes:

@Anonymous #1: What do you mean exactly? Input steps that handle files all support regular expressions for specifying files/directories. If the pattern happens to match multiple files, they are all processed, one by one. So no need to build a Job to pick them up one by one.

@Anonymous #2: Yeah, PostgreSQL is a great product. But "Pentaho Kettle Solutions" is a database agnostic product. We have a few samples based on MySQL, because there happens to be more of it. But the samples will run on all JDBC databases.

AFA @Anonymous #1 is concerned:I do realize the regex support for filenames but I have a different requirement. Let's say that at certain time intervals I need to scan a directory for files. More files can come in randomnly. So I need to capture that set of files and do what I need to do (xform and move to archive) just on those files. If more files come in then the next scan will get them. Using regex in xforms will not work when more files that qualify come in before the xform is complete and get inadvertently moved.

@Anonymous #1: ok, I see. But still, you don't need a complex job like you described to do that. Just quickly thinking about this problem, I can see at least two ways that seem an adequate solution to me:

1) use a regular expression to have the text input step read all files in the directory. Be sure to add the file name to the stream (use "Include filename in output" flag in the content tab of the file input step). After the file processing pipeline, use something (a step like group by, or analytic query, or even javascript) to identify the last row coming out of each file. Then use a "process files" step (in the utility folder) to move the processed file out of the directory.

Of course, this would change the situation somewhat as you now have two directories, one for the input files, and one for the processed files. But if you think about it that isn't such a bad situation: keeping only a single input directory is going to lead to a problem at some point, as the directory will just keep filling up with files.

2) suppose solution #1 is not what you want, and you really want to hang on to your single directory that receives all files, then you can use a "get filenames" input step and compare the contents of the directory against filenames you stored in a control file using a "merge diff" step. You can use a switch/case step to take the appropriate path according to the value of the diff flag field: if the file from the dir is identical to what you found in the control file, the file was already processed and you do nothing. If the file from the dir does not occur in your file, it is new and it must be processed. Then the only thing you need to do is to append the newly processed file name to the control file.

I hope that in the new book you do a complete section on file watching/moving/copying since that is a topic that is not ***CLEAR*** from the docs/forums. Worse than unclear, the solutions often do not work.

Got it working but I was getting confused since those other "Error" and "Warning" folders were not getting filled as expected.

What I did was filter on an error count of zero and send those to a success process and then send the others to an error file. I do wish I could capture the malformed input row verbatim and write that out along with the error info thereby splitting the file out.

I cannot pass the bad data out since it causes the same format errors when it gets written back out to the same fields. Maybe we should get the row buffer somehow?

Will there be anything concerning the use of "Custom data sources", like Hibernate within Kettle or BI? I know I read somewhere that this is possible but I cannot find it in the forums. I'd like to use a special data source that has security built in...

I am glad this one is coming out. What is more gladdening is that it is gonna be complementary to Pentaho 3.2 Data Integration Beginner's Guide from Packt Publishing.

When I saw the title on amazon, I first thought is was gonna be a "repeatition" of the one from Packt. But it is wow to see an outline so packed full of wonderful topics.

I am practically 2 months old in using pentaho and so in love with it. My entire goals is to perfect was has been done then I will quickly want to actively become a contributor to the community. I am working on trying out some deployment for some small size firms.

There is no end to possibility, I can see. This book has an entry in my budget.

Search This Blog

About Me

I'm @rolandbouman, a Web- and BI Developer and Information Analyst. I have worked for MySQL AB, Sun Microsystems and I'm currently working as a software engineer for Pentaho (a Hitachi data systems company).

Together with Jos van Dongen I wrote a book called "Pentaho Solutions" (Wiley, ISBN: 978-0-470-48432-6, 630+ pages). This book is intended for people that want to get started with Business Intelligence and provides lots of practical examples to work with the open source Pentaho Business Intelligence Suite.

Together with Matt Casters and Jos van Dongen, I authored another book for Wiley called "Pentaho Kettle Solutions" (750+ pages, Wiley, ISBN: 978-0-470-63517-9). This book is more specialized and focuses on Pentaho data integration (Kettle) and ETL.