Codebits is an annual
3-day conference about software and, well, code. It's organized
by SAPO and this year's edition is to be held on November 10 thru
12 at the Pavilhão Atlântico, Sala Tejo in Lisbon,
Portugal.

I've never attended SAPO Codebits before, but I heard good things
about it from Datacharmer Giuseppe
Maxia. The interesting thing about the way this conference is
organized is that all proposals are available to the public,
which can also vote for the proposals. This year's
proposals are looking very interesting already, with high
quality proposals from …

Most of you usually use a data integration engine to process data
in a batch-oriented way. Pentaho Data Integration (Kettle)
is typically deployed to run monthly, nightly, hourly
workloads. Sometimes folks run micro-batches of work every
minute or so. However, it’s lesser known that our beloved
transformation engine can also be used to stream data
indefinitely (never ending) from a source to a target. This
sort of data integration is sometimes referred to as being
“streaming“, “real-time“, “near
real-time“, “continuous” and so on. Typical
examples of situations where you have a never-ending supply of
data that needs to be processed the instance it becomes available
are JMS (Java Message Service), RDBMS log sniffing, on-line
fraud analyses, web or application …

on occasion we need to support environments where not only a lot
of data needs to be processed but also in frequent batches.
For example, a new data file with hundreds of thousands of rows
arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity”
computing resources in parallel. In this blog post I’ll
detail how the general architecture would look like and how to
tune memory usage in this environment.

Clustering was first created around the end of 2006. Back
then it looked like this.

The master

This is the most important part of our cluster. It takes
care of administrating network configuration and topology.
It also keeps track of the state of dynamically added slave
servers.

A couple of years ago I wrote a post about
key/value tables and how they can ruin the day of any honest
person that wants to create BI solutions. The obvious
advice I gave back then was to not use those tables in the first
place if you’re serious about a BI solution. And if you
have to, do some denormalization.

However, there are occasions where you need to query a source
system and get some report going on them. Let’s take a look
at an example :

Some time ago while I visited the nice folks from Human
Inference in Arnhem, I ran into Kasper Sørensen, the lead
developer of DataCleaner.

DataCleaner is an open source data quality tool
released (like Kettle) under the LGPL license. It is essentially to blame for
the lack of a profiling tool inside of Kettle. That is
because having DataCleaner available to our users was enough to
push the priority of having our own data profiling tool far
enough down.

Kasper worked on DataCleaner pretty much in his spare time in the
past. Now that Human Inference took over the project I was
expecting more frequent updates and that’s what we …

Now that we’re blogging again I thought I might as well continue
to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.
We haven’t had a lot of requests for MongoDB support so there is
no step to read from it yet. However, it is surprisingly
simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent
4.2.0-M1 build. Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of
your PDI/Kettle distribution.

Then you can read from a collection with the following “User
Defined Java Class” code:

Last year, right after the summer in version 4.1 of Pentaho Data
Integration, we introduced the notion of dynamically inserted ETL
metadata (Youtube video here). Since then we received a lot of
positive feedback on this functionality which encouraged me to
extend it to a few more steps. Already with support for “CSV
Input” and “Select Values” we could do a lot of dynamic
things. However, we can clearly do a lot better by
extending our initiative to a few more steps: “Microsoft Excel
Input” (which can also read ODS by the way), “Row Normalizer” and
“Row De-normalizer”.

Below I’ll describe an actual (obfuscated) example that you will
probably recognize as it is equally hideous as simple in it’s
horrible complexity.

It has been a while since I posted on my blog - in fact, I
believe this is the first time ever that more than one month
passed between posts since I started blogging. There are a couple
of reasons for the lag:

Matt Casters, Jos van Dongen and me have spent a lot of
time finalizing our forthcoming book, Pentaho Kettle Solutions (Wiley, ISBN:
978-0-470-63517-9). The book is currently being produced, and
should be available according to schedule in early September
2010. If you're interested, you might like to read …

A few weeks ago, when I was stuck in the US after the MySQL User
Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable.
However, this time it’s a book about my brainchild Kettle.
That makes this book very special to me. The full title is
Pentaho 3.2 Data Integration : Beginner’s Guide
(Amazon, Packt). The title all by itself explains
the purpose of this book: give the reader a quick-start when it
comes to Pentaho Data Integration (Kettle).

I had a great time at the conference, met a lot of nice folks,
friends, customers, partners and colleagues. After the conference
I was unable to get back home like so many of you because of the
Paul
Simon singing Eyjafjallajökulvolcano
in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0
RC1 hacking with the rest of the l33t super Pentaho development
team. However, after 2+ weeks from home, even a severe
storm …

Content reproduced on this site is the property of the respective copyright holders.
It is not reviewed in advance by Oracle and does not necessarily represent the opinion
of Oracle or any other party.