Category Archives: Pentaho

What I am going to showcase in this tutorial is how to load web stats from Google Analytics into a fact table with Penthao Kettle/PDI. And then how to represent that fact table with Mondrian 3.6 schema so we can visualize the data with Saiku on Pentaho BI Server. In the end I’ll give my two cents on Saiku Analytics and possible options involving d3.js and Roland Bouman‘s xmla4js.

In case you are new to this I recommend reading my articles on the following topics involved here:

In a traditional star schema the dimensions are located within specialized tables which are referred to by numeric keys from the fact table. A dimension can represent anything from the gender (“male”, “female”, “intersex”) over a hierarchy representing a location (“Germany”, “RLP“, “Mainz“) to an individual user’s profile (name, address, date of birth, …). Now thanks to Mr. Kimball we know there are different types of what he refers to as Slow Changing Dimensions (SCD – “slow” because they are expected to change only infrequently):

Pentaho’s BI Server or BA platform allows you to access business data in the form of dashboards, reports or OLAP cubes via a convient web interface. Additionally it provides an interface to administer your BI setup and schedule processes. The aim of this tutorial is to illustrate how to get started with the BI Server and for that purpose I am going to use a small and artifical data set – as otherwise I would have to get deeper into further technologies – so I can keep this text lean. I am going to elaborate on Mondrian schemas, data warehouse design, MDX and further related concepts in separate articles. When you are through with this text and still hungry, make sure you check out “Mondrian Schema for OLAP Cube Definition ft. Google Analytics and Saiku” for a more advanced use case.

First – let me whet your appetite by showing you what a pretty pretty report you will be able to compose after you finished this tutorial – and there you go (click on the picture to bask in the report’s full glory!):

Two weeks ago I wrote about Kettle 4.4 and its database repository and how working with it is truly no fun due to excessive latency connected with loading and saving of jobs and transformations. Logging the queries sent to a MySQL database made it obvious that the reason is a tornado of thousands of database commands and replies sent back and forth.

Did Kettle 5 learn something new?

We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

In the past few months I have been using data sets provided by Eurostat a lot and so I crafted a Kettle job that loads SDMX files (an XML document keeping the data), sets up custom-tailored tables in a Postgres database and fills them. Because those SDMX files often weigh a couple of hundred MB it wouldn’t be good idea to deal with the XML by representing the full structure in memory and reading it using XPath. A more efficient solution is to stream the document (using the “XML Input Stream (StAX)” step) and deal with it in a piece-meal fashion.

For executing an HTTPS request with a Java software like Kettle / PDI you need to provide a certificate for the domain the request is going to address. A possible error message for a request to https://www.example.org using the “HTTP client” step from the “Lookup” section might look like this:

In my last post I was describing how to calculate “returning visitors” in a customizable way depending on how you want to define “returning”. At work as well as for personal projects I use for ETL processes Pentaho Data Integration (PDI) aka Kettle. PDI provides a step for fetching data from Google Analytics and I am going to describe in this post how to use this feature on the basis of the job I clicked together for the article on “returning visitors”. I will focus on the steps and aspects relevant to the subject.

Currently I am working on a project with the goal to download all available data sets on Eurostat as SDMX documents and load the data into a local database, so I can investigate it more conveniently. For this ETL process I use Pentaho Kettle aka PDI (Pentaho Data Integration). And one of quite a few small challenges was to download a list of files via HTTP. I must admit I was and still am surprised that this cannot be accomplished easier with Kettle – but what the heck, it’s still a great tool – and first and foremost it IS possible. So in this article I am going to describe the most straightforward way to implement a batch download. I assume basic knowledge on how to use Kettle.

The MAIN job

First of all there is no transformation step to download a file via HTTP, but a job entry for that purpose, which is why we have to use a job (DOWNLOAD.kjb) within a job (MAIN.kjb). The file list will most likely have to provided by steps executed within a transformation, which extracts this information from a file. In my above mentioned use case, the list of to be downloaded files is provied within an XML document which is parsed and which offers a great opportunity for another article.