We are using Pentaho Kettle 4.4 for realizing ETL processes and to make the involved jobs centrally accessible those are stored in a database repository. Given that several proxies separate my VPN access to the database holding the repo a latency was to be expected and in the beginning not too bothersome. But proportional to the complexity of the processes grew the time involved in loading and storing of jobs and transformations until a point was reached where this was no longer acceptable (up to a minute for storing an edited job f.x.) and a solution to this had to be found.

Setting up PowerPivot – that means all the way from downloading it and successfully fetching data – keeps a number of pitfalls in readiness. To spare you the sweat I will give you some guidance on this topic. I assume that you are using Excel 2010 on Windows 7 and that you want to connect to MySQL 5.

Let me introduce this post with an example. You offer a web service and every request to it is logged in a database table – keeping the type of request, the user, a lot of other stuff and first and foremost the (date)time it happened. Now you want to know the number of requests within an hour, chronologically ordered, for the last four weeks – to get an idea about when your customers are using the service. If you encounter this situation for the first time your query might look like this:

Using the ‘week’ as a time unit can be a bit confusing for at least two reasons:

Does the week start on Sunday or Monday?

How do you count a week that stretches two years?

Now probably your and anybody else’s take on those two questions will not just depend on cultural upbringing – but even more on – how it has been done so far. Nobody wants to change old statistics to a new counting or maintain two parallel statistics for one year with both countings. So to make a long story short, I’ll just stick with ISO 8601 standard because it seems reasonable to me. Here is what ISO 8601 has to say about the two questions:

When you are dealing with large amounts of big data sets it is much more efficient to organizes those in database tables instead of CVSs or other files. Just yesterday I set up R for fetching data from a MySQL DBMS loading a table of stock quotes consisting of more than 300’000 rows into a data frame within seconds. That is pretty cool – and if necessary you can join huge tables in no time benefting of the indexing infrastructure of the DBMS of your choice.