more ramblings from a computer nerd

Filling the gaps in open source ETL

For the last 5 years, I have spent a great deal of time developing data marts and general data integration solutions with Pentaho’s PDI/Kettle (http://kettle.pentaho.com/). I’ve used both the community OSS edition as well as the commercial edition. This includes extensive experience with pulling data out of poorly designed legacy systems (when things like this come in handy: https://code.google.com/p/legstar-pdi/ coupled with MVSDS). As you start to build more and more data integration solutions, the need for an effective metadata catalog can become quite an issue. One of those issues is effectively understanding the lineage of information, which is invaluable when questions arise about the correctness of certain information along with impact analysis. Kettle has some primitive capabilities in this space, but they are quite underwhelming, in my opinion.

I have never found an effective OSS solution to this and I have bounced around the idea of implementing something myself from time to time. However, every now and then, I see a glimmer of hope when I uncover tools such as ++Spicy (http://www.db.unibas.it/projects/spicy/). Does anyone else have any nuggets they’d like to share? If so, share them with me on twitter @zpratt.