In my last article I showed an analysis of 617 movie scripts, identifying the most said words in those movies and also the trending of positive and negative words. That was done using different data sets, which means I had to do some data cleaning and blending. Today I’ll show you exactly what I did to clean and prepare the final data set using Pentaho Data Integration, a.k.a. Kettle.

The impact of words. So strong that even if they’re not directed to us, they can change how we feel. Movies bring a lot of emotions to us spectators, with all the sceneries, the action, the history, the characters — but what about the impact of the words in movies? I did a small analysis and I found out some interesting things.

According to the data published by Statistics Canada (the government agency commissioned with producing statistics), crime activity in Canada has been falling for almost a decade now. But how Montréal is pictured in the overall crime scene?

Whether you’re using the CSV Input step, or the Table Input, you might have noticed the lazy conversion checkbox and wondered what that means. Or you already faced an error because the lazy conversion was enabled, such as:

There was a data type error: the data type of java.lang.String object [Hello, world!] does not correspond to value meta [String<binary-string>]

Often the reaction is just turn off the lazy conversion, but that might (and probably will) hurt the overall performance of your transformation, even more if the input has thousands or hundreds of thousands rows.

Sometimes you need to break your data stream into multiple flows, do some kind of manipulation and then get them all back into the single stream it was before. Today I’ll be helping you consolidate the replicated columns that are created after joining these flows in Pentaho Data Integration, a.k.a Kettle.