The hero of this story is a fictional character, any resemblance to reality is NOT accidental. The idea of this post is coming from a beer (well, beers) with friends where we came across the topic of data munging (data-preparation by birthname) and how much misery it can be. Laughing, we bid on each other by throwing in wilder and wilder data quality monsters because of which we used to tear out our hair during the projects.

We reached a point when we could see the Data-Frankenstein materialising in front of our – bit clouded – eyes. Here are the most important attributes of the monster to make it recognizable for those unfortunates who run into it:

Data tables split into multiple pieces, tipically by year, but in worst cases by month

Tables without headers, or the header is embedded in every 50 rows

The number of columns in each extract is almost random

Judging by the value sets, the order of the columns is mixed up

The delimiter is presented in the free text field thus the content is shifted when importing the data into the analytical tool

The formats of obvious columns (like date or financial fields) are totally different

The end of critically important fields (such as ID) is cut off

By the result of a well-meaning join the records are duplicated

There are characters in the free text field we have never seen before

The apostrophe is missing from one end of the text

Here comes the question: What to do when coming across a freak like this? There are multiple possibilities:

Run! – It is the first to think of but a brave data fella does not flinch.

Return it to the sender! – Ask for a version which is at least acceptable. If possible get the database dump file or get close to the colleague doing the extracting and you might help yourself out of a few unpleasant tasks.

Fight it! – At the end you will have no choice but to roll up your sleeve and do it. It is not easy, but after some nerve-racking data cleaning cases you will have scripts and procedures to make the beast into a harmless kitty. By this struggle you will get to know the data and your brain is already working on the different reports and variables to create with the domesticated Frankenstein.

Do not over do it!- It is surprisingly easy to sink in this kind of work but always have your eyes on the target and do not spend time on fixing fields that are useless from the business point of view.

Ps.: We thought about creating a dataset like this artificially but realized that it is actually a complex task. After all, the systems creating these data monsters were developed for many years.

Hiflylabscreates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.

It is not clear from a short introduction that this attractive looking job is actually donkey work for the most part because of data preparation.

As one survey – published in Forbes – shows, 76% of data scientists enjoy these kind of tasks the least, however they spend around 80% of their time with them. It’s interesting that this rate was similar 10 years ago when we wrote about this in our book released back then.

While spending that much time with data cleansing we create special, unique expressions to describe it: data digging, massaging, play-doh-ing, plucking… Still I know only a handful of people who left their professions, got bored or started to hate working with data because of this. We love cooking so we do the vegetable pealing as well. Besides, the quality of data cleansing does matter a lot. The final result often depends on this phase – the withered parts have to be cut out while the delicious bites are being processed.

Meanwhile the technology that supports data analysis is developing in a dizzying pace. There are a lot of developments that target data cleansing to make it less time consuming to give data experts more time to spend with the actual analysis. Venture capital flows to startups who concentrate on Big Data interpretation, but naturally the giants of the data industry also work on their own solutions.

So can we hope that the 80% data cleansing – 20% analysis ratio of working time is going to pass away? I doubt it and I do not expect any significant changes within the next years.

Tools that support data cleansing are going to get better and better. However this will result in involving data sources that we would not even think about using today. Faster road vehicles did not only result in spending less time with travel, but ended up allowing us to reach farther destinations as well.

There is one more area that has the promise of doing the data cleansing phase: there are initiatives to use artificial intelligence (AI) for data interpretation. Of course, AI sets foot in more and more fields, for example within a few years there will be less need for drivers.

There will always be chefs even when machines help them with vegetable pealing. And there will be data analysts as well, with more and more useful tools to support their work.

Hiflylabscreates business value from data. The core of the team has been working together for 15 years, currently with more than 50 passionate employees.