How important is Data Quality, really?

The answer to this question might be given away in the form of other questions, like:

How many top-class athletes do you see hanging out at greasy fast-food joints?

How important is diet to an Olympic athlete?

How important are good tyres (ok, tires for the US audience, there!) to racing car performance?

The GIGO (Garbage In --> Garbage Out) principle most certainly applies when it comes to data quality. For BI to be pervasive in the organization, the organization must firstly trust the data and good data quality leads to increased trust which leads to wider and faster adoption of BI.

In so many of the BI implementations I do, the initial pilot program is spec'd out on the understanding that the data is in place and it is clean and usable. In almost every instance, as we start to dive deeper, it becomes glaringly obvious that this is not the case and a lot of time is needed to improve the data quality? why is this so?

The business in general gets on with the primary activities of the business, which SHOULD be to make a profit in the chosen field / market the business operates in. For this, certain personality types are required. To design proper data structures requires the services and knowledge of a good data analyst / data custodian who understands the business and has hopefully done this before in a similar environment. Sadly, many folks use scant and basic knowledge to design data, keep adding to it and modifying it to the point that the corporate data looks like a patchwork quilt. This is just a harsh reality of living in an imperfect world where we more often than not, don't get the luxury of being able to start again and implement perfectly. A good data analyst / data quality analyst is a details person. Finding patterns in the data and proving to the business that the data does or does not represent a given way of thinking is something that most people would rather hand off to someone else, some nutter who likes to analyze bytes and low-level data. Sadly, I resemble that remark! A good data analyst will be at a barbeque or movie or driving, when a sudden realization of the data enters their mind, or a way of looking at the data that has not been thought of before. This usually follows a mad scramble to get onto the system and interrogate to verify / refute this thinking. It's a mental disease (this non-stop analytical way of thinking) and it will not go away. I feel a lot better after accepting this diagnosis.

In one of my recent implementations, we implemented BI over a national retail chain and found that there were links to suppliers that did not exist, there were products that were grouped all wrong in terms of the product hierarchy, stores did not match the patterns in the data which business knowledge was taken for granted on and a variety of other factors. It makes for very difficult BI and this is probably one of the major causes of slow BI adoption, because the organization does not trust the information to be correct!

In the initial pilot phase, about 75% of the time was spent just dealing with the data. what was required were reports and dashboards, but all of this presumes the data is available, it is accurate and it matches what the business "know" to be true about their operations. The importance of this phase cannot be overstated. Now that we have this information, subsequent phases of the BI life cycle can use the knowledge gained from this, to make life easier / faster or at least to identify what can and cannot be done with the data.

So where do we fix data quality issues? Fortunately that question is the easiest to answer: at its source!

We ran a commission reconciliation report for an insurance brokerage in Australia, breaking down the premium renewal summaries by state. This report showed us that of the 8 states in Australia (Victoria, New South Wales, Northern Territory, Western Australia, South Australia, Tasmania, Queensland and ACT [Australian Capital Territory] [kind of like DC in USA]), we had summaries for 25 states. How can this be?

Well, we showed totals for VIC, Vic, Victoria, Victorai , New South Wales, New South Whales (yes the large sea-faring mammals), etc. This pointed to a critical flaw in the source system where premium renewals were being captured, in that the state field was a freeform text field instead of a drop-down picklist. This leaves the door wide open for human error and the pervasive all-too-famous PEBKAC error (Problem Exists Between Keyboard And Chair). Once the data was fixed and the culprit online system changed to accept only drop-down values, we ran the report again and got the information and it was correct.

So often, we use our tool BI Plus as a data quality tool to check summaries and drilldown summaries to verify that totals match, right down to the detailed level transactions. I have found that before I start to use BI Plus as a reporting / dashboarding / advanced Pivot table tool, I use it almost solely as a data quality tool, profiling data, verifying relationships, record counts, summaries, etc. then checking with the business analysts I work with to verify that what I am discovering is actually on track with what they know about the business. It is with this information that I dared to pose the question "Why is it that the some stores are permitted to open for the day to trade when the GP generated for the entire day does not even nearly cover costs of keeping the store open?". On the drilldown, we saw a detail of every such day in the last 6 months. Business benefit? We can now start to predict stores about to go into bankruptcy way before the stores can and start to take action. This would not be possible without the data quality being in a reasonably healthy state!

So getting back to the question of "How important is Data Quality, really?", if BI were a singer, it would be the Hollies and the song it would sing to Data Quality would be "You're the air that I breathe!". Think of large-city smog and this analogy takes on a more visual aspect. Breathe in enough smog and it will have adverse affects on your health, for sure!