In Data, Sometimes the Truth is just an Opinion…

Data Warehouses were born of the finance and regulatory age. When you peel away the buzz words, the principle goal of this initial phase of business intelligence was the certification of truth. Warehouses helped to close the books and analyze results. Regulations like Dodd Frank wanted to make sure that you took special care to certify the accuracy of financial results and Basel wanted certainty around capital liquidity and on and on. Companies would spend months or years developing common metrics, KPIs, and descriptions so that a warehouse would accurately represent this truth.

In our professional lives, many items still require this certainty. There can only be one reported quarterly earnings figure. There can only be one number of beds in a hospital or factories available for manufacturing. However, an increasing number of questions do not have this kind of tidy right and wrong answer. Consider the following:

Who are our best customers?

Is that loan risky?

Who are our most effective employees?

Should I be concerned about the latest interest rate hike?

Words like best, risky, and effective are by their very natures, subjective. My colleague at Qlik, Jordon Morrow (@analytics_time), writes and speaks extensively about the importance of data literacy and uses a phrase that has always resonated with me: data literacy requires the ability to argue with data. This is key when the very nature of what we are evaluating does not have neat, tidy truths.

Let’s give an example. A retail company trying to liquidate its winter inventory and has asked three people to evaluate the best target list for an e-mail campaign.

John downloads last year’s campaign results and collects the names and e-mail addresses of the 2% that responded to the campaign last year with an order.

Jennifer thinks about the problem differently. She looks through sales records of anyone who has bought winter merchandise in the past 5 years during the month of March who had more than a 25% discount on the merchandise. She notices that these people often come to the web site to learn about sales before purchasing. Her reasoning is that a certain type of person who likes discounts and winter clothes is the target.

Juan takes yet another approach. He looks at social media feeds of brand influencers. He notices that there are 100 people with 1 million or more followers and that social media posts by these people about product sales traditionally cause a 1% spike in sales for the day as their followers flock to the stores. This is his target list.

So who is right? This is where the ability to argue with data becomes so critical. In theory, each of these people should feel confident developing a sales forecast on his or her model. They should understand the metric that they are trying to drive and they should be able to experiment with different ideas to drive a better outcome and confidently state their case.

While this feels intuitive, enterprise processes and technologies are rarely set up to support this kind of vibrant analytics effort. This kind of analytics often starts with the phrase “I wonder if…” while conventional IT and data governance frameworks are not able generally to deal with questions that a person did not know that they had 6 months before. And yet, “I wonder if” relies upon data that may have been unforeseen. In fact, it usually requires a connection of data sets that have often never been connected before to drive break-out thinking. My friend Bill Schmarzo describes it succinctly in the linked blog: "Data Science is about identifying those variables and metrics that might be better predictors of performance.” This relies on the analysis of new, potentially unexpected data sets like social media followers, campaign results, web clicks, and sales behavior. Each of these items might be important for an analysis, but in a world in which it is unclear what is and is not important, how can a governance organization anticipate and apply the same dimensions of quality to all of the hundreds of data sets that people might use? And how can they apply the same kind of rigor to data quality standards for the hundreds of thousands of data elements available as opposed to the 100-300 critical data elements.

They can’t. And that’s why we need to re-evaluate the nature of data governance for different kinds of analytics. In my next blog, I will explore a new framework for data governance that flexes to include not only conventional data that drives reporting and regulatory outcomes, but data analytics in a data democratized world.

In his first post, @JoeDosSantos discusses data governance and the importance in arguing with #data