Can Big Data Avoid The Correlation Trap?

As I was at a vendor conference recently that brought this to mind, I will write on something that is less about the vendor and more about what I view as major opportunities for software technology going forward. In this case, I am referring to causality vs. correlation. Be assured, this eventually relates to business strategy.

One of the first things I learned in statistics was to beware of "spurious correlation." The example given was the fact that growth in use of radios tracked nicely to the growth in population, the economy, etc. So, it was easy to conclude that selling more radios would ensure a booming economy forever and a day. In fact, the two had nothing to do with each other; it was just that both of them happened to be growing pretty much monotonically at a particular period in time.

The obvious way to avoid spurious correlation is to establish causation: This is or is not caused by that. Loosely speaking, it appears to me that establishing causation between groups of events/data points A and B involves two key elements:

Establishing that group of events B follows group of events A in time; and

Establishing a physical link between event A and event B (e.g., your return of serve in tennis is physically linked to my serve: I’m serving to you, not to someone else.)

Establishing Causation

The problem in the real world is that while statistics and analytics are awash in ways to establish correlation, there is an incredible paucity of ways to establish causation. I wandered through the byways of Wikipedia when, a while ago, I considered fulfilling my lifelong Asimov-Foundation-fueled desire to write about mathematical tools for predicting the future – and found nothing close to globally applicable.

There was a major example of this in economics recently. Two economists, Reinhart and Rogoff, published a paper warning of the perils of high deficits to the future growth path of economies. To their data – whatever its merits – they applied standard regression techniques showing some sort of decrease in growth correlated with higher and higher levels of debt. But nowhere did they consider whether the data really did show that high deficits preceded slow growth in time.

Moreover, comments by critics simply questioned whether high deficits preceded slow growth in time. Until very recently, no one attempted to figure out statistically what preceded what. Of course, once someone did that, it turned out that the data supported more that slow growth preceded high debt than vice versa – but this only covered the temporal aspect, not the physical link, and the statistical technique used was a very awkward comparison of regressions.

The result of this, I think, is a pervasive, unconscious assumption that if we keep beating at correlations over and over, we will somehow establish causation. To me, this leads to the kind of business problem cited in a recent tweet: 80 percent of CEOs think they supply a superior customer experience, while just 8 percent of customers think so. These CEOs are taking things that have been shown to correlate with more positive customer response, and assuming they will work in the case of their own customers.

I could cite many other examples in and out of business; but for businesses, this means money thrown down a rathole trying to provide a superior customer experience and failing – although certainly businesses are doing better than when the first spurious this-focus-group-likes-the-idea-so-it'll-work correlations were used. I am asserting that the correlation trap is costing businesses lots of cash in wasted strategies, whether they are using Big Data or not.

What Could Big Data Do?

I recently tweeted that maybe "Big Data" should be called "Deep Analytical Processes" now because it is becoming apparent that any real value of Big Data beyond having more data to play with is that a change of degree is a change in kind. That is, the ability to use more data leads to another layer of understanding of its meaning – e.g., personalization of the customer rather than segmentation – and that allows us to change fundamentally our processes for decision-making and action-taking in response to these insights. So for example, a company might change the call-center process to treat the call-center customer as a unique bundle of attributes instead of a category.

One of these dimensions of depth that can potentially lead to a change in process is time. Now, optionally, the business can attach a sequence of interactions and changes in attributes over time to the individual customer. Not only can it know and remember what has happened in previous interactions, it can also detect how the customer has changed since the last interaction, so it can offer baby-related information to the new parent, for example.

Thus, if the business is smart enough, it can eliminate many spurious correlations by seeing that they do not fit the customer sequence of actions and states. Think of the money wasted on call center redesigns that infuriated users!

However, there remain three main improvements in Big Data before we can really establish causation and begin to truly escape the correlation trap:

Vendor products need to be beefed up to allow the temporal aspect of causation to be included, unobtrusively but pervasively, in analytics. There ought to be a "time-checker" that says, radio sales follow, they don’t lead, economic growth. Making sure of temporality should be a given in Big Data, not an effortful exception.

Comparable attention must be paid to the physical link. The obvious place to do this is in the modeling phase. However, even in Big Data, the modeling phase does not appear to be a common step. Models need to include the concept of the physical link, and users need to use that capability by default. That’s a bit of a tall order.

Statistics really needs to provide far better tools, and far greater use of them, for establishing both time and physical link. I would be happy if at least the time aspect was upgraded substantially, since development of models is still a bit of a black art. But some sort of minimal toolkit that is as useful as regression in including time, and figuring out how much that establishes causation, would be really useful.

Business Benefits of Avoiding the Correlation Trap (via Big Data)

I have briefly alluded to examples of saving money by avoiding strategies based on spurious correlations. I believe, however, that avoiding the correlation trap provides a more fundamental business benefit from analytics.

It seems to me that business benefits, crassly speaking, run along two lines: first, avoiding bad things such as costs and disasters, and second, achieving good things such as increased revenues and new markets. Business cases based on good things are necessarily more speculative, because before a CEO can achieve superior performance he or she has to keep the business alive – especially in these slow-revenue-growth times. Up to now, Big Data has leaned toward a good things message: Big Data use leads to new insights that improve revenues or decrease costs while increasing customer satisfaction, followed by – trust me – another such insight, followed by – trust me more – yet another such insight.

However, avoiding the correlation trap is much more about avoiding bad things and much more long term. It's also about process, in the sense that your processes are now based on well-established insights where your action is better guaranteed to be a cause of a good outcome rather than being irrelevant and money-wasting or positively harmful. I like to compare this kind of strategy to effective boxing. A short, better-targeted punch is more powerful and more likely to strike home than a roundhouse right in the vague direction of the opponent.

Big Data, at least potentially, takes us a lot closer to that kind of strategic targeting by avoiding the correlation trap. It's not just about the latest and greatest insight. It's about not wasting money on the fad du jour, but taking full advantage of what's really going on, based not only on deep analysis but deep understanding. Fix the three problems cited above, and you're well on the way there.

So for vendors, I believe the challenge is to get your products in shape to handle causation much better, as I detailed above. However, for users, there is no need to wait. You can start infusing your analytics with time-based analysis on a skunkworks or pilot-project basis right now, with a little careful piecing through the masses of detail your vendors give you. Then demand that your vendors up their game in this area.

If all this comes to pass, I can dream that eventually even the statistics profession will pay attention to the real-world problems of establishing causality. If that happens, I will be happy, Isaac Asimov in his grave will be happy, and businesses might begin to appreciate statisticians more. In fact, I can envision a bumper sticker: Have you hugged your statistician today?

On second thought ... naaah. That's a bit much, even for me. Businesses will just have to settle for long-term strategic benefits.

Wayne Kernochan is the president of Infostructure Associates, an affiliate of Valley View Ventures that aims to identify ways for businesses to leverage information for innovation and competitive advantage. Wayne has been an IT industry analyst for 22 years. During that time, he has focused on analytics, databases, development tools and middleware, and ways to measure their effectiveness, such as TCO, ROI and agility measures. He has worked for respected firms such as Yankee Group, Aberdeen Group and Illuminata, and has helped craft marketing strategies based on competitive intelligence for vendors ranging from Progress Software to IBM.