A few notes from my Big Data and Analytics classes – MBA Learnings

Technology analyst Benedict Evans shared an interesting image from a classic 1960 film “The Apartment.” The scene is set in the office of a large insurance company in New York – drones laid out at desks almost as far at the eye can see. Each desk has a telephone, rolodex, typewriter and a large electro-mechanical calculating machine.

As Ben points out – “In effect, every person on that floor is a cell in a spreadsheet. The floor is a worksheet and the building is an Excel file, with thousands of cells each containing a single person. The links between cells are made up of a typewriter, carbon copies (‘CC’) and an internal mail system, and it takes days to refresh whenever someone on the top floor presses ‘F9’.”

(Incidentally, as the protagonists are a desk worker and an elevator attendant, this is actually a romance between a button and a spreadsheet cell.)

It is clear that the capabilities of analysis tools in 1960 was far below our ability to analyze them. So, Microsoft Excel and other spreadsheet programs offered huge benefit simply because they helped bridge the gap between the average manager’s ability to analyze data and the tools available to do so. This, in turn, spurred businesses to collect more data in the hope of extracting insights. So, over the late 90s and the 2000s, every junior consultant and investment banker became an Excel ninja. Being able to use the tool to the best extent possible added real value.

All was well. Until “big data” entered the picture.

Excel spreadsheets had a major capacity upgrade recently that finally allowed a million rows. However, that makes it massively inadequate for a real world “big data” dataset. So, what is “big data?”

The consensus is that big data refers to data sets that have 3 V’s – volume (i.e. size), velocity (speed of data in and out) and variety (range of data types and sources). These data sets are in the size of hundreds of millions of rows with inputs coming in every second. To imagine a big data set, imagine a massive spreadsheet that receives point-of-sale data for McDonalds or Wal-Mart in real-time.

The next question, then, is – how do we make sense of all of this? It is hard to have a simple answer to this “big” question. So, I’ll share a couple of observations from my big data and analytics classes –

As any person who has analyzed data in Excel will tell you, you can make your data tell any number of stories. The presence of large amounts of data doesn’t change that fact. If anything, it becomes easier to manipulate the data to tell the story you want.

Additionally, the biggest problem plaguing poor analysis – mixing correlation and causality – definitely doesn’t go away. While correlation can be instructive in itself (sophisticated retailers have used correlated buys successfully to push the right coupons), it is dangerous to imply causal relationships because of a number of reasons – e.g. there could be a third effect that causes both.

Big data has increased our ability to experiment with different campaigns and messages. However, unless the experiments are well designed and executed with groups that are perfectly random (or close), these results can be erroneous.

In some ways, this gets to the root of the fundamental issue with analysis – analysis, by itself, is generally useless. Analysis, supported by business judgment, can be incredibly powerful.

The effect of this issue gets magnified when we have huge amounts of data. There are often more variables than we know what to do with. And, while machine learning tools like neural networks can help us find relationships between them, they won’t mean much if they aren’t combined with good business judgment.

All this leads me to conclude that we’re now in a situation that is very different from the 1960s picture we started with. At that time, an average manager’s analysis capabilities were far ahead of the tools and data available. Now, it is safe to say that the tools and data available far outstrip the average manager’s analysis capabilities. Forget the average manager, it is safe to say that even the most sophisticated managers will struggle with driving the right analysis and then interpreting the results right. While we can expect the tools to become easier in 5-10 years, the for analysis and insights are not going to go away any time soon. If, by chance, you are wondering as to why I am not referencing sophisticated data science teams that exist to solve this problem in leading companies, I’d like to go back to the key driver of great analysis – “good business judgment.” The ideal analyst is someone who combines amazing tool capability with business judgment. Very few of these people exist. Great analysis is driven by managers AND data scientists. And, for managers to work well with data scientists, they need to become good consumers of analysis.

So, if there’s one thing I’ve taken away from these classes, it is the importance of doing whatever it takes to get on board the big data train.

For those who don’t plan to attend graduate school, consider online courses in statistics that cover basic statistics tools. Open source tools like R make it easy for anyone to be analytically savvy.

And, if you are fortunate enough to attend a graduate school that is emphasizing big data and analytics, take full advantage of the opportunity.