TR: How would you describe the prevailing idea about Big Data inside the tech community?

Fader: "More is better." If you can give me more data about a customer—if you can capture more aspects of their behavior, their connections with others, their interests, and so on—then I can pin down exactly what this person is all about. I can anticipate what they will buy, and when, and for how much, and through what channel.

So what exactly is wrong with that?

It reminds me a lot of what was going on 15 years ago with CRM (customer relationship management). Back then, the idea was "Wow, we can start collecting all these different transactions and data, and then, boy, think of all the predictions we will be able to make." But ask anyone today what comes to mind when you say "CRM," and you’ll hear "frustration," "disaster," "expensive," and "out of control." It turned out to be a great big IT wild-goose chase. And I’m afraid we’re heading down the same road with Big Data.

Further, illustrating the difference between Big Data and Right Data (See my post on this same subject about the synthesis in the offing: How Hadoop is revolutionizing…)

A true data scientist would have a decent sense of how to answer these questions, with an eye toward practical decision-making. But a Big Data zealot might say, "Save it all—you never know when it might come in handy for a future data-mining expedition." That’s the distinction that separates "old school" and "new school" analysts.

and

Big Data and data scientists seem to have such a veneer of respectability.

In investing, you have "technical chartists." They watch [stock] prices bouncing up and down, hitting what is called "resistance" at 30 or "support" at 20, for example. Chartists are looking at the data without developing fundamental explanations for why those movements are taking place—about the quality of a company’s management, for example.

Among financial academics, chartists tend to be regarded as quacks. But a lot of the Big Data people are exactly like them. They say, "We are just going to stare at the data and look for patterns, and then act on them when we find them." In short, there is very little real science in what we call "data science," and that’s a big problem.

Does any industry do it right?

Yes: insurance. Actuaries can say with great confidence what percent of people with your characteristics will live to be 80. But no actuary would ever try to predict when you are going to die. They know exactly where to draw the line.

Even with infinite knowledge of past behavior, we often won’t have enough information to make meaningful predictions about the future. In fact, the more data we have, the more false confidence we will have. Not only won’t our hit rate be perfect, it will be surprisingly low. The important part, as both scientists and businesspeople, is to understand what our limits are and to use the best possible science to fill in the gaps. All the data in the world will never achieve that goal for us.

The thing to note about this interview is that for all the truth that is captured here, it will make not one iota of a difference in the Big data frenzy of the current time. And I dare say that much of this frenzy will come to naught but that time is far off yet.

I came across a SaaS forecasting technology provider named Lokad in one of my sojourns on the interwebs. I must confess at the very outset though that I’m a biased reviewer at best. Biased, how and why? Two salient points:

1. I am of the firm belief (note the word) that forecasting is best done in hindsight where it may do the least harm. Forecast all you wish about the past – I have no issue with you.I have no problems with economists creating all sorts of economic models to forecast the state of the economy because I rarely if ever pay any attention to them. So also the Raputre – same thing. This thread of skepticism pervades everything from the macro to the micro forecasting world. That’s the first point – Do no harm.

2. I am firm believer (note the word again) in the primacy of execution – the best forecast that I can think of is the one that creates the world ahead because of the design, of the plan, of the intent to dominate etc. As you may imagine, this is restricted to the world of growth and not the world of maturity and decline of a product’s lifecycle. While each of these three phases of a product’s lifecycle have their own specific variables and levers of interest, my mantra here is not to predict the world but to respond to it in the least amount of time. Thus, I’m not a big fan of long lead times, centralized planning et al.

Please do keep in mind these points as I delve into this review.

"We benchmarked Lokad on client data (a beverage distributor) against a model we specifically developed for the case. Following a deep analysis of the data we combined different forecasting techniques like ARIMA, VAR, LOESS, HOLT-WINTER and others using R, the statistical computing software. Lokad performed very good, the values of MAPE were similar to our results, after 3 months of analysis of the case. I am really impressed of this accuracy. Lokad is also very fast and provides a high level of automation." Mauro Coletto, Business Intelligence Consultant

Empiricism – always a good idea.

The good practices that I see at Lokad’s forecasting engine

1. Getting as close to the point of demand data as possible – vital for execution and even more so for forecasting. But from the looks of it, if you as the client of forecasting as a service don’t have really granular data, then Lokad’s service can only be as good as your own execution efforts are likely to be. The implication here is that you really can’t expect Lokad to improve your sorry ass case of datatitis.

Our technology is designed to deal with your data in their current form.

One point to note here though is that most companies firmly believe that they’ve got a good handle on data. That’s until they see how their industry/segment leaders benchmark at.

Verdict: Neutral.

2. Getting the statistician(s) out of your firm. Spouting statistics on this and that is a finely honed skill that has considerable usefulness to your career progression – the higher you go, the more access you have to utterly useless statistical gordian knots. I don’t think that I would be so far off the mark as to say that higher honchos who browbeat you with statistics quantitatively know how poor their own competence and consequent results are without having any insights on what qualitative actions can be taken to surmount, nay transcend the current set of pitfalls marked on the pareto charts. While you can outsource the real statisticians and their professional output to Lokad – this may be an opportunity that Lokad is leaving on the table. Statisticians both inside and outside a firm are quite likely to be treated as black boxes anyway – so why incur the cost of having that black box on your payroll? I can predict quite easily that as Lokad grows, they will go after the opportunity on the table and I believe that can be a big big opportunity. The real competitors for Lokad in that space are the business/operations consultants. Upsell the interpretation once you have the lock on the statistical computation service.

Verdict: A big plus for Lokad.

3. The use of computational power – Call it the cloud, compute storm, global warming – whatever? What cloud computing does for every client is a very old idea in new clothes. That is to say, when you need the burst of computational power to solve a multi-variable (in the millions of them) for a short period of time, doesn’t it make better sense to pay for your slice of time rather than buying the whole set of computers and storing them in the basement? Companies like Lokad that are doing this today are all set to ride a power wave of such adoption – the time and assets to solve these large problems are now shared. Talk about cooperation without actually talking about it. Lokad is riding the wave and that’s a good way to piggyback on the successes of others riding the same wave.

Verditc: A plus for Lokad

4. Finally the Technology – The nitty gritty of the technology. First up, Quantile forecasting. I did struggle more than a bit to understand this and I’m not entirely sure that I understand it (Has that ever troubled me in blustering on but I digress…). The idea here seems to me to introduce a weight (that is correlated to the difference in payoffs for the positive vs negative realization of a particular event). To me, this is a part of modeling i.e. appreciating the sensitivity of a forecast and erring on the side of the lower cost. Makes sense.

What I understand about the use of quantiles in forecasting is the improvement (marginal or more) that one gets from a better extrapolation of the expected demand predicted by a normal distribution over a cost weighted distribution? Is this the heart of it? Over 1 SKU the delta between a normal distribution and weighted distribution is probably quite small but extend that over 1000s of SKUs and the numbers begin to add up.

Verdict: A plus for Lokad (Remains a plus if it is what I understand it to be as above)

In all, I find such a move by Lokad quite an interesting thing. Validated empiricism would make it compelling as well. However, as I outlined above, my biases and experience is on execution. Good execution with good forecasting in your frontal view rather than in the rear view mirror is something of a mythical beast but strange things are beginning to happen in the world of the cloud.

I believe that this sub-domain of Big Data is where a significant portion of the future’s wrangles and intense competition is going to be. TechCrunch has an article on how Facebook (which by the way is one of the big Hadoop users) is using Analytics tools more or less in real-time in order to “glean” information about the activity on their site.

Facebook’s analytics tool Insights will soon begin showing Page performance data in real-time or near real-time rather than on average 48 hour delay, the company Facebook plans to announce at Wednesday’s Facebook Marketing Conference in New York City according to our sources.

and

Making real-time Insights data available through the API “will give Page owners an opportunity to see how their Page actually lives and breathes,” says Facebook analytics tool provider EdgeRank Checker‘s founder Chad Wittman

So Target started sending coupons for baby items to customers according to their pregnancy scores. Duhigg shares an anecdote — so good that it sounds made up — that conveys how eerily accurate the targeting is. An angry man went into a Target outside of Minneapolis, demanding to talk to a manager:

“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

and

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”

What Target discovered fairly quickly is that it creeped people out that the company knew about their pregnancies in advance.

“If we send someone a catalog and say, ‘Congratulations on your first child!’ and they’ve never told us they’re pregnant, that’s going to make some people uncomfortable,” Pole told me. “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy.”

So we’ve gone POS (Point Of Sale) to gleaning information about clients from the patterns they exhibit. Information has been so free in the past that it sounds like it is going to become a whole lot more expensive.

Though what I loved best about this article is the way that the Forbes writer included a picture of Andrew Pole of Target Corp (linked from LinkedIn) in the article itself. A bit of reverse information gleaning…

The Age of Big Data is an article at NY Times Sunday Review by Steve Lohr. Some of the key takeaways,

They help businesses make sense of an explosion of data — Web traffic and social network comments, as well as software and sensors that monitor shipments, suppliers and customers — to guide decisions, trim costs and lift sales.

And of course, something like this means jobs as welll…

A report last year by the McKinsey Global Institute, the research arm of the consulting firm, projected that the United States needs 140,000 to 190,000 more workers with “deep analytical” expertise and 1.5 million more data-literate managers, whether retrained or hired.

As for the impact

The story is similar in fields as varied as science and sports, advertising and public health — a drift toward data-driven discovery and decision-making. “It’s a revolution,” says Gary King, director of Harvard’s Institute for Quantitative Social Science. “We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.”

and

Research by Professor Brynjolfsson and two other colleagues, published last year, suggests that data-guided management is spreading across corporate America and starting to pay off. They studied 179 large companies and found that those adopting “data-driven decision making” achieved productivity gains that were 5 percent to 6 percent higher than other factors could explain.

Finally, if you thought that our lives are going to get easier, be aware

Big Data also supplies more raw material for statistical shenanigans and biased fact-finding excursions. It offers a high-tech twist on an old trick: I know the facts, now let’s find ’em. That is, says Rebecca Goldin, a mathematician at George Mason University, “one of the most pernicious uses of data.”

IBM was among the select companies that Forrester invited to participate in The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012, (February 2, 2012). Technologies evaluated were IBM InfoSphere BigInsights (IBM’s Hadoop-based offering), and IBM Netezza Analytics. In this evaluation, IBM was placed in the Leaders category of the Wave and achieved the highest possible score in both the Strategy and Market Presence segments. In the third segment, Current Offering, IBM received the second highest score.

The Forrester report on the current players in the Big Data space can be downloaded from IBM’s site here.

Pre-Hadoop (and Hadoop like infrastructure) – BI applications access the data that is available in a data store such as a database and a data warehouse and produce actionable items from this data. As time moves on, the data from the data storage gets archived and essentially disappears or dies or gets aggregated/reduced for offline storage.

The below is the new anti-thesis:

Post- Hadoop – The approach here is to have live data available at all times in the raw and/or processed data form.. The Hadoop approach is to take the application to the data – distributed data and distributed applications as well acting and exploring this data.

The reason why the anti-thesis has this form is largely because as data storage has become commoditized (and rather large), data pipes enlarged and data computation rather fast, both computation and pipes have not (and perhaps need not) expanded as much as storage has. At the same time, it has become a human imperative to put out as much junk as possible er,.. be more creative and big data apps and their providers (Facebook, Google etc) have followed suit.

The synthesis – Yet to be.

But here’s a guess. Right-Compute and Right-Data. The premise of Big Compute and Big Data is that in the pile of horse manure, there must be a pony in there somewhere : a white stallion to be sure. As many past Masters (Who is a Master? – Think Sun Tzu, Newton) will tell you – the objective of dealing with Big Data (and is there any bigger Data out there than the human, natural and metaphysical world) is to elicit the laws that underlie them.

Now in the past, we as the curious ones have depended on intuiting and hypothesizing about the Big Data out there. Today, it seems that we’re done with the hypothesizing and are jumping straight into letting the Data speak for itself.

Right Compute and Right Data is about getting back to the hypothesize and test scheme that has proven remarkably successful in our developmental journey,

In January 2012, we sit again on the cusp of three grand technological transformations with the potential to rival that of the past century. All find their epicenters in America: big data, smart manufacturing and the wireless revolution.

Now, that’s what I call timing because I’ve been staking out the ground on two of those technological transformation – Smarter Manufacturing (on my @ Supply Chain Management blog) and Big Data here on this blog. My views on Smarter Manufacturing are here.

As for Big Data, this is what the authors have to say,

Information technology has entered a big-data era. Processing power and data storage are virtually free. A hand-held device, the iPhone, has computing power that shames the 1970s-era IBM mainframe. The Internet is evolving into the "cloud"—a network of thousands of data centers any one of which makes a 1990 supercomputer look antediluvian. From social media to medical revolutions anchored in metadata analyses, wherein astronomical feats of data crunching enable heretofore unimaginable services and businesses, we are on the cusp of unimaginable new markets.

While much of this is true, does it sound like a prediction? To me, it sounds like the inevitable inference. Except that I think a different actualization of the potential of Big Data. While we’re at the point of Big Data storage, retrieval, analyses, manipulation etc that is not the point of Big Data <anything>.

I see Big Data as a Resource like water or oil for example – a vast landscape to be discovered, molded and valued. That is the new economy – powered by a new resource altogether…