Is Trump's unexpected victory a failure for big data? Not really

Polls go by the old IT maxim: 'Garbage in, garbage out'

Most election prediction shops and public polls in recent days foresaw Republican Donald Trump losing the U.S. presidential race to Democrat Hillary Clinton.

They got it wrong, bigly. And the failed predictions could cast doubts on some hot technology sectors, including big data and customer relationship management.

Not so fast, say some data experts. The problem with the polls and with forecasters like FiveThirtyEight may have more to do with data collection than data crunching, they say.

Data analysis worked well in the Moneyball model for the Oakland Athletics, but baseball stats are different than election polling, said CRM analyst Denis Pombriant, founder of Beagle Research Group. Statisticians have been collecting "highly reliable" baseball data for more than a century, while polling data is more squishy, he said.

All data analysis depends "on good, clean, and valid data, and for that reason, should only be applied in its pure form outside of baseball with care," he wrote in a research note Wednesday. "Without care, you revert to the old IT maxim of GIGO, or 'garbage in, garbage out.'"

Polls aren't really big data, added Nik Rouda, senior analyst at the Enterprise Strategy Group. "The sample sizes were certainly good enough for a poll, but maybe didn’t meet the definitions around volumes of data, variety of data, [and] historical depth contrasted against real-time immediacy, machine learning, and other advanced analytics," he said by email. "If anything, I’d argue that more application of big data techniques would have given a better forecast."

While both presidential teams relied on voter profiles, more robust profiles and "analyzing cohorts by behavior would have shown a clear picture," Rouda said. "This was a failure of the traditional approach, not a failure of math or failure of big data."

There may be more room for prediction shops to combine polling with social media, he suggested. Number crunchers could have looked more at subtle financial indicators, he said.

It's worth noting that most national polls had Clinton winning by only a small margin, with some polls within the margin of error. As of Wednesday morning, Clinton was winning the popular vote by about 185,000 votes in an election in which more than 118.5 million votes were cast. But Clinton lost at least 27 states, including several states the polls predicted she'd win, and will lose in the Electoral College.

The incorrect forecasts, however, don't appear to be a problem with the margin of error, said Professor Samuel Wang, manager of the Princeton Election Consortium, which gave Clinton a 99 percent chance of winning as of Tuesday morning.

The polling resulted in "a systematic error," Wang said by email. "The entire group of polls was off, as a group. This was a really large error, around 4 points at presidential and Senate levels, up and down the ticket."

Wang said he's still evaluating the results. One reason for the incorrect predictions may have been late decisions by so-called undecided voters.

"Pollsters use the word 'undecided,' but what that really means is that they are unable to verbalize their preference -- they may not even know what it is themselves," he said.

Early in the campaign, about 20 percent of Republican-leaning voters were undecided, Wang said. "For them, voting for Trump was a battle between party loyalty and reaction to a fairly radical candidate," he added. "It may be that party loyalty has won."

Wang and Tony Baer, a big data analyst at Ovum, both suggested that polls may have undercounted hard-to-reach voters. With many U.S. residents dropping their landlines, it's also more difficult for pollsters to target the right people, Baer said.

In other cases, people responding to the polls may have lied. Pollsters may have gotten "the wrong signals," Baer added by email.

"When you have data sets that are large enough, you can find signals for just about anything," he added. "So this places a premium on identifying the right data sets and asking the right questions, and relentlessly testing out your hypothesis with test cases extending to more or different data sets."

Asked for his reaction to Trump's victory, Baer said he was "just as perplexed as anybody."