Big Data's Human Error Problem

We humans are our own worst enemies in the quest for better data quality, says one expert. Think false memory syndrome, typos, slips of the tongue and confirmation bias.

5 Big Wishes For Big Data Deployments

(click image for larger view and for slideshow)

Has the problem of bad data grown worse in the era of big data? No, not really, says author and industry analyst Joe Maguire, one of the organizers of the MIT Chief Data Officer and Information Quality (CDOIQ) Symposium, to be held July 17-19 in Cambridge, Mass.

The event, now in its 7th year, focuses on the issues of information quality and the need for a chief data officer (CDO) role within enterprises. In addition to being one of the conference organizers, Maguire will moderate a panel on human factors in information quality.

When it comes to information, digital or otherwise, one fact never changes: humans and data quality errors are inseparable, Maguire told InformationWeek in a phone and email interview. Furthermore, data that's too clean -- devoid of any signs of human blunders -- is immediately suspect.

"Sure, bad data touches human lives -- and vice versa. Humans are known to make a certain number of typos. In certain contexts, immaculate data could be a sign of fraud. If humans are involved in the production of data, you should expect it to be imperfect," Maguire wrote via email.

Problems resulting from poor data quality -- some serious, others lighthearted -- are often in the news. On the somber side, foreign names of terrorism suspects often have multiple spellings in U.S. intelligence databases, a common error that makes it difficult for security officials to track potential troublemakers.

On a lighter note, Yiddish scholars last week griped that the spelling of "knaidel," the winning word in last month's Scripps National Spelling Bee, should actually have been spelled "kneydl." The data quality issue stemmed from the fact that Scripps contestants, including the 13-year-old Queens, New York, boy who won the event by spelling knaidel, used Webster's Third New International Dictionary to study for the contest, rather than, say, a Yiddish-English dictionary with an alternative spelling, The New York Times reported.

"Bad data is first and foremost a human phenomenon: false memory syndrome, typos, slips of the tongue, confirmation bias and too many others to list," Maguire wrote.

Big data also has the potential to expand bad data "shenanigans" by providing people with a much larger mass of data from which to cherry-pick morsels of information that justify their positions, added Maguire.

"Confirmation bias deserves special attention. Besides producing bad data -- as when researchers rationalize discarding inconvenient data points -- it can also yield dismissive responses to good data," he wrote via email. "Think of those who cannot or will not be dissuaded from believing that vaccines cause autism, or those who could not swallow Nate Silver's predictions about the 2012 presidential election. Most noteworthy about confirmation bias is the sincerity felt at the very moment it occurs."

Another big data factor that makes bad data trickier to limit: Enterprises often don't have control over the data sources they're analyzing, including social media feeds and data sets available from public repositories such as Data.gov.

One of the best things about the MIT CDOIQ Symposium, at least according to Maguire, is that it's a self-selecting group of data quality professionals who are very passionate about the topic.

In addition to the usual typos and misspellings that characterize bad data, attendees will also discuss other metrics of information quality, including data timeliness and appropriateness, Maguire said.

Enterprises can take steps to reduce (but not eradicate) bad data, such as implementing data governance guidelines and establishing programs to incentivize employees to value information quality, Maguire added.

E2 is the only event of its kind, bringing together business and technology leaders across IT, marketing, and other lines of business looking for new ways to evolve their enterprise applications strategy and transform their organizations to achieve business value. Join us June 17-19 for three days of 40+ conference sessions and workshops across eight tracks and discover the latest insights in enterprise social software, big data and analytics, mobility, cloud, SaaS and APIs, UI/UX and more. Register for E2 Conference Boston today and save $200 off Full Event Passes, $100 off Conference, or get a FREE Keynote + Expo Pass!

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.