Quality Data, a must have for AI

With the advent of
Artificial Intelligence (AI) we
are able to analyse data in more depth than ever before.

AI
techniques like Machine Learning (ML) can
unravel deeper insights from sets of data than traditional statistical
techniques. Big Data both requires and enables these new methods. We can now
access large amounts of data through mass storage and high performance
computing.

The first two V’s of Big Data — Volume and Variety — have to be met in order to
get Machine Learning working. For instance: with large amounts of data
regarding visits to a web shop, you can classify and predict how particular
types of visitors will use the website. This way you can create product
recommendations for our visitors, even for first timers.

When using a relatively simple ML technique like decision trees, every decision node
needs at least ten occurrences in training and test data. With ten-thousands of
decision nodes, this can easily require millions of data records to cover the
complete model. Collecting these vast quantities of data can be
challenging.

Garbage In, Garbage Out

But to fully exploit the possibilities of Artificial
Intelligence and the promises of truthful and correct predictions and advice,
we also need data with the right quality. The old saying about
computing is still valid: “Garbage In, Garbage Out.” But why is this becoming
more problematic with AI and Machine Learning? With traditional data
analysis, when bad data is discovered our set, one can exclude it and start
over. This is cumbersome but manageable.

But such data cleaning cannot be done at scale. With
Big Data and Machine Learning, bad data cannot be detected or pulled out of the
system that easily. Artificial Intelligence techniques draw conclusions from
large masses of data, which may or may not include garbage data. At a certain
moment in time, it becomes impossible to determine on which data elements these
predictions are based. In this way, Artificial Intelligence becomes black box
technology: you don’t know where it’s drawing its conclusions from.
“Unlearning” something is nearly impossible —remove one part and the
entire model ceases to work. Just like a brain! When bad data is detected,
you’re mostly required to restart the whole learning process from the
beginning, which is time and cost-intensive.

Questions to be asked

So how do we establish whether our data is of good
quality? This is called Veracity, another V of Big Data. Veracity
refers to the trustworthiness of the data. Without digging too deep into the
subject, there’re some basic questions we can ask about the data we’re going to use.

Why & Who: Data from a reputable source
typically implies better accuracy than a random online poll. Data is sometimes
collected, or even fabricated, to serve an agenda. We should establish the
credibility of the data source, and for what purpose it was collected. Dare to
ask if the data biased because it was to prove a political, business, ethnical or ideological point-of-view.

Where: Almost all data is geographically or
culturally biased. Consumer data collected in the United States may not be
representative of consumers in Asia. And the cultural differences within Asia
are also huge. When we objectively measure data, like temperatures, the
interpretation of that data can differ: what is classified as cold or warm? And
of course, temperature readings from Paris are not very useful for weather
predictions in Mumbai.

When: Validity is also one of the
V’s of Big Data. Most data is linked to time in some way in that
it might be a time series, or it’s a snapshot from a specific period.
Out-of-date data should be omitted. But when using an AI in a longer time span, data can become old
or obsolete during the process. “Machine
unlearning” will be needed to get rid of data that is no longer valid.

How: It’s worth getting to know the gist of how the data of
interest was collected. Domain knowledge is of the essence here. For instance,
when collecting consumer data, we can fall back on the decades old methods of
market research. Answers on an ill-constructed questionnaire will certainly
render poor quality data.

What: Ultimately, you want to know what your data is about, but
before you can do that, you should know what surrounds the numbers. Sometimes
humans can detect bad data or outliers, because it’s looks illogical. We should
investigate where strange looking data comes from. But Artificial Intelligence
doesn’t have this form of common sense; it tends to take all data for true.

Taking care of your data

In order to answer the question in an orderly
manner, you need to organize the research into the quality of the data. Data
quality procedures should of course be in place and used. But more is to be
done. Establishing the veracity of data is part of the process of data (and
content) curation.

“Curation is the end-to-end process of creating good data through the
identification and formation of resources with long-term value. (…) The goal of
data curation in the enterprise is twofold: to ensure compliance and that data
can be retrieved for future research or reuse.” (Mary Ann Richardson)

I strongly believe data curation should be expanded
beyond the description above. Like a curator in a museum who establishes if an
exhibit is genuine or fake, a data curator should do the same for his data.
This not only requires data analytics skills, but also domain expertise about
the subject from which the data stems.

When you want to use data for Machine Learning and
Artificial Intelligence, you’ve to go beyond the standard criteria of data
quality. Yes, these criteria — like availability, usability and
reliability — are still valid. But we should also take veracity into
account: is the data truthful. And you need methods and roles to establish the
truthfulness of your data.