Novelty and Outlier Detection

In my last few articles, I've looked at a number of ways
machine learning can help make predictions. The basic idea is
that you create a model using existing data and then ask that model to
predict an outcome based on new data.

So, it's not surprising that one of the most amazing ways machine
learning is being applied is in predicting the future. Just a few days
before writing this piece, it was announced that machine learning
models actually might be able to predict earthquakes—a goal that
has eluded scientists for many years and that has the potential to
save thousands, and maybe even millions, of lives.

But as you've also seen, machine learning can be used to
"cluster" data—that is, to find patterns that humans either can't or won't see,
and to try to put the data into various "clusters", or machine-driven
categories. By asking the computer to divide data into distinct
groups, you gain the opportunity to find and make use of previously
undetected patterns.

Just as clustering can be used to divide data into a number of
coherent groups, it also can be used to decide which data points
belong inside a group and which don't. In "novelty
detection", you
have a data set that contains only good data, and you're trying to
determine whether new observations fit within the existing data
set. In "outlier detection", the data may contain outliers,
which you
want to identify.

Where could such detection be useful? Consider just a few
questions you could answer with such a system:

Are there an unusual amount of login attempts from a particular IP
address?

Are any customers buying more than the typical number of products
at a given hour?

Which homes are consuming above-average amounts of water during a
drought?

Which judges convict an unusual number of defendants?

Should a patient's blood tests be considered normal, or are there
outliers that require further checks and examinations?

In all of those cases, you could set thresholds for minimum and maximum
values and then tell the computer to use those thresholds in
determining what's suspicious. But machine learning changes that
around, letting the computer figure out what is considered "normal"
and then identify the anomalies, which humans then
can investigate. This allows people to concentrate their energies on
understanding whether the outliers are indeed problematic, rather than
on identifying them in the first place.

So in this article, I look at a number of ways you can try to
identify outliers using the tools and libraries that Python provides
for working with data: NumPy, Pandas and scikit-learn. Just which
technique and tools will be appropriate for your data depend on what
you're doing, but the basic theory and practice presented here should
at least provide you with some food for thought.

Finding Anomalies

Humans are excellent at finding patterns, and they're also quite good at
finding things that don't fit a pattern. But, what sort of algorithm
can look at a group of data sets and figure out which is unlike the
others?

One simple way to do this is to set a cutoff, often done at one or two
standard deviations. For those of you without a background in
statistics (or who have forgotten what a "standard deviation" is),
it's a measurement of how spread out the data is. For example:

In the above example, I have a NumPy array containing seven instances
of the number ten. People often think of the mean as describing the data,
and it does, but it's only when combined with the standard deviation
that you can know how much the numbers differ from one another. In
this case, they're all identical, so the standard deviation is 0.

In this example, the mean remains the same, but the standard deviation is
quite different:

Here, the mean has not changed, but the standard deviation
has. You can see, from just those two numbers, that although the numbers
remain centered around 10, they also are spread out quite a bit.

One simple way to detect unusual data is to look for all of the values
that lie outside of two standard deviations from the mean, which
accounts for about 95% of the data. (You can go further out if you
want; 99.73% of data points are within three standard deviations, and
99.994% are within four.) If you're looking for outliers in an
existing data set, you can do something like this: