The good news is that this is simple—simple to understand, simple to
implement and simple to automate.

However, it's also too simple for most data. You're unlikely to be
looking at a single-dimensional vector. The baseline (mean) is likely
to shift over time. And besides, there must be other, better ways to
measure whether something is "inside" or
"outside", right?

Getting More Sophisticated

For real-world anomaly detection, you're going to need to improve on a
few fronts. You'll need to consider the data and determine what's
"in" and what's "out". You'll also need to figure out ways to evaluate
your model.

Let's consider novelty detection: there is initial data, and you want to
know if a new piece of data would fit inside the existing data or
if it would be considered an outlier. For example, consider a patient
who comes in with values from a blood test. Do those tests indicate
that the patient is normal, because the data's values are similar to
the ones you've already seen? Or are those new values statistical
outliers, indicating that the patient needs additional attention?

In order to experiment with novelty and outlier detection, I
downloaded historic precipitation data for an area of Pennsylvania
(Wyncote), just outside Philadelphia, for every day in
2016. Because I'm a scientific kind of guy, I downloaded the data in
metric units. The data came from the US government.

Why would I break the date apart? Because it'll likely be easier for
models to work with three separate numeric columns, rather than a
single date-time column. Besides, having these columns as part of my
model will make it easier to understand whether snow in July is
abnormal. I ignore the year, since it's the same for every
record, which means that it can't help me as a predictor in this
model.

My data frame now contains 353 rows—I'm not sure why it's not
365—of data from 2016, with columns indicating the amount of rain (in
mm), the date and the month.

Based on this, how can you build a model to indicate whether rainfall
on a given day is normal or an outlier?

In scikit-learn, you always use the same method: you import the
estimator class, create an instance of that class and then fit the
model. In the case of supervised learning, "fitting" means teaching
the model which inputs go with which outputs. In the case of
unsupervised learning, which I'm doing here, you use "fit" with just
a set of inputs, allowing the model to distinguish between inliers and
outliers.

Creating a Model

In the case of this data, there are several types of models that I can
build. I experimented a bit and found that the
IsolationForest
estimator gave me the best results. Here's how I create and train the
model:

The model now has been trained, so I can find out whether a given amount
of rain, on a certain month and day, is considered normal.

To try things out, I check the model against its own inputs:

>>> Series(model.predict(df)).value_counts()

In the above code, I run model.predict(df). This gives the inputs to
the model and asks it to predict whether these are normal, expected
values (indicated by 1) or outlier values (indicated by –1). By
turning the result into a Pandas series and then calling
value_counts,
I see:

1 317
-1 36

Although it falsely marked 36 days as outliers, maybe those days were
unusual. The model certainly would be improved if it had multiple
years' worth of data, rather than just one year's worth.