Overfitting Machine Learning Experiments: When Cross-validation is No Silver Bullet

Overfitting Machine Learning Experiments: When Cross-validation is No Silver Bullet

A few years ago, I attended a very good talk about identifying influencers in social media based on
textual features. To evalute the results, the researchers employed
cross-validation,
a very popular technique in machine learning where the train set is
split in n parts (called folds). The machine
learning system is then trained and evaluated n times, each time in
all the training data minus one fold and then evaluated in the
remaining fold. In that way it is possible to have evaluation results
for an evaluation set of the same size as the train set without doing
the "mortal sin" of evaluating in training. The technique is very
useful and widely employed. However, that doesn't stop you from overfitting at
the methodological level, meaning if you repeat multiple
experiments over the same data you will get enough insights into it to
"overfit" it. This methodological problem is quite common,
so I decided to write it down. It is also not very easy to spot due to
the Warm Fuzzy Feeling (TM) that comes with using cross
validation. That is, many times we as practitioners feel that by using
cross-validation we buy some magical insurance policy against
overfitting.

Overfitting as it relates to the authors' work

After evaluating how their system performed with all its
components, the authors evaluated each of their components separatedly
and then assembled the "best" system. My issue is
with the claim the performance results in this so assembled "best"
system were not overfitted. Taking the performance of multiple
components and assembling the best combination is a type of
meta-learning. Even though the performance of each of the individual
components is cross-validated, and arguably the selected components
are possibly the best for the task, the performance of this
combination of best components is overfit to this training data. To
have non-overfit numbers, a two-level cross-validation needs to be
performed or an evaluation in a held-out set.

A two-level cross-validation will go like this: split the
training data into 2 parts (I choose two parts to maximize the amount
of training data in the remaining fold). In the first half, do a
ten-fold cross-validation training the sub-components and assemble the
best possible system. Then in the second half, do 10-fold
cross-validation of training and testing the full-system. Then repeat
switching the halves. The final evaluation number is representative of
the performance of a "best" system. Interestingly, the best system
components in the first half might not be the same in the
second!

At any rate, this is a very small point that does not invalidate
the work in that paper. It does although make for a nice background
for the larger methodological point I discuss next.

An even less obvious overfitting peril

OK, while this meta-learning issue was tough to spot, there's
also a tricky overfitting problem that is even more common: when
developing a system using multiple features, it is easy to keep
evaluating using cross-validation and gain intuitions and insights
about the data up to the point to overfitting on it, in
practice.

Interestingly, the meta-learning issue discussed before is a
fitting metaphor for a scientist trying different features and feature
variations against the same dataset: a human-in-the-middle type of
meta-learning. The way to avoid it is to do this adaptation process in
a small development set before moving to a large scale evaluation
(that can very well be done using cross-validation). I did that on my
thesis and it works well for
a build-once-and-test situation. For something with multiple
iterations (like the Watson
system), you'll need large amounts of data and a data release
protocol so you can ensure the new features and components are added
in a healthy, not overfitting manner.