2/14/2005

“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: overrepresenting performance on particular datasets and (implicitly) overrepresenting performance of a method on future datasets.

We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.

Name

Method

Explanation

Remedy

Traditional overfitting

Train a complex predictor on too-few examples.

Hold out pristine examples for testing.

Use a simpler predictor.

Get more training examples.

Integrate over many predictors.

Reject papers which do this.

Parameter tweak overfitting

Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.

For example, choosing the features so as to optimize test set performance can achieve this.

Same as above.

Brittle measure

Use a measure of performance which is especially brittle to overfitting.

“entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.

Prefer less brittle measures of performance.

Bad statistics

Misuse statistics to overstate confidences.

One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.

Don’t do this. Reject papers which do this.

Choice of measure

Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.

This is fairly common and tempting.

Use canonical performance measures. For example, the performance measure directly motivated by the problem.

Incomplete Prediction

Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.

Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.

Reject papers which do this.

Human-loop overfitting.

Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.

This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.

Chose to report results on some subset of datasets where your algorithm performs well.

The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.

Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.

Reprobleming

Alter the problem so that your performance improves.

For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.

Discount papers which do this. Make sure problem specifications are clear.

Old datasets

Create an algorithm for the purpose of improving performance on old datasets.

After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…

Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.

Overfitting by review

10 people submit a paper to a conference. The one with the best result is accepted.

This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.

Be more pessimistic of confidence statements by papers at high rejection rate conferences.

Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)

I have personally observed all of these methods in action, and there are doubtless others.

49 Comments to “Clever Methods of Overfitting”

Twice I’ve tried to realistically present the performance of the algorithm. Twice was my paper rejected because of “unfinished methods” or “disappointing results”. There’s a whole culture of “rounding-up”, and trying to do the evaluations fairly just gives you trouble. When fair evaluations get rejected and rounders-up pass through, what do you do?

On any given paper, there is an incentive to “cheat” with some of the above methods. This can be hard to resist when so much rides on a paper acceptance _and_ some of the above cheats are not easily detected. Nevertheless, it should be resisted because “cheating” of this sort inevitably fools you as well as others. Fooling yourself in research is a recipe for a career that goes nowhere. Your techniques simply won’t apply well to new problems, you won’t be able to tackle competitions, and ultimately you won’t even trust your own intuition, which is fatal in research.

My best advice for anonymous is to accept that life is difficult here. Spend extra time testing on many datasets rather than a few. Spend extra time thinking about what make a good algorithm, or not. Take the long view and note that, in the long run, the quantity of papers you write is not important, but rather their level of impact. Using a “cheat” very likely subverts long term impact.

How about an index of negative results in machine learning? There’s a Journal of Negative Results in other domains: Ecology & Evolutionary Biology, Biomedicine, and there is Journal of Articles in Support of the Null Hypothesis. A section on negative results in machine learning conferences? This kind of information is very useful in preventing people from taking pathways that lead nowhere: if one wants to classify an algorithm into good/bad, one certainly benefits from unexpectedly bad examples too, not just unexpectedly good examples.

I visited the workshop on negative results at NIPS 2002. My impression was that it did not work well.

The difficulty with negative results in machine learning is that they are too easy. For example, there are a plethora of ways to say that “learning is impossible (in the worst case)”. On the applied side, it’s still common for learning algorithms to not work on simple-seeming problems. In this situation, positive results (this works) are generally more valuable than negative results (this doesn’t work).

This discussion reminds of some interesting research on “anti-learning“, by Adam Kowalczyk. This research studies (empirically and theoretically) machine learning algorithms that yield good performance on the training set but worse than random performance on the independent test set.

Hmm, rereading this post. What do you mean by “brittle”? Why is mutual information brittle?

Standard deviation of loss across the CV folds is not a bad summary of variation in CV performance. I’m not sure one can just reject a paper where the authors bothered to disclose the variation, rather than just plopping out the average. Standard error carries some Gaussian assumptions, but it is still a valid summary. The distribution of loss is sometimes quite close to being Gaussian, too.

As for significance, I came up with the notion of CV-values that measure how often method A is better than method B in a randomly chosen fold of cross-validation replicated very many times.

What I mean by brittle: Suppose you have a box which takes some feature values as input and predicts some probability of label 1 as output. You are not allowed to open this box or determine how it works other than by this process of giving it inputs and observing outputs.

Let x be an input.
Let y be an output.
Assume (x,y) are drawn from a fixed but unknown distribution D.
Let p(x) be a prediction.

For classification error I(|y – p(x)| < 0.5) you can prove a theorem of the rough form:
forall D, with high probability over the draw of m examples independently from D,
expected classification error rate of the box with respect to D is bounded by a function of the observations.

What I mean by “brittle” is that no statement of this sort can be made for any unbounded loss (including log-loss which is integral to mutual information and entropy). You can of course open up the box and analyze its structure or make extra assumptions about D to get a similar but inherently more limited analysis.

The situation with leave-one-out cross validation is not so bad, but it’s still pretty bad. In particular, there exists a very simple learning algorithm/problem pair with the property that the leave-one-out estimate has the variance and deviations of a single coin flip. Yoshua Bengio and Yves Grandvalet in fact proved that there is no unbiased estimator of variance. The paper that I pointed to above shows that for K-fold cross validation on m examples, all moments of the deviations might only be as good as on a test set of size $m/K$.

I’m not sure what a ‘valid summary’ is, but leave-one-out cross validation can not provide results I trust, because I know how to break it.

I have personally observed people using leave-one-out cross validation with feature selection to quickly achieve a severe overfit.

Thanks for the explanation of brittleness! This is a problem with log-loss, but I’d say that it is not a problem with mutual information. Mutual information has well-defined upper bounds. For log-loss, you can put a bound into effect by mixing the prediction with a uniform distribution over y, bounding the maximum log-loss in a way that’s analogous to the Laplace probability estimate. While I agree that unmixed log-loss is brittle, I find classification accuracy noisy.

A reasonable compromise is Brier score. It’s a proper loss function (so it makes good probabilistic sense), and it’s a generalization of classification error where the Brier score of a non-probabilistic classifier equals its classification error, but a probabilistic classifier can benefit from distributing the odds. So, the result you mention holds also for Brier score.

If I perform 2-replicated 5-fold CV of the NBC performance on the Pima indians dataset, I get the following [0.76 0.75 0.87 0.76 0.74 0.77 0.79 0.72 0.78 0.82 0.81 0.79 0.73 0.74 0.82 0.79 0.74 0.77 0.83 0.75 0.79 0.73 0.79 0.80 0.76]. Of course, I can plop out the average of 0.78. But it is nicer to say that the standard deviation is 0.04, and summarize the result as 0.78 +- 0.04. The performance estimate is a random quantity too. In fact, if you perform many replications of cross-validation, the classification accuracy will have a Gaussian-like shape too (a bit skewed, though).

I too recommend against LOO, for the simple reason that the above empirical summaries are often awfully strange.

Very very interesting. However, I still feel (but would love to be convinced otherwise) that when the dataset is small and no additional data can be obtained, LOO-CV is the best among the (admittedly non-ideal) choices. What do you suggest as a practical alternative for a small dataset?

I’m not convinced by your observation about people using LOO-CV with feature selection to overfit. Isn’t this just a problem with reusing the same validation set multiple times? Even if I use a completely separately drawn validation set, which Bengio and Grandvalet show yield an unbiased estimtae of the variance of the prediction error, I can still easily overfit the validation set when doing feature selection, right?

This is my first post on your blog. Thanks so much for putting it up — a very nice resource!

Aleks’s technique for bounding log loss by wrapping the box in a system that mixes with the uniform distribution has a problem: it introduces perverse incentives for the box. One reason why people consider log loss is that the optimal prediction is the probability. When we mix with the uniform distribution, this no longer becomes true. Mixing with the uniform distribution shifts all probabilistic estimates towards 0.5, which means that if the box wants to minimize log loss, it should make an estimate p such that after mixing, you get the actual probability.

David McAllester advocates truncation as a solution to the unboundedness. This has the advantage that it doesn’t create perverse incentives over all nonextreme probabilities.

Even when we swallow the issues of bounding log loss, rates of convergence are typically slower than for classification, essentially because the dynamic range of the loss is larger. Thus, we can expect log loss estimates to be more “noisy”.

Before trusing mutual information, etc…, I want to see rate of convergence bounds of the form I mentioned above.

I’m not sure what Brier score is precisely, but justing using L(p,y)=(p-y)^2 has all the properties mentioned.

I consider reporting standard deviation of cross validation to be problematic. The basic reason is that it’s unclear what I’m supposed to learn. If it has a small deviation, this does not mean that I can expect the future error rate on i.i.d. samples to be within the range of the +/-. It does not mean that if I cut the data in another way (and the data is i.i.d.), I can expect to get results in the same range. There are specific simple counterexamples to each of these intuitions. So, while reporting the range of results you see may be a ‘summary’, it does not seem to contain much useful information for developing confidence in the results.

One semi-reasonable alternative is to report the confidence interval for a Binomial with m/K coin flips, which fits intuition (1), for the classifier formed by drawing randomly from the set of cross-validated classifiers. This won’t leave many people happy, because the intervals become much broader.

The notion that cross validation errors are “gaussian-like” is also false in general, on two counts:

The distribution being drawn from is not a gaussian. (Actually, it’s a Binomial, assuming i.i.d. samples.)

The draws are not independent. (as discussed in the papers above)

This is an important issue because it’s not always obvious from experimental results (and intuitions derived from experimental results) whether the approach works. The math says that if you rely on leave-one-out cross-validation in particular you’ll end up with bad inuitions about future performance. You may not encounter this problem on some problems, but the monsters are out there.

For rif’s questions — keep in mind that I’m only really considering methods of developing confidence here. I’m ok with people using whatever ugly nasty hacks they want in producing a good predictor. You are correct about the feature selection example being about using the same validation set multiple times. (Bad!) The use of leave-one-out simply aggravated the effect of this with respect to using a holdout set because it’s easier to achieve large deviations from the expectation on a leave-one-out estimate than on a holdout set.

Developing good confidence on a small dataset is a hard problem. The simplest solution is to accept the need for a test set even though you have few examples. In this case, it might be worthwhile to compute very exact confidence intervals (code here). Doing K-fold cross validation on m examples and using confidence intervals for m/K coin flips is better, but by an unknown (and variable) amount. The theory approach, which has never yet worked well, is to very carefully use the examples for both purposes. A blend of these two approaches can be helpful, but the computation is a bit rough. I’m currently working with Matti KÃ¤Ã¤riÃ¤inen on seeing how well the progressive validation approach can be beat into shape.

And of course we should remember that all of this is only meaningful when the data is i.i.d, which it often clearly is not.

I think we have a case where the assumptions of applied machine learners differ from the assumptions of the theoretical machine learners. Let’s hash it out!

==

* (Half-)Brier score is 0.5(p-y)^2, where p and y are vectors of probabilities (p-predicted, y-observed).

* A side consequence of mixing is also truncation; but mixing is smooth, whereas truncation results in discontinuities of the gradient. There is a good justification for mixing: if you see that you misclassify in 10% of the cases on the unseen test data, you can anticipate similar error in the future, and calibrate the predictions by mixing with the uniform distribution.

* Standard deviation of the CV results is a foundation for bias/variance decomposition and a tremendous amount of work in applied statistics and machine learning. I wouldn’t toss it away so lightly, and especially not based on the argument of non-independence of folds. The purpose of non-independence of folds in the first place is that you get a better estimate of the distribution over all the training/test splits of a fixed proportion (one could say that the split is chosen by i.i.d., not the instances). You get a better estimate with 10-fold CV than by picking 10 train/test splits by random.

* Both binomial and Gaussian model of the error distribution are just models. Neither of them is ‘true’, but they are based on slightly different assumptions. I generally look at the histogram and eyeball it for gaussianity, as I have done in my example. The fact that it is a skewed distribution (with the truncated hump at ~85%) empirically invalidates the binomial error model too. One can compute the first two moments as a “finite” summary as an informative summary even if the underlying distribution has more of them.

I am not advocating ‘tossing’ cross-validation. I am saying that caution should be exercised in trusting it.

Do you have a URL for this other analysis?

You are right to be skeptical about models, but the ordering of skepticism seems important. Models which make more assumptions (and in particular which makes assumptions that are clearly false) should be viewed with more skepticism.

What is standard deviation of cross validation errors is supposed to describe? I listed and dismissed a couple possibilities, so now I’m left without an understanding.

I’d like to follow up a bit on your comment that “It’s easier to achieve large deviations from the expectation on a leave-one-out estimate than on a holdout set.” I was not familiar with this fact. Could you discuss this in more detail, or provide a reference that would help me follow this up? Quite interesting.

I didn’t mean to imply that you’d disagree with cross-validation in general. The issue at hand is whether the standard deviation of CV errors is useful or not. I can see two reasons for why one can be unhappy about it:

a) It can happen that you get accuracy of 0.99 +- 0.03. What could that mean? The standard deviation is a summary. If you provide a summary consisting of the first two moments, it does not mean that you believe in the Gaussian model – of course those statistics are not sufficient. It is a summary that roughly describes the variance of the classifier, inasmuch that the mean accuracy indicates its bias.

b) The instances in a training and test set are not i.i.d. Yes, but the above summary relates to the question: “Given a randomly chosen training/test 9:1 split of instances, what can we say about the classifier’s accuracy on the test set?” This is a different question than “Given a randomly chosen instance, what will be the classifier’s expected accuracy?”

Several people have a problem with b) and use bootstrap instead of cross-validation in bias/variance analysis. Still, I don’t see a problem with the formulation, if one doesn’t attempt to perceive CV as an approximation to making statements about i.i.d. samples.

Aleks, I regard the 0.99 +/- 0.3 issue as a symptom that the wrong statistics are being used (i.e. assuming gaussianity on obviously non-gaussian draws).

I’m not particularly interested in â€œGiven a randomly chosen training/test 9:1 split of instances, what can we say about the classifierâ€™s accuracy on the test set?â€ because I generally think the goal of learning is doing well on future examples. Why should I care about this?

Reporting 0.99 +- 0.03 does not imply that one who wrote it believes that the distribution is Gaussian. Would you argue that reporting 0.99 +- 0.03 is worse than just reporting 0.99? Anyone surely knows that the classification accuracy cannot be more than 1.0, it would be most arrogant to assume such ignorance.

CV is the de facto standard method of evaluating classifiers, and many people trust the results that come out of this. Even if I might not like this approach, it is a standard, it’s an experimental bottom line. “Future examples” are something you don’t have, something you can only make assumptions about. Cross-validation and learning curves employ the training data as to empirically demonstrate the stability and convergence of the learning algorithm on what effectively *is* future data for the algorithm, under the weak assumption of permutability of the training data. Permutability is a weaker assumption than iid. My main problem with most applications of CV is that people don’t replicate the cross-validation on multiple assignments to folds, something that’s been pointed out quite nicely by, e.g.,Estimating Replicability of Classifier Learning Experiments. ICML, 2004.

The problem with LOO is that you *cannot* perform multiple replications.

If your assumptions grow from iid, you shouldn’t use cross-validation, it’s a) not solving your problem, and b) you could get better results with an evaluation method that assumes more. It is unfair to criticize CV on these grounds. One can grow a whole different breed of statistics based on permutability and training/test splitting.

Reporting 0.99 +- 0.03 does mean that the inappropriate statistics are being used.

I am not trying to claim anything about the belief of the person making the application (and certainly not trying to be arrogant).

I have a problem with reporting the +/- 0.03. It seems that it has no interesting interpretation, and the obvious statistical interpretation is simply wrong.

The standard statistical “meaning” of 0.99 +- 0.03 is a confidence interval about an observation. A confidence interval [lower_bound(observation), upper_bound(observation)] has the property that, subject to your assumptions, it will contain the true value of some parameter with high probability over the random draw of the observation. The parameter I care about is the accuracy, the probability that the classifier is correct. Since the true error rate can not go above 1, this confidence interval must be constructed with respect to the wrong assumptions about the observation generating process. This isn’t that damning though – what’s really hard to swallow is that this method routinely results in intervals which are much narrower than the standard statistical interpretation would suggest. In other words, it generates overconfidence.

> Would you argue that reporting 0.99 +- 0.03 is worse than just reporting 0.99?

Absolutely. 0.99 can be interpreted as an unbiased monte carlo estimate of the “true” accuracy. I do not have an interpretation of 0.03, and the obvious interpretations are misleading due to nongaussianity and nonindependence in the basic process. Using this obvious interpretation routinely leads to overconfidence which is what this post was about.

I don’t regard the distinction between “permutable” and “independent” as significant here, because DeFinetti’s theorem says that all exchangeable (i.e. permutable) sequences can be thought of as i.i.d. samples conditioned on the draw of a hidden random variable. We do not care what the value of this hidden random variable is because a good confidence interval for accuracy works no matter what the datageneration process is. Consequently, the ‘different breed’ you speak of will end up being the same breed.

Many people use cross validation in a way that I don’t disagree with. For example, tuning parameters might be reasonable. I don’t even have a problem with using cross validation error to report performance (except when this creates a subtle instance of “reproblem”). What seems unreasonable is making confidence interval-like statements subject to known-wrong assumptions. This seems especially unreasonable when there are simple alternatives which don’t make known-wrong assumptions.

I think you are correct: many other people (I would not say it’s quite “the” standard) try to compute (and report) confidence interval-like summaries. I think it’s harmful to do so because of the routine overconfidence this creates.

rif — Another reason LOO CV is bad because it asymptotically suboptimal. For example if you use Leave One Out cross-validation for feature selection, you might end up selecting suboptimal subset, even with infinite training sample. Te neural-nets FAQ talks about it: http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html

Experimentally, Ronny Kohavi and Breiman found independently that 10 is the best number of folds for CV.

The FAQ says “cross-validation is markedly superior [to split sample validation] for small data sets; this fact is demonstrated dramatically by Goutte (1997)”. (google scholar has the paper), but I’m not sure their conclusions extend beyond their Gaussian synthetic data.

I agree with you regarding the inappropriateness of +- notation, and I also agree about general overconfidence of confidence intervals. Over here it says: “LTCMâ€™s loss in August 1998 was a -10.5 sigma-event on the firmâ€™s risk model, and a -14 sigma-event in terms of the actual previous price movements. Sometimes overfitting is very expensive LTCM “lost” quite a few hundred million US$ (“lost” — financial transactions are largely a zero-sum game).

What if I’d had written 0.99(0.03), without implying that 0.03 is a confidence interval (because it is not)? It is quite rare in statistics to provide confidence intervals – usually one provides either the standard deviation of the distribution or the standard error of the estimate of the mean. Still, I consider the 0.03 a very useful piece of information, and I’m grateful to any author that is dilligent enough to provide some information about the variation in the performance. I’d reject a paper that only provides the mean for a small dataset, or didn’t perform multiply replicated experiments.

As much as I’m concerned this is The Right Way of dealing with confidence intervals of cross-validated loss is to perform multiple replications of cross-validation, and provide the scores at appropriate percentiles. My level of agreement with the binomial model is about at the same level as your agreement with the Gaussian model. Probability of error is meaningless: there are instances that you can almost certainly predict right, there are instances that you usually misclassify, and there are boundary instances where the predictions of the classifier vary, depending on the properties of the split. Treating all these groups as one would be misleading.

Regarding de Finetti, one has to be careful: there is a difference between finite and infinite exchangeability. The theorem goes from *infinite* exchangeability to iid. When you have an infinite set, there is no difference between forming a finite sample by sampling-with-replacement (bootstrap) versus sampling-without-replacement (cross-validation). When you have a finite set to sample from, it’s two different breeds.

As for assumptions, they are all wrong… But some are more agreeable than others.

0.99(0.03) is somewhat better, but I suspect people still interpret it as a confidence interval, even when you explicitly state that it is not.

Another problem is that I still don’t know why it’s interesting. You assert it’s very interesting, but can you explain why? How do you use it? Saying 0.99(0.03) seems semantically equivalent to saying “I achieved test set performance of 0.99 with variation 0.03 across all problems on the UCI database”, except not nearly as reassuring because the cross-validation folds do not encompass as much variation across real-world problems.

On Binomial vs. Gaussian model: the Binomial model (at least) has the advantage that it is not trivially disprovable.

On probability of error: it’s easy to criticize any small piece of information as incomplete. Nevertheless, we like small pieces of information because we can better understand and use them. “How often should I expect the classifier to be wrong in the future” seems like an interesting (if incomplete) piece of information to me. A more practical problem with your objection is that distinguishing between “always right”, “always wrong” and “sometimes right” examples is much harder, requiring more assumptions, than distinguishing error rate. Hence, such judgements will be more often wrong.

I had assumed you were interested in infinite exchangeability because we are generally interested in what the data tells us about future (not yet seen) events. Analysis which is only meaningful with respect to known labeled examples simply doesn’t interest me, in the same way that training error rate doesn’t interest me.

No, 0.99(0.03) means 0.99 classification error across 90:10 training-test splits on a single data set. It is quite meaningless to try to assume any kind of average classification error across different data sets.

Regarding probability of error, if it’s easy to acquire this kind of information, why not do it?

Infinite exchangeability does not apply to a finite population. What do you do when I gather *all* the 25 cows from the farm and measure them? You cannot pretend that there are infinitely many cows in the farm. You can, however, wonder about the number of cows (2,5, 10, 25?) you really need to measure to characterize all the 25 with reasonable precision.

I maintain that future is unknowable. Any kind of a statement regarding the performance of a particular classifier trained from data should always be seen as relative to the data set.

This still isn’t answering my question: Why is 0.03 useful? I can imagine using an error rate in decision making. I can imagine using a confidence interval on the error rate in decision making. But, I do not know how to use 0.03 in any useful way.

Your comment on exchangeability makes more sense now. In this situation, what happens is that (basically) you trade using a Binomial distribution for a Hypergeometric distribution to analyze the number of errors on the portion of the set you haven’t seen. The trade Binomial->Hypergeometric doesn’t alter intuitions very much because the distributions are fairly similar (Binomial is a particlular limit of the Hypergeometric, etc…)

This still isn’t the answer I want. How is 0.03 useful? How do you use it?

The meaning of “stability” here seems odd. It seems to imply nothing about how the algorithm would perform for new problems or even for a new draw of the process generating the current training examples. Why do we care about this very limited notion of stability?

If you don’t mind a somewhat philosophical argument, examine the Figure 5 in Modelling Modelled. The NBC becomes highly stable beyond 150 instances. On the other hand, C4.5 has a higher average utility, but also a greater variation in its utility on the test set. Is it meaningful to compare both methods when the training set consists of ~100 instances? The difference in expected utility is negligible in comparison to the total amount of variation in performance.

(0.03) indicates how much the classification accuracy is affected by the choice of the training data across the experiments. It quantifies the variance of the learned model. It describes that the estimate of classification accuracy across test sets of a certain size is not a number, it is a distribution.

I get my distribution of expected classification accuracy through sampling, and the only assumption is the fixed choice of the relative size of the training and test set. The purpose of (0.03) is to stress that the classification accuracy estimate depends on the assignment of instances to training or test set. You get your confidence interval starting from an arbitrary point estimate “0.99” along with a very strong binomial assumption, one that is invalidated by the above sampling experiments. It’s a simple answer alright, but a very dubious set of assumptions.

By now, I’ve listed sufficiently many papers that attempt to justify the bias/variance problem, and the purpose of (0.03) should be apparent in the context of this problem. Do you have a good reason for disagreeing with with the whole issue of bias/variance decomposition?

I know what (0.03) indicates, but this still doesn’t answer my question. How do we _use_ it? How is this information supposed to affect the choices that we make? The central question is whether or not (0.03) is relevant to decision making, and I don’t yet see that relevance.

“Binomial distribution” is not the assumption. Instead, it is the implication. The assumption is iid samples. This assumption is not always true, but none of the experiments in the ‘modelling modeled’ reference seem to be the sort which disprove the correctness of the assumption. In particular, cutting up the data in several different ways and learning different classifiers with different observed test error rates cannot disprove the independence assumption.

This reminds me of Lance’s post on calibrating weather prediction numbers. The weatherman tells us that (subjective) probability of rain tomorrow is 0.8 How do (should) we use that? Now suppose we know something about the prior he used to come up with the 0.8 estimate. Does that change the way we use the number?

Re: Yaroslav – Yes, if the prior doesn’t match our own prior, we can squeeze out the update and update *our* prior.

Re: John – If you accept the bias/variance issue, then (0.03) is interesting therefore intrinsically useful I guess you don’t buy this. It concerns the estimation of risk, second-order probability (probability-of-probability), etc. The issue is that you cannot characterize the error rate reliably, and must therefore use a probability distribution. This is the same pattern as with introducing error rate because you cannot say whether a a classifier is always correct or always wrong.

A more practical utility is comparing two classifiers in two cases. In one case, the classifier A gets the classification accuracy of 0.88(0.31) and B gets 0.90(0.40). What probability would you assign to the statement “A is better than B?” in the absence of any other information? Now consider another experiment, where you get 0.88(0.01) for A and 0.90(0.01) for B.

a) I am generally inclined to avoid model selection because it is a source of overfitting. I would generally rather make a weighted integration of predictions. If pressed for computational reasons, I might choose the classifier with the smallest cross validation or validation set error rate.

I still don’t understand why you want to assign a probability.

b) I don’t understand your response. You give examples of 0.88(0.01) and 0.90(0.01). How do you use the 0.01 to decide?

c) I agree with your definition of better, as long as the test set is not involved in the cross validation process.

Interesting! Now I understand: all the stuff I’ve been talking about in this thread is very much about the tools and tricks in order to do model selection. But you dislike model selection, so obviously these tools and tricks may indeed seem useless.

a) If you have to make a choice, how easy is it for you to then state that A is better than B? It’s very rare that A would always be better than B. Instead, it may usually be better. Probability captures the uncertainty inherent to making such a choice. The probability of 0.9 means that in 90% of the test batches, A will be better.

b) With A:0.88(0.01) vs B:0.90(0.01), B will almost always be better than A. With A:0.88(0.1) vs B:0.90(0.1), we can’t really say which one will be better, and a choice could be arbitrary.

c) OK, but assume you have a certain batch of the data. That’s all you have. What do you do? Create a single test/train split, or create a bunch of them and ‘integrate out’ the dependence of your estimate on the particular choice?

Regarding the purpose of model selection. I’m sometimes working with experts, e.g. MD’s, who gathered the data and want to see the model. I train SVM, I train classification trees, I train NBC, I train many other things. Eventually, I would like to give them a single nicely presented model. They cannot evaluate or teach this ensemble of models. They won’t get insights from an overly complex model, they need something simpler, something they can teach/give to their ambulance staff to make decisions. So the nitty-gritty reality of practical machine learning has quite an explicit model complexity cost.

And one way of dealing with model complexity is model selection. It’s cold and brutal, but it gets the job done. The above probability is a way of quantifying how unjustified or arbitrary it is in a particular case. If it’s too brutal and if the models are making independent errors, then one can think about how to approximate or present the ensemble. Of course, I’d want to hand the experts the full Bayesian posterior, but how do I print it out on an A4 sheet of paper so that the expert can compare it to her intuition and experience?

Of course, I’m not saying that everyone should be concerned about model complexity and presentability. I am just trying to justify its importance to applied data analysis.

I understand that some form of predictor simplification/model selection is sometimes necessary.

a) I still don’t understand why you want to assign a probability to one being better than another. If we accept that model selection/simplification must be done, then it seems like you must make a hard choice. Why are probabilities required?

b) The reasoning about B and A does not hold on future data in general (and I am not interested in examples where we have already measured the label). In particular, I can give you learning algorithm/problem pairs in which there is a very good chance you will observe something which looks like a significant difference over cross validation folds, but which is not significant. The extreme example mentioned in this post shows you can get 1.00(0.00) and 0.00(0.00) for two algorithms producing classifiers with the same error rate.

c) If I thought there was any chance of a time ordering in the data, I would using a single train/test split with later things in the test set. I might also be tempted to play with “progressive validation” (although that’s much less standard). If there was obviously no time dependence, I might use k-fold cross validation (with _small_ k) and consider the average error rate a reasonable predictor of future performance. If I wanted to know roughly how well I might reasonably expect to do in the future and thought the data was i.i.d. (or effectively so), I would use the test set bound.

a) I consider 10-fold cross-validation to be a series of 10 experiments. For each of these experiments, we obtain a particular error rate. For a particular experiment, A might be better than B, but for a different experiment B would be better than A. Both probability and the standard deviations are ways of modelling the uncertainty that comes with this. If I cannot make a sure choice, and if modelling uncertainty is not too expensive, why not model it?

b) Any fixed method can be defeated by an adaptive adversary. I’m looking for a sensible evaluation protocol that will discount both overfitting and underfitting, and I realize that nothing is perfect.

c) I agree with your suggestions, especially with the choice of a small ‘k’. Still, I would stress that cross-validation is to be replicated multiple times, with several different permutations of the fold-assignment vector. Otherwise, the results are excessively dependent on a particular assignment to folds. If something affects your results, and if you are unsure about it, then you should not keep it fixed, but vary it.

a) I consider the notion that 10-fold cross validation is 10 experiments very misleading, because there can exist very strong dependencies between the 10 “experiments”. It’s like computing the average and standard deviations of the wheel locations of race car #1 and race car #2. These simply aren’t independent, and so the amount of evidence they provide towards “race car #1 is better than race car #2″ is essentially the same as the amount of evidence given by “race car #1 is in front of race car #2″.

b) Pleading “but nothing works in general” is not convincing to me. In the extreme, this argument can be used to justify anything. There are some things which are more robust than other things, and it seems obvious that we should prefer the more robust things. If you use confidence intervals, this nasty example will not result in nonsense numbers, as it does with the empirical variance approach.

You may try to counterclaim that there are examples where confidence intervals fail, but the empirical variance approach works. If so, state them. If not, the confidence interval approach at least provides something reasonable subject to a fairly intuitive assumption. No such statement holds for the empirical variance approach.

I agree about b), but continue to disagree about a). The argument behind it is somewhat intricate. We’re estimating something random with a non-random set of experiments. Let me pose a small problem/analogy: if you wanted to use monte carlo sampling to estimate the area of a certain shape in 2D, but you can only take 10 samples, would you draw these samples purely at random? You would not, because you would risk the chance that you’d sample the same point twice, and would gain no information. Cross-validation is a bit like that: it tries to diversify the samples in order to get a better estimate with fewer samples. Does it make sense?

Back to this tar baby I understand your concern, but it is inherent to *sampling without replacement* of instances as contrasted to *sampling with replacement* of instances. I was not arguing bootstrap or iid versus training/test split or cross-validation. I was arguing for cross-validation compared to random splitting into the training and test set.

It’s quite clear that i.i.d. is often incompatible with sampling without replacement, and I can demonstrate this experimentally. In some cases, i.i.d. is appropriate (large populations, random sampling), and in other cases splitting is appropriate (finite populations, exhaustive or stratified sampling). These two stances should be kept apart and not mixed, as seems to be the fashion. What should be a challenge is to study learning in the latter case.

Assuming m independent samples, what we know (detailed here) is that K-fold cross validation has a smaller variance, skew, or other higher order moment then a random train/test split with the test set of size m/K. We do not and cannot (fully) know how much smaller this variance is. There exist examples where K-fold cross validation has the same behavior as a random train/test split.

If you want to argue that cross-validation is a good idea because it removes variance, I can understand that. If you want to argue that the individual runs with different held out folds are experiments, I disagree. This really is like averaging the position of wheels on a race car. It reduces variance (i.e. doesn’t let a race car with a missing wheel win), but it is still only one experiment (i.e. one race). If you want more experiments, you should not share examples between runs of the learning algorithm.

Incompatible means that assuming i.i.d. within the classifier will penalize you if the classifier is evaluated using cross-validation: the classifier is not as confident as it can afford to be. I’m not arguing that CV is better, I’m just arguing that it’s different. I try to be agnostic with respect to evaluation protocols, and adapt to the problem at hand. CV tests some things, bootstrap other things, each method has its pathologies, but advocating a single individual train/test split is complete rubbish unless you’re in highly cost-constrained adversial situation.

But now I’ll play the devil’s advocate again. Assume that I’m training on 10% and testing on 90% of data in “-10″-fold CV. Yes, the experiments are not independent. Why should they be? Why shouldn’t I exhaustively test all the tires of the car in four *dependent* experiments? Why shouldn’t I test the blood pressure of every patient just once, even if this makes my experiments dependent? Why shouldn’t I hold out for validation each and every choice of 10% of instances? Why is having this kind of dependence any less silly than sampling the *same* tire multiple times in order to keep the samplings “independent”? Would it be less silly than sampling just one tire and compute a bound based on that single measurement, as any additional measure could be dependent? Why is using a Gaussian to model the heights of *all* the players in a basketball team silly, even if the samples are not independent?

The notion that “advocating a single individual train/test split is complete rubish except in a cost constrained adversarial situation” is rubbish. As an example, suppose you have data from wall street and are trying to predict stock performance. This data is cheap and plentiful, but the notion of using cross validation is simply insane due to the “survivor effect”: future nonzero stock price is a strong and unfair predictor of past stock price. If you try to use cross validation, you will simply solve the wrong problem.

What’s happening here is that cross validation relies upon identicality of data in a far more essential manner than just having a training set and a test set. It is essential to understand this in considering methods for looking at your performance.

For your second point, I agree with the idea of reducing variance via cross validation (see second paragraph of comment 42) when the data is IID. What I disagree with is making confidence interval-like statements about the error rate based upon these nonindependent tests. If you want to know that one race car is better than another, you run them both on different tracks and observe the outcome. You don’t average over their wheel positions in one race and pretend that each wheel position represents a different race.

Well, of course neither cross-validation nor bootstrap makes sense when the assumption of instance exchangeability is clearly not justified. It was very funny to see R. Kalman make this mistake in http://www.pnas.org/cgi/content/abstract/101/38/13709/ – a journalist noticed this and wrote a pretty devastating paper on why peer review is important. My comment on “rubbish” was in the context of the validity of instance exchangeability, of course.

Regarding your note on “reducing variance”: I believe that you’re trying to find some benefit of cross-validation in the context of IID. Although you might do that, the crux of my message is that finite exchangeability (FEX) exercised by CV is different from infinite exchangeability (iid) exercised by bootstrap. Finite exchangeability has value on its own, not just as an approximation to infinite exchangeability. In fact, I’d consider finite exchangeability as primary, and infinite exchangeability as n approximation to it. I guess that your definition of confidence interval is based upon IID, so if I do “confidence intervals” based on FEX, it may look wrong.

I hope that I understand you correctly. What I’m suggesting is to allow for and appreciate the assumption of finite exchangeability, and build theory that accomodates for it. Until then, it would be unfair to dismiss empirical work assuming FEX in some places just because most theory work assumes IID.

I’ve worked on FEX confidence intervals here. The details change, but not the basic message w.r.t. the IID assumption.

The basic issue we seem to be debating, regardless of assumptions about the world, is whether we should think of the different runs of cross validation as “different” experiments. I know of no reasonable assumption under which the answer is “yes” and many reasonable assumptions under which the answer is “no”. For this conversation to be further constructive, I think you need to (a) state a theorem and (b) argue that it is relevant.

[...] Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme. [...]