01 December 2008

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:

Noise in the training data (which we have fit and is now screwing us up at test time)

These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.

Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).

Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).

At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?

Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).

Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).

Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).

Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.

There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.

20 comments:

Thanks for a very interesting post, Hal! I think you might be on to something but I have some questions about the intuition you have. You wrote:

"1. Suppose that f(c') doesn't make any (many?) more mistakes than f(0)... This suggests that the problem is (3) or (4) above (features or examples)."

I created some visual examples to test this. Consider an example here (link to JPG file). Here, circles and crosses are training examples of two classes. Our linear classifier (vertical line at x=0) misclassifies the dev point (triangle).

Now, as we increase c', the dev point gets correctly classified by moving the new classifier to a vertical line atx=0.7. This corresponds to the first scenario you mentioned, since the training points remain correctly classified.

Suppose we had more examples: in this case I follow your claim here that having more examples will fix the problem and make sure I get the classifier at x=0.7 in the first place. [Note, however, that these extra samples are not generated i.i.d over the whole feature space. I'm not sure whether it makes more sense toconsider a more standard probabilistic model of data generation.]

What I don't follow is why can't this be a noisy training label problem? Suppose some of the crosses had been mislabeled (they should've been circles) If the noisy labels were fixed, you get a correct classifier (at x=1.25 in this case). Again, note that I am assuming some biased way of generating noisy labels (in particular, my noisy labels are at the boundary). If I had assumed that labels are flipped randomly with some probability (as assume in many statistical learning theory papers), then my counter example doesn't hold.

The point I'm trying to make is that it seems that you need to have some probabilistic model of data generation and noise generation in order to be sure that the claims are true. If one can arbitrarily generate data or create noisy labels, it seems that one can find counter-examples to your claim.

By the way, a random thought: I wonder whether you can gain any information by looking at the distribution of c' for all dev points. Can we draw any relationship to boosting? i.e. does high c' have any relation to "harder to classify" points that get higher weights in boosting?

If we consider the typical NLP problems in which linear models are being applied (broadly, these would be: local-context based tagging (WSD/POS/B-I-O or whatever), parser action prediction, re-reanking of rich structures (trees, sequences), and the various text classification tasks), then clearly the problem is, by and large (3) (do any of you really believe that our current feature representations for these tasks are suitable ones?), and when we try to augment the feature space we face intractability, or (4), or both.

So when facing an NLP problem, I don't need[*] a fancy algorithm to tell me where the problem lies -- it's in the feature representation.

The interesting question, in my view, is then "can we identify these test cases for which our feature are not sufficient?". Note this is not the same as asking "can we identify in advance where our classifier will fail". Rather, it's about sacrificing recall in favor of precision, by allowing our linear classifier to say "I don't know" on some of the cases.

* algorithms for identifying cases of (1) or (2), e.g. annotation errors/inconsistencies, are, ofcourse, of interest. But for a different reason.

Other than the FDA approved weight loss pills, herbs and medicines prepared from herbal products also adequately assist you in triggering off weight loss. As you scan the herbal weight loss related pages of the website http://www.weight-loss-truths.com , you would come to known that herbal weight loss is possible through herbal products such as Hoodia, Dandelion, Pyruvate, Aloe Vera et al. However, before opting for herbal weight loss products, it is necessary for you to consult with a herbal practitioner.

There are various online sources to provide you informative details on FDA approved weight loss pills as well as information on pills and tablets available in the pharmaceutical market that are capable of successfully relieving you from the grip of erectile dysfunction and depression. But the online pill destination http://www.pill-care.com is somewhat unique as it makes a whole array of information available to you, including details on the highly popular pills and tablets, testimonials as well as news on the latest tidbits of the pharmaceutical market. Log in to the website and get hold of relevant pill-care details.

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some habbo credits，he usually can win a lot of habbo gold，then he let his friends all have some habbo coins，his friends thank him very much for introducing them the cheap habbo credits，they usually buy habbo gold together.

With so many diet pills saturating the market today, trying to find the best and most effective diet pill can be very tedious. Chances are, we may end up choosing the wrong diet pill. Phentremine is a natural apetite suppressant and if you are the one who cannot control your cravings for food, then it is just the right medicine for you! http://www.phentermine-effects.com

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri