Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)

Train a new classifier on the output of (3)

Apply the new classifier to the original data set and go to (4)

Note that the hand-built classifier can alternatively just be a hand-labeling of a small number of "representative" examples.

I've never actually used such an algorithm before, but I hear that they work pretty well.

There's always been one thing that's surprised me, though. And that's the fact that they work pretty well.

There are two reasons why the fact that these algorithms work surprises me. First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. Second, there is a domain adaptation problem.

To be more explicit, let me give an example. Suppose I want to classify sentences as subjective or objective. This is a well studied problem to which bootstrapping algorithms have been applied. The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. One way to do this is to create word lists. Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt.

Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier.

This is where things go haywire for my understanding.

First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead.

One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. But people don't seem to do this. In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." But now we have a single feature that will allow us to replicate the rule-based classifier. This scares me greatly.

Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." The fact that this will generalize is totally not obvious.

And yet these things seem to work pretty well. I just don't quite understand why.

17 comments:

I find the best way to understand the common approach to "bootstrapping" in NLP (not to be confused with Efron's statistical bootstrap) is to think of it as a highly quantized EM with annealing.

Specifically, you build an initial model, ideally with high precision. Then you run EM iterations with quantization. By that I mean that you quantize the expectations with a high threshold (e.g. [0.0,0.9] to 0 and (0.9,1.0] to 1). As you go, that threshold presumably gets lowered, making the whole thing a kind of annealing.

Of course, this assumes an underlying classifier that can infer conditional estimates of categories given inputs. I suppose you can do this with Yarowsky-style word list classifiers by recasting them as decision trees and estimating probabilities.

The EM connection makes sense, but I didn't mention it because it doesn't really address either of my concerns, or at best partially addresses only the first one :).

In a sense bootstrapping is a family of algorithms for solving semi-sup learning. I guess my concern---especially concern #2---is that we know that we can do much better than semi-sup learning when different domains are involved. (Thanks to John Blitzer for teaching us this.)

The short answer is that bootstrapping works for the same reason naive Bayes works -- tokens in docs are highly correlated with topics, and with even a few example docs, we can build a better-than-chance classifier.

Problem 1 worries that we might never get off the ground (continuing the "bootstrap" analogy). The reason this doesn't happen is that from a few set of seed words, we'll pick up some documents. From even a small set of docs, we can train a classifier that is not embarassing on 0/1 loss at high confidence. If this doesn't happen, we're grounded, so the annealing param needs to be set right here. If we do pick up more docs, we can build up flight speed by inferring that the words in those docs are associated with the categories. Many of these inferences will be wrong, but the reason it works is the same as for naive Bayes -- we're accumulating lots of little pieces of evidence in a voting scheme and the method's fairly low variance. These new docs are then enough to push some more relevant docs over the threshold, and then we're flying.

Problem 2 questions whether we can estimate a good distribution over all words. The answer is that just like for naive Bayes, it doesn't matter much if what we care about is 0/1 loss. The method's surprisingly robust because all the words tend to provide evidence. The lack of modeling correlation is why naive Bayes does so poorly on log (cross-entropy) loss compared to models that model dependency better (make fewer independence assumptions).

Unfortunately, like EM, we're left with a residual problem 3: we may get stuck in a local optimum. This indeed seems to happen in practice. Luckily, annealing through high precision examples helps push the initial steps of EM in the right direction. And if it doesn't we can just choose a different seed set.

I confess I'm puzzled as to why these bootstrapping methods seem to work better than EM. The informal story seems to be that with EM, the unlabeled data overwhelms the labeled data, which seems a reasonable enough explanation of why EM goes wrong. But then the question is: why doesn't this happen with bootstrapping, especially if it is just a kind of approximation to EM as Bob suggests.

hrm... the statement "bootstrapping works for the same reason naive Bayes works" doesn't make sense to me. afaik, there's nothing that says that i couldn't bootstrap, say, with an svm or maxent model. but in these models, there's a huge chance that you'll just memorize the data and not generalize... i can easily construct distributions on which this will happen.

i think problem 1 is deeper than not getting off the ground: (to continue the analogy) i think we can actually start digging a hole. why doesn't this happen?

i think i'm starting to understand problem 2 and this may be kind of cool. essentially what bob seems to be saying (buried somewhere)---which i agree with---is that while D^0 (the distribution over sentences from the output of our rules) and D (the true distribution over sentences) maybe be different in an "adaptation" sense, by doing this "incrementally add a few more examples" thing, we're essentially constructing a sequence of distributions D^0, D^1, D^2, ..., D^T, where D^T = D.

looking at this from the domain adaptation perspective, this is actually quite interesting. i don't think anyone's looked at the problem in this way before. there may be some new D.A. algorithm lurking in there somewhere.

mark -- yes, this is weird. i don't have a good answer. one thing that i can suggest based on some work a student of mine is doing right now is that maybe what's going wrong is that when we usually do EM, we do it on a naive bayes model, which is very poorly calibrated for predicting probabilities (as bob points out). by doing this thresholding in bootstrapping, we may be effectively trying to remedy this problem....

In regards to the generalization problem, aren't there enough similarities between bootstrapping and boosting that could help explain the apparent lack of overfitting? In both instances, it appears you've got a weighted combination of classifiers that are specialized on certain areas of the problem space, and (at least in boosting) adding in additional features apparently does not lead immediately to overfitting.

"... I hear that they work pretty well". Don't trust everything you hear. As far as I know, no bootstrapping algorithm has ever been shown to work widely beyond its first reported application. In other words, the bootstrap parameters have been seriously overfitted to the initial application. Bootstrapping is intuitively a very cool idea, but we are missing a sketchy theoretical understanding of under what conditions it would work. Until we do, bootstrapping methods will be one-offs from which we can learn little.

Bootstrapping, self-training, semi-supervising etc. have been tried for at least 4 decades (some OCR research from the 60's already had same idea).

The main problem with bootstrapping with a 'single' classifier was that the samples classified with high precision by the classifier are not uniformly distributed in the feature space. I.e., the errors made by a classifier are near the boundary, and if we throw these samples away the remaining samples don't add much to the existing classifier.

The novelty of some of the latest approaches (like co-training) is that they assume that there are two classifiers where the errors of each one are 'uniformly distributed' for the other. In NLP we can often find a natural analogue because some intrinsic properties of tokens (like its identity) are usually independent of its context given its label.

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..sesli sohbetsesli chatsesli sohbet siteleri