16 September 2013

I have a colleague who wants to look through large amounts of (text) data for examples of a pretty rare phenomenon (maybe 1% positive class, at most). We have about 20 labeled positive examples and 20 labeled negative examples. The natural thing to do at this point is some sort of active learning.

But here's the thing. We have no need for a classifier. And we don't even care about being good at finding negative examples. All we care about is finding as many positive examples from a fixed corpus as possible.

That is to say: this is really a find-a-needle-in-a-haystack problem.
The best discussion I've found of this is Section 4 of Inactive Learning, by Attenberg and Provost. I came across a couple other papers that basically do some sort of balancing to deal with the imbalanced data problem in active learning, but the Attenberg and Provost results suggest that even this is tricky to get right.

But I think this only partially addresses the problem. Here's why.

Let's say that my colleague (call her Alice) is willing to spend one hour of her life looking at examples. Alice estimates that she can look at about 500 examples in that time period (small caveat: some examples are larger than other and therefore slower, but let's ignore this for now). Alice's goal is to find as many positive examples in that one hour as possible. Here are some possible strategies Alice could deploy

Look at 500 randomly selected examples, getting 5 (more) positive examples in expectation (likely somewhere between 2 and 12, with 95% confidence). This gives her a total of 25 positive examples

Train a classifier on her 40 labeled examples, have it rank the entire set (minus those 40). Then look at the top 500 examples. If the learned model has a recall of r% in the top 500, then she should expect to get 5*r more examples, giving her a total of 20+5r examples. (Caveat: this is a bit dangerous, because with very few training examples, results can often be anti-correlated with the desired output and you get better performance by flipping the predicted label. There was a paper on this maybe 5-10 years ago but I can't find it... Maybe somewhat knows what I'm referring to.)

Train a classifier, spend 30 minutes looking at 250 random examples, getting 2.5r more positive examples, then train another classifier, then spend 30 minutes looking at it's top ranked 250 examples. If the second classifier has a recall of r'% in the top 250, then she should expect to get another 2.5*r', and she'll have a total of 20+2.5r+2.5r' = 20 + 2.5(r+r') so long as r>r' this should be better.

Taking this to the extreme, Alice could annotate one example at a time (subject to constraints on either the learning being very fast or Alice not actually being a human). As long as the recall-at-one of the classifiers learned is non-decreasing, this should (I think) be the optimal strategy.

So it seems that in a setting like this, what the classifier should optimize for is simply recall-at-one. In other words, it doesn't care about anything except that it's top ranked output is correct.

Okay, so why is this difficult? Basically because we don't have that much data. If the corpus that represents the haystack is huge, then we want to ensure that the top-one example from that haystack is a positive example. So in principle even if all we're trying to do is select between two different hypotheses (say our hypothesis class at each time step has cardinality two) then in principle we should use as much data as possible to evaluate these two hypotheses. In particular, looking at the recall-at-one on the subset of 40 labeled examples that we have is probably not representative, essentially because this loss function doesn't decompose nicely.

So what could we do instead? Well, we could pool the labeled and unlabeled examples and predict on those instead. Certainly if one of the positive labeled examples outranked everything else, that would be a good thing. But if truly 1% of the unlabeled dataset is positive, then this is not a necessary condition. In particular, the top-ranked positive labeled example could be as far as 1% down the list and we could still (in principle) be doing a good job.

Ok I'll admit: I really don't know how to do this. Maybe it's enough in most cases to optimize for zero-one loss (or a weighted version thereof) and just cross your fingers that the recall-at-one is good. But it does feel like there should be a more direct way to go about this problem.

14 comments:

I've been thinking about this as well, because it seems to be a key issue in intrusion detection and server operations management.

I had this thought recently:

You indicated your colleague was looking for a class which occurs rarely. Suppose a positive example is "obvious" to the human. That suggests in some space there is a clear decision boundary. Combining "it's rare" with "obvious decision boundary" suggests the decision boundary will be in an area of low empirical support, i.e., "anamolous" from a density estimation or one-class estimation standpoint. Thus I suspect you could beneficially incorporate some unsupervised techniques.

I actually just wrote a paper about something very similar in the context of information extraction: (http://www.phontron.com/paper/neubig13lpci.pdf).

Basically, we take the 4th approach, learning one (or actually 5) examples at a time. We do this for one week worth of tweets, and the reason why we can scale up to real time is because we're doing a bunch of approximations (e.g. caching scores and deciding which examples to look at based on the old scores) that definitely aren't correct from a theoretical perspective, so it'd be interesting to think about more what this means in theory.

@Paul: yeah, I'm actually not so sure the rare examples are obvious, unfortunately, in this case. In other cases, perhaps so. But yeah, some unsupervised preprocessing might be in order, but the worry would be that if the positives are unclear, the unsupervised learning will never pick them out.

Breck and I dealt with this all the time for customers. For customers, sentiment analysis is usually about finding positive comments and finding negative comments and grouping them, not about 0/1 loss on a random held out data set.

The IR literature is all over this precision-at-N kind of evaluation.

What we did is took a very active approach. We'd almost always have a domain expert on hand for the problem who had a bunch of positive examples in hand and could find more using a search engine. So we'd build a search engine, let users find some positive seed examples.

Then we'd build a classifier and evaluation infrastucture.

After that, the active learning strategy we used was to get new labels only for the top scoring results that weren't always labeled. This gave us a nice discriminative set of negative examples that looked like positive examples to early-round classifiers.

Then we just iterated. If the customer later finds or thinks about different kinds of positive examples, we'd just add them to the mix.

The evaluation was always precision-at-N for some relevant N.

Often we were trying to get as much recall as possible at a given level of precision set by the customer.

So it's sort of like a mixture of your 3, followed by a continuous run of 4, even after the app is fielded.

I haven't seen any direct analysis of this kind of approach, but I'll check out some of the links in the comments.

I've also been thinking about this problem, which I like to call active recall maximization. I presume the goal is the maximize recall with a fixed budget, or to achieve 100% recall with the smallest possible budget.

I recall Sanjoy Dasgupta working on a problem similar to this. He gave a talk once about an idea motivated by the problem of helping a legal team find all the relevant corporate documents for a subpoena, which is basically amounting to finding all the needles in a haystack.

Is there any concern that the category might be heterogeneous? Recall at 1 (or precision at 1; are they equivalent?) doesn't take this into consideration. The rare category work focuses more on diversity.

The problem is one of great interest in e-discovery (identifying electronic documents that need to be turned over in legal matters). That's the area I spend 95% of my time on these days. I've been working the past couple of years to get more computer scientists and statisticians interested (Sanjoy being one of my successes).

Doug Oard, William Webber, Mossaab Bagdouri and I have been working on this problem, with the additional constraint that in legal cases you often also must pay the labeling cost to statistically validate the level of recall you achieve. So there's a tradeoff between spending labels to get good recall and spending labels to prove you got good recall. We had a short paper at SIGIR 2013 and will have a long one at CIKM 2013, with a journal article in progress.

As far as I can understand, this is just a classic detection problem from the statistics - you have a signal embedded in noise and you have to test for the signal (noise is what you call negative labeled samples). Perhaps a lot of the machinery developed there could be leveraged ?

This is indeed exactly the problem I have been calling "active search:" I simply want to find as many positives as possible, and don't care about building a classifier at all. I also call it the "generalized Battleship problem," because it's the same situation you're faced when playing the board game.

In our ICML paper we derive the optimal policy from a Bayesian decision-theoretic point of view. It turns out that the policy you propose (annotating/retraining one example at a time and selecting the top-ranked point each time) is exactly the greedy/myopic 1-step lookahead approximation to the optimal policy. Two- and more-step lookahead can lead to nontrivial and perhaps surprising behavior, because we start thinking about the impact of our current choices on future selections. I might want to label a point ranked further down the list if it happens to be highly similar to other unlabeled points, because if I got lucky, I might discover more positives along with it.

We also show how to construct cases where farther lookahead can benefit you by arbitrarily amount, which means that the expectation of diminishing returns is not necessarily true. Unfortunately, the cost of lookahead is, in general, exponential in the number of unlabeled examples; however, we show in the paper how, for some models, we can achieve massive pruning of the search space.

If your colleague is still interested in this problem and this approach, I have code implementing the algorithm in the paper for arbitrary models (including the pruning) and would be happy to discuss this further.

Sounds like a PU learning for me. There are some results from Dell Zhang, "Learning Classifiers without negative Examples: A Reduction Approach". What you basically need, is a classifier which is learned optimizing AUC. This is it. You treat your big/unexplored corpus as noisy negatives during the learning stage. Once trained, you compute predictions on it and voilà, your needles (positive) will start to shine among the hay (negatives). An additional benefit is that the class imbalance is not an issue if you optimize for AUC.