PU Learning

PU LearningDealing with a negative class hidden in unlabelled dataPhil MassieBlockedUnblockFollowFollowingJan 21PU Learning — finding a needle in a haystackA challenge that keeps presenting itself at work is one of not having a labelled negative class in the context of needing to train a binary classifier.

Typically, the issue is paired with horribly imbalanced data sets and pressed for time, I have often taken the simplistic route of sub-sampling the unknown set and treating them as unknowns.

Obviously this isn’t ideal as the unknown set is contaminated and as a result the classifiers don’t train that well.

Nevertheless, out in the wild, with real-life deadlines, the approach was time efficient, and the results were often surprisingly useful.

Recently, I was lucky to have a few days to read around the topic a little.

I found some interesting approaches and thought it would be worth taking a few notes, and they turned into this post.

There are a few different PU approaches around.

All the approaches involve isolating a set of so-called Reliable Negatives (RNs) from the unknown data set.

As I read, the most widely cited, initial approaches are attributed to Liu et al.

2002 and 2003 wherein a set of RNs are iteratively grown from within the unknown class.

Another approach which appealed to me was described by Fusilier et al.

2015.

In their paper the authors describe an approach which iteratively reduces the set of RNs from within the unknown class, effectively tightening the net around those cases which are the most dissimilar to the positive class.

This approach appealed to me as it implicitly deals with class imbalance.

The third approach I came across (Mordelet & Vert 2013) which also implicitly accounts for class imbalance involves bagging, or randomly sampling from the unknown class and treating the sample as negatives.

Where this deviates from my naive approach mentioned above is that the process is repeated many times and a series of models are trained.

The models characterise the positive class against unknown data sets with varying degrees of contamination.

The resulting model scores are ensembled and the result should better isolate the reliable negatives from the unknown class.

Below, I will go into a little detail about each of the three approaches.

Methods‘Original’ approach (Liu et al.

2002 and 2003)Given a training set containing only positives (P) and unknown (U) classes follow the following steps:Treating all U as negatives (N) train a classifier P vs.

UUsing the classifier, score the unknown class and isolate the set of ‘reliable’ negatives (RN).

Train a new classifier on P vs.

RN, use it to score the remaining U, isolate additional RN and enlarge RN.

Repeat step 3, iteratively enlarging the set of RN until the stopping condition is met.

The stopping condition is met when no new negative cases are classified.

Where Q is defined as the set of unknowns classified as negatives and i is the iterator, the stopping condition is defined as:|Qi| > 0Modified approach (Fusilier et al.

2014)Given a training set containing only positives (P) and unknown (U) classes follow the following steps:Treating all U as negatives (N) train a classifier P vs.

UUsing the classifier, score the unknown class and isolate the set of ‘reliable’ negatives (RN).

Where Q is defined as the set of unknowns classified as negatives and i is the iterator, the stopping condition is defined as:|Qi| <= |Q(i-1)| & |P| < |Qi|The stopping condition ensures that Q reduces in size (avoiding sudden large reductions in RN size) while the RN set never gets smaller than the P set.

More explicitly:while the size of the set of unknowns classified as negatives in this iteration is smaller than or equal to the size of the set of unknowns classified as negatives in the previous iteration and the size of the set of positive classes is smaller than the set of refined RNs resulting from this iterationBagging approach (Inductive) (Mordelet & Vert 2013)Given a training set containing only positives (P) and unknowns (U), where K = size of bootstrap samples and T = number of samples, follow the following steps: 1.

Draw a bootstrap sample Ut of size K from U 2.

Train a classifier P vs Ut 3.

Repeat steps 1 and 2 T times 4.

Score the test data with an ensemble approach using the bagged models.

The stopping criterion here is determined by the value of T and the authors suggest that there is typically not much additional value to be gained by setting T > 100.

Judging from their plots however, where |P| and K are both large, there is little change above T = 5.

I suspect it’s worth trying to keep track of this during training if possible or setting up an early stopping type criterion in your function because depending on your time constraints as training 100 models may not be viable.

Things to follow up on:- Most of the articles use SVM, but they also tend to be NLP problems.

Does the classifier family matter much?- How do the original papers identify the cut-off for determining ‘reliability’- Modified approach: WRT the stopping criterion, why could Q get larger with the iterations?- Bagging approach: Consider how best to penalise false negatives.

– Cutoff selection?ConclusionThese three methods provide sensible approaches to the problem of PU learning but only the modified and bagging approaches provide inherent ways to deal with imbalanced data.

My plan is to try and implement these 2 approaches and compare their results.

While I cant share the data publicly, I will try and share the code and general results on the blog and in GitHub.