This is a question in general, not specific to any method or data set. How do we deal with a class imbalance problem in Supervised Machine learning where the number of 0 is around 90% and number of 1 is around 10% in your dataset.How do we optimally train the classifier.

One of the ways which I follow is sampling to make the dataset balanced and then train the classifier and repeat this for multiple samples.

I feel this is random, Is there any framework to approach these kind of problems.

4 Answers
4

Undersampling. Select a subsample of the sets of zeros such that it's size matches the set of ones. There is an obvious loss of information, unless you use a more complex framework (for a instance, I would split the first set on 9 smaller, mutually exclusive subsets, train a model on each one of them and ensemble the models).

Oversampling. Produce artificial ones until the proportion is 50%/50%. My previous employer used this by default. There are many frameworks for this (I think SMOTE is the most popular, but I prefer simpler tricks like Noisy PCA).

One Class Learning. Just assume your data has a few real points (the ones) and lots of random noise that doesn't physically exists leaked into the dataset (anything that is not a one is noise). Use an algorithm to denoise the data instead of a classification algorithm.

Cost-Sensitive Training. Use a asymmetric cost function to artificially balance the training process.

Some lit reviews, in increasing order of technical complexity\level of details:

$\begingroup$Just noticed that the Noisy PCA I cited can be seen either as Oversampling, Regularization or Jitter, depending on implementation details.$\endgroup$
– Lucas GallindoJan 14 '15 at 14:53

$\begingroup$Thanks Lucas for the resources. Helps a lot.I have a peculiar problem at hand where all my samples are labelled as '1'. However in reality these sample have a minimal impurity,i.e there are some records which are actually supposed to be '0' but are labelled as '1'. I believe this kind of problem belong to one class classification. Is my understanding correct. Is there a common framework used to identify them, initially I was trying clustering but that is not working.$\endgroup$
– NG_21Jan 19 '15 at 10:08

1

$\begingroup$All of these algorithms need some data labelled as zero and some as one with 100% centainty about the correctness of the label (or something very close to 100%). You have all ones, but you know that a small percentage of this data is mislabelled, a different situation. Without any knowledge about the domain of application, I would attack it using Anomaly Detection, then label the anomalies as zero. Then try some classification algorithm (One Class Learning, perhaps). With knowledge about the domain of application, I would seek help from a domain expert before anything.$\endgroup$
– Lucas GallindoJan 21 '15 at 18:45

This heavily depends on the learning method. Most general purpose approaches have one (or several) ways to deal with this. A common fix is to assign a higher misclassification penalty on the minority class, forcing the classifier to recognize them (SVM, logistic regression, neural networks, ...).

Changing sampling is also a possibility like you mention. In this case, oversampling the minority class is usually a better solution than undersampling the majority class.

$\begingroup$Thanks. Can you point to any resource where it is explained with some examples. Is there a way to achieve this in R/Python without resorting to coding the algorithm from scratch.$\endgroup$
– NG_21Jan 6 '15 at 8:15

Often problem is not the frequency but absolute amount of cases in the minority class. If you do not have enought variation in the target when compared against variation in the features, then it might mean that algorithm cannot classify things very accurately.

One thing is that misclassification penalty could be used at classification step and not in the parameter estimation step if there is any. Some methods do not have concept of parameter, they just produce outright class labels or class probabilities.

When you have probabilistic estimator then you can make classification decision based on information theoretic grounds or with combination of business value.

Add two trick:
1. use CDF , count the frequency in your training data or use very large validation (if your test set will not change, but the validation set must have same distribution with training set), then sort your prediction, and get first X%(your count the frequency before) for the one class and the others are else/
2. weighted sample, model will be tend to the weighted sample class, your can use the sample variance v. eg. weighti = 1/2(1- (vmax - vi)/vmax)