Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

I'm training a model (NN) that gets some data as input and outputs a single value in the range of $[0, 1]$. Right now, the average of the outputs in my dataset is around 0.5, but I know that future data will largely consist of 0.0s, and thus there will eventually be a strong data imbalance towards 0.0. I want the training procedure to be future-proof and scalable, so I'm trying to find a way to automatically rebalance the dataset. My library (Keras) supports sample weights in training, which seems like a straightforward way to do this without losing any information.

Basically, I think what I'm looking for is a function $w(y_i) \rightarrow w_i$ that given a training example $y_i$ gives me a weight $w_i$, so that the weighted average of all training examples $Y$ with the weights $W$ is $0.5$. I know there are many configurations of weights that have this property, but of course the weights should be as close to 1 as possible and definitely $> 0$. I also realize that this is not possible for cases where e.g. all numbers are the same, or all are $< 0.5$. But let's assume my data is diverse enough.

I'm sure I'm not the first one to think of this, but I can't find any solutions / best practices for this case. I guess I could treat it as a little optimization problem of its own, but I hope there is something simpler.

2 Answers
2

There are a few things which are unclear so I am going to have to make some assumptions.

You say it is binary classification/regression. Are you trying to find the probability of each binary class?

When you say "future data", do you mean future training data or future test data? If it is test data then you do not need to worry at all about his effect. If it is future training data then yes you may have a problem. You do not want to do model building on a training set that is different from the training set you would actually use in the productionized model. I would suggest you produce a number of validation plots for this case.

If you really are doing binary classification. Most models in python have a method (eg predict_proba() ) that will produce a probability and will let you decide on class based on that probability. The default being 50%. For this reason I do not think that you want to have the weighted average be 0.5 but the "probability" represent the true probability. Please refer to this tutorial http://scikit-learn.org/stable/modules/calibration.html

As to your original question. You want the weights to be $M/N_i$ where $N_i$ is the number of samples in each class $i$ and $M$ the total number of samples. This is a way to do some sort of pseudo-upsampleing and have the classes be weighted by the number of samples.

$\begingroup$Thanks for your reply, Keith! With future data I mean both future training and test data. I'm working on a product that will collect more data as it evolves, but I know already that this additional data will be heavily skewed towards 0.0. For binary classification/regression, I guess I myself am confused as to what my problem setting actually is called. Basically I'm predicting a single probability. My target data has lots of 0s and 1s, but also values in between, like 0.4 and 0.734. I can't use a class-based weighting as you proposed, as the target numbers are continuous.$\endgroup$
– cpuryNov 7 '17 at 8:38

2

$\begingroup$Ahh got it. So it is regression. You are trying to predict a probability and you do not know the truth of the probability (ie if it is actually 1,0). So lets just say you have a target to predict in [0,1] and not use the word probability because it confuses the issue. The distribution of the target not being uniform might not be an issue. If want to reweight I would try doing what I suggested above by making a histogram of the target and trying to fit it to an analytic curve. The inverse of that curve could then be used to normalize your data as you want. Refit the curve for future data$\endgroup$
– KeithNov 7 '17 at 19:21

$\begingroup$Sorry for the confusion! Hmm so I would bucket the range to build a histogram, then find a function that approximates that histogram, and then finally use the inverse of that function as the weight for a given sample?$\endgroup$
– cpuryNov 8 '17 at 12:59

$\begingroup$Yea, that is the method I would use. It may not be effective so you should try it with and without the weights. If you think about a linear fit: x in [0,1] -> y in [0,1] to simplify, having more data near y = 0 means you will have less error of prediction for the corresponding range in x. Following this analogy, by changing the weights you are messing the error bars in your fit which would translate to something statistically deep in the NN. This could make your fit worse. Notice how this is not common practice in a linear fit since the error is exactly accounted for. Hope this makes sense$\endgroup$
– KeithNov 8 '17 at 18:26

This is an interesting problem because obviously, you want to train a model that performs well on unseen data, and therefore you'd like to train it on data resembling the one you'll encounter later on.

If you already know the true distribution of your data. I would use what Keith mentioned in the comment. Fit a curve on the histogram of that data, and use the inverse as weight.

What you could also do if you know the true mean of your distribution is build some type of "discriminator", which guesses whether a data point will fall on the "left" or "right" of the mean. Since your data is skewed, you could also have a discriminator for the median, which might perform better.

Then, you can build models for each "side" of the mean. Since your data is heavily skewed, you will have a lot of data for one, and not so much for the other.

Finally, you could have a final model that takes in as input the discriminator(s) (how certain it is that the data point will fall left or right), and the output of both models. That final model should smooth out some of the mistakes done by the discriminator(s) and both models.

$\begingroup$Interesting approach! Is there some research or example of someone using something similar in practice?$\endgroup$
– cpuryNov 9 '17 at 11:13

$\begingroup$I have used such an approach for ecommerce spend prediction. Instead of choosing the mean, I chose to discriminate between spenders and non spenders. In your case this is trying to identify the 0's. I had a binary predictor which did zero vs non-zero. Then for those classified as non-zero I had a regression model which predicted the expected value. The point would be to get a split which produces as balanced of classes as possible. In my case 99% were zero so it was still imbalanced in classification.$\endgroup$
– KeithNov 9 '17 at 18:26

$\begingroup$Cool, thanks! I'll make sure to try that approach once I have more data.$\endgroup$
– cpuryNov 10 '17 at 10:18