I'm faced with fairly typical class imbalance problem across a dataset with nearly 9MM rows (hard drive failures) that's not stored locally (it's in Postgres table; downloading a .csv of it is not possible). I want to build a classification model that predicts failure [0,1] (limited to the Top-10 models by frequency) using a tree-based method across 11 features. Here's what my R code looks like, so far:

In theory, I understand the trade-offs between over/under-sampling, but how would you approach this problem in the context of memory constraints?

It's not feasible to load all 8+ million rows (combined across all top models) into local memory, partition between test/training set, and run the analysis. So instead my ideas were to either:

Build/train/cross-validate the model on 1 model only, then test it on others. Here's a plot of the log-density of failure rate (across those drives that had at least 1 failure):

Initially down-sample the "good" drives (where failure == 0) to 10-100K each and keeping all the "bad" ones

The second seems like it would introduce bias, but when the class imbalance is so high, does this become trivial? From a high level, which of these (if any) would be the preferred method? Or, alternatively, is there a technical way to solve this w/o blowing up the memory usage on my local machine?

$\begingroup$How many covariates do you have and what kind of model would you like to fit?$\endgroup$
– whuber♦Jan 16 at 16:10

1

$\begingroup$@whuber There's 14-15 features (all numeric). As for which type of model, I was going to start with a tree-based method.$\endgroup$
– RayJan 16 at 16:16

1

$\begingroup$That looks like important information to include prominently in your post.$\endgroup$
– whuber♦Jan 16 at 17:37

$\begingroup$"Good" vs. "not good" are classical ill-conditioned problems. Not only becaus failures are rare, but also because failures are probably not a well-defined group but can be due to various causes. You may want to look into one-class classification (which doesn't stop you from modeling, say, "good" as well as a bunch of classes for specific failure groups).$\endgroup$
– cbeleitesJan 17 at 13:28