Model accuracy is not such an appropriate measure of performance when
the data has a very imbalanced distribution of outcomes. For example,
if positive cases account for just 1% of all cases, as might be the
situation in an insurance dataset recording cases of fraud or in
medical diagnoses for rare but terminal diseases, then the most
accurate, but most useless, of models is one that predicts no fraud or
diagnoses no disease in all cases. It will be 99% accurate! In such
situations, the usual goal of the model builder, which is to build the
most accurate model, does not match the actual goal of the model
building.

There are two common approaches to dealing with imbalance: sampling
and cost sensitive learning.

Before describing these two approaches to dealing with this issue, it
is worth noting that some algorithms have no difficulty with building
models from training data with imbalanced classes. Random forests, for
example, need no such treatment of the training data in order to build
models that capture under-represented classes quite well.