Special Missing Values

Nicholas Tierney

2020-04-30

Data sometimes have special missing values to indicate specific reasons for missingness. For example, “9999” is sometimes used in weather data, say for for example, the Global Historical Climate Network (GHCN) data), to indicate specific types of missingness, such as instrument failure.

You might be interested in creating your own special missing values so that you can mark specific, known reasons for missingness. For example, an individual dropping out of a study, known instrument failure in weather instruments, or for values being censored in analysis. In these cases, the data is missing, but we have information about why it is missing. Coding these cases as NA would cause us to lose this valuable information. Other stats programming languages like STATA, SAS, and SPSS have this capacity, but currently R does not. So, we need a way to create these special missing values.

We can use recode_shadow to recode missingness by recoding the special missing value as something like NA_reason. naniar records these values in the shadow part of nabular data, which is a special dataframe that contains missingness information.

This vignette describes how to add special missing values using the recode_shadow() function. First we consider some terminology to explain these ideas, if you are not familiar with the workflows in naniar.

Terminology

Missing data can be represented as a binary matrix of “missing” or “not missing”, which in naniar we call a “shadow matrix”, a term borrowed from Swayne and Buja, 1998.

This reads as “recode shadow for wind where wind is equal to -99, and give it the label”broken_machine". The .where function is used to help make our intent clearer, and reads very much like the dplyr::case_when() function, but takes care of encoding extra factor levels into the missing data.

The extra types of missingness are recoded in the shadow part of the nabular data as additional factor levels: