Missing

Missing values present challenges to data mining and modelling in
general. There can be many reasons for missing values, including the
fact that the data is hard to collect, and so not always available
(e.g., results of an expensive medical test), or that it is simply not
recorded because it is in fact 0 (e.g., spouse income for someone
without a spouse). Knowing why the data is missing is important in
deicing how to deal with the missing value.

The Show Missing check button of the Summary
option of the Explore tab provides a summary of missing
values in our dataset. A summary of missing data is displayed in
Figure 6.1. Such information is
useful in understanding structure in the missing data, and perhaps
coming to an understanding of why the data is missing.

Figure 6.1:
Missing value summary for a modified version of the
audit dataset.

The missing value summary table is presented with the variables from
the dataset listed along the top. Each row corresponds to a pattern of
missing values. A 1 indicates a value is present, whereas a
0 indicates a value is missing, and the pattern generally
relates to a collection of entities.

The left hand column records the number of entities that exhibit that
pattern, so that the sum of this column (which is not actually shown
in the output) will equal the number of entities in our dataset. The
right hand column records the number of variables with missing values
for each pattern. So the first row, corresponding to no missing values
for any variables, has a 0.

The final row records the number of missing values
over the whole dataset for each of the variables, with the total
number of missing values recorded at the bottom right.

The rows and columns are sorted in ascending order according to the
amount of missing data.

Generally, the first row records the number of entities that have no
missing values, as is the case in
Figure 6.1, where 1575 rows are
complete.

The second row corresponds to a pattern of missing values for the
variable Age. There are 39 entities that have just
Age missing (and there are 42 entities that have
Age missing, overall). This particular row's pattern has
just a single variable missing, as indicated by the 1 in the
final column.

The final row indicates that there are, for example, 37 missing values
for the variable Marital, and that there are 560 missing
values altogether in this dataset.