Monday, August 23, 2010

Some time ago, I wrote about missing values and how they complicate the life of an applied statistician. A particularly tricky case concerns logical variables, and I give a more detailed explanation here.

Suppose X is a variable representing whether a person is at risk for developing type-2 diabetes. Two known risk factors are (A) being older and (B) being overweight. If we had a database containing the age and weight of each person in a group, we could compute X using the following logical expression:

X = A OR B.

(X, A, and B are known as logical variables, and they each take values TRUE or FALSE according to whether the corresponding condition holds.) But what happens if some ages and weights are missing from the database? Fortunately statistical software packages like R and SPSS have built-in rules that will correctly evaluate the logical expression, even if A or B (or both) are missing. The complete truth table is as follows (where T means TRUE, F means FALSE, and a dot means missing):Note that if A is FALSE and B is missing, the result is missing. That makes sense because if the actual value of B were TRUE, the result would be TRUE, but if the actual value of B were FALSE, the result would be FALSE. Thus it is not possible to say what the value of X is.

The trouble is, this logic can sometimes be perverse. Suppose X instead represents whether a patient tests positive for infection with a certain virus. But there may be two different blood tests (A and B), and patients may receive one or the other, or perhaps both. Suppose that if any of the tests is positive, the patient will be considered to be infected. The logical variables A and B take the values TRUE, FALSE, or missing according to whether the corresponding tests were positive, negative, or simply not performed. Shouldn't the logical expression X = A OR B handle this situation correctly? Unfortunately not. Suppose only one test was performed, and it was negative. Then the truth table shows that X will be missing, even though the patient tested negative!

Why does the logical expression handle missing values the way we want in the first case, but fail to do so in the second? The answer is that in the first case a missing value represents the fact that the age or weight of a given person is not available, whereas in the second case, when a test outcome is missing from the database, it means that no test was performed, thus the variable representing the outcome is not applicable. Another common case of variables that are not applicable occurs with data representing observations on multiple occasions. For example, suppose a database records whether hotel guests eat at the hotel restaurant on the first day of their stay (EAT1), the second day (EAT2), or the third day (EAT3). Some guests stay for just one day, while others stay longer. The database may look like this:This is an example of a ragged array, and as with the blood test, the issue is that the denominator (the number of tests performed, or the number of days a guest stays at the hotel) varies. To determine whether a guest ate at the hotel restaurant at least once (which we will represent by the logical variable EAT), we might try:

EAT = EAT1 OR EAT2 OR EAT3.

Unfortunately, as with the blood test example, this can fail when there are missing values. Guest number 4 in the table above stayed just one day at the hotel and did not eat at the restaurant, so EAT should be FALSE, but the expression above gives a missing value.

Workaround in R

In R, the vertical bar operator | represents OR, and missing values are represented by NA. For the diabetes example, the following behaviour is just what we want:> FALSE | NA[1] NA

In other words, when a person does not have one of the risk factors, but we don't know about the other one, then we don't know if the person is at risk. But for the blood test example, we need to use the following code:> sum(FALSE,NA,na.rm=TRUE)>0[1] FALSE

The sum function adds up logical values by treating TRUE as 1 and FALSE as zero. If the sum of the logical values is greater than zero, then at least one of the values must have been TRUE. Setting na.rm=TRUE tells sum to ignore missing values.

Workaround in SPSS

The situation is much the same in SPSS. For the diabetes example, the following works: COMPUTE X = A OR B.EXECUTE.

But for the blood test example, we need to use:COMPUTE X = SUM(A,B)>0.EXECUTE.

Note that the SPSS function SUM ignores missing values.

Missing value mistakes

The hard part, of course, is thinking through how the missing values in a given situation should be handled. I suspect that this issue has resulted in countless errors in data analyses. Proceed with caution: a miss is as good as a mile!