Wednesday, February 11, 2015

R's Tricky == Operator, or "It depends on what the meaning of the word 'is' is"

One scenario where R can trip up a programmer is when using the == operator or its relatives. The help page notes that "NA values are regarded as non-comparable", which introduces some potentially unexpected behavior.

As a toy example, look what happens when trying to subset on a column that includes NA values.

df <- data.frame(a=11:15,b=c(3,NA,4,4,NA))

df

df[df$b==4,]

df[df$b<=4,]

In each case, rows with an NA in the b column are returned. This might be surprising and not obvious if wrapped inside of a an aggregation such as nrow or sum. A safer way to accomplish this subsetting is by using the %in% operator. Like so: