Friday, November 30, 2012

Data types part 4: Logical class

First, an update: A commentator has asked me to post my code so that it is easier to practice the examples I show here. It will take me a little bit of time to get all of my code for past posts well-documented and readable, but I have uploaded the code and data for the last 4 posts, including this one, here:

Unfortunately, I could not find a way to attach it to blogger, so sorry for the extra step.
_________________________________________________________________________

Ok, now on to Data types part 4: Logical

I started this series of posts on data types by saying that when you have a dataframe like this called mydata:

you can't do this in R:

Age<25

Because Age does not exist as an object in R, and you get the error below:

But then what happens when I do,

mydata$Age<25

This is perfectly legal to do in R, but it's not going to drop observations. With this kind of statement, you are asking R to evaluate the logical question "Is it true that mydata$Age is less than 25?". Well, that depends on which element of the Age vector, of course. Which is why this is what you get when you run that code:

On first glance, this looks like a character vector. There is a string of entries using character letters after all. But it's not character class, it's the logical class. If you save this string of TRUE and FALSE entries into an object and print its class, this is what you get:

The logical class can only take on two values, TRUE or FALSE. We've seen evaluations of logical operations already, first in subsetting, like this:

mysubset<-mydata[mydata$Age<40,]

Check out my post on subsetting if this syntax is confusing. In a nutshell, R evaluates all rows and keeps only those that meet the criteria, which is only rows where Age has a value of under 40 and all columns.

Or here, in ifelse() statements

mydata$Young<-ifelse(mydata$Age<25,1,0)

More on ifelse() statements here. The ifelse() function is really useful, but is actually overkill when you're just creating a binary variable. This can be done faster by taking advantage of the fact that logical values of TRUE always have a numeric value of 1, while logical values of FALSE always have a numeric value of 0.

That means all I need to do to create a binary variable of under age 25 is to convert my logical mydata$Ageunder25vector into numeric. This is very easy with R's as.numeric() function. I do it like this:

mydata$Ageunder25_num<-as.numeric(mydata$Ageunder25)

or directly without that intermediate step like this:

mydata$Ageunder25_num<-as.numeric(mydata$Age<25)

Let's check out the relevant columns in our dataframe:

We can see that the Ageunder25_num variable is an indicator of whether the Age variable is under 25.

Now the really, really useful part of this is that you can use this feature to turn on and off a variable depending on its value. For example, say you got your data and realized that some of the height values were in inches and some were in centimeters, like this:

Those heights of 152 and 170 are in centimeters while everything else is inches. There are various ways to fix it, but one way is to check which values are less than, say 90, which is probably a safe cutoff and create a new column that keeps those values under 90 but converts the values over 90. We can do this in this way:

So the first half of the calculation (in red) is "turned on" when Height_wrong is less than 90, because the value of the logical statement is a numeric TRUE, i.e. a 1, and this value of 1 is multiplied by the original Height column. The second part of the statement (in blue) is FALSE and so is just 0 times something so it's 0. If the Height_wrong column is greater than 90, then the first half is just 0 and the second half is turned on and thus the Height_wrong variable is divided by 2.54 cm, converting it into inches. We get the result below:

Another useful way to use the as.numeric() and logical classes to your advantage is a situation like this:

I have in my dataset the age of the last child born (and probably other characteristics of this child not shown), and then just the number of other children for each woman. I want to get a total number of children variable. I can do it simply in the following way.

First, a note about the is.na() function. If you want to check if a variable is missing in R, you don't use syntax like "if variable==NA" or "if variable==.". This is not going to indicate a missing value. What you want to use instead is is.na(variable) like this:

is.na(newdata$Child1age)

Which gives you a logical vector that looks like this:

If you want to check if a variable is not missing, you use the ! sign (meaning "Not") in front and check it like this:

We've seen this kind of thing before! Now we can translate this logical vector into numeric and add it to the number of other children, like this:

Data and Code Download

NB: It's been pointed out to me that some images don't show up on IE, so you'll need to switch to Chrome or Firefox if you are using IE. Thanks!

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!