R for Public Health

Monday, March 9, 2015

Lists are a data type in R that are perhaps a bit daunting at first, but soon become amazingly useful. They are especially wonderful once you combine them with the powers of the apply() functions. This post will be part 1 of a two-part series on the uses of lists. In this post, we will discuss the basics - how to create lists, manipulate them, describe them, and convert them. In part 2, we'll see how using lapply() and sapply() on lists can really improve your R coding.
Creating a list
Let's start with some basics: what lists are and how to create them. A list is a data structure that can hold any number of any types of other data structures. If you have vector, a dataframe, and a character object, you can put all of those into one list object like so:
create three different classes of objects
add all three objects to one list using list() function
print list
list1
We can also turn an object into a list by using the as.list() function. Notice how every element of the vector becomes a different component of the list.
coerce vector into a list
Manipulating a list
We can put names on the components of a list using the names() function. This is useful for extracting components. We could have also named the components when we created the list. See below:
name the components of the list
list1
could have named them when we created list
Extract components from a list (many ways): the first is using the [[ ]] operator (notice two brackets, not just one). Note that we can use the single [ ] operator on a list, but it will return a list rather than the data structure that is the component of the list, which is normally not what we would want to do. See what I mean here:
extract 3rd component using -> this returns a *string*
list1[[3]]
print a list containing the 3rd component -> this returns a *list*
list1[3]
It is also possible to extract components using the operator or the components name. Again, be careful about the [ ] vs [[ ]] operator in the second way. You need the [[ ]] to return the data structure of the component.
extract 3rd component using
extract 3rd component using [[ ]] and the name of the component
Subsetting a list - use the single [ ] operator and c() to choose the components
subset the first and third components
list1[c(1,3)]
We can also add a new component to the list or replace a component using the $ or [[ ]] operators. This time I'll add a linear model to the list (remember we can put anything into a list).
add new component to existing list using
add a new component to existing list using
Finally, we can delete a component of a list by setting it equal to NULL.
delete a component of existing list
list1
Notice how the 5th component doesn't have a name because we didn't assign it one when we added it in. Now if we want to extract the dataframe we have in the list, and just look at it's first row, we would do list1[[2]][1,]. This code would take the second component in the list using the [[ ]] operator (which is the dataframe) and once it has the dataframe, it subsets the first row and all columns using only the [ ] operator since that is what is used for dataframes (or matrices).
For help on subsetting matrices and dataframes, check out [this post](http://rforpublichealth.blogspot.com/2012/10/quick-and-easy-subsetting.html).
extract first row of dataframe that is in a list
list1[[2]][1,]
Describing a list
To describe a list, we may want to know the following:
the class of the list (which is a list class!) and the class of the first component of the list.
describe class of the whole list
class(list1)
describe the class of the first component of the list
class(list1[[1]])
the number of components in the list - use the length function()
find out how many components are in the list
length(list1)
a short summary of each component in the list - use str(). (I take out the model because the output is really long)
take out the model from list and then show summary of what's in the list
str(list1)
Now we can combine these functions to append a component to the end of the list, by assigning it to the length of the list plus 1:
Initializing a list
To initialize a list to a certain number of null components, we use the vector function like this:
initialize list to have 3 null components and print
list2
Converting list into matrix or dataframe
Finally, we can convert a list into a matrix, dataframe, or vector in a number of different ways. The first, most basic way is to use unlist(), which just turns the whole list into one long vector:
convert to one long string - use unlist
unlist(list1)
But often we have matrices or dataframes as components of a list and we would like to combind them or stack them into one dataframe. The following shows the two good ways I've found to do this (from [this StackOverflow](http://stackoverflow.com/questions/4227223/r-list-to-data-frame) page) using ldply() from the plyr package or rbind().
Here, we first create a list of matrices and then convert the list of matrices into a dataframe.
create list of matrices and print
mat.list
convert to data frame
1. use ldply
require(plyr)
ldply(mat.list, data.frame)
2. use rbind
do.call(rbind.data.frame, mat.list)
Get ready for part 2 next time, when we'll get to what we can use lists for and why we should use them at all.

Friday, December 26, 2014

Happy New Year everyone! For the last post of the year, I thought I'd have a little fun with the new animation package in R. It's actually really easy to use. I recently had some fun with it when I presented my research at an electronic poster session, and had an animated movie embedded into the powerpoint.
All of the GIFs above use ggplot and the animation packages. The main idea is to iterate the same plot over and over again, changing incrementally whatever it is that you want to move in the graph, and then save all those plots together into one GIF.
Let's start with first plot that traces a regression line over the scatterplot of the points. We'll make up some data and fit a loess of two degrees (default). You could easily do this with any kind of regression.
#make up some data
tracedat

Monday, October 20, 2014

Public health data can often be hierarchical in nature; for example, individuals are grouped in hospitals which are grouped in counties. When units are not independent, then regular OLS standard errors are biased.
One way to correct for this is using clustered standard errors. This post will show you how you can easily put together a function to calculate clustered SEs and get everything else you need, including confidence intervals, F-tests, and linear hypothesis testing.
Under standard OLS assumptions, with independent errors,
$$V_{OLS} = \sigma^2(X'X)^{-1}$$
We can estimate $\sigma^2$ with $s^2$:
$$s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2$$
where N is the number of observations, K is the rank (number of variables in the regression), and $e_i$ are the residuals from the regression.
But if the errors are not independent because the observations are clustered within groups, then confidence intervals obtained will not have $1-\alpha$ coverage probability. To fix this, we can apply a sandwich estimator, like this:
$$V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1}$$
where $n_c$ is the total number of clusters and $u_j = \sum_{j_{cluster}}e_i*x_i$.
$x_i$ is the row vector of predictors including the constant. Programs like Stata also use a degree of freedom adjustment (small sample size adjustment), like so:
$$\frac{M}{M-1}*\frac{N-1}{N-K} * V_{Cluster}$$
where M is the number of clusters, N is the sample size, and K is the rank.
In this example, we'll use the Crime dataset from the plm package. It includes yearly data on crime rates in counties across the United States, with some characteristics of those counties. Let's load in the libraries we need and the Crime data:
library(plm)
library(lmtest)
data(Crime)
We would like to see the effect of percentage males aged 15-24 (pctymle) on crime rate, adjusting for police per capita (polpc), region, and year. However, there are multiple observations from the same county, so we will cluster by county.
In Stata the commands would look like this.
reg crmrte pctymle polpc i.region year, cluster(county)
In R, we can first run our basic ols model using lm() and save the results in an object called m1. Unfortunately, there's no 'cluster' option in the lm() function. But there are many ways to get the same result
#basic linear model with standard variance estimate
Crime$region

Monday, July 7, 2014

My [previous blog post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) went over the basics of the syntax and debugging of user-written functions. In this post, I'll show you examples of useful functions that you can write to make your life easier when using R.
Here is the data we'll be using for this post:
set.seed(10)
bpdata

Sunday, June 8, 2014

I've been asked on a few occasions what is the deal with R user-written functions. First of all, how does the syntax work? And second of all, why would you ever want to do this?
In Stata, we don't write functions; we execute built-in commands like **browse** or **gen** or **logit**. You can write your own programs that create new commands (like ado files) but it's less common for users to do so.
In R, there are built-in functions like **summary()** or **glm()** or **median()**, but you can also write your own functions. You can write a quick, one-line function or long elaborate functions. I use functions all the time to make my code cleaner and less repetitive.
In this post I'll go over the basics of how to write functions. In the next post, I'll explain what kinds of functions I have used commonly in public health research that have improved my data cleaning and analyses.
Basic syntax of a function
A function needs to have a name, probably at least one argument (although it doesn't have to), and a body of code that does something. At the end it usually should (although doesn't have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function - they don't exist outside of the function. But you can "return" the value of the object from the function, meaning pass the value of it into the global environment. I'll go over this in more detail.
Functions need to have curly braces around the statements, like so:
name.of.function

Monday, February 17, 2014

In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed.
library(ggplot2)
library(gridExtra)
set.seed(10005)
xvar

Data and Code Download

NB: It's been pointed out to me that some images don't show up on IE, so you'll need to switch to Chrome or Firefox if you are using IE. Thanks!

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!