The US Census provides an incredible wealth of data but it’s not always easy to work with it. In the past, working with the tabular and spatial census data generally meant downloading a table from FactFinder and a shapefile from the boundary files site and joining the two, perhaps in a GIS system. These files could also be handled in R but getting the data, reading it into R and, in particular, merging tabular and spatial data can be a chore. Working with slots and all the different classes of data and functions can be challenging.

A recent interesting post on stackoverflow by Claire Salloum prompted me to revisit this issue in R and I've definitely found some valuable new packages for capturing and manipulating Census data in R.

Edited with an update related to ggplot2 version 2.0 on May 6, 2016

Census data the hard way

The tabular data described in the stackoverflow post can be downloaded from this FactFinder link. Note that I clicked on download and then chose “Data and annotations in a separate file” and included the descriptive data element names. I also posted a version with just the variables used in this post. The spatial data for tracts comes from this link. We are using data from the US Census' American Community Survey.

And now the tabular data we downloaded from FactFinder. We will use dplyr functions to manipulate the data and clean it up. If these functions are unfamiliar to you, take a look at a previous blog post on dplyr.

Plotting census data with ggplot2

In order to plot with ggplot2 we need to combine our spatial and tabular data AND the result needs to be a data frame. Luckily, ggplot2 has a nice function, fortify, to help us convert the polygons to a data frame and we can merge the ACS data to this data frame:

This works. Not particularly pretty, but functional. We could play around with the color scale to come up with something better or we could invest our time in creating something more useful, an interactive map. Let’s give this a try.

By the way, if you’d like more discussion on mapping with ggplot2, you could look at this blog post.

Interactive plotting with the leaflet package

In a previous blog post I showed how to use the leafletR package but I've now moved to using the leaflet package from RStudio instead primarily because it does not require that the user convert a spatial object to a geoJSON object first. Robin Lovelace has a nice summary of leaflethere. There is also a nice post by Marco Sciaini here.

The leaflet package requires the data input to be a spatial object. In our example, we started with a SpatialPolygonsDataFrame (which we created by reading using readORG) and then we converted this to a vanilla data frame using the fortify function. Given our example there are two approaches. You can convert your data.frame back to a spatial data frame and map with leaflet or you can make use of your existing spatial data frame (called tract above). I'll show the code for both versions (though, spoiler alert, using the existing data frame is much easier).

Option 1: Convert the data.frame back to a SpatialPolygonsDataFrame

Here you need to go through a lot of hoops. There may be an easier approach I have not considered but my experience is that you need to convert each tract to a Polygon, then to a Polygons object, then to a SpatialPolygons object and finally to a SpatialPolygonsDataFrame. No, this is not not fun!

Here is the code. Note that I start by creating a function that will extract the records for a particular tract and then convert this tract data to a Polygon and then Polygons object:

Options 2: Make use of the existing SpatialPolygonsDataFrame

Rather than go through the hoops of converting the SpatialPolygonsDataFrame to a data.frame and back let's make use of the original SpatialPolygonsDataFrame that we read in with readOGR.

class(tract)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
# create a new version
df.polygon2<-tract #tract is the
# create a rec-field to make sure that we have the order correct
# this probably is unnecessary but it helps to be careful
df.polygon2@data$rec<-1:nrow(df.polygon2@data)
tmp <- left_join(df.polygon2@data, data, by=c("GEOID"="id")) %>%
arrange(rec)
# replace the original data with the new merged data
df.polygon2@data<-tmp
# limit to NYC
df.polygon2 <- df.polygon2[grep("Kings|Bronx|New York County|Queens|Richmond", df.polygon2$geography),]
#df.polygon2 <- df.polygon2[order(df.polygon2$percent),]

That was definitely a simpler way to map using downloaded census data. But what if we did not have to manually download data and spatial data? Enter the packages acs and tigris.

Census data the easy(er) way

There are two great packages that make downloading tabular and spatial data from the Census websites unnecessary for many cases. You can instead use the acs package by Ezra Glenn to download tabular data and the tigrispackage by Kyle Walker and Bob Rudis. These packages make the process quite a bit smoother and self-contained.

The code from this example is extracted in a large part from Kyle Walker's helpful page on tigris.

1) Set up the packages

library(tigris)
library(acs)
library(stringr) # to pad fips codes

2) Get the spatial data (tigris)

# note that you can use county names in the tigris package but
# not in the acs.fetch function from the acs package so I'm using
# fips numbers here.
# grab the spatial data (tigris)
counties <- c(5, 47, 61, 81, 85)
tracts <- tracts(state = 'NY', county = c(5, 47, 61, 81, 85), cb=TRUE)

3) Get the tabular data (acs)

In order to do this you will need an API key from the US Census. Go to this site to request one (takes a minute or two) and then use the api.key.install function in the `acs` package to use the key.

If you get an error pandoc document conversion failed your data may be too big to save and you may need to use a less detailed version of the spatial data. Thanks to this post for pointing this out.

One issue with the leaflet map

An astute observer will notice that the stroke lines (outlines) are not being rendered in these maps. In RStudio the lines appear (see image below). I experimented with opacity, width, color and even within the Chrome developer tools. I tried to save from the RStudio plot window and I tried saveWidget. If you have ideas please let me know.

Rarely is the solution to an issue like this so simple. Thanks to Bob Rudis for pointing out that I should be using the hex codes for colors rather than using named colors so that they can be recognized in a web setting. I’ve made the changes above and it worked perfectly.

I updated the post with three things: a better link to the data at Census, a link to a ZIP with the data and an update about the API key needed to get ACS data. If you’re willing to test out the data links that would be helpful.

Very nice post! thanks a lot!
Is there an easy way to produce maps for all of the USA instead of just a state? (Is there a “state code” that gives all the states, and if yes, is there a file with all the counties?)
Or do we need to loop over states?

Wonderful tutorial. I’ve been playing with census data regarding American Indian land and populations. I didn’t know about the acs package. The link you provide to handling other spatial files is great–I’ll be playing with this all week.

Hi, is it possible you missed a step? We just re-ran the code with the newest ACS and it worked. Both Hollie and I tested it out she was using acs 2.0, tigris 0.2.2 and R 3.2.5 and I’m using acs 1.2, tigris 0.2.2.9000, r 3.2.3. Let me know if you still have trouble.

FAQS:
Q: What is the income_df creation with str_pad doing?
A: Constructing GEOIDs. These will need to match the GEOID from the “tracts” object in geo_join(). The tracts object as written in this example is only valid if you are going the level of tracts. If you go up or down levels, see other functions (block_groups(), blocks(), etc). Basically look at the number of numerals in your GEOID from acs.get(), compared to the GEOIDs link above, and make sure you have the same numeral length.

Hi Zev, I have a question. I follow your code which goes well until I run “ggtract <- fortify(tract, region = "GEOID")". It keep giving me "isTRUE(gpclibPermitStatus()) is not TRUE". I tried many methods, but still cannot solve this problem. What is the aim of this step, is it trying to convert mapping data to the data frame? Thanks in advance!