Species distribution models in R

Update, 20 August 2013: Many readers have requested a copy of the Joshua tree data set used as an example in this post, and I’ve finally secured permission from the coauthors of the original study to post it to Dryad. You can download it right here. Happy coding!

One of our most consistently popular posts of the past few months has been Kim Gilbert’s introduction to using geographic data to make maps in R. But once you’ve made a nice map of your collection sites, you’ve only just started to tap the possibilities of spatial data in R. With a suite of packages anchored by dismo, you can use R and open-sourced climate data to determine the environmental conditions your favorite species requires—by building a species distribution model.

Species distribution models (SDMs) are handy any time you want to extrapolate where a species might be based on where you know it actually is. Maybe you’re trying to figure out where would be fruitful to do more sampling; maybe you want to know where your favorite critters probably lived back during the last ice age; maybe you want to know what regions will be suitable for your favorite critters after another century of global climate change. The basic idea is,

Take a list of locations where you know you can find a species and identify the climate (or other environmental conditions) at those locations;

Build a statistical model (using one or more of several available methods) that differentiates the climate (or other conditions) at the points where your species is found from other points where your species isn’t found; and

Apply the model to climate (or other conditions) from some other time or place to estimate a probability that your species would be happy there.

This post provides a little demonstration of what you can do given a reasonably good set of collection sites and a no-longer-cutting-edge laptop. But caveat lector: actually building SDMs for publication-grade analysis requires a lot more work that what I’m presenting here. If you like the possibilities, you should start by reading the “vingnette” documents “Species distribution modeling in R” [pdf] (by dismo developers Roger Hijmans and Jane Elith) and “Boosted regression trees for ecological modeling” [PDF] (by Elith and John Leathwick). Those provide lots of detail and further reading lists, including a relatively recent review by Hijmans and Elith. They’re also the starting point for the demonstrations below.

First, you’ll want some distribution data. For this demonstration, I’ll use coordinates from a 2009 SDM study of Joshua trees by Godsoe et al. (fullest possible disclosure: I’m part of et al.). These are more than 5,000 locations where Joshua trees have been spotted—which is relatively easy, since they’re the tallest things that grow in most of the Mojave Desert—compiled from public databases, then ground-truthed and supplemented by the coauthors. All of those points are plotted in the map up at the beginning of the post; here’s the code that will set that up:

The first thing you might notice is that they’re not very evenly distributed. Some of that is because Joshua trees aren’t very evenly distributed, but some of this is sampling effort isn’t very evenly distributed. To compensate for that, we can divide the map up into a grid and randomly draw a single point from each grid square. I’m using grid squares of 0.05 degrees of latitude x 0.05 degrees of longitude, which are equivalent to somewhat less than 5 kilometers square at these latitudes. (This is rather crude at this scale; over much larger areas, an accurate version of this calculation gets complicated, because the Earth, incoveniently, is not flat.)

And, when you plot the new, subsampled dataset on top of the original, you should see a more even distribution.

Subsampled points in lighter green, superimposed on the original points in darker green.

Now you have your location points. But you also need some points that define regions where Joshua trees aren’t found. This is a bit tricky, philosophically speaking. The best you can do is hike the Mojave Desert to identify points where a Joshua tree isn’t growing—but how do you know it wouldn’t grow at any given point, given the chance? Conversely, you could sample points where you know Joshua trees will never grow—like, say, the Canadian Rockies—but that wouldn’t make a very informative model. One option is to define a “background” region to sample at random, which (we hope) captures environmental conditions that Joshua trees could disperse to, but hasn’t.

To do this, we’ll define circular regions, each with a radius of 50km, centered on each point in our presence list, then draw random points that must fall within those circles. Effectively, this draws random points that must be within 50 of a point where Joshua trees have been spotted. These will be in regions that have reasonably similar environments to the points where we’ve identified Joshua trees.

If we plot the circles (dark areas) and the random “pseudo-absence” points (orange circles) in bg on our map, here’s what they look like:

Our “pseudo-absences,” drawn within 50 km of at least one Joshua tree.

We have presence points and absence points. Now we need to know what the climate is like at those points. For this, we’re ready for climate data. Conveniently, the raster package provides functions to download the Bioclim open-source climate dataset, 19 variables derived from the WORLDCLIM database of temperature and rainfall measurements taken worldwide from 1950 to 2000. Bioclim is available at several different spatial resolutions; this demostration is for the 2.5 arc-minute resolution dataset, which contains worldwide data in a single file. At the highest resolution, you can only download it in 30-arc-second-square “tiles.” Even at this lower resolution, the file is hundreds of megabytes; we’re going to need a fair bit of memory.

Note also the path option in the getData() function; this lets us specify a location to place the downloaded data, and a place that R will check before it downloads again, in a future session. If the file is unwieldy, you might want to cut it down by defining the geographic “extent” you’re interested in, and using crop() to, er, crop the data file. The extent is defined as (1) the westernmost longitude, (2) the easternmost longitude, (3) the southernmost latitude, and (4) the northermost latitude:

To estimate the effectiveness of a predictive model, it’s common practice to set aside part of the dataset to validate the model. We’ll estimate the model from the total dataset minus the set-aside, then check the model’s ability to correctly identify the points in the set-aside as presences or absences. If we then re-divide the data, then re-estimate the model, then re-validate, we can begin to get a sense for how effective a given method is at estimating the SDM. For this purpose, the function kfold() creates a vector of values you can use to break your data into k groups of equal size, each of which can be used as the set-aside for testing over several iterations of this validation cycle.

Now we have the data we can (finally!) use to build a species distribution model. One common method for this is the algorithm implemented in Maxent, which uses a “maximum entropy” method to identify differences between conditions at presence points and conditions at absence points. The dismo package includes functions that call and manipulate Maxent, but first you need to install the javascript maxent.jar in an appropriate location. When you call Maxent from within R for the first time, if this file isn’t in place, R will complain that it’s not there, and tell you where it should be. On my MacBook, the folder is /Library/Frameworks/R.framework/Versions/3.0/Resources/ library/dismo/java/.

Then, we’re ready to go. Since we defined 5 groups, let’s pick a number between 1 at 5 to use for our validation set-aside:

test = 3

Then we use test and the kfold groupings to divide the presence and absence points:

Now, estimate a maxent model using the “training” points and the Bioclim data. This may take a few moments to run:

me = maxent(BClim, p=train_p, a=train_a)

To validate the model, we use the appropriately named function.

e = evaluate(test_p, test_a, me, BClim)
e

Among the stats you’ll see when you call the variable e is the AUC, or area under curve, statistic. This is a measure of how good the model me is at identifying a given geographic point as either a presence or an absence; when AUC = 0.5, the model is no better than a coin flip. For the data and I’ve been using up to now, the AUC is about 0.842, which isn’t terrible. If I were serious about developing a publishable SDM, I’d now repeat the model estimation procedure with different values for test, to use different portions of the total data for validation; the range of AUC scores I got would tell me, in general, how well the default Maxent procedure works, and I could tweak the procedure I use for selecting a set of pseudo-absence points, or the range of Bioclim variables I use to try and improve the average AUC.

In the meantime, to visualize the predictions from our model, we can estimate its predicted probability of Joshua tree presence for every cell in our Bioclim data set, then plot those probabilities:

Probability of Joshua trees, given our Maxent model and the Bioclim data.

With our presence and pseudo-absence points plotted over it, you can see there are some places where the model is good at differentiating between them, and others where it’s not so good.

That puts R through most of its SDM-estimation paces. Again, this is a starting point at best! Before you get too far with this, you’ll want to look into alternative estimation methods, such as boosted regression trees (Godsoe et al. used both Maxent and BRTs), and think very carefully about how you select presence and absence points, what environmental data you use, and how you handle them all.