Random thoughts on ecology, biodiversity, and science in general

Category: R

Being an ecologist is all about the trade-off between effort, and time and money. Given infinite amounts of both, we would undoubtedly sample the heck out of nature. But it would be an exercise in diminishing returns: after a certain amount of sampling, we would fail to unturn any stone that has not already been unturned. Thus, ecology is a balancing act of effort: too little, and we have no real insight. Too much, and we’ve wasted a lot of time and money.

In a nutshell, the paper introduces a method for, “assessing sample-size adequacy in studies of ecological communities.” Put in a slightly different way, the authors have devised a technique for determining when additional sampling does not really improve one’s ability to describe whole communities — both the number of species, and their relative abundances. Perfect for evaluating when enough is enough, and adjusting the output of time and money!

In this post, I dig into this technique, show its applications using an example, and introduce a new R function to assess multivariate precision quickly and easily.

Recently, I was exploring techniques to interpolate some missing environmental data, and stumbled across something called ‘random forest’ analysis. Random what now? I did a little digging and came across the massive and insanely complicated field of machine learning. I couldn’t find a concise guide to machine learning techniques, or when I might want to use one or the other, so I thought I would cobble together a brief guide on my own. Below is a rough stab at explaining and exploring different machine learning techniques, from CARTs to GBMs, using R.

Nature is complex. This seems like an obvious statement, but too often we reduce it to straightforward models. y ~ x and that sort of thing. Not that there’s anything wrong with that: sometimes y is actually directly a function of x and anything else would be, in the words of Brian McGill, ‘statistical machismo.’

But I would wager that, more often that not, y is not directly a function of x . Rather, y may be affected by a host of direct and indirect factors, which themselves affect one another directly and indirectly. If only there was someway to translate this network of interacting factors into a statistical framework to better and more realistically understand nature. Oh wait, structural equation modeling.

I’ve been hacking away at this post for a while now, for a few reasons. First, I’m a git novice, so I’m still trying to learn my way around the software. Second, this is an intimidating topic for those who are not used to things like the command line, so it was a challenge to identify which ideas were critical to cover, and which could be ignored without too much of a loss in functionality. Finally, there are always lots of little kinks to work out, especially in a software that is cross-platform. Therefore, please take the following with a grain of salt and let me know if anything is unclear, needs work, or is flat out wrong!

Lately I’ve been running a lot of complex models with huge datasets, which is grinding my computer to a halt for hours. Streamlining code can only go so far, but R is limited because the default session runs on only 1 core. In a time when computers have at least 2 cores, if not more, why not take advantage of that extra computing power? (Heck, even my phone has 2 cores.*)

Luckily, R comes bundled with the “parallel” package, which helps to distribute the workload across multiple cores. It’s a cinch to set up on a local machine:

Lately, I’ve been using loops to fit a number of different models and storing the models (or their predictions) in a list (or matrix)–for instance, when bootstrapping. The problem I was running into was the for loop screeching to a halt as soon as a model kicked back an error. I wanted the function to register an error for that entry, then skip to the next one and finish off the loop.

Linear mixed effects models are a powerful technique for the analysis of ecological data, especially in the presence of nested or hierarchical variables. But unlike their purely fixed-effects cousins, they lack an obvious criterion to assess model fit.

[Updated October 13, 2015: Development of the R function has moved to my piecewiseSEM package, which can be found here under the function sem.model.fits]