Random thoughts on ecology, biodiversity, and science in general

Category: Statistics

Recently, I was exploring techniques to interpolate some missing environmental data, and stumbled across something called ‘random forest’ analysis. Random what now? I did a little digging and came across the massive and insanely complicated field of machine learning. I couldn’t find a concise guide to machine learning techniques, or when I might want to use one or the other, so I thought I would cobble together a brief guide on my own. Below is a rough stab at explaining and exploring different machine learning techniques, from CARTs to GBMs, using R.

Nature is complex. This seems like an obvious statement, but too often we reduce it to straightforward models. y ~ x and that sort of thing. Not that there’s anything wrong with that: sometimes y is actually directly a function of x and anything else would be, in the words of Brian McGill, ‘statistical machismo.’

But I would wager that, more often that not, y is not directly a function of x . Rather, y may be affected by a host of direct and indirect factors, which themselves affect one another directly and indirectly. If only there was someway to translate this network of interacting factors into a statistical framework to better and more realistically understand nature. Oh wait, structural equation modeling.