Data simulation and randomization tests

James S. Santangelo

Lesson preamble

Learning objectives

Understand the basic logic underlying randomization tests and how they differ from standard parametric statistic tests.

Apply randomization test to real ecological community data.

Lesson outline

Total lesson time: 2 hours

Brief intro to randomization tests (10 min)

Simulating small-scale data in R (30 mins)

General procedure for conducting a randomization test (10 mins)

Simulate data and perform t-test (10 mins)

Perform randomization test on same simulated data (30 mins)

Perform randomization test on real ecological data (30 mins)

Setup

install.packages('dplyr') (or tidyverse)

install.packages('ggplot2') (or tidyverse)

install.packages("reshape2")

install.packages("EcoSimR")

Introduction to randomization tests

As we saw in previous weeks, standard parametric statistical tests (e.g. ANOVA, standard linear regression) have numerous assumptions that should be respected prior to drawing conclusions from their results. Violation of these assumptions can lead to erroneous conclusions about the population(s) you are trying to understand. Mixed-effects models provided us with a few ways of overcoming the limitations of standard models, while still using all of our data and remaining in a statistical framework that we generally understand and are comfortable with. Randomization tests provide yet another alternative solution to parametric tests.

Many ecological and evolutionary hypotheses ask: Is the observed pattern different than what we would expect by random chance? In other words, the null hypothesis is randomness. Randomization tests (sometimes called permutation tests) allow us to test whether the observed data is different from a random distribution generated by reordering our observed data. If the pattern is random, then it should be just as likely as any other pattern generated by reordering the data. If it is not random, then it should occur more or less frequently than we expect under this distribution.

Randomization tests do not require that we make the same assumptions about our data as standard parametric tests. We do not require that samples be randomly drawn from the population since the goal of randomization tests is not to estimate parameters about the population (parameter estimation requires random samples to estimate means and variances). Similarly, we do not require normality of the data nor do we need equal variances among treatment groups since we are not really thinking about the population from which the data came and are not testing against a distribution (e.g. t distribution) with known shape. This will all become more clear when we walk through an example. But first, an aside.

An aside on simulating data in R

R allows us to generate fake data (i.e. simulate data) rather easily. This is useful if we just want some toy data to play around with, want to simulate data similar to the type we will collect in an experiment to test our analyses, or want to simulate data collection many times to estimate population parameters (e.g. variance or standard deviation). We can sample existing (i.e. real) data using the sample() function, or take random samples from known statistical distributions (e.g. Normal distribution). We will return to sample() later in this lesson. For now, let’s focus on sampling from known distributions.

We’ll start by taking random samples from a normal distribution with mean equal to 1 and standard deviation equal to 0.

Note how the mean from the above distribution (0.0325) is not exactly the same as the mean we specified (0). This should not be too surprising since we took random samples from a normal distribution and there is likely going to be variance within these samples. However, if we were to repeat this sampling many times (e.g. 10,000), each time plotting the mean of the distribution, we would expect the mean of this new distribution (i.e. the mean of the 10,000 means) to be much closer to what we specified. Let’s try this.

As expected, the above distribution is normal and it’s mean is much closer to 0 than the distribution generated by sampling only a single time. Interestingly, the same would be true even if the original samples did not come from a normal distribution. For those who took introductory stats, you might remember that this is called the Central Limit Theorem, which states that the distribution of sample means approaches a normal distribution regardless of the shape of the original population’s distribution.

Note that we could also simulate data from a linear model (or any other model), as the example below shows.

Calculate the proportion of times the actual test statistic is outside the distribution of test-statistics.

We will perform the above approach on some simulated data. Imagine we had ventured to South America and collected 10 male and 10 female Hercules beetles. We brought the beetles back to the lab and measure the width of their bodies at the largest point. The question we are interested in is: Do male and female Hercules beetles differ in body width? We will assume that body width is normally distributed (i.e. we will draw from a normal distribution).

Males have a mean body width of 17.3656816 while females have a mean body width of 15.0282489

We will use a randomization test to test if males and females differ in body width. If they do not differ, then each body width measurement in our dataframe should be equally likely to belong to a male or to a female. In other words, body width should independent of beetle sex. We can simulate this by randomly rearranging the body width values, effectively removing the effect of sex on body width, and then calculating a test statistic (e.g. t statistic, difference in means). We can do this many times (e.g. 5,000) and ask how many times our observed test statistic falls outside the randomly generated distribution of test statistics. We will use the difference in mean body width as our test statistic so let’s go ahead and calculate the observed difference in body width between male and female beetles.

The observed difference in mean body width between male and female Hercules beetles is 2.3374326

Traditionally, if we wanted to determine if there is a significant difference in means between two groups, we could use a t-test. Let’s perform a t-test on these data so we could compare it to the results of the randomization test that we will perform. Note the alternative = two.sided argument passed to t.test(). Since we are interested in whether males and females differ in mean body length without reference to whether one sex is specifically larger or smaller than the other, we are performing a two-sided hypothesis test. Realistically, males could either have larger or smaller body widths so a two-tailed test is appropriate here.

From the t-test above, we see that our t-statistic has a value of 2.49 and that males and females differ significantly in body width (P < 0.05). From the means reported in the t-test, it looks like males have larger body widths than females (17.36 vs. 15.03). Let’s now perform a randomization test. To get a better sense of what we’re doing, let’s first perform a single reshuffling of the data and look at the resulting distribution of the data.

The mean difference from a single reshuffling of the data is -1.0902614

We can see that the mean difference from the simulated data is considerably different from the observed mean difference. This makes sense since the body width values were randomly assigned to males and females. Let’s now do this 5,000 times, each time calculating the mean difference in body width.

We now have a numeric vector containing 5,000 simulated differences in mean body width between males and females. Let’s plot a histogram of the simulated values and overlay onto the histogram our observed mean difference.

Finally, to get out P-value, we calculate the number of times the simulated mean difference exceeded the observed mean difference from our data. Because we are performing a two-tailed test, this amounts to determining the number of times the simulated mean difference is either greater or lesser than the observed difference. We can do this by asking how many times the absolute value of the simulated mean difference is greater or equal to the absolute value of the observed mean difference.

As we can see, the randomization test provides results that are largely consistent with the t-test. This is not surprising since we specifically sampled independent data from normal distributions with similar variances. In other words, our sampled data do not violate any of the assumptions of the t-test and in such a case the difference in means is by definition analogous to the t statistic used in the t-test. There are some cases where randomization tests provide more accurate (or exact) P-values than parametric tests (e.g. small samples from skewed distributions) but even more diverse applications of randomization tests are found in ecology and evolutionary biology.

Randomization tests: A real-world application

Randomization tests are applied in all corners of ecological and evolutionary research but for our purpose heres we’ll turn to the field of community ecology for an example. Community ecologists have long been interested in understanding the factors that shape the organization of ecological communities. Are the species in an ecological community just a random assortment of species available from the regional species pool? Conversely, do species interactions and shared resources determine the local distribution of species, such that some species are found together more often than we would expect by chance whereas other never or only rarely co-occur. The proposition that competition and shared resources are important in driving community assembly are known as assembly rules and were first proposed by Diamond (1975). An important null model based on randomization tests was later developed to test what species assemblages would look like in the absence of competition and shared resources (Connor and Simberloff 1979). We will apply one such randomization test in this section.

As you can see, the data is a matrix with sites (e.g. islands, quadrats, etc.) as columns and species as rows. Each cell represents whether a species is present (1) or absent (0) at that site. This type of data is often referred to as a presence-absence matrix. The column totals (using colSums()) gives us a sense of how species richness varies across the sites while the row totals (using rowSums()) tells us how rare or common a species is across all sites. The question we are interested in addressing is: Are the observed species co-occurrence patterns the result of species interactions or random chance?

Let’s read in some real data before proceeding. The data are from the National Ecological Observatory Network and represent the presence and absence of plant species in eight square meter plots during the month of July, 2017, at a single site around Harvard, Boston, MA. Note this is only a subset of the very large dataset provided by NEON, which dates back to 2013 and includes about 46 different sites.

Clearly this data is not in the form that we need it to be. As I’m sure you know by now, R has many useful libraries and functions that should allow us to change this relatively easily. We’ll rely on the reshape2 package to transform our dataset into a presence-absence matrix.

Well that was painless! We now have our data in a form that is usable in our randomization test but we need to decide two things first: (1) What is our test statistic? This is what we’ll use to determine if co-occurrence patterns are random or not, similar to how we used the difference in means to tell if male and female beetles differed in length. (2) How do we randomize such a large matrix? Do we randomly permute rows and columns so that each cell is equally likely to contain a 1 or 0? Do we constrain row permutations so that row totals (i.e. species rarity) must be preserved but site richness is free to vary? Do we constrain column permutations so that column totals (i.e. site richness) must be preserved but species rarity can vary? It turns out there are exactly 9 different ways a matrix like this can be randomized, each involving a different set of constraints and biological assumptions (Gotelli 2000).

For our purpose, we will preserve species rarity but will assume that each site is equally likely to be colonized. In other words, we are assuming that sites do not vary in habitat quality or anything else that may change the number of species that can occupy it. However, we want rare species to remain rare throughout the permutations. This model is analogous to the SIM2 algorithm in Gotelli (2000) and is reasonable in our case since the sites are relatively close in proximity (i.e. likely have similar habitat quality).

For our index, we will use the C-score (Stone and Roberts 1990). I will not go into the calculation of this score here but suffice it to say that the higher the C-score, the less co-occurrence (on average) between all pairs of species in the matrix (i.e. a more segregated matrix). Smaller C-scores indicate an aggregated matrix (i.e. more co-occurrence among species pairs). Thankfully for us, there is an R package that implements the randomization algorithm for us. Let’s go ahead and conduct the analysis.

As we can see, our observed C-score (red line) is much lower (significantly so) than we would expect by random chance given the species in our community. This suggests the community is aggregated such that some species are more likely to occur with other species in the community than we would expect if the communities were assembling themselves randomly. This can happen if there is facilitation and some species create ecological niches or favourable habitat for other species such that they occur more often. Importantly, the C-score is an average across all pairwise comparisons in the matrix and thus doesn’t tell us anything about which specific species pairs may be more or less likely to co-occur. Identifying which species pairs are driving this pattern would require quite a bit more time and identifying the ecological mechanisms behind co-occurrence patterns of specific species pairs requires careful experimentation.

For the sake of completeness, we can also visualize a single permutation of the matrix compared against our observed matrix.

Notice how all of the rows in the two figures above always have the same number of filled squares. This is confirmation that the rarity of species (i.e. row totals) has been preserved during the permutation.

Other uses of randomization tests

Testing for mean differences among groups and for putative rules governing community assembly are only two of the many uses for randomization tests. In this section, I briefly highlight other uses of data randomization in statistics, ecology, and evolutionary biology.

Randomization tests are additionally used numerous fields in ecology and evolutionary biology such as spatial ecology (e.g. testing spatial patterns in animal distribution and allowing for spatial autocorrelations), phylogenetics (e.g. permutations of species phylogenies), community ecology and environmental monitoring. In addition to randomization tests in the way that we’ve demonstrated here, randomization procedures play an important role in other aspects biology and statistics.

Bootstrapping involves sampling data with replacement to estimate the sampling distribution of a population estimator (e.g. mean, median, standard deviation). This is often done to estimate the error (e.g. standard error, confidence interval) around the estimated population parameter.

Cross-validation involves randomly subsampling a dataset, which is then held as a model validation set on which the predictive ability of a model can be tested. A model is tested on the remaining data (i.e. the training set) and used to predict the validation set. This is done many times to estimate the accuracy of the model.

Monte-carlo methods are a huge class of methods that generally involve simulating random numbers to estimate probability distributions and for obtaining numerical solution to really complicated problems. For our purposes, these are especially common in computational biology, genomics and in Bayesian inference (e.g. building phylogenies).