Resampling is a term used in statistics to describe a variety of methods for computing summary statistics using subsets of available data (jackknife), drawing randomly with replacement from a set of data points (bootstrapping), or switching labels on data points when performing significance tests (permutation test, also called exact test, randomization test, or re-randomization test).

Contents

Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample, most often with the purpose of deriving robust estimates of standard errors and confidence intervals of a population parameter, for example a mean, median, proportion, odds ratio, correlation coefficient, regression coefficient etc. It may also be used for the construction of hypothesis tests. It is often used as a robust alternative to inference based on parametric assumption when those assumptions are in doubt, or where parametric inference is impossible or require very complicated formulas for the calculation of standard errors. See also jackknife.

See also particle filter for the general theory of Sequential Monte Carlo methods, as well as details on some common implementations.

The jackknife is a statistical method first thought of and applied by Richard von Mises. It is related to bootstrapping in the sense that both methods are used both to estimate and compensate for bias and to derive robust estimates of standard errors and confidence intervals. Both methods have in common that the variability of a statistic is estimated from the variability within a sample, rather than from parametric assumptions. Jackknife is a less general technique than the bootstrap, and it explores the sample variation in a different way from the bootstrap. Jackknifed statistics are developed by systematically dropping out subsets of data one at a time and assessing the resulting variation in the studied parameter. (Mooney & Duval).

Jackknife and bootstrap may in many situations be used to obtain similar results. A difference between them is that when used to obtain an estimate of the standard error of a statistic, bootstrapping will give slightly different results when the process is repeated on the same data, whereas jackknife will give exactly the same result each time.
A situation where jackknife is regarded as the preferred alternative is the analysis of data from complex sampling schemes, for example multi-stage sampling with varying sampling weights.

Preamble: All statistical tests use observations from a data set to compute a test statistic that characterises a hypothesis of interest. This test statistic is then compared to an expected reference distribution, to assess the probability of it occurring randomly under a null hypothesis. If the observed probability, the p-value, is small (a value of 1/20 or less is often used in medical, econometric or social science applications) then the null hypothesis is rejected and a complimentary, alternative hypothesis is accepted.

A permutation test - a particular type of statistical significance test and sometimes called a randomization test, re-randomization test, or an exact test - is a statistical test in which a reference distribution is obtained by permuting the observed data points across all possible outcomes, given a set of conditions consistent with the null hypothesis.
The theory has evolved from the works of R.A. Fisher and E.J.G. Pitman in the 1930s.

Permutation tests form a branch of non-parametric statistics. In contrast to permutation tests, the reference distributions for many popular ‘classical’ statistical tests, such as the t-test, the z-test, and the chi-squared test, are obtained from theoretical probability distributions. Many researchers believe this invalidates or, at least, critically weakens their use because the assumptions relating the theoretical distributions to the empirically obtained test statistics may not be valid. The extent to which this is true, in various real-world settings, is an area of active statistical investigation. Researchers may be forced to make these assumptions in some situations because there is no other alternative, and a non-optimal statistical test is usually considered better than none at all.

Fisher's exact test is a commonly used permutation test for evaluating the association between two dichotomous variables and contrasts with Pearson's chi-square test which can be used for the same purpose. When sample sizes are small the chi-squared test statistic can no longer be accurately compared against the chi-square reference distribution and the use of Fisher’s exact test becomes most appropriate.

All parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption.
It is for example possible in this manner to construct a permutation t-test, a permutation chi-squared test of association, a permutation two-sample Kolmogorov-Smirnov test and so on.
Many parametric tests define the test statistic as a ratio t/s, where t measures the deviation of an observable parameter from it's expected value when the null hypothesis is true, and s is an estimate of the standard error of t. A permutation test need not in general take into account the value of s, as this is a fixed constant for all permutations of a sample. This is an advantage when constructing new permutation tests, as it will not be necessary to find an expression for the standard error of the test statistic. Finding the standard error (or variance) of a new test statistic is often the trickiest part when developing new significance tests, requiring deep mathematical knowledge. So the construction of a permutation test rather than a parametric test to solve a certain problem may be regard as a way of replacing mathematical skill with raw computing power.
The most commonly used non-parametric tests are in their original form defined as permutation tests on ranks; these include for example the Mann-Whitney U test and Spearman’s rank correlation test. Pitman’s original formulation (in 1937) of the general permutation test of association between two variables describe a general test procedure that when applied to two numeric variables in linear scales gives a permutation test of Pearson's correlation coefficient, when applied to ranked data points gives Spearman's rank correlation test, when applied to one numeric variable and one dichotomous gives a permutation t-test, when applied to one ranked variable and one dichotomous gives Mann-Whitney’s U-test (also known as the Wilcoxon rank sum test) and when applied to two dichotomous variables gives Fisher's exact test. In general the most important advantage of permutation tests is that the results are reliable also for small samples and when data strongly violates the distributional assumptions of the corresponding parametric test. For larger sample sizes the central limit theorem will in most situations assure that the results obtained from parametric tests are very similar to the results from the related permutation test, so it may be concluded that even when the parametric assumptions aren't meet, parametric tests are often good approximations to the corresponding ‘exact’ permutation test, provided the sample is large enough.
Prior to the 1980s the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes. However, since the 1980s, the confluence of cheap fast computers and the development of new sophisticated path algorithms that are applicable in special situations, made the application of permutation test methods practical for a wide range of problems, and initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based ‘exact’ confidence intervals.
During the 1990s a totally general short-cut method for finding the reference distribution was introduced, the Monte Carlo method. Even with the most advanced computer today, the task of performing a general permutation test on continuous data is still overwhelming unless the sample size is very small.
The number of permutations = N! for data with no ties. For N=10 the number of permutations = 3628800.
For N=20 it is 2.4E18 and for N=50 it is 3.0E64.
Therefore it was an important breakthrough in the area of applied statistics when it was realised that by using Monte Carlo sampling, i.e. taking a small (relative to the total number of permutations) number of random samples with replacement from the permutation distribution, it was possible to accurately estimate the reference distribution of any permutation test on any data. Small sample in this case meaning at least 10,000.

Limitations of tests based on the permutation principle:
There are two important assumptions behind a permutation test - that the observations are independent and that they are exchangeable under the null hypothesis.
An important consequence of the exchangeability assumption is that tests of difference in location (like a permutation t-test) require equal variance, otherwise the observations are not exchangeable. In this respect the permutation t-test shares the same weakness as the classical Student’s t-test.
Another weakness of permutation tests is that as they are returning a p-value as the only outcome of a statistical analysis, which means that they do not satisfy the common requirement today that results should be presented as confidence intervals of the parameter of interest, and not (only) as p-values. However, there are methods for calculating ‘exact’ confidence intervals from the inverse of a permutation test.