I recently used bootstrapping to estimate confidence intervals for a project. Someone who doesn't know much about statistics recently asked me to explain why bootstrapping works, i.e., why is it that resampling the same sample over and over gives good results. I realized that although I'd spent a lot of time understanding how to use it, I don't really understand why bootstrapping works.

Specifically: if we are resampling from our sample, how is it that we are learning something about the population rather than only about the sample? There seems to be a leap there which is somewhat counter-intuitive.

I have found a few answers to this question here which I half-understand. Particularly this one. I am a "consumer" of statistics, not a statistician, and I work with people who know much less about statistics than I do. So, can someone explain, with a minimum of references to theorems, etc., the basic reasoning behind the bootstrap? That is, if you had to explain it to your neighbor, what would you say?

$\begingroup$(+1) You might mention briefly the questions you have looked at, but that don't quite satisfy you. There are lots of questions on the bootstrap here. :)$\endgroup$
– cardinalApr 8 '12 at 21:11

$\begingroup$@cardinal Thanks, I updated the original post. Hopefully it is more clear. :)$\endgroup$
– Alan H.Apr 9 '12 at 3:17

$\begingroup$Basically, bootstrap works because it is nonparametric maximum likelihood. So, when there is problems with maximum likelihood, you can expect problems with the bootstrap.$\endgroup$
– kjetil b halvorsenMar 18 '15 at 20:43

3

$\begingroup$Jake VanderPlas had a great talk at PyCon 16 about bootstrapping and some other related techniques. See the slides starting at slide 71 and the video recording.$\endgroup$
– thmMay 24 '17 at 14:30

11 Answers
11

You want to ask a question of a population but you can't. So you take a sample and ask the question of it instead. Now, how confident you should be that the sample answer is close to the population answer obviously depends on the structure of population. One way you might learn about this is to take samples from the population again and again, ask them the question, and see how variable the sample answers tended to be. Since this isn't possible you can either make some assumptions about the shape of the population, or you can use the information in the sample you actually have to learn about it.

Imagine you decide to make assumptions, e.g. that it is Normal, or Bernoulli or some other convenient fiction. Following the previous strategy you could again learn about how much the answer to your question when asked of a sample might vary depending on which particular sample you happened to get by repeatedly generating samples of the same size as the one you have and asking them the same question. That would be straightforward to the extent that you chose computationally convenient assumptions. (Indeed particularly convenient assumptions plus non-trivial math may allow you to bypass the sampling part altogether, but we will deliberately ignore that here.)

This seems like a good idea provided you are happy to make the assumptions. Imagine you are not. An alternative is to take the sample you have and sample from it instead. You can do this because the sample you have is also a population, just a very small discrete one; it looks like the histogram of your data. Sampling 'with replacement' is just a convenient way to treat the sample like it's a population and to sample from it in a way that reflects its shape.

This is a reasonable thing to do because not only is the sample you have the best, indeed the only information you have about what the population actually looks like, but also because most samples will, if they're randomly chosen, look quite like the population they came from. Consequently it is likely that yours does too.

For intuition it is important to think about how you could learn about variability by aggregating sampled information that is generated in various ways and on various assumptions. Completely ignoring the possibility of closed form mathematical solutions is important to get clear about this.

$\begingroup$(+1) This is a good answer. I think there may be a way to further draw out a very important point, though. In the way the bootstrap is normally carried out, there are two effects that are happening. First, we are pretending that the sample we have obtained is a proxy for our population. This is nominally a reasonable thing to do, provided our sample size is reasonably large. However, we usually have a hard time calculating the actual quantities of interest from that pretend distribution. So, we have to estimate them, and this is why we draw lots of bootstrap samples. If we could.../...$\endgroup$
– cardinalApr 9 '12 at 0:29

11

$\begingroup$.../...calculate the quantities of interest directly for our pretend distribution, we'd prefer to do that. And, that would be the real bootstrap. But, usually we can't, so we're reduced to having to resampling, instead.$\endgroup$
– cardinalApr 9 '12 at 0:32

8

$\begingroup$@naught101: "Reasonably large" can be quantified pretty well by the D-K-W inequality (if you'd like, you can look at my answer in the link in the OP's question) and regarding lots, it depends on the sample statistic of interest, but if we have $B$ bootstrap samples, then with simple Monte Carlo we know that the standard error is of order roughly $O(B^{-1/2})$.$\endgroup$
– cardinalApr 9 '12 at 2:11

4

$\begingroup$@cardinal: Nice comment. A lot of people think that the bootstrap and resampling are the same thing when in fact the latter is a tool used for the former. A similar misconception is that many users of statistics tend to get MCMC and Bayesian analysis confused.$\endgroup$
– MånsTApr 10 '12 at 7:34

+1 to @ConjugatePrior, I just want to bring out one point which is implicit in his answer. The question asks, "if we are resampling from our sample, how is it that we are learning something about the population rather than only about the sample?" Resampling is not done to provide an estimate of the population distribution--we take our sample itself as a model of the population. Rather, resampling is done to provide an estimate of the sampling distribution of the sample statistic in question.

$\begingroup$(+1) This is close to the point I was trying to make in the comment to ConjugatePrior's answer, though you've stated it more concisely and clearly. In some special cases, we can calculate the sampling distribution of the test statistic exactly under the empirical distribution obtained from the sample. But, usually, we can't and so we're forced into simulation. :)$\endgroup$
– cardinalApr 9 '12 at 1:13

6

$\begingroup$I see, so if I understand you, then this technique assumes that the sample is an adequate model of the population, and therefore that resampling over that sample on a large enough scale will reveal something about the population, but only to the extent that the original sample is a good one. Now that I put it that way it seems almost obvious...$\endgroup$
– Alan H.Apr 9 '12 at 1:31

4

$\begingroup$@AlanH., I just want to change "... will reveal something about the population" to "... will reveal something about the sampling distribution" (of the statistic at issue, eg mean). But, yes, you have it there$\endgroup$
– gung♦Apr 9 '12 at 1:52

$\begingroup$You're all correct, of course. Personally, and purely for pedagogical reasons, I save this point for my 'longer version', because in my particular audiences this point tends to knock their young and still unsteady intuitions a bit off balance if applied too soon.$\endgroup$
– conjugatepriorJun 25 '12 at 12:37

3

$\begingroup$@ErosRam, bootstrapping is to determine the sampling distribution of something. You can do it for a sample statistic (eg 56th percentile) or a test statistic (t), etc. In my binomial ex, the sampling distribution will obviously be 0 heads - 25%; 1 head - 50%; 2 heads - 25%; this is clear w/o resampling. Cardinal has a comment somewhere that explains this (many of the best answers on the site are cardinal's comments), but it's hard to find b/c it's a comment.$\endgroup$
– gung♦Mar 30 '15 at 22:16

This is probably a more technical explanation aimed at people who understand some statistics and mathematics (calculus, at least). Here's a slide from a course on survey bootstraps that I taught some while ago:

Some explanations are needed, of course. $T$ is the procedure to obtain the statistic from the existing data (or, to be technically precise, a functional from the distribution function to real numbers; e.g., the mean is $E[X]=\int x {\rm d}F$, where for the sample distribution function $F_n()$, the ${\rm d}F$ is understood as a point mass at a sample point). In the population, denoted by $F()$, application of $T$ gives the parameter of interest $\theta$. Now, we've taken a sample (the first arrow on the top), and have the empirical distribution function $F_n()$ -- we apply $T$ to it to obtain the estimate $\hat\theta_n$. How far is it from $\theta$, we wonder? What is the distribution that the random quantity $\hat\theta_n$ may have around $\theta$? This is the question mark in the lower left of the diagram, and this is the question the bootstrap tries to answer. To restate gung's point, this is not the question about the population, but the question about a particular statistic and its distribution.

If we could repeat our sampling procedure, we could get that distribution and learn more. Well, that usually is beyond our capabilities. However, if

$F_n$ is close enough to $F$, in a suitable sense, and

the mapping $T$ is smooth enough, i.e., if we take small deviations from $F()$, the results will be mapped to numbers close to $\theta$,

we can hope that the bootstrap procedure will work. Namely, we pretend that our distribution is $F_n()$ rather than $F()$, and with that we can entertain all possible samples -- and there will be $n^n$ such samples, which is only practical for $n\le 5$. Let me repeat again: the bootstrap works to create the sampling distribution of $\hat\theta_n^*$ around the "true" parameter $\hat\theta_n$, and we hope that with the two above conditions, this sampling distribution is informative about the sampling distribution of $\hat\theta_n$ around $\theta$:

Now, instead of just going one way along the arrows, and losing some information/accuracy along these arrows, we can go back and say something about variability of $\hat\theta_n^*$ around $\hat\theta_n$.

The above conditions are spelled out it utmost technicality in Hall's (1991) book. The understanding of calculus that I said may be required as a prerequisite to staring at this slide is the second assumption concerning smoothness: in more formal language, the functional $T$ must possess a weak derivative. The first condition is, of course, an asymptotic statement: the larger your sample, the closer $F_n$ should become to $F$; and the distances from $\hat\theta_n^*$ to $\hat
\theta_n$ should be the same order of magnitude as those from $\hat\theta_n$ to $\theta$. These conditions may break, and they do break in a number of practical situations with weird enough statistics and/or sampling schemes that do not produce empirical distributions that are close enough to $F$.

Now, where does that 1000 samples, or whatever the magic number might be, comes from? It comes from our inability to draw all $n^n$ samples, so we just take a random subset of these. The right most "simulate" arrow states another approximation that we are making on our way to get the distribution of $\hat\theta_n$ around $\theta$, and that is to say that our Monte Carlo simulated distribution of $\hat\theta_n^{(*r)}$ is a good enough approximation of the complete bootstrap distribution of $\hat\theta_n^*$ around $\hat\theta_n$.

I am answering this question because I agree that this is a difficult thing to do and there are many misconceptions. Efron and Diaconis attempted to do that in their 1983 Scientific American article and in my view they failed. There are several books out now devoted to the bootstrap that do a good job. Efron and Tibshirani do a great job in their article in Statistical Science in 1986. I tried especially hard to make bootstrap accessible to practitioner's in my bootstrap methods book and my introdcution to bootstrap with applications to R. Hall's book is great but very advanced and theoretical. Tim Hesterberg has written a great supplemental chapter to one of David Moore's introductory statistics books. The late Clifford Lunneborg had a nice book. Chihara and Hesterberg recently came out with an intermediate level mathematical statistics book that covers the bootstrap and other resampling methods. Even advanced books like Lahiri's or Shao and Tu's give good conceptual explanations. Manly does well with his book that covers permutations and the bootstrap There is no reason to be puzzled about the bootstrap anymore. It is important to keep in mind that the bootstrap depends on the bootstrap principle "Sampling with replacement behaves on the original sample the way the original sample behaves on a population. There are examples where this principle fails. It is important to know that the bootstrap is not the answer to every statistical problem.

$\begingroup$@Procrastinator. I am doing that more often. in some cases I am in a hurry to get my answer posted and come back to clean it up later. I haven't got the hang of converting link addresses to links by title and I am not sure that it is all that necessary. It is a single click either way But if you can't wait for that I don't mind you doing the edits. In fact I appreciate it.$\endgroup$
– Michael ChernickJul 24 '12 at 18:56

1

$\begingroup$I was going to change my comment to "I don't mind you doing the edits" with the "But if you can't wait" taken out. I see how what you did is neater and easier and probably takes less time but I just haven't learned it yet and I don't see this as such a big deal the way some moderators and other members do.$\endgroup$
– Michael ChernickJul 24 '12 at 19:05

1

$\begingroup$(+1) I confer on you the power of the $10,000$ points @Michael Chernick.$\endgroup$
– user10525Jul 24 '12 at 19:11

$\begingroup$Thank you procrastinator. i was anticipating possibly reaching that total today.$\endgroup$
– Michael ChernickJul 24 '12 at 20:44

Through bootstrapping you are simply taking samples over and over again from the same group of data (your sample data) to estimate how accurate your estimates about the entire population (what really is out there in the real world) is.

If you were to take one sample and make estimates on the real population, you might not be able to estimate how accurate your estimates are - we only have one estimate and have not identified how this estimate varies with different samples that we might have encountered.

With bootstrapping, we use this main sample to generate multiple samples. For example, if we measured the profit every day over 1000 days we might take random samples from this set. We might the profit from one random day, record it, get the profit from another random day (which might happen to be the same day as before - sampling with replacement), record it, and so forth, until we get a "new" sample of 1000days (from the original sample).

This "new" sample is not identical to the original sample - indeed we might generate several "new" samples as above. When we look at the variations in the means and estimate, we are able to get a reading on how accurate the original estimates were.

Edit - in response to comment

The "newer" samples are not identical to the first one and the new estimates based on these will vary. This simulates repeated samples of the population. The variations in the estimates of the "newer" samples generated by the bootstrap will shed a light on how the sample estimates would vary given different samples from the population. This is in fact how we can get try to measure the accuracy of the original estimates.

Of course, instead of bootstrapping you might instead take several new samples from the population but this might be infeasible.

$\begingroup$Thanks! This much I understand. I am particularly wondering how it is that resampling from a sample of the population helps to understand the underlying population. If we are resampling from a sample, how is it that we are learning something about the population rather than only about the sample? There seems to be a leap there which is somewhat counter-intuitive.$\endgroup$
– Alan H.Apr 8 '12 at 21:26

I realize this is an old question with an accepted answer, but I'd like to provide my view of the bootstrap method. I'm in no ways an expert (more of a statistics user, as the OP) and welcome any corrections or comments.

I like to view bootstrap as a generalization of the jackknife method. So, let's say you have a sample S of size 100 and estimate some parameter by using a statistic T(S). Now, you would like to know a confidence interval for this point estimate. In case you don't have a model and analytical expression for standard error you may go ahead and delete one element from the sample, creating a subsample $S_i$ with element i deleted. Now you can compute $T(S_i)$ and get 100 new estimates of the parameter from which you can compute e.g. standard error and create a confidence interval. This is the jackknife method JK-1.

You may consider all subsets of size 98 instead and get JK-2 (2 elements deleted) or JK-3 etc.

Now, bootstrap is just a randomized version of this. By doing resampling via selection with replacements you would "delete" a random number of elements (possibly none) and "replace" them by one (or more) replicates.

By replacing with replicates the resampled dataset always have the same size. For jackknife you may ask what is the effect of jackknifing on samples of size 99 instead of 100, but if sample size is "sufficiently large" this is likely a non-issue.

In jackknife you never mix delete-1 and delete-2 etc, to make sure the jacked estimates are from samples of same size.

You may also consider splitting the sample of size 100 into e.g. 10 samples of size 10. This would in some theoretical aspects be cleaner (independent subsets) but reduces the sample size (from 100 to 10) so much as to be impractical (in most cases).

You could also consider partially overlapping subsets of certain size. All this is handled in an automatic and uniform and random way by the bootstrap method.

Further, the bootstrap method gives you an estimate of the sampling distribution of your statistic from the empirical distribution of the original sample, so you can analyze further properties of the statistic besides standard error.

$\begingroup$the link above is defunct so I dont know what Fox said. But none of the addresses my concern that bootstrapping creates error. Suppose you wanted to know about the relative frequency of languages on earth. If you took your sample from the internet and just resampled that sample, you would miss all the languages not on the net.$\endgroup$
– aquagremlinDec 25 '15 at 16:06

A finite sampling of the population approximates the distribution the same way a histogram approximates it. By re-sampling, each bin count is changed and you get a new approximation. Large count values fluctuate less that small count values both in the original population and in the sampled set. Since you are explaining this to a layperson, you can argue that for large bin counts this is roughly the square root of the bin count in both cases.

If I find $20$ redheads and $80$ others out of a sample of $100$, re-sampling would estimate the fluctuation of redheads as $\sqrt{(0.2 \times 0.8) \times 100}$, which is just like assuming that the original population was truly distributed $1:4$. So if we approximate the true probability as the sampled one, we can get an estimate of sampling error "around" this value.

I think it is important to stress that the bootstrap does not uncover "new" data, it is just a convenient, non parametric way to approximately determine the sample to sample fluctuations if the true probability is given by the sampled one.

$\begingroup$I made slight formatting changes in your answer - feel free to revert them if you find them unsuitable. What may need some further clarification is why there is a square root?$\endgroup$
– Tim♦Mar 10 '16 at 13:30

Note that in classic inferential statistics the theoretical entity that connects a sample to the population as a good estimator of the population is the sampling distribution (all the possible samples that could be drawn from the population). The bootstrap method is creating a kind of sampling distribution (a distribution based on multiple samples). Sure, it is a maximum likelihood method, but the basic logic is not that different from that of the traditional probability theory behind classic normal distribution-based statistics.

$\begingroup$This doesn't seem to distinguish the bootstrap from any other statistical procedure that begins with the raw data. It seems only to distinguish those from procedures that are based on summary statistics or binned frequencies.$\endgroup$
– whuber♦Jun 30 '17 at 20:23

When explaining to beginners I think it helps to take a specific example...

Imagine you've got a random sample of 9 measurements from some population. The mean of the sample is 60. Can we be sure that the average of the whole population is also 60? Obviously not because small samples will vary, so the estimate of 60 is likely to be inaccurate. To find out how much samples like this will vary, we can run some experiments - using a method called bootstrapping.

The first number in the sample is 74 and the second is 65, so let's imagine a big "pretend" population comprising one ninth 74's, one ninth 65's, and so on. The easiest way to take a random sample from this population is to take a number at random from the sample of nine, then replace it so you have the original sample of nine again and choose another one at random, and so on until you have a "resample" of 9. When I did this, 74 did not appear at all but some of the other numbers appeared twice, and the mean was 54.4. (This is set up on the spreadsheet at http://woodm.myweb.port.ac.uk/SL/resample.xlsx - click on the bootstrap tab at the bottom of the screen.)

When I took 1000 resamples in this way their means varied from 44 to 80, with 95% between 48 and 72. Which suggests that there is an error of up to 16-20 units (44 is 16 below the pretend population mean of 60, 80 is 20 units above) in using samples of size 9 to estimate the population mean. and that we can be 95% confident that the error will be 12 or less. So we can be 95% confident that the population mean will be somewhere between 48 and 72.

There are a number of assumptions glossed over here, the obvious one being the assumption that the sample gives a useful picture of the population - experience shows this generally works well provided the sample is reasonably large (9 is a bit small but makes it easier to see what's going on).
The spreadsheet at http://woodm.myweb.port.ac.uk/SL/resample.xlsx enables you to see individual resamples, plot histograms of 1000 resamples, experiment with larger samples, etc. There's a more detailed explanation in the article at https://arxiv.org/abs/1803.06214.

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).