Main menu

Post navigation

According to Wikipedia, the jackknife technique “was developed by Maurice Quenouille from 1949 and refined in 1956″, and was so named by John Tukey in 1958 (who also expanded on the technique). Quenouille’s 1956 paper “Notes on Bias in Estimation” gives a clear explanation of the intuition behind the jackknife estimate.

In the general statistical set-up, we are interested in estimating some parameter , using observations . Let denote our estimate for based on these observations. (For concreteness, one example you could consider is , i.e. variance of the distribution, and .)

There are several different properties we might want our estimate to have, e.g. efficiency, sufficiency, consistency, unbiasedness. The maximum likelihood estimate (MLE) has the first 3 properties but not the 4th (except in rare situations). If we have an MLE (or some other consistent estimate), how can we adjust it to obtain an estimate that is less biased?

Under some conditions, we can perform a Taylor expansion to get

If this is the case, then if we define , the equation above gives

so is biased to order , while our original estimate was biased to order .

In defining , we used which was based on . There is no a priori reason why we should this particular set of points: we could have easily used , , or any other set of points instead. In fact, Quenouille argues that in order to minimize the loss in efficiency, one should use all possible sets of observations.

If we define as the estimate of based on all the observations except , then replacing by the mean of these ‘s (which we denote by ) yields the jackknife estimate:

Instead of bias estimation, Tukey saw the utility of the jackknife for robust interval estimation. Arguing that the ‘s could be treated as approximately i.i.d. random variables, the statistic

should have an approximate distribution with degrees of freedom. Thus, we can make use of this pivotal statistic to do interval estimation.

When creating a presentation in LaTeX using Beamer, the \tableofcontents automatically creates a table of contents for you based on the \section, \subsection and \subsection tags. For example, in one of my documents,

\begin{frame}{Outline of talk}
\tableofcontents
\end{frame}

creates this slide:

Sometimes, in addition to a table of contents slide at the beginning of the presentation, I want a slide at the beginning of each section to remind the audience of where we are in the presentation. We can do this by adding the following lines of code before the \begin{document} command:

In this previous post, we introduced quantile regression and demonstrated how to run it in R. In this post, we dig into the math behind quantile regression.

Actually, before we get into quantile regression, let’s talk about quantiles. Given some observations , how do we define the 50th quantile, better known as the median? In high school, we are taught that if we line up all our observations in ascending order, the median is the element right in the middle if is odd, or the sum of the two middle elements if is even. Quantiles can be defined in a similar manner: for , the th quantile is a value such that at least fraction of the observations are less than or equal to it.

This is all well and good, but it’s a bit hard to generalize this notion easily. Instead, we can define the th quantile as the solution to a minimization problem; the “algorithm” for finding the th quantile in the previous paragraph can be viewed as a way to solve this minimization problem.

Here is the minimization problem: if we define , then for a set of observations , we defined its th quantile as

When , we see that the above reduces to

Here is some hand-wavy reasoning for why the sample th quantile minimizes the expression in . Assume that a proportion of the terms are in the first summation , and that of the terms are in the second summation . If we increase by some little so that the number of terms in each summation doesn’t change, then overall the expression increases by

If , then we can decrease the objective function value in by making larger. If , then we can decrease the objective function value by making smaller. The only place where we can’t have any improvement is when , i.e. chosen such that of the observations are less than it.

Let’s extend this minimization problem idea to quantile regression. For each response value , we have a corresponding estimate We can interpret the setting above as trying to minimize

where the ‘s are all forced to be equal to one value, . In quantile regression, we are forced instead to have , where are the values for the th observation for the features on hand. Thus, we can rewrite the minimization problem as

or in vector notation (and incorporating the intercept as a feature),

Notice that this is very similar to ordinary least squares regression. In that case, instead of applying to each , we apply .

Let be some response variable of interest, and let be a vector of features or predictors that we want to use to model the response. In linear regression, we are trying to estimate the conditional mean function, , by a linear combination of the features.

Let’s make this more concrete. Say Uber came up with a new algorithm for dispatching drivers and we are interested in how this algorithm fares in terms of wait times for consumers. A simple (linear regression) model for this is

where if the new algorithm was used to dispatch a driver, and if the previous algorithm was used. From this model, we can say that under the old algorithm, mean wait time was , but under the new algorithm, mean wait time is . So if , I would infer that my new algorithm is "doing better".

But is it really? What if the new algorithm improves wait times for 90% of customers by 1 min, but makes the wait times for the remaining 10% longer by 5 min? Overall, I would see a decrease in mean wait time, but things got significantly worse for a segment of my population. What if that 10% whose wait times became 5 minutes longer were already having the longest wait times to begin with? That seems like a bad situation to have, but our earlier model would not pick it up.

One way to pick up such situations is to model conditional quantile functions instead. That is, trying to estimate the mean of given the features , let’s trying to estimate a quantile of given the features . In our example above, instead of trying to estimate the mean wait time, we could estimate the 95th quantile wait time to catch anything going wrong out in the tails of the distribution.

Another example where estimating conditional quantiles is useful is in growth charts. All of you have probably seen one of these charts below in a doctor’s office before. Each line in the growth chart represents some quantile for length/weight given the person’s age. We track our children’s length and weight on this chart to see if they are growing normally or not.

WHO growth chart for boys. The lines represent conditional quantile functions for different quantiles: of length given age and weight given age.

Quantile regression is a regression method for estimating these conditional quantile functions. Just as linear regression estimates the conditional mean function as a linear combination of the predictors, quantile regression estimates the conditional quantile function as a linear combination of the predictors.

Quantile regression in R

We can perform quantile regression in R easily with the quantreg package. I will demonstrate how to use it on the mtcars dataset. (For more details on the quantreg package, you can read the package’s vignette here.)

Let’s load our packages and data:

library(quantreg)
data(mtcars)

We can perform quantile regression using the rq function. We can specify a tau option which tells rq which conditional quantile we want. The default value for tau is 0.5 which corresponds to median regression. Below, we fit a quantile regression of miles per gallon vs. car weight:

In the table above, the lower bd and upper bd columns represent the endpoints of confidence intervals for the model coefficients. There are a number of ways for these confidence intervals to be computed; this can be specified using the seoption when invoking the summary function. The default value is se="rank", with the other options being “iid”, “nid”, “ker”, “boot” and “BLB” (type ?summary.rq for details).

This next block of code plots the quantile regression line in blue and the linear regression line in red:

Median regression (i.e. 50th quantile regression) is sometimes preferred to linear regression because it is “robust to outliers”. The next plot illustrates this. We add two outliers to the data (colored in orange) and see how it affects our regressions. The dotted lines are the fits for the original data, while the solid lines are for the data with outliers. As before, red is for linear regression while blue is for quantile regression. See how the linear regression fit shifts a fair amount compared to the median regression fit (which barely moves!)?

Simulated annealing is a probabilistic technique for approximating the global optimum of a given objective function. Because it does not guarantee a global optimum, it is known as a metaheuristic.

According to Wikipedia, annealing is “a heat treatment that alters the physical and sometimes chemical properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves heating a material above its recrystallization temperature, maintaining a suitable temperature for a suitable amount of time, and then cooling” (emphasis mine). As we will see, simulated annealing mimics this physical process in trying to find a function’s global optimum.

There are several versions of simulated annealing. Below, we give a high-level outline of a basic version for minimizing a function . (Simulated annealing can be used for constrained optimization as well; for simplicity we limit our exposition to unconstrained optimization.) Let index the iteration number. Simulated annealing has a positive temperature parameter, denoted by at iteration , whose evolution strongly influences the result.

Start at some initial point . For each iteration , if the stopping criterion is not satisfied:

Choose some random proposal point from a probability distribution. The lower the temperature, the more concentrated the probability distribution’s mass is around .

With some acceptance probability, set . If not, set . The acceptance probability is a function of the function values and , and the current temperature .

Update the temperature parameter.

That’s all there is to it! The devil, of course, is in the details. Here are some examples of what each of the building blocks could be.

Stopping criterion:

Terminate after a fixed number of iterations, or when the computational budget has been fully utilized.

Terminate when the objective function does not improve much, e.g. .

Choosing a random proposal point: This depends very much on what the underlying state space looks like. A benefit of simulated annealing is that it can be applied to very general state spaces , even discrete ones, as long as there is a notion of distance between two points in the space. (In particular, first-order information about the space (i.e. gradients) is not needed.)

If , we could set , where , or .

The dependence on could be different as well, e.g. .

If is the space of permutations on , then could be selected uniformly at random from the set of permutations which are at most transpositions from .

In the proposals above, the smaller is, the closer is likely to be than .

Acceptance probability: The probability of accepting as the next value in the sequence (i.e. as ) is a function which depends on , and (crucially) the temperature . Typically if , we would accept with probability 1, but this does not have to be the case. In general, the acceptance probability should decrease as decreases and as becomes more positive.

This is the most popular acceptance function in practice: if , accept. If not, accept with probability .