Main menu

Post navigation

Medians in high dimensions

The median is a common statistical measure for central tendency, i.e. a value that lies at the “center” of a sample of observations. In one dimension it is easy to describe. If the observations are , we simply line them up along the real number line and report the one right in the middle (or the mean of the two right in the middle if is even).

It turns out that in higher dimensions, there are several “natural” ways to define the median. These medians are defined for points in ; when , they all coincide with the definition of the one-dimensional (univariate) median.

One way to define the median in higher dimensions is to take the univariate median along each dimension. More concretely, if is the th coordinate of the th observation, then the median is a -dimensional vector such that its th coordinate is the median of . This is known as the marginal median.

Another way to define the median in higher dimensions is to note that in one dimension, the median minimizes the sum of distances to the points, i.e.

We can take this to be the definition of the median and extend it easily to higher dimensions:

where is Euclidean distance in . This is also known as the geometric median or the median.

One is not limited to using Eucliean distance in the definition above; for any distance metric over we could define a median as

The medoid takes this approach, except that it limits the possible values that the median can take on to the set of observations itself:

Yet another way to generalize the univariate median is to realize that the median divides the one-dimensional space such that half the points lie on each side of it. The centerpoint generalizes this to dimensions: it is a point such that any hyperplane passing through it divides the observations into two sets, each containing at least of the observations.

The centerpoint is closely related to the notion of Tukey depth, introduced by John Tukey in 1974. Given a set of observations , the Tukey depth of a point is the size of the smallest subset of points on either side of any hyperplane passing through . Mathematically,

Donoho & Gasko (1992) noted that this could be the basis of a definition for a multivariate median which is now known as the Tukey median: a point that maximizes Tukey depth, i.e.