Definition (Relative entropy): The relative entropy between two discrete probability distributions on the probability simplex is defined to be the positive quantity

The relative entropy is only finite when the distribution is absolutely continuous with respect to the distribution . Whenever is zero, the contribution of the th term is taken to be zero as .

Properties (Relative entropy): The relative entropy enjoys the following properties:

Information inequality: for all , in , while if and only if .

Convexity: is convex in .

Lower semicontinuity: is lower semicontinuous in .

The relative entropy defines two convex measures of distance on the unit simplex. Indeed, we can define in this context two distinct kinds of pseudo balls due to the asymmetry of the relative entropy. The sublevel set of the first kind

and the sublevel set of the second kind

indeed characterize two distinct convex geometries of distance on the unit simplex. From the convexity property of the relative entropy it follows that either kind of pseudo ball or is convex for any positive .

Pseudo balls of the first kind

Pseudo balls of the first kind were encountered previously in the context of Sanov’s theorem. The figure below illustrates the pseudo balls of the first kind for various .

Relative entropy ball of the first kind

Theorem (Pseudo balls of the first kind): Pseudo balls of the first kind can be characterized as

using the entropy function .

The entropy function is a positive convex function which can be canonically represented using the exponential cone. We will see that the pseudo balls of the second kind admit a representation in terms of the geometric mean.

Pseudo balls of the second kind

The figure below illustrates the pseudo balls of the second kind for various .

Relative entropy ball of the second kind

In practice, pseudo balls of the second type are mostly encountered around empirical distributions

of data samples . In this case, the elements of the probability vector are fractions with as a common denominator. The set of all such distributions is denoted further as .

Theorem (Pseudo balls of the second kind): Pseudo balls of the second kind around distributions can be characterized as

Note that as becomes dense in with increasing , the previous theorem can be used to a construct second-order cone representation (of arbitrary precision) of the pseudo balls for any in . Indeed, the function is recognized as a geometric mean which is a positive concave function and is canonically represented using the second-order cone.

It turns out that the Chernoff bound previously discussed has a neat generalization in large deviation theory. In this setting the object under study is the entire empirical distribution instead of merely the empirical average of the data. Now we can study extreme events in which the empirical distribution realizes in a convex set .

Csizar’s Theorem

Let be data samples in drawn independently and identically from the distribution with mean . An important class of an undesirable events can be expressed as the empirical average of data realizing in a closed convex set . The closed convex set is a subset of the probability simplex of distributions on . For the topology employed here and further technical conditions on the set , the reader is referred to the seminal paper of Csiszar. For the sake of readability, these technical points are supressed here.

Csizar’s Theorem : The probability that the empirical distribution of realizes in a closed convex set is bounded by

(1)

where is the Kullback-Leibler divergence between and .

Geometric Interpretation

Csizar’s theorem can be seen to subsume Chernoff’s result by recognizing that the empirical average realizing in closed convex set is stated equivalently as the empirical distribution realizing in the closed convex set . In fact both bounds are exactly the same as it can be shown that

where is the convex dual of the log moment generating function of the distribution of . Observe that Csizar’s inequality (1) admits a nice geometrical interpretation as done in the figure below. The probability of the empirical distribution realizing in a set is bounded above by its distance as measured by the Kullback-Leibler divergence from the distribution generating the data and the set of extreme events .

As the Kullback-Leibler divergence is jointly convex in both its arguments, Csizar’s bound (1) is stated in terms of a convex optimization problem. That is, a convex optimization problem over distributions on , rather than vectors in .

Sanov’s Theorem

Sanov’s bound is furthermore exponentially tight as it correctly identifies the exponential rate with which the probability of the event diminishes to zero.

Sanov’s Theorem : The probability that the empirical distribution of realizes in a open convex set diminishes with exponential rate

(2)

The previous makes that the Chernoff and Csizar inequality accurately quantify the probability of extreme events taking place. Furthermore, due to the convexity of the objects involved, computing these bounds amounts to solving a convex optimization problem.

When working with statistical data, it is often desirable to be able to quantify the probability of certain undesirable events taking place. In this post we discuss an interesting connection between convex optimization and extreme event analysis. We start with the classical Chernoff bound for the empirical average.

Chernoff’s Bound

Let be data samples in drawn independently and identically from the distribution with mean . An important class of an undesirable events can be expressed as the empirical average of data realizing in a convex set . When for instance have an interpretation as losses, then knowing the probability of the average loss exceeding a critical value is paramount. In that case, knowing the probability that the empirical average realizes in the half space would be of great interest. Chernoff’s classical inequality quantifies the probability of such events quite nicely.

Chernoff’s Theorem : The probability of the empirical average of realizes in a closed set is

(1)

with the convex dual of the log moment generating function and the distribution of .

Proof: Let and , and consider the positive function . If the function satisfies for all in , then we may conclude that

Using the independence of distinct samples and taking the logarithm on both sides of the previous inequality establishes

It is clear that for all in if and only if for all in . From this we obtain the general form of Chernoff’s bound

The last inequality follows from the minimax theorem for convex optimization.

Geometric Interpretation

The Chernoff bound (1) expresses the probability of the extreme event in terms of the convex conjugate of the log moment generating function. This makes that computing Chernoff’s bound can be done by solving a convex optimization problem.

The function is the log moment generating function of the recentered distribution generating the data and is always convex. We give the cumulant generating function for some common standardized (zero mean, unit variance) distributions in the table below.

dom

Normal

Laplace

Furthermore, it comes with a nice geometric interpretation as well. The function is positive and convex and thus defines a distance between the mean of the distribution generating the data and its empirical mean.

The minimum distance as measured by the convex dual of the log moment generating function between the mean of the distribution generating the data and the set bounds the probability of the event taking place.

Cramer’s Theorem

Chernoff’s bound is furthermore exponentially tight as it correctly identifies the exact exponential rate with which the probability of the event diminishes to zero. The last surprising fact is codified in Cramer’s theorem.

Cramer’s Theorem : Assume that the distribution satisfies . Then the probability that the empirical average of realizes in a open set diminishes with exponential rate

(2)

The lower bound in Cramer’s thorem makes that the Chernoff inequality accurately quantifies the probability of extreme events taking place as the number of samples tends toward infinity. Notice that for the Cramer’s theorem convexity of the set is not a requirement. Note that as by definition of being a probability distribution it follows that . The condition requires the distribution to be light-tailed. The tails of the distribution must diminish at an exponential rate.