I was reading about the Jeffreys prior on wikipedia: Jeffreys Prior and saw that after each example, it describes how a variance-stabilizing transformation turns the Jeffreys prior into a uniform prior.

As an example, for the Bernoulli case, it states that for a coin that is heads with probability $\gamma \in [0,1]$, the Bernoulli trial model yields that the Jeffreys prior for the parameter $\gamma$ is:

$$
p(\gamma) \propto \frac{1}{\sqrt{\gamma ( 1-\gamma)}}
$$

It then states that this is a beta distribution with $\alpha = \beta = \frac{1}{2}$. It also states that if $\gamma = \sin^2(\theta)$, then the Jeffreys prior for $\theta$ is uniform in the interval $\left[0, \frac{\pi}{2}\right]$.

I recognize the transformation as that of a variance-stabilizing transformation. What confuses me is:

Why would a variance-stabilizing transformation result in a uniform prior?

Why would we even want a uniform prior? (since it seems it may be more susceptible to being improper)

In general, I'm not quite sure why the squared-sine transformation is given and what role is plays. Would anyone have any ideas?

2 Answers
2

The Jeffreys prior is invariant under reparametrization. For that reason, many Bayesians consider it to be a “non-informative prior”. (Hartigan showed that there is a whole space of such priors $J^\alpha H^\beta$ for $\alpha + \beta=1$ where $J$ is Jeffreys' prior and $H$ is Hartigan's asymptotically locally invariant prior. — Invariant Prior Distributions)

It is an often-repeated falsehood that the uniform prior is non-informative, but after an arbitrary transformation of your parameters, and a uniform prior on the new parameters means something completely different. If an arbitrary change of parametrization affects your prior, then your prior is clearly informative.

Using the Jeffreys is, by definition, equivalent to using a flat prior after applying the variance-stabilizing transformation.

From a mathematical standpoint, using the Jeffreys prior, and using a flat prior after applying the variance-stabilizing transformation are equivalent. From a human standpoint, the latter is probably nicer because the parameter space becomes "homogeneous" in the sense that differences are all the same in every direction no matter where you are in the parameter space.

Consider your Bernoulli example. Isn't a little bit weird that scoring 99% on a test is the same distance to 90% as 59% is to 50%? After your variance-stabilizing transformation the former pair are more separated, as they should be. It matches our intuition about actual distances in the space. (Mathematically, the variance-stabilizing transformation is making the curvature of the log-loss equal to the identity matrix.)

$\begingroup$1. I agree that a uniform prior does not mean "non-informative" prior, but my comment about not valuing a certain value over another value still holds (under that particular parameterization). 2. The properness of a prior is very concerning. If you have an improper prior and have data, it is not guaranteed that you will have a proper posterior. So it is very concerning.$\endgroup$
– GreenparkerMay 10 '16 at 21:15

$\begingroup$1. But that's the whole point: the parametrization is arbitrary, so it is meaningless to say that you're not valuing one value over another. 2. In practice, I've never found it concerning. It might be concerning to other people I guess.$\endgroup$
– Neil GMay 10 '16 at 21:17

$\begingroup$1. Fair point. 2. I am not sure what problems you deal with, but even the simple Gaussian likelihood with a Jeffreys prior can have an improper posterior. See my answer here.$\endgroup$
– GreenparkerMay 10 '16 at 21:19

$\begingroup$I don't think the edit is correct. If the posterior is improper then MCMC is most certainly nonsensical since you are trying to draw from an undefined distribution. Imagine trying to sample from Uniform $(0,\infty)$ using any sampling scheme. Although, the MCMC algorithm might still be ergodic (when you have null recurrence), but your samples will be useless.$\endgroup$
– GreenparkerMay 10 '16 at 21:35

The Wikipedia page that you provided does not really use the term "variance-stabilizing transformation". The term "variance-stabilizing transformation" is generally used to indicate transformations that make the variance of the random variable a constant. Although in the Bernoulli case, this is what is happening with the transformation, that is not exactly what the goal is. The goal is to get a uniform distribution, and not just a variance stabilizing one.

Recall that one of the main purposes of using Jeffreys prior is that it is invariant under transformation. This means that if you re-parameterize the variable, the prior will not change.

1.

The Jeffreys prior in this Bernoulli case, as you pointed out, is a Beta$(1/2, 1/2)$.
$$p_{\gamma}(\gamma) \propto \dfrac{1}{\sqrt{\gamma(1-\gamma)}}.$$

Thus $\theta$ is the uniform distribution on $(0, \pi/2)$. This is why the $\sin^2(\theta)$ transformation is used, so that the re-parametrization leads to a uniform distribution. The uniform distribution is now the Jeffreys prior on $\theta$ (since Jeffreys prior is invariant under transformation). This answers your first question.

2.

Often in Bayesian analysis one wants a uniform prior when there is not enough information or prior knowledge about the distribution of the parameter. Such a prior is also called a "diffuse prior" or "default prior". The idea is to not commit to any value in the parameter space more than other values. In such a case the posterior is then completely dependent on the data likelihood. Since,
$$q(\theta|x) \propto f(x|\theta) f(\theta) \propto f(x|\theta).$$

If the transformation is such that the transformed space is bounded, (like $(0, \pi/2)$ in this example), then the uniform distribution will be proper. If the transformed space is unbounded, then the uniform prior will be improper, but often the resulting posterior will be proper. Although, one should always verify that this is the case.

$\begingroup$This idea that you are "not committing to any value" by using a diffuse prior is wrong. The proof is that you can take any transformation of the space and the diffuse prior will mean something completely different.$\endgroup$
– Neil GMay 10 '16 at 20:48

$\begingroup$My comment on "not committing to any value" refers only to that particular parameterization. Of course, transformations will change how the mass is distributed (just like in this Bernoulli example).$\endgroup$
– GreenparkerMay 10 '16 at 21:17

$\begingroup$Like I said below your other comment, the parametrization is arbitrary, which is why the statement "not committing to any value" is meaningless.$\endgroup$
– Neil GMay 10 '16 at 21:18