Andrew then went on to say: “You should be able to do the same thing if you have information on a nonlinear function of parameters too, but then you need to fix the Jacobian, or maybe there’s some way to do this in Stan.”

I disagreed with this idea of “fixing the jacobian”. Deep in the comments I discussed how this works and how to understand it vs how to deal with “jacobian corrections” when describing priors in terms of probability measures on a target space. The question of whether you need a Jacobian is determined by the role that the information plays, whether you have absolute knowledge of a probability measure, or relative knowledge of how much an underlying probability density should be modified (ie. “masked” if you’re familiar with using masks in Photoshop). I thought I’d post it here so I can refer to it easily:

The first thing you have to understand is what is a Jacobian correction for.

Essentially a Jacobian “correction” allows you to sample in one space A in such a way that you induce a particular given known density on B when B is a known function of A.

if B = F(A) is a one to one and onto mapping (invertible) and you
know what density you want B to have (pb(B)), then there is only one
density you can define for A which will cause A to be sampled correctly
so that B has the right density. Most people might work out that since B
= F(A) the density should be pb(F(A))… but instead…

To figure this out, we can use nonstandard analysis, in which dA and
dB are infinitesimal numbers, and we can do algebra on them. We will do
algebra on *probability values*. Every density must be multiplied by an
infinitesimal increment of the value over which the density is defined
in order to keep the “units” of probability (densities are probability
per unit something).

We want to define a density pa(A) such that any small set of width dA
at any given point A* has total probability pb(F(A*)) abs(dB*)

That is, we have the infinitesimal equation:

pa(A) dA = pb(F(A)) abs(dB)

solve for pa(A) = pb(F(A)) abs(dB/dA)

if we said pa(A) = pb(F(A)) we’d be wrong by a factor involving the
derivative dB/dA = dF/dA evaluated at A, which is itself a function of
A. The absolute value is to ensure that everything remains positive.

abs(dB/dA) is a jacobian “correction” to the pb(F(A)) we derived naively at first.

—————

So, the applicability is when

1) you know the density in one space
2) you want to sample in a different space
3) There is a straightforward transformation between the two spaces

In Andrew’s example, this isn’t the case. We are trying to decide on a
probability density *in the space where we’re sampling* A, and we’re
not basing it on a known probability density in another space. Instead
we’re basing it on

1) Information we have about the plausibility of values of A based on examination of the value of A itself. p(A)

2) Information about the relative plausibility of a given A value
after we calculate some function of A… as I said, this is a kind of
“mask”. pi(F(A))

Now we’re defining the full density P(A) in terms of some “base”
density little p, p(A) and multiplying it by a masking function pi(F(A))
and then dividing by Z where Z is a normalization factor. So the
density on space A is *defined* as P(A) = p(A) pi(F(A)) / Z

Notice how if p(A) is a properly normalized density, then pi(F(A)) K
is a perfectly good mask function for all values of K, because the K
value is eliminated by the normalization constant Z, which changes with
K. In other words, pi tells us only *relatively* how “good” a given A
value is in terms of the value of its F(A) value. It need not tell us
any “absolute” goodness quantity.

Should “how much we like A” depend on how many different possible A
values converge to the region F(A)? I think this seems wrong. If you
have familiarity with photoshop, think like this: you want to mask away a
“green screen”, should whether you mask a given pixel depend on “how
green that pixel is” or “how many total green pixels there are”? The
*first* notion is the masking notion I’m talking about, it’s local
information about what is going on in vicinity of A, the second notion
is the probability notion: how much total spatial locations get mapped
to “green” that’s “probability of being green”

For example pi(F(A)) = 1 is a perfectly good mask, it says “all
values of F(A) are equally good” (it doesn’t matter what color the pixel
is). Clearly pi is not a probability density, since you can’t normalize
that. You could also say “A is totally implausible if it results in
negative F(A)” (if the pixel is green don’t include it) so then pi(F(A))
= 1 for all F(A) >= 0 and 0 otherwise is a good mask function as
well. It’s also not normalizable in general.

If you start including “jacobian corrections” in your mask function, then your mask function isn’t telling you information like “mask this based on how green it is” it’s telling you some mixed up information instead that involves “how rapidly varying the “greenness measurement” is in the vicinity of this level of greenness”. This isn’t the nature of the information you have, and so you shouldn’t just blindly think that because you’re doing a nonlinear transform of a parameter, that you’re obligated to start taking derivatives and calculating Jacobians.

This is very interesting, and a great way to design complex informative priors when the parameters are not independent.

This seems so easy so I had to convince myself with an example, where I set:
– a /sqrt(2)/2 ~ N(1, 0.5)
– b /sqrt(2)/2 ~ N(1, 0.5)
– sqrt(a^2 + b^2) ~ N(1, 0.1)
Basically, in (a, b) space, the first two priors are centered around (1, pi/4) in polar coordinates and the last prior acts as a mask to take a ring of radius normally distributed around 1.

FYI it works but I was wondering how would you even adjust for the Jacobian in this context?
Like, if you consider the transformation (a, b) -> (a, sqrt(a^2 + b^2)) (we can assume a and b are positive for simplicity) you would do the change of variables for the first and third prior but you would still need to define a prior for b.
Alternatively, if you go to polar coordinates with a = r *cos(phi) and b = r * sin(phi), you could define the third prior for r but you would have to trouble defining the first and second priors, unless you define something like tan(phi) ~ N(1, 0.5) / N(1, 0.5).