Nomenclatural abomination

The terminology used throughout this document enormously overloads the symbol p(). That is, we are using, in each line of this discussion, the function p() to mean something different; its meaning is set by the letters used in its arguments. That is a nomenclatural abomination. I apologize, and encourage my readers to do things that aren’t so ambiguous (like maybe add informative subscripts), but it is so standard in our business that I won’t change (for now).

I found this terribly confusing when I started doing statistics. The meaning is not explicit in the notation but implicit in the conventions surrounding its use, conventions that were foreign to me since I was trained in mathematics and came to statistics later. When I would use letters like f and g for functions collaborators would say “I don’t know what you’re talking about.” Neither did I understand what they were talking about since they used one letter for everything.

I also find myself apologizing to students (and cursing the statistics community for not cleaning this up) for having

-N(mu,sigma^2) in books -dnorm(mean,sigma) in R -dnorm(mean,precision) in JAGS and WinBUGS -back to normal(mean,sigma) in Stan (but the BDA3 book also uses N(mu,sigma^2)), so that now JAGS, WinBUGS and Stan are inconsistent.

Especially for beginning students, the fact that in the lecture notes one uses N(mu,sigma^2), but in R one writes dnorm(mean,sigma) causes endless confusion. And then when one starts bayesian modeling, and things rapidly descend into a mess.

I’m so glad someone raises his voice on this… It’s also the case in thermodynamics where you can see something like F(P,V) and later maybe F(U,T), where F is the same physical quantity (free energy) but a different mathematical function. The worst I’ve ever seen, in a published paper on Bayesian inference, is this: “Let’s define the likelihood as : p(theta|x) = p(x|theta)”. I’ve stared at this for 10 minutes until I fell off my chair (the LHS p here is not a probability distribution over its first argument, theta. I wonder how the author would write the “posterior” probability on theta, knowing x). I agree it’s not exactly the same problem (it’s worse).

As Michael Collins pointed out in the comments and as I’ve subsequently seen in practice, the probability theorists follow a convention that satisfies notational purists.

A joint probability function (density, mass, or mixed) over random variables X_1,…X_n is written p_{X_1,…,X_n}. A conditional probability function of random variables X_1,…,X_n given random variables Y_1,…,Y_m is written as p_{X_1,…,X_n|Y_1,…,Y_m}. Now you can supply any arguments you want without confusion.

For example, if X and Y are random variables, the first step of deriving Bayes’s rule for X and Y is unambiguously written as

p_{X|Y}(a|b) = p_{Y|X}(b|a) * p_{X}(a) / p_{Y}(b)

even if you use x for a and y for b.

This also clears up event notation for probabilities. So we can define the cumulative distribution function for random variable X as F_X(x) =def= Pr[X < x], and nothing gets confused (unless you're on a board or piece of paper or have poor eyesight and are working with a sans-serif font). It also explains why random variables are written in capitals — to distinguish them from plain old variables.

In applied statistics, it's rather tedious to type all those subscripts, so people tend to use x and y for variables ranging over random variables X and Y, so that p(x|y) is implicitly taken to mean p_{X|Y}(x|y).

Things get even more confusing for learners when you use the convention of Gelman et al.'s Bayesian Data Analysis (as we do with Stan) and simultaneously drop the random variable subscripts on p and use the same notation x for a random variable and plain-old variable (Gelman argues that it’s problematic for Greek letters like Sigma and trying to capitalize matrices like M). I’ve gotten used to this convention in practice, but we sometimes have to clarify which random variables we’re talking about (as when defining cumulative distribution functions).

I wrote a little about the broader issue here with a specific example of functions vs independent variables in coordinate geometry.

As I say there: “It’s interesting how many notational shortcuts mathematicians take would never be tolerated by a programmer.”

Gerry Sussman writes in his book Functional Differential Geometry:

It is surprisingly easy to get the right answer with unclear and informal symbol manipulation. To address this problem we use computer programs to communicate a precise understanding of the computations… Expressing the methods … in a computer language forces them to be unambiguous and computationally effective.

James: It’s a little odd that Function Differential Geometry uses weakly typed code. I suppose you can’t do everything at once, but it would have been interesting to write such a book in Haskell rather than Lisp so that the code had a type system that mirrored the geometry.

How about programming languages, where you can use any symbol you want, as long as it appeared on the IBM Selectric Typewriter. Instead of using ←, →, ≤, ≠, ¬, ∅, ⇒, ∨, ∧, or ⊗, we use , , ||, &&, and ^. This continuing re-use of typewriter characters leads to context-sensitive interpretation, rules for multi-character lexing, and so on.