What’s Wrong with Probability Notation?

[Update 21 October 2009: Check out Michael Collins’s comment and my reply. Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. I replied that it’s really the practice that’s problematic, and I’m talking about the notation in Gelman et al.’s Bayesian Data Analysis or Blei, Ng and Jordan’s LDA paper.]

What’s Wrong?

What’s wrong with the probability notation used in Bayesian stats papers? The triple whammy of

overloading for every probability function,

using bound variables named after random variables, and

using the bound variable names to distinguish probability functions.

Probabilty Notation is Bad

The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule,

,

where each of the four uses of corresponds to a different probability function! In computer science, we’re used to using names to distinguish functions. So and are the same function applied to different arguments. In probability notation, and are different probability functions, picked out by their arguments.

Random Variables Don’t Help

As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables to distinguish the distributions and bound variables in the usual mathematical sense, disambiguating our random variables with

.

When we have dozens of parameters in our multivariate densities, this gets messy really quickly. So practitioners fall back to the unsubscripted notation.

Great Expectations

The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). Here, statisticans write and for random variables with the probability function and sample space implicit. The way you then see expectation notation written in applied Bayesian modeling is often:

.

What’s really sneaky here is the use of as a global random variable on the left side of the equation and as a bound variable on the right hand side. Distinguishing random variables with capital letters, this would look like

.

Continuous vs. Discrete

The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities.

In Bayes’s rule, if is continuous in , we’re meant to understand integration

,

in place of summation

.

Similarly, for expectations of continuous densities, we write

or if we’re being very careful with random variables,

.

Intros to probability theory often use for continuous probability density functions and reserve for discrete probability mass functions. They’ll start with notation (or ) for the event probability function.

Samples Spaces to the Rescue?

In applied work, we rarely, if ever, need to talk about the sample space , measures from subsets of to , or actually consider a random variable as a function from to . I don’t recall ever seeing a model defined in this way.

Instead, we typically construct multivariate densities modularly by combining simpler distributions. For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as:

or in sampling notation by stating and . In fact, that’s just how it gets coded in the model interpreter BUGS (Bayesian inference using Gibbs sampling) and compiler HBC (hierarchical Bayes compiler).

Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? We could start with a joint density function over a vector , and consider projections of in lieu of random variables. This we can fully specify with high-school calculus.

If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. In practice, it’s not really that hard to keep our sampling distributions separate from our posteriors , even if we write them both as .

Technically, we could take as the sample space and then define the random variables as projections. But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities.

Lambda Calculus to the Rescue?

Maybe we can do better. We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus.

For instance, we could define a discrete bivariate for reference. We could then distinguish the marginals , using

instead of , and

instead of .

We could similarly distinguish the conditionals, writing

instead of , and

instead of .

Bayes’s rule now becomes:

.

Clear, no?

Perhaps Not

OK, so maybe the statisticians are onto something here with their sloppy notation in practice.

All attempts to distinguish function names in stats only seem to make matters worse. This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next.

Conventionally, X is the r.v., and x is a value in the range space of x. So it makes literally no sense to write E[x]. And you have to really understand that X is a measurable function. If you do this, the notation is actually airtight.

The convention of dropping subscripts on individual densities, in most cases, aids understanding. Putting the subscripts in just adds to the alphabet soup.

A final point about why this came to pass — in measure-theoretic probability, the odds of every random variable is determinable (in principle) from a single measure, which may be a product measure, and it is conventionally called P. With this understanding, it’s always OK to have a single “P” for every event. I think this is another behind-the-scenes reason why people just use one “P”.

@John I’m particularly amused by one paper where I saw (you know who you are!).

@Mark I wish I’d gotten your comment a year ago. I actually prefer writing models out in sampling notation, ideally using something like BUGS to make sure they’re precisely specified enough to compile.

I’m trying to figure out how to write this all for an intro to Bayesian classifiers I’m writing. I started out with the traditional notion of random variables (samples space with prob distro plus mapping to values). But the combo of a shared sample space and distribution with functions to values detemrining the random variables’ joint distributions is kind of confusing to try to write down.

Andrew Gelman and Jennifer Hill described random variables succintly in their applied regression book. There’s a giant urn, you pull a ball out, and it has the value of every random variable written on it. Of course, you have to be happy with uncountably many balls in the urn. But to make the continuous urn notion precise, I’d need to pull out Lebesgue integrals. (I’d really like to pull out the even more general Stieltjes integrals to do away with the awkward alternations between summations and integrals.)

Instead, I’m leaning toward just defining joint densities on a per-model basis and working from there. You don’t ever really need the sample space and I’m not going to be able to define general integration in an intro text.

I like the notation in the book — in general I think it’s a great book — and I think it avoids some of the pitfalls you’ve described in your blog.

(Ignore the later chapters on statistics though, particularly frequentist
statistics, where the notation becomes a little more controversial — this led to long discussions with the authors and my co-lecturer about notation once the move from probability to statistics is made).

I think writing , etc. is sloppy, although we do it in research papers all the time. The book always uses

You would never write , always write (assuming a convention that capitalization is used to denote random variables — otherwise be careful about what is a r.v. as opposed to the value of a r.v.). Actulally writing leads to real notational problems I think. Particularly when you get to conditional expectation. You don’t need subscripting on expectations, assuming that whenever you introduce a random variable you’ve been careful to specify its pmf/pdf.

[Michael then went on in a second post to add the following. – Bob]

The book is really great. The first chapter just goes over sample spaces, probability measures, events etc., with no mention of random variables. The second chapter introduces random variables — and that notation
is used throughout the book — in fact once r.v.s are introduced you rarely need to refer back to the underlying sample space (although it’s useful to have that in mind).

One thing I like about the book is that it’s basically a book about probability (although it covers statistics in later chapters) — as a consequence it gets the basics of probability and probability notation down very solidly, before even mentioning statistics, which I think is really important. The mathematical probabilists are very precise.

… let all the work be done by the r.v. — if you’ve taken care to define the set of random variables in the first place, then there will be no problems. is better because you really do want to be able to write things like — i.e., you do want to treat
this as a function. The / thing is a convention but is unnecessary at that point. The Greek letter thing may be an inconvenience, but simply using capital Greek letters may work I think.

[…] Still, most people struggle with them. Could it be that the notation is just hard to swallow? What’s Wrong with Probability Notation? is a magnificent post that gives some basic reasons: The first two issues arise in the usual […]

Indeed, that’s how the notation is used to distinguish random variables from regular old bound variables . But you typically only see this in careful discussions of probability theory or intro stats texts.

In practical modeling papers, where there are parameters, matrices, etc., the upper-case/lower-case thing gets difficult to maintain. And it’s so rare to see anything other than that it seems awfully pedantic to include all those subscripts.

The real kicker is that you almost never see random variables defined as maps . In fact, you never see the sample space even mentioned. Instead, you’ll see statements like “assume is a random variable distributed as .”

I have read way too many papers where, for instance, Normal distributions were happily applied over R+, or even over circular spaces. Consider, oh, “almost” all of the robotics navigation and mapping literature of the 90’s: positions x, y and orientations \theta of robots were put together in a single “pose” variable, a linear Gaussian models were applied directly on this 3D variable… :)

[…] One hurdle newcomers have to applied Bayesian work is understand the notation at work. Understanding that p(x) is not the same function as p(y). Typically these refer to the marginal density (or mass) function for x and y, respectively. Similarly p(x|y) is not the same function as p(y|x), but instead the first is the function describing the conditional density (or mass) function of x given y and the second is the conditional density (or mass) function of y given x. Attempts to rectify this notation seem to make the notation overly complicated and therefore, the differences are made implicit. For more discussion of this, please see this post. […]

Can somebody help me here, I started off trying to understand the concept of nested designs in ANOVA, then moved into trying to clarify what was meant by the term variate, bivariate, etc. Then the term Random variable as being a real valued function defined on a sample space (only dealing with discreet) at the moment. I then tried to sort out the term independent variables from the term statistically independent random variables. All of the definitions I have seen define two or more independent random variable over the SAME sample space. Why could we not have different random variables defined over different sample spaces (discreet ones). The more I think about probability theory the more confused I get. Something has got to be done about this to bring about some sort of simplicity and consistency in this very important area.
Hope there is someone out there who can help.

Once you understand the notation, it’s consistent. That’s not to say that some people don’t use it inconsistently. The problem for beginners is just how much the notation’s overloaded.

I’m afraid it doesn’t make sense to talk about the independence of random variables over different sample spaces. Keep in mind that it’s OK to have different outcomes in the same sample space — the sample space itself is abstract. For instance, if you have two discrete distros with outcomes {A,B} and {X,Y,Z}, then you can have a sample space with six points, {AX, AY, AZ, BX, BY, BZ}. The value of the first variable is A for samples AX, AY and AZ, and B for samples BX, BY and BZ. Similar, the value of the second variable is X for AX and BX, and so on.

For instance, writing independence out in full, variables and over a sample space are independent if and only if for all . The reason you need this to be over a single sample space is that you can’t define the joint distribution otherwise.

In practice, the sample space is rarely mentioned and random variables are defined by their distributions. It’s just assumed everything comes from the same sample space. Often you define by first defining and and then stating that they’re independent.

Concerning your general issue with classical probabilistic notation: I could not agree more. However, I believe an elegant solution already exists, and was proposed by Jaynes.

It’s basis is simple: p(x), or P(x), is an object which does not exist. The correct tool in probabilities is conditional probabilities: one should always specify the preliminary knowledge that conditions the state of uncertainty about a quantity x. Therefore, p(X | c) is the distribution about variable X under the assumption that c holds. If preliminary knowledge is different, say c’, then p(X | c’) can be another mathematical distribution. Function p(. | .), in this case, is not overloaded (contrary to what many reviewers of my papers asserted, but that’s not the point. :) ).

Complement this with a few conventions, for instance, use p(. | .) in the continuous case, P(. | .) in the discrete case. Use capitalized symbols when referring to variables (i.e., domains), and small-capped symbols for values. So that P([X=x] | c) (or P(X=x | c), when you’re lazy) is a probability value, whereas P(X | c) is a probability distribution. And you should be all set.

The use of right-hand side symbols to make probability distributions non-ambiguous does not even need to be tied to “subjectivist” stances about the meaning of probabilities as states of knowledge of agents. Indeed, it is also the basis of (purely Bayesian statistics or machine learning inspired) methods of model selection: both P(X | c) and P(X | c’) being defined, a new variable C can then be introduced, with domain C={c, c’}, and the model P(C | D) P(X | C D) can then be introduced to carry out model selection by computing P(C | X D).

2. And what about mixtures, like spike and slab or Dirichlet process which are part discrete and part continuous?

3. The objection to the cap/lowercase convention is about matrices and to a lesser extent Greek letters. If we also want to capitalize matrices or bold them, we run into conflicting conventions. Not all the Greek letters have easily distinguishable caps.

3. While writing may clear up some confusions, it also runs head-on into the notation used for events, where is shorthand for the event . And obviously if we’re talking events and is continuous.

4. I don’t see how using conditionals cleans up the distinction between and , which would be written in probability theory as and . This notation gets cumbersome when we have a dozen parameters.

I’ve never heard anyone say that the problem is purely Bayesian or frequentist — this is just about probability theory, about which everyone is in agreement. The frequentist/Bayesian debate is about what can be the object of a probability distribution, not how the laws of probability work.

I totally agree that random variables should be distinguished from their values and that there should be a regular way of specifying which variable’s probability density is meant in places where it is not obvious. Maybe the underlining notation I wrote about above would be a better way for both of these aims (the second aim seems to assume the first).

As for the “continuous versus discrete” overloading, I think it is not a big problem, as there is no mathematical controversy here: sum is a special case of integration (sum is the Lebesgue integral with respect to the natural measure on the integers, the counting measure). When doing computations, one must be aware of the codomains of the variables (are the variables integer-valued, real-valued, vector-valued, unit-circle-valued etc.), but that also applies if we have only continuous variables (Julien says something similar in his comment https://lingpipe-blog.com/2009/10/13/whats-wrong-with-probability-notation/#comment-16022 ).

Mark Berliner from OSU has used the notation “[x]” to refer to the “density of x” in some of his papers, and “[x|y]” for a conditional density. He does not claim this is new, but I don’t recall where he wrote he saw it. is used for expectation, and is used for variance. I s’ppose one could propose that, since expectations are common, maybe would suffice for expectation.

Thank you so much for such a great post! This helps a lot in writing better text.

Even though this seems like a non-issue for statisticians, many non-statisticians can experience significant confusion due to the notion that a density function is just another static function as in ordinary calculus.

I am particularly comfortable with using p all the time and rely on the actual variables inside the parentheses to signify the corresponding density function.

However, we often find that some densities in our application are significant enough to justify giving them their own names like f, g or h. Another situation where we genuinely need specific names for density functions is when a single random variable has multiple possible densities depending on the context.

This practice, however, leads many to think that f, g or h are just ordinary calculus functions, which is indeed not true. Because density functions can be marginalized to reduce its dimensions, or conditioned upon another random variable, while ordinary functions cannot. For example, say, f(a | b) is the conditional density of a given b. Therefore, the shape or location of f is affected by b, which is also a random variable. Hence, f has no static shape or location. What we really want to communicate by f is a law of dependence between a – b, and this law can be static, which is why we are motivated to use f instead of the generic notation p in the first place.

For statisticians, it is quite natural to also write that integrating b out of g(a,b) will give g(a), which is to use g for both the joint and marginal density. This is justifiable because the joint density implies its marginal densities. Or sometimes people also write h(a,b) = f(a)g(b | a) which looks utterly confusing to most outsiders. However, this notation immediately makes sense when we realize that replacing f, g and h with p will give the familiar formula of the joint density of a and b.

In conclusion, I think the confusion is genuine and significant among many researchers. However, as I mentioned, this confusion seems to have legitimate reasons and a simple solution is not yet available. My current solution to this situation is to keep reminding myself that f, g or h are nothing else but specific variants of p, and hence they are not just calculus functions.

From my teaching experience, the main issue when starting with probability for most people is indeed the notation. But, from my mathematicians’ point of view, the main problem, which just seems unsolvable, is the lack of the mathematical foundation, namely the knowledge of measure and Lebesgue theory. With the proper mathematical framework there is no need to define two different notions of discrete and continuous random variables.

We start with a probability space $(\Omega,\mathcal F, \mathbb P)$, where

– $\Omega \neq \emptyset$ the sample space (which is indeed just a set),

– $\mathcal F$ is a $\sigma$-algebra, normally just the power set $\mathcal P(\Omega) := \{ A \mid A \subset \Omega \}$ of $\Omega$,

$\{X = x\} := \{X \in \{x\}\}$
etc. Now, you may set $p_X(x) := \mathbb P(X=x)$ which is not ambiguous at all.

But: A whole different object is the law or cumulative distribution function of a random variable $X$ which is defined as the image measure of a random variable $X$ by, for any $B \in \mathcal B(\mathbb R)$,

where $u : \mathbb R \to \mathbb R$ is a measurable map and the last equality comes from a Lebesgue integration analogue of the change of variables theorem in Riemannian integration (transformation rule for image measures).

The measurable function $u$ can be quite simple, e.g. $u(x) = x$, which is just the expectation of $X$ or $u(x) = (x – \mathbb EX)^2$, which is the variance of $X$.

From the abstract definition we now recover the discrete and continuous random variables like this:

(1) A discrete random variable only sees isolated points, i.e. we want to end up with a distribution of the form, $X \sim \mathrm{Poi}(\lambda)$, $\mathbb P(X = n) = e^{-\lambda} \frac{\lambda^n}{n!}$, for $n = 0,1,2…$ and $\lambda > 0$.

Finally, the conditional probability and conditional expectation are two more generalised (but again completely different!) objects, denoted in a similar fashion to emphasise its strong connection to the probability and expectation of function.

which is again a random variable. So actually one should write $\mathbb P(A \mid \mathcal A)(\omega)$. In particular, the existence of both notions is not obvious at all. There is quite some work to be done. Secondly, $A \mapsto \mathbb P(A \mid \mathcal A)(\omega)$ is not a measure. But the good news is, two particular cases are exactly want we want:

Also i would like to add that many times there is a problem in representing the realization of a random variable, which should be lower case, like vectors, and despite this because they are tensors they are shown in capital letters not differentiating between matrices, and vectors tensors as explained with here with elegance.