7.2 Change Point Models

The first example is a model of coal mining disasters in the U.K. for the years 1851–1962.15

Model with Latent Discrete Parameter

(Fonnesbeck et al. 2013 Section 3.1) provides a Poisson model of disaster
\(D_t\) in year \(t\) with two rate parameters, an early rate (\(e\))
and late rate (\(l\)), that change at a given point in time \(s\). The
full model expressed using a latent discrete parameter \(s\) is

The last line uses the conditional operator (also known as the ternary
operator), which is borrowed from C and related languages. The
conditional operator has the same behavior as its counterpart in C++.16

It uses a compact notation involving separating its three arguments by
a question mark (?) and a colon (:). The conditional
operator is defined by

The log sum of exponents function allows the model to be coded
directly in Stan using the built-in function log_sum_exp,
which provides both arithmetic stability and efficiency for mixture
model calculations.

Coding the Model in Stan

The Stan program for the change point model is shown in
the figure below. The transformed parameter
lp[s] stores the quantity \(\log p(s, D \, | \, e, l)\).

A change point model in which disaster rates D[t] have one rate,
e, before the change point and a different rate, l, after the
change point. The change point itself, s, is marginalized out as
described in the text.

Although the change-point model is coded directly, the doubly nested
loop used for s and t is quadratic in T. Luke Wiklendt pointed
out that a linear alternative can be achieved by the use of dynamic
programming similar to the forward-backward algorithm for Hidden
Markov models; he submitted a slight variant of the following code to
replace the transformed parameters block of the above Stan program.

As should be obvious from looking at it, it has linear complexity in
T rather than quadratic. The result for the mining-disaster
data is about 20 times faster; the improvement will be greater for
larger T.

The key to understanding Wiklendt’s dynamic programming version is to
see that head(lp_e) holds the forward values, whereas
lp_l[T + 1] - head(lp_l, T) holds the backward values; the
clever use of subtraction allows lp_l to be accumulated
naturally in the forward direction.

Fitting the Model with MCMC

This model is easy to fit using MCMC with NUTS in its default
configuration. Convergence is fast and sampling produces roughly
one effective sample every two iterations. Because it is a relatively
small model (the inner double loop over time is roughly 20,000 steps),
it is fast.

The value of lp for each iteration for each change point is
available because it is declared as a transformed parameter. If the
value of lp were not of interest, it could be coded as a local
variable in the model block and thus avoid the I/O overhead of saving
values every iteration.

Posterior Distribution of the Discrete Change Point

The value of lp[s] in a given iteration is given by \(\log p(s,D|e,l)\) for the values of the early and late rates, \(e\) and \(l\),
in the iteration. In each iteration after convergence, the early and
late disaster rates, \(e\) and \(l\), are drawn from the posterior
\(p(e,l|D)\) by MCMC sampling and the associated lp calculated.
The value of lp may be normalized to calculate \(p(s|e,l,D)\) in
each iteration, based on on the current values of \(e\) and \(l\).
Averaging over iterations provides an unnormalized probability
estimate of the change point being \(s\) (see below for the normalizing
constant),

where \(\mbox{lp}[m,s]\) represents the value of lp in
posterior draw \(m\) for change point \(s\). By averaging over draws,
\(e\) and \(l\) are themselves marginalized out, and the result has no
dependence on a given iteration’s value for \(e\) and \(l\). A final
normalization then produces the quantity of interest, the posterior
probability of the change point being \(s\) conditioned on the data \(D\),

\[
p(s | D)
=
\frac{q(s|D)}{\sum_{s'=1}^T q(s' | D)}.
\]

A plot of the values of \(\log p(s|D)\) computed using Stan 2.4’s
default MCMC implementation is shown in the posterior plot.

Log probability of change point being in year, calculated analytically.

Figure 7.1: Analytical change-point posterior

The frequency of change points generated by sampling the discrete change
points.

Figure 7.2: Sampled change-point posterior

id:change-point-posterior.figure

In order their range of estimates be visible, the first plot is on the log
scale and the second plot on the linear scale; note the narrower range
of years in the right-hand plot resulting from sampling. The posterior
mean of \(s\) is roughly 1891.

Discrete Sampling

The generated quantities block may be used to draw discrete parameter
values using the built-in pseudo-random number generators. For
example, with lp defined as above, the following program
draws a random value for s at every iteration.

A posterior histogram of draws for \(s\) is shown on the second change
point posterior figure above.

Compared to working in terms of expectations, discrete sampling is
highly inefficient, especially for tails of distributions, so this
approach should only be used if draws from a distribution are
explicitly required. Otherwise, expectations should be computed in
the generated quantities block based on the posterior distribution for
s given by softmax(lp).

Posterior Covariance

The discrete sample generated for \(s\) can be used to calculate
covariance with other parameters. Although the sampling approach is
straightforward, it is more statistically efficient (in the sense of
requiring far fewer iterations for the same degree of accuracy) to
calculate these covariances in expectation using lp.

Multiple Change Points

There is no obstacle in principle to allowing multiple change points.
The only issue is that computation increases from linear to quadratic
in marginalizing out two change points, cubic for three change points,
and so on. There are three parameters, e, m, and
l, and two loops for the change point and then one over time,
with log densities being stored in a matrix.