This notebook demonstrates how to implement user defined probability functions in Stan language. As an example I use the generalized Pareto distribution (GPD) to model geomagnetic storm data from the World Data Center for Geomagnetism.

The largest event in the data is the March 1989 geomagnetic storm, also known as Quebec blackout event (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). Now we are interested in estimating the probability of observing a same magnitude event in the future which would be helpful, for example, for insurance companies (for a short term geomagnetic storm predictions there are very elaborate models using observations closer to sun, too). Extreme value theory says that many distributions have a tail which is well modelled with generalized Pareto distribution given the cutoff point is far enough in the tail. For a more detailed model we could also take into account that geomagnetic storms are temporally correlated and tend to appear in bursts.

Generalized Pareto distribution

The Generalized Pareto distribution is defined as \[
p(y|u,\sigma,k)=
\begin{cases}
\frac{1}{\sigma}\left(1+k\left(\frac{y-u}{\sigma}\right)\right)^{-\frac{1}{k}-1}, & k\neq 0 \\
\frac{1}{\sigma}\exp\left(\frac{y-u}{\sigma}\right), & k = 0,
\end{cases}
\] where \(u\) is a lower bound parameter, \(\sigma\) is a scale parameter, \(k\) is a shape parameter, and \(y\) is the data restricted to the range \((u, \inf)\) (see, e.g., https://en.wikipedia.org/wiki/Generalized_Pareto_distribution for cdf and random number generation).

For probability distributions implemented in the Stan math library there are functions

_lpdf (or _lpmf): log probability density (mass) function

_cdf: cumulative distribution function

_lcdf: log cumulative distribution function

_lccdf: log complementary cumulative distribution function

_rng: generate random variables from the distribution

Unlike the functions implemented in the C++ Stan math library, the user defined functions can have only signature. In this example, I have chosen the function signatures so that the behavior is as close as possible to most usual ways to call these functions. The main feature is that all these function return a scalar real value (except _rng would return scalar integer for a discrete integer valued distribution).

For the Generalized Pareto distribution we implement

real gpareto_lpdf(vector y | real ymin, real k, real sigma)

real gpareto_cdf(vector y | real ymin, real k, real sigma)

real gpareto_lcdf(vector y | real ymin, real k, real sigma)

real gpareto_lccdf(vector y | real ymin, real k, real sigma)

real gpareto_rng(real ymin, real k, real sigma)

As we can define only one function signature for user defined functions, I have chosen to have vector type for y and real types for the parameters, while builtin functions can handle vectors, arrays and scalars (denoted by generic type reals). For vector valued y, _lpdf, _lcdf, and _lccdf return sum of log values computed with each element of y. For vector valued y, _cdf returns product of values computed with each element of y. _rng returns a single random number generated from the generalized Pareto distribution.

Stan code with user defined functions

The whole code for functions, the basic model and the generated quantities for posterior predictive checking (ppc), leave-one-out cross-validation (loo), and prediction of rare events is shown below. I will next go through some practical details.

For each function we do basic argument checking. In addition of invalid values due to user errors, due to the limited accuracy of the floating point presentation of the values, sometimes we may get invalid parameters. For example, due to the underflow, we may get sigma equal to 0, even if we have declared it as real<lower=0> sigma;. For these latter cases, it is useful to use reject statement so the corresponding MCMC proposal will be rejected and the sampling can continue.

Stan documentation warns about use of fabs and conditional evaluation as they may lead to discontinuous energy or gradient. In GPD we need to handle a special case of \(k \rightarrow 0\). The following will cause discontinuity in theory, but the magnitude of the discontinuity is smaller than the accuracy of the floating point presentation, and thus this should not cause any additional numerical instability.

By defining gpareto_lpdf we can use also the common ~ notation in Stan to write familiar looking model code.

model {
y ~ gpareto(ymin, k, sigma);
}

In generated quantities we compute log_lik values for LOO and yrep values for posterior predictive checking. We had defined the first argument of pareto_lpdf to be a vector. Now a single element of vector y[n] has a type real, and as Stan has strong typing and we can’t overload the user defined functions, we need to cast y[n] to be a vector by making a 1 element vector rep_vector(y[n],1). Alternatively we could write another function with a different name which could accept scalar y[n], but here I wanted to demonstrate the minimal approach.

Before running the Stan model, it’s a good idea to check that we have not made mistakes in the implementation of the user defined functions. RStan makes it easy to call the user defined functions in R, and thus easy to make some checks.

# the next set of tests compares the duer defined functions to alternative implemenations
# check that gparetp_lpdf with k=0 matches dexp() in stats (dgpd in extraDistr does not work with k=0 !!!)
all.equal(gpareto_lpdf(yr, ymin, k, sigma),dexp(yr-ymin, rate = 1/sigma, log = TRUE))