Stan Case Studies, Volume 5 (2018)

Multilevel Linear Models using Rstanarm

In this tutorial, we illustrate how to fit a multilevel linear model within a full Bayesian framework using rstanarm. This tutorial is aimed primarily at educational researchers who have used lme4 in R to fit models to their data and who may be interested in learning how to fit Bayesian multilevel models. However, for readers who have not used lme4 before, we briefly review the use of the package for fitting multilevel models.

Predator-Prey Population Dynamics: the Lotka-Volterra model in Stan

Lotka (1925) and Volterra (1926) formulated parameteric differential
equations that characterize the oscillating populations of predators
and prey. A statistical model to account for measurement error and
unexplained variation uses the deterministic solutions to the
Lotka-Volterra equations as expected population sizes. Stan is used to
encode the statistical model and perform full Bayesian inference to
solve the inverse problem of inferring parameters from noisy data. The
model is fit to Canadian lynx and snowshoe hare populations between
1900 and 1920, based on the number of pelts collected annually by the
Hudson’s Bay Company. Posterior predictive checks for replicated data
show the model fits this data well. Full Bayesian inference may be
used to estimate future (or past) populations.

Nearest neighbor Gaussian process (NNGP) models in Stan

Nearest neighbor Gaussian process (NNGP) based models is a
family of highly scalable Gaussian processes based models. In brief,
NNGP extends the Vecchia’s approximation (Vecchia 1988) to a process
using conditional independence given information from neighboring
locations. This case study shows how to express and fit these models
in Stan.

Stan Case Studies, Volume 4 (2017)

Extreme value analysis and user defined probability functions in Stan

This notebook demonstrates how to implement user defined
probability functions in Stan language. As an example I use the
generalized Pareto distribution (GPD) to model geomagnetic storm data
from the World Data Center for Geomagnetism.

Modelling Loss Curves in Insurance with RStan

Loss curves are a standard actuarial technique for helping
insurance companies assess the amount of reserve capital they need to
keep on hand to cover claims from a line of business. Claims made and
reported for a given accounting period are tracked seperately over
time. This enables the use of historical patterns of claim development
to predict expected total claims for newer policies.

We model the growth of the losses in each accounting period as an
increasing function of time, and use the model to estimate the
parameters which determine the shape and form of this growth. We also
use the sampler to estimate the values of the “ultimate loss ratio”,
i.e. the ratio of the total claims on an accounting period to the
total premium received to write those policies. We treat each
accounting period as a cohort.

Splines in Stan

In this document, we discuss the implementation of splines in Stan. We
start by providing a brief introduction to splines and then explain
how they can be implemented in Stan. We also discuss a novel prior
that alleviates some of the practical challenges of spline models.

This case study shows how to efficiently encode and compute an
Intrinsic Conditional Auto-Regressive (ICAR) model in Stan.
When data has a neighborhood structure, ICAR models provide spatial smoothing
by averaging measurements of directly adjoining regions.
The Besag, York, and Mollié (BYM) model is a Poisson GLM which
includes both an ICAR component and an ordinary
random-effects component for non-spatial heterogeneity.
We compare two variants of the BYM model and fit two datasets
taken from epidemiological studies over 56 and 700 regions, respectively.

Typical Sets and the Curse of Dimensionality

This case study illustrates the so-called “curse of
dimensionality” using simple examples based on simulation to show that
all points are far away in high dimensions and that the mode is an
atypical draw from a multivariate normal. The information-theoretic
concept of typical set is illustrated with both discrete and
continuous cases, which show that probability mass is a product of
volume and density (or count and mass in the discrete case). It also
illustrates Monte Carlo methods and relates distance to the log
density of the normal distribution and the chi-squared distribution.

A Primer on Bayesian Multilevel Modeling using PyStan

This case study replicates the analysis of home radon levels using
hierarchical models of Lin, Gelman, Price, and Kurtz
(1999). It
illustrates how to generalize linear regressions to hierarchical models with
group-level predictors and how to compare predictive inferences and
evaluate model fits. Along the way it shows how to get data into Stan
using pandas, how to sample using PyStan, and how to visualize the results
using Seaborn.

The Impact of Reparameterization on Point Estimates

When changing variables, a Jacobian adjustment needs to be
provided to account for the rate of change of the transform. Applying
the adjustment ensures that inferences that are based on expectations
over the posterior are invariant under reparameterizations. In
contrast, the posterior mode changes as a result of the
reparameterization. In this note, we use Stan to code a repeated
binary trial model parameterized by chance of success, along with its
reparameterization in terms of log odds in order to demonstrate the
effect of the Jacobian adjustment on the Bayesian posterior and the
posterior mode. We contrast the posterior mode to the maximum
likelihood estimate, which, like the Bayesian estimates, is invariant
under reparameterization. Along the way, we derive the logistic
distribution by transforming a uniformly distributed variable.

Hierarchical Two-Parameter Logistic Item Response Model

This case study documents a Stan model for the two-parameter logistic model (2PL) with hierarchical priors. A brief simulation indicates that the Stan model successfully recovers the generating parameters. An example using a grade 12 science assessment is provided.

This case study documents a Stan model for the rating scale model (RSM) and the generalized rating scale model (GRSM) with latent regression. The latent regression portion of the models may be restricted to an intercept only, yielding a standard RSM or GRSM. A brief simulation indicates that the Stan models successfully recover the generating parameters. An example using a survey of public perceptions of science and technology is provided.

This case study documents a Stan model for the partial credit model (PCM) and the generalized partial credit model (GPCM) with latent regression. The latent regression portion of the models may be restricted to an intercept only, yielding a standard PCM or GPCM. A brief simulation indicates that the Stan models successfully recover the generating parameters. An example using the TIMSS 2011 mathematics assessment is provided

This case study documents Stan models for the Rasch and two-parameter logistic models with latent regression. The latent regression portion of the models may be restricted to an intercept only, yielding standard versions of the models. Simulations indicate that the two models successfully recover generating parameters. An example using a grade 12 science assessment is provided.

Two-Parameter Logistic Item Response Model

This tutorial introduces the R package edstan for estimating
two-parameter logistic item response models using Stan without knowing
the Stan language. Subsequently, the tutorial explains how the model
can be expressed in the Stan language and fit using the rstan
package. Specification of prior distributions and assessment of
convergence are discussed. Using the Stan language directly has the
advantage that it becomes quite easy to extend the model, and this is
demonstrated by adding a latent regression and differential item
functioning to the model. Posterior predictive model checking is also
demonstrated.

Cognitive Diagnosis Model: DINA model with independent attributes

This case study documents a Stan model for the DINA model with independent attributes. A Simulation indicates that the Stan model successfully recovers the generating parameters and predicts respondents’ attribute mastery. A Stan model with no structure of the attributes is also discussed and applied to the simulated data. An example using a subset of the fraction subtraction data is provided.

Pooling with Hierarchical Models for Repeated Binary Trials

This note illustrates the effects on posterior inference of
pooling data (aka sharing strength) across items for repeated binary
trial data. It provides Stan models and R code to fit and check
predictive models for three situations: (a) complete pooling, which
assumes each item is the same, (b) no pooling, which assumes the items
are unrelated, and (c) partial pooling, where the similarity among the
items is estimated. We consider two hierarchical models to estimate
the partial pooling, one with a beta prior on chance of success and
another with a normal prior on the log odds of success. The note
explains with working examples how to (i) fit models in RStan and plot
the results in R using ggplot2, (ii) estimate event probabilities,
(iii) evaluate posterior predictive densities to evaluate model
predictions on held-out data, (iv) rank items by chance of success,
(v) perform multiple comparisons in several settings, (vi) replicate
new data for posterior p-values, and (vii) perform graphical posterior
predictive checks.

Stan Case Studies, Volume 2 (2015)

Multiple Species-Site Occupancy Model

This case study replicates the analysis and output graphs of
Dorazio et al. (2006) noisy-measurement occupancy model for multiple
species abundance of butterflies. Going beyond the paper, the
supercommunity assumptions are tested to show they are invariant to
sizing, and posterior predictive checks are provided.