State space model (SSM) refers to a class of probabilistic graphical model (Koller and Friedman, 2009) that describes the probabilistic dependence between the latent state variable and the observed measurement. The state or the measurement can be either continuous or discrete. The term “state space” originated in 1960s in the area of control engineering (Kalman, 1960). SSM provides a general framework for analyzing deterministic and stochastic dynamical systems that are measured or observed through a stochastic process. The SSM framework has been successfully applied in engineering, statistics, computer science and economics to solve a broad range of dynamical systems problems. Other terms used to describe SSMs are hidden Markov models (HMMs) (Rabiner, 1989) and latent process models. The most well studied SSM is the Kalman filter, which defines an optimal algorithm for inferring linear Gaussian systems.

An important objective of computational neuroscience is to develop statistical techniques to characterize the dynamic features inherent in neural and behavioral responses of experimental subjects collected during neurophysiological experiments. In neuroscience experiments, measurements of neural or behavioral data are often dynamic, noisy and have rich temporal structures. Examples of such include intracellular or extracellular recordings, neuronal spike trains, local field potentials, EEG, MEG, fMRI, calcium imaging, and behavioral measures (such as the reaction time and decision choice). Questions of interest may include how to analyze spike trains from ensembles of hippocampal place cells to infer the rodent’s position in the environment or how to identify the sources of dipole using multi-channel MEG recordings. Regardless of their specific modality and applications, SSM provides a unified and powerful paradigm to model and analyze these signals in a dynamic fashion in both time and space.

Contents

Formalism and theory

The objective of state space modeling is to compute the optimal estimate of the hidden state given the observed data, which can be derived as a recursive form of Bayes’s rule (Brown et al., 1998; Chen, Barbieri and Brown, 2010). In a general state space formulation, let x(t) denote the state and y(0:t) denote the cumulative observations up to time t, the filtering posterior probability distribution of the state conditional on the observations y(0:t) is

where the probability distribution (or density) p(x(t)|x(t-1)) describes a state transition equation, and the probability distribution (or density) p(y(t)|x(t),y(0:t-1)) is the observation equation. Equations (1) and (2) provide the fundamental relations to develop state space models and analyses.

For an illustration purpose, consider a discrete-time multivariate linear Gaussian system, the SSM is characterized by two linear equations.

State equation. The n-dimensional hidden state process x(t+1) follows a first-order Markovian dynamics, as it only depends on the previous state at time t and is corrupted by a (correlated or uncorrelated) state noise process n(t)

Observation equation. The m-dimensional measurement y(t) is subject to a linear transformation of the hidden state x(t) and is further corrupted by a measurement noise process $\textbf{v}(t)$

$$\textbf{y}(t) = \textbf{Bx}(t) + \textbf{v}(t)\tag{4}$$

When the noise processes n(t) and v(t) are both Gaussian with zero mean and respective covariance matrices Q and R, y(t) will be a Gaussian process. Equation (3) defines a first-order autoregressive (AR) process. Although the first-order Markovian transition is assumed in equation (3) , a higher-order AR structure can be always reformulated and transformed into a first-order AR formulation by concatenating several state vectors into a new state vector (for example, ${\bf x}_{new}$(t) = [x(t), x(t-1)]). Given a series of observations Y={y(1), …, y($T_0$)} during a time interval [0,$T_0$], in light of equations (3) and (4), the joint (complete data) likelihood function for the linear Gaussian SSM is

where the superscript T denotes the vector or matrix transpose operator, and the augmented variable θ={A, B, Q, R, x(0)} includes the initial state condition and parameters that fully characterize the linear Gaussian SSM.

Variants

The linear Gaussian SSM can be extended to a broad class of dynamical Bayesian networks (Ghahramani, 1998) by changing one or more following conditions about the state or measurement variables (Chen, Barbieri and Brown, 2010): (i) from continuous state to discrete or mixed-value state variable in equation (3); (ii) from continuous observation to discrete or mixed observation in equation (3); (iii) from Gaussian to non-Gaussian noise processes in equation (3) or (4); and (iv) inclusion of nonlinearity in equation (3) or (4). For instance, changing condition (i) may result in a discrete-time HMM or switching SSM; changing condition (ii) or (iii) may result in a generalized SSM with a generalized linear model (GLM) (McCullagh and Nelder, 1989) in place of equation (4); and changing condition (iv) may result in a nonlinear neural filter. In addition, a control variable can be incorporated into the state equation (3), which will result in a standard linear quadratic Gaussian (LQG) control system for which the optimal solution can be derived analytically (Bertsekas, 2005).

In modeling discrete neural signals, such as neuronal spike trains (point processes) or spike count (Poisson process), new variants of SSM may emerge. For simplicity, consider a single neural point process whose conditional intensity function λ(t) is modulated by a constant baseline rate parameter μ, a latent continuous state variable x(t) and an observed covariate u(t). When Δ is sufficiently small, the product λ(t)Δ is approximately equal to the probability of observing a spike within the interval ((t-1)Δ, tΔ] (in which there is at most one spike). For illustration simplicity, assume that the state equation is characterized by a first-order AR process, and the observation equation is characterized by a point process likelihood function (Smith and Brown, 2003)

where θ={σ, ρ, μ, α, β, $x_0$}, n(t) denotes a zero-mean Gaussian variable with variance $\sigma^2$, |ρ|<1 denotes the AR coefficient, and the indicator variable $dy(\tau)$ is equal to 1 when there is a spike within the interval ((t-1)Δ, tΔ] and 0 otherwise. From equations (6) and (8), the complete data likelihood function is derived as

Provided that the parameter θ is known, direct optimization of p(X,Y|θ) or logp(X,Y|θ) will yield the a global optimal state estimate of {x(t)}.

In the presence of multivariate point process observations $Y=\{y_c(1:t)\}_{c=1}^C$, provided that the observations are driven by a common state process $X=\{x(t)\}_{t=1}^{T_0}$ and the multivariate point process observations are conditionally independent at any time $t$ given a parameter θ, the generic complete data likelihood is derived as

Such a probabilistic model is often used to characterize population neuronal spike trains (Brown et al., 1998; Brown et al., 2003; Chen, Barbieri and Brown, 2010).

Statistical inference and learning

A common objective of statistical inference for SSM is to infer the state (including its uncertainty) based on the time series observations. In light of equation (1), the goal of state space analysis is to estimate the posterior probability distribution (or density) p(x(t)|Y). In the special case of linear Gaussian SSM, the predictive posterior distribution is fully characterized by the conditional mean and conditional covariance of a Gaussian distribution. When the state and observation equations of the linear Gaussian SSM are known, the optimal inference algorithm is described by recursive Kalman filtering (where Y={y(1), …, y(t)} is used for an online operation) or fixed-interval Kalman smoothing (where Y={y(1), …, y($T_0$)} is used for an offline operation). When the state is discrete (as in the HMM) and the state and observation equations are known, the optimal solution is given by the Viterbi algorithm. In contrast, in the presence of point process observations, a discrete analog of Kalman filtering operation is described by a point process filtering operation (Brown et al., 1998; Smith and Brown, 2003). For the above simple example (equations (6)-(8)), the following point process filtering equations are used to infer the state $\{x(t)\}$

where x(t+1|t+1) and $σ^2_x$(t+1|t+1) denote the posterior state mode and posterior state variance, respectively. Equations (11)-(14) are reminiscent of Kalman filtering. Equations (11) and (12) for one-step mean and variance predictions are the same as Kalman filtering, but equations (13) and (14) are different from Kalman filtering due to the presence of non-Gaussian observations and nonlinear operation in (13). In equation (13), [dy(t+1) − exp(μ + αx(t+1|t+1) + βu(t+1))Δ] is viewed as the innovations term, and $σ^2_x$(t+1|t) may be interpreted as a “Kalman gain”. The quantity of the Kalman gain determines the “step size” of error correction. In equation (14), the posterior state variance is derived by inverting the second derivative of the log-posterior probability density logp(X|Y,θ) based on a Gaussian approximation of the posterior distribution around the posterior mode (Brown et al., 1998; Smith and Brown, 2003; Eden et al., 2004).

When the state and observation equations are completely or partially unknown, the parameter θ and the state {x(t)} need to be jointly estimated. In statistics, likelihood inference is a well-established and asymptotically efficient approach for parameter estimation. Specifically, the expectation-maximization (EM) algorithm (Dempster, Laird and Rubin, 1977) provides a general framework to maximize or increase the likelihood by iteratively updating the latent state and parameter variables. Reconsider the example used in equations (6)-(8), the corresponding EM algorithm consists of the following two steps.

E-step: At the k-th step, compute the expected complete data log likelihood (Q-function) based on the estimate ${\it θ}^{(k)}$

and estimate the expected statistics of the latent process (E[x(t)||${\it θ}^{(k)}$], E[$x^2$(t)||${\it θ}^{(k)}$] and E[x(t)x(t+1)||${\it θ}^{(k)}$]) using point process filtering (equations (11) through (14)) and fixed-interval Kalman smoothing.

This can be achieved by setting the partial derivative of the Q-function to zero (i.e., $\frac{\partial Q}{\partial \theta}=0$), which may be solved via either numerical optimization (in the case of μ, α and β) or closed-form solutions (in the case of σ, ρ and $x_0$).

The E and M steps are executed iteratively until the likelihood reaches a local maximum. Upon convergence, the EM algorithm yields a point estimate of θ, the confidence intervals of θ can be assessed from the likelihood principle (Pawitan, 2001; Brown et al., 2003).

Alternatively, Bayesian inference (Gelman et al., 2005) provides another approach that aims to estimate the full posterior of the state and parameters based on Bayesian statistics. When the state variable is non-Gaussian, particle filtering or smoothing may be used to numerically approximate the posterior distribution; when the parameter variable is non-Gaussian, Gaussian approximation, Gibbs sampling or Markov chain Monte Carlo (MCMC) approaches may be employed. The common goal of these inference methods is to estimate the joint posterior of the state and the parameters using Bayes's rule

where the first equation assumes a factorial form of the posterior for the state and the parameters, and p(X) and p(θ) denote the prior distributions for the state and the parameters, respectively. The denominator of equation (16) is a normalizing constant known as the partition function. In general, Monte Carlo-based Bayesian inference or learning is powerful yet computationally expensive (Doucet, de Frietas, and Gordon, 2001; Gilks, Richardson and Spiegelhalter, 1996). A trade-off between tractable computational complexity and good performance is to exploit various approximate Bayesian inference methods, such as expectation propagation (Minka, 1999), mean-field approximation (Opper and Saad, 2001) and variational approximation (Jordan et al., 1999; Ghahramani, 1998; Beal and Ghahramani, 2006). These approximation techniques can also be integrated or combined to produce new methods, such as Monte Carlo EM or variational MCMC algorithms (McLachlan and Krishnan, 2008; Andrieu et al., 2003).

Numerous applications of SSM to dynamic analyses of neuroscience data can be found in the literature (Paninski et al., 2009; Chen, Barbieri and Brown, 2010). Several important and representative applications are highlighted here.

Research topics

An important topic in state-space modeling is model selection, or specifically to select the (discrete or continuous-valued) state dimensionality. Classical likelihood-based approaches rely on Akaike’s information criterion (AIC) or Bayesian information criterion (BIC), but these measures are often practically inefficient especially in the presence of sparse data samples. Following the Bayesian principle of “letting data speak for themselves”, the model selection problem has recently been tackled using nonparametric Bayesian inference, for instance in the cases of infinite HMM (Beal et al., 2002; Teh et al., 2006) and the switching SSM (Fox et al., 2010, 2011). Moreover, inference of a large-scale SSM for neuroscience data remains another important research topic. Exploiting the structure of the system, such as the sparsity, smoothness and convexity, may allow for employing efficient state-of-the-art optimization routines and imposing domain-dependent priors for regularization (Paninski et al., 2009; Paninski, 2010). Finally, developing consistent goodness-of-fit assessment for neuroscience data would help to validate and compare different statistical models (Brown et al., 2003).