Login using

You can login by using one of your existing accounts.

We will be provided with an authorization token (please note: passwords are not shared with us) and will sync your accounts for you. This means that you will not need to remember your user name and password in the future and you will be able to login with the account you choose to sync, with the click of a button.

The behavior of many real-world phenomena can be modeled by non-linear dynamical systems whereby a latent system state is observed through a filter. We are interested in interacting subsystems of this form, which we model by a set of coupled maps as a synchronous update graph dynamical system. Specifically, we study the structure learning problem for spatially distributed dynamical systems coupled via a directed acyclic graph. Unlike established structure learning procedures that find locally maximum posterior probabilities of a network structure containing latent variables, our work exploits the properties of dynamical systems to compute globally optimal approximations of these distributions. We arrive at this result by the use of time delay embedding theorems. Taking an information-theoretic perspective, we show that the log-likelihood has an intuitive interpretation in terms of information transfer.

1. Introduction

Complex systems are broadly defined as systems that comprise interacting non-linear components (Boccaletti et al., 2006). Discrete-time complex systems can be represented using graphical models such as graph dynamical systems (GDSs) (Mortveit and Reidys, 2001; Wu, 2005), where spatially distributed dynamical units are coupled via a directed graph. The task of learning the structure of such a system is to infer directed relationships between variables; in the case of dynamical systems, these variables are typically hidden (Kantz and Schreiber, 2004). In this paper, we study the structure learning problem for complex networks of non-linear dynamical systems coupled via a directed acyclic graph (DAG). Specifically, we formulate synchronous update GDSs as dynamic Bayesian networks (DBNs) and study this problem from the perspective of information theory.

In this paper, we exploit the properties of discrete-time multivariate dynamical systems in inferring coupling between latent variables in a DAG. Specifically, the main focus of this paper is to analytically derive a measure (score) for evaluating the fitness of a candidate DAG, given data. We assume the data are generated by a certain family of multivariate dynamical system and are thus able to overcome the issue of latent variables faced by established structure learning algorithms. That is, under certain assumptions of the dynamical system, we are able to employ time delay embedding theorems (Stark et al., 2003; Deyle and Sugihara, 2011) to compute our scores.

2. Related Work

We are interested in classes of systems whereby dynamical units are coupled via a graph structure. These types of systems have been studied under several names, including complex dynamical networks (Boccaletti et al., 2006), spatially distributed dynamical systems (Kantz and Schreiber, 2004; Schumacher et al., 2015), master-slave configurations (or systems with a skew product structure) (Kocarev and Parlitz, 1996), and coupled maps (Kaneko, 1992). Common to each of these definitions is that the multivariate state of the system comprises individual subsystem states, the dynamics of which are given by a set of either discrete-time maps or first-order ordinary differential equations (ODEs), called a flow. We assume the discrete-time formulation, where a map can be obtained numerically by integrating differential equations or recording experimental data (observations) at discrete-time intervals (Kantz and Schreiber, 2004). The literature on coupled dynamical systems is often focused on the analysis of characteristics such as stability and synchrony of the system. In this work, we draw on the fields of BN structure learning and non-linear time series analysis to infer coupling between spatially distributed dynamical systems.

BN structure learning comprises two subproblems: evaluating the fitness of a graph and identifying the optimal graph given this fitness criterion (Chickering, 2002). The evaluation problem is particularly challenging in the case of graph dynamical systems, which include both latent and observed variables. A number of theoretically optimal techniques exist for the evaluation problem for BNs with complete data (Bouckaert, 1994; Lam and Bacchus, 1994; Heckerman et al., 1995), which have been extended to DBNs (Friedman et al., 1998). With incomplete data, however, the common approach is to resort to approximations that find local optima, e.g., expectation-maximization (EM) (Friedman et al., 1998; Ghahramani, 1998). An additional caveat with respect to structure learning is that algorithms find an equivalence class of networks with the same Markov structure, and not a unique solution (Chickering, 2002).

In non-linear time series analysis, the problem of inferring coupling strength and causality in complex systems has received significant attention recently (Schreiber, 2000; Hoyer et al., 2009). Early work by Granger defined causality in terms of the predictability of one system linearly coupled to another (Granger, 1969). Although this measure is popular for identifying coupling, it requires systems are linear statistical models and is considered insufficient for inferring coupling between dynamical systems due to inseparability (Sugihara et al., 2012). Another method popular in neuroscience is transfer entropy, which was introduced to quantify the information transfer between non-linear (finite-order Markov) systems (Schreiber, 2000). Transfer entropy has been used to recover interaction networks in numerous fields such as multi-agent systems (Cliff et al., 2016) and effective networks in neuroscience (Lizier et al., 2011; Vicente et al., 2011; Lizier and Rubinov, 2012). More recently, researchers have used the additive noise model (Hoyer et al., 2009; Peters et al., 2011) to infer unidirectional cause and effect relationships with observed random variables and find a unique DAG (as opposed to an equivalence class). These studies have been extended by exploring weakly additive noise models for learning the structure of systems of observed variables with non-linear coupling (Gretton et al., 2009).

A recent approach to inferring causality is convergent cross-mapping (CCM), which is based on Takens theorem (Takens, 1981) and tests for causation (predictability) by considering the history of observed data of a hidden variable in predicting the outcome of another (Sugihara et al., 2012). Using a similar approach, Schumacher et al. (2015) used Stark’s bundle delay embedding theorem (Stark, 1999; Stark et al., 2003) to predict one subsystem from another using Gaussian processes. This algorithm can thus be used to infer the driving systems in spatially distributed dynamical systems in a similar manner to our work. However, both papers do not consider the problem of inference over the entire network structure, or formally derive the measures used therein. In our work, we provide a rigorous proof based on established structure learning procedures and discuss the problem of inference within a distributed dynamical system.

3. Background

This section summarizes relevant technical concepts used throughout the paper. First, a stochastic temporal process X is defined as a sequence of random variables (X1, X2, …, XN) with a realization (x1, x2, …, xN) for countable time indices n∈N. Consider a collection of M processes, and denote the ith process Xi to have associated realization xni at temporal index n, and xn as all realizations at that index xn=⟨xn1,xn2,…,xnM⟩. If Xni is a discrete random variable, the number of values the variable can take on is denoted |Xni|. The following sections collect results from DBN literature, attractor reconstruction, and information theory that are relevant to this work.

3.1. Dynamic Bayesian Networks

DBNs are a general graphical representation of a temporal model, representing a probability distribution over infinite trajectories of random variables (Z1, Z2, …) compactly (Friedman et al., 1998). These models are a more expressive framework than the hidden Markov model (HMM) and Kalman filter model (KFM) (or linear dynamical system) (Friedman et al., 1998). In this work, we denote Zn = {Xn, Yn} as the set of hidden and observed variables, respectively, where n ∈ {1, 2, …} is the temporal index.

BNs B = (G, Θ) represent a joint distribution p(z ) graphically and consist of: a DAG G and a set of conditional probability distribution (CPD) parameters Θ. DBNs B = (B1, B→) extend the BN to model temporal processes and comprise two parts: the prior BN B1 = (G1, Θ1), which defines the joint distribution pB1(z1); and the two-time-slice Bayesian network (2TBN) B→ = (G→, Θ→), which defines a first-order Markov process pB→(zn+1|zn) (Friedman et al., 1998). This formulation allows for a variable to be conditioned on its respective parent set ΠG→(Zn+1i) that can come from the preceding time slice or the current time slice, as long as G→ forms a DAG. The 2TBN probability distribution factorizes according to G→with a local CPD pD estimated from an observed dataset. That is, given a set of stochastic processes (Z1, Z2, …, ZN), the realization of which constitutes a dataset D = (z1, z2, …, zN), we obtain the 2TBN distribution as

pB→(zn+1|zn)=∏ipB→(zn+1i|πG→(Zn+1i)),(1)

where πG→(Zn+1i) denotes the (index-ordered) set of realizations {zoj:Zoj∈ΠG→(Zn+1i)}.

3.2. Embedding Theory

Embedding theory refers to methods from differential topology for inferring the (hidden) state of a dynamical system from a reconstructed sequence of observations. The state of a discrete-time dynamical system is given by a point xn confined to a d-dimensional manifold M. The time evolution of this state is described by a map f : M→M, so that the sequence of states (xn) is given by xn+1 = f(xn). In many situations, we only have access to a filtered, scalar representation of the state, i.e., the measurement yn = ψ (xn) given by some measurement functionψ:M→R (Takens, 1981; Stark, 1999). The celebrated Takens’ theorem (Takens, 1981) shows that for typical f and ψ, it is possible to reconstruct f from the observed time series up to some smooth coordinate change. More precisely, fix some κ (the embedding dimension) and some τ (the time delay), then define the delay embedding mapΦf,ψ: M→Rκ by

Φf,ψ(xn)=yn(κ)=⟨yn,yn−τ,yn−2τ,…,yn−(κ−1)τ⟩.(2)

In differential topology, an embedding refers to a smooth map Ψ: M→N between manifolds M and N if it maps M diffeomorphically onto its image; therefore, Φf,ψ has a smooth inverse Φf,ψ−1. The implication of Takens’ theorem is that for typical f and ψ, the image Φf,ψ(M) of M is completely equivalent to M itself, apart from the smooth invertible change of coordinates given by the mapping Φf,ψ. An important consequence of this theorem is that we can define a map F=Φf,ψ∘f∘Φf,ψ−1 on Φf,ψ, such that yn+1(κ)=F(yn(κ)) (Stark, 1999).

There are technical assumptions for Takens’ theorem (and the generalized versions employed herein) to hold. These assumptions require: (f, ψ) to be generic functions (in terms of Baire space), a restricted number of periodic points, and distinct eigenvalues at each neighborhood of these points (Takens, 1981; Stark, 1999; Stark et al., 2003; Deyle and Sugihara, 2011).

3.3. Information Theoretic Measures

Conditional entropy represents the uncertainty of a random variable X after taking into account the outcomes of another random variable Y by

H(X|Y)=−∑x,yp(x,y)log2 p(x|y).(3)

Multivariate transfer entropy is a measure that computes the information transfer from a set of source processes to a set of destination process (Lizier et al., 2011). In this work, we use the formulation of collective transfer entropy (Lizier et al., 2010), where the information transfer from m source processes V = {Y1, Y2, …, Ym} to a single destination process Y can be decomposed as a sum of conditional entropy terms:

TV→Y=HYn+1|Yn(κ)−HYn+1|Yn(κ),⟨Yni,(κi)⟩,(4)

where Yni,(κi)=⟨Yni,Yn−τii,Yn−2τii,…,Yn−(κi−1)τii⟩ for some κi and τi, and similarly for Yn(κ).

4. Representing Non-Linear Dynamical Networks as DBNs

We express multivariate dynamical systems as a synchronous update GDS to allow for generic maps. With this model, we can express the time evolution of the GDS as a stationary DBN, and perform inference and learning on the subsequent graph. We formally state the network of dynamical systems as a special case of the sequential GDS (Mortveit and Reidys, 2001) with an observation function for each vertex.

• a multivariate statexn=⟨xni⟩, composed of states for each vertex Vi confined to a di-dimensional manifoldxni∈Mi;

• an M-variate observationyn=⟨yni⟩, composed of scalar observations for each vertexyni∈R;

• a set of local maps {fi} of the formfi:M→Mi, which update synchronously and induce a global mapf:M→M; and

• a set of local observation functions {ψ1, ψ2, …, ψM} of the formψi:Mi→R.

Without loss of generality, we can use local functions to describe the time evolution of the subsystems:

xn+1i=fi(xni,⟨xnij⟩j)+υfi(5)

yn+1i=ψi(xn+1i)+υψi.(6)

Here, υfi is i.i.d. additive noise and υψi is noise that is either i.i.d. or dependent on the state, i.e., υψi(xn+1i). The subsystem dynamics [equation (5)] are therefore a function of the subsystem state xni and the subsystem parents’ state ⟨xnij⟩j at the previous time index such that fi:Mi×jMij→Mi. Each subsystem observation is given by equation (6). We assume the functions {fi} and {ψi} are invariant w.r.t. time and thus the graph G is stationary.

The time evolution of a synchronous GDS can be modeled as a DBN. First, each subsystem vertex Vi = Xni, Yni has an associated state variable Xni and observation variable Yni; the parents of subsystem Vi are denoted ΠG(V i). Since the graph G→ is stationary and synchronous, parents of Xn+1i come strictly from the preceding time slice, and additionally ΠG→(Yn+1i)=Xn+1i. Thus, we can build the edge set E={E1,E2,…,EM} in the GDS by means of the DBN. That is, each edge subset Ei is built by the DBN edges

Ei={Vj→Vi:Xnj∈ΠG→(Xn+1i)∧Vj∈V\Vi},

so long as G forms a DAG. As an example, consider the synchronous GDS in Figure 1A. The subsystem V3 is coupled to both subsystem V1 and V2 through the edge set E={V1→V3,V2→V3}. The time-evolution of this network is shown in Figure 1B, where the top two rows (processes X1 and Y1) are associated with subsystem V1, and similarly for V2 and V3. The distributions for the state [equation (5)] and observation [equation (6)] of M arbitrary subsystems can therefore be factorized according to equation (1):

pB→(zn+1|zn)=∏i=1MpD(xn+1i|xni,⟨xnij⟩j)⋅pD(yn+1i|xn+1i).(7)

FIGURE 1

Figure 1. Representation of (A) the synchronous GDS with three vertices (V1, V2 and V3), and (B) the rolled-out DBN of the equivalent structure. Subsystem V3 is coupled to both subsystems V1 and V2 by means of the edges between latent variables Xn1→Xn+13 and Xn2→Xn+13.

In the rest of the paper, we use simplified notation, given this constrained graph structure. First, since our focus is on learning coupling between distributed systems, the superscripts refer to individual subsystems, not variables. Thus, although the 2TBN B→is constrained such that ΠG→(Yni)=Xni, the notation Ynij denotes the measurement variable of the jth parent of subsystem i, e.g., in Figure 1, an arbitrary ordering of the parents gives Yn3,1=Yn1 and Yn3,2=Yn2. Second, the scoring functions for the 2TBN network B→can be computed independently of the prior network B1 (Friedman et al., 1998). We will assume the prior network is given, and focus on learning the 2TBN. As a result, we drop the subscript and note that all references to the network B are to the 2TBN. Since B→is stationary, learning B→is equivalent to learning the synchronous GDS.

5. Learning Synchronous GDSs from Data

In this section, we develop the theory for learning the synchronous update GDS from data. We will focus on techniques for learning graphical models using the score and search paradigm, the objective of which is to find a DAG G* that maximizes a score g(B : D). Given such a score, we can then employ established search procedures to find the optimal graph G*. Thus, we can state that our main goal is to derive a tractable scoring function g(B : D) for synchronous GDSs that gives a parsimonious model for describing the data.

To derive the score, we use the DBN formulation of synchronous GDSs (Sec. 4) to show that we cannot directly compute the posterior probability of the network structure (Sec. 5.1). By making some assumptions about the system, however, we are able to compute scores for GDSs by use of attractor reconstruction methods (Sec. 5.2). We conclude this section by giving an interpretation of the log-likelihood in terms of information transfer (Sec. 5.3).

5.1. Structure Learning for DBNs

Ideally, we want to be able to compute the posterior probability of the network structure G, given data D. Using Bayes’ rule, we can express this distribution as pG|D∝pD|GpG, where p(G) encodes any prior assumptions we want to make about the network G. Thus, the problem becomes that of computing the likelihood of the data, given the model, p(D|G). The likelihood can be written in terms of distributions over network parameters (Friedman et al., 1998):

p(D|G)=∫p(D|G,Θ)p(Θ|G)dΘ,(8)

where we denote ℓ(Θ^G:D)=log p(D|G,Θ^G) as the log-likelihood function for a choice of parameters Θ^G that maximize p(D|G, Θ), given a graph G.

A common approach to compute equation (8) in closed form is by using Dirichlet priors. This leads to the BD (Bayesian–Dirichlet) score and variants (Heckerman et al., 1995; Friedman et al., 1998). However, to obtain this analytic solution, we require counts of the tuples (zni,πG(Zni)), which involve hidden variables. We will instead use Schwarz’s (Schwarz, 1978) asymptotic approximation of the posterior distribution, which states that

limN→∞log p(D|G)≈ℓ(Θ^G:D)−log N2C(G)+O(1),(9)

where C(G) is the model dimension (i.e., number of parameters needed for the graph G (Friedman et al., 1998)) and O(1) is a constant bounded by the number of potential models. The approximation of the posterior [equation (9)] requires that data come from an exponential family of likelihood functions with conjugate priors over the model G, and the parameters given the model ΘG (Schwarz, 1978).

Akaike (1974) gives a similar criterion by approximating the KL-divergence of any model from the data. We can compute both criteria in terms of the log-likelihood function ℓ(Θ^G:D) and the model dimension C(G), and thus the problem can be generalized to that of deriving an information criterion for scoring the graph of the form

5.2. Deriving the Scores for Synchronous GDSs

To calculate the information criterion [equation (10)], we require tractable expressions for the log-likelihood function ℓ(Θ^G:D) and the model dimension C(G). The form of the CPD in equation (7) specifies these functions, and for equation (9) to hold, this distribution must come from an exponential family (Schwarz, 1978). We do not assume the underlying model is linear-Gaussian or other known distributions, and thus express the log-likelihood as the maximum likelihood estimate for multinomial distributions (Friedman et al., 1998). From equation (7), the log-likelihood then decomposes as

Note that although we describe the states and observations as discrete in equation (11), we assume the data are generated by a continuous and stationary process. In theory, it is conceivable to have access to an infinite dataset containing realizations of all potential states and observations. In practice, we have a limited dataset and therefore must implement a discretization scheme. Modeling the dynamical systems with non-parametric techniques requires that the number of parameters scales linearly in the size of the data, and thus C(G) scales linearly with N. Instead, later, we will assume the observation data are discretized, such that there are |Yni| possible outcomes for an observed random variable Yni.

The log-likelihood function [equation (11)] involves distributions over latent variables, and thus we resort to state-space (attractor) reconstruction. First, Lemma 1 shows that a future observation from a given subsystem can be predicted from a sequence of past observations. Building on this result, we present a computable formulation of the 2TBN distribution pB→(zn+1|zn) via Lemma 2. We then derive a tractable form of the log-likelihood function, presented in Lemma 1. It is then shown in Theorem 2 that these lemmas allow us to compute the information criterion equation (10).

Lemma 1. Consider a synchronous GDS (G, xn, yn, {fi}, {ψi}), where the graph G is a DAG. Each subsystem state follows the dynamicsxn+1i=fi(xni,⟨xnij⟩j)and emits an observationyn+1i=ψi(xn+1i); the subsystem observation can be estimated, for some mapGi, by

yn+1i=Giyni,(κi),⟨ynij,(κij)⟩j.(12)

Proof. Consider a forced system xn+1 = f(xn, wn) with forcing dynamics wn+1 = h(wn) and observation yn = ψ(xn+1). Given this type of forced system, the bundle delay embedding theorem (Stark, 1999; Stark et al., 2003) states that the delay map Φf,h,ψ(xn,wn)=yn(κ) is an embedding for generic f, ψ, and h. Stark (1999) proved this result in the case of forcing dynamics h that are independent of the state x.1 For notational simplicity, we omit dependence on the noise process for the map Φf,h,ψ; the noise can be considered an additional forcing system so long as υf is i.i.d and υψ is either i.i.d or dependent on the state (Stark et al., 2003).

Given a DAG G, any ancestor of the subsystem V i is not dependent on V i. As such, the sequence

yni,(κi)=Φfi,⟨fij⟩j,ψixni,⟨xnij⟩j(13)

is an embedding, since ⟨xnij⟩j is independent of xni. Let ⟨xnijk⟩k be the index-ordered set of parents of node Xnij (which itself is the jth parent of the node Xni). Under the constraint that G is a DAG, where the state xn+1i=fi(xni,⟨xnij⟩j)+υfi, it follows from the bundle delay embedding theorem (Stark, 1999; Stark et al., 2003) that there exists a map Fi that is well defined and a diffeomorphism between observation sequences. From equation (13), we can write this map

Proof. The generalized time delay embedding theorem (Deyle and Sugihara, 2011) states that, under certain technical assumptions, and given M inhomogeneous observation functions {ψ1,ψ2,…,ψM}, the map

Φf,ψ(x)=⟨Φf1,ψ1(x),Φf2,ψ2(x),…,ΦfM,ψM(x)⟩(16)

is an embedding where each subsystem (local) map Φfi,ψi:M→Rκi, and, at time index n is described by

Φfi,ψi(xn)=yni,(κi)=⟨ψixn,ψi(xn−τi),ψi(xn−2τi),…,ψi(xn−(κi−1)τi)⟩

where ∑iκi=2d+1 (Deyle and Sugihara, 2011).2 Therefore, the global map equation (16) is given by Φf,ψ(xn)=⟨yni,(κi)⟩ and there must exist an inverse map xn=Φf,ψ−1⟨yni,(κi)⟩. Given Lemma 1, the existence of Φf,ψ−1, and since ∀i,{yni,(κi),⟨ynij,(κij)⟩j}⊆⟨yni,(κi)⟩, we arrive at the following equation:

Lemma 2 shows that the distributions can be reformulated by conditioning on delay vectors. The RHS of equation (15) can be used to perform inference in the 2TBN (7). The numerator is a product of local CPDs of scalar variables, and can thus be computed by either counting (for discrete variables) or density estimation (for continuous variables). The denominator is used to compute the probability that the hidden state occured, given an observed delay vector; fortunately, Casdagli et al. (1991) established methods to compute this CPDs for a variety of practical scenarios. Therefore, Lemma 2 provides a method to perform exact inference. Using this delay vector representation, we arrive at the following theorem.

Theorem 1. Consider a synchronous GDS(G,xn,yn,{fi},{ψi}), where the graph G is a DAG. Each subsystem state follows the dynamicsxn+1i=fi(xni,⟨xnij⟩j)and generates an observationyn+1i=ψi(xn+1i); a complete dataset is given by the sequence of observations D = (y1, y2, …, yN). The log-likelihood of the data given a network structure can be computed in terms of conditional entropy:

In equation (19), we have removed arguments of the joint distributions that will be nullified when multiplied with the CPD. Expressing equation (19) in terms of conditional entropy [equation (3)], we arrive at equation (18).

Proof. The distributions for the first term in equation (18) do not depend on the parents of a subsystem and thus are independent of the graph G being considered. Therefore, we have the following equation for maximum log-likelihood:

Since we are searching for the graph G* = maxG g(B : D), holding N constant, we can substitute equation (21) and equation (22) into equation (10) and ignore the constant term O(N) in (21).

5.3. The Log-Likelihood and Information Transfer

To conclude our study of the scores, we look at the log-likelihood in the context of information transfer. First, rearranging the terms of collective transfer entropy [equation (4)], we can rewrite the log-likelihood function [equation (18)], leading to the following result.

Again, the first two terms in equation (23) do not depend on the proposed graph structure, and thus maximizing log-likelihood is equivalent to maximizing collective transfer entropy. This becomes clear when we consider the log-likelihood ratio. This ratio quantifies the gain in likelihood by modeling the data D by a candidate network B instead of the empty network B∅, i.e.,

ℓ(Θ^G:D)−ℓ(Θ^G∅:D)∝log p(B|D)p(B∅|D).

Recall that the empty DAG G∅ is one with no parents for all vertices ∀i,ΠG(Vi)=⟨Ynij,(κij)⟩j=∅. Substituting this definition into equation (18) [or, alternatively equation (23)] gives the following result.

Proposition 2. The ratio of the log-likelihood [equation (18)] of a candidate DAG G to the empty networkG∅can be expressed as

ℓ(Θ^G:D)−ℓ(Θ^G∅:D)=N⋅∑i=1MT⟨Yij⟩j→Yi.

6. Discussion and Future Work

We have presented a principled method to score the structure of non-linear dynamical networks, where dynamical units are coupled via a DAG. We approached the problem by modeling the time evolution of a synchronous GDS as a DBN. We then derived the AIC and BIC scoring functions for the DBN based on time delay embedding theorems. Finally, we have shown that the log-likelihood of the synchronous GDS can be interpreted in the context of information transfer.

The representation of synchronous GDSs as DBNs allows for inference of coupling in dynamical networks and facilitates techniques for synthesis in these systems. DBNs are an expressive framework that allows representation of generic systems, as well as a numerous general purpose inference techniques that can be used for filtering, prediction, and smoothing (Friedman et al., 1998). Our representation therefore allows for probabilistic reasoning for purposes of planning and prediction in complex systems.

Theorem 2 captures an interesting parallel between learning from complete data and learning non-linear dynamical networks. If the embedding dimension κ and time delay τ are unity, then the information criterion becomes identical to learning a DBN from complete data (Friedman et al., 1998). Thus, our result could be considered a generalization of typical structure learning procedures.

The results presented here provoke new insights into the concepts of structure learning, non-linear time series analysis, and effective network analysis (Sporns et al., 2004; Park and Friston, 2013) based on information transfer (Honey et al., 2007; Lizier et al., 2011; Cliff et al., 2013, 2016). The information-theoretic interpretation of the log-likelihood has interesting consequences in the context of information dynamics and information thermodynamics of non-linear dynamical networks. The transfer entropy terms in Propositions 1 and 2 show that the optimal structure of a synchronous GDS is immediately related to the information processing of distributed computation (Lizier et al., 2008), as well as the thermodynamic costs of information transfer (Prokopenko and Lizier, 2014).

In the future, we aim to perform empirical studies to exemplify the properties of the presented scoring functions. Specifically, the empirical studies should yield insight into the effect of weak, moderate and strong coupling between dynamical units. An important concept to consider in stochastic systems is the convergence of the shadow (reconstructed) manifold to the true manifold (Sugihara et al., 2012); we have implicitly accounted for this phenomena by using CPDs in our model, however, it is important to investigate the property of convergence with different density estimation techniques. In addition, we are interested in the effect of synchrony in these networks and the relationship to previous results for dynamical systems coupled by spanning trees (Wu, 2005). We conjecture that approach used here will allow us to derive scoring functions without the assumption of multinomial observations, and thus afford the use of non-parametric density estimators. Parametric techniques, such as learning the parameters of dynamical systems (Ghahramani and Roweis, 1999; Hefny et al., 2015), could be considered in place of the posterior approximations.

Finally, the reconstruction theorems used in this paper typically make the assumption that the map (or flow) is a diffeomorphism (invertible in time). Thus, given any state, the past and future are uniquely determined and the time delay τ can be taken positive or negative. In certain cases, however, the time-reversed system is acausal, giving a map that is not time-invertible (an endomorphism). Ideally, we would aim to have methods to infer coupling for both endomorphisms and diffeomorphisms. Takens (2002) showed that if the map is an endomorphism, taking the delay vector of temporally previous observations forms an embedding. The generalized theorems in Stark (1999), Stark et al. (2003), and Deyle and Sugihara (2011), however, were established for diffeomorphisms, rather than endomorphisms; we can only conjecture that taking a delay of past observations (as we have done throughout this paper) follows for these results. Empirical studies using the measures presented in this paper would indicate whether it is an important line of inquiry to prove the generalized reconstruction theorems for endomorphisms.

Author Contributions

OC co-wrote the manuscript, derived and proved the theorems, lemmas, and propositions. MP co-wrote the manuscript, assisted with the proofs, and supervised. RF co-wrote the manuscript, assisted with the proofs, and supervised.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Joseph Lizier, Jürgen Jost, and Wolfram Martens for many helpful discussions, particularly in regards to embedding theory. This work was supported in part by the Australian Centre for Field Robotics; the New South Wales Government; and the Faculty of Engineering & Information Technologies, The University of Sydney, under the Faculty Research Cluster Program.

Footnotes

^Stark (1999) conjectures that the theorem should generalise to functions h that are not independent of x. To the best of our knowledge, this result remains to be proven.

^The original proof (Deyle and Sugihara, 2011) uses positive lags; however, the authors note that the use of negative lags also applies [and should be used in the case of endomorphisms (Takens, 2002)].