Years of Citing Articles

Bookmark

OpenURL

Abstract

Two or more Bayesian network structures are Markov equivalent when the corresponding acyclic digraphs encode the same set of conditional independencies. Therefore, the search space of Bayesian network structures may be organized in equivalence classes, where each of them represents a different set of conditional independencies. The collection of sets of conditional independencies obeys a partial order, the so-called “inclusion order.” This paper discusses in depth the role that the inclusion order plays in learning the structure of Bayesian networks. In particular, this role involves the way a learning algorithm traverses the search space. We introduce a condition for traversal operators, the inclusion boundary condition, which, when it is satisfied, guarantees that the search strategy can avoid local maxima. This is proved under the assumptions that the data is sampled from a probability distribution which is faithful to an acyclic digraph, and the length of the sample is unbounded. The previous discussion leads to the design of a new traversal operator and two new learning algorithms in the context of heuristic search and the Markov Chain Monte Carlo method. We carry out a set of experiments with synthetic and real-world data that show empirically the benefit of striving for the inclusion order when learning Bayesian networks from data.

Citations

...s to converge to an asymptotic value smaller than 3.7. This was observed up to 10 vertices. 2. Frydenberg (1990) also proved it but under the additional condition of the fifth graphoid axiom CI5 (see =-=Pearl, 1988-=-). 533In Figure 1 we see the cardinalities of DAG-space and EG-space plotted, up to 10 vertices. From a non-causal perspective one is interested in learning equivalence classes of Bayesian networks f...

...ctorization in (1) allows us to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (Buntine, 1991, =-=Cooper and Herskovits, 1992-=-, Heckerman et al., 1995). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. Throughout this paper we have ...

...to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (Buntine, 1991, Cooper and Herskovits, 1992, =-=Heckerman et al., 1995-=-). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. Throughout this paper we have used the BDeu scoring me...

...ns. Each equivalence class has a canonical representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (=-=Spirtes et al., 1993-=-, Chickering, 1995, Andersson et al., 1997a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering...

...the ratio of the cardinalities of the neighborhoods |N(G)|/|N(G ′ )| which is known as the candidate-generating ratio. In our experimentation we have assumed a symmetric candidate-generating density (=-=Chib and Greenberg, 1995-=-), where |N(G)| = |N(G ′ )|. This is reasonable in our context since G and G ′ will differ in a single adjacency. The eMC 3 algorithm of Figure 7 needs the specification of some Bayesian network as a ...

... the results for heuristic learning, and in Subsection 5.3 we show the results for MCMC learning. 5.1 Synthetic and Real-World data We have used two kinds of synthetic data. One is the Alarm dataset (=-=Beinlich et al., 1989-=-), which has become a standard benchmark dataset for the assessment of learning algorithms for Bayesian networks on discrete data. The Alarm dataset was sampled from the Bayesian network in Figure 8, ...

... is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, Singh and Valtorta, 1993, Larrañaga et al., 1996, =-=Friedman and Koller, 2000-=-). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefore, errors in the ordering may easily lead to very ...

...he recursive factorization in (1) allows us to obtain a closed formula for the marginal likelihood of the data D given a Bayesian network M ≡ D(G), p(D|M), under a certain set of assumptions about D (=-=Buntine, 1991-=-, Cooper and Herskovits, 1992, Heckerman et al., 1995). The logarithm of the marginal likelihood and the prior of the model, log[p(D|M)p(M)], is often used as a scoring metric for Bayesian networks. T...

... may learn different models through different runs, and this permits trading time for multiple local maxima. 4.3 The Markov Chain Monte Carlo Method The need to account for the uncertainty of models (=-=Draper, 1995-=-) has led to the development of computational methods that implement the full Bayesian approach to modeling. Recall the Bayes’ theorem: p(M|D) = p(D|M)p(M) p(D) , (3) where p(D) is known as the normal...

...alized subgraph induced by the smallest ancestral set of A ∪ B ∪ S. The DGMP is the sharpest possible graphical criterion that permits reading CI restrictions from a given DAG (Pearl and Verma, 1987, =-=Lauritzen et al., 1990-=-). An alternative way of reading conditional independencies in a DAG is using the dseparation criterion of Pearl and Verma (1987), which we review now. A vertex vi in a path v0, v1, . . .,vn, n > 1, i...

...and AR neighborhoods. Therefore, errors in the ordering may easily lead to very bad local maxima, as shown by Chickering et al. (1995). Heuristic algorithms that use EG-space (Spirtes and Meek, 1995, =-=Chickering, 1996-=-, 2002a,b) do not assume that any form of causal ordering is known probably because, in general, they can work better with complex domains. We introduce here a new heuristic algorithm which works in D...

...al representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (Spirtes et al., 1993, Chickering, 1995, =-=Andersson et al., 1997-=-a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering, 1995) and essential graph (Andersson et ...

...ssume that a causal ordering between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, =-=Singh and Valtorta, 1993-=-, Larrañaga et al., 1996, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefo...

...t, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (Havránek, 1984, Edwards and Havránek, 1985, =-=Giudici and Green, 1999-=-) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe an efficient implementation of the ENR and ENCR neigh...

...ng between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (Bouckaert, 1992, Singh and Valtorta, 1993, =-=Larrañaga et al., 1996-=-, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and AR neighborhoods. Therefore, errors in the orderi...

...s reachable from the NR and AR neighborhoods. Therefore, errors in the ordering may easily lead to very bad local maxima, as shown by Chickering et al. (1995). Heuristic algorithms that use EG-space (=-=Spirtes and Meek, 1995-=-, Chickering, 1996, 2002a,b) do not assume that any form of causal ordering is known probably because, in general, they can work better with complex domains. We introduce here a new heuristic algorith...

...en class of GMMs. 544In fact, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (Havránek, 1984, =-=Edwards and Havránek, 1985-=-, Giudici and Green, 1999) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe an efficient implementation ...

...th, that ends in the true Bayesian network, the score will increase as it is shown in Theorem 3.3. This result is equivalent to Lemmas 8 and 9 from Chickering (2002b) where the optimality of the GES (=-=Meek, 1997-=-) algorithm for structure learning of Bayesian networks is proved. However, the inclusion boundary condition provides us with a general policy for the design of effective traversal operators for any g...

...one uses a score equivalent scoring metric. In such situation it makes sense to use EG-space instead of DAG-space. This argument has been further supported by several authors (Heckerman et al., 1995, =-=Madigan et al., 1996-=-) who cite the following advantages: 1. The cardinality of EG-space is smaller than in DAG-space. 2. The scoring metric is no longer constrained to give equal scores to Markov equivalent Bayesian netw...

...al representation in the form of an acyclic partially directed graph where the edges may be directed and undirected and satisfy some characterizing conditions (Spirtes et al., 1993, Chickering, 1995, =-=Andersson et al., 1997-=-a). This representation has been introduced independently by several authors under different names: pattern (Spirtes et al., 1993), completed PDAG (Chickering, 1995) and essential graph (Andersson et ...

...ates A and B in the moralized subgraph induced by the smallest ancestral set of A ∪ B ∪ S. The DGMP is the sharpest possible graphical criterion that permits reading CI restrictions from a given DAG (=-=Pearl and Verma, 1987-=-, Lauritzen et al., 1990). An alternative way of reading conditional independencies in a DAG is using the dseparation criterion of Pearl and Verma (1987), which we review now. A vertex vi in a path v0...

...tors for any given class of GMMs. 544In fact, the inclusion boundary condition has been implicitly taken into consideration by most of the learning algorithms for undirected and decomposable models (=-=Havránek, 1984-=-, Edwards and Havránek, 1985, Giudici and Green, 1999) and surprisingly ignored by most authors in the context of Bayesian networks. 4. Inclusion-driven structure learning In this section we describe ...

...some algorithms assume that a causal ordering between the variables is known (Cooper and Herskovits, 1992), or they search for a good causal ordering that may help in providing later a better result (=-=Bouckaert, 1992-=-, Singh and Valtorta, 1993, Larrañaga et al., 1996, Friedman and Koller, 2000). However, the causal ordering reduces the already small part of the inclusion boundary that was reachable from the NR and...

...lies to every type of GMM. As we shall see throughout the paper, this concept is the key to understanding the relevance of the inclusion order in the learning task. Definition 3.1 Inclusion boundary (=-=Kočka, 2001-=-) Let M(H),M(L) be two GMMs determined by the graphs H and L. Let M I (H) ≺ M I (L) denote that M I (H) ⊂ M I (L) and for no graph K, M I (H) ⊂ M I (K) ⊂ M I (L). The inclusion boundary of the GMM M(G...