Transcription

1 626 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 Variational Approximations between Mean Field Theory and the Junction Tree Algorithm Wim Wiegerinck RWCP, Theoretical Foundation SNN, University of Nijmegen, Geert Grooteplein 21, 6525 EZ, Nijmegen, The Netherlands Abstract Recently, variational approximations such as the mean field approximation have received much interest. We extend the standard mean field method by using an approximating distribution that factorises into cluster potentials. This includes undirected graphs, directed acyclic graphs and junction trees. We derive generalised mean field equations to optimise the cluster potentials. We show that the method bridges the gap between the standard mean field approximation and the exact junction tree algorithm. In addition, we address the problem of how to choose the structure and the free parameters of the approximating distribution. From the generalised mean field equations we derive rules to simplify the approximation in advance without affecting the potential accuracy of the model class. We also show how the method fits into some other variational approximations that are currently popular. 1 INTRODUCTION Graphical models, such as Bayesian networks, Markov fields, and Boltzmann machines provide a rich framework for probabilistic modelling and reasoning (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Jensen, 1996; Castillo et al., 1997; Hertz et al., 1991). Their graphical structure provides an intuitively appealing modularity and is well suited to the incorporation of prior knowledge. The invention of algorithms for exact inference during the last decades has lead to the rapid increase in popularity of graphical models in modern AI. However, exact inference is NP-hard (Cooper, 1990). This means that large, densely connected networks are intractable for exact computation, and approximations are necessary. In this context, the variational methods gain increasingly interest (Saul et al., 1996; Jaakkola and Jordan, 1999; Jordan et al., 1999; Murphy, 1999). An advantage of these methods is that they provide bounds on the approximation error and they fit excellently into a generalised-em framework for learning (Saul et al., 1996; Neal and Hinton, 1998; Jordan et al., 1999). This is in contrast to stochastic sampling methods (Castillo et al., 1997; Jordan, 1998) which may yield unreliable results due to finite sampling times. Until now, however, variational approximations have been less widely applied than Monte Carlo methods, arguably since their use is not so straightforward. One of the simplest and most prominent variational approximations is the so-called mean field approximation which has its origin in statistical physics (Parisi, 1988). In the mean field approximation, the intractable distribution P is approximated by a completely factorised distribution Q by minimisation of the Kullback-Leibler (KL) divergence between P and Q. Optimisation of Q leads to the so-called mean field equations, which can be solved efficiently by iteration. A drawback of the standard mean field approximation is its limited accuracy due to the restricted distribution class. For this reason, extensions of the mean field approximation have been devised by allowing the approximating distributions Q to have a more rich, but still tractable structure (Saul and Jordan, 1996; Jaakkola and Jordan, 1998; Ghahramani and Jordan, 1997; Wiegerinck and Barber, 1998; Barber and Wiegerinck, 1999; Haft et al., 1999; Wiegerinck and Kappen, 2000). In this paper, we further develop this direction. In section 2 we present a general variational framework for approximate inference in an (intractable) target distribution using a (tractable) approximating distribution that factorises into overlapping cluster potentials. Generalised mean field equations are derived which are used in an iterative algorithm to optimise the cluster potentials of the approximating distribution. This

2 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS procedure is guaranteed to lead to a local minimum of the KL-divergence. In section 3 we show the link between this procedure and standard exact inference methods. In section 4 we give conditions under which the complexity of the approximating model class can be reduced in advance without affecting its potential accuracy. In sections 5 and 6 we consider approximating directed graphs and we construct approximating junction trees. In section 7, we consider the approximation of target distributions for which the standard approach of KL minimisation is intractable. 2 VARIATIONAL FRAMEWORK 2.1 TARGET DISTRIBUTIONS Our starting point is a probabilistic distribution P(x) on a set of discrete variables x = x1,..., Xn in a finite domain, Xi E {1,..., ni}. Our goal is to find its marginals P(xi) on single variables or small subsets of variables P(xi,..., xk). We assume that P can be written in the following factorisation 1 P(x) a = Z II Wa(da) = exp L a 1/la(da)- Zp, (1) p da. in which 1lT a are potential functions that depend on a small number of variables, denoted by the clusters Zp is a normalisation factor that might be unknown. Note that the potential representation is not unique. a, When it is convenient, we will use the logarithmic form of the potentials, 1/Ja =log 1lT =log Zp. An example is a Boltzmann machine with binary units (Hertz et al., 1991), Zp 1 P(x) = Z exp(l WijXiXi + L hkxk), (2) p i<j k that fits in our form (1) with dii = (xi, Xj), i < j, dk = Xk and potentials 1/lij(Xi, Xj) = WijXiXj, 1/Jk(Xk) = hkxk. Another example of a distribution that fits in our framework is a Bayesian network given evidence e, which can be expressed in terms of the potentials 1lT i (dj) = P(xj l1ri), with dj = (xi, 1l"j) and the normalisation Zp = P(e). This example shows that our inference problem includes the problem of computation of conditionals given evidence, since conditioning can be included by absorbing the evidence into the model definition via Pe(x) = P(x, e)/ P(e). The complexity of computing marginals in P depends on the underlying graphical structure of the model, and is exponential in the maximal clique size of the triangulated moralised graph (Lauritzen and Spiegelhalter, 1988; Jensen, 1996; Castillo et al., 1997). This may lead to intractable models, even if the clusters da are small. An example is a fully connected Boltzmann machine: the clusters contain at most two variables, while the model has one clique that contains all the variables in the model. 2.2 APPROXIMATING DISTRIBUTIONS In the variational method the intractable probability distribution P(x) is approximated by a tractable distribution Q(x). This distribution can be used to compute probabilities of interest. In the standard (mean field) approach, Q is a completely factorised distribution, Q(x) = Tii Q(xi) We take the more general approach with Q being a tractable distribution that factorises according a given structure. By tractable we mean that marginals over small subsets of variables are computationally feasible. To construct Q we first define its structure. in which c, are predefined clusters whose union contains all variables. <I>,(c,) are nonnegative potentials of the variables in the clusters. The only restriction on the potentials is the global normalisation <I>,(c,) L = II {x} J' ZQ. 2.3 VARIATIONAL OPTIMISATION The approximation Q is optimised such that the Kullback-Leibler (KL) divergence between Q and P, "' Q(x) _ j Q(x)) D(Q, P) = L.,.. Q(x) log P(x) = log P(x) {x} is minimised. In this paper, (...) denotes the average with respect to Q. The KL-divergence is related to the difference of the probabilities of Q and P, m ; x IP(A)- Q(A)I:::; J D(Q, P), for any event A in the sample space (see (Whittaker, 1990)). In the logarithmic potential representations of P and Q, the KL-divergence is D(Q, P) = ( VJ,(c,)-y;1/;a(da)) +constant,, (4)

3 628 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 which shows that D(Q, P) is tractable (up to a constant) when Q is tractable and the clusters in P and Q are small. To optimise Q under the normalisation constraint ( 4), we do a constrained optimisation of the KL-divergence with respect to c.p1 using Lagrangian multipliers. In this optimisation, the other potentials 'Pf3, (3 =/= 'Y remain fixed. This leads to the solution c.p ( c1), given by the generalised mean field equations (a) Target distribution The average (...)c is taken with respect to the con- '"' ditional distribution Q(xlc1). In (5), D1 (resp. C1) are the sets of clusters a in P (resp. (3 =/= "( in Q) that depend on c1. In other words, a (/. D1 implies Q(dalc,) = Q(da), etc. Finally, z is a constant that can be inferred from the normalisation (4), i.e. z =log L exp [L 'Pf3(cf3) + {x} f3=h + ( L '1/Ja(da) - L 'Pf3(C{3) ]-ZQ. (6) aed-r {3EC-r C-y Since Q(xlc1) is independent of the potential c.p1, z (6) is independent of c.p1. Consequently, the right hand side of (5) is independent of c.p1 as well. So (5) provides a unique solution c.p to the optimisation of the potential of cluster 'Y This solutions corresponds to the global minimum of D(Q, P) given that the potentials of other clusters (3 =/= 'Y are fixed. This means that in a sequence where at each step different potentials are selected and updated, the KL-divergence decreases at each step. Since D( Q, P) 0, we conclude that this iteration over all clusters of variational potentials leads to a local minimum of D(Q, P). In the mean field equations (5), the constant z plays only a minor role and can be set to zero if desired. This can be achieved by simultanously shifting ZQ and 'Pe-r by the same amount before we optimize 'Pe-r. (This shift does not affect Q). The generalized mean field equations (5) straightforwardly generalizes upon the standard mean field equations for fully factorized approximations (see e.g. (Haft et al., 1999)). The main difference is that the contribution of the other potentials!3, f3 E C1 vanishes in the fully factorized approximation. In figure 1, a simple example is given. (b) KL = 0.43 (c) KL = 0.03 Figure 1: Chest clinic model (ASIA), from (Lauritzen and Spiegelhalter, 1988). (a): Exact distribution P with marginal probabilities. (b-e): Approximating distributions with approximated marginal probabilities. In (b) Q is fully factorised. In (c), Q is a tree. KL is the KL-divergence D(Q, P) between the approximating distribution Q and the target distribution P. 3 GLOBAL CONVERGENCE In this section, we link the mean field approximation with exact computation by showing global convergence for approximations Q that satisfy the following two conditions: (1) Each cluster da of P is at least contained in one of the clusters c1 in Q. (2) Q satisfies the so-called running intersection property. For a definition of the running intersection property, we follow (Castillo et al., 1997): Q satisfies the running intersection property if there is an ordering of the clusters of Q, (cl,,em) such that s, = c, n (cl u... U c1_1) is contained in at least one of the clusters (c1,..., c1_1). If a cluster c1 intersects with the separator s0 of a successor c0, there are three possibilities: s0 is contained in another successor c17 (J > 17 > "f), or s0 is contained in c1 itself, or s0 intersects only with the separator s1 (since s0 is contained in a predecessor of c1, which is separated by s1). We denote A 1 = {soc c1lso ct. c17,j > 1J > "f}. So each separator is contained in exactly one A1. Finally, we define A, = { da C c1lda C/.. c11, 1J > "(}. Each cluster of P is contained in exactly one A1. With these preliminaries, we consider the mean field equations (5) applied to the potentials of Q. We con-

4 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS sider a decreasing sequence of updates. At first, the last potential m is updated. This results in ';,(em)= L 1/Ja(da) + m(sm), aeam where m(sm) is a function that depends only on the value of the separator Sm. If Sm is empty, m(sm) is a constant. If in this sequence, potential 1 has its turn, the result is ;(c,) = L 1/Ja(da)- L o(so) +,(s1), (7) aea 8E where, again,, ( s,) is a function that depends only on the value of the separator s1. Finally, after all potentials have been updated, we add up all potentials and obtain which shows that Q converged to P in one sweep of updates. If the sequence of updates is in random order, the result shows convergence in finite time. Note that if condition 2 - the running intersection property - is not satisfied, the mean field procedure does not need to converge to the global optimum, even if the model class of Q is rich enough to model P exactly (condition 1). Standard exact inference methods (Lauritzen and Spiegelhalter, 1988; Jensen, 1996; Castillo et al., 1997), (after constructing cluster-sets that satisfy the two above-stated conditions), are very similar to (7). The difference is that standard exact methods just keep the separator functions, ( s1) equal to zero (which is of course much more efficient). The advantage of the generalised mean field approximation is that it generalises to Q's that do not meet the required conditions for exact computation. potentials of A and parametrise the approximation Q as Q(x) = exp [2:: 1(c1) + L 1/Ja(da)- ZQ] I aea (8) The (variational) potentials 1 are to be optimised. The potentials 1/Ja, a E A are copies of the potentials in the target distribution, and are fixed. The clusters c1 and da, a E A define the cluster-set of Q, and they contain all variables. The approximation (8) is of the general form as (3). The difference is that in (8) some potentials are set in advance to specific values, and do not need to be optimised any more. This has obviously big computational advantages. A disadvantage is that the copied potentials might be suboptimal, and that by fixing these potentials the method might be weaker than one in which they are adaptive. From the mean field equations (5), one can infer conditions under which the optimisation effectively uses copied potentials of P, and simplifies free parameters of Q, and thus effectively restricts the model class. This is stated in the following. Lemma Let Q be parametrised as in (8) and let ck be one of the clusters of Q. If c < can be written as a union cl<. = U da u U cl<.u, aea" u with u = 1,..., Umax, (nb., the C <u 's are not in the cluster-set of Q), such that for all of the remaining clusters t in P and Q, i.e., t E { da, c1 Ia (j. A U AI<,')' -:/:- K;}, the independency holds for at least one u E {1,..., Umax}, regardless of the values of the potentials and, then the optimised approximating distribution Q (8) takes the form 4 EXPLOITING SUBSTRUCTURES An obviously important question is how to choose the structure of Q to get the best compromise between approximation error and complexity. Another question is if our approach, in which all the potentials of Q are fully adaptive, is the best way to go. An alternative approach, originally proposed in (Saul and Jordan, 1996), is to copy the target distribution P and remove potentials that makes P intractable. The removed potentials are compensated by introducing additional variational parameters in the remaining potentials. In the context of our paper, this can be expressed as: pick a subset A of clusters a of the distribution P, copy the Q(x) exp [L <u (cl<.u) + L 1(c1) + u 1# < + L 1/Ja(da) - Zql aeaua" This is straightforwardly verified by applying the mean field optimisation to < in Q. From this lemma, considerable simplifications can be deduced. Consider, for example, a fully connected Boltzmann machine P (cf (2)) approximated by Q. If Q consists of potentials of non-overlapping clusters c,, it can inferred that the optimised Q will consists of the fixed copies of the weights of P that are within the

5 630 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS I, '0 > (a) Figure 2: Example of redundant structure. (a): Graph of exact distribution P(A)P(BIA)P(CIA). (b): Optimisation of an approximating distribution with structure Q(A)Q(B, C) leads to a distribution with simpler structure Q(A)Q(B)Q(C). The variables Band C become independent in Q, although they are marginally dependent in P (via A). clusters c '"Y of Q, adaptive biases for the nodes that are connected with weights in P which are not copied into Q, and fixed copies of biases for the remaining nodes. Note that optimal weights in an approximation of a Boltzmann machine are not always just copies from the target distribution. An illustrative counter example is the target distribution P(x1, x2,x3) ex: exp(w12x1x2 + W13X1X3 + W23X2X3) With W12 = OO, SO X1 and X2 are hard coupled (xi = ±1). The optimal approximation of the form Q(x1, x2,x3) ex: c])(xl,x2)<p(x2,x3) is given by c])(xl, x2) ex: exp(w12x1x2) and <P(x2, x3) ex: exp([w13 +w23]x2x3). The approximation in which the weight between x2 and x3 in Q is copied from P (i.e. w23 instead of w13 + w23 ) is suboptimal. The convergence times between approximate models with and without using copied potentials may differ, even if their potential accuracies are the same. As an example, consider the target P(x1, x2) ex: exp(w12x1x2). The approximation Q(x1, x2) ex: <P(x1, x2) convergences in one step. On the other hand, in Q(x1, x2) ex: exp( w12x1x2 +</>1 (x1) +</>2 (x2)), the potentials i decay only exponentially. The lemma generalises the result on graph partitioning in Boltzmann machines as presented in (Barber and Wiegerinck, 1999). It shows and clarifies in which cases the copied potentials of tractable substructures as originally proposed in (Saul and Jordan, 1996) are optimal. A nice example in which the copied potentials are optimal is the application to the Factorial Hidden Markov Models in (Ghahramani and Jordan, 1997; Jordan et al., 1999). The lemma provides a basis to the intuition that adding structure to Q that is not present in P might be redundant. (The lemma is still valid if AI< is empty.) In fig. 2 a simple example is given. A similar result for approximations using directed graphs ( cf. section 5) is obtained in (Wiegerinck and Kappen, 2000). Finally, we note that the lemma only provides sufficient conditions for simplification. (b) 5 DIRECTED APPROXIMATIONS A slightly different class of approximated distributions are the 'directed' factorisations. These have been considered previously in (Wiegerinck and Barber, 1998; Barber and Wiegerinck, 1999; Wiegerinck and Kappen, 2000), but they fit well in the more general framework of this paper. Directed factorisations can be written in the same form (3), but the clusters need to have an ordering c1, c2, c3,.... We define separator sets s1 = c"' n { c1 U... U c"t-l } and residual sets r"' = c "' s1. We restrict the potentials c))"' ( c"') = <P"' ( r"', s"') to satisfy the local normalisation L c])"'(r"', s"') = 1, {r-,} (9) We can identify c])"'(r"',s"') = Q(r"'ls"') and (3) can be written in the familiar directed notation Q(x) = IJ"' Q(r"'ls"'). To optimise the potentials <p"'(r"', s"') (= log Q(r'"Y is'"y)), we do again a constraint optimisation with constraints (9). This leads to generalised mean field equations for directed distributions in which D (resp. C ) is the set of clusters a in P (resp. f3 =I r in Q) that depend on r"'. z(s"') is a local normalisation factor that can be inferred from (9), i.e. 6 JUNCTION TREES For the definition of junction trees, we follow (Jensen, 1996): A cluster tree is a tree of clusters of variables which are linked via separators. These consists of the variables in the adjacent clusters. A cluster tree is a junction tree if for each pair of clusters c '"Y, C6, all nodes in the path between c "' and CJ contain the intersection. In a consistent junction tree, the potentials c))"' and <P6 of the nodes c "', CJ with intersection I satisfy We consider consistent junction tree representations of Q of the form Q(x)

6 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS in which the product in the denominator is taken over the separators. The separator potentials are defined by the cluster potentials, <1>-y,J(s'I',J) = L <I>'I'(c'l'). c s,j The junction tree representation is convenient, because the cluster probabilities can directly be read from the cluster potentials: Q(c'l') = <I>'I'(c'l'). For a more detailed treatment of junction trees we refer to (Jensen, 1996). In the following, we show how approximations can be optimised while maintaining the junction tree representation. Taking one of the clusters, c,., separately, we write Q as the potential <I>,. times Q(xlck) TI')',e,. <t>')'( ) Q( X ) = <J> t< ( c,. ) X f1 <J> ')',J( s')',j ) ('I',J) Subsequently, we update <!>,. according to the mean field equations (5), <t>:(c,.) = exp( L log'll(da) daedk L log<i>'i'(c'l') + L log<i>'/',,;(s'/',.;)) (10) ')'EC.. ('1',6)ES.. where S,. is the set of separators that depend on c,.. Z makes sure that <I> is properly normalised. Now, however, the junction tree is not consistent anymore. We can fix this by applying the standard DistributeEvidence(c,.) operation to the junction tree (see (Jensen, 1996)). In this routine, c,. sends ames sage to all its neighbours c-y via and <t>;,,.cs')',.) = L <t>:(c,.) ck s K Recursively, the neighbours c'l' send messages to all their neighbours except the one from which the message came. After this procedure, the junction tree is consistent again, and another potential can be updated by (10). Since the DistributeEvidence routine does not change the distribution Q (it only makes it consistent), the global convergence result (section 3) applies if the structure of Q is a junction tree of P. This links the mean field theory with the exact junction tree algorithm. CK 7 APPROXIMATED MINIMISATION The complexity of the variational method is at least proportional to the number of states in the clusters da of the target distribution P, since it requires the computation of averages of the form ('1/!(da)). In other words, the method presented in this paper can only be computationally tractable if the number of states in da is reasonably small. If the cluster potentials are explicitely tabulated, the required storage space is also proportional with the number of possible cluster states. In practice, potentials with large number of cluster states are parameterised. In these cases, one can try to exploit the parametrisation and approximate ('lj;(da)) by a tractable quantity. Examples are target distributions P with conditional probabilities P(xil7ri) that are modelled as noisy-or gates (Pearl, 1988; Jensen, 1996) or as weighted sigmoid functions (Neal, 1992). For these parametrisations (log P(xi 17rf)) can be approximated by a tractable quantity Ei(Q, ) (which may be defined using additional variational parameters ). As an example, consider tables parametrised as sigmoid functions, where Zi is the weighted input of the node, Zi = L:k WikXk +hi. In this case, the averaged log probability is intractable for large parent sets. To proceed we can use the approximation proposed in (Saul et al., 1996) (log(1 + ez')) < i (zi) + log ( e- ;z; + e(l- ;)z;) = Ei(Q, ), which is tractable if Q is tractable (Wiegerinck and Barber, 1998; Barber and Wiegerinck, 1999). Numerical optimisation of.c( Q, ) = (log Q) - [ ( Q, ) with respect to Q and leads to local minimum of an upper bound of the KL-divergence. Note however, that iteration of fixed point equations derived from.c( Q, ) does not necessarily lead to convergence, due to the nonlinearity of [ with respect to Q. In (Wiegerinck and Kappen, 2000) numerical simulations are performed on artificial target distributions P that had tractable substructures as well as sigmoidal nodes with large parent sets. Target distributions with varying system size were approximated by fully factorised distributions as well as distributions with structure. The results showed that an approximation using structure can improve significantly the accuracy of approximation within feasible computer time. This seemed independent of the problem size. Another example is a hybrid Bayesian network (which

7 632 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 has continuous and discrete variables). In the remainder of this section we closely follow (Murphy, 1999). For expositional clarity we consider the distribution P(rix)P(xit)P(t), in which r and t are binary variables and x is a continuous variable. The conditional distribution P(rlx) is parametrised by a sigmoid, P(r = 1/x) = a(wx +b) with parameters w and b. The conditional distribution P(xit) is a conditional Gaussian, P(xlt) = exp(gt + xht + xkt/2) in which (ht. Kt) are parameters depending on t and P(t) is a simple table with two entries. As (Murphy, 1999) showed, computation of the conditional distributions of x and t given observation of r is difficult. In (Murphy, 1999), it is proposed to approximate the KL-divergence by using the quadratic lower bound of the sigmoid function (Jaakkola, 1997; Murphy, 1999), loga(x) 2: x/2 + A.( )x2 +logo-( )- /2- ea.( ), with A.( ) = -1/4 tanh( /2). By fixing, this bound leads to an tractable upper bound of the KLdivergence C(Q, ) = (logq(t) +logq(xlt) -logp(t) -(gt + g( ) + x(ht + h( )) + x2(kt + K( ))/2)), in which g( ) logo-( )+!(2r- 1)b-! + A.( )(b2- e) h( ) = K( ) = (2r- 1)w + 2A.( )bw 2A.( )w2 For given, the optimal distribution Q(x, t) is simply given by Q(x, t) ex P(t) exp (9t + g( ) + x(ht + h( )) + x2(kt + K( ))/2) In other words, Q is the product of a conditional Gaussian and a table. Since the parameters of the conditional Gaussian Q(xlt) depends on t, an obvious extension of this scheme is to make the contribution that depends on also depending on t. In other words, we replace the single parameter by two parameters t, t = 0, 1, and bound the KL-divergence by C(Q, t) = (logq(t) + logq(x/t) -logp(t) -(gt + g( t) + x(ht + h( t)) + x2(kt + K( t))/2)) Then it follows that for given t, the optimal distribution Q(x, t) is given by Q(x, t) ex P(t) exp (gt + g( t) + x(ht + h( t)) +x2(kt + K( t))/2) To optimise t, we find in analogy with (Murphy, 1999) ; = ((wx + b)2)t :. I..,. I I. :,., : I : I : II I ;,, :I I. I. : I. 10 (a) I ' J / Figure 3: The effect of using approximations with structure in hybrid networks. We show results for a network P(t)P(xlt)P(rix), in which t and r are binary (0/1) and x is continuous. P(t = 1) = 0.3, P(xit) is a conditional Gaussian with p,0,1 = (10, 20), o-0,1 = 1 and p(rlx) is defined using a sigmoid with w = -1, b = 5. (This example is based on the crop network, with t is 'subsidy', x is 'price' and r is 'buy'. 'Crop' is assumed to be observed in its mean value - see (Murphy, 1999) for details). In (a) we plot a( -(wx- b)) as a function of x (solid), as well as the variational lower bound using one optimised unconditional parameter (dotted) and the two bounds for the optimised conditional variational parameters t (dashed). In (b) we plot P(r = 0, x) as a function of x using the exact probability (solid), the approximation using one unconditional parameter (dotted) and two conditional parameters t (dashed - this graph coincides with the exact graph). The largest improvements of this extension can be expected when the posterior distribution (given observation of r) is multi-modal. In figure 3, an example is given. 8 DISCUSSION Finding accurate approximations of graphical models such as Bayesian networks is crucial if their application to large scale problems is to be realised. We have presented a general scheme to use a (simpler) approximating distribution that factorises according to a given structure. The scheme includes approximations using undirected graphs, directed acyclic graphs and junction trees. The approximating distribution is tuned by minimisation of the Kullback-Leibler divergence. We have shown that the method bridges the gap between standard mean field theory and exact computation. We have contributed to a solution for the question how to select the structure of the approximating distribution, and when potentials of the target distribution can be exploited. Parametrised dis- (b)

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

Finding the M Most Probable Configurations Using Loopy Belief Propagation Chen Yanover and Yair Weiss School of Computer Science and Engineering The Hebrew University of Jerusalem 91904 Jerusalem, Israel

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

Study Manual Probabilistic Reasoning 2015 2016 Silja Renooij August 2015 General information This study manual was designed to help guide your self studies. As such, it does not include material that is

Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.

Probabilistic Graphical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 1-3. You must complete all four problems to

Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

The Exponential Family David M. Blei Columbia University November 3, 2015 Definition A probability density in the exponential family has this form where p.x j / D h.x/ expf > t.x/ a./g; (1) is the natural

Bargaining Solutions in a Social Network Tanmoy Chakraborty and Michael Kearns Department of Computer and Information Science University of Pennsylvania Abstract. We study the concept of bargaining solutions,

6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 9 Basic Scheduling with A-O-A Networks Today we are going to be talking

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about