Abstract

We study the natural gradient method for learning in deep Bayesian networks,including neural networks. There are two natural geometries associated withsuch learning systems consisting of visible and hidden units. One geometry isrelated to the full system, the other one to the visible sub-system. These twogeometries imply different natural gradients. In a first step, we demonstrate agreat simplification of the natural gradient with respect to the firstgeometry, due to locality properties of the Fisher information matrix. Thissimplification does not directly translate to a corresponding simplificationwith respect to the second geometry. We develop the theory for studying therelation between the two versions of the natural gradient and outline a methodfor the simplification of the natural gradient with respect to the secondgeometry based on the first one. This method suggests to incorporate arecognition model as an auxiliary model for the efficient application of thenatural gradient method in deep networks.