4 Foundations and Trends R in Machine Learning Vol. 2, No. 1 (2009) c 2009 Y. Bengio DOI: / Learning Deep Architectures for AI Yoshua Bengio Dept. IRO, Université de Montréal, C.P. 6128, Montreal, Qc, H3C 3J7, Canada, Abstract Theoretical results suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g., in vision, language, and other AI-level tasks), one may need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success, beating the stateof-the-art in certain areas. This monograph discusses the motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer models such as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks.

5 1 Introduction Allowing computers to model our world well enough to exhibit what we call intelligence has been the focus of more than half a century of research. To achieve this, it is clear that a large quantity of information about our world should somehow be stored, explicitly or implicitly, in the computer. Because it seems daunting to formalize manually all that information in a form that computers can use to answer questions and generalize to new contexts, many researchers have turned to learning algorithms to capture a large fraction of that information. Much progress has been made to understand and improve learning algorithms, but the challenge of artificial intelligence (AI) remains. Do we have algorithms that can understand scenes and describe them in natural language? Not really, except in very limited settings. Do we have algorithms that can infer enough semantic concepts to be able to interact with most humans using these concepts? No. If we consider image understanding, one of the best specified of the AI tasks, we realize that we do not yet have learning algorithms that can discover the many visual and semantic concepts that would seem to be necessary to interpret most images on the web. The situation is similar for other AI tasks. 2

6 3 Fig. 1.1 We would like the raw input image to be transformed into gradually higher levels of representation, representing more and more abstract functions of the raw input, e.g., edges, local shapes, object parts, etc. In practice, we do not know in advance what the right representation should be for all these levels of abstractions, although linguistic concepts might help guessing what the higher levels should implicitly represent. Consider for example the task of interpreting an input image such as the one in Figure 1.1. When humans try to solve a particular AI task (such as machine vision or natural language processing), they often exploit their intuition about how to decompose the problem into subproblems and multiple levels of representation, e.g., in object parts and constellation models [138, 179, 197] where models for parts can be re-used in different object instances. For example, the current stateof-the-art in machine vision involves a sequence of modules starting from pixels and ending in a linear or kernel classifier [134, 145], with intermediate modules mixing engineered transformations and learning,

7 4 Introduction e.g., first extracting low-level features that are invariant to small geometric variations (such as edge detectors from Gabor filters), transforming them gradually (e.g., to make them invariant to contrast changes and contrast inversion, sometimes by pooling and sub-sampling), and then detecting the most frequent patterns. A plausible and common way to extract useful information from a natural image involves transforming the raw pixel representation into gradually more abstract representations, e.g., starting from the presence of edges, the detection of more complex but local shapes, up to the identification of abstract categories associated with sub-objects and objects which are parts of the image, and putting all these together to capture enough understanding of the scene to answer questions about it. Here, we assume that the computational machinery necessary to express complex behaviors (which one might label intelligent ) requires highly varying mathematical functions, i.e., mathematical functions that are highly non-linear in terms of raw sensory inputs, and display a very large number of variations (ups and downs) across the domain of interest. We view the raw input to the learning system as a high dimensional entity, made of many observed variables, which are related by unknown intricate statistical relationships. For example, using knowledge of the 3D geometry of solid objects and lighting, we can relate small variations in underlying physical and geometric factors (such as position, orientation, lighting of an object) with changes in pixel intensities for all the pixels in an image. We call these factors of variation because they are different aspects of the data that can vary separately and often independently. In this case, explicit knowledge of the physical factors involved allows one to get a picture of the mathematical form of these dependencies, and of the shape of the set of images (as points in a high-dimensional space of pixel intensities) associated with the same 3D object. If a machine captured the factors that explain the statistical variations in the data, and how they interact to generate the kind of data we observe, we would be able to say that the machine understands those aspects of the world covered by these factors of variation. Unfortunately, in general and for most factors of variation underlying natural images, we do not have an analytical understanding of these factors of variation. We do not have enough formalized

8 1.1 How do We Train Deep Architectures? 5 prior knowledge about the world to explain the observed variety of images, even for such an apparently simple abstraction as MAN, illustrated in Figure 1.1. A high-level abstraction such as MAN has the property that it corresponds to a very large set of possible images, which might be very different from each other from the point of view of simple Euclidean distance in the space of pixel intensities. The set of images for which that label could be appropriate forms a highly convoluted region in pixel space that is not even necessarily a connected region. The MAN category can be seen as a high-level abstraction with respect to the space of images. What we call abstraction here can be a category (such as the MAN category) or a feature, a function of sensory data, which can be discrete (e.g., the input sentence is at the past tense) or continuous (e.g., the input video shows an object moving at 2 meter/second). Many lower-level and intermediate-level concepts (which we also call abstractions here) would be useful to construct a MAN-detector. Lower level abstractions are more directly tied to particular percepts, whereas higher level ones are what we call more abstract because their connection to actual percepts is more remote, and through other, intermediate-level abstractions. In addition to the difficulty of coming up with the appropriate intermediate abstractions, the number of visual and semantic categories (such as MAN) that we would like an intelligent machine to capture is rather large. The focus of deep architecture learning is to automatically discover such abstractions, from the lowest level features to the highest level concepts. Ideally, we would like learning algorithms that enable this discovery with as little human effort as possible, i.e., without having to manually define all necessary abstractions or having to provide a huge set of relevant hand-labeled examples. If these algorithms could tap into the huge resource of text and images on the web, it would certainly help to transfer much of human knowledge into machine-interpretable form. 1.1 How do We Train Deep Architectures? Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of

9 6 Introduction lower level features. Automatically learning features at multiple levels of abstraction allow a system to learn complex functions mapping the input to the output directly from data, without depending completely on human-crafted features. This is especially important for higher-level abstractions, which humans often do not know how to specify explicitly in terms of raw sensory input. The ability to automatically learn powerful features will become increasingly important as the amount of data and range of applications to machine learning methods continues to grow. Depth of architecture refers to the number of levels of composition of non-linear operations in the function learned. Whereas most current learning algorithms correspond to shallow architectures (1, 2 or 3 levels), the mammal brain is organized in a deep architecture [173] with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex. Humans often describe such concepts in hierarchical ways, with multiple levels of abstraction. The brain also appears to process information through multiple stages of transformation and representation. This is particularly clear in the primate visual system [173], with its sequence of processing stages: detection of edges, primitive shapes, and moving up to gradually more complex visual shapes. Inspired by the architectural depth of the brain, neural network researchers had wanted for decades to train deep multi-layer neural networks [19, 191], but no successful attempts were reported before : researchers reported positive experimental results with typically two or three levels (i.e., one or two hidden layers), but training deeper networks consistently yielded poorer results. Something that can be considered a breakthrough happened in 2006: Hinton et al. at University of Toronto introduced Deep Belief Networks (DBNs) [73], with a learning algorithm that greedily trains one layer at a time, exploiting an unsupervised learning algorithm for each layer, a Restricted Boltzmann Machine (RBM) [51]. Shortly after, related algorithms based on auto-encoders were proposed [17, 153], apparently exploiting the 1 Except for neural networks with a special structure called convolutional networks, discussed in Section 4.5.

10 1.2 Sharing Features and Abstractions Across Tasks 7 same principle: guiding the training of intermediate levels of representation using unsupervised learning, which can be performed locally at each level. Other algorithms for deep architectures were proposed more recently that exploit neither RBMs nor auto-encoders and that exploit the same principle [131, 202] (see Section 4). Since 2006, deep networks have been applied with success not only in classification tasks [2, 17, 99, 111, 150, 153, 195], but also in regression [160], dimensionality reduction [74, 158], modeling textures [141], modeling motion [182, 183], object segmentation [114], information retrieval [154, 159, 190], robotics [60], natural language processing [37, 130, 202], and collaborative filtering [162]. Although auto-encoders, RBMs and DBNs can be trained with unlabeled data, in many of the above applications, they have been successfully used to initialize deep supervised feedforward neural networks applied to a specific task. 1.2 Intermediate Representations: Sharing Features and Abstractions Across Tasks Since a deep architecture can be seen as the composition of a series of processing stages, the immediate question that deep architectures raise is: what kind of representation of the data should be found as the output of each stage (i.e., the input of another)? What kind of interface should there be between these stages? A hallmark of recent research on deep architectures is the focus on these intermediate representations: the success of deep architectures belongs to the representations learned in an unsupervised way by RBMs [73], ordinary auto-encoders [17], sparse auto-encoders [150, 153], or denoising auto-encoders [195]. These algorithms (described in more detail in Section 7.2) can be seen as learning to transform one representation (the output of the previous stage) into another, at each step maybe disentangling better the factors of variations underlying the data. As we discuss at length in Section 4, it has been observed again and again that once a good representation has been found at each level, it can be used to initialize and successfully train a deep neural network by supervised gradient-based optimization.

11 8 Introduction Each level of abstraction found in the brain consists of the activation (neural excitation) of a small subset of a large number of features that are, in general, not mutually exclusive. Because these features are not mutually exclusive, they form what is called a distributed representation [68, 156]: the information is not localized in a particular neuron but distributed across many. In addition to being distributed, it appears that the brain uses a representation that is sparse: only a around 1-4% of the neurons are active together at a given time [5, 113]. Section 3.2 introduces the notion of sparse distributed representation and Section 7.1 describes in more detail the machine learning approaches, some inspired by the observations of the sparse representations in the brain, that have been used to build deep architectures with sparse representations. Whereas dense distributed representations are one extreme of a spectrum, and sparse representations are in the middle of that spectrum, purely local representations are the other extreme. Locality of representation is intimately connected with the notion of local generalization. Many existing machine learning methods are local in input space: to obtain a learned function that behaves differently in different regions of data-space, they require different tunable parameters for each of these regions (see more in Section 3.1). Even though statistical efficiency is not necessarily poor when the number of tunable parameters is large, good generalization can be obtained only when adding some form of prior (e.g., that smaller values of the parameters are preferred). When that prior is not task-specific, it is often one that forces the solution to be very smooth, as discussed at the end of Section 3.1. In contrast to learning methods based on local generalization, the total number of patterns that can be distinguished using a distributed representation scales possibly exponentially with the dimension of the representation (i.e., the number of learned features). In many machine vision systems, learning algorithms have been limited to specific parts of such a processing chain. The rest of the design remains labor-intensive, which might limit the scale of such systems. On the other hand, a hallmark of what we would consider intelligent machines includes a large enough repertoire of concepts. Recognizing MAN is not enough. We need algorithms that can tackle a very large

12 1.2 Sharing Features and Abstractions Across Tasks 9 set of such tasks and concepts. It seems daunting to manually define that many tasks, and learning becomes essential in this context. Furthermore, it would seem foolish not to exploit the underlying commonalities between these tasks and between the concepts they require. This has been the focus of research on multi-task learning [7, 8, 32, 88, 186]. Architectures with multiple levels naturally provide such sharing and re-use of components: the low-level visual features (like edge detectors) and intermediate-level visual features (like object parts) that are useful to detect MAN are also useful for a large group of other visual tasks. Deep learning algorithms are based on learning intermediate representations which can be shared across tasks. Hence they can leverage unsupervised data and data from similar tasks [148] to boost performance on large and challenging problems that routinely suffer from a poverty of labelled data, as has been shown by [37], beating the state-of-the-art in several natural language processing tasks. A similar multi-task approach for deep architectures was applied in vision tasks by [2]. Consider a multi-task setting in which there are different outputs for different tasks, all obtained from a shared pool of highlevel features. The fact that many of these learned features are shared among m tasks provides sharing of statistical strength in proportion to m. Now consider that these learned high-level features can themselves be represented by combining lower-level intermediate features from a common pool. Again statistical strength can be gained in a similar way, and this strategy can be exploited for every level of a deep architecture. In addition, learning about a large set of interrelated concepts might provide a key to the kind of broad generalizations that humans appear able to do, which we would not expect from separately trained object detectors, with one detector per visual category. If each high-level category is itself represented through a particular distributed configuration of abstract features from a common pool, generalization to unseen categories could follow naturally from new configurations of these features. Even though only some configurations of these features would present in the training examples, if they represent different aspects of the data, new examples could meaningfully be represented by new configurations of these features.

13 10 Introduction 1.3 Desiderata for Learning AI Summarizing some of the above issues, and trying to put them in the broader perspective of AI, we put forward a number of requirements we believe to be important for learning algorithms to approach AI, many of which motivate the research are described here: Ability to learn complex, highly-varying functions, i.e., with a number of variations much greater than the number of training examples. Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks. Ability to learn from a very large set of examples: computation time for training should scale well with the number of examples, i.e., close to linearly. Ability to learn from mostly unlabeled data, i.e., to work in the semi-supervised setting, where not all the examples come with complete and correct semantic labels. Ability to exploit the synergies present across a large number of tasks, i.e., multi-task learning. These synergies exist because all the AI tasks provide different views on the same underlying reality. Strong unsupervised learning (i.e., capturing most of the statistical structure in the observed data), which seems essential in the limit of a large number of tasks and when future tasks are not known ahead of time. Other elements are equally important but are not directly connected to the material in this monograph. They include the ability to learn to represent context of varying length and structure [146], so as to allow machines to operate in a context-dependent stream of observations and produce a stream of actions, the ability to make decisions when actions influence the future observations and future rewards [181], and the ability to influence future observations so as to collect more relevant information about the world, i.e., a form of active learning [34].

14 1.4 Outline of the Paper Outline of the Paper Section 2 reviews theoretical results (which can be skipped without hurting the understanding of the remainder) showing that an architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whose depth is matched to the task. We claim that insufficient depth can be detrimental for learning. Indeed, if a solution to the task is represented with a very large but shallow architecture (with many computational elements), a lot of training examples might be needed to tune each of these elements and capture a highly varying function. Section 3.1 is also meant to motivate the reader, this time to highlight the limitations of local generalization and local estimation, which we expect to avoid using deep architectures with a distributed representation (Section 3.2). In later sections, the monograph describes and analyzes some of the algorithms that have been proposed to train deep architectures. Section 4 introduces concepts from the neural networks literature relevant to the task of training deep architectures. We first consider the previous difficulties in training neural networks with many layers, and then introduce unsupervised learning algorithms that could be exploited to initialize deep neural networks. Many of these algorithms (including those for the RBM) are related to the auto-encoder: a simple unsupervised algorithm for learning a one-layer model that computes a distributed representation for its input [25, 79, 156]. To fully understand RBMs and many related unsupervised learning algorithms, Section 5 introduces the class of energy-based models, including those used to build generative models with hidden variables such as the Boltzmann Machine. Section 6 focuses on the greedy layer-wise training algorithms for Deep Belief Networks (DBNs) [73] and Stacked Auto-Encoders [17, 153, 195]. Section 7 discusses variants of RBMs and auto-encoders that have been recently proposed to extend and improve them, including the use of sparsity, and the modeling of temporal dependencies. Section 8 discusses algorithms for jointly training all the layers of a Deep Belief Network using variational bounds. Finally, we consider in Section 9 forward looking questions such as the hypothesized difficult optimization

15 12 Introduction problem involved in training deep architectures. In particular, we follow up on the hypothesis that part of the success of current learning strategies for deep architectures is connected to the optimization of lower layers. We discuss the principle of continuation methods, which minimize gradually less smooth versions of the desired cost function, to make a dent in the optimization of deep architectures.

16 2 Theoretical Advantages of Deep Architectures In this section, we present a motivating argument for the study of learning algorithms for deep architectures, by way of theoretical results revealing potential limitations of architectures with insufficient depth. This part of the monograph (this section and the next) motivates the algorithms described in the later sections, and can be skipped without making the remainder difficult to follow. The main point of this section is that some functions cannot be efficiently represented (in terms of number of tunable elements) by architectures that are too shallow. These results suggest that it would be worthwhile to explore learning algorithms for deep architectures, which might be able to represent some functions otherwise not efficiently representable. Where simpler and shallower architectures fail to efficiently represent (and hence to learn) a task of interest, we can hope for learning algorithms that could set the parameters of a deep architecture for this task. We say that the expression of a function is compact when it has few computational elements, i.e., few degrees of freedom that need to be tuned by learning. So for a fixed number of training examples, and short of other sources of knowledge injected in the learning algorithm, 13

17 14 Theoretical Advantages of Deep Architectures we would expect that compact representations of the target function 1 would yield better generalization. More precisely, functions that can be compactly represented by a depth k architecture might require an exponential number of computational elements to be represented by a depth k 1 architecture. Since the number of computational elements one can afford depends on the number of training examples available to tune or select them, the consequences are not only computational but also statistical: poor generalization may be expected when using an insufficiently deep architecture for representing some functions. We consider the case of fixed-dimension inputs, where the computation performed by the machine can be represented by a directed acyclic graph where each node performs a computation that is the application of a function on its inputs, each of which is the output of another node in the graph or one of the external inputs to the graph. The whole graph can be viewed as a circuit that computes a function applied to the external inputs. When the set of functions allowed for the computation nodes is limited to logic gates, such as {AND, OR, NOT}, this is a Boolean circuit, or logic circuit. To formalize the notion of depth of architecture, one must introduce the notion of a set of computational elements. An example of such a set is the set of computations that can be performed logic gates. Another is the set of computations that can be performed by an artificial neuron (depending on the values of its synaptic weights). A function can be expressed by the composition of computational elements from a given set. It is defined by a graph which formalizes this composition, with one node per computational element. Depth of architecture refers to the depth of that graph, i.e., the longest path from an input node to an output node. When the set of computational elements is the set of computations an artificial neuron can perform, depth corresponds to the number of layers in a neural network. Let us explore the notion of depth with examples of architectures of different depths. Consider the function f(x)=x sin(a x + b). It can be expressed as the composition of simple operations such as addition, subtraction, multiplication, 1 The target function is the function that we would like the learner to discover.

18 15 element set output * sin element set output neuron * sin + - * x + a inputs b neuron neuron... neuron neuron neuron neuron neuron neuron inputs Fig. 2.1 Examples of functions represented by a graph of computations, where each node is taken in some element set of allowed computations. Left, the elements are {,+,,sin} R. The architecture computes x sin(a x + b) and has depth 4. Right, the elements are artificial neurons computing f(x) = tanh(b + w x); each element in the set has a different (w,b) parameter. The architecture is a multi-layer neural network of depth 3. and the sin operation, as illustrated in Figure 2.1. In the example, there would be a different node for the multiplication a x and for the final multiplication by x. Each node in the graph is associated with an output value obtained by applying some function on input values that are the outputs of other nodes of the graph. For example, in a logic circuit each node can compute a Boolean function taken from a small set of Boolean functions. The graph as a whole has input nodes and output nodes and computes a function from input to output. The depth of an architecture is the maximum length of a path from any input of the graph to any output of the graph, i.e., 4 in the case of x sin(a x + b) in Figure 2.1. If we include affine operations and their possible composition with sigmoids in the set of computational elements, linear regression and logistic regression have depth 1, i.e., have a single level. When we put a fixed kernel computation K(u, v) in the set of allowed operations, along with affine operations, kernel machines [166] with a fixed kernel can be considered to have two levels. The first level has one element computing

19 16 Theoretical Advantages of Deep Architectures K(x,x i ) for each prototype x i (a selected representative training example) and matches the input vector x with the prototypes x i. The second level performs an affine combination b + i α ik(x,x i ) to associate the matching prototypes x i with the expected response. When we put artificial neurons (affine transformation followed by a non-linearity) in our set of elements, we obtain ordinary multi-layer neural networks [156]. With the most common choice of one hidden layer, they also have depth two (the hidden layer and the output layer). Decision trees can also be seen as having two levels, as discussed in Section 3.1. Boosting [52] usually adds one level to its base learners: that level computes a vote or linear combination of the outputs of the base learners. Stacking [205] is another meta-learning algorithm that adds one level. Based on current knowledge of brain anatomy [173], it appears that the cortex can be seen as a deep architecture, with 5 10 levels just for the visual system. Although depth depends on the choice of the set of allowed computations for each element, graphs associated with one set can often be converted to graphs associated with another by an graph transformation in a way that multiplies depth. Theoretical results suggest that it is not the absolute number of levels that matters, but the number of levels relative to how many are required to represent efficiently the target function (with some choice of set of computational elements). 2.1 Computational Complexity The most formal arguments about the power of deep architectures come from investigations into computational complexity of circuits. The basic conclusion that these results suggest is that when a function can be compactly represented by a deep architecture, it might need a very large architecture to be represented by an insufficiently deep one.

20 2.1 Computational Complexity 17 A two-layer circuit of logic gates can represent any Boolean function [127]. Any Boolean function can be written as a sum of products (disjunctive normal form: AND gates on the first layer with optional negation of inputs, and OR gate on the second layer) or a product of sums (conjunctive normal form: OR gates on the first layer with optional negation of inputs, and AND gate on the second layer). To understand the limitations of shallow architectures, the first result to consider is that with depth-two logical circuits, most Boolean functions require an exponential (with respect to input size) number of logic gates [198] to be represented. More interestingly, there are functions computable with a polynomial-size logic gates circuit of depth k that require exponential size when restricted to depth k 1 [62]. The proof of this theorem relies on earlier results [208] showing that d-bit parity circuits of depth 2 have exponential size. The d-bit parity function is defined as usual: d parity : (b 1,...,b d ) {0,1} d 1, if b i is even i=1 0, otherwise. One might wonder whether these computational complexity results for Boolean circuits are relevant to machine learning. See [140] for an early survey of theoretical results in computational complexity relevant to learning algorithms. Interestingly, many of the results for Boolean circuits can be generalized to architectures whose computational elements are linear threshold units (also known as artificial neurons [125]), which compute f(x) =1 w x+b 0 (2.1) with parameters w and b. The fan-in of a circuit is the maximum number of inputs of a particular element. Circuits are often organized in layers, like multi-layer neural networks, where elements in a layer only take their input from elements in the previous layer(s), and the first layer is the neural network input. The size of a circuit is the number of its computational elements (excluding input elements, which do not perform any computation).

A First Encounter with Machine Learning Max Welling Donald Bren School of Information and Computer Science University of California Irvine November 4, 2011 2 Contents Preface Learning and Intuition iii

7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single

Submission to the IJCV Special Issue on Learning for Vision and Vision for Learning, Sept. 2005, 2 nd revised version Aug. 2007. Robust Object Detection with Interleaved Categorization and Segmentation

1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures

Two faces of active learning Sanjoy Dasgupta dasgupta@cs.ucsd.edu Abstract An active learner has a collection of data points, each with a label that is initially hidden but can be obtained at some cost.

SESSION 1 PAPER 1 SOME METHODS OF ARTIFICIAL INTELLIGENCE AND HEURISTIC PROGRAMMING by Dr. MARVIN L. MINSKY (94009) 3 BIOGRAPHICAL NOTE Marvin Lee Minsky was born in New York on 9th August, 1927. He received

Chapter 1 The Gödel Phenomena in Mathematics: A Modern View Avi Wigderson Herbert Maass Professor School of Mathematics Institute for Advanced Study Princeton, New Jersey, USA 1.1 Introduction What are

[ Zhou Wang and Alan C. Bovik ] For more than 50 years, the meansquared error (MSE) has been the dominant quantitative performance metric in the field of signal processing. It remains the standard criterion