3 Neural networks single neurons are not able to solve complex tasks (e.g. restricted to linear calculations) creating networks by hand is too expensive; we want to learn from data nonlinear features also have to be generated by hand; tessalations become intractable for larger dimensions Machine Learning: Multi Layer Perceptrons p.3/61

4 Neural networks single neurons are not able to solve complex tasks (e.g. restricted to linear calculations) creating networks by hand is too expensive; we want to learn from data nonlinear features also have to be generated by hand; tessalations become intractable for larger dimensions we want to have a generic model that can adapt to some training data basic idea: multi layer perceptron (Werbos 1974, Rumelhart, McClelland, Hinton 1986), also named feed forward networks Machine Learning: Multi Layer Perceptrons p.3/61

9 Multi layer perceptrons multi layer perceptrons, more formally: A MLP is a finite directed acyclic graph. nodes that are no target of any connection are called input neurons. A MLP that should be applied to input patterns of dimension n must have n input neurons, one for each dimension. Input neurons are typically enumerated as neuron 1, neuron 2, neuron 3,... nodes that are no source of any connection are called output neurons. A MLP can have more than one output neuron. The number of output neurons depends on the way the target values (desired values) of the training patterns are described. all nodes that are neither input neurons nor output neurons are called hidden neurons. since the graph is acyclic, all neurons can be organized in layers, with the set of input layers being the first layer. Machine Learning: Multi Layer Perceptrons p.6/61

10 Multi layer perceptrons connections that hop over several layers are called shortcut most MLPs have a connection structure with connections from all neurons of one layer to all neurons of the next layer without shortcuts all neurons are enumerated Succ(i) is the set of all neurons j for which a connection i j exists Pred(i) is the set of all neurons j for which a connection j i exists all connections are weighted with a real number. The weight of the connection i j is named w ji all hidden and output neurons have a bias weight. The bias weight of neuron i is named w i0 Machine Learning: Multi Layer Perceptrons p.7/61

12 Multi layer perceptrons variables for calculation: hidden and output neurons have some variable net i ( network input ) all neurons have some variable a i ( activation / output ) applying a pattern x = (x 1,...,x n ) T to the MLP: for each input neuron the respective element of the input pattern is presented, i.e. a i x i Machine Learning: Multi Layer Perceptrons p.8/61

13 Multi layer perceptrons variables for calculation: hidden and output neurons have some variable net i ( network input ) all neurons have some variable a i ( activation / output ) applying a pattern x = (x 1,...,x n ) T to the MLP: for each input neuron the respective element of the input pattern is presented, i.e. a i x i for all hidden and output neurons i: after the values a j have been calculated for all predecessors j Pred(i), calculate net i and a i as: net i w i0 + (w ij a j ) a i f log (net i ) j Pred(i) Machine Learning: Multi Layer Perceptrons p.8/61

14 Multi layer perceptrons variables for calculation: hidden and output neurons have some variable net i ( network input ) all neurons have some variable a i ( activation / output ) applying a pattern x = (x 1,...,x n ) T to the MLP: for each input neuron the respective element of the input pattern is presented, i.e. a i x i for all hidden and output neurons i: after the values a j have been calculated for all predecessors j Pred(i), calculate net i and a i as: net i w i0 + (w ij a j ) a i f log (net i ) j Pred(i) the network output is given by the a i of the output neurons Machine Learning: Multi Layer Perceptrons p.8/61

22 algorithm (forward pass): Multi layer perceptrons Require: pattern x, MLP, enumeration of all neurons in topological order Ensure: calculate output of MLP 1: for all input neurons i do 2: set a i x i 3: end for 4: for all hidden and output neurons i in topological order do 5: set net i w i0 + j Pred(i) w ija j 6: set a i f log (net i ) 7: end for 8: for all output neurons i do 9: assemble a i in output vector y 10: end for 11: return y Machine Learning: Multi Layer Perceptrons p.10/61

23 Multi layer perceptrons variant: Neurons with logistic activation can only output values between 0 and 1. To enable output in a wider range of real number variants are used: neurons with tanh activation function: f log (2x) tanh(x) 1.5 linear activation a i =tanh(net i )= enet i e net i e net i +e net i neurons with linear activation: a i = net i Machine Learning: Multi Layer Perceptrons p.11/61

24 Multi layer perceptrons variant: Neurons with logistic activation can only output values between 0 and 1. To enable output in a wider range of real number variants are used: neurons with tanh activation function: a i =tanh(net i )= enet i e net i e net i +e net i neurons with linear activation: a i = net i f log (2x) tanh(x) linear activation the calculation of the network output is similar to the case of logistic activation except the relationship between net i and a i is different. the activation function is a local property of each neuron. Machine Learning: Multi Layer Perceptrons p.11/61

26 Multi layer perceptrons typical network topologies: for regression: output neurons with linear activation for classification: output neurons with logistic/tanh activation all hidden neurons with logistic activation layered layout: input layer first hidden layer second hidden layer... output layer with connection from each neuron in layer i with each neuron in layer i + 1, no shortcut connections Lemma: Any boolean function can be realized by a MLP with one hidden layer. Any bounded continuous function can be approximated with arbitrary precision by a MLP with one hidden layer. Proof: was given by Cybenko (1989). Idea: partition input space in small cells Machine Learning: Multi Layer Perceptrons p.12/61

31 discusses mathematical problems of the form: Optimization theory minimize u f( u) u can be any vector of suitable size. But which one solves this task and how can we calculate it? Machine Learning: Multi Layer Perceptrons p.15/61

33 Optimization theory A global minimum u is a point so that: f( u ) f( u) y for all u. A local minimum u + is a point so that exist r > 0 with f( u + ) f( u) for all points u with u u + < r global minima local x Machine Learning: Multi Layer Perceptrons p.16/61

34 analytical way to find a minimum: For a local minimum u +, the gradient of f becomes zero: Optimization theory f u i ( u + ) = 0 for all i Hence, calculating all partial derivatives and looking for zeros is a good idea (c.f. linear regression) Machine Learning: Multi Layer Perceptrons p.17/61

35 analytical way to find a minimum: For a local minimum u +, the gradient of f becomes zero: Optimization theory f u i ( u + ) = 0 for all i Hence, calculating all partial derivatives and looking for zeros is a good idea (c.f. linear regression) but: there are also other points for which f u i = 0, and resolving these equations is often not possible Machine Learning: Multi Layer Perceptrons p.17/61

36 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? u Machine Learning: Multi Layer Perceptrons p.18/61

37 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? slope is negative (descending), go right! u Machine Learning: Multi Layer Perceptrons p.18/61

38 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? u Machine Learning: Multi Layer Perceptrons p.18/61

39 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? slope is positive (ascending), go left! u Machine Learning: Multi Layer Perceptrons p.18/61

40 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? Which is the best stepwidth? u Machine Learning: Multi Layer Perceptrons p.18/61

41 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? Which is the best stepwidth? slope is small, small step! u Machine Learning: Multi Layer Perceptrons p.18/61

42 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? Which is the best stepwidth? u Machine Learning: Multi Layer Perceptrons p.18/61

43 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? Which is the best stepwidth? slope is large, large step! u Machine Learning: Multi Layer Perceptrons p.18/61

44 Optimization theory numerical way to find a minimum, searching: assume we are starting at a point u. Which is the best direction to search for a point v with f( v) < f( u)? Which is the best stepwidth? general principle: v i u i ǫ f u i ǫ > 0 is called learning rate u Machine Learning: Multi Layer Perceptrons p.18/61

45 Gradient descent approach: Require: mathematical function f, learning rate ǫ > 0 Ensure: returned vector is close to a local minimum of f 1: choose an initial point u 2: while gradf( u) not close to 0 do 3: u u ǫ gradf( u) 4: end while 5: return u open questions: how to choose initial u how to choose ǫ does this algorithm really converge? Gradient descent Machine Learning: Multi Layer Perceptrons p.19/61

51 Gradient descent choice of ǫ is crucial. Only small ǫ guarantee convergence. for small ǫ, learning may take very long depends on the scaling of f : an optimal learning rate for f may lead to divergence for 2 f Machine Learning: Multi Layer Perceptrons p.20/61

53 Gradient descent some more problems with gradient descent: flat spots and steep valleys: need larger ǫ in u to jump over the uninteresting flat area but need smaller ǫ in v to meet the minimum u v zig-zagging: in higher dimensions: ǫ is not appropriate for all dimensions Machine Learning: Multi Layer Perceptrons p.21/61

54 Gradient descent conclusion: pure gradient descent is a nice theoretical framework but of limited power in practice. Finding the right ǫ is annoying. Approaching the minimum is time consuming. Machine Learning: Multi Layer Perceptrons p.22/61

64 Gradient descent 1: choose an initial point u 2: set initial learning rate ǫ 3: set former gradient γ 0 4: while gradf( u) not close to 0 do 5: calculate gradient g grad f( u) 6: for all dimensions i do 7: ǫ i η + ǫ i if g i γ i > 0 η ǫ i if g i γ i < 0 ǫ i otherwise η + 1,η 1 are additional parameters that have to be adjusted by hand. For η + = η = 1 we get vanilla gradient descent. 8: u i u i ǫ i g i 9: end for 10: γ g 11: end while 12: return u Machine Learning: Multi Layer Perceptrons p.26/61

71 Gradient descent 1: choose an initial point u 2: set initial steplength 3: set former gradient γ 0 4: while gradf( u) not close to 0 do 5: calculate gradient g grad f( u) 6: for all dimensions i do η + i if g i γ i > 0 7: i η i if g i γ i < 0 i otherwise u i + i if g i < 0 8: u i u i i if g i > 0 9: end for 10: γ g 11: end while 12: return u u i otherwise η + 1,η 1 are additional parameters that have to be adjusted by hand. For MLPs, good heuristics exist for parameter settings: η + = 1.2, η = 0.5, initial i = 0.1 Machine Learning: Multi Layer Perceptrons p.29/61

89 Calculating partial derivatives the calculation of the network output of a MLP is done step-by-step: neuron i uses the output of neurons j Pred(i) as arguments, calculates some output which serves as argument for all neurons j Succ(i). apply the chain rule! Machine Learning: Multi Layer Perceptrons p.37/61

95 calculations within a neuron i assume we already know e a i Calculating partial derivatives observation: e depends indirectly from a i and a i depends on net i apply chain rule e = e net i a i a i net i what is a i net i? Machine Learning: Multi Layer Perceptrons p.41/61

96 Calculating partial derivatives a i net i a i is calculated like: a i = f act (net i ) Hence: a i net i = f act(net i ) net i (f act activation function) Machine Learning: Multi Layer Perceptrons p.42/61

124 Back to MLP Training bringing together building blocks of MLP learning: we can calculate E w ij we have discussed methods to minimize a differentiable mathematical function Machine Learning: Multi Layer Perceptrons p.49/61

126 generic MLP learning algorithm: 1: choose an initial weight vector w 2: intialize minimization approach 3: while error did not converge do 4: for all ( x, d) D do Back to MLP Training 5: apply x to network and calculate the network output 6: calculate e( x) for all weights w ij 7: end for 8: calculate E(D) for all weights suming over all training patterns w ij 9: perform one update step of the minimization approach 10: end while learning by epoch: all training patterns are considered for one update step of function minimization Machine Learning: Multi Layer Perceptrons p.50/61

127 generic MLP learning algorithm: 1: choose an initial weight vector w 2: intialize minimization approach 3: while error did not converge do 4: for all ( x, d) D do Back to MLP Training 5: apply x to network and calculate the network output 6: calculate e( x) for all weights w ij 7: perform one update step of the minimization approach 8: end for 9: end while learning by pattern: only one training patterns is considered for one update step of function minimization (only works with vanilla gradient descent!) Machine Learning: Multi Layer Perceptrons p.51/61

133 Real-world examples: sales rate prediction Bild-Zeitung is the most frequently sold newspaper in Germany, approx. 4.2 million copies per day it is sold in sales outlets in Germany, differing in a lot of facets Machine Learning: Multi Layer Perceptrons p.57/61

134 Real-world examples: sales rate prediction Bild-Zeitung is the most frequently sold newspaper in Germany, approx. 4.2 million copies per day it is sold in sales outlets in Germany, differing in a lot of facets problem: how many copies are sold in which sales outlet? Machine Learning: Multi Layer Perceptrons p.57/61

135 Real-world examples: sales rate prediction Bild-Zeitung is the most frequently sold newspaper in Germany, approx. 4.2 million copies per day it is sold in sales outlets in Germany, differing in a lot of facets problem: how many copies are sold in which sales outlet? neural approach: train a neural network for each sales outlet, neural network predicts next week s sales rates system in use since mid of 1990s Machine Learning: Multi Layer Perceptrons p.57/61

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2000 Building MLP networks by construction Ah Chung Tsoi University of

Neural Nets To give you an idea of how new this material is, let s do a little history lesson. The origins are typically dated back to the early 1940 s and work by two physiologists, McCulloch and Pitts.

Neural Networks Neural network is a network or circuit of neurons Neurons can be Biological neurons Artificial neurons Biological neurons Building block of the brain Human brain contains over 10 billion

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

Data Mining Prediction Jingpeng Li 1 of 23 What is Prediction? Predicting the identity of one thing based purely on the description of another related thing Not necessarily future events, just unknowns

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK N M Allinson and D Merritt 1 Introduction This contribution has two main sections. The first discusses some aspects of multilayer perceptrons,

Gradient Methods Rafael E. Banchs INTRODUCTION This report discuss one class of the local search algorithms to be used in the inverse modeling of the time harmonic field electric logging problem, the Gradient

WEEK 8 Summary of week 8 (Lectures 22, 23 and 24) This week we completed our discussion of Chapter 5 of [VST] Recall that if V and W are inner product spaces then a linear map T : V W is called an isometry

Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

3. Reaction Diffusion Equations Consider the following ODE model for population growth u t a u t u t, u 0 u 0 where u t denotes the population size at time t, and a u plays the role of the population dependent

Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.

7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single

3 Neural Network Structures This chapter describes various types of neural network structures that are useful for RF and microwave applications. The most commonly used neural network configurations, known

COPYRIGHT NOTICE: David A. Kendrick, P. Ruben Mercado, and Hans M. Amman: Computational Economics is published by Princeton University Press and copyrighted, 2006, by Princeton University Press. All rights

1. Suppose our training set and test set are the same. Why would this be a problem? 2. Why is it necessary to have both a test set and a validation set? 3. Images are generally represented as n m 3 arrays,

Chapter 7 Diagnosis and Prognosis of Breast Cancer using Histopathological Data In the previous chapter, a method for classification of mammograms using wavelet analysis and adaptive neuro-fuzzy inference

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

Vectors, Gradient, Divergence and Curl. 1 Introduction A vector is determined by its length and direction. They are usually denoted with letters with arrows on the top a or in bold letter a. We will use

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

93 CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER 5.1 INTRODUCTION The development of an active trap based feeder for handling brakeliners was discussed

4 Perceptron Learning 4.1 Learning algorithms for neural networks In the two preceding chapters we discussed two closely related models, McCulloch Pitts units and perceptrons, but the question of how to