This is a question of terminology. Sometimes I see people refer to deep neural networks as "multi-layered perceptrons", why is this? A perceptron, I was taught, is a single layer classifier (or regressor) with a binary threshold output using a specific way of training the weights (not back-prop). If the output of the perceptron doesn't match the target output, we add or subtract the input vector to the weights (depending on if the perceptron gave a false positive or a false negative). It's a quite primitive machine learning algorithm. The training procedure doesn't appear to generalize to a multi-layer case (at least not without modification). A deep neural network is trained via backprop which uses the chain rule to propagate gradients of the cost function back through all of the weights of the network.

So, the question is. Is a "multi-layer perceptron" the same thing as a "deep neural network"? If so, why is this terminology used? It seems to be unnecessarily confusing. In addition, assuming the terminology is somewhat interchangeable, I've only seen the terminology "multi-layer perceptron" when referring to a feed-forward network made up of fully connected layers (no convolutional layers, or recurrent connections). How broad is this terminology? Would one use the term "multi-layered perceptron" when referring to, for example, Inception net? How about for a recurrent network using LSTM modules used in NLP?

$\begingroup$its just rebranding. MLPs were hyped in 90s and supplanted by SVMs, so need to call it something different in 2000's. the suggestion is that dnn have more layers, but not so big a difference eg Le Net [MLP/CNN] (1998) 2 convolutional 2 fully connected. alexnet =DNN (2012) 5 convolutional and 3 fully connected.$\endgroup$
– seanv507Feb 24 at 11:52

3 Answers
3

One can consider multi-layer perceptron (MLP) to be a subset of deep neural networks (DNN), but are often used interchangeably in literature.

The assumption that perceptrons are named based on their learning rule is incorrect. The classical "perceptron update rule" is one of the ways that can be used to train it. The early rejection of neural networks was because of this very reason, as the perceptron update rule was prone to vanishing and exploding gradients, making it impossible to train networks with more than a layer.

The use of back-propagation in training networks led to using alternate squashing activation functions such as tanh and sigmoid.

So, to answer the questions,

the question is. Is a "multi-layer perceptron" the same thing as a "deep neural network"?

MLP is subset of DNN. While DNN can have loops and MLP are always feed-forward, i.e.,

A multi layer perceptrons (MLP)is a finite acyclic graph

why is this terminology used?

A lot of the terminologies used in the literature of science has got to do with trends of the time and has caught on.

How broad is this terminology? Would one use the term "multi-layered perceptron" when referring to, for example, Inception net? How about for a recurrent network using LSTM modules used in NLP?

So, yes inception, convolutional network, resnet etc are all MLP because there is no cycle between connections. Even if there is a shortcut connections skipping layers, as long as it is in forward direction, it can be called a multilayer perceptron. But, LSTMs, or Vanilla RNNs etc have cyclic connections, hence cannot be called MLPs but are a subset of DNN.

$\begingroup$just out of curiosity: I thought logistic regression is a regression technique because you estimate the probability of class 1 membership, instead of class membership. As such it does not seem a classification technique to me (the researcher/analyst has to decide on a probability cut-off in order to classify based on logistic regression).$\endgroup$
– IWSNov 24 '17 at 12:57

$\begingroup$@IWS you're right. Various users on this site have repeatedly made the point that logistic regression is a model for (conditional) probability estimation, not a classifier. See for example here.$\endgroup$
– DeltaIVNov 24 '17 at 21:22

1

$\begingroup$Edited the response to fix For example, "logistic regression" is a classification technique and should not be termed as "regression" if so to speak. The link shared by @DeltaIV makes it very clear why it is a regression and not a classifier.$\endgroup$
– m1cro1ceNov 25 '17 at 18:25

Good question: note that in the field of Deep Learning things are not always as well-cut and clearly defined as in Statistical Learning (also because there's a lot of hype), so don't expect to find definitions as rigorous as in Mathematics. Anyway, the multilayer perceptron is a specific feed-forward neural network architecture, where you stack up multiple fully-connected layers (so, no convolution layers at all), where the activation functions of the hidden units are often a sigmoid or a tanh. The nodes of the output layer usually have softmax activation functions (for classification) or linear activation functions (for regression). The typical MLP architectures are not "deep", i.e., we don't have many hidden layers. You usually have, say, 1 to 5 hidden layers. These neural networks were common in the '80, and are trained by backpropagation.

Now, with Deep Neural Network we mean a network which has many layers (19, 22, 152,...even > 1200, though that admittedly is very extreme). Note that

we haven't specified the architecture of the network, so this could be feed-forward, recurrent, etc.

we haven't specified the nature of the connections, so we could have fully connected layers, convolutional layers, recurrence, etc.

"many" layers admittedly is not a rigorous definition.

So, why does it still make sense to speak of DNNs (apart from hype reasons)? Because when you start stacking more and more layers, you actually need to use new techniques (new activation functions, new kind of layers, new optimization strategies...even new hardware) to be able to 1) train your model and 2) make it generalize on new cases. For example, suppose you take a classical MLP for 10-class classification, tanh activation functions, input & hidden layers with 32 units each and output layer with 10 softmax units $\Rightarrow 32\times32+32\times10 = 1344$ weights. You add 10 layers $\Rightarrow 11584$ weights. This is a minuscule NN by today's standards. However, when you go on to train it on a suitably large data set, you find that the convergence rate has slowed down tremendously. This is not only due to the larger number of weights, but to the vanishing gradient problem - back-propagation computes the gradient of the loss function by multiplying errors across each layers, and these small numbers become exponentially smaller the more layers you add. Thus, the errors don't propagate (or propagate very slowly) down your network, and it looks like the error on the training set stops decreasing with training epochs.

And this was a small network - the deep Convolutional Neural Networks called AlexNet had 5 layers but 60 millions weights, and it's considered small by today's standards! When you have so many weights, then any data set is "small" - even ImageNet, a data set of images used for classification, has "only" about 1 million images, thus the risk of overfitting is much larger than for shallow network.

Deep Learning can thus be understood as the set of tools which are used in practice to train neural networks with a large number of layers and weights, achieving low generalization error. This task poses more challenges than for smaller networks. You can definitely build a Deep Multilayer Perceptron and train it - but (apart from the fact that it's not the optimal architecture for many tasks where Deep Learning is used today) you will probably use tools which are different from those used when networks used to be "shallow". For example, you may prefer ReLU activation units to sigmoid or tanh, because they soften the vanishing gradient problem.

$\begingroup$The previous answer by m1cro1ce says that a conv-net (like inception) can also be classified as a MLP, whereas you specify that a MLP can't have convolutional layers (and it seems you're implying that the choice of activation functions also affects what can be called a MLP or not?). Is there agreement in the literature (or in within the ML community) on what exactly MLP means and what it doesn't mean? If someone said to me "I want you to build a MLP for task X" what am I restricted to doing?$\endgroup$
– enumarisNov 24 '17 at 19:34

$\begingroup$@enumaris you're not restricted by law to do anything. Last time I checked, it was still legal to build a CNN and call it an MLP. I would of course reject such a paper/poster/whatever, but that's me and I cannot speak for the whole DL community which isn't exactly famous for its strict use of terminology. Anyway, my definition: feed-forward neural network with fully connected layer and at least some nonlinear activation function (otherwise, no matter how many layers, it's always equivalent to a single layer linear network) is the same as you can find in...$\endgroup$
– DeltaIVNov 25 '17 at 17:25

$\begingroup$...Wikipedia. Note the line in the layers section "Since MLPs are fully connected[..]". This rules out CNNs. You can find the same definition (feed-forward, fully connected, at least the hidden layers have nonlinear activation functions) in this book. Concerning the activation functions, I definitely don't seem to imply anything. I just said that MLPs usually have tanh or sigmoid activation functions, but that's not mandatory.$\endgroup$
– DeltaIVNov 25 '17 at 17:34

$\begingroup$I would like to mark one of these 2 answers as the accepted answer, but since they give conflicting answers, I'd like to know which answer is the one more commonly found in the literature or among the ML community.$\endgroup$
– enumarisNov 28 '17 at 2:32

$\begingroup$@enumaris the title of your question is "Multi-layer perceptron vs deep neural network", and you ask if a "multi-layer perceptron" the same thing as a "deep neural network": this question has been answered in detail, both in mine and m1cro1ce's answer. Now you're asking the question "are CNNs a subset of MLP?" - the Stack Exchange sites have a policy of one question for post.$\endgroup$
– DeltaIVNov 28 '17 at 23:57