While studying as a newbie about Neural Networks I started as everyone from the basics (perceptrons, MLPs) then how backpropagation works before dive in to harder deep learning concepts.

Now, I am trying to solve exercices (theoretical and practical) in order to practice and I have two questions.

1) I read that the Perceptron (single neuron) relates to a logistic regression classifier in particular, a (binary) logistic regression classifier is the same as a (single neuron) Perceptron with a sigmoid activation function. Is it true? How can I prove it? any link/example with examples etc.

2) A Multi-layer NN without activation function is equivalent to applying
a linear transformation to the input data. Can someome explain me why this is true? an example?

1 Answer
1

1- We have 2 logits before the softmax, $x=[x_0,x_1]$, where $x_0=w_0a+b_0$ and $x_1=w_1a+b_1$; The softmax has 2 outputs: $S(x) = [S(x)_0, S(x)_1]$. In this binary case, we can just make a prediction by knowing whether $S(x)_0$ is greater than $0.5$ or not. So we just need the value for a single neuron (e.g. $S(x)_0$) for our prediction. If we show that the value for $S(x)_0$ can be derived using a single sigmoid, we would be done:
$$
S(x)_0 = \dfrac{e^{x_0}}{e^{x_0}+e^{x_1}}= \dfrac{1}{1+\frac{e^{x_1}}{e^{x_0}}}= \dfrac{1}{1+e^{-(x_0-x_1)}}=\sigma(z)\quad\text{(where $z=x_0-x_1$)}
$$

But $z$ is still a function of both $x_0$ and $x_1$ (2 neurons). Let's write $z$ in a different way. Remember that $x_0=w_0a+b_0$ and $x_1=w_1a+b_1$, thus:
$$z=x_0-x_1=(w_0-w_1)a+(b_0-b_1)=w'a+b'$$

So all you need to do is use a single weight vector $w'=w_0-w_1$ and a single bias $b'=b_0-b_1$ to calculate $z$, and then proceed to take the sigmoid of $z$.

2- I'll just use a simple 3 layer neural network without activation functions as an example so that it will be easier to see. $X$ is our input matrix where each column contains an input vector, and $W_i$ and $A_i$ denote the weight matrix and outputs at the $i$'th layer respectively:
$$
A_1 = W_1X\\
A_2 = W_2A_1\\
A_3 = W_3A_2
$$
Then we can simply write $A_3$ as:
$$
A_3 = W_{3}(A_2) = W_{3}(W_{2}A_1)= W_{3}(W_{2}(W_1X))=(W_{3}W_{2}W_1)X= W'X
$$
So instead of using a 3 layer neural network without non-linearities, you can just apply a linear transformation $W'$ to your inputs.