This is the sixth post of our series on classification from scratch. The latest one was on the lasso regression, which was still based on a logistic regression model, assuming that the variable of interest Y has a Bernoulli distribution. From now on, we will discuss a technique that did not originate from those probabilistic models, even if they might still have a probabilistic interpretation. Somehow. Today, we will start with neural nets.

Maybe I should start with a disclaimer. The goal is not to replicate well-designed R functions, used for predictive modeling. It is simply to get a basic understanding of what's going on.

Networks, Nodes, and Edges

First of all, neural nets are nets or networks. I will skip the parallel with "neural" stuff because it does not help me understand what is happening (all apologies for my poor knowledge of biology, and cells)So, it's about some network. Networks have nodes, and edges (possibly connected) that connect nodes.Or maybe, to more specific (at least it helped me understand what's going on), some sort of flow network.In such a network, we usually have sources (here multiple) sources (here {s1, s2, s3}), on the left, on a sink (here {t}), on the right. To continue with this metaphorical introduction, information from the sources should reach the sink. And usually, sources are explanatory variables, {x1 ,...,xp}, and the sink is our variable of interest y. And we want to create a graph, from the sources to the sink. We will have directed edges, with only one (unique) direction, where we will put weights. It is not a flow, the parallel with the flow will stop here. For instance, the most simple network will be the following one, with no layer (i.e., no node between the source and the sink):The output here is a binary variable:

It can also be

But here, it's not a big deal. In our network, our output will be

because it is easier to handle. For instance, consider y=f(something), for some function f taking values in (0,1). One can consider the sigmoid function

This is actually the logistic function (so we should not be surprised to have results somehow close the logistic regression...). This function f is called the activation function, and there are thousands of such functions. If

people consider the hyperbolic tangent

or the inverse tangent function

And as input for such function, we consider a weighted sum of incoming nodes. So here

We can also add a constant actually

So far, we are not far away from the logistic regression. Except that our starting point was a probabilistic model, in the sense that the later was interpreted as a probability (the probability that Y=1) and we wanted the model with the highest likelihood. But we'll talk about the selection of weights later one. First, let us construct our first (very simple) neural network. First, we have the sigmoid function:

sigmoid = function(x) 1 / (1 + exp(-x))

Then consider some weights. In our model with seven explanatory variables, with need 7 weights. Or 8 if we include the constant term. Let us consider w=1

That's not bad for a very first attempt. Except that we've been cheating here, since we did use

How, for real, should we choose those weights?

Using a Loss Function

Well, if we want an "optimal" set of weights, we need to "optimize" an objective function. So we need to quantify the loss of a mistake, between the prediction, and the observation. Consider here a quadratic loss function:

It might be stupid to use a quadratic loss function for a classification, but here, it's not the point. We just want to understand what the algorithm we use is, and the loss function l is just one parameter. Then we want to solve

Thus, consider

weights_1 = optim(weights_0,loss)$par

(where the starting point is the OLS estimate). Again, to see what's going on, let us visualize the ROC curve

A Single Layer

Let us add a single layer in our network.Those nodes are connected to the sources (incoming from sources) from the left and then connected to the sink, on the right. Those nodes are not inter-connected. And again, for that network, we need edges (i.e., series of weights). For instance, on the network above, we did add one single layer, with (only) three nodes.

For such a network, the prediction formula is

or more synthetically

Usually, we consider the same activation function everywhere. Don't ask me why I find that weird.

Now, we have a lot of weights to choose. Let us use again OLS estimates:

In that case, we did specify edges, and which sources (explanatory variables) should be used for each additional node. Actually, here, other techniques could be have been used, like using a PCA. Each node will then be one of the components. But we'll use that idea later one...

On Back Propagation

Now, we need some optimal selection of those weights. Observe that with only 3 nodes, there are already (7 + 1) x 3 + 3 = 27 parameters in that model! Clearly, parcimony is not the major issue when you start using neural nets! If

Using neuralnet()

Again, for the same network structure, with one (hidden) layer, and three nodes in it.

Network With Multiple Layers

The good thing is that it's not possible to add more layers. Like two layers. Nodes from the first layer are no longer connected with the sink, but with nodes in the second layer. And those nodes will then be connected to the sink. We now have something like

where

I may be rambling here (a little bit) but that's a lot of parameters. Here is the visualization of such a network,