AI News, Single-Layer Neural Networks and Gradient Descent

Single-Layer Neural Networks and Gradient Descent

This article offers a brief glimpse of the history and basic concepts of machine learning.

We will take a look at the first algorithmically described neural network and the gradient descent algorithm in context of adaptive linear neurons, which will not only introduce the principles of machine learning but also serve as the basis for modern multilayer neural networks in future articles.

The perceptron is not only the first algorithmically described learning algorithm [1], but it is also very intuitive, easy to implement, and a good entry point to the (re-discovered) modern state-of-the-art machine learning algorithms: Artificial neural networks (or “deep learning” if you like).

To put the perceptron algorithm into the broader context of machine learning: The perceptron belongs to the category of supervised learning algorithms, single-layer binary linear classifiers to be more specific.

Next, we define an activation function g(\mathbf{z}) that takes a linear combination of the input values \mathbf{x} and weights \mathbf{w} as input (\mathbf{z} = w_1x_{1} + \dots + w_mx_{m}), and if g(\mathbf{z}) is greater than a defined threshold \theta we predict 1 and -1 otherwise;

in this case, this activation function g is an alternative form of a simple “unit step function,” which is sometimes also called “Heaviside step function.” (Please note that the unit step is classically defined as being equal to 0 if % <![CDATA[ z

To summarize the main points from the previous section: A perceptron receives multiple input signals, and if the sum of the input signals exceed a certain threshold it either returns a signal or remains “silent” otherwise.

What made this a “machine learning” algorithm was Frank Rosenblatt’s idea of the perceptron learning rule: The perceptron algorithm is about learning the weights for the input signals in order to draw linear decision boundary that allows us to discriminate between the two linearly separable classes +1 and -1.

Rosenblatt’s initial perceptron rule is fairly simple and can be summarized by the following steps: The output value is the class label predicted by the unit step function that we defined earlier (output =g(\mathbf{z})) and the weight update can be written more formally as w_j := w_j + \Delta w_j.

The value for updating the weights at each increment is calculated by the learning rule where \eta is the learning rate (a constant between 0.0 and 1.0), “target” is the true class label, and the “output” is the predicted class label.

If the two classes can’t be separated by a linear decision boundary, we can set a maximum number of passes over the training dataset (“epochs”) and/or a threshold for the number of tolerated misclassifications.

Our intuition tells us that a decision boundary with a large margin between the classes (as indicated by the dashed line in the figure below) likely has a better generalization error than the decision boundary of the perceptron.

In contrast to the perceptron rule, the delta rule of the adaline (also known as Widrow-Hoff” rule or Adaline rule) updates the weights based on a linear activation function rather than a unit step function;

(The fraction \frac{1}{2} is just used for convenience to derive the gradient as we will see in the next paragraphs.) In order to minimize the SSE cost function, we will use gradient descent, a simple yet useful optimization algorithm that is often used in machine learning to find the local minimum of linear systems.

mentioned above, each update is updated by taking a step into the opposite direction of the gradient \Delta \mathbf{w} = - \eta \nabla J(\mathbf{w}), thus, we have to compute the partial derivative of the cost function for each weight in the weight vector: \Delta w_j = - \eta \frac{\partial J}{\partial w_j}.

Another advantage of online learning is that the classifier can be immediately updated as new training data arrives, e.g., in web applications, and old training data can be discarded if storage is an issue.

In later articles, we will take a look at different approaches to dynamically adjust the learning rate, the concepts of “One-vs-All” and “One-vs-One” for multi-class classification, regularization to overcome overfitting by introducing additional information, dealing with nonlinear problems and multilayer neural networks, different activation functions for artificial neurons, and related concepts such as logistic regression and support vector machines.

On Monday, January 21, 2019

Perceptron Training

Watch on Udacity: Check out the full Advanced Operating Systems course for free ..

Neural Network Training (Part 3): Gradient Calculation

In this video we will see how to calculate the gradients of a neural network. The gradients are the individual error for each of the weights in the neural network.

Perceptron Update

Professor Abbeel steps through a multi-class perceptron looking at one training data item, and updating the perceptron weight vectors.