How

Before we get into the nitty-gritty, here is the vanilla gradient descent update:

Fig. Gradient Descent

# Vanilla updatew +=- learning_rate * dw

Where,learning_rate is a constant hyperparameter. The idea is to keep it low enough as to not overshoot the point of minima.

Simple momentum update

The physics class has started. Well, this is how it goes.

Think of the loss being a roller coaster

Think of the loss being a hilly roller coaster terrain, thus it has the potential energy of U.

U(potential energy) = mgh

Which simply implies that U(energy) ∝ h(height). We want that since when the gradient is on top, we want it to go to the bottom faster and when on the bottom of the curve, we want it to slow down in order to not miss the minima.

The force of the particle is considered as F = ma in the negative gradient

Where,v is the particle that is initialized at zero. (from the top of the hill).mu is referred to as momentum. Think of this as the coefficient of friction which will counteract v when it goes towards the bottom. Usually, the value is between (0.1–0.9). (Typically taken as 0.9).This variable damps the energy of the system allowing v to stop.

Sometimes, we change the value of mu from 0.5 to 0.9 during multiple epochs to further optimize. it provides a relatively small boost to the speed of the system.

2. Nesterov momentum

This is a distant cousin of normal momentum update but it is quite popular owing to its consistency in getting the minima and the speed at which it does so.

Car going 60km/hr in a straight line will end up 60 km from the origin in an hour

So, the core concept of Nesterov momentum lies in the fact that if you know the velocity and direction of an object, you can predict its location in time T.

(left)The old way. Instead of going towards the gradient step, sometimes the movement is towards a different direction thus wasting time (right) Nesterov momentum calculates the step to be taken in future and takes the corrective action

Say the current vector at position x, velocity is mu * v . To predict where this point will end up in time t(the next step basically) will be x + mu * v . We can use this as a ‘lookahead’ or a future prediction for the point. Thus, we can adjust the movement of the gradient accordingly to get us to the right position.