Gradient Descent in Machine Learning

Share

Before going into the details of Gradient Descent let’s first understand what exactly is a cost function and its relationship with the MachineLearning model.

In Supervised Learning a machine learning algorithm builds a model which will learn by examining multiple examples and then attempting to find out a function which minimizes loss. Since we are unaware of the true distribution of the data on which the algorithm will work so we instead measure the performance of the algorithm on known set of data i.e Training dataset. This process is known as Empirical Risk Minimization.Loss is the penalty for a bad prediction, in simple terms it measures how well the algorithm is performing on the given dataset. Thus loss is a number which indicates how bad model prediction was on a single example. If the model prediction is 100% accurate then the loss will be zero else the loss will be greater. The function responsible for calculating the penalty is generally referred to as Cost Function.

In the above image the farther the points is from the straight red line higher the error in predicted value w.r.t ground truth value.

In this blogpost we will be discussing about the most popular Error/Loss function which is Mean Square Error.

Mean Squared Error(MSE) is calculated by taking the difference between the predicted output and the ground truth output , square the difference and then averaging it out across the whole dataset.

N is number of samples in the dataset

The primary goal of training a machine learning model is to find a set of weights and biases that have low loss on average across all examples i.e Minimising the Cost function.

This is necessary because the lower error between the actual and predicted value signifies the model has learnt well i.e trained well on training dataset.

How to minimize cost function?

Gradient Descent in simple terms is an algorithm which minimizes a function. The general idea of Gradient Descent is to tweak parameters(weights and biases) iteratively in order to minimize a cost function. Suppose you are standing at the top of the mountain and you want to get down to the bottom of the valley as quickly as possible , the good strategy is to go downhill in the direction of steepest slope. This is exactly how gradient descent works, it measures the local gradient of the loss function with regards to parameter and it goes in the direction of negative gradient. Once the gradient is zero, you have reached the minimum.Concretely , start by initializing with a random value and then improve it gradually taking one small step at a time , each step attempting to decrease the cost function(eg: MSE) , until algorithm converges to minimum.

with a random value and then improve it gradually taking one small step at a time , each step attempting to decrease the cost function(eg: MSE) , until algorithm converges to minimum.

An important parameter in Gradient Descent is the size of step known as learning rate hyperparameter. If the learning rate is too small there will multiple iterations that the algorithm has to execute for converging which will take longer time. On the other hand, if the learning rate is too high , we might jump across the valley and end up on the other side , possibly even higher than before. This might make algorithm diverge with larger and larger values, failing to find a good solution.