Linear Regression 4 – Learning Rate and Initial Weights

Choosing Learning Rate

We introduced an important parameter, the learning rate α, in Linear Regression 2 – Gradient Descent without discussing how to choose its value. In fact, the choice of the learning rate affects the performance of the algorithm significantly. It determines the convergence speed of the gradient descent algorithm, which is the number of iteration to reach the minimum. The below figures, we call it learning graph, show how different learning rates impact the speed of the algorithm.

The first figure shows the gradient descent with a small learning rate (α = 0.03). It takes a long time for the algorithm to converge because in each iteration the weights were changed just a little bit. Figure (b) shows a larger learning rate results in a faster convergence speed of the algorithm, with fewer steps to reach the optimum.

Actually learning rate not just determines the convergence speed, it can even affect whether the algorithm converge or not. If a learning rate is large enough, the gradient descent algorithm will make a large update in each iteration and jump from one side of the error surface to the other side. The minimum will be missed, the cost can increase and the algorithm may never converge, as shown in below figure.

How should we choose the learning rate so that it can guarantee a fast convergent gradient descent algorithm? Unfortunately, this is an empirical process. The often used practice is trial and error. We can first start from a small learning rate to ensure convergence happen, and try to increase α hoping the algorithm will converge faster without any bad behaviour. An example range of α is from 10‑4 to 10 and increase α 3 times as the previous value, e.g. 0.0001, 0.0003, 0.001, etc. However, the real setting should depend on particular problem. In many situations, inspecting the learning graph should be helpful.

Choosing Initial Weights

As in the case of learning rate, the initial values of weights are often chosen randomly. One of the suggestion is to choose uniformly in the range [-0.2, 0.2][1]. To make the chosen weights process easier and reasonable, we should first normalize the input features so that all the input values are on a similar range. We shall discuss this technique in the coming article.