Most of the machine learning optimization I work with involves minimizing error to find values for neural network weights and biases, but there are many kinds of ML optimization algorithms. A classical optimization technique that tends to confuse newcomers to ML involves the Hessian.

The Hessian is a matrix of all possible Calculus second derivatives for a function. The Hessian can be used in two ways. First, the so-called second derivative test to determine if a value is a function minimum or a maximum or undetermined. The second way to use the Hessian is directly, to iteratively get closer and closer to a minimum error.

Suppose you want to minimize some error function E which depends on a set of weights:

So you’d add epsilon to the a guesses and then repeat. In words, evaluate the Hessian (all second derivatives) at the guesses a, then invert the matrix, then multiply by the gradient of the error function at a. The -1 is so the update is an addition instead of a subtraction.

Notice that you need the inverse of the Hessian matrix. If you have 10 weights to solve for, the Hessian would be 10×10. So, this direct approach won’t work if n gets very, very large. Therefore there are several variations of this technique that estimate the Hessian in various ways.

All of this is fairly deep stuff, but if you work with machine leaning, it slowly but surely starts to make sense over time.