Tag: cross-validation

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”. I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it.

The idea of bilevel optimization is quite simple. Imagine that you would like to minimize some function . However, itself is defined through some optimization. More formally, suppose we would like to solve

Or, equivalently,

where is defined as . This seems a little bit obscure at first, but actually comes up in several different ways in machine learning and related fields.

Hyper-parameter learning

The first example would be in learning hyperparameters, such as regularization constants. Inevitably in machine learning, one fits parameters parameters to optimize some tradeoff between the quality of a fit to training data and a regularization function being small. Traditionally, the regularization constant is selected by optimizing on a training dataset with a variety of values, and then picking the one that performs best on a held-out dataset. However, if there are a large number of regularization parameters, a high-dimensional grid-search will not be practical. In the notation above, suppose that is a vector of regularization constants, and that are training parameters. Let, be the regularized empirical risk on a training dataset, and let be how well the parameters perform on some held-out validation dataset.

Energy-based models

Another example (and the one suggesting the notation) is an energy-based model. Suppose that we have some “energy” function which measures how well an output fits to an input . The energy is parametrized by . For a given training input/output pair , we might have that measures how how the predicted output compares to the true output , where .

Computing the gradient exactly

Even if we just have the modest goal of following the gradient of to a local minimum, even computing the gradient is not so simple. Clearly, even to evaluate requires solving the “inner” minimization of . It turns out that one can compute through first solving the inner minimization, and then solving a linear system.

Overall, this is a decent approach, but it can be quite slow, simply because one must solve an “inner” optimization in order to compute each gradient of the “outer” optimization. Often, the inner-optimization needs to be solved to very high accuracy in order to estimate a gradient accurately enough to reduce — higher accuracy than is needed when one is simply using the predicted value itself.

Truncated optimization

To get around this expense, a fairly obvious idea is to re-define the problem. The slowness of exactly computing the gradient stems from needing to exactly solve the inner optimization. Hence, perhaps we re-define the problem such that an inexact solve of the inner problem nevertheless yields an “exact” gradient?

Re-define the problem as solving

,

where denotes an approximate solve of the inner optimization. In order for this to work, must be defined in such a way that is a continuous function of . With standard optimization methods such as gradient descent or BFGS, this can be achieved by assuming there are a fixed number of iterations applied, with a fixed step-size. Since each iteration of these algorithms is continuous, this clearly defines as a continuous function. Thus, in principle, it could be optimized efficiently through automatic differentiation of the code that optimizes . That’s fine in principle, but often inconvenient in practice.

It turns out, however, that one can derive “backpropagating” versions of algorithms like gradient descent, that take as input only a procedure to compute along with it’s first derivatives. These algorithms can then produce the gradient of in the same time as automatic differentiation.

Back Gradient-Descent

If the inner-optimization is gradient descent for steps with a step-size of , the algorithm to compute the loss is simple:

Input

For

(a)

Return

How to compute the gradient of this quantity? The following algorithm does the trick.

For

(a)

(b)

Return .

Similar algorithms can be derived for the heavy-ball algorithm (with a little more complexity) and limited memory BFGS (with a lot more complexity).

Code

So, finally, here is the code, and I’ll give a simple example of how to use it. There are just four simple files:

I think the meanings of this are pretty straightforward, so I’ll just quickly step through the demo here. I’ll start off by grabbing taking one of Matlab’s built-in datasets (on cities) so that we are trying to predict a measure of crime from measures of climate, housing, health, transportation, arts, recreation, and economy, as well as a constant. There are 329 data, total, which I split into a training set of size 40, a validation set of size 160, and a test set of size 129.

Next, I’ll set up some simple constants that will be used later on, and define the optimization parameters for minFunc, that I will be using for the outer optimization. In particular, here I choose the inner optimization to use 20 iterations.

Now, I’ll define the training risk function ( in the notation above). The computes the risk with a regularization constant of , as well as derivatives. I’ll also define the validation risk ( in the notation above).