Tag: optimization

In 2012, I wrote a paper that I probably should have called “truncated bi-level optimization”. I vaguely remembered telling the reviewers I would release some code, so I’m finally getting around to it.

The idea of bilevel optimization is quite simple. Imagine that you would like to minimize some function . However, itself is defined through some optimization. More formally, suppose we would like to solve

Or, equivalently,

where is defined as . This seems a little bit obscure at first, but actually comes up in several different ways in machine learning and related fields.

Hyper-parameter learning

The first example would be in learning hyperparameters, such as regularization constants. Inevitably in machine learning, one fits parameters parameters to optimize some tradeoff between the quality of a fit to training data and a regularization function being small. Traditionally, the regularization constant is selected by optimizing on a training dataset with a variety of values, and then picking the one that performs best on a held-out dataset. However, if there are a large number of regularization parameters, a high-dimensional grid-search will not be practical. In the notation above, suppose that is a vector of regularization constants, and that are training parameters. Let, be the regularized empirical risk on a training dataset, and let be how well the parameters perform on some held-out validation dataset.

Energy-based models

Another example (and the one suggesting the notation) is an energy-based model. Suppose that we have some “energy” function which measures how well an output fits to an input . The energy is parametrized by . For a given training input/output pair , we might have that measures how how the predicted output compares to the true output , where .

Computing the gradient exactly

Even if we just have the modest goal of following the gradient of to a local minimum, even computing the gradient is not so simple. Clearly, even to evaluate requires solving the “inner” minimization of . It turns out that one can compute through first solving the inner minimization, and then solving a linear system.

Overall, this is a decent approach, but it can be quite slow, simply because one must solve an “inner” optimization in order to compute each gradient of the “outer” optimization. Often, the inner-optimization needs to be solved to very high accuracy in order to estimate a gradient accurately enough to reduce — higher accuracy than is needed when one is simply using the predicted value itself.

Truncated optimization

To get around this expense, a fairly obvious idea is to re-define the problem. The slowness of exactly computing the gradient stems from needing to exactly solve the inner optimization. Hence, perhaps we re-define the problem such that an inexact solve of the inner problem nevertheless yields an “exact” gradient?

Re-define the problem as solving

,

where denotes an approximate solve of the inner optimization. In order for this to work, must be defined in such a way that is a continuous function of . With standard optimization methods such as gradient descent or BFGS, this can be achieved by assuming there are a fixed number of iterations applied, with a fixed step-size. Since each iteration of these algorithms is continuous, this clearly defines as a continuous function. Thus, in principle, it could be optimized efficiently through automatic differentiation of the code that optimizes . That’s fine in principle, but often inconvenient in practice.

It turns out, however, that one can derive “backpropagating” versions of algorithms like gradient descent, that take as input only a procedure to compute along with it’s first derivatives. These algorithms can then produce the gradient of in the same time as automatic differentiation.

Back Gradient-Descent

If the inner-optimization is gradient descent for steps with a step-size of , the algorithm to compute the loss is simple:

Input

For

(a)

Return

How to compute the gradient of this quantity? The following algorithm does the trick.

For

(a)

(b)

Return .

Similar algorithms can be derived for the heavy-ball algorithm (with a little more complexity) and limited memory BFGS (with a lot more complexity).

Code

So, finally, here is the code, and I’ll give a simple example of how to use it. There are just four simple files:

I think the meanings of this are pretty straightforward, so I’ll just quickly step through the demo here. I’ll start off by grabbing taking one of Matlab’s built-in datasets (on cities) so that we are trying to predict a measure of crime from measures of climate, housing, health, transportation, arts, recreation, and economy, as well as a constant. There are 329 data, total, which I split into a training set of size 40, a validation set of size 160, and a test set of size 129.

Next, I’ll set up some simple constants that will be used later on, and define the optimization parameters for minFunc, that I will be using for the outer optimization. In particular, here I choose the inner optimization to use 20 iterations.

Now, I’ll define the training risk function ( in the notation above). The computes the risk with a regularization constant of , as well as derivatives. I’ll also define the validation risk ( in the notation above).

One recent trend seems to be the realization that one can get better performance by tuning a CRF (Conditional Random Field) to a particular inference algorithm. Basically, forget about the distribution that the CRF represents, and instead only care how accurate are the results that pop out of inference. An extreme example of this is the recent paper Learning Real-Time MRF Inference for Image Denoising by Adrian Barbu.

The basic idea is to fit a FoE (Field of Experts) image prior such that when one takes a very few gradient descent steps on a denoising posterior, the results are accurate. From the abstract:

We argue that through appropriate training, a MRF/CRF model can be trained to perform very well on a suboptimal inference algorithm. The model is trained together with a fast inference algorithm through an optimization of a loss function […] We apply the proposed method to an image denoising application […] with a 1-4 iteration gradient descent inference algorithm. […] the proposed training approach obtains an improved benchmark performance as well as a 1000-3000 times speedup compared to the Fields of Experts MRF.

The implausible-sounding 1000-fold speedup comes simply from using only 4 iterations of gradient descent rather than several thousand. (Incidentally, L-BFGS requires far fewer iterations for this problem.) The results are a bit better than the generative FoE model– that takes much more work for training and inference.

I have every confidence that this does work well, and similar strategies could probably be used to create fast inference models/algorithms for many different problems. My thesis was largely an attempt to do the same thing for marginal, rather than MAP inference.

The disturbing/embarrassing question, for me, is does this really have anything to do with probabilistic modeling any more? Formally speaking, a probability density is being fit, but I doubt it would transfer to, say, inpainting, or that samples from the density would look like natural images. The best interpretation of what is happening might be that one is simply fitting a big, nonlinear, black box function approximation.

It seems that the more effort we expend to wring the best performance out of a probabilistic model, the less “probabilistic” it is.

P.S. Some of my friends have invited me to never mention autodiff again, ever, but this is one of the many papers where I think the learning optimization would be made much easier/faster by using it.

Anyway, the notes give an explanation of the Gauss-Seidel iterative method for solving linear systems that is so clear, I feel a little cheated that it was never explained to me before. Let’s start with the typical (confusing!) explanation of Gauss-Seidel (as in, say, Wikipedia). You want to solve some linear system

,

for . What you do is decompose into a diagonal matrix , a strictly lower triangular matrix , and a strictly upper triangular matrix , like

.

To solve this, you iterate like so:

.

This works when is symmetric positive definite.

Now, what the hell is going on here? The first observation is that instead of inverting , you can think of minimizing the function

.

(It is easy to see that the global minimum of this is found when .) Now, how to minimize this function? One natural way to approach things would be to just iterate through the coordinates of , minimizing the function over each one. Say we want to minimize with respect to . So, we are going to make the change . Thus, we want to minimize (with respect to d), this thing:

Expanding this, taking the derivative w.r.t. , and solving gives us (where is the unit vector in the th direction)

,

or, equivalently,

.

So, taking the solution at one timestep, , and solving for the solution at the next timestep, we will have, for the first index,

.

Meanwhile, for the second index,

.

And, more generally

.

Now, all is a matter of cranking through the equations.

Which is exactly the iteration we started with.

The fact that this would work when is symmetric positive definite is now obvious– that is when the objective function we started with is bounded below.

The understanding that Gauss-Seidel is just a convenient way to implement coordinate descent also helps to get intuition for the numerical properties of Gauss-Seidel (e.g. quickly removing "high frequency error", and very slowly removing "low frequency error").

You have some function . You have figured out how to compute it’s gradient, . Now, however, you find that you are implementing some algorithm (like, say, Stochastic Meta Descent), and you need to compute the product of the Hessian with certain vectors. You become very upset, because either A) you don’t feel like deriving the Hessian (probable), or B) the Hessian has elements, where is the length of and that is too big to deal with (more probable). What to do? Behold:

Consider the Hessian-Vector product we want to compute, . For small ,

And so,

This trick above has been around apparently forever. The approximation becomes exact in the limit . Of course, for small , numerical problems will also start to kill you. Pearlmutter’s algorithm is a way to compute with the same complexity, with out suffering rounding errors. Unfortunately, Pearlmutter’s algorithm is kind of complex, while the above is absolutely trivial.

Update: Perlmutter himself comments below that if we want to use the finite difference trick, we would do better to use the approximation:

This expression will be closer to the true value for larger , meaning that we are less likely to get hurt by rounding error. This is nicely illustrated on page 6 of these notes.

When fitting statistical models, we usually need to “regularize” the model. The simplest example is probably linear regression. Take some training data, . Given a vector of weights , the total squared distance is

So to fit the model, we might find to minimize the above loss. Commonly, (particularly when has many dimensions), we find t\hat the above procedure leads to overfitting: very large weights t\hat fit the training data very well, but poorly predict future data. “Regularization” means modifying the optimization problem to prefer small weights. Consider

Here, trades off how much we care about the model fitting well versus how we care about being small. Of course, this is a much more general concept that linear regression, and one could pick many alternatives to the squared norm to measure how simple a model is. Let’s abstract a bit, and just write some function to represent how well the model fits the data, and some function to represent how simple it is.

The obvious question is: how do we pick ? This, of course, has been the subject of much theoretical study. What is commonly done in practice is this:

Divide the data into a training set and test set.

Fit the model for a range of different s using only the training set.

For each of the models fit in step 2, check how well the resulting weights fit the test data.

Output the weights that perform best on test data.

(One can also retrain on all the data using the that did best in step 2.)

Now, there are many ways to measure simplicity. (In the example above, we might consider , , , , …). Which one to use? If you drank too much coffee one afternoon, you might decide to include a bunch of regulizers simultaneously, each with a corresponding regularization parameter, ending up with a problem like

.

And yet, staring at that equation, things take a sinister hue: if we include too many regularization parameters, won’t we eventually overfit the regularization parameters? And furthermore, won’t it be very hard to test all possible combinations of to some reasonable resolution? What should we do?

And now I will finally get to the point.

Recently, I’ve been looking at regularization in a different way, which seems very obvious in retrospect. The idea is that when searching for the optimal regularization parameters, we are fitting a model– a model that just happens to be defined in an odd way.

Where is the function measuring how well the model fits the training data. So, for a given regularization parameter, returns the weights solving the optimization problem. Given that, the optimal \lambda will the the one minimizing this equation:

This extends naturally to the setting where we have a vector of regularization parameters .

From this perspective, doing an optimization over regularization parameters makes sense: it is just fitting the parameters to the data — it just so happens that the objective function is implicitly defined by a minimization over the data .

Regularization works because it is fitting a single parameter to some data. (In general, one is safe fitting a single parameter to any reasonably sized dataset although there are exceptions.) Fitting multiple regularization parameters should work, supposing the test data is big enough to support them. (There are computational issues I’m not discussing here with doing this, stemming from the fact that isn’t known in closed form, but is implicitly defined by a minimization. So far as I know, this is an open problem, although I suspect the solution is “implicit differentiation”)

Even regularizing the regularization parameters isn’t necessarily a crazy idea, though clearly isn’t the way to do it. (Simpler models correspond to larger regularization constants. Maybe ?.) Then, you can regularize those parameters!

Well. In retrospect that wasn’t quite so simple as it seemed.

It’s a strong bet that this is a standard interpretation in some communities.