I am trying to solve some unconstrained nonlinear optimization problems on GPU (CUDA).

The objective function is a smooth nonlinear function, and its gradient is relatively cheap to compute analytically, so I don't need to bother with numerical approximation.

I want to solve this problem with mostly fp32 maths ops (for various reasons), so which nonlinear optimization method is more robust against round-up errors whilst has good performance? (e.g. conjugate gradient/quasi newton/trust region), have anyone tried BFGS on GPU with good results?

BTW, the Hessian, if needed, is relatively small in my case (<64x64 typically), but I need to solve thousands of these small scale optimization problems concurrently.

$\begingroup$Given the small size of your problems I don't think the specific choice of algorithm (e.g., BFGS) is going to be your most significant challenge. Instead, it will be minimizing GPU<->CPU communication overhead. Probably the best way to do that is going to be to solve lots of instances of your problems in parallel on the GPU. Load them all up at once, solve them all at once, download the results all at once. I don't have specific advice on the algorithm, but I will say that GPUs are better with loops than with branches.$\endgroup$
– Michael GrantApr 27 '13 at 13:14

1

$\begingroup$@Michael C. Grant: Well, the communication overhead can be hidden by computation easily in my case, so it is not a bottleneck there, I am very incline to use limited-memory BFGS or standard BFGS here, but not sure if there are better approach.$\endgroup$
– user0002128Apr 27 '13 at 22:43

LBFGS is in general a very solid choice, and unless you're really strapped for memory it's probably the best place to start.

Both the conjugate gradient and BFGS require line searches though, which is where the fp32 becomes a problem. Rather than using the standard Wolfe conditions for the line search, I would suggest using the approximate Wolfe condition suggested here. The paper is a little involved, but the important stuff is equation 4.1. Essentially they explicitly introduce the precision with which you can calculate your function.

Considerations for the GPU:

You have a lot of small problems, which is slightly different from my use case of one large problem. Consider running 1 problem per GPU block (or warp, rather) if you can parallelize function and gradient evaluations to use all the threads in a block. That way it's not a problem if different problems require a different number of iterations.

If this is not an option, I would go with the LBFGS solver. If your function is well behaved, you might get away with simply using a step size of 1 (avoiding the line search) and just running all the problems for a fixed number of iterations.

I suggest you to use Levenberg Marquardt (a trust region variant) as it is used in many practical applications and demonstrated very good speed-vs-accuracy performance. Moreover, for GPU there are some libraries (e.g. cuLM https://github.com/zitmen/cuLM), which you could try out. If they don't do the job, there are tons of resources for you to implement. Implementing LM is not hard at all. You should only take care about minimizing GPU communication. To get a brief idea: