I'm trying to understand how the adjoint-based optimization method works for a PDE constrained optimization. Particularly, I'm trying to understand why the adjoint method is more efficient for problems where the number of design variables is large, but the "number of equations is small".

What I understand:

Consider the following PDE constrained optimization problem:

$$\min_\beta \text{ } I(\beta,u(\beta))\\ s.t. R(u(\beta))=0$$

where $I$ is a (sufficiently continuous) objective function of a vector design variables $\beta$ and a vector of field variable unknowns $u(\beta)$ which depend on the design variables, and $R(u)$ is the residual form of the PDE.

Thus, if we are able to solve for $\lambda$ such that $$\frac{\partial I}{\partial u} + \lambda^T\frac{\partial R}{\partial u}=0 \text{ (adjoint equation)}$$

Then the gradient $\delta I= \left[\frac{\partial I}{\partial \beta} + \lambda^T\frac{\partial R}{\partial \beta}\right]\delta \beta$ is evaluated only in terms of the design variables $\beta$.

Thus, an adjoint based optimization algorithm would loop over the following steps:

Given current design variables $\beta$

Solve for the field variables $u$ (from the PDE)

Solve for the lagrange multipliers $\lambda$ (from the adjoint equation)

Calculate gradients $\frac{\partial I}{\partial \beta}$

Update design variables $\beta$

My question

How does this adjoint 'trick' improve the cost of the optimization per iteration in the case where the number of design variables is large? I've heard that the cost of gradient evaluation for the adjoint method is 'independent' of the number of design variables. But how exactly is this true?

By the way, the Lagrange multiplier is usually added to the objective functional, not the variation; thus $\min_{u,\beta}\max_\lambda I(u,\beta) + \lambda^T R(u,\beta)$. Setting the derivative with respect to $u$ to zero yields the adjoint equation, and inserting this (and the solution $u$ of the state equation $R(u,\beta)=0$) into the derivative with respect to $\beta$ yields the gradient. If you start with the weak formulation of the PDE, things get even simpler: Just insert the Lagrange multiplier in place of the test function. No need for the strong form or partial integration anywhere.
–
Christian ClasonJul 25 '14 at 23:52

1

The most expensive part of any simulation is the solve phase. By using the adjoint you get the gradient in two solves, much cheaper compared to finite differences where you at least need n+1 solves, n being the number of free parameters in your model.
–
staliJul 26 '14 at 1:46

2 Answers
2

How does this adjoint 'trick' improve the cost of the optimization per iteration in the case where the number of design variables is large?

I think about the cost from a linear algebra perspective. (See these notes by Stephen G. Johnson, which I find more intuitive than the Lagrange multiplier approach). The forward approach amounts to solving for sensitivities directly:

This regrouping of terms requires only one linear solve, instead of a linear solve for each parameter, which makes adjoint evaluation cheap for the many parameter case.

I've heard that the cost of gradient evaluation for the adjoint method is 'independent' of the number of design variables. But how exactly is this true?

It's not totally independent; presumably the cost of evaluating $(\partial{I}/\partial{\beta})$ and $(\partial{R}/\partial{\beta})$ will increase with the number of parameters. The linear solves, however, will still be of the same size, as long as the size of $u$ does not change. The assumption is that the solves are much more expensive than the function evaluations.

In a nutshell, the advantage comes from the fact that to compute derivatives of the reduced objective $I(\beta,u(\beta))$, you do not really need to know the derivative of $u(\beta)$ with respect to $\beta$ as a separate object, but only that part of it that leads to variations in $I(\beta,u(\beta))$.

Let me switch to a notation I'm a bit more comfortable with:
$$\min_{y,u} J(y,u) \quad\text{subject to}\quad e(y,u)=0$$
($u$ being the design variable, $y$ being the state variable, and $J$ being the objective).
Let's say $e(y,u)$ is nice enough to apply the implicit function theorem, so the equation $e(y,u)=0$ has a unique solution $y(u)$ which is continuously differentiable with respect to $u$, and the derivative $y'(u)$ is given by the solution of
$$e_y(y(u),u)y'(u) + e_u(y(u),u) = 0\tag{1}$$
($e_y$ and $e_u$ being the partial derivatives).

This means you can define the reduced objective $j(u):=J(y(u),u)$, which is differentiable as well (if $J(y,u)$ is).
One way to characterize the gradient $\nabla j(u)$ is via directional derivatives (e.g., compute all the partial derivatives with respect to a basis of the design space).
Here, the directional derivative in direction $h$ is given by the chain rule as
$$j'(u;h) = \langle J_y(y(u),u),y'(u)h \rangle + \langle J_u(y(u),u),h\rangle.\tag{2}$$
If $J$ is nice, the only difficult thing to compute is $y'(u)h$ for given $h$. This can be done by multiplying $(1)$ with $h$ from the right and solving for $y'(u)h$ (which the implicit function theorem allows), i.e., computing
$$[y'(u)h] = e_y(y(u),u)^{-1} [e_u(y(u),u)h]\tag{3}$$
and plugging this expression into $(2)$.
In PDE-constrained optimization, this amounts to solving a linearized PDE for every basis vector $h$ of the design space.

However, if we find an operator $\nabla j$ such that
$$j'(u;h) = \langle \nabla j,h\rangle\qquad \text{for all }h,$$
then this must be the desired gradient. Looking at $(1)$, we can write
$$ \langle J_y(y(u),u),y'(u)h \rangle = \langle y'(u)^*J_y(y(u),u),h \rangle $$
(with $y'(u)^*$ being the adjoint operator), so all we need to compute is $y'(u)^*j_y(y(u),u)$. Using that $(AB)^* = B^* A^*$, this can be done using $(3)$, i.e., computing
$$\lambda:= e_y(y(u),u)^{-*}J_y(y(u),u)$$
and setting
$$\nabla j(u) = e_u(y(u),u)^*\lambda +J_u(y(u),u).$$
In PDE-constrained optimization, $J_y(y(u),u)$ is usually some sort of residual, and computing $\lambda$ involves solving a single (linear) adjoint PDE, independent of the dimension of the design space. (In fact, this even works for distributed parameters, i.e., if $u$ is a function in some infinite-dimensional Banach space, where the first approach is infeasible.)