I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like

Everything can be packed into the convex objective. If you have a convex objective and a constraint with a convex function , just take , i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints or linear equality constraints and such.

In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those for which

Simple as that. Only take the subgradient of the objective and that’s it.

When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to and also and by the notion of constraint qualifications.

These constraint qualifications are additional assumptions that are needed to ensure that a minimizer fulfills the KKT-conditions. For example, if one has constraints then the linear independence constraint qualification (LICQ) states that all the gradients for constraints that are “active” (i.e. ) have to be linearly independent.

It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If and are differentiable, then , always. That’s not true for subgradients! You always have that . The reverse inclusion is not always true but holds, e.g., if there is some point for which is finite and is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!

Take two constraints and with convex and differentiable . We can express these by (). Then it is equivalent to write

and

So characterizing solution to either of these is just saying that . Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which is finite and the other one is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.

So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.

Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with but still there are solutions to the optimization problem!

As a very simple illustration take the constraints and in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ”. Now take the point which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that and , so treating the constraints individually gives . But the full story is that , thus and consequently, the subgradient is much bigger.

Share this:

Like this:

In my Analysis class today I defined the trigonometric functions and by means of the complex exponential. As usual I noted that for real we have , i.e. lies on the complex unit circle. Then I drew the following picture:

This was meant to show that the real part and the imaginary part of are what is known as and , respectively.

After the lecture a student came to me and noted that we could have started with and note that and could do the same thing. The question is: Does this work out? My initial reaction was: Yeah, that works, but you’ll get a different …

But then I wondered, if this would lead to something useful. At least for the logarithm one does a similar thing. We define for and real as , notes that this gives a bijection between and and defines the inverse function as

So, nothing stops us from defining

Many identities are still valid, e.g.

or

For the derivative one has to be a bit more careful as it holds

Coming back to “you’ll get a different ”: In the next lecture I am going to define by saying that is the smallest positive root of the functions . Naturally this leads to a definition of “ in base ” as follows:

Definition 1 is the smallest positive root of .

How is this related to the area of the unit circle (which is another definition for )?

The usual analysis proof goes by calculating the area of a quarter the unit circle by integral .

Doing this in base goes by substituting :

Thus, the area of the unit circle is now …

Oh, and by the way, you’ll get the nice identity

(and hence, the area of the unit circle is indeed )…

Share this:

Like this:

In another blogpost I wrote about convexity from an abstract point of view. Recall, that convex functions can be defined as soon as we have a real linear structure on and an order on as this allows to formulate the basic requirement for a convex function, namely that for all and it holds that

One amazing thing about convexity is, that it implies some regularity for the function. Indeed, you’ll find something on the net if you search for “convexity implies continuity”. But wait. How can that be? We have a mapping from a vector space to some ordered space (which I will always assume to be here, i.e. the extended real line) and we did not specify any topology on (while the extended real line carries its usual order topology). Indeed, one can equip a vector space with a lot of different topologies so how can it be that some property like convexity, which is expressed in purely algebraical terms, implies something like continuity, which is topological property? The answer is, that it is not really true that “convexity implies continuity”. The correct statement is a bit more subtle:

A convex function is Lipschitz continuous at any point where it is locally bounded.

Ok, here we have something more: We need boundedness of , but this is still related to and not related to . But there is this little word “locally” and this is the point where some topology on comes into play. Let’s assume that we have even a metric on so that we can talk about balls. Then, the statement reads as:

A convex function is Lipschitz continuous at a point if there exists a and such that for .

Put differently: The continuity of a convex function depends on the boundedness of on neighborhoods. Consequently, if we change the topology, we change the set of neighborhoods and hence, a fixed convex function may have different continuity behavior in different topologies. This does indeed happen. Consider the following extreme example: Let and

This function is convex but, for the norm-topology, not continuous at any point. Also, it is not locally bounded at any point. However, if we change the topology such that each point is its own neighborhood (that is, we take the discrete metric), than we get local boundedness and also continuity of .

Share this:

Like this:

The Douglas-Rachford method is a method to solve a monotone inclusion with two maximally monotone operators defined on a Hilbert space . The method uses the resolvents and and produces two sequences of iterates

This is again a monotone inclusion, but now on . We introduce the positive definite operator

and perform the iteration

(This is basically the same as applying the proximal point method to the preconditioned inclusion

Writing out the iteration gives

Now, applying the Moreau identity for monotone operators (), gives

substituting finally gives Douglas-Rachford:

(besides the stepsize which we would get by starting with the equivalent inclusion in the first place).

Probably the shortest derivation of Douglas-Rachford I have seen. Oh, and also the (weak) convergence proof comes for free: It’s a proximal point iteration and you just use the result by Rockafellar from “Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization 14(5), 1976.

Like this:

Currently I am at the SIAM Imaging conference in Hong Kong. It’s a great conference with great people at a great place. I am pretty sure that this will be the only post from here, since the conference is quite intense. I just wanted to report on two ideas that have become clear here, although, they are both pretty easy and probably already widely known, but anyway:

1. Non-convex + convex objective

There are a lot of talks that deal with optimization problems of the form

Especially, people try to leverage as much structure of the functionals and as possible. Frequently, there arises a need to deal with non-convex parts of the objective, and indeed, there are several approaches around that deal in one way or another with non-convexity of or even . Usually, in the presence of an that is not convex, it is helpful if has favorable properties, e.g. that still is bounded from below, coercive or even convex again. A particularly helpful property is strong convexity of (i.e. stays convex even if you subtract from it). Here comes the simple idea: If you already allow to be non-convex, but only have a that is merely convex, but not strongly so, you can modify your objective to

for some . This will give you strong convexity of and an that is (often) theoretically no worse than it used to be. It appeared to me that this is an idea that Kristian Bredies told me already almost ten years ago and which me made into a paper (together with Peter Maaß) in 2005 which got somehow delayed and published no earlier than 2009.

2. Convex-concave saddle point problems

If your problem has the form

with some linear operator and both and are convex, it has turned out, that it is tremendously helpful for the solution to consider the corresponding saddle point formulation: I.e. using the convex conjugate of , you write

A class of algorithms, that looks like to Arrow-Hurwicz-method at first glance, has been sparked be the method proposed by Chambolle and Pock. This method allows and to be merely convex (no smoothness or strong convexity needed) and only needs the proximal operators for both and . I also worked on algorithms for slightly more general problems, involving a reformulation of the saddle point problem as a monotone inclusion, with Tom Pock in the paper An accelerated forward-backward algorithm for monotone inclusions and I also should mention this nice approach by Bredies and Sun who consider another reformulation of the monotone inclusion. However, in the spirit of the first point, one should take advantage of all the available structure in the problem, e.g. smoothness of one of the terms. Some algorithm can exploit smoothness of either or and only need convexity of the other term. An idea, that has been used for some time already, to tackle the case if , say, is a sum of a smooth part and a non-smooth part (and is not smooth), is, to dualize the non-smooth part of : Say we have with smooth , then you could write

and you are back in business, if your method allows for sums of convex functions in the dual. The trick got the sticky name “dual transportation trick” in a talk by Marc Teboulle here and probably that will help, that I will not forget it from now on…

As clear from the titles, both papers treat a similar method. The first paper contains all the theory and the second one has few particularly interesting applications.

In the first paper we propose to view several known algorithms such as the linearized Bregman method, the Kaczmarz method or the Landweber method from a different angle from which they all are special cases of another algorithm. To start with, consider a linear system

Obviously, this is nothing else than a gradient descent for the functional and indeed converges to a minimizer of this functional (i.e. a least squares solution) if the stepsizes fulfill for some . If one initializes the method with it converges to the least squares solution with minimal norm, i.e. to (with the pseudo-inverse ).

A totally different method is even older: The Kaczmarz method. Denoting by the -th row of and the -th entry of the method reads as

where or any other “control sequence” that picks up every index infinitely often. This method also has a simple interpretation: Each equation describes a hyperplane in . The method does nothing else than projecting the iterates orthogonally onto the hyperplanes in an iterative manner. In the case that the system has a solution, the method converges to one, and if it is initialized with we have again convergence to the minimum norm solution .

There is yet another method that solves (but now it’s a bit more recent): The iteration produces two sequences of iterates

for some , the soft-thresholding function and some stepsize . For reasons I will not detail here, this is called the linearized Bregman method. It also converges to a solution of the system. The method is remarkably similar, but different from, the Landweber iteration (if the soft-thresholding function wouldn’t be there, both would be the same). It converges to the solution of that has the minimum value for the functional . Since this solution of close, and for large enough identical, to the minimum solution, the linearized Bregman method is a method for sparse reconstruction and applied in compressed sensing.

Now we put all three methods in a joint framework, and this is the framework of split feasibility problems (SFP). An SFP is a special case of a convex feasibility problems where one wants to find a point in the intersection of multiple simple convex sets. In an SFP one has two different kinds of convex constraints (which I will call “simple” and “difficult” in the following):

Constraints that just demand that for some convex sets . I call these constraints “simple” because we assume that the projection onto each is simple to obtain.

Constraints that demand for some matrices and simple convex sets . Although we assume that projections onto the are easy, these constraints are “difficult” because of the presence of the matrices .

If there were only simple constraints a very basic method to solve the problem is the methods of alternating projections, also known as POCS (projection onto convex sets): Simply project onto all the sets in an iterative manner. For difficult constraints, one can do the following: Construct a hyperplane that separates the current iterate from the set defined by the constraint and project onto the hyperplane. Since projections onto hyperplanes are simple and since the hyperplane separates we move closer to the constraint set and this is a reasonable step to take. One such separating hyperplane is given as follows: For compute (with the orthogonal projection ) and define

Now we already can unite the Landweber iteration and the Kaczmarz method as follows: Consider the system as a split feasibility problem in two different ways:

Treat as one single difficult constraint (i.e. set ). Some calculations show that the above proposed method leads to the Landweber iteration (with a special stepsize).

Treat as simple constraints . Again, some calculations show that this gives the Kaczmarz method.

Of course, one could also work “block-wise” and consider groups of equations as difficult constraints to obtain “block-Kaczmarz methods”.

Now comes the last twist: By adapting the term of “projection” one gets more methods. Particularly interesting is the notion of Bregman projections which comes from Bregman distances. I will not go into detail here, but Bregman distances are associated to convex functionals and by replacing “projection onto or hyperplanes” by respective Bregman projections, one gets another method for split feasibility problems. The two things I found remarkable:

The Bregman projection onto hyperplanes is pretty simple. To project some onto the hyperplane , one needs a subgradient (in fact an “admissible one” but for that detail see the paper) and then performs

( is the convex dual of ) with some appropriate stepsize (which is the solution of a one-dimensional convex minimization problem). Moreover, is a new admissible subgradient at .

If one has a problem with a constraint (formulated as an SFP in one way or another) the method converges to the minimum- solution of the equation if is strongly convex.

Note that strong convexity of implies differentiability of and Lipschitz continuity of and hence, the Bregman projection can indeed be carried out.

Now one already sees how this relates to the linearized Bregman method: Setting , a little calculation shows that

Hence, using the formulation with a “single difficult constraint” leads to the linearized Bregman method with a specific stepsize. It turns out that this stepsize is a pretty good one but also that one can show that a constant stepsize also works as long as it is positive and smaller that .

In the paper we present several examples how one can use the framework. I see one strengths of this approach that one can add convex constraints to a given problem without getting into any trouble with the algorithmic framework.

The second paper extends a remark that we make in the first one: If one applies the framework of the linearized Bregman method to the case in which one considers the system as simple (hyperplane-)constraints one obtains a sparse Kaczmarz solver. Indeed one can use the simple iteration

and will converge to the same sparse solution as the linearized Bregman method.

This method has a nice application to “online compressed sensing”: We illustrate this in the paper with an example from radio interferometry. There, large arrays of radio telescopes collect radio emissions from the sky. Each pair of telescopes lead to a single measurement of the Fourier transform of the quantity of interest. Hence, for telescopes, each measurement gives samples in the Fourier domain. In our example we used data from the Very Large Array telescope which has 27 telescopes leading to 351 Fourier samples. That’s not much, if one want a picture of the emission with several ten thousands of pixels. But the good thing is that the Earth rotates (that’s good for several reasons): When the Earth rotates relative to the sky, the sampling pattern also rotates. Hence, one waits a small amount of time and makes another measurement. Commonly, this is done until the earth has made a half rotation, i.e. one complete measurement takes 12 hours. With the “online compressed sensing” framework we proposed, one can start reconstructing the image as soon the first measurements have arrived. Interestingly, one observes the following behavior: If one monitors the residual of the equation, it goes down during iterations and jumps up when new measurements arrive. But from some point on, the residual stays small! This says that the new measurements do not contradict the previous ones and more interestingly this happened precisely when the reconstruction error dropped down such that “exact reconstruction” in the sense of compressed sensing has happened. In the example of radio interferometry, this happened after 2.5 hours!

Like this:

It’s out again! Our department has a vacant position for optimization to fill! This time we are seeking a professor (W2) working in discrete optimization.

The official job advertisement has been sent to various newletters and digests and you can find it for example here or here. In addition to that information let me give some more information about the math department here. Basically, I copied the following information from this previous advertisement:

The math department here is a medium sized department. It covers quite broad range of mathematics:

Numerical Linear Algebra (Fassbender, Bollhöfer)

PDEs (Sonar, Hempel)

Modelling (Langemann)

Stochastics (Kreiss, Lindner, Leucht)

Applied Analysis/Mathematical Physics (Bach, myself)

Algebra and Discrete Mathematics (Eick, Löwen, Opolka)

and, of course, Optimization (Zimmermann, tba). In most cases I find some expert around for all my questions that are a bit outside my field. All groups are active and working together smoothly. The department is located in the Carl-Friedrich Gauss Faculty which is also the home of the departments for Computer Science, Business Administration and Social Sciences. At the least in Computer Science and Business Administration there are some mathematically oriented groups, e.g.

and there are several groups with some mathematical background and interesting fields of applications (computer graphics, robotics,…). Moreover, the TU has a lot of engineering institutes with strong background in mathematics and cool applications.
In addition to a lively and interesting research environment, the university treats its staff well (as far as I can see) and administrative burden or failures are not harming too much (in fact less then at other places, I’ve heard)!

Full disclosure: I am the head of the hiring committee this time. All questions you may have about the position can be sent to me.

The deadline for application is 30.04.2014. The deadline is sharp and only electronic applications (addressed to fk1@tu-bs.de) will be considered. Please send a single pdf-file and make sure that all text in the document is searchable.