Non-convex optimization is now ubiquitous in machine learning. While previously, the focus was on convex relaxation methods, now the emphasis is on being able to solve non-convex problems directly.

It is not possible to find the global optimum of every non-convex problem due to NP-hardness barrier. An alternate approach is: when can it be solved efficiently (preferably in low order polynomial time). Recent theoretical work has established that many non-convex problems can be solved near-optimally, but simple iterative algorithms, e.g. gradient descent with random restarts. The conditions for success turn out be mild and natural for many learning problems.

Some examples of guaranteed non-convex methods:

Many latent variable models can be learnt by decomposing higher order moment tensors (which is a non-convex problem). We are guaranteed to learn a consistent model, when the hidden or latent representation is non-degenerate: the effect of one hidden variable cannot be expressed as a linear combination of effects of others. See our paper. Also see an excellent blog post by Rong Ge.

Non-negative matrix factorization becomes tractable under a separability condition, see here. This condition may not always hold, but in many problems such as learning PCFGs in NLP, we can always add more features till we have a separable problem. See here.

Non-convex problems tend to work better in practice, but until now theory was only available for convex relaxation methods. This is changing fast: many non-convex methods are proven to succeed in the regimes where convex relaxation succeeds. For instance, for the problem of robust PCA, where the goal is to find a low rank + sparse decomposition of the matrix, we show that a natural non-convex method, which is more efficient than the convex method, have the same regimes of success. See here.

For non-convex problems, the main drawback is the need for good initialization. This is where practitioners tend to have good intuitions and employ domain specific knowledge to engineer these initializations. This is also now being tackled in theory. For instance, for sparse coding problem, we can initialize by finding solution to an overlapping clustering problem. See here and here.

The above works are relying on gradient descent, or other local methods, with specific initializations. An alternative method is based on smoothing, where the function is progressively transformed into a coarse-grained objective through local smoothing. With sufficient amount of smoothing, it becomes convex, and its global optimum is used as initialization to a finer-grained objective which is non-convex. Recent works analyze when such smoothing methods succeed for non-convex optimization.

Avoiding saddle points: Saddle points are critical points, where the gradient vanishes, but it is not a local optimum, meaning there exist directions where the objective value improves. Saddle points slow down stochastic gradient descent (SGD), especially as the number of variables grows. This is a very challenging problem, in fact, it has been argued that saddle points are much more of an issue than local optima; seehere. Recent works have made progress on how to escape saddle points efficiently in high dimensions. In fact, if the SGD is noisy enough, this paper shows that it can escape non-degenerate saddle points, which can be determined by second order derivatives. A more challenging case involves saddle points which require higher order derivatives to escape, and we analyze them in a recent work here.

Discrete optimization: So far, I have discussed continuous optimization. There is also interesting work in the analysis of discrete optimization, especially in the context of probabilistic inference. For instance, Stefano Ermon has been analyzing the effect of random projections on information projections; see here. Gibbs sampling is another popular technique for probabilistic inference, but it is challenging to establish a bound on the mixing time. A recent work by Christopher Re's group is trying to provide new analysis tools for Gibbs sampling. Previously, I have worked on learning Markov graph structure of probabilistic models, and show that simple and efficient methods succeed in the regime of "high temperature." You can read more about it here.

Sanjeev Arora, Moritz Hardt and Nisheeth Vishnoi are maintaining a blog on non-convex optimization: Off the convex path

This questionoriginally appeared on Quora. - the knowledge sharing network where compelling questions are answered by people with unique insights. You can follow Quora on Twitter, Facebook, and Google+. More questions:​