Erling's blog

Thursday, August 24, 2017

Since I spend a lot of time on solving sparse linear equation systems then I am also a user of sparse matrix reordering methods. My claim to fame is that I have implemented approximate minimum degree myself and it is used in MOSEK.

Below I summarize some interesting link to graph partitioning software:

Tuesday, April 12, 2016

It is very common to use a BLAS library to perform linear algebra operations such as dense matrix times dense matrix multiplication which can be performed using the dgemm function. The advantage of BLAS is

it is well a defined standard.

and hardware vendors such as Intel supplies a tuned version.

Now at MOSEK my employeer we use the Intel MKL library that includes a BLAS implementation. It really helps us deliver good floating point performance. Indeed we use a sequential version of Intel MKL but call it from potentially many threads using Clik plus. This works well due to the well designed BLAS interface. However, there is one rotten apple in the basket and that is error handling.

Here I will summarize why the error handling in the BLAS standard is awful from my perspective.

First of all why can errors occur when you do the dgemm operation if we assume the dimensions of the matrices are correct and ignoring issues with NANs and the like. Well, in order to obtain good performance the dgemm function may allocate additional memory to store smallish matrices that fit into the cache. I.e. the library use a blocked version to improve the performance.

Oh wait that means it can run of memory and then what? The BLAS standard error handling is to print a message to stderr or something along that line.

Recall that dgemm is embedded deep inside MOSEK which might be embedded deep inside a third party program. This implies an error message printed to stderr does not make sense to the user. Also the user would NOT like us to terminate the application with a fatal error. Rather we want to know that an out of space situation happened and terminate gracefully. Or doing something to lower the space requirement. E.g. use a fewer threads.

What is the solution to this problem? The only solution offered is to replace a function named xerbla that gets called when an error happens. The idea is that the function can set a global flag indicating an error happened. This might be a reasonable solution if the program is single threaded. Now instead assume you use a single threaded dgemm (from say Intel MKL) but call it from many threads. Then first of all you have to introduce a lock (a mutex) around the global error flag leading to performance issues. Next it is hard to figure out which of all the dgemm calls that failed. Hence, you have to fail them all. What pain.

Why is the error handling so primitive in BLAS libraries. I think the reasons are:

BLAS is an old Fortran based standard.

For many years BLAS routine would not allocate storage. Hence, dgemm would never fail unless the dimensions where wrong.

BLAS is proposed by academics which does not care so much about error handling. I mean if you run out of memory you just buy a bigger supercomputer and rerun your computations.

If the BLAS had been invented today it would have been designed in C most likely and then all functions would have returned an error code. I know dealing with error codes is a pain too but that would have made error reporting much easier for those who wanted to do it properly.

Wednesday, February 18, 2015

I found the talk: Plain Threads are the GOTO of todays computing by Hartmut Kaiser very interesting because I have been working on improving the multithreaded code in MOSEK recently. And is also thinking how MOSEK should deal with all the cores in the CPUs in the future.
I agree with Hartmut something else than plain threads is needed.

Wednesday, November 6, 2013

First a clarification conic quadratic optimization and second order cone optimization is the same thing. I prefer the name conic quadratic optimization though.

Frequently it is asked on the internet what is the computational complexity of solving conic quadratic problems. Or the related questions what is the complexity of the algorithms implemented in MOSEK, SeDuMi or SDPT3.

Using for instance the primal-dual interior point algorithm the problem can be solved to \(\varepsilon\) accuracy in \(O(\sqrt{d} \ln(\varepsilon^{-1}))\) interior-point iterations, where \(\varepsilon\) is the accepted duality gap. The most famous variant having that iteration complexity is based on Nesterov and Todds beautiful work on symmetric cones.

Each iteration requires solution of a linear system with the coefficient matrix
\[ \label{neweq}
\left [ \begin{array}{cc}
H & A^T \\
A & 0 \\
\end{array}
\right ] \mbox{ (*)}
\]
This is the most expensive operation and that can be done in \(O(n^3)\) complexity using Gaussian elimination so we end at the complexity \(O(n^{3.5}\ln(\varepsilon^{-1}))\).

That is the theoretical result. In practice the algorithms usually works much better because they normally finish in something like 10 to 100 iterations and rarely employs more than 200 iterations. In fact if the algorithm requires more than 200 iterations then typically numerical issues prevent the software from solving the problem.

Finally, typically conic quadratic problem is sparse and that implies the linear system mentioned above can be solved must faster when the sparsity is exploited. Figuring our to solve the linear equation system (*) in the lowest complexity when exploiting sparsity is NP hard and therefore optimization only employs various heuristics such minimum degree order that helps cutting the iteration complexity. If you want to know more then read my Mathematical Programming publication mentioned below. One important fact is that it is impossible to predict the iteration complexity without knowing the problem structure and then doing a complicated analysis of that. I.e. the iteration complexity is not a simple function of the number constraints and variables unless A is completely dense.

To summarize primal-dual interior-point algorithms solve a conic quadratic problem in less 200 times the cost of solving the linear equation system (*) in practice.

So can the best proven polynomial complexity bound be proven for software like MOSEK. In general the answer is no because the software employ an bunch of tricks that speed up the practical performance but unfortunately they destroy the theoretical complexity proof. In fact, it is commonly accepted that if the algorithm is implemented strictly as theory suggest then it will be hopelessly slow.

I have spend of lot time on implementing interior-point methods as documented by the Mathematical Programming publication and my view on the practical implementations are that are they very close to theory.