Wednesday, May 11, 2011

Playing with Linear Programming on PyPy

Fancy hi-level interfaces often come with a high runtime overhead
making them slow. Here is an experiment with building such an
interface using constructions that PyPy should be good at
optimizing. The idea is to allow the JIT in PyPy to remove the
overhead introduced by using a fancy high-level python interface
on top of a low-level C interface. The application considered is
Linear
programming. It is a tool used to solve linear optimization
problems. It can for example be used to find the nonnegative values
x, y and z that gives the maximum value of

without violating the constraints

There exists general purpose solvers for these kind of problems that
are very fast and can literally handle millions of variables. To use
them however the problem has to be transformed into some specific
matrix form, and the coefficients of all the matrices
has to be passed to the solver using some API. This transformation is
a tedious and error prone step that forces you to work with matrix
indexes instead of readable variable names. Also it makes maintaining
an implementation hard since any modification has to be transformed
too.

The example above comes from the manual of
the glpk library. That
manual continues by describing how to convert this problem into the
standard form of glpk (which involves introducing three new variables)
and then gives the c-code needed to call the
library. Relating that c-code to the problem above without the
intermediate explanation of the manual is not easy. A common
solution here is to build a hi-level interface that allows a more
natural way of defining the matrices and/or allow the equations to be
entered symbolically. Unfortunately, such interfaces often become
slow. For the benchmark below for example,
cvxopt
requires 20 minutes to setup a problem that takes 9.43 seconds to solve
(this seems a bit extreme, am I doing something wrong?).

The high-level interface I constructed on top of the
glpk library is
pplp and it allows
the equations to be entered symbolically. The above problem can be
solved using

To benchmark the API I used it to solve a
minimum-cost
flow problem with 154072 nodes and 390334 arcs. The C library
needs 9.43 s to solve this and the pplp interface adds another 5.89
s under PyPy and 28.17 s under CPython. A large amount of time is
still spend setting up the problem, but it's a significant
improvement over the 20 minutes required on CPython by
cvxopt. It is
probably not designed to be fast on this kind of benchmark. I have
not been able to get cvxopt to work under PyPy. The benchmark used is
available here

Fancy hi-level interfaces often come with a high runtime overhead
making them slow. Here is an experiment with building such an
interface using constructions that PyPy should be good at
optimizing. The idea is to allow the JIT in PyPy to remove the
overhead introduced by using a fancy high-level python interface
on top of a low-level C interface. The application considered is
Linear
programming. It is a tool used to solve linear optimization
problems. It can for example be used to find the nonnegative values
x, y and z that gives the maximum value of

without violating the constraints

There exists general purpose solvers for these kind of problems that
are very fast and can literally handle millions of variables. To use
them however the problem has to be transformed into some specific
matrix form, and the coefficients of all the matrices
has to be passed to the solver using some API. This transformation is
a tedious and error prone step that forces you to work with matrix
indexes instead of readable variable names. Also it makes maintaining
an implementation hard since any modification has to be transformed
too.

The example above comes from the manual of
the glpk library. That
manual continues by describing how to convert this problem into the
standard form of glpk (which involves introducing three new variables)
and then gives the c-code needed to call the
library. Relating that c-code to the problem above without the
intermediate explanation of the manual is not easy. A common
solution here is to build a hi-level interface that allows a more
natural way of defining the matrices and/or allow the equations to be
entered symbolically. Unfortunately, such interfaces often become
slow. For the benchmark below for example,
cvxopt
requires 20 minutes to setup a problem that takes 9.43 seconds to solve
(this seems a bit extreme, am I doing something wrong?).

The high-level interface I constructed on top of the
glpk library is
pplp and it allows
the equations to be entered symbolically. The above problem can be
solved using

To benchmark the API I used it to solve a
minimum-cost
flow problem with 154072 nodes and 390334 arcs. The C library
needs 9.43 s to solve this and the pplp interface adds another 5.89
s under PyPy and 28.17 s under CPython. A large amount of time is
still spend setting up the problem, but it's a significant
improvement over the 20 minutes required on CPython by
cvxopt. It is
probably not designed to be fast on this kind of benchmark. I have
not been able to get cvxopt to work under PyPy. The benchmark used is
available here

Winston: It is indeed. What cvxopt spends 20 min on I don't know. One guess would be that it is passing the ~2 million coefficients involved to C one by one, possible with a bit of error checking for each of them. As for the 6 s used by pplp, it needs to convert the equations into the matrices glpk wants. That means shuffling the coefficients around a bit and some bookkeeping to keep track of which goes where.

Are you distinguishing between the time it takes to setup the optimization problem and the time it takes to actually solve it?

GLPK is a simplex solver written in C, and CVXOPT is an interior point solver written in Python/C and is not particularly optimized for sparse problem. Nevertheless, you should check the you actually formulate a large sparse problem in CVXOPT, and not a dense one.