I am a full-time consultant and provide services related to the design, implementation and deployment of mathematical programming, optimization and data-science applications. I also teach courses and workshops. Usually I cannot blog about projects I am doing, but there are many technical notes I'd like to share. Not in the least so I have an easy way to search and find them again myself. You can reach me at erwin@amsterdamoptimization.com.

I always thought of ML (Machine Learning) as a somewhat smallish but highly fashionable subset of Statistics, but now it seems ML is hyped so much to completely obscure Statistics.

Is ML really that big?

ML has little bit more of a CS and IT background (traditional statistics more of a math background with perhaps most applications in the social sciences?). May be CS/IT people are much better in their public relations. Also many companies nowadays see IT as core business, so may be there is more funding available for doing significant ML applications.

MS has a poor performance record wrt analytics: poor Solver Foundation was abandoned in a bad way leaving some clients hanging to dry.

MS already has a popular statistics package (it is called Excel). (Joking!)

MS is a little bit late: IBM purchased SPSS in 2009.

R is very successfully riding this hype cycle. Not bad for a somewhat funky and older language (rooted in S).

Can we solve as a MIP?

The quadratic expressions form a problem. Although Cplex can solve non-convex QPs it does not handle non-convex quadratic constraints (I believe SCIP does). We can use some piecewise linear approximation, but it seems to make sense to first try a linear version with a different distance function: the Manhattan distance. Here the distance between two points is the sum of the distance in the x direction and in the y direction. The formulation can look like:

This model turns out not easy to solve. Even after 2 hours compute time the resulting locations are not very good:

Multi-start Algorithm

A simple multi-start algorithm would just slap a loop around this. Indeed we can find slightly better solutions with this:

obj

min distance

time

local solve conopt

0.01694

0.13016

1

multi start sequential (N=50, conopt)

0.01744

0.13208

93

multi start sequential (N=50, snopt)

0.01744

0.13205

39

multi start parallel (N=500, 4 threads, snopt)

0.01748

0.13219

113

As the problems are independent of each other it is easy to solve them in parallel. Here we used a scheme where we solve four problems in parallel. If a problem is done, a new one is started, so we keep those four threads busy (no starvation). The code managing all this is written in GAMS using a few loops. Being able to write little algorithms in a modeling language can be an important feature. Using the parallel algorithm we can solve 500 problems on 4 worker threads in under 2 minutes.

Local Solver

Interestingly, the heuristic solver “Localsolver” likes the first formulation, model m1, better. This is more compact but non-differentiable. This solver does not care about the objective function not being smooth.

Indeed the generated code is very low level. This certainly does not look optimal. Lets say there is room for improvement. I suspect that a smarter code will increase the number of iterations per second only (the search space is related to the number of these float variables which stays the same). Update: see comment below where some experiments show no performance degradation for this code.

Tuesday, January 20, 2015

Another quadratic problem that has some interesting features – this is an old favorite of mine. The problem is stated as follows: a patriarch has n sons and wants to build them each a house on his estate such that the minimum distance between them is maximized. We assume here the parents plot is simply [0,1]x[0,1] and high fertility led to n=75 brothers.

When we random place the points we get a small minimum distance: 0.0088. The two orange points almost overlap.

When forming the objective function we can do two immediate tricks:

We don’t need to compare all points i,j with each other twice. We can limit the distance calculations to the lower triangular part, i.e. were i > j.

We can drop the sqrt in the distance calculation as this is a monotonic function.

The default GAMS all-zero starting point is very bad: all gradients are zero. E.g. conopt will just stay at that point: it does not move an inch. Better is to use the above random configuration as initial point. From the results we see this is actually not a bad starting point.

There are two different objectives we can use. The first objective is a nice, compact single objective function that is unfortunately non-differentiable. The second formulation leads to a set of upper limits on mindist (we are maximizing mindist so we automatically find the correct one; note that mindist represents the squared minimum distance). The set lt(i,j) has 0.5(n2–n)=2775 elements, so we have 2775 inequalities in the second case.

Typically local NLP solvers like the second form better. E.g. with Conopt:

The second objective is much better although the model is much bigger, and we get a minimum distance of 0.1302. As you can see here, problem size is not all that matters when solving optimization problems. The brothers are now located as follows:

The orange points indicates where we observe the minimum distance. This starts to look like a grid we would expect. This also explains why we see an increase in the use of nonlinear programming in packing applications.

Next: towards global solutions.

Update: 3d version

We can generalize to 3d as follows. The plots are somewhat difficult to understand.

Wednesday, January 14, 2015

GAMS does not have a complements keyword. Instead it is using a simple notation to match variables to equations in the model statement:

MODEL TRNSP / PROFIT.X, SUPPLY.W, DEMAND.P/ ;

Secondly it is mentioned that GAMS has a presolve facility. Amazingly it does not. It leaves it to the solver to do that. For LP and MIP problems this may be not so bad, although a built-in presolver has some advantages (smaller problems and solutions files to exchange with the solver and most likely better error messages – we have more context available – the importance of this point is often underestimated IMHO). For NLP/MINLP models a built-in presolver can be more important: many nonlinear solvers do not have a presolver built-in. In general a nonlinear presolver is difficult to do inside the solver as the solver doesn’t know the algebraic form – the math – of the nonlinear functions they are executing (of course they can work on the linear part of the model). The most popular NLP solver with GAMS is CONOPT which actually has very good nonlinear presolving facilities (tailored for use with GAMS) and provides excellent diagnostics. But models to be solved with say MINOS or SNOPT can benefit hugely from a presolver (I have seen models being solved much more easily by AMPL with these solvers thanks to the presolving capabilities of AMPL; it really can make a huge difference). Having said this, careful modeling can lead to models where there is little room for the presolver to make a large difference (I try to implement my models this way – I always feel guilty of not doing a good job when the presolver if taking out large chunks of the model. Actually cleaning up a model to get rid of parts that will be presolved away is often a good exercise.).

Sometimes small presolve-like operations need to be performed by GAMS. Often this has to do with the fact that GAMS has an objective variable while solvers have an objective function. For an LP/MIP we would not care (just add an objective row with a single nonzero element at the location of the objective variable). But in case of QPs it is important to substitute out the objective variable (for NLPs this can also be important: linearly constrained problems may be much easier to solve). Even those simple transformation steps are only done in a somewhat half-baked fashion. Here is an LP file for a small QP problem fed into Cplex:

Friday, January 9, 2015

The following model tries to assign jobs to ‘bins’ (bins can be interpreted as workstations) in such a way that the remaining space in the bins is as large as possible (i.e. as little fragmentation as possible so that we have capacity left for large jobs).

Essentially we place high value on large values on remaining capacity zk. We consider here cases with β=0 (if β>0 we also like jobs to be executed early: remaining capacity more in the future is more valuable). This is a non-convex problem so not so easy to solve. This is also the model discussed here.

The choice of a quadratic objective is somewhat arbitrary. It is a simple function that has the required properties that larger values for zk are valued more than proportionally (i.e. linear). In practice I often handle this by some piecewise linear formulation. In addition in practical cases it may be useful to give bins (workstations) without any jobs additional value (no set-up or clean-up in this case, so an unused bin may be preferred).

With Cplex’s global MIQP solver (http://www.orsj.or.jp/ramp/2014/paper/4-3.pdf) we have now at least three ways to solve this problem with Cplex:(global MIQP, MIP with SOS2 interpolation, MIP with binaries exploiting integer data for capacity and job spans). Using an expanded (larger) version of the dataset provided in the paper with 27 bins and 15 jobs to allocate we try a few formulations.

The global MIQP formulation uses the above model directly. We solve to optimality with 4 threads. Cplex selects this algorithm with option SolutionTarget=3.

We can add a small value for β to try to reduce some symmetry in the model. We are still aiming for large remaining capacities, but we add a little bit of noise to distinguish similar solutions. The purpose is not really to add a ‘schedule as early as possible’ objective but rather just to increase the performance of the branch-and-bound algorithm. In the experiment we use β=0.01. Note: this may not always give us the same optimal solution.

We can use a piecewise linear formulation using SOS2 variables. In this case we used 4 segments. Obviously this is an approximation, so we may arrive at a different optimal solution. However in this case just using 4 segments is good enough to find the global optimum.

We can assume all capacity and job length data is integer (as is the case for the example data sets in the paper), and use this to formulate a linear model with extra binary variables. Essentially we implement a table lookup. If all data is integer valued (and the range of the numbers is not outrageously large) this will give us the proven optimal solution of the original MIQP. Exploiting properties of the input data can be key to solve things fast.

Here are the results:

These results confirm my experience. I typically use a piecewise linear formulation with just a few segments for this. In most cases I use even fewer than the 4 segments we used here. This seems often to perform reasonably well.

The authors of the original paper did not explore these alternatives but concluded quickly the global MIQP (implemented in Lingo) was too slow and implemented a heuristic. I am not sure if they would have done the same if they saw these results.

Update:

Instead of somewhat more complex formulations suggested in the comments, a simple table look-up using binary variables can be implemented with three constraints having a SOS1 structure:

This will make sure we have y=f(x) where x = 0,1,…,n. This is very close to the SOS2 formulation so took me just minutes to implement.

Obviously the default stopping condition (max time = 1000s) is not very good for this case: we are just sitting there for 99.99% 99.9% of the time already at the optimal solution doing nothing useful. (266 is indeed the global optimum). In general these heuristics have no idea how close they are to an optimal solution, and just rely on the user to provide these time or iteration limits. (MIP solvers have this very powerful bounding information where we can say at any moment in time what the gap is between the best solution found an the best solution possible).

Is there nothing better we can do? There is some research on probabilistic stopping rules for these kind of algorithms (i.e. stop as soon as we think the probability we will find a better solution is small):