I am a full-time consultant and provide services related to the design, implementation and deployment of mathematical programming, optimization and data-science applications. I also teach courses and workshops. Usually I cannot blog about projects I am doing, but there are many technical notes I'd like to share. Not in the least so I have an easy way to search and find them again myself. You can reach me at erwin@amsterdamoptimization.com.

Saturday, February 27, 2016

The optimization toolbox of Matlab does not contain an MIQP (Mixed Integer Quadratic Programming) solver. So when they discuss a cardinality constrained portfolio problem (this means only a certain number of assets can be in the portfolio), an interesting cutting plane technique is used instead. This method will solve a series of linear MIP models instead of attacking the MIQP model directly.

Let’s first formulate the problem as proper MIQP model. First here is how the data is prepared:

Data

Quadratic Data Trick

The main deviation from the original model is that we don’t store the variance-covariance matrix \(Q\) as a full dense matrix but as an upper-triangular matrix (it includes the diagonal). We multiply the off-diagonal elements by two, so that the expression \(\sum_{i,j} x_i Q_{i,j} x_j \) will yield exactly the same result for any \(x\). Basically we replace expressions like \(x_1 Q_{1,2} x_2 +x_2 Q_{2,1} x_1\) by \( 2 x_1 Q_{1,2} x_2\). The sole purpose of this trick is to have fewer non-zero elements in the quadratic form \(x^TQx\). This really can help for larger problems (this looks mainly a GAMS problem: GAMS is slower in generating the model when the nonzero count of the Q matrix increases; it does not make much difference for the QP solver).

The model itself looks like:

MIQP Model

Modeling trick

In the original model the variable \(K\) was not present and the min- and max-fraction constraints were formulated as:

This duplicates a summation. Our formulation adds a single variable (with a lower- and upper-bound) but we get rid of an equation that includes a summation. As a result we reduce the number of non-zero elements in the constraint matrix.

where \(K\) is the current iteration or cycle, and \(x^*_k\) is the optimal MIP solution from the k-th iteration. This is essentially a Taylor approximation.

Here is our version of the algorithm:

In this second model we replaced the original objective by the new linear objective and the cuts. The equation cut is indexed by a dynamic set cuts that grows during the iteration loop. We start the iteration scheme with an empty set cuts. Note that in the calculation of \(\beta\) we changed the expression from \(2{x^*_k}^T Q\) to \({x^*_k}^T (Q+Q^T)\). This is to compensate for the upper-triangularization trick we used before.

The funny factor \(\frac{1}{2}\) is often used in QP algorithms. I suspect this is largely for historical reasons (the background may be that the gradients are then just the \(q_{i,j}\) values). Notice in the input that the first \(m_1\) constraints are equality constraints and the next \(m_2\) constraints are \(\ge\) inequalities. Bounds need to be represented as constraints and \(\le\) constraints \(a^Tx \le b\) need to be passed on as \(-a^Tx \ge -b\). The fact that linear equations are expressed as \(A^Tx = b\) is unusual (\(Ax = b\) is far more conventional).

Here is a small example on how one can use it to solve a small portfolio problem:

Javascript

Here we try the javascript version of this algorithm, found here: www.numericjs.com. We throw the same data at it. We use the notebook-like interface to enter the data and solve the problem:

Looks like the constraints cause some problems here. The unconstrained solution is the same. But when the constraints are added to the problem the algorithm fails miserably. At least on this problem, R seems to be a winner.

Matlab

Matlab also contains a QP algorithm called quadprog, documented here. The usage is a little bit more natural in my opinion: they don’t transpose A and they allow lower- and upper-bounds to be specified directly.

Tuesday, February 23, 2016

I.e. there are 6 zeroes in the number, 2 ones, 1 two, and 1 six. This is called a self-descriptive number. The post is talking about a Prolog solution. Of course we can use a MIP solver instead of Prolog:

It is extremely short but not exactly very readable or intuitive (I guess that was not one of the design criteria for Prolog – see here for a discussion about write-only languages with a particularly illuminating APL example showcasing this concept). I prefer here the mathematical optimization model. A plus for the Prolog version is that it also verifies there is only a single solution for n=10.

If we just want to inspect a few of these objects, it may be more convenient to use a lazy load DB format. Such a database consists of two files: an .RDB file with data and an .RDX file with an index indicating where each object is located in the .RDB data file. The .RDB data file is slightly larger than the corresponding .Rdata file:

This is because each object is stored and compressed individually.

Loading is super fast as we don’t really load the data:

> lazyLoad("indus89")
NULL
> length(ls())
[1] 275

This will only load the index of the data. The RStudio environment shows:

Now, as soon as we do anything with the data, it will load it. This is the “lazy” loading concept. E.g. lets just print alpha:

Update

As indicated in the comments below, the lazyLoad function is described as being for internal use. So usage may require a certain braveness. I am sure renaming the function to myLazyLoad does not count as a solution for this.

Alternatives mentioned in the comments (of course if the venerable Bill Venables makes a comment I better pay attention):

Sunday, February 21, 2016

I came across a question about using an xor condition in a MIP model. Here is a summary of the answer.

Expressing \(z=x\cdot y\) (or \(z=x\text{ and }y\) ) where \(x,y,z \in \{0,1\}\) as a set of linear constraints is well-known:

and:

\[\begin{align}
&z \le x \\
&z \le y \\
&z \ge x+y-1
\end{align}\]

\(\>\>\>\Longleftrightarrow\>\>\> \)

\(z\)

\(x\)

\(y\)

\(0\)

\(0\)

\(0\)

\(0\)

\(0\)

\(1\)

\(0\)

\(1\)

\(0\)

\(1\)

\(1\)

\(1\)

The relation \(z=x\text{ xor }y\) is slightly more complicated. The xor (exclusive-or) condition can also be written as \(z = x<>y\), i.e. \(z\) is 1 if \(x\) and \(y\) are different (and \(z\) is zero if they are the same). This we can write as the following set of linear inequalities:

Tuesday, February 16, 2016

Problem Description

I am not at all sure the physics make sense, but let’s focus on the mathematics of this model.

Formulation A

An MINLP formulation could be:

The idea is:

If \(y_i=0\) (oven is turned off) we force \(x_i=0\) (equation Off). Also we have \(loss_i=0\) as we multiplied the loss function by \(y_i\).

If \(y_i=1\) (oven is turned on), we let the solver decide on \(x_i\). In almost all cases it will choose a value > 0 as it wants to be somewhat close to \(a_i\). But if \(x_i\) would become zero something special happens: suddenly \(y_i=0\) becomes a better solution.

We don’t have to model explicitly \(x_i=0 \implies y_i=0\). If \(x_i=0\) then the objective would have a preference for \(y_i=0\) as this will improve the objective.

This model is no longer quadratic. For small instances like this we can use a global solver without a problem.

Formulation B

If we can make the problem quadratic and convex we can solve the problem with many more solvers. So here is a formulation that does just that:

Here we have:

If \(y_i=0\) (oven is turned off) we force \(x_i=0\) (equation Off). The slack \(s_i\) is left floating. This will cause the algorithm to choose a slack such that the loss is zero.

Thursday, February 11, 2016

When exporting data sets in R’s .Rdata format, one of the things to consider is how string vectors are exported. I can now write factors in my .Rdata writer, so we can do some experiments on exporting a string column just as strings or as a factor.

Here is an illustration how things are stored in an .Rdata file:

There is some overhead with each string: 8 bytes. This can add up. An integer vector takes much less space.

Microsoft has acquired Revolution. Here are some early results of that.

Microsoft R Open

This is the standard R version with the addition of the Intel MKL library for high performance linear algebra (including multi-threading).

Also included is a neat way to make sure your users use the same versions of packages as you did. This will make reproducibility easier. Basically the idea is to add a checkpoint command with the snapshot date in your script that verifies that the same snapshot is used by your users. They call this the CRAN Time Machine.

The microsoft site is called mran.microsoft.com (MRAN = Microsoft R Application Network). The icon seems to be a monkey with glasses:

Sunday, February 7, 2016

While building a prototype .Rdata writer we developed a gdx2r application. This tool will convert a GAMS GDX file to an .RData file. Here we do some performance testing to get a feeling of how fast this is and how large the files are.

The first step is to generate a large, dense three dimensional parameter and export this to a GDX file. We can use compression to make the generated GDX file smaller at the expense of more CPU time.

To get some baseline data we first see how our gdx2sqlite tool behaves. A SQLite database is a convenient format to get data into R or Python. The –fast option helps very little here. This option typically has large impact on datasets with many smaller tables. The SQLite database is in general larger than a GDX file. A big advantage of using a database format is that we can make queries in R or Python to import suitable subsets of the data. In addition these queries can do some operations (summarizing, pivoting, transforming etc.).

A second much used format is CSV files. The GDXDUMP tool can generate this file. R has a read.csv command to import CSV files. The CSV format is limited to a single table.

Below we see that we can generate large SQLite files very fast: comparable to writing a CSV files. Typically we don’t get these speeds when doing inserts into a server based RDBMS (such as SQL Server or Oracle). In the tlimings below we used the uncompressed GDX file.

Below are timings from the gdx2r tool. We notice that buffering before passing data to the next layer (either the file stream or the compressed file stream) really improves the speed. Even a small buffer of 1000 bytes is doing a good job (from 0 to 1k bytes makes more of a difference than from 1k to 10k bytes). The explanation for this is the following: the calls to the file stream are passed on to the Windows API WriteFile and these calls have a large fixed cost. A similar argument holds for compressed file stream. Because of the different building blocks (buffering vs no buffering, compression vs no compression) we have to try a number of possibilities.

In the table below, when bufsize=0, no buffering is used.

Finally it is interesting to compare the reading times to get thigs into R. Below we have:

Read table p from the SQLite database. This is relatively fast.

Read the CSV file. The is very slow compared to the alternatives.

Load the .RData file is also fast.

The last entry is using the GAMS package gdxrrw. That package can read GDX files directly (assumes an installed GAMS system). This is somewhat slower than I would expect. It should be similar to SQLite and .Rdata timings but it is a factor of 2 slower.

Saturday, February 6, 2016

Now that we understand the .RData file format (as shown here) we can start writing some code to actually generate some .Rdata files. The tool gdx2r takes a GAMS GDX file and dumps the contents into an .Rdata file. E.g. in a GAMS model we can do:

The first command exports some GAMS symbols to a GDX file. The call to gdx2r converts it to an .Rdata file.

> setwd("c:/tmp")

> load("results.rdata",verbose = T)

Loading objects:

i

f

c

x

> i

[1] "seattle""san-diego"

> f

[1] 90

> c

ij value

1seattle new-york 0.225

2seattlechicago 0.153

3seattletopeka 0.162

4 san-diego new-york 0.225

5 san-diegochicago 0.162

6 san-diegotopeka 0.126

> x

ij level marginal lower upper

1seattle new-york500.0000Inf

2seattlechicago3000.0000Inf

3seattletopeka00.0360Inf

4 san-diego new-york2750.0000Inf

5 san-diegochicago00.0090Inf

6 san-diegotopeka2750.0000Inf

>

Using the small result for the transportation model in the GAMS model library we show here a few symbols.

i is one-dimensional set, so it is exported as a string vector.

f is a scalar. In R these will arrive as a real vector of length one.

c is a two-dimensional parameter. This will be represented as data frame. The i and j columns are represented as string vectors. I am thinking about changing this to a factor (may be using an option setting). When using a factor we store a vector of unique strings (the levels) and then use an integer vector to index into that string vector. In R factors are often used to handle categorical variables.

Finally x is a two-dimensional variable. This is also exported as a data frame.

The file results.rdata is a compressed binary file, so no good way to look at it directly:

For debugging I added some options to gdx2r. With the –ascii flag we can generate an ASCII version of this file. This looks like:

We can also turn off compression using the flags –unbuffered or –buffered. That generates a binary file we can at least make some sense from:

All these versions can be read by R using the load command.

As we write these files natively (without going through R) we would expect this to be fast. Next post: some timings on large data.