I am a full-time consultant and provide services related to the design, implementation and deployment of mathematical programming, optimization and data-science applications. I also teach courses and workshops. Usually I cannot blog about projects I am doing, but there are many technical notes I'd like to share. Not in the least so I have an easy way to search and find them again myself. You can reach me at erwin@amsterdamoptimization.com.

Saturday, February 28, 2015

In the tool GDX2SQLITE we export a GAMS GDX file to a SQLite database file.

The –small option mimics how GAMS stores data internally. All strings (called UELs or Unique Elements) are stored in a pool and all sets, parameters, variables and equations use index numbers for that pool.

To find out how such a pool looks like, you can do

alias (*,pool); display pool;

The same storage mechanism is used in GDX files: the strings are stored separately.

When we convert a GDX file to a SQLite database there is an option to use the same storage structure. With SQL we can hide the details behind a VIEW: a “virtual” table that is computed on the fly from an underlying SQL SELECT statement.

Here is the mechanism:

In the picture above, only the first 6 rows of each table are shown. In reality the tables have a significant number of rows.

In practical models and data sets the number of UELS (ie the size of the string pool) is small (often surprisingly small) compared to the total number of rows in the data tables. The advantage of this storage scheme is that the database files are smaller.

Indeed here we have some statistics:

Artificial data often has a larger number of UELs than real world data, as is illustrated here (data.gdx is a synthetic data set while WaterCropData is from a real model).

Notes:

A single UEL pool is used for multiple data tables.

This somewhat related to older LP solvers which had a super-sparsity storage scheme. Not only were the zero elements not stored but also all values were stored in a pool and the coefficient matrix was pointing into this pool. I don’t think this scheme is used anymore. Most recent reference I could find is http://www.sciencedirect.com/science/article/pii/0360835290900068. The original reference is: KALAN, J., "Aspects of Large-Scale In-Core Linear Programming," ACM Proceedings, 1971 To be complete: Sparsity indicates we don’t store zero elements. Usually we don’t store zero elements in GDX files (as GAMS does not store and export them). This also means zero elements are not stored in the resulting database file either. E.g. in table Data, only combinations (i,v,value) exist where value ≠ 0.

Version 0.5 of GDX2SQLITE adds better column names to the CREATE VIEW statements to make the results more compatible with the default storage scheme.

The –small option is clearly beneficial on database size. What are the impacts on performance of creating the database and on querying the database? Creating the database is probably close to linear in the total size, so I expect some savings here. I have no good answer about querying speed. Note: the column uel in table UELS$ is a primary key.

We should be able to reproduce the “exhaustive” method from the leaps package using an MIQP formulation. Something like:

Given a value for n we want to minimize SSQ. The binary variables δ indicate whether a feature j can appear in the model (we exclude the constant term: that one remains in the model). We need to solve this model for n=1,2,3,… To be complete, the other symbols are β: the quantities we want to estimate, data X,y and our big M constant.

I would guess that say Gurobi would do a better job than leaps. As it turns out, it was not easy to beat leaps.

To generate a random model we use a very simplistic scheme. Not sure if this has much in common with real world data sets, but a similar approach was used in Gatu, Kontoghiorghes, “Branch-and-Bound Algorithms for Computing the Best-Subset Regression Models”, Journal of Computational and Graphical Statistics, Volume 15, Number 1, Pages 139–156:

We use N=50 variables in total of which we want to select n=1,2,3,..,10 variables.

The model is very sensitive to the size of M. I used M=1000. With M=1e5 we get into trouble. We can see messages like: Warning: max constraint violation (1.2977e-02) exceeds tolerance (possibly due to large matrix coefficient range).This QP formulation is not so easy to solve for Gurobi from a numerical standpoint.

The RSS is going down when we add variables. As expected.

Also as expected: the time to solve each sub-problem goes up when we have more variables in the model. “pick n out of N” constructs often cause performance problems.

I used a small optcr (gap) and 4 threads to make Gurobi fast enough to beat Leaps. (Leaps takes 3 minutes and a loop of 10 solves with Gurobi takes 2 minutes).

I often like to formulate a separate fit equation as in: This turns out to be very bad w.r.t. performance. Of course the fit equation forms a completely dense block in the coefficient matrix, something Gurobi was not designed for (it likes sparse structures much better).

R/Leaps

The first line gets the data from the database and applies a pivot operation (from a long, skinny data frame with three columns i,v,value to a wide one with 10 columns v1,v2,v3,…,v10). This trick is useful to remember is this happens a lot when reading data from a database (such data is typically organized in long, skinny tables opposed to a “matrix” organization). The second line calls the regsubsets function from the Leaps package.

The sum of squares start out the same but they diverge just a little bit.

The selected variables are much the same as using Gurobi (but there are a few differences here and there). The time is actually close to that of Gurobi. This performance is much better than I expected.

It is sometimes a bit confusing to find out which ODBC drivers are installed (especially when we need to know precisely which drivers are 32bit or 64bit). This script when run from a 64 bit GAMS version will list all 32 bit and 64 bit drivers:

Saturday, February 7, 2015

The Google PageRank algorithm is a well known algorithm to calculate web page rankings in a search engine. It is surprisingly simple (for smaller data sets). The basic update equation for the page rank estimate v looks like:

We iterate this thing until the v’s converged. Note that in GAMS this is a parallel assignment so we don’t need to worry about v’s being overwritten during the assignment (otherwise we could have used vprev in the right-hand side expression, see further below). To test this we first load the (sparse) matrix and calculate the degree of each node:

The actual algorithm should perform quite reasonable in GAMS as we can exploit the sparse execution of the matrix-vector multiplication inside the loop. Note that the transition matrix M is transposed wrt A. The loop terminates as soon as the update does not change v anymore.

We can see Matlab is doing much better on the matrix-vector products especially when using a sparse matrix M. The difference between sparse and dense should increase for larger matrices: 500x500 is actually not that big and works quite well with dense linear algebra. The density of the M matrix is 1% in this example and should be smaller for larger matrices. Matlab is very fast as long as we can use the computational kernels efficiently, which is the case here.

Complete code

Here we show the complete model. As it turns out the loop looks quite similar, but some of the initialization code is really different between GAMS and Matlab. Matlab is of course much closer to real mathematical notation than GAMS. I am not sure this always helps in making code more readable. This small example actually shows this succinctly.

Here we load the data. In GAMS we index over sets consisting of strings. Typically a network graph is implemented as two sets: a one dimensional set with nodes and a two dimensional set with arcs. In Matlab we use matrices where we index over integers1..n. Here we specify a sparse matrix by providing the row and column numbers; the nonzero values are all one. I prefer the GAMS representation of the graph.

GAMS needs more lines of code because all identifiers need to be declared before they can be used. This is cumbersome for interactive use but provides extra protection for larger projects. Also GAMS is basically a batch environment (there is no console where we type GAMS commands).

GAMS is a very small language. In this model we basically only use the sum operator and the card and abs functions. Matlab has a very rich set of matrix functions (we use here sparse, ones, size, nnz, sum, find, norm). That is of course both a blessing and a curse: we can write things down very compactly but you need to know more stuff before being proficient in Matlab.

The construction of the transition matrix M is cleaner in GAMS than in Matlab. It may not be immediately obvious but in the Matlab code the matrix D is a diagonal matrix. Of course we want to store this sparse. The use of a diagonal matrix in Matlab is a well-known trick to keep things vectorized (opposed to using explicit loops). Explicit loops are often bad (i.e. slow) both in GAMS and in Matlab (often beginners tend to overuse loops in both languages).

The major iteration loop is programmed in such a way that it always terminates (either by convergence or running out of iterations). This is a useful defensive approach. The test for convergence can be embedded in the loop statement in GAMS (note that this requires vprev to be initialized at this point). In Matlab there is no such trick so we just do the test at the most convenient place.

Finally note the difference in the display of the graph A. Arguably the GAMS output is a little bit better to understand. We can improve the Matlab output by saying display(full(A)). Of course we only should do a display of a small graph.

---- 38 SET a directed arcs

a b c d

a YES YES YES b YES YES c YES d YES YES

>> display(A)

A =

(2,1) 1 (1,2) 1 (4,2) 1 (1,3) 1 (3,3) 1 (4,3) 1 (1,4) 1 (2,4) 1

Convergence

The convergence of the term sum(abs(v-vprev)) is interesting. We plot here against the iteration number:

Of course, here we don’t have a black box logistic regression tool so we need to do a few extra steps. First we need to add an explicit constant term to the data, and expand a categorical variable into a number of dummy variables. The second part is to write down the log likelihood function. The resulting model looks like:

Usually this is written down as an unconstrained optimization problem. In such a formulation we would repeat some linear expressions that I rather substitute out. Largely for aesthetical reasons. This adds a linear constraint block but makes the problem somewhat less nonlinear – depending how you count: the nonlinear code is smaller but the number of nonlinear nonzero elements is larger. For small problems it does not really matter: NLP solvers handle both versions quite well (more details on this below). Yet another way to solve this is to write down the first-order conditions and solve these. I suspect this is how glm inside R works.

Usually for NLP models like this I’d like to specify a starting point for the nonlinear variables. Here it seems that the default initial values bx(i)=0 are just fine.

Statistics

Using a little bit of extra code we can calculate some useful statistics. Just as R does. In fact we try to mimic their output. Here is the GAMS code:

We use a small CNS model (square nonlinear system solver) to invert the information matrix. The right-hand side of equation inverter is actually an identity matrix (ones on the diagonal, zero everywhere else). We solve here a linear system of equations with a non-linear solver. That is because GAMS does not have model type to solve a linear system of equations. It could be done with an LP solver and a dummy objective. As the size of the matrix is small, we choose for the shortest GAMS code. The final results when displaying the results parameter look like:

The estimates are our beta’s. The standard errors are the square root of the diagonal elements of the inverse of the information matrix. The z-values are just the result of dividing the estimate by its standard error. Finally the p-values are found using the standard normal cdf function. This looks just like the R summary output (apart from the stars indicating the significance level):

It is actually quite instructive to do this exercise to understand a little bit better what R is reporting when doing a logistic regression. High-performance, convenient and well-tested black-box routines are great to use. But once in a while it may make sense to go back to basics and work on the problem directly.

Comparison between unconstrained and linearly constrained formulation

I was a little bit wondering about the relative performance of these two different formulations. Here is a summary: