A core point in SAS’ pitch for its new MPI (Message-Passing Interface) in-memory technology seems to be logistic regression is really important, and shared-nothing MPP doesn’t let you parallelize it. The Mahout/Hadoop folks also seem to despair of parallelizing logistic regression.

Many “non-parallelizable” algorithms are parallelized by finding an approximation that for practical purposes is just as good. Mathematically they are not equivalent, and there may be corner cases not handled as well, nevertheless, for many practical examples… nobody cares.

By analogy, I’ve always been a fan of piece-wise linear techniques, vs. more careful curve fits. By many technical measures they are coarse, inelegant, etc. Practically…. they rock.

Jon Bock on
April 7th, 2011 2:51 pm

Curt,

Among the different algorithms and approximations for logistic regression, some definitely can be parallelized in a shared-nothing MPP system even while others are not suitable for parallelization there. In the Mahout case the algorithm used is stochastic gradient descent, which as they mention is inherently not suited to parallelization in a shared-nothing MPP system. However, other algorithms such as batch gradient descent and Newton’s method (also referred to as Newton-Raphson) are parallelizable.

As you mention, Aster Data’s logistic regression function is parallelized—we initially released a version last year based on the batch gradient descent method and have been busy expanding the algorithms available since then. Of note is that our logistic regression implementation is designed for not only cases that fit in memory but can also process cases that are larger than available memory.

MPI has the advantage that shared memory communicators can be used as well as network communicators, so problems of granularity related efficiency can be better addressed.

The more interesting subject that you touched on is what we call “orthogonal parallelism”; this is where instead of just splitting records up over various CPU’s in the shared nothing cluster, large objects and computations that may relate to a single field can be orthogonally parallelised over the cluster as well.

For example: a large database contains many EHR’s that include a fields containing large images, RNA or DNA sequences. Although a conventional MPP system would split the records over the cluster, each large DNA object (for example) would still exist entirely on only 1 node.
A better solution would be to split the large objects up by storing these fields in parallel.

Additionally orthogonally parallel computations such as the parallel hammer algorithm can be used to analyse these large DNA fields in parallel.

Version 1.1 of DeepCloud MPP will permit orthogonally parallel UDF’s for this purpose.

A big part to logistic modeling can be done in parallel. In fact they can be done using separate SAS sessions on a multi core machine as well.
In fact I used to run parallel logistic regressions in SAS System circa 2004(by changing one or two variables and redoing proc logistic). Essentially I was running two or three logistic regressions with one or two variables changed in order to finalize the model, and estimating VIF, etc etc
Fitting the parameters, estimating deviation from actual, and scoring model can all be parallelized in my opinion.

Marco Ullasci on
April 8th, 2011 3:14 am

Mathematics on a computer, when dealing with all but the most trivial computations, is approximation to begin with.
As long as the results remain inside the safety range required by the specific use I really appreciate a parallel implementation that makes me able to scale out.