January 20, 2015

In this post I will explore how to parallelize certain types of machine learning / natural language processing code in an environment with multiple cpu cores and/or a gpu. The running example I will use is a transition based parser, but the same techniques should apply to other similar models used for sequence labeling, chunking, etc. We will see the relative contributions of mini-batching, parallel processing, and using the gpu. The ~24x speed-up that we get means we can parse the ~1M words of Penn Treebank in 9 minutes rather than 3.5 hours. (This post uses Matlab, here is a Julia version).

Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".

To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.

To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).

A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:

This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:

This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.

1 comment:

1. Using parfor and gpu together quickly runs out of gpu memory as each process carves out its own gpu memory space.

2. Using parfor in a cluster environment with a networked home directory causes trouble because matlab places information about the "local" workers in ~/.matlab and gets confused if multiple machines see the same ~/.matlab. You get an error message like:

"Starting matlabpool using the 'local' profile ... [Warning: Found 2 pre-existing communicating job(s) created by matlabpool thatare running. You can use 'matlabpool close force local' to remove all jobs created by matlabpool.]"

which of course is the wrong advice.

One workaround is to make ~/.matlab a symbolic link to a real local directory, e.g. /tmp/username/matlab. A better solution is not to use matlab.