There exist several drawbacks of representing the word senses with a fixed list
of definitions of a manually constructed lexical database. There is no guarantee
that they reflect the exact meaning of a target word in a given context, since they
usually contain definitions that are too general. More so, lexical databases often
include many rare senses while missing corpus/domain-specific senses. Word Sense
Induction (WSI) focuses on discriminating the usages of a polysemous word without
using a fixed list of definitions or any hand-crafted resources.

In contrast to the most common approach in WSI, which is to apply clustering
or graph partitioning on a representation of first- or second-order co-occurrences of
a word, my method obtains a probability distribution for each context suggested
by a statistical model. This distribution helps to create context embeddings using
the co-occurrence framework that represents the context with low-dimensional,
dense vectors in Euclidean space. Then, these context embeddings are clustered
by k-means clustering algorithm to discriminate usages (senses) of a word. This
method proved its usefulness in Unsupervised Part-of-Speech Induction, and supervised
tasks such as Multilingual Dependency Parsing. I examine this method
on SemEval 2010 and SemEval 2013 Word Sense Induction lexical sample tasks,
and the dataset I created using OntoNotes 5.0. This new lexical sample dataset
has high inter-annotator agreement (IAA) (>90%) and number of instances for
each word type is more than any previous lexical sample tasks (>500 instances).

The contributions in this thesis are as follows: (1) I suggest a method to attack
the Word Sense Induction problem. (2) I provide a comprehensive analysis (a)
in embedding step by comparing other popular word embeddings by transforming
each of them to context embeddings using substitute word distributions for each
context, and (b) in clustering step by comparing different clustering algorithms (kmeans,
Spectral Clustering, DBSCAN) and different clustering approaches (local
approach where instances of each word type clustered separately, and part-of-speech
based approach where instances tagged with same-part-of-speech clusters independently).

June 16, 2015

May 27, 2015

Earlier this year, editors of the journal Basic and Applied Social Psychology announced that the journal would no longer publish papers containing p-values. The latest American Psychological Association Publication Manual states that researchers should "wherever possible, base discussion and interpretation of results on point and interval estimates," i.e. not p-values. FDA has been encouraging Bayesian analysis. What is wrong with p-values?

What is a p-value? In the classical statistical procedure known as "significance testing", we have a default hypothesis, usually called the null hypothesis and denoted by H0, and we wish to determine whether or not to reject H0 based on some observations X. We choose a statistic S=f(X) (a scalar function of X) that summarizes our data. The p-value is the probability of observing a value at least as extreme as S under H0. We reject H0 if the p-value is below some specified small threshold like α=0.05 and we say something like "H0 is rejected at 0.05 significance level." This threshold or significance level (α) upper bounds the probability of false rejection, i.e. rejecting H0 when it is correct.

Example: We toss a coin 1000 times and observe 532 heads, 468 tails. Is this a fair coin? In this example the null hypothesis H0 is that the coin is fair, observation X is the sequence of heads and tails, and the statistic S is the number of heads. The p-value, or probability of S ∉ [469,531] under H0, can be calculated as:
\[ 1 - \sum_{k=469}^{531} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.04629 \]
We can reject the null hypothesis at 0.05 significance level and decide the coin is biased. But should we?

Objection 1: (MacKay 2003, pp.63) What we would actually like to know is the probability of H0 given that we observed 532 heads. Unfortunately the p-value 0.04629 is not that probability (although this is a common confusion). We can't calculate a probability for H0 unless we specify some alternatives. Come to think of it, how can we reject a hypothesis if we don't look at what the alternatives are? What if the alternatives are worse? So let's specify a "biased coin" alternative (H1) which assumes that the head probability of the coin (θ) is distributed uniformly between 0 and 1 (other ways of specifying H1 are possible and do not effect the conclusion). We have:
\[ P(S=532 \mid H_0) = {1000 \choose 532} \left(\frac{1}{2}\right)^{1000} = 0.003256 \]
\[ P(S=532 \mid H_1) = \int_0^1 {1000 \choose 532} \theta^{532} (1-\theta)^{468} d\theta = 0.001 \]
So H0 makes our data 3.2 times more likely than H1! And here the p-value almost made us think the data was 1:20 in favor of the "biased" hypothesis.

Objection 2: (Berger 1982, pp.13) Well, now that we understand p-value is not the probability of H0, does it tell us anything useful? According to the definition it limits the false rejection rate, i.e. if we always use significance tests with a p-value threshold α=0.01, we can be assured of incorrectly rejecting only 1% of correct hypotheses in the long run. So does that mean when I reject a null hypothesis I am only mistaken 1% of the time? Of course not! P(reject|correct) is 1%, P(correct|reject) can be anything! Here is an example:

X=1

X=2

H0

.01

.99

H1

.01001

.98999

The table gives the probabilities the two hypotheses H0 and H1 assign to different outcomes X=1 or X=2. Say we observe X=1. We reject H0 at α=0.01 significance level. But there is very little evidence against H0, the likelihood ratio P(X|H1)/P(X|H0) is very close to 1, so the chance of being in error is about 1/2 (assuming H0 and H1 are a-priori equally likely). Thus α=0.01 is providing a very misleading and false sense of security when rejection actually occurs.

Objection 3: (Murphy 2013, pp.213) Consider two experiments. In the first one we toss a coin 1000 times and observe 474 tails. Using T=474 as our statistic the one sided p-value is P(T≤474|H0):
\[ \sum_{k=0}^{474} {1000 \choose k} \left(\frac{1}{2}\right)^{1000} = 0.05337 \]
So at a significance level of α=0.05 we do not reject the null hypothesis of an unbiased coin.

In the second experiment we toss the coin until we observe 474 tails, and it happens to take us 1000 trials. Different intention, same data. This time N=1000 is the natural test statistic and the one sided p-value is P(N≥1000|H0):
\[ \sum_{n=1000}^\infty {n-1 \choose 473} \left(\frac{1}{2}\right)^n = 0.04994 \]
Suddenly we are below the magical α=0.05 threshold and we can reject the null hypothesis. The observed data, thus the likelihoods of any hypotheses for this data have not changed. The p-value is based not just on what actually happened, but what could have happened. This is clearly absurd.

Objection 4: (Cumming 2012) If we base the fate of our hypotheses on p-values computed from experiments, at the very least we should expect the p-values (thus our critical decisions) to change very little when we replicate the experiments. Unfortunately p-values do not even give us stability, as this wonderful video "Dance of the p values" by Geoff Cumming illustrates:

In order to argue for an hypothesis H1 that some effect exists, one does it indirectly: invent a "null hypothesis" H0 that denies any such effect, then argue against H0 in a way that makes no reference to H1 at all (that is, using only probabilities conditional on H0). To see how far this procedure takes us from elementary logic, suppose we decide that the effect exists; that is, we reject H0. Surely, we must also reject probabilities conditional on H0; but then what was the logical justification for the decision? Orthodox logic saws off its own limb.
Harold Jeffreys (1939, p. 316) expressed his astonishment at such limb-sawing reasoning by looking at a different side of it: "An hypothesis that may be true is rejected because it has failed to predict observable results that have not occurred. This seems a remarkable procedure. On the face of it, the evidence might more reasonably be taken as evidence for the hypothesis, not against it."

OK, first a disclaimer: this version of KUnet.jl, my deep learning code for Julia, is a bit more than 500 lines, but it is still under 1000 lines and it supports convolution and pooling, new activation and loss functions, arrays of arbitrary dimensionality with 32 and 64 bit floats etc. See here for installation instructions.
We will use the MNIST dataset to
illustrate basic usage of
KUnet:

julia> include(Pkg.dir("KUnet/test/mnist.jl"))

This may take a bit the first time you run to download the data.

Next we tell Julia we intend to use KUnet, and some variables from
MNIST:

julia> using KUnet
julia> using MNIST: xtrn, ytrn, xtst, ytst

The MNIST variables are Float32 matrices. The x matrices have pixel
values scaled to [0.0:1.0] for a 28x28 image on each column. The y
matrices have 10 rows indicating the 10 classes with a single nonzero
entry for the correct class in each column.

If you are not happy with the default Layer constructors, you can
specify your own parameters. For example, the Mmul(64,784)
constructor fills a (64,784) weight matrix with random weights from a
Gaussian distribution with std=0.01. If we want a different
initialization, we could create a weight matrix any way we want and
pass it to the Mmul constructor instead. Note that the weight matrix
for an Mmul layer with 784 inputs and 64 outputs has size (64, 784).

julia> w1 = randn(64, 784) * 0.05
julia> l1 = Mmul(w1)

Training parameters like the learning rate (lr) can be specified at
layer construction, or using setparam! on the whole network or on
individual layers.

If you take a look at the definition of train in
net.jl,
you will see that it takes a network net, the input x, and the desired
output y, and after splitting the data into minibatches it just calls
backprop and update. Here is a simplified version:

train(net, x, y)=(backprop(net, x, y); update(net))

The backprop function calls forw which computes the network
output, and back which computes the gradients of the network
parameters with respect to the loss function:

backprop(net, x, y)=(forw(net, x); back(net, y))

The forw and back functions for a Net simply call the forw and
back functions of each layer in order, feeding the output of one to
the input of the next:

The forw function for a layer takes the layer input x, and returns
the layer output y. The back function for a layer takes the loss
gradient wrt its output dy and returns the loss gradient wrt its input
dx. If the layer has a parameter p, back also computes the loss
gradient p.diff wrt its current value p.data. You can take a look at
individual layer definitions (e.g. in
mmul.jl,
bias.jl,
relu.jl,
etc.) to see how this is done for each layer.

The final layer of the network
(XentLoss
in our case) is a subtype of LossLayer. LossLayer is a special type
of layer: its forw does nothing but record the network output y. Its
back expects the desired output z (not a gradient) and computes the
loss gradient wrt the network output dy. A LossLayer also implements
the function loss(l::LossLayer,z) which returns the actual loss
value given the desired output z. KUnet currently implements the
following LossLayers:
LogpLoss,
QuadLoss,
SoftLoss,
XentLoss.

The update function for a net calls the update function for each
of its layers, which in turn calls the update function on layer
parameters:

The update function for a parameter p is used to update its values
(p.data) given the loss gradients (p.diff). Its behavior is
controlled by the following parameters: lr, l1reg, l2reg, adagrad,
momentum, nesterov. Here is a simplified definition of update from
param.jl
(p.ada, p.mom, and p.nes are temporary arrays initialized to 0):

Note that for actual research we should not be looking at the test set
accuracy at this point. We should instead split the training set into
a training and a development portion and do all our playing around
with those. We should also run each experiment 10 times with
different random seeds and measure standard errors, etc. But, this is
just a KUnet tutorial.

It seems the training set accuracy is not that great. Maybe
increasing the learning rate may help:

This improvement is unexpected, we were already overfitting with 64
hidden units, and common wisdom is not to increase the capacity of the
network by increasing the hidden units in that situation. Maybe we
should try dropout:

OK, that's enough fiddling around. I hope this gave you enough to get
your hands dirty. We are already among the better results on the
MNIST website. I am sure you can
do better playing around with the learning rate, dropout
probabilities, momentum, adagrad, regularization, and numbers, sizes,
types of layers etc. But be careful, it could become addictive :)

May 07, 2015

I just finished implementing CUDNN.jl, a Julia wrapper for the NVIDIA cuDNN GPU accelerated deep learning library. It consists of a low level interface and a high level interface. The low level interface wraps each function from libcudnn.so in a Julia function in libcudnn.jl and each data type from cudnn.h in a Julia datatype in types.jl. These were generated semi-automatically using Clang and are documented in the cuDNN Library User Guide. For the high level interface defined in CUDNN.jl, I kept the original names from the C library and provided more convenient type signatures, return values, and keyword arguments with reasonable defaults. Next step will be integrating this with Knet.jl.
Full post...

April 24, 2015

While browsing some interesting talks on the web, I realized some of the "big questions" that cause much confusion fall into one of two categories.

Physics does not have X. Is X something real and missing in physics, or is X an illusion, a modeling tool, a result of our limited information, agent based perspective, or stupidity? Examples: consciousness, free will, causality and directionality of time.

Physics does have X which doesn't make sense. Is X something real, or a modeling tool? Examples: randomness, entropy, various sorts of quantum weirdness.

Here are some of the related links:

I was trying to find out what fellow MIT AI Lab alum Gary Drescher is up to after writing Good and Real (which covers most of the big questions mentioned above). I found this video of a talk on causality and choice he gave at the Singularity Summit in 2009. He argues convincingly that determinism and choice are not at odds with each other.

Speaking of causality, Michael Nielsen (author of Quantum Computation and Quantum Information, as well as a free online book on Neural Networks and Deep Learning) has a nice article summarizing the key points of Judea Pearl's Causality book. The epilogue of the book (well worth the read if you don't have the time or patience for the whole book) is a lecture Pearl gave some years ago where he covers the history of the confusion about causality, why physics (if we model the whole universe) does not require it, how economics and other social sciences (where lack of controlled experiments make causality detection difficult) still do not adequately model it, and how it is actually possible in some cases to derive causality from purely observational data without resorting to controlled experiments.

Another regular at the Singularity Summit is Eliezer Yudkowsky. I first discovered his posts on Less Wrong, which has a lot of good material that identifies common sources of confusion when thinking about the big questions.

"Noisy moons: This is logistic regression on noisy moons dataset from sklearn which shows the smoothing effects of momentum based techniques (which also results in over shooting and correction). The error surface is visualized as an average over the whole dataset empirically, but the trajectories show the dynamics of minibatches on noisy data. The bottom chart is an accuracy plot."

"Beale's function: Due to the large initial gradient, velocity based techniques shoot off and bounce around - adagrad almost goes unstable for the same reason. Algos that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability."

"Long valley: Algos without scaling based on gradient information really struggle to break symmetry here - SGD gets no where and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimization direction.
Algos that scale step size based on the gradient quickly break symmetry and begin descent."

"Saddle point: Behavior around a saddle point.
NAG/Momentum again like to explore around, almost taking a different path.
Adadelta/Adagrad/RMSProp proceed like accelerated SGD."

Parsers (as well as many other natural language processing algorithms) work by (1) extracting features for the current state, and (2) using a machine learning model to predict the best action / structure based on those features. The feature extraction code is typically messy and irregular and best performed on (possibly multiple) CPUs. The machine learning models can typically be accelerated significantly using GPUs. In this post I will use a greedy transition based parser with a neural network model and figure out how to use both the GPU and the multiple CPU cores effectively. We will take the parser speed from 55 ms/word (with a single CPU) to 0.055 ms/word (using 20 CPU cores and two K40 GPUs). At this speed we can parse the whole Penn Treebank (approx 1M words) in less than 60 seconds.

Here n is a machine learning model, f is a specification of what features to extract, and p represents the parser state. The parser works by extracting features representing the sentence and the current parser state, using the model to score possible moves, and executing the highest scoring move until no valid moves are left.

To parse a whole corpus (array of sentences), we just map gparse to each sentence. Julia can distinguish which gparse we mean by looking at the types of arguments.

The corpus, dev, is the Penn Treebank WSJ section 22 (1700 sentences, 40117 words); net is a standard feed forward neural network with 1326 input units, 20000 hidden units in a single layer, and 3 output units; feats is a specification of features to be extracted. The parsing speed is 55.944 ms/word. More than 99% of that time is spent on "predict".

In order to speed up "predict", we will use the GPU (NVIDIA Tesla K20m):

This gives us 3.704 ms/word, a 15x speed-up. However the GPU can be better utilized if we somehow manage to batch our feature vectors and compute scores for multiple instances in parallel. The problem is parsing a sentence is a serial process, you need the state resulting from the last move in order to compute the features for the next move. The solution is to parse multiple sentences in parallel (thanks to Alkan Kabakcioglu for suggesting this). Different sentences have no dependencies on each other and we can keep track of their states and predict their moves in bulk. The second version of gparse takes an additional "batchsize" argument specifying how many sentences to parse in parallel. This needs some bookkeeping (requiring 80 lines of code for gparse v2 instead of the 10 line beauty you see above), so I won't cut-and-paste it here, you can see the source code if you wish. Here are some experiments with the batched gparse:

As we increase the number of sentences processed in parallel (doing all 1700 sentences in the corpus in parallel in the last example), we get 0.550 ms/word, a 100x speedup from where we started. At this point the time spent on prediction is about a third of the time spent on feature extraction, so let's take another look at features. We will use Julia's parallel computing primitives to group the sentences to be processed on separate cores. The third version of gparse takes yet another argument specifying the number of cpu cores:

The distribute command distributes the corpus equally among ncpu workers, and localpart gives each worker its own subset. We copy the net back and forth between the CPU and the GPU because I couldn't figure out how to pass GPU pointers between different workers. Finally pmap is the parallel map which calls gparse v2 on each worker for the appropriate subset of the corpus, pmerge merges the results. This time we will run the parser on the training set (Sections 02-21, ~40k sentences, ~950k words)

February 28, 2015

Click here for a newer version (Knet7) of this tutorial. The code used in this version (KUnet) has been deprecated.

There are a number of deeplearningpackagesoutthere. However most sacrifice readability for efficiency. This has two disadvantages: (1) It is difficult for a beginner student to understand what the code is doing, which is a shame because sometimes the code can be a lot simpler than the underlying math. (2) Every other day new ideas come out for optimization, regularization, etc. If the package used already has the trick implemented, great. But if not, it is difficult for a researcher to test the new idea using impenetrable code with a steep learning curve. So I started writing KUnet.jl which currently implements backprop with basic units like relu, standard loss functions like softmax, dropout for generalization, L1-L2 regularization, and optimization using SGD, momentum, ADAGRAD, Nesterov's accelerated gradient etc. in less than 500 lines of Julia code. Its speed is competitive with the fastest GPU packages (here is a benchmark). For installation and usage information, please refer to the GitHub repo. The remainder of this post will present (a slightly cleaned up version of) the code as a beginner's neural network tutorial (modeled after Honnibal's excellent parsing example).

I hope that the code is self-explanatory. A couple of things to note: the layer has fields l.x, l.y to remember its last input and output, these will come in handy when calculating gradients. The ".+" operation is the "broadcasting" version of "+", which means if l.y is a matrix, then the vector l.b will be added to each of its columns. Finally Julia automatically returns the value of the last expression in a function, so there is no need for an explicit return statement. The version in the actual source, net.jl, is a bit uglier because (1) it tries to make each operation (except multiplying with the weight matrix) optional, and (2) it uses macros from InplaceOps.jl to avoid allocating matrices for each operation (which hopefully will become unnecessary once Julia decides on a standard syntax for in-place operations).

Once we have forw for a single layer, calculating a prediction for the whole network is literally a one-liner:

forw(n::Net, x)=(for l=n x=forw(l,x) end; x)

This is the compact function definition syntax of Julia. There is no need to give this function a different name, Julia can distinguish it from forw(l::Layer,x) using the type of the first argument. I find myself thinking of silly names for related functions a lot less in Julia. The loop "for l=n" sets the variable l to each layer of the network n in turn, and we keep feeding the output of one layer as input to the next layer. The output of the final layer is returned. The source code also provides a "predict" function which is just a wrapper around "forw(n::Net,x)", its only difference is to chop up the input into minibatches and feed them to forw one batch at a time, mainly to allow the limited GPU memory to process large datasets.

In order to train the network, we need a loss function. A loss function takes y, the network output, and dy, the desired output, and returns a loss value which we try to minimize during training. In addition, it calculates the gradient of the loss with respect to y, which tells us how small changes in the network output will effect the loss. As an example, here is the implementation of softmaxloss (albeit a bit low level) in func.jl:

By convention KUnet does not use an activation function in the last layer, so y is just the linear output of the last layer (prediction only cares about the max y). Softmax treats these y values as unnormalized log-probabilities, and all but the last few lines of the code is for normalizing them in a numerically stable manner. In fact only this single line calculates the gradient: dy[i,j] = (prob[i] - dy[i,j]) / ycols. We should note a couple of important design decisions here: (1) y and dy should have the same dimensionality, so use one-of-k encoding for desired classification outputs. (2) The loss function overwrites dy with the gradient of the loss with respect to y.

Backpropagation is the algorithm that calculates the gradient of the loss with respect to network weights, given the gradient of the loss with respect to the network output. This can be accomplished taking a backward pass through the layers after the forward pass and the loss calculation:

Remember that the loss function overwrites its second argument dy with the loss gradient. The backward pass through the network simply calls the backward pass through each layer, from the output to the input:

back(n::Net, dy) = (for i=length(n):-1:1 dy=back(n[i],dy) end)

The backward pass through each layer takes the gradient with respect to its output, calculates the gradient with respect to its weights, and returns the gradient with respect to its input (i.e. the output of the previous layer):

Note how the operations in back mirror the ones in forw. The arrays dy, l.dw, l.db and l.dx are the loss gradients with respect to l.y, l.w, l.b, and l.x. We saw l.f and l.fx as the activation and preprocessing functions in forw. Here, their three argument versions calculate gradients instead. That way we don't need to invent new names for the backward versions and run the risk of the user pairing the wrong forward and backward functions. Specifically l.f(l,l.y,dy) takes dy, the gradient wrt the layer output f(wx+b), and returns the gradient wrt the linear output wx+b. From that we compute the gradient with respect to w, b, and x. Finally the three argument version of l.fx calculates the gradient of the original input x (i.e. the output of the previous layer) from the gradient of the preprocessed x so the backpropagation can continue with the previous layer. By the end of this process, each layer has the gradients of its parameters stored in l.dw and l.db. I really think this is one case where the code is easier to understand than the math or the English explanation.

The gradients calculated by backprop, l.dw and l.db, tell us how much small changes in corresponding entries in l.w and l.b will effect the loss (for the last instance, or minibatch). Small steps in the gradient direction will increase the loss, steps in the opposite direction will decrease the loss. This suggests the following update rule:

w = w - dw

This is the basic idea behind Stochastic Gradient Descent (SGD): Go over the training set instance by instance (or minibatch by minibatch). Run the backpropagation algorithm to calculate the loss gradients. Update the weights and biases in the opposite direction of these gradients. Rinse and repeat...
Over the years, people have noted many subtle problems with this approach and suggested improvements:Step size: If the step sizes are too small, the SGD algorithm will take too long to converge. If they are too big it will overshoot the optimum and start to oscillate. So we scale the gradients with an adjustable parameter called the learning rate:

w = w - learningRate * dw

Step direction: More importantly, it turns out the gradient (or its opposite) is often NOT the direction you want to go in order to minimize loss. Let us illustrate with a simple picture:

The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.
Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is the loss for the last minibatch/instance only, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.
So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a standard textbook on numerical optimization, and peruse the latest papers on optimization in deep learning.Minimize what? The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see Radford Neal's fbm), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.
KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer:

The reader can peruse update.jl for the definitions of each of the helpers. This is one part of the code I haven't had time to "mathematize". I am hoping the helpers will be unnecessary one day and update will be expressed in easy to understand natural mathematical notation once Julia has in-place array operation syntax that is generic for GPUs.
The "train" function in net.jl is just a wrapper around "backprop" and "update". It goes through the training set once, splitting it into minibatches, and feeding each minibatch through backprop and update.

The activation function follows the linear transformation (wx+b) in a layer and is typically a non-linear element-wise function. Without an activation function, multiple layers in a neural network would be useless because the composition of several linear functions is still a linear function. As an example activation function l.f, here is the definition of the relu (rectified linear unit) from func.jl:

The two argument version handles the forward calculation, i.e. replacing negative y values with 0. The three argument version handles the backward calculation, i.e. replacing dy gradients with 0 where y was 0. Unlike some other high-level languages, for loops are very efficient in Julia. These can probably be made even faster using "@parallel for" and multiple threads, but people who need performance will want to use the GPU versions:

Being able to write GPU kernels directly in Julia is still work in progress. So for now I write a couple of lines of CUDA (the reluforw and reluback functions, which are basically the C versions of relu), compile it to a shared library (libkunet.so), and use Julia's (very convenient) ccall function to directly call them. This does not quite satisfy my "generic readable code" requirement but it is the simplest solution I found at this stage of JuliaGPU development. The nice thing is Julia automatically calls the GPU version if the y argument is a CudaArray and falls back on the pure Julia version otherwise. This allows the user to control whether calculations will be performed on the CPU or the GPU by declaring l.w and l.b as regular arrays or CudaArrays.

A preprocessing function precedes the linear transformation (wx+b) modifying the input of the layer. Preprocessing functions typically add some noise to the input to improve generalization. For example dropout can be implemented as a preprocessing function of a layer where each input is dropped (replaced by 0) with a given probability. Adding Gaussian noise, or elastic transformations of image inputs can also be implemented as preprocessing functions. Here is the (simplified) dropout implementation from func.jl:

The two argument drop is for the forward calculation, the three argument version is for the backward calculation, and the last two are the CPU and GPU versions of the common helper. This implementation uses three fields of the layer type: l.fx is set to "drop", l.xdrop is an array of random numbers between 0 and 1 that has the same dimensionality as l.x, and l.dropout is the probability of dropping an input. Every time we receive a new x, rand! fills xdrop with new random numbers and a different subset of the input is dropped. The rest of the input is rescaled. Preprocessing is applied during training, not prediction, so the actual forw implementation takes an optional flag "apply_fx" that determines whether fx is to be applied.

I wanted to write something that is concise, easy to understand, easy to extend, and reasonably efficient. There is a subtle trade-off between conciseness and extensibility: If we use a very high level language that already has a "neural_network_train" function, we can write very concise code but we lose the ability to change the training algorithm. If we use a very low level language that only provides primitive arithmetic operations, all the algorithm details are exposed and modifiable but the code is bulky and difficult to understand. For a happy medium, the code should reflect the level at which I think of the problem: e.g. I should be able to say A*B if I want to multiply two matrices, and I shouldn't constantly worry about the location and element types of my arrays. That restricts the playing field to a handful of languages. Efficiency requires being able to work with a GPU. Ideally the code should be generic, i.e. algorithms should be expressed once and work whether the data is on the GPU or the CPU memory. Julia has fledgling GPU support but it does have concise matrix arithmetic and it excels at generic operations. So I struggled a bit with trying to get generic matrix operations to work on the GPU (which is nowhere nearly complete but currently good enough to run the KUnet code), and to express each algorithm in as mathematical and concise a manner as possible (which is still work in progress, largely due to the lack of GPU generics and a standard syntax for in-place operations). But the progress so far (and the invaluable support I got from the Julia community) convinced me that this is a worthwhile endeavor.

With very little extra effort (and lines of code) one should be able to extend KUnet to implement convolutional and recurrent nets and new tricks like maxout units, maxnorm regularization, rmsprop optimization etc. You can send me suggestions for improvement (both in coding style and new functionality) using comments to this blog post, or using issues or pull requests on GitHub.

Abstract:
We introduce FDA5 for efficient parameterization,
optimization, and implementation of feature decay algorithms
(FDA), a class of instance selection algorithms that use feature
decay. FDA increase the diversity of the selected training set
by devaluing features (i.e. n-grams) that have already been
included. FDA5 decides which instances to select based on three
functions used for initializing and decaying feature values and
scaling sentence scores controlled with 5 parameters. We present
optimization techniques that allow FDA5 to adapt these functions
to in-domain and out-of-domain translation tasks for different
language pairs. In a transductive learning setting, selection of
training instances relevant to the test set can improve the
final translation quality. In machine translation experiments
performed on the 2 million sentence English-German section of
the Europarl corpus, we show that a subset of the training set
selected by FDA5 can gain up to 3.22 BLEU points compared to
a randomly selected subset of the same size, can gain up to 0.41
BLEU points compared to using all of the available training data
using only 15% of it, and can reach within 0.5 BLEU points to the
full training set result by using only 2.7% of the full training data.
FDA5 peaks at around 8M words or 15% of the full training set.
In an active learning setting, FDA5 minimizes the human effort
by identifying the most informative sentences for translation and
FDA gains up to 0.45 BLEU points using 3/5 of the available
training data compared to using all of it and 1.12 BLEU points
compared to random training set. In translation tasks involving
English and Turkish, a morphologically rich language, FDA5 can
gain up to 11.52 BLEU points compared to a randomly selected
subset of the same size, can achieve the same BLEU score using
as little as 4% of the data compared to random instance selection,
and can exceed the full dataset result by 0.78 BLEU points. FDA5
is able to reduce the time to build a statistical machine translation
system to about half with 1M words using only 3% of the space
for the phrase table and 8% of the overall space when compared
with a baseline system using all of the training data available yet
still obtain only 0.58 BLEU points difference with the baseline
system in out-of-domain translation.

January 20, 2015

In this post I will explore how to parallelize certain types of machine learning / natural language processing code in an environment with multiple cpu cores and/or a gpu. The running example I will use is a transition based parser, but the same techniques should apply to other similar models used for sequence labeling, chunking, etc. We will see the relative contributions of mini-batching, parallel processing, and using the gpu. The ~24x speed-up that we get means we can parse the ~1M words of Penn Treebank in 9 minutes rather than 3.5 hours. (This post uses Matlab, here is a Julia version).

Here is the serial version of the main loop. The language is matlab, but I hope it is clear enough as pseudo-code. The specifics of the model, the parser and the features are not all that important. As a baseline, this code takes 10.9 ms/word for parsing, and most of that time is spent in "getfeatures" and "predict".

To speed up "predict", the simplest trick is to perform the matrix operations on the gpu. Many common machine learning models including the neural network, kernel perceptron, svm etc. can be applied using a few matrix operations. In my case declaring the weights of my neural net model as gpuArrays instead of regular arrays improves the speed to 6.24 ms/word without any change in the code.

To speed up "getfeatures" the gpu is useless: feature calculation typically consists of ad-hoc code that tries to summarize the parser state, the sentence and the model in a vector. However we can parse multiple sentences in parallel using multiple cores. Replacing the "for" in line 2 with "parfor" and using a pool of 12 cores improves the performance to 5.03 ms/word with the gpu and 3.70 ms/word without the gpu (here the single gpu in the machine creates a bottleneck for the parallel processes).

A common trick for speeding up machine learning models is to use mini-batches instead of computing the answers one at a time. Consider a common operation: multiplying a weight matrix, representing support vectors or neural network weights, with a column vector, representing a single instance. If you want to perform this operation on 100 instances, we can do this one at a time in a for loop, or we can concatenate all instances into a 100 column matrix and perform a single matrix multiplication. Here are some comparisons, each variation measures the time for processing 10K instances:

This is almost a 100x speed-up going from single instances on the cpu to mini-batches on the gpu! Unfortunately it is not trivial to use mini-batches with history based models, i.e. models where the features of the next instance depend on your answers to the previous instances. In that case it is impossible to ask for "the next 100 instances" before you start providing answers. However typically the sentences are independent of one another and nothing prevents us from asking for "the instances representing the initial states of the next 100 sentences" and concatenate these together in a matrix. Then we can calculate 100 answers in parallel and use them to give us the 100 next states etc. The sentence lengths are different, and they will reach their final states at different times, but we can handle that with some bookkeeping. The following version of the code groups sentences into minibatches and processes them in parallel:

This code runs at 2.80 ms/word with the cpu and 1.67 ms/word with the gpu. If we replace the for loop with parfor we get 1.08 ms/word with the cpu and 0.46 ms/word with the gpu.