This is a hands-on tutorial on deep learning. Step by step, we'll go
about building a solution for the Facial Keypoint Detection Kaggle
challenge.
The tutorial introduces Lasagne, a new library for building
neural networks with Python and Theano. We'll use Lasagne to
implement a couple of network architectures, talk about data
augmentation, dropout, the importance of momentum, and pre-training.
Some of these methods will help us improve our results quite a bit.

You don't need to type the code and execute it yourself if you just
want to follow along. But here's the installation instructions for
those who have access to a CUDA-capable GPU and want to run the
experiments themselves.

I assume you have the CUDA toolkit, Python 2.7.x, numpy, pandas,
matplotlib, and scikit-learn installed. Lasagne is still waiting for
its first proper release, so for now we'll install it straight from
Github. To install Lasagne
and all the remaining dependencies, run this command:

(Note that for sake of brevity, I'm not including commands to create a
virtualenv and activate it.
But you should.)

If everything worked well, you should be able to find the
src/lasagne/examples/ directory in your virtualenv and run the
MNIST example. This
is sort of the "Hello, world" of neural nets. There's ten classes,
one for each digit between 0 and 9, and the input is grayscale images
of handwritten digits of size 28x28.

cd src/lasagne/examples/
python mnist.py

This command will start printing out stuff after thirty seconds or so.
The reason it takes a while is that Lasagne uses Theano to do the
heavy lifting; Theano in turn is a "optimizing GPU-meta-programming
code generating array oriented optimizing math compiler in Python,"
and it will generate C code that needs to be compiled before training
can happen. Luckily, we have to pay the price for this overhead only
on the first run.

The training dataset for the Facial Keypoint Detection challenge
consists of 7,049 96x96 gray-scale images. For each image, we're
supposed learn to find the correct position (the x and y coordinates)
of 15 keypoints, such as left_eye_center,
right_eye_outer_corner, mouth_center_bottom_lip, and so on.

An example of one of the faces with three keypoints marked.

An interesting twist with the dataset is that for some of the
keypoints we only have about 2,000 labels, while other keypoints have
more than 7,000 labels available for training.

Let's write some Python code that loads the data from the CSV files
provided.
We'll write a function that can load both the training and the test
data. These two datasets differ in that the test data doesn't contain
the target values; it's the goal of the challenge to predict these.
Here's our load() function:

# file kfkd.pyimportosimportnumpyasnpfrompandas.io.parsersimportread_csvfromsklearn.utilsimportshuffleFTRAIN='~/data/kaggle-facial-keypoint-detection/training.csv'FTEST='~/data/kaggle-facial-keypoint-detection/test.csv'defload(test=False,cols=None):"""Loads data from FTEST if *test* is True, otherwise from FTRAIN. Pass a list of *cols* if you're only interested in a subset of the target columns. """fname=FTESTiftestelseFTRAINdf=read_csv(os.path.expanduser(fname))# load pandas dataframe# The Image column has pixel values separated by space; convert# the values to numpy arrays:df['Image']=df['Image'].apply(lambdaim:np.fromstring(im,sep=' '))ifcols:# get a subset of columnsdf=df[list(cols)+['Image']]print(df.count())# prints the number of values for each columndf=df.dropna()# drop all rows that have missing values in themX=np.vstack(df['Image'].values)/255.# scale pixel values to [0, 1]X=X.astype(np.float32)ifnottest:# only FTRAIN has any target columnsy=df[df.columns[:-1]].valuesy=(y-48)/48# scale target coordinates to [-1, 1]X,y=shuffle(X,y,random_state=42)# shuffle train datay=y.astype(np.float32)else:y=NonereturnX,yX,y=load()print("X.shape == {}; X.min == {:.3f}; X.max == {:.3f}".format(X.shape,X.min(),X.max()))print("y.shape == {}; y.min == {:.3f}; y.max == {:.3f}".format(y.shape,y.min(),y.max()))

It's not necessary that you go through every single detail of this
function. But let's take a look at what the script above outputs:

First it's printing a list of all columns in the CSV file along with
the number of available values for each. So while we have an
Image for all rows in the training data, we only have 2,267 values
for mouth_right_corner_x and so on.

load() returns a tuple (X, y) where y is the target matrix.
y has shape n x m with n being the number of samples in the
dataset that have all m keypoints. Dropping all rows with missing
values is what this line does:

df=df.dropna()# drop all rows that have missing values in them

The script's output y.shape == (2140, 30) tells us that there's
only 2,140 images in the dataset that have all 30 target values
present. Initially, we'll train with these 2,140 samples only. Which
leaves us with many more input dimensions (9,216) than samples; an
indicator that overfitting might become a problem. Let's see. Of
course it's a bad idea to throw away 70% of the training data just
like that, and we'll talk about this later on.

Another feature of the load() function is that it scales the
intensity values of the image pixels to be in the interval [0, 1],
instead of 0 to 255. The target values (x and y coordinates) are
scaled to [-1, 1]; before they were between 0 to 95.

Here we define the input layer, the hidden layer and the
output layer. In parameter layers, we name and specify the
type of each layer, and their order. Parameters input_shape,
hidden_num_units, output_nonlinearity, and
output_num_units are each parameters for specific layers; they
refer to the layer by their prefix, such that input_shape defines
the shape parameter of the input layer, hidden_num_units
defines the hidden layer's num_units and so on. (It may seem a
little odd that we have to specify the parameters like this, but the
upshot is it buys us better compatibility with scikit-learn's pipeline and parameter search
features.)

We set the first dimension of input_shape to None. This
translates to a variable batch size.

We set the output_nonlinearity to None explicitly. Thus, the
output units' activations become just a linear combination of the
activations in the hidden layer.

The default nonlinearity used by DenseLayer is the rectifier,
which is simply max(0, x). It's the most popular choice of
activation function these days. By not explicitly setting
hidden_nonlinearity, we're choosing the rectifier as the
activiation function of our hidden layer.

The neural net's weights are initialized from a uniform distribution
with a cleverly chosen interval. That is, Lasagne figures out this
interval for us, using "Glorot-style" initialization.

There's a few more parameters. All parameters starting with
update parametrize the update function, or optimization method.
The update function will update the weights of our network after each
batch. We'll use the nesterov_momentum gradient descent
optimization method to do the job. There's a number of other methods
that Lasagne implements, such as adagrad and rmsprop. We
choose nesterov_momentum because it has proven to work very well
for a large number of problems.

The update_learning_rate defines how large we want the steps of
the gradient descent updates to be. We'll talk a bit more about the
learning_rate and momentum parameters later on. For now, it's
enough to just use these "sane defaults."

Comparison of a few optimization methods (animation by Alec
Radford).
The star denotes the global minimum on the error surface. Notice
that stochastic gradient descent (SGD) without momentum is the
slowest method to converge in this example. We're using Nesterov's
Accelerated Gradient Descent (NAG) throughout this tutorial.

In our definition of NeuralNet we didn't specify an objective
function to minimize. There's again a default for that; for
regression problems it's the mean squared error (MSE).

The last set of parameters declare that we're dealing with a
regression problem (as opposed to classification), that 400 is the
number of epochs we're willing to train, and that we want to print out
information during training by setting verbose=1:

regression=True,# flag to indicate we're dealing with regression problemmax_epochs=400,# we want to train this many epochsverbose=1,

Finally, the last two lines in our script load the data, just as
before, and then train the neural net with it:

X,y=load()net1.fit(X,y)

Running these two lines will output a table that grows one row per
training epoch. In each row, we'll see the current loss (MSE) on the
train set and on the validation set and the ratio between the two.
NeuralNet automatically splits the data provided in X into a
training and a validation set, using 20% of the samples for
validation. (You can adjust this ratio by overriding the
eval_size=0.2 parameter.)

On a reasonably fast GPU, we're able to train for 400 epochs in under
a minute. Notice that the validation loss keeps improving until the
end. (If you let it train longer, it will improve a little more.)

Now how good is a validation loss of 0.0032? How does it compare to
the challenge's benchmark
or the other entries in the leaderboard? Remember that we divided the
target coordinates by 48 when we scaled them to be in the interval
[-1, 1]. Thus, to calculate the root-mean-square error, as that's
what's used in the challenge's leaderboard, based on our MSE loss of
0.003255, we'll take the square root and multiply by 48 again:

>>>importnumpyasnp>>>np.sqrt(0.003255)*482.7385251505144153

This is reasonable proxy for what our score would be on the Kaggle
leaderboard; at the same time it's assuming that the subset of the
data that we chose to train with follows the same distribution as the
test set, which isn't really the case. My guess is that the score is
good enough to earn us a top ten place in the leaderboard at the time
of writing. Certainly not a bad start! (And for those of you that
are crying out right now because of the lack of a proper test set:
don't.)

We can see that our net overfits, but it's not that bad. In
particular, we don't see a point where the validation error gets worse
again, thus it doesn't appear that early stopping, a technique
that's commonly used to avoid overfitting, would be very useful at
this point. Notice that we didn't use any regularization whatsoever,
apart from choosing a small number of neurons in the hidden layer, a
setting that will keep overfitting somewhat in control.

How do the net's predictions look like, then? Let's pick a few
examples from the test set and check:

LeNet5-style convolutional
neural nets are at the heart of deep learning's recent breakthrough in
computer vision. Convolutional layers are different to fully
connected layers; they use a few tricks to reduce the number of
parameters that need to be learned, while retaining high
expressiveness. These are:

local connectivity: neurons are connected only to a subset of
neurons in the previous layer,

weight sharing: weights are shared between a subset of neurons in
the convolutional layer (these neurons form what's called a feature
map),

Units in a convolutional layer actually connect to a 2-d patch of
neurons in the previous layer, a prior that lets them exploit the 2-d
structure in the input.

When using convolutional layers in Lasagne, we have to prepare the
input data such that each sample is no longer a flat vector of 9,216
pixel intensities, but a three-dimensional matrix with shape (c, 0,
1), where c is the number of channels (colors), and 0 and 1
correspond to the x and y dimensions of the input image. In our case,
the concrete shape will be (1, 96, 96), because we're dealing with a
single (gray) color channel only.

A function load2d that wraps the previously written load and
does the necessary transformations is easily coded:

We'll build a convolutional neural net with three convolutional layers
and two fully connected layers. Each conv layer is followed by a 2x2
max-pooling layer. Starting with 32 filters, we double the number of
filters with every conv layer. The densely connected hidden layers
both have 500 units.

There's again no regularization in the form of weight decay or
dropout. It turns out that using very small convolutional filters,
such as our 3x3 and 2x2 filters, is again a pretty good regularizer by
itself.

Let's write down the code:

net2=NeuralNet(layers=[('input',layers.InputLayer),('conv1',layers.Conv2DLayer),('pool1',layers.MaxPool2DLayer),('conv2',layers.Conv2DLayer),('pool2',layers.MaxPool2DLayer),('conv3',layers.Conv2DLayer),('pool3',layers.MaxPool2DLayer),('hidden4',layers.DenseLayer),('hidden5',layers.DenseLayer),('output',layers.DenseLayer),],input_shape=(None,1,96,96),conv1_num_filters=32,conv1_filter_size=(3,3),pool1_ds=(2,2),conv2_num_filters=64,conv2_filter_size=(2,2),pool2_ds=(2,2),conv3_num_filters=128,conv3_filter_size=(2,2),pool3_ds=(2,2),hidden4_num_units=500,hidden5_num_units=500,output_num_units=30,output_nonlinearity=None,update_learning_rate=0.01,update_momentum=0.9,regression=True,max_epochs=1000,verbose=1,)X,y=load2d()# load 2-d datanet2.fit(X,y)# Training for 1000 epochs will take a while. We'll pickle the# trained model so that we can load it back later:importcPickleaspicklewithopen('net2.pickle','wb')asf:pickle.dump(net2,f,-1)

Training this neural net is much more computationally costly than the
first one we trained. It takes around 15x as long to train; those
1000 epochs take more than 20 minutes on even a powerful GPU.

However, our patience is rewarded with what's already a much better
model than the one we had before. Let's take a look at the output
when running the script. First comes the list of layers with their
output shapes. Note that the first conv layer produces 32 output
images of size (94, 94), that's one 94x94 output image per filter:

The predictions of net1 on the left compared to the predictions of
net2.

And then let's compare the learning curves of the first and the second
network:

This looks pretty good, I like the smoothness of the new error curves.
But we do notice that towards the end, the validation error of net2
flattens out much more quickly than the training error. I bet we
could improve that by using more training examples. What if we
flipped the input images horizontically; would we be able to improve
training by doubling the amount of training data this way?

An overfitting net can generally be made to perform better by using
more training data. (And if your unregularized net does not overfit,
you should probably make it larger.)

Data augmentation lets us artificially increase the number of training
examples by applying transformations, adding noise etc. That's
obviously more economic than having to go out and collect more
examples by hand. Augmentation is a very useful tool to have in your
deep learning toolbox.

We mentioned batch iterators already briefly. It is the batch
iterator's job to take a matrix of samples, and split it up in
batches, in our case of size 128. While it does the splitting, the
batch iterator can also apply transformations to the data on the fly.
So to produce those horizontal flips, we don't actually have to double
the amount of training data in the input matrix. Rather, we will just
perform the horizontal flips with 50% chance while we're iterating
over the data. This is convenient, and for some problems it allows us
to produce an infinite number of examples, without blowing up the
memory usage. Also, transformations to the input images can be done
while the GPU is busy processing a previous batch, so they come at
virtually no cost.

Flipping the images horizontically is just a matter of using slicing:

X,y=load2d()X_flipped=X[:,:,:,::-1]# simple slice to flip all images# plot two images:fig=pyplot.figure(figsize=(6,3))ax=fig.add_subplot(1,2,1,xticks=[],yticks=[])plot_sample(X[1],y[1],ax)ax=fig.add_subplot(1,2,2,xticks=[],yticks=[])plot_sample(X_flipped[1],y[1],ax)pyplot.show()

Left shows the original image, right is the flipped image.

In the picture on the right, notice that the target value keypoints
aren't aligned with the image anymore. Since we're flipping the
images, we'll have to make sure we also flip the target values. To do
this, not only do we have to flip the coordinates, we'll also have to
swap target value positions; that's because the flipped
left_eye_center_x no longer points to the left eye in our flipped
image; now it corresponds to right_eye_center_x. Some points like
nose_tip_y are not affected. We'll define a tuple
flip_indices that holds the information about which columns in the
target vector need to swap places when we flip the image
horizontically. Remember the list of columns was:

Since left_eye_center_x will need to swap places with
right_eye_center_x, we write down the tuple (0, 2). Also
left_eye_center_y needs to swap places: with
right_eye_center_y. Thus we write down (1, 3), and so on. In
the end, we have:

Our batch iterator implementation will derive from the default
BatchIterator class and override the transform() method only.
Let's see how it looks like when we put it all together:

classFlipBatchIterator(BatchIterator):flip_indices=[(0,2),(1,3),(4,8),(5,9),(6,10),(7,11),(12,16),(13,17),(14,18),(15,19),(22,24),(23,25),]deftransform(self,Xb,yb):Xb,yb=super(FlipBatchIterator,self).transform(Xb,yb)# Flip half of the images in this batch at random:bs=Xb.shape[0]indices=np.random.choice(bs,bs/2,replace=False)Xb[indices]=Xb[indices,:,:,::-1]ifybisnotNone:# Horizontal flip of all x coordinates:yb[indices,::2]=yb[indices,::2]*-1# Swap places, e.g. left_eye_center_x -> right_eye_center_xfora,binself.flip_indices:yb[indices,a],yb[indices,b]=(yb[indices,b],yb[indices,a])returnXb,yb

To use this batch iterator for training, we'll pass it as the
batch_iterator_train argument to NeuralNet. Let's define
net3, a network that looks exactly the same as net2 except for
these lines at the very end:

Now we're passing our FlipBatchIterator, but we've also tripled
the number of epochs to train. While each one of our training epochs
will still look at the same number of examples as before (after all,
we haven't changed the size of X), it turns out that training
nevertheless takes quite a bit longer when we use our transforming
FlipBatchIterator. This is because what the network learns
generalizes better this time, and it's arguably harder to learn things
that generalize than to overfit.

So this will take maybe take an hour to train. Let's make sure we
pickle the model at the end of training, and then we're ready to go
fetch some tea and biscuits. Or maybe do the laundry:

Comparing the learning with that of net2, we notice that the error on
the validation set after 3000 epochs is indeed about 5% smaller for
the data augmented net. We can see how net2 stops learning anything
useful after 2000 or so epochs, and gets pretty noisy, while net3
continues to improve its validation error throughout, though slowly.

Still seems like a lot of work for only a small gain? We'll find out
if it was worth it in the next secion.

What's annoying about our last model is that it took already an hour
to train it, and it's not exactly inspiring to have to wait for your
experiment's results for so long. In this section, we'll talk about a
combination of two tricks to fix that and make the net train much
faster again.

An intuition behind starting with a higher learning rate and
decreasing it during the course of training is this: As we start
training, we're far away from the optimum, and we want to take big
steps towards it and learn quickly. But the closer we get to the
optimum, the lighter we want to step. It's like taking the train
home, but when you enter your door you do it by foot, not by train.

Remember that in our previous model, we initialized learning rate and
momentum with a static 0.01 and 0.9 respectively. Let's change that
such that the learning rate decreases linearly with the number of
epochs, while we let the momentum increase.

NeuralNet allows us to update parameters during training using the
on_epoch_finished hook. We can pass a function to
on_epoch_finished and it'll be called whenever an epoch is
finished. However, before we can assign new values to
update_learning_rate and update_momentum on the fly, we'll
have to change these two parameters to become Theano shared
variables. Thankfully, that's pretty easy:

The callback or list of callbacks that we pass will be called with two
arguments: nn, which is the NeuralNet instance itself, and
train_history, which is the same as nn.train_history_.

Instead of working with callback functions that use hard-coded values,
we'll use a parametrizable class with a __call__ method as our
callback. Let's call this class AdjustVariable. The
implementation is reasonably straight-forward:

Cool, training is happening much faster now! The train error at
epochs 500 and 1000 is half of what it used to be in net2, before
our adjustments to learning rate and momentum. This time,
generalization seems to stop improving after 750 or so epochs already;
looks like there's no point in training much longer.

And again we have much faster training than with net3, and
better results. After 1000 epochs, we're better off than net3 was
after 3000 epochs. What's more, the model trained with data
augmentation is now about 10% better with regard to validation error
than the one without.

Like with any other regularization technique, dropout only makes sense
if we have a network that's overfitting, which is clearly the case for
the net5 network that we trained in the previous section. It's
important to remember to get your net to train nicely and overfit
first, then regularize.

To use dropout with Lasagne, we'll add DropoutLayer layers between
the existing layers and assign dropout probabilities to each one of
them. Here's the complete definition of our new net. I've added a
# ! comment at the end of those lines that were added between this
and net5.

Also overfitting doesn't seem to be nearly as bad. Though we'll have
to be careful with those numbers: the ratio between training and
validation has a slightly different meaning now since the train error
is evaluated with dropout, whereas the validation error is evaluated
without dropout. A more comparable value for the train error is
this:

In our previous model without dropout, the error on the train set was
0.000373. So not only does our dropout net perform slightly better,
it overfits much less than what we had before. That's great news,
because it means that we can expect even better performance when we
make the net larger (and more expressive). And that's what we'll try
next: we increase the number of units in the last two hidden layers
from 500 to 1000. Update these lines:

Remember those 70% of training data that we threw away in the
beginning? Turns out that's a very bad idea if we want to get a
competitive score in the Kaggle leaderboard. There's quite a bit of
variance in those 70% of data and in the challenge's test set that our
model hasn't seen yet.

So instead of training a single model, let's train a few specialists,
with each one predicting a different set of target values. We'll
train one model that only predicts left_eye_center and
right_eye_center, one only for nose_tip and so on; overall,
we'll have six models. This will allow us to use the full training
dataset, and hopefully get a more competitive score overall.

The six specialists are all going to use exactly the same network
architecture (a simple approach, not necessarily the best). Because
training is bound to take much longer now than before, let's think
about a strategy so that we don't have to wait for max_epochs to
finish, even if the validation error stopped improving much earlier.
This is called early stopping, and we'll write another
on_epoch_finished callback to take care of that. Here's the
implementation:

You can see that there's two branches inside the __call__: the
first where the current validation score is better than what we've
previously seen, and the second where the best validation epoch was
more than self.patience epochs in the past. In the first case we
store away the weights:

self.best_weights=[w.get_value()forwinnn.get_all_params()]

In the second case, we set the weights of the network back to those
best_weights before raising StopIteration, signalling to
NeuralNet that we want to stop training.

nn.load_weights_from(self.best_weights)raiseStopIteration()

Let's update the list of on_epoch_finished handlers in our net's
definition and use EarlyStopping:

We already discussed the need for flip_indices in the Data
augmentation section. Remember from section The data that our
load_data() function takes an optional list of columns to extract.
We'll make use of this feature when we fit the specialist models in a
new function fit_specialists():

fromcollectionsimportOrderedDictfromsklearn.baseimportclonedeffit_specialists():specialists=OrderedDict()forsettinginSPECIALIST_SETTINGS:cols=setting['columns']X,y=load2d(cols=cols)model=clone(net)model.output_num_units=y.shape[1]model.batch_iterator_train.flip_indices=setting['flip_indices']# set number of epochs relative to number of training examples:model.max_epochs=int(1e7/y.shape[0])if'kwargs'insetting:# an option 'kwargs' in the settings list may be used to# set any other parameter of the net:vars(model).update(setting['kwargs'])print("Training model for columns {} for {} epochs".format(cols,model.max_epochs))model.fit(X,y)specialists[cols]=modelwithopen('net-specialists.pickle','wb')asf:# we persist a dictionary with all models:pickle.dump(specialists,f,-1)

There's nothing too spectacular happening here. Instead of training
and persisting a single model, we do it with a list of models that are
saved in a dictionary that maps columns to the trained NeuralNet
instances. Now despite our early stopping, this will still take
forever to train (though by forever I don't mean Google-forever, I mean
maybe half a day on a single GPU); I don't recommend that you actually
run this.

We could of course easily parallelize training these specialist nets
across GPUs, but maybe you don't have the luxury of access to a box
with multiple CUDA GPUs. In the next section we'll talk about another
way to cut down on training time. But let's take a look at the
results of fitting these expensive to train specialists first:

Learning curves for six specialist models. The solid lines
represent RMSE on the validation set, the dashed lines errors on
the train set. mean is the mean validation error of all nets
weighted by number of target values. All curves have been scaled
to have the same length on the x axis.

Lastly, this solution gives us a Kaggle leaderboard
score of 2.17 RMSE, which corresponds to the second place at the
time of writing (right behind yours truly).

In the last section of this tutorial, we'll discuss a way to make
training our specialists faster. The idea is this: instead of
initializing the weights of each specialist network at random, we'll
initialize them with the weights that were learned in net6 or
net7. Remember from our EarlyStopping implementation that
copying weights from one network to another is as simple as using the
load_weights_from() method. Let's modify the fit_specialists
method to do just that. I'm again marking the lines that changed
compared to the previous implementation with a # ! comment:

deffit_specialists(fname_pretrain=None):iffname_pretrain:# !withopen(fname_pretrain,'rb')asf:# !net_pretrain=pickle.load(f)# !else:# !net_pretrain=None# !specialists=OrderedDict()forsettinginSPECIALIST_SETTINGS:cols=setting['columns']X,y=load2d(cols=cols)model=clone(net)model.output_num_units=y.shape[1]model.batch_iterator_train.flip_indices=setting['flip_indices']model.max_epochs=int(4e6/y.shape[0])if'kwargs'insetting:# an option 'kwargs' in the settings list may be used to# set any other parameter of the net:vars(model).update(setting['kwargs'])ifnet_pretrainisnotNone:# !# if a pretrain model was given, use it to initialize the# weights of our new specialist model:model.load_weights_from(net_pretrain)# !print("Training model for columns {} for {} epochs".format(cols,model.max_epochs))model.fit(X,y)specialists[cols]=modelwithopen('net-specialists.pickle','wb')asf:# this time we're persisting a dictionary with all models:pickle.dump(specialists,f,-1)

It turns out that initializing those nets not at random, but by
re-using weights from one of the networks we learned earlier has in
fact two big advantages: One is that training converges much faster;
maybe four times faster in this case. The second advantage is that it
also helps get better generalization; pre-training acts as a
regularizer. Here's the same learning curves as before, but now for
the pre-trained nets:

Learning curves for six specialist models that were pre-trained.

Finally, the score for this solution on the challenge's leaderboard is
2.13 RMSE. Again the second place, but getting closer!

There's probably a dozen ideas that you have that you want to try out.
You can find the source code for the final solution here
to download and play around with. It also includes the bit that
generates a submission file for the Kaggle challenge. Run python
kfkd.py to find out how to use the script on the command-line.

Here's a couple of the more obvious things that you could try out at
this point: Try optimizing the parameters for the individual
specialist networks; this is something that we haven't done so far.
Observe that the six nets that we trained all have different levels of
overfitting. If they're not or hardly overfitting, like for the green
and the yellow net above, you could try to decrease the amount of
dropout. Likewise, if it's overfitting badly, like the black and
purple nets, you could try increasing the amount of dropout. In the
definition of SPECIALIST_SETTINGS we can already add some
net-specific settings; so say we wanted to add more regularization to
the second net, then we could change the second entry of the list to
look like so:

And there's a ton of other things that you could try to tweak. Maybe
you'll try adding another convolutional or fully connected layer? I'm
curious to hear about improvements that you're able to come up with in
the comments.

Daniel Nouri is the founder of Natural Vision, a company that builds cutting edge
machine learning solutions.

It's been a while since I started my little adventure of building an
image recognition pipeline that would reliably identify animal and
plant species from photos. I'm pretty happy with what the results are
so far, and these days I'm even successfully scratching my own itch
with it. Which is how I came up with the idea in the first place: I
suck at identifying birds, butterflies, and plants, but I want to
learn about them. And knowing an animal or plant's name is really the
gateway to learning more about it.

So here's a painless way of identifying these species. Say you want
to identify one bird that you see (and you're lucky to have a camera
with a big zoom with you). You take a photo of it and then send it to
@id_birds on Twitter using Twitter's
own image upload. After a minute or two, you'll get the automatic
reply, with hopefully the right species name. @id_birds can with a
remarkably high accuracy identify yours between the 256 different
birds that it knows.

There's two other bots that are also available to use for free:
@id_butterflies knows around
50 butterfly species, and @id_wildflowers around 200 wildflowers. All
three bots know species occurring natively in Europe only, at least
for now.

"But does it actually work?" I hear you say. And the answer is: it
works pretty well, especially if you send good quality photos where
the animal or plant is clearly visible.

Seeing is believing though, so to prove that it works well, let's try
and identify the ten birds that are in The Wildlife Trust's river
bird spotter sheet.

Let's start with the first bird, a grey heron. To the left we have me
addressing the Twitter bot and sending the image. To the right is
@id_birds' reply. And it's correct! Ardea cinerea is indeed the
scientific name for the grey heron:

(If you're not seeing the tweets below, I suggest you view this on my
blog.)

And already we have @id_birds' first mis-classification. It thinks
that this moorhen is a mallard (Anas platyrhynchos). Only the second
guess is correct. Thus so far we're counting one correct, one error.

And the response is correct. Notice how the second guess (Coracias
garrulus) seems a little weird. I guess it's the similarities in
color here. We don't count the second guesses here though, so it's
two correct, one error.

What about the kingfisher? That's a bird with a pretty unique
appearance among birds native to Europe. So it's probably one of the
easier birds to identify:

So all in all we have two mis-classifications out of ten. That's 8
out of 10 accuracy, which is pretty remarkable when you consider that
the differences between these 256 different birds species can
sometimes be minuscule. As an example, compare the grey wagtail
(Motacilla cinerea) from above with the yellow wagtail. According to
Wikipedia, the grey wagtail "looks similar to the yellow wagtail but
has the yellow on its underside restricted to the throat and vent."

Of course ten picture is a tiny test set, but the score roughly
matches what I see with the much larger test set that I use for
development, so it's cool. :-)

I'll write about how the computer vision works under the hood another
time (but here's a hint: it's based deep learning).
We call the stack that we've developed and that's running behind
@id_birds and the other bots PhotoID. So far PhotoID has proven to
have very competitive performance and be remarkably flexible to use
for different image classification problems.

Let's conclude this post with two other IDs, one from @id_butterflies,
the other from @id_wildflowers:

Since recent breakthroughs in the field of speech recognition and
computer vision, neural networks have gotten a lot of attention again.
Particularly impressive were Krizhevsky et al.'s seminal results at
the ILSVRC 2012 workshop, which showed that neural nets are able to
outperform conventional image recognition systems by a large margin;
results that shook up the entire field. [1]

Krizhevsky's winning model is a
convolutional neural network (convnet), which is a type of neural
net that exploits spatial correlations in 2-d input. Convnets can
have hundreds of thousands of neurons (activation units) and
millions of connections between them, many more than could be learned
effectively previously. This is possible because convnets share
weights between connections, and thus vastly reduce the number of
parameters that need to be learned; they essentially learn a number of
layers of convolution matrices that they
apply to their input in order to find high-level, discriminative
features.

Figure 1: Example predictions of ILSVRC 2012 winner; eight images
with their true label and the net's top five predictions
below. (source)

Many papers have since followed up on Krizhevsky's work and some were
able to improve upon the original results. But while most attention
went into the problem of using convnets to do image recognition, in
this article I will describe how I was able to successfully apply
convnets to a rather different domain, namely that of underwater
bioacoustics, where sounds of different animal species are detected
and classified.

My work on this topic began with last year's Kaggle Whale Detection
Challenge, which
asked competitors to classify two-second audio recordings, some of
which had a certain call of a specific whale on them, and others
didn't. The whale in question was the North Atlantic Right Whale
(NARW), which is a whale species that's sadly nearly extinct, with
less than 400 individuals estimated to still exist. Believing that
this could be a very interesting and meaningful way to test my freshly
acquired knowledge around convolutional neural networks, I entered the
challenge early, and was able to reach a pretty remarkable Area Under
Curve (AUC) score of roughly 97% after only two days into the
competition. [2]

Figure 2: The Kaggle leaderboard after two days into the
competition.

The trick to my early success was that I framed the problem of finding
the whale sound patterns as an image reconition problem, by turning
the two-second sound clips each into spectrograms. Spectrograms are
essentially 2-d arrays with amplitude as a function of time and
frequency. This allowed me to use standard convnet architectures
quite similar to those Krizhevsky had used when working
with the CIFAR-10 image dataset. With one of few differences in
architecture stemming from the fact that CIFAR-10 uses RGB images as
input, while my spectrograms have one real number value per pixel,
not unlike gray-scale images.

Figure 3: Spectrogram containing a right whale up-call.

Spurred by my success, I registered for the International Workshop on
Detection, Classification, Localization, and Density Estimation
(DCLDE) of Marine Mammals using Passive Acoustics, in St Andrews. A world-wide community of
scientists meets every two years at this workshop to discuss the
latest developments in using passive acoustics (listening for
sounds) to detect and track marine mammals.

The talk that I gave
at the DCLDE 2013 workshop was well received. In it, I elaborated how
my method relied on little to no problem-specific human engineering,
and therefore could be easily adapted to detect and classify all sorts
of marine mammal sounds, not just right whale up-calls.

At DCLDE, the execution speed of detection algorithms was frequently
quoted as being x times faster than real-time, with x often being
a fairly low number around 1 to 10. My GPU-powered implementation
turned out to be on the faster side here: on my workstation, it
detects and classifies sounds 700x faster than real-time, which means
it runs detections on one year of audio recordings in roughly twelve
hours, using only a single NVIDIA GTX 580 graphics card.

In terms of accuracy, it was somewhat hard to get an idea of which of
the algorithms presented really worked better than others. This had
two reasons: the inconsistent use of reliable metrics such as AUC and
use of cross-validation, and a lack of standard datasets that everyone
could test their algorithms against. [3]

However, it should be mentioned that good datasets are a bit tricky to
come by in this field. The nature of hydrophone recordings is that
the signal you're listening for could be generated a few meters away,
or many kilometers, and therefore be very faint. Plus, recordings
often contain a lot of ambient noise coming from cargo ships, offshore
drilling, hydrophone cable flutter, and the like. With the effect
that often it's hard even for a human expert to tell if the particular
sound they're listening to is a vocalization of the mammal they're
looking for, or just noise. Thus, analysts will often label segments
as unsure, and two analysts will sometimes even give conflicting
labels to the same sound.

This leads to a situation where people tend to ignore noisy sounds
altogether, since if you consider them, predictions become difficult
to verify manually, and good training examples harder to collect. But
more importantly, when you ignore sounds with a bad signal-to-noise
ratio (SNR), your algorithms will have an easier time learning the
right patterns, too, and they will make fewer mistakes. As it turns
out, noise is often more of a problem for algorithms than it is for
human specialists.

The approach of ignoring sounds with a bad SNR seems fine until you're
in a situation where you've put a lot of effort into collecting
recordings, and then they turn out to be unusually noisy, and trying
to adjust your model's detection threshold yields either way too many
false positives detections, or too many calls are missed.

One of the very nice people I met at DCLDE was Holger Klinck from
Oregon State University. He wanted to try out my convnet with one of
his lab's "very messy" recordings. Some material that his group at
OSU had collected at five sites near Iceland and Greenland in 2007 and
2008 had unusually high levels of noise in them, and their detection
algorithms had maybe worked less than optimal there.

Figure 4: "Locations of passive acoustic moorings near Iceland and
southern Greenland (black spots), and the number of right whale
upcalls detected per day in late 2007 at the five sites." Taken
from [4]. Note the very low number of calls detected at the CE
and SE sites.

I was rather amazed when a few weeks later I had a hard disk from OSU
in my hands containing in total many years of hydrophone recordings
from two sites near Iceland and four locations on the Scotian Shelf.
I dusted off the model that I had used for the Kaggle Whale Detection
Challenge and quite confidently started running detections on the
recordings. Which is when I was in for a surprise: the predictions my
shiny 97% model made were all really lousy! Very many obvious
non-whale noises were detected wrongly. How was it possible?

To solve this puzzle, I had to understand that the Kaggle Whale
Detection Challenge's train and test datasets had a strong selection
bias in them. The tens of thousands of examples that I had used to
train my model for the challenge were unrepresentative of all whale,
and particularly, a lot of similar-sounding non-whale sounds out
there. That's because the Kaggle challenge's examples were collected
by use of a two-stage pipeline, where an automated detector would
first pick out likely candidates from the recording, only after which
a human analyst would label them with true or false. I realized that
what we were building in the Kaggle challenge was a classifier that
worked well only if it had a certain detector running in front of it
that would take care of the initial pass of detection. My neural net
had thus never seen during training anything like the sounds that it
mistook for whale calls now.

If I wanted my convnet to be usable by itself, on continuous audio
recordings, and independently of this other detector, I would have to
train it with a more balanced training set. And so I ditched most of
the training examples I had, and started out with only a few hundred,
and trained a new model with them. As was expected, training with
only few examples left me with a pretty weak model that would overfit
and make lots of obvious mistakes. But this allowed me to pick up the
worst mistakes, label them correctly, and feed them back into the
system as training examples. And then repeat that. (A process that
Olivier Grisel later told me amounts to active learning.)

Many (quite enjoyable) hours of listening to underwater sounds later,
I had collected some 2000 training examples this way, some of which
were already pretty tricky to verify. And luckily, the newly trained
model started to make pretty good predictions. When I sent my results
back to Holger, he said that, yes, the patterns I'd found were very
similar to those that his group had found for the Scotian Shelf sites!

Figure 5: Number of right whale up-call detections per hour at two
sites on the Scotian Shelf, detected by the convnet. The numbers
and seasonal pattern match with what Mellinger et al. reported in
[5].

The OSU team had used a three-stage detection process to produce their
numbers. Humans verified in phases two (broadly) and three (in more
detail) the detections that the algorithm came up with in phase one.
Whereas my detection results came straight out of the algorithm.

A case-by-case comparison still needs to happen, but the similarities
of the overall call patterns suggests that the convnet reaches
comparable performance, but without the need for human analysts to be
part of the detection pipeline, making it potentially much more
time-efficient to use in practice.

What's even more exciting is that the neural net was able to find
right whale up-calls at the problematic SE site near Iceland, where
previously no up-calls could be detected due to high noise levels.

Figure 6: NRW up-call detections per day at sites SW and SE
near Iceland, detected by the convnet. The patterns at the SW
site match roughly with what was reported in [4], while no calls
could be identified previously at the SE site (cf. Figure 4).

Another thing we're currently looking into is wheter or not the
relatively small but constant number of calls that the convnet
detected during Winter season are real, or if they're false positives.
Right whales are not known to hang around so high up North during that
time of the year, so proving that would constitute significant news
for people studying the migration routes of these whales.