Have Fun with Machine Learning: A Guide for Beginners

Preface

This is a hands-on guide to machine learning for programmers with no background in
AI. Using a neural network doesn’t require a PhD, and you don’t need to be the person who
makes the next breakthrough in AI in order to use what exists today. What we have now
is already breathtaking, and highly usable. I believe that more of us need to play with
this stuff like we would any other open source technology, instead of treating it like a
research topic.

In this guide our goal will be to write a program that uses machine learning to predict, with a
high degree of certainty, whether the images in data/untrained-samples
are of dolphins or seahorses using only the images themselves, and without
having seen them before. Here are two example images we'll use:

To do that we’re going to train and use a Convolutional Neural Network (CNN).
We’re going to approach this from the point of view of a practitioner vs.
from first principles. There is so much excitement about AI right now,
but much of what’s being written feels like being taught to do
tricks on your bike by a physics professor at a chalkboard instead
of your friends in the park.

I’ve decided to write this on Github vs. as a blog post
because I’m sure that some of what I’ve written below is misleading, naive, or
just plain wrong. I’m still learning myself, and I’ve found the lack of solid
beginner documentation an obstacle. If you see me making a mistake or missing
important details, please send a pull request.

With all of that out the way, let me show you how to do some tricks on your bike!

Overview

Here’s what we’re going to explore:

Setup and use existing, open source machine learning technologies, specifically Caffe and DIGITS

This guide won’t teach you how neural networks are designed, cover much theory,
or use a single mathematical expression. I don’t pretend to understand most of
what I’m going to show you. Instead, we’re going to use existing things in
interesting ways to solve a hard problem.

Q: "I know you said we won’t talk about the theory of neural networks, but I’m
feeling like I’d at least like an overview before we get going. Where should I start?"

There are literally hundreds of introductions to this, from short posts to full
online courses. Depending on how you like to learn, here are three options
for a good starting point:

This fantastic blog post by J Alammar,
which introduces the concepts of neural networks using intuitive examples.

Similarly, this video introduction by Brandon Rohrer is a really good intro to
Convolutional Neural Networks like we'll be using

Setup

Installing the software we'll use (Caffe and DIGITS) can be frustrating, depending on your platform
and OS version. By far the easiest way to do it is using Docker. Below we examine how to do it with Docker,
as well as how to do it natively.

But the number one reason I’m using Caffe is that you don’t need to write any code to work
with it. You can do everything declaratively (Caffe uses structured text files to define the
network architecture) and using command-line tools. Also, you can use some nice front-ends for Caffe to make
training and validating your network a lot easier. We’ll be using
nVidia’s DIGITS tool below for just this purpose.

Caffe can be a bit of work to get installed. There are installation instructions
for various platforms, including some prebuilt Docker or AWS configurations.

On a Mac it can be frustrating to get working, with version issues halting
your progress at various steps in the build. It took me a couple of days
of trial and error. There are a dozen guides I followed, each with slightly
different problems. In the end I found this one to be the closest.
I’d also recommend this post,
which is quite recent and links to many of the same discussions I saw.

Getting Caffe installed is by far the hardest thing we'll do, which is pretty
neat, since you’d assume the AI aspects would be harder! Don’t give up if you have
issues, it’s worth the pain. If I was doing this again, I’d probably use an Ubuntu VM
instead of trying to do it on Mac directly. There's also a Caffe Users group, if you need answers.

Q: “Do I need powerful hardware to train a neural network? What if I don’t have
access to fancy GPUs?”

It’s true, deep neural networks require a lot of computing power and energy to
train...if you’re training them from scratch and using massive datasets.
We aren’t going to do that. The secret is to use a pretrained network that someone
else has already invested hundreds of hours of compute time training, and then to fine
tune it to your particular dataset. We’ll look at how to do this below, but suffice
it to say that everything I’m going to show you, I’m doing on a year old MacBook
Pro without a fancy GPU.

As an aside, because I have an integrated Intel graphics card vs. an nVidia GPU,
I decided to use the OpenCL Caffe branch,
and it’s worked great on my laptop.

When you’re done installing Caffe, you should have, or be able to do all of the following:

A directory that contains your built caffe. If you did this in the standard way,
there will be a build/ dir which contains everything you need to run caffe,
the Python bindings, etc. The parent dir that contains build/ will be your
CAFFE_ROOT (we’ll need this later).

Running make test && make runtest should pass

After installing all the Python deps (doing pip install -r requirements.txt in python/),
running make pycaffe && make pytest should pass

You should also run make distribute in order to create a distributable version of caffe with all necessary headers, binaries, etc. in distribute/.

At this point, we have everything we need to train, test, and program with neural
networks. In the next section we’ll add a user-friendly, web-based front end to
Caffe called DIGITS, which will make training and testing our networks much easier.

Option 1b: Installing DIGITS Natively

nVidia’s Deep Learning GPU Training System, or DIGITS,
is BSD-licensed Python web app for training neural networks. While it’s
possible to do everything DIGITS does in Caffe at the command-line, or with code,
using DIGITS makes it a lot easier to get started. I also found it more fun, due
to the great visualizations, real-time charts, and other graphical features.
Since you’re experimenting and trying to learn, I highly recommend beginning with DIGITS.

Because it’s just a bunch of Python scripts, it was fairly painless to get working.
The one thing you need to do is tell DIGITS where your CAFFE_ROOT is by setting
an environment variable before starting the server:

export CAFFE_ROOT=/path/to/caffe
./digits-devserver

NOTE: on Mac I had issues with the server scripts assuming my Python binary was
called python2, where I only have python2.7. You can symlink it in /usr/bin
or modify the DIGITS startup script(s) to use the proper binary on your system.

Once the server is started, you can do everything else via your web browser at http://localhost:5000, which is what I'll do below.

Option 2: Caffe and DIGITS using Docker

Install Docker, if not already installed, then run the following command
in order to pull and run a full Caffe + Digits container. A few things to note:

make sure port 8080 isn't allocated by another program. If so, change it to any other port you want.

change /path/to/this/repository to the location of this cloned repo, and /data/repo within the container
will be bound to this directory. This is useful for accessing the images discussed below.

Now that we have our container running you can open up your web browser and open http://localhost:8080. Everything in the repository is now in the container directory /data/repo. That's it. You've now got Caffe and DIGITS working.

If you need shell access, use the following command:

docker exec -it digits /bin/bash

Training a Neural Network

Training a neural network involves a few steps:

Assemble and prepare a dataset of categorized images

Define the network’s architecture

Train and Validate this network using the prepared dataset

We’re going to do this 3 different ways, in order to show the difference
between starting from scratch and using a pretrained network, and also to show
how to work with two popular pretrained networks (AlexNet, GoogLeNet) that are
commonly used with Caffe and DIGITs.

For our training attempts, we’ll use a small dataset of Dolphins and Seahorses.
I’ve put the images I used in data/dolphins-and-seahorses.
You need at least 2 categories, but could have many more (some of the networks
we’ll use were trained on 1000+ image categories). Our goal is to be able to
give an image to our network and have it tell us whether it’s a Dolphin or a Seahorse.

Prepare the Dataset

The easiest way to begin is to divide your images into a categorized directory layout:

Here each directory is a category we want to classify, and each image within
that category dir an example we’ll use for training and validation.

Q: “Do my images have to be the same size? What about the filenames, do they matter?”

No to both. The images sizes will be normalized before we feed them into
the network. We’ll eventually want colour images of 256 x 256 pixels, but
DIGITS will crop or squash (we'll squash) our images automatically in a moment.
The filenames are irrelevant--it’s only important which category they are contained
within.

We want to use these images on disk to create a New Dataset, and specifically,
a Classification Dataset.

We’ll use the defaults DIGITS gives us, and point Training Images at the path
to our data/dolphins-and-seahorses folder.
DIGITS will use the categories (dolphin and seahorse) to create a database
of squashed, 256 x 256 Training (75%) and Testing (25%) images.

Give your Dataset a name,dolphins-and-seahorses, and click Create.

This will create our dataset, which took only 4s on my laptop. In the end I
have 92 Training images (49 dolphin, 43 seahorse) in 2 categories, with 30
Validation images (16 dolphin, 14 seahorse). It’s a really small dataset, but perfect
for our experimentation and learning purposes, because it won’t take forever to train
and validate a network that uses it.

You can Explore the db if you want to see the images after they have been squashed.

Training: Attempt 1, from Scratch

Back in the DIGITS Home screen, we need to create a new Classification Model:

We’ll start by training a model that uses our dolphins-and-seahorses dataset,
and the default settings DIGITS provides. For our first network, we’ll choose to
use one of the standard network architectures, AlexNet (pdf). AlexNet’s design
won a major computer vision competition called ImageNet in 2012. The competition
required categorizing 1000+ image categories across 1.2 million images.

We’ll train our network for 30 epochs, which means that it will learn (with our
training images) then test itself (using our validation images), and adjust the
network’s weights depending on how well it’s doing, and repeat this process 30 times.
Each time it completes a cycle we’ll get info about Accuracy (0% to 100%,
where higher is better) and what our Loss is (the sum of all the mistakes that were
made, where lower is better). Ideally we want a network that is able to predict with
high accuracy, and with few errors (small loss).

NOTE: some people have reported hitting errors in DIGITS
doing this training run. For many, the problem related to available memory (the process
needs a lot of memory to work). If you're using Docker, you might want to try
increasing the amount of memory available to DIGITS (in Docker, preferences -> advanced -> memory).

Initially, our network’s accuracy is a bit below 50%. This makes sense, because at first it’s
just “guessing” between two categories using randomly assigned weights. Over time
it’s able to achieve 87.5% accuracy, with a loss of 0.37. The entire 30 epoch run
took me just under 6 minutes.

We can test our model using an image we upload or a URL to an image on the web.
Let’s test it on a few examples that weren’t in our training/validation dataset:

It almost seems perfect, until we try another:

Here it falls down completely, and confuses a seahorse for a dolphin, and worse,
does so with a high degree of confidence.

The reality is that our dataset is too small to be useful for training a really good
neural network. We really need 10s or 100s of thousands of images, and with that, a
lot of computing power to process everything.

Training: Attempt 2, Fine Tuning AlexNet

How Fine Tuning works

Designing a neural network from scratch, collecting data sufficient to train
it (e.g., millions of images), and accessing GPUs for weeks to complete the
training is beyond the reach of most of us. To make it practical for smaller amounts
of data to be used, we employ a technique called Transfer Learning, or Fine Tuning.
Fine tuning takes advantage of the layout of deep neural networks, and uses
pretrained networks to do the hard work of initial object detection.

Imagine using a neural network to be like looking at something far away with a
pair of binoculars. You first put the binoculars to your eyes, and everything is
blurry. As you adjust the focus, you start to see colours, lines, shapes, and eventually
you are able to pick out the shape of a bird, then with some more adjustment you can
identify the species of bird.

In a multi-layered network, the initial layers extract features (e.g., edges), with
later layers using these features to detect shapes (e.g., a wheel, an eye), which are
then feed into final classification layers that detect items based on accumulated
characteristics from previous layers (e.g., a cat vs. a dog). A network has to be
able to go from pixels to circles to eyes to two eyes placed in a particular orientation,
and so on up to being able to finally conclude that an image depicts a cat.

What we’d like to do is to specialize an existing, pretrained network for classifying
a new set of image classes instead of the ones on which it was initially trained. Because
the network already knows how to “see” features in images, we’d like to retrain
it to “see” our particular image types. We don’t need to start from scratch with the
majority of the layers--we want to transfer the learning already done in these layers
to our new classification task. Unlike our previous attempt, which used random weights,
we’ll use the existing weights of the final network in our training. However, we’ll
throw away the final classification layer(s) and retrain the network with our image
dataset, fine tuning it to our image classes.

For this to work, we need a pretrained network that is similar enough to our own data
that the learned weights will be useful. Luckily, the networks we’ll use below were
trained on millions of natural images from ImageNet, which
is useful across a broad range of classification tasks.

This technique has been used to do interesting things like screening for eye diseases
from medical imagery, identifying plankton species from microscopic images collected at
sea, to categorizing the artistic style of Flickr images.

Doing this perfectly, like all of machine learning, requires you to understand the
data and network architecture--you have to be careful with overfitting of the data,
might need to fix some of the layers, might need to insert new layers, etc. However,
my experience is that it “Just Works” much of the time, and it’s worth you simply doing
an experiment to see what you can achieve using our naive approach.

Uploading Pretrained Networks

In our first attempt, we used AlexNet’s architecture, but started with random
weights in the network’s layers. What we’d like to do is download and use a
version of AlexNet that has already been trained on a massive dataset.

With these .caffemodel files in hand, we can upload them into DIGITs. Go to
the Pretrained Models tab in DIGITs home page and choose Upload Pretrained Model:

For both of these pretrained models, we can use the defaults DIGITs provides
(i.e., colour, squashed images of 256 x 256). We just need to provide the
Weights (**.caffemodel) and Model Definition (original.prototxt).
Click each of those buttons to select a file.

Fine Tuning AlexNet for Dolphins and Seahorses

Training a network using a pretrained Caffe Model is similar to starting from scratch,
though we have to make a few adjustments. First, we’ll adjust the Base Learning Rate
to 0.001 from 0.01, since we don’t need to make such large jumps (i.e., we’re fine tuning).
We’ll also use a Pretrained Network, and Customize it.

In the pretrained model’s definition (i.e., prototext), we need to rename all
references to the final Fully Connected Layer (where the end result classifications
happen). We do this because we want the model to re-learn new categories from
our dataset vs. its original training data (i.e., we want to throw away the current
final layer). We have to rename the last fully connected layer from “fc8” to
something else, “fc9” for example. Finally, we also need to adjust the number
of categories from 1000 to 2, by changing num_output to 2.

This time our accuracy starts at ~60% and climbs right away to 87.5%, then to 96%
and all the way up to 100%, with the Loss steadily decreasing. After 5 minutes we
end up with an accuracy of 100% and a loss of 0.0009.

Even with images that you think might be hard, like this one that has multiple dolphins
close together, and with their bodies mostly underwater, it does the right thing:

Training: Attempt 3, Fine Tuning GoogLeNet

Like the previous AlexNet model we used for fine tuning, we can use GoogLeNet as well.
Modifying the network is a bit trickier, since you have to redefine three fully
connected layers instead of just one.

To fine tune GoogLeNet for our use case, we need to once again create a
new Classification Model:

We rename all references to the three fully connected classification layers,
loss1/classifier, loss2/classifier, and loss3/classifier, and redefine
the number of categories (num_output: 2). Here are the changes we need to make
in order to rename the 3 classifier layers, as well as to change from 1000 to 2 categories:

Q: "What about changes to the prototext definitions of these networks?
We changed the fully connected layer name(s), and the number of categories.
What else could, or should be changed, and in what circumstances?"

Great question, and it's something I'm wondering, too. For example, I know that we can
"fix" certain layers
so the weights don't change. Doing other things involves understanding how the layers work,
which is beyond this guide, and also beyond its author at present!

Like we did with fine tuning AlexNet, we also reduce the learning rate by
10% from 0.01 to 0.001.

Great question, and one that I wonder about as well. I only have a vague understanding of these
and it’s likely that there are improvements we can make if you know how to alter these
values when training. This is something that needs better documentation.

Because GoogLeNet has a more complicated architecture than AlexNet, fine tuning it requires
more time. On my laptop, it takes 10 minutes to retrain GoogLeNet with our dataset,
achieving 100% accuracy and a loss of 0.0070:

Just as we saw with the fine tuned version of AlexNet, our modified GoogLeNet
performs amazing well--the best so far:

Using our Model

With our network trained and tested, it’s time to download and use it. Each of the models
we trained in DIGITS has a Download Model button, as well as a way to select different
snapshots within our training run (e.g., Epoch #30):

There’s a nice description in
the Caffe documentation about how to use the model we just built. It says:

A network is defined by its design (.prototxt), and its weights (.caffemodel). As a network is
being trained, the current state of that network's weights are stored in a .caffemodel. With both
of these we can move from the train/test phase into the production phase.

In its current state, the design of the network is not designed for deployment. Before we can
release our network as a product, we often need to alter it in a few ways:

Remove the data layer that was used for training, as for in the case of classification we are no longer providing labels for our data.

Remove any layer that is dependent upon data labels.

Set the network up to accept data.

Have the network output the result.

DIGITS has already done the work for us, separating out the different versions of our prototxt files.
The files we’ll care about when using this network are:

Python example

Let's write a program that uses our fine-tuned GoogLeNet model to classify the untrained images
we have in data/untrained-samples. I've cobbled this together based on
the examples above, as well as the caffePython module's source,
which you should prefer to anything I'm about to say.

First, we'll need the NumPy module. In a moment we'll be using NumPy
to work with ndarrays, which Caffe uses a lot.
If you haven't used them before, as I had not, you'd do well to begin by reading this
Quickstart tutorial.

Second, we'll need to load the caffe module from our CAFFE_ROOT dir. If it's not already included
in your Python environment, you can force it to load by adding it manually. Along with it we'll
also import caffe's protobuf module:

The caffe.Net()constructor
takes a network file, a phase (caffe.TEST or caffe.TRAIN), as well as an optional weights filename. When
we provide a weights file, the Net will automatically load them for us. The Net has a number of
methods and attributes you can use.

We're interested in loading images of various sizes into our network for testing. As a result,
we'll need to transform them into a shape that our network can use (i.e., colour, 256x256).
Caffe provides the Transformer class
for this purpose. We'll use it to create a transformation appropriate for our images/network:

[...truncated caffe network output...]
dolphin1.jpg is a dolphin dolphin=99.968% seahorse=0.032%
dolphin2.jpg is a dolphin dolphin=99.997% seahorse=0.003%
dolphin3.jpg is a dolphin dolphin=99.943% seahorse=0.057%
seahorse1.jpg is a seahorse dolphin=0.365% seahorse=99.635%
seahorse2.jpg is a seahorse dolphin=0.000% seahorse=100.000%
seahorse3.jpg is a seahorse dolphin=0.014% seahorse=99.986%

I'm still trying to learn all the best practices for working with models in code. I wish I had more
and better documented code examples, APIs, premade modules, etc to show you here. To be honest,
most of the code examples I’ve found are terse, and poorly documented--Caffe’s
documentation is spotty, and assumes a lot.

It seems to me like there’s an opportunity for someone to build higher-level tools on top of the
Caffe interfaces for beginners and basic workflows like we've done here. It would be great if
there were more simple modules in high-level languages that I could point you at that “did the
right thing” with our model; someone could/should take this on, and make using Caffe
models as easy as DIGITS makes training them. I’d love to have something I could use in node.js,
for example. Ideally one shouldn’t be required to know so much about the internals of the model or Caffe.
I haven’t used it yet, but DeepDetect looks interesting on this front,
and there are likely many other tools I don’t know about.

Results

At the beginning we said that our goal was to write a program that used a neural network to
correctly classify all of the images in data/untrained-samples.
These are images of dolphins and seahorses that were never used in the training or validation
data:

Model Attempt 3: Fine Tuned GoogLeNet (1st Place)

Conclusion

It’s amazing how well our model works, and what’s possible by fine tuning a pretrained network.
Obviously our dolphin vs. seahorse example is contrived, and the dataset overly limited--we really
do want more and better data if we want our network to be robust. But since our goal was to examine
the tools and workflows of neural networks, it’s turned out to be an ideal case, especially since it
didn’t require expensive equipment or massive amounts of time.

Above all I hope that this experience helps to remove the overwhelming fear of getting started.
Deciding whether or not it’s worth investing time in learning the theories of machine learning and
neural networks is easier when you’ve been able to see it work in a small way. Now that you’ve got
a setup and a working approach, you can try doing other sorts of classifications. You might also look
at the other types of things you can do with Caffe and DIGITS, for example, finding objects within an
image, or doing segmentation.