MSc project proposal: Automatic categorisation of plant images

Imperial Computing Science MSc Group Project Proposal, November 2013

Aim

The ultimate aim is to produce a smart phone application which can
automatically identify a plant from a photo. One use case is some one
who is out on a walk and wants to know the identity of a plant
(specifically: my 2 yearold daughter keeps asking me to identify trees
but I'm useless at it and existing tree identification apps are slow
and laborious to use!). Another use case is a farmer who wants to
know the identity of a strange new weed. One of the ultimate aims is
to encourage a greater enthusiasm and curiosity for the natural
world.

Competition

There is an international plant image classification competition
called "PlantCLEF
2014" which you could enter if you feel confident about your image
classification system! They provide a large, labelled dataset of
images.

Learning algorithms

You'd be free to choose any image classification approach you want.
The help to give you a feel for what the options are, let me give a very
quick intro to image classification:

There are typically at least two stages to an image classification
system: first "features" (edges, corners, blobs etc) are detected in
the image and then these features are passed to a classification
algorithm.

Feature extraction

There are two main approaches to feature extraction: either
hand-craft your feature detectors or build a system which can
automatically learn which features to extract.

Automatic feature learning

The alternative to hand-building feature detectors is to build a
system which can automatically learn which features to detect.

Before considering how to do this on a computer, let's briefly
consider the best object classification system we know of: the
brain. Does our genome define separate feature detectors for speech,
faces, plants, tools, animals etc? The answer is almost certainly
"no"! What evidence is there? Firstly, the human genome only contains
about 20,000 protein-coding genes. Even if we include the
non-protein-coding regulatory DNA sequences, there is still far too
little information in the genome to specify lots of different
learning algorithms. And there are rather gory neuroscience
experiments where the optic nerve of rodents has been re-routed
at birth to feed into the auditory cortex. After a short
while the "auditory cortex" of the brain has
magically learnt to do sight
(e.g. see Sharma
et al 2000). One interpretation of this is that the brain
automatically learns to extract features from the input data.

In the context of computer vision, automatic feature learning has
several advantages over hand-built feature detectors. First, the
system should find the most useful features to extract (whilst with
hand-built feature detectors, you are making a subjective judgement
about which features you think are best). Secondly, you
don't need to go through months or years of R&D to build each new
feature detector! Thirdly, automatic feature learning can be
done in a completely unsupervised fashion. For example, you could get
as many unlabelled plant images as possible (e.g. from flickr / google
image search) and feed these to your system and it will automatically
figure out which features to extract. How does it do this? It's
basically trying to find the most compact representation for the
data.

There are several ways to do automatic feature learning on a
computer and they mostly come under the banner of "deep learning"
(this is a poorly defined term but it pretty much means a large
artificial neural network with multiple hidden layers). At the time
of writing, deep learning appears to be a very effective technology
for image classification. For example, here are some recent
successes:

Open source tools for deep learning

Deep neural networks are computationally very expensive (especially
during training), hence it will almost certainly be necessary to run
the net on a desktop computer with a fast GPU. GPU programming is
notoriously tricky. But don't worry; you won't have to write
frightening GPGPU
code directly like Alex Krizhevsky did with his
cuda-convnet
(unless you want to!). Instead, a Python tool
called Theano
abstracts the implementation details away. You just write Python;
then Theano does all the hard work of running that code on a GPU, and
is surprisingly
fast. Here's
a 20 minute video intro to Theano. Building further on top of
Theano (and making your life easier still) is
the PyLearn2
library which allows you to specify deep learning networks with
a relatively minuscule amount of code.

Torch7 - "Torch7 is a scientific
computing framework with wide support for machine learning
algorithms. It is easy to use and provides a very efficient
implementation, thanks to an easy and fast scripting language,
LuaJIT, and an underlying C implementation."

gnumpy (which wraps CUDAMat)

PyCUDA

"Quick start"

Whilst Theano and PyLearn2 are probably a great approach if you
want a lot of control over your deep learning system, the fastest
way to get started (i.e. requiring the least amount of coding and
minimal understanding of the theory) is probably to dive straight in and use
Alex Krizhevsky's
cuda-convnet
code (or Daniel
Nouri's fork which implements dropout). This is the system
which won
the 2012
ImageNet "Large Scale Visual Recognition Challenge".

So one way to get started fairly quickly with the image
classification aspect of the project would be to:

Collect lots and lots of labelled images of plants

Throw all these images at cuda-convenet and see how well it can
do (I could be wrong but I think cuda-convent needs labelled
training data)

Examine the learnt network to try to figure out what worked well
and what didn't

If you still have time, maybe try implementing a suitable network
using Theano / PyLearn2 that you can pre-train with loads of
unlabelled images and then fine-tune the training with labelled images.

Plant image datasets

Learning algorithms in general, and especially deep neural nets,
like having huge training datasets. Some sources of data might
include:

ImageNet is a huge
database of all sorts of images, including lots of plants and trees.

The
UK's Natural
History Museum probably have a large dataset. You could try
writing to them to ask if they'd be interested in collaborating.
They probably would, given projects like their "urban tree survey"
where they want people to identify as many trees as possible.

leafsnap
is an iOS app with very similar aims to this project. It gets an
average of 2 stars with reviews like "Unfortunately the database
is US only. Many of the wonderful trees I have around me here in
England are missing." Despite these reviews, it has been
downloaded about 1 million times, and apparently does achieve
state of the art performance. The project
was described
in this paper and they are planning to release their dataset
and code (although neither are available at the time of
writing).

Project scope

This proposal is just a hand-wavey proposal; you certainly wouldn't
have to implement an entire smart phone application utilising cutting
edge machine learning techniques. For example, if you wanted, you
could drop the smart phone part of the project and focus on "just"
getting a desktop computer to recognise plant images. The precise
specification of the project will be defined in "Report One", due on
the 31st Jan. We can tailor the project to your group's interests.
And, of course, no one will expect you to produce a really high
performing plant recogniser in a single term! You just need to give
it your best shot (and try to have fun with it!)

Aspects of the project

If there are members of your team who want to focus on non-ML
aspects then here are some ideas for non-ML things to do on this
project, if you wanted:

Building a complete smart phone plant recogniser app will require
are at least three or four components: the phone app; the server; the
'plant categoriser'; acquiring and pre-processing training materials.
The basic idea is that a smart phone probably doesn't have sufficient
processing power to do the image classification on the phone (at
least, not without very time-consuming optimisation of your code; and
possibly running your code on the phone's GPU) so you'll probably need
to do the image classification on a server.

Here are some brief hand-wavey ideas about each component. Just
to emphasise: the list below is just to give you a feel for the
project; you can completely ignore this list if you have better
ideas!

The phone app

Let user take several photos, then the user selects 1 or more
"good" images to upload to the plant recognition server

When the server responds, display the top 5 (?) answers
with confidences for each answer. The user can then select the
correct answer (this feedback could then be used to refine the
classification engine).

Let the user click on each answer to find out more about that plant.
This data could be sourced from Wikipedia or, perhaps, from the
Natural History Museum.

Integrate with Siri / Google Voice? e.g. the user just takes a photo
of a tree and then asks the phone "what fruit does this tree
produce?"

Could you do the whole app as an HTML5 app so that it can run on
any (modern) platform? HTML5 has
a Camera
API. If you do the smart phone app as a native app (rather
than HTML5) then maybe also consider building an HTML5 app so folks
can use the classifier though a desktop computer.

Record geographical location and date with each image. Perhaps
the image classifier will be able to use this information to refine
its classifications. Have a setting in the app to disable the recording of
geo location, just in case some users are nervous about uploading
their geo location. Could also record compass heading, tilt and
focus distance (if available) to allow the precise location of the
plant to be estimated (as distinct from the location of
the phone).

Depending on how your image classification system works, the app
might need to guide the user to photograph close-ups of the leaves,
bark, seeds etc. Or perhaps the user would first take a single
photo of the whole plant and then, if the classifier fails to get a
confident match, the system would guide the user to take close ups
to give the classification system more data to work with.

Of course, the app will need to be able to communicate with your
server and will need to fail gracefully when there are network
issues. Don't leave your users staring and a frozen screen for
ages! Just a simple progress bar can make the wait much less
frustrating for your users.

Maybe the user could save a list of favourite plants, and view
their favourite plants on a map.

Integrate with twitter. e.g. add a button to tweet "I just
found <plant name> in <location> using
<name of app>."

I'm not certain but I think that, if you're starting from
scratch, it may be easier to start developing for Android rather
than iOS (iOS is thoroughly locked down). Could be wrong though.

The server

The core functionality is to receive images from smart phone / web
clients, pass these images to your
classification engine, and then send the output of the
classification engine to the correct client. It must fail
gracefully if the classification engine fails to return an answer.
The classification engine may run on a different computer (e.g. a
computer with a fast GPU).

The server will need to implement a 'job queue'; and it should probably be
able to spread the load across multiple classification engines each
running on a different machine (what
happens if your app becomes so popular that a single classification
engine can't keep up with demand?!) Maybe the server should respond
immediately to each request with an "estimated wait time" so users
can be informed if there is a long wait before the classification
engine becomes available for their job.

Use classification results to crowd-source a map of plants (the Natural History Museum might
want to integrate this into their "Urban tree survey"). Perhaps use
OpenStreeMap. You'll need
to estimate the exact location of each plant so you can attempt to
protect yourselves from counting a plant multiple times if multiple
users photograph the same plant. Perhaps allow users to annotate
plants (e.g. "this tree looks like it is diseased"). Maybe allow
users to annotate individual photos of a plant (e.g. "close up of
brids' nest found in this tree"). If you do find
multiple users taking photos of the same plant then keep all those
photos so you can then keep a photographic "history" of the
plant.

The classification engine

Train just one classifier to do whole plant and leaves and seeds and bark?
Or use separate classifiers for each segment of a plant? If you use separate classifiers, will
the user have to label each photo as "leaves", "seed" etc? Or
could you train a classifier to do automatic segmentation (e.g. see
Farabet at al 2013)?

If using deep learning techniques, would you do unsupervised
pre-training (which would allow you to use unlabelled images during
pre-training) or train the whole net in a supervised fashion (which
may be appropriate if you have enough labelled examples and should
probably be the first thing you try)

Could you take advantage of the geographical location and season
to refine your hypotheses? If so, would this be done by feeding
this information into your image classifier (e.g. if you used neural
nets then perhaps you could have simple "date" and "geo location"
inputs which connect to the upper layers?) Or perhaps it would
make more sense to use Bayesian statistics to combine evidence from
your classifier with prior knowledge about the geographical and temporal
distribution of plants. Update your priors when new successful
classifications are made.

If the user provides feedback (e.g. "this is the correct answer")
then could you exploit that information?

If the user takes multiple images of the same plant then how best
to use these multiple images to come up with a good answer? Maybe
just run each image through your classifier and then return the
"majority vote"?

Maybe try to train your classifier
on phylogenetic
data so it can make sensible guesses when it doesn't know the exact
answer. This is known as
"transfer
learning". I believe the basic idea is that you pre-train a
deep neural network in an unsupervised fashion on phylogenetic data
(so it learns relationships between plants) and then you plug in a
new lower set of input layers to map from image data to these
pre-learnt representations. e.g. see section "2.4 Multitask and
Transfer Learning"
in Bengio
at al 2013.

experiment with multiple ML techniques

Acquiring and pre-processing training images

scrape the plant image datasets listed above for images of plants
(if you use deep learning with unsupervised pre-training then the
images don't all have to be labelled). Lots of scope to
parallelise this scraping and run it on multiple DoC machines so
you can suck in millions of images.

Extract dates and location from the EXIF metadata in the
images to produce priors for geographical and temporal
distributions.

Just to emphasise: you are certainly not expected to implement
all these ideas! It would take years to implement all these
features! I only mention them to give a feel for the potential
breadth of the project, if breadth is what you want (but, of
course, be aware that Spring term will fly past and you should
keep your group project specification as simple as possible).

Risks and benefits of this project

It is important to point out that MSc group projects aren't marked on
your ability to implement bleeding-edge computer science. For
example, the "Best MSc Computing Project" from Spring 2012 was a
lovely new website for a local doctors' surgery. If your aim is to
maximise your chance of winning "best project" whilst minimising your
effort then embarking on a project which makes use of bleeding-edge
tools and techniques carries some risk! We must also point out that
neither supervisor on this project does computer vision research as
their 'day job' (although Dr Knottenbelt did do some research on South
African flora in the 1990s!). But Jack will be using Deep Learning techniques for
his PhD in Spring; and will do all he can to help you... but you do
need to be aware that you will quickly become the department's experts
(possibly even the world's experts) in using computer vision for
recognising plants; and hence you will need to be comfortable with
being pioneers ;).

But there are real benefits of the project. For starters, you have
a real chance of building a system which can compete very favourably
in the "PlantCLEF
2014" competition; and it's very rare for MSc groups projects to
have a chance of competing in international competitions. Secondly,
"deep learning" is a topic which a lot of people are excited about
(including Google, Facebook etc) but very few people have hands-on
experience with the technique; so if you want to do research or get a
job in machine learning then a project on deep learning should look
attractive to employers and academics. Also, whilst it is true that
machine learning can become terrifyingly mathsy, it is also the case
that image classification is a very popular application of ML in
general and deep learning in particular, so there are lots of papers
and quite a lot of code that you can use (i.e. you can take existing
approaches and re-use them without having to understand the innards
really well). Also, there is great scope to
turn this project into a successful app; the demand for such as app is
demonstrated by the fact that the leafsnap app was
downloaded "almost
a million times" as of 2012. And, finally, the project should be
fun! (if you get excited by using cutting edge techniques to build a
system which has a good shot at doing object recognition almost as
well as a human)

If lots of groups are interested in this project then we may
consider running more than one group on this project (up to a
maximum of two or three groups at a real push). If everyone
wants to then you could compete directly; perhaps using different
machine learning techniques. Or you could collaborate in some way
(e.g. split the project up). Or groups could target different image
classification challenges (I think this option would be my
preference. One advantage of going this route is that groups could
help each other out more on the image classification techniques.
And, of course, each group would have the potential to have a
greater impact on the target problem domain.). There are lots of other object classification tasks
ripe for innovation. e.g. classifying microscope images of blood
cells as either "healthy" or "malaria-infected". (A research group
at UCLA
recently built a "gamified" solution to this problem:)

Yet more details

I have been having email conversations with groups about this
project. To make sure that all groups have access to the same
information, I'll put even more info about this project here.

I try to mitigate climate change using computer science. I am a Research Engineer at DeepMind, mostly working on energy problems. Previously, I worked on energy disaggregation as a post-doc at Imperial College London. Read more about me…