Learning without Labels

November
4th,
2012

One of the reasons for training of deep networks
for extended periods without labels is that
pure backpropagation training has problems with
the dilution of gradient information with each step backward
through the network. This is why we add a training phase without any notion of
target labels, making it possible for the trained network
to become familiar with the input data so that it
obtains some early initial abstractions, which are then
easier to hone during a pure backpropagation phase.

Local minima

Another way of looking at this is to consider the sources of noise in the system.

From a visual recognition task, there’s a great deal of
redundant and/or confounding information in the input pixel data
and in the initial (random) weights for the neurons in the network.
Getting the system into a usable initial condition for
reliable back-propagation training is key to avoiding getting
caught in local minima
(where one is learning more about the noise than about the signal).

Childhood labels

Rough progression visually:

Static vs. moving

Stuff vs. non-stuff (interesting vs. boring)

Face vs. non-face

Mummy vs. other people

Black-and-White vs. colours (which become interesting later)

Later on (skipping many stages):

Peek-a-boo

Names for parts of the head

Big vs little - comparing objects

Mine vs. not-mine

Lining things up in rows

Singular vs. plural

These developmental stages seem pretty built-in as being ‘interesting’
at the time when the brain is ready to appreciate them.
Of course, it may be that only a brain prepared with the previous stage is
capable of learning the next.

Simple story

Despite parents’ best efforts, most of the labels
(or preferences about what aspect of the world is being learned)
seem to be internally generated by infants. The brain seems to have a lesson scheme built in - mapping out the
right order to adsorb different lessons.

As a simple example, “The Hungry Catapillar” offers very different lessons to
children of different ages:

a book is not for chewing

looks at pages means more time with parent

turning of pages

turning of pages when it’s the right time

poking fingers through the holes

realising that books have a ‘right way’ up

understanding that the catapillar is same on each page

seeing the different fruit

understanding that the catapillar is eating

seeing that fingers come from one page to the next

understanding that there’s a progression in stuff being eaten

realising that the words are the same each time

understanding that the catapillar is getting bigger

being able to fill in missing words

understanding that the catapillar sleeps after eating

…

understanding that the catapillar becomes a butterfly (this is an enormous jump)

Label-less learning

Lots of development occurs without explicit labels. The brain has a built-in
knack for knowing when the right time has arrived to
chunk up data in ways that will be helpful later.

Building up a network’s weights gradually

solving the bulk problem first, followed by refinements -
seems only natural. This may also ensure that the network doesn’t get ‘prematurely optimized’
and cornered in a local minima that is detrimental to further learning progress.

An interesting avenue of inquiry is whether it’s possible
to find criteria that indicate when a given network is ready to ‘move on’
to the next stage of difficulty in learning. Can one detect when a network has started to over-learn,
and use that signal to introduce the ‘next level of difficulty’ of input data
to prevent this happening?