Recognizing a cat

Elebenty gazillion readers sent me this story (with diverse links) over the last ten days. Thanks to all of you, who are now too numerous to name. (See links at The New York Times, Wired Science, and the BBC, among others.)

We consider the problem of building high- level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 bil- lion connections, the dataset has 10 million 200×200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a clus- ter with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental re- sults reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bod- ies. Starting with these learned features, we trained our network to obtain 15.8% accu- racy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative im- provement over the previous state-of-the-art.

At any rate, the tale is this: Google scientists interested in face-recognition software set up an immensely complicated artificial neural network, containing 16,000 computer processors, and fed it random images from YouTube. And what came out? Cat recognition! (Note: NOT squid recognition.) Cats, of course, are everywhere on YouTube.

The NYT notes:

Presented with 10 million digital images found in YouTube videos, what did Google’s brain do? What millions of humans do with YouTube: looked for cats.

The neural network taught itself to recognize cats [JAC: this took 3 days], which is actually no frivolous activity. This week the researchers will present the results of their work at a conference in Edinburgh, Scotland. The Google scientists and programmers will note that while it is hardly news that the Internet is full of cat videos, the simulation nevertheless surprised them. It performed far better than any previous effort by roughly doubling its accuracy in recognizing objects in a challenging list of 20,000 distinct items. . .

To find them, the Google research team, led by the Stanford Universitycomputer scientist Andrew Y. Ng and the Google fellow Jeff Dean, used an array of 16,000 processors to create a neural network with more than one billion connections. They then fed it random thumbnails of images, one each extracted from 10 million YouTube videos.

The videos were selected randomly and that in itself is an interesting comment on what interests humans in the Internet age. However, the research is also striking. That is because the software-based neural network created by the researchers appeared to closely mirror theories developed by biologists that suggest individual neurons are trained inside the brain to detect significant objects. . .

“We never told it during the training, ‘This is a cat,’ ” said Dr. Dean, who originally helped Google design the software that lets it easily break programs into many tasks that can be computed simultaneously. “It basically invented the concept of a cat. We probably have other ones that are side views of cats.”

The Google brain assembled a dreamlike digital image of a cat by employing a hierarchy of memory locations to successively cull out general features after being exposed to millions of images. The scientists said, however, that it appeared they had developed a cybernetic cousin to what takes place in the brain’s visual cortex.

An image of a cat that a neural network taught itself to recognize. Photo by Jim Wilson for the New York Times

The BBC notes:

The work of the team stands at odds with many image-recognition techniques, which depend on telling a computer to look for specific features of a target object before any are presented to it.

By contrast, the Google machine knew nothing about the images it was to see. However, its 16,000 processing cores ran software that simulated the workings of a biological neural network with about one billion connections.

In a similar way nerves in brains are heavily interconnected and it is believed that “recognition” involves the triggering of a specific pathway through that thicket of connections.

Pathways for particular objects, people or other stimuli are thought to be built up as organisms learn about the world. Some neuroscientists speculate that parts of the human visual system become so specialised they recognise very specific subjects such as a person’s grandmother or their cat.

As millions of images were analysed by Google’s network of silicon nerves, some parts of it started to react to specific elements in those pictures.

After three days and 10 million images the network could spot a cat, even though it had never been told what one looked like.

Although the work at first seems useless, it isn’t. We learn to recognize objects by repeated exposure to them. And it’s always been a mystery to scientists how we’re able to form and remember images of people and friends whom we repeatedly see. (When I was younger, my father used to ask me the question, “Try to imagine a face that you’ve never seen before.” Try it—it’s not easy!) Some day face-recognition software will be everywhere: identifying you before letting you into secure facilities, helping police solve crimes, and so on.

And of course it will also give us a clue about how our brain works when recognizing faces. We can easily clue in on human faces, but not so easily on the faces of individuals from different species, which may be nearly as distinct from each other as one human is from others. And there’s an evolutionary reason for that: it was crucial for our group-living ancestors to recognize not only kin but other groupmates (who helped us before, and who was bad to us?), and to discriminate group-mates from potentially hostile members of outgroups.

I became a professor rather than a rancher because I cannot do what I have seen ranchers do as an ordinary thing. They look at a group of animals and immediately recognize each (even though bred to be identical) as an individual, and recognize the individuals the next time they see them, whenever. It is sort of scary to witness.

It’s just training up the right skills. I used to look at seemingly meaningless squiggles and mutter to myself “wind, waves, rock quarry, earthquake, ooh – nuclear detonation”. A few years ago I walked into an office and looked at some graphs and went “damn, I can’t interpret these like I used to be able to anymore.”

Wow – all that processing capability and time and it still doesn’t come close to what a human would do in an instant. “15.8% accuracy” means “doesn’t work at all” in my dictionary. Oh, but “neural network”!

Also, I find it interesting that JAC refers to the arXiv as Cornell University Library. Although this is technically correct, nobody I know in the fields where the arXiv is popular (Physics, Maths, Theoretical Computer Science) refers to it as “Cornell University Library”.

I didn’t know that this result was from unsupervised learning, but to contrast it with what the cortex do I always refer to this 2006 post. It describes what it claims was the first time a neural net representation of a cortex a) spontaneously organized symbols, and hence (albeit maybe not for the first time) b) avoided overtraining.

“Cognitive modeling with neural networks is sometimes criticized for failing to show generalization. That is, neural networks are thought to be extremely dependent on their training (which is particularly true if they are “overtrained” on the input training set). Furthermore, they do not explicitly perform any “symbolic” processing, which some believe to be very important for abstract thinking involved in reasoning, mathematics, and even language.”

“After this training, the prefrontal layer had developed peculiar sensitivities to the output. In particular, it had developed abstract representations of feature dimensions, such that each unit in the PFC seemed to code for an entire set of stimulus dimensions, such as “shape,” or “color.” This is the first time (to my knowledge) that such abstract, symbol-like representations have been observed to self-organize within a neural network.

Furthermore, this network also showed powerful generalization ability. If the network was provided with novel stimuli after training – i.e., stimuli that had particular conjunctions of features that had not been part of the training set – it could nonetheless deal with them correctly.”

A prediction out of that is that as animals with a cortex analog (so maybe even mushroom bodies of arthropods) evolved, they labored under the constraint of avoiding overtraining. Hence symbolic like functionality.

One may even speculate that this symbolic functioning meant a preevolved affinity to language like communication forms.