(Part 1) Automated Vision using Convolutional Neural Networks

Or: Well-extracted means partially-recognised

With just a glance, people can extract large volumes of information about their surroundings. The human brain not only perceives the presence of various objects, but also their spatial position, any movement, and their relative significance. The tiniest of details – such as the twitch of a facial muscle – can tell a poker player whether an opponent is bluffing. But how do computers process visual information, and just how far has automated vision progressed?

“Computer Vision” approaches the task by harnessing unstructured visual data, such as images or videos, to extract structured information. One example of this approach might be to recognise certain objects, such as the human face, before localising them and “decoding” them within an image to identify facial expressions.

Humans find such tasks very easy. So why do computers find it difficult to tell the difference between a face and a tree, for example? Despite decades of research, it has only been in recent years that it has become possible to solve problems of this kind with an acceptable degree of accuracy.

It is widely known that a large proportion of the human brain is involved in processing visual information, and our visual system is constantly honed during every waking moment of our lives thanks to a constant stream of new visual stimuli.

A lot of data presents many problems

Image data is characterised by its large dimensions and the vast number of variable factors it contains. A digital photo from a standard smartphone has a resolution of at least 12 megapixels. This means that a single image comprises 12 million pixels. Furthermore, most images are in colour. The chromatic composition of each pixel is usually represented using three colour channels. Therefore, a single picture can contain 36 million variables.

Several pictures of just one person’s face can vary significantly within the area of each frame – depending on lighting, background, orientation and the position of the camera. For a computer to deal with this information, it has to process these 36 million variables to determine the abstract output: for example, “face” or “face, width: 300 px, height: 450 px, position x: 100 px, position y: 300 px”.

Without some degree of preparation to reduce the size and variability of the data, as well as bundle and generalise information, it is practically impossible for a computer extract it directly from the image.

In computer vision and in many other branches of statistical learning, we approach this problem by processing a compact representation of the data. These representations include as much of the relevant information as possible, while eliminating unimportant and redundant elements. Features are identified by cross-referencing characteristics that are common to all catalogued variations of the feature in question.

Only the important things count: eyes, nose, ears

For example, in the case of a facial recognition system, the overall brightness of an image is just as irrelevant as sensor noise or compression factors. Certain geometric forms are common to all human faces; therefore, if we can represent an image as a combination of these shapes, then facial recognition can be simplified according to whether we see a nose, eyes, and ears in the image.

This process is known as feature extraction. It is generally applied to all automated recognition tasks, and is not solely limited to computer vision. The more meaningful the features being processed, the easier it is to extract the desired information from an image.

A similar process occurs within the human brain. Higher-level processing does not rely directly on information received via stimulation of photoreceptor cells within the retina. Instead, a neural impulse cascade processes, summarises, and abstracts information at a number of levels. It begins in the retina cells and continues all the way through to the visual cortex. For example, there are nerve cells in the primary visual cortex (known as V1) that react to angles of a particular orientation. Several such cells create stimuli within the higher levels of the visual cortex, which are associated with more complex patterns.

First define then train

Until a few years ago, the traditional approach to computer vision was to define relevant image characteristics during the development of the system itself. A statistical learning technique – such as a decision tree – was then applied during training. The “HoG” (Histogram of Oriented Gradients) method is highly suited to the recognition of people. It summarises information based upon the strength and orientation of angles within an image (click here to find out more). A large proportion of image information is discarded, with only that which is relevant (angle strength and orientation) being applied to determining whether unfamiliar image data includes a person or not.

A similar approach can be taken to solving numerous conventional computer vision problems: for example, many industrial production processes are automatically monitored using “seeing software”. If a press produces a steel girder that is too thin, or if a PCB has been incorrectly soldered, the software sounds an alarm. The system then marks the item as defective, addressing the problem during the workflow. Such an approach works very well for standardised items, within constantly lit – and above all known – environments.

A few years ago, more difficult tasks such as facial recognition could also be performed quite acceptably; however, developers of vision systems always had to determine themselves the features that needed to be extracted from image data. This was a very work-intensive process and also highly repetitive: they implemented version A, tested and improved it based on intuition, before eventually progressing through various stages until version Z. Many years were spent on research into robust facial recognition systems before they achieved a satisfactory standard. Meanwhile, other problems could not be resolved using traditional methods until a few years ago.

The ImageNet Large Scale Visual Recognition Challenge is a well-regarded competition for image classification systems. It requires automatic systems to process an enormous data set, made up of more than a million images. Systems are expected to categorise each picture according to one of 1000 pre-defined categories: for example, a rifle or hammerhead shark. The 2011 winner of the challenge achieved an accuracy rate of around 74 per cent, using an HoG-based algorithm. This equated to around one in four images being classified incorrectly. Humans can do much better! Click here for more information.

But what are the methods available to a computer vision system, if it needs to recognise a hammerhead shark and tell it apart from a sledgehammer? You can find out in the second part of my article on convolutional neural networks.