Convolutional neural networks.

In my introductory post on neural networks, I introduced the concept of a neural network that looked something like this.

As it turns out, there are many different neural network architectures, each with its own set of benefits. The architecture is defined by the type of layers we implement and how layers are connected together. The neural network above is known as a feed-forward network (also known as a multilayer perceptron), since we simply have a series of fully-connected layers.

Today, I'll be talking about convolutional neural networks which are used heavily in image recognition applications of machine learning. Convolutional neural networks provide an advantage over feed-forward networks because they are capable of considering locality of features.

Consider the case where we'd like to build an neural network that could recognize handwritten digits.

For example, given the following 4 by 4 pixel image as input, our neural network should classify it as a "1".

Images are simply a matrix of values corresponding with the intensity of light (white for highest intensity, black for lowest intensity) at each pixel value. Grayscale images have a single value for each pixel while color images are typically represented by light intensity values for red, green, and blue at each pixel value. Thus, a 400 by 400 pixel image has the dimensions $\left[ {400 \times 400 \times 1} \right]$ for a grayscale image and $\left[ {400 \times 400 \times 3} \right]$ for a color image. When using images for machine learning models, we'll typically rescale the light intensity values to be bound between 0 and 1.

Revisiting feed-forward networks

First, let's examine what this would look like using a feed-forward network and identify any weaknesses with this approach. Foremost, we can't directly feed this image into the neural network. A feed-forward network takes a vector of inputs, so we must flatten our 2D array of pixel values into a vector.

Now, we can treat each individual pixel value as a feature and feed it into the neural network.

Unfortunately, we lose a great deal of information about the picture when we convert the 2D array of pixel values into a vector; specifically, we lose the spatial relationships within the data.

Just ask yourself, could you identify the following numbers in flattened vector form without rearranging the pixels back into a 2D array in your head?

Don't kid yourself, you can't. The spatial arrangement of features (pixels) is important because we see in a relativistic perspective. This is where convolutional neural networks shine.

Convolution layers

A convolution layer defines a window by which we examine a subset of the image, and subsequently scans the entire image looking through this window. As you'll see below, we can parameterize the window to look for specific features of an image, such as edges. This window is also sometimes called a filter, since it produces an output image which focuses solely on the regions of the image which exhibited the feature it was searching for.

Note: Windows, filters, and kernels all refer to the same thing with respect to convolutional neural networks; don't get confused if you see one term being used instead of another.

For example, the following window searches for vertical lines in the image.

We overlay this filter onto the image and linearly combine (and then activate) the pixel and filter values. The output, shown on the right, identifies regions of the image in which a vertical line was present. We could do a similar process using a filter designed to find horizontal edges to properly characterize all of the features of a "4".

We'll scan each filter across the image calculating the linear combination (and subsequent activation) at each step. In the figure below, our original image is represented in blue and the convolved image is represented in green. You can imagine each pixel in the convolved layer as a neuron which takes all of the pixel values currently in the window as inputs, linearly combined with the corresponding weights in our filter.

You can also pad the edges of your images with 0-valued pixels as to fully scan the original image and preserve its complete dimensions.Image credit

As an example, I took an image of myself sailing and applied four filters, each of which look for a certain type of edge in the photo. You can see the resulting output below.

A convolutional layer will accept a matrix of values as input and produce a series of convolved images, each of which search for a certain feature of the image. Our convolution layer will have a corresponding feature mapping for each filter we define. Notice how some feature mappings are more useful than others; for example, the "right diagonal" feature mapping, which searches for right diagonals in the image, is essentially a dark image which signifies that the feature is not present.

In practice, we don't explicitly define the filters that our convolutional layer will use; we instead parameterize the filters and let the network learn the best filters to use during training. We do, however, define how many filters we'll use at each layer.

We can stack layers of convolutions together (ie. perform convolutions on convolutions) to learn more intricate patterns within the features mapped in the previous layer. This allows our neural network to identify general patterns in early layers, then focus on the patterns within patterns in later layers.

Let's revisit the feed-forward architecture one more time and compare it with convolutional layers now that we have a better understanding of what's going on inside them.

A feed-forward network connects every pixel with each neuron in the following layer, losing any spatial information present in the image.

By contrast, a convolutional architecture looks at local regions of the image. In this case, a 2 by 2 filter with a stride of 2 is scanned across the image to output 4 neurons, each containing localized information about the image.

The four neurons can be combined to form a feature mapping of the image, where the feature mapped is dependent on the parameters of our filter. In the case of convolutional layers, spatial information can be preserved.

We can scan the image using multiple filters to generate multiple feature mappings of the image. Each feature mapping will reveal the parts of the image which express the given feature defined by the parameters of our filter.

In general, a convolution layer will transform an input into a stack of feature mappings of that input. Depending on how you set your padding, you may or may not reduce the size of your input, but as long as you use multiple filters you'll always increase the depth.

Our filter will have a defined width and height, but its depth should match that of the input. Thus for earlier examples where we looked at grayscale images, our filter had a depth of one. However, if we wanted to convolve a color image, for example, we would need a filter with parameter values for each of the three RGB layers.

Pooling layers

A pooling layer can be used to compress spatial information while preserving the full depth of our feature mapping. We'll still scan across the image using a window, but this time our goal is to compress information rather than extract certain features. Similar to convolutional layers, we'll define the window size, and stride. Padding is not commonly used for pooling layers.

Max pooling

In max pooling, we simply return the maximum value present inside the window for each scanning location.

Global average pooling

Global average pooling is an even more extreme case, where we simply return the average value of each feature mapping, effectively compressing the size of each feature mapping to a single value while preserving the depth of our input (where depth corresponds with the number of feature mappings).

Architecture of a CNN

Convolutional neural networks (also called ConvNets) are typically comprised of convolutional layers with pooling layers periodically interspersed. The convolutional layers create feature mappings which serve to explain the input in different ways, while the pooling layers compress the previous layer's feature mapping.

Essentially, we have an initial volume of information. This may be a grayscale image with a depth of 1 or a color image with a depth of 3. The convolutional layers act to increase the depth of this volume of information, while pooling layers serve to reduce the size of our volume of information. Eventually, we'll reduce this representation of our original input into a deep stack of 1 by 1 (single value) representations of the original input. At this time, we can now feed this information into a fully-connected layer and output layer.