Implementing A Convolutional Neural Network Using Tensorflow

Image recognition is currently my favorite type of machine learning. I say currently because I find language translation and NLP quite interesting.

Convolutional neural networks, at the time of writing this, are the most efficient and accurate method used for image recognition.

While you could use a standard fully connected deep neural network with a small dataset, it is not the most efficient method. Fully connected deep neural networks are structured in such a way that each graph node is connected to each node in the former and latter layer. The first layer is always constructed of the input data. In the case of the MNIST dataset, which we will be using in this example, each image has 28 x 28 pixels, which is 784 pixels in total. When using a fully connected neural network, each of those would have to be connected to each node in the next layer.

MNIST dataset features quite small images. The dataset is widely known because its small size and the training of a model shouldn’t be very expensive (that is if the model is well designed).

What is a convolution?

A convolution layer requires a filter. A filter is kind of like weights in the standard fully connected layers. A filter learns a certain feature, like some sort of a shape. Tensorflow requires a 4-dimensional tensor as a filter (filter height, filter width, input channels, output channels). In the example I’m about to show you, I use a 5 * 5 * 1 filter. That means that we will create a matrix that is 5 pixels wide, 5 pixels high and 1 pixel deep.

The filter is 1 deep because in this example we use the MNIST dataset, which features gray-scale images. A standard image that you see every day is 3 pixels deep. Those three layers of depth are the red layer, the green layer and the blue layer, hence RGB. Each of those layers feature values between 0 and 255.

Adding a convolution in Tensorflow is as simple as adding this line:

tf.nn.conv2d(data, weights_conv1, strides=[1, 1, 1, 1])

Strides are a one-dimensional list of length 4. Each of those values represent the stride of the sliding window for each dimension (defined in the filter).

The term weights and filter are interchangeable, so the weights passed as the parameter to the conv2d function is the filter 5 * 5 * 1 filter.

We slide the filter over the height and width of the image and compute the dot product of the filter and the input (the patch). As we do that, we create a 2-dimenstional activation map as a response to the filter. Over time the network learns to recognize certain shapes and colors.

As we do this, we do not connect each neuron to all neuron in the next layer. Instead, we will connect each neuron to only a local region of the input volume. This is defined in the filter size. The connections are local along the x and y axes (width and height), but the connection is always full across the z axes (depth).

Pooling

Pooling is another very important part of convolutional neural networks.

It is common practice to add a pooling layer between two convolutional layers. Its purpose is to reduce the size of the data. The data is reduced to its most prominent parts.

The most commonly used type of pooling is max pooling. With max pooling, we take a patch of data, for example 2 * 2, we take the largest value in that patch and append it to the new matrix. We also need to add a sliding window. A 2 * 2 window is the most common sliding window. That means that the resulting matrix will be 75% smaller than the original.

Pooling reduces computation time and prevents overfitting, so it is quite essential to a neural network.

As I said, there are a couple of different types of pooling, but the most commonly used one is max pooling and you do it like this;