Deep Learning for Letter Recognition with Tensorflow

It's been 6 months since my last blog. There are multiple blog drafts to be honest, but many of them took a really long time to finish. I guess I should have sliced the material a bit to multiple blogs. Anyway in this blog, I want to show how to achieve more than 95% accuracy with just Macbook Air. So let's dig down to details.

This material originally comes Deep Learning Lecture 4 from Udacity. In this blog, Pickling, reformat, accuracy, and session are theirs, but the architecture is my own which is the core of Deep Learning. It's kind of refreshing because I have experience it before (yes, Andrew Ng's Coursera Machine Learning on Neural Network).

There is two major section when running Tensorflow. The first one is how you describe the deep learning architecture in form of tensor graph. Since we only describe and compile the architecture, there's no computation going on. The second one is session. It's made to tell the graph to actually play and learning from the data. We can use the session to report the validation set.

Okay, so this is where I play around over some parameters and deep learning architecture, which is where the real fun begin.

In [ ]:

batch_size=128regularization_c=0.12learning_rate_c=0.4num_steps=15001

On machine learning side, there are these usual parameters that you have to tuned. Batch size tells you how many training set for the algorithm to tell at one step. Batching this lead you to stochastic gradient descent. Using smaller batch size makes the machine train faster, but less convergence to the true features that can predict the data. So you have to increase the number of steps to mitigate the issue. On the other hand, if you have bigger batch size, the machine train longer at each step, but converge at a less stochastic movement. So you have a smaller number of steps to train. I choose 128 images per step and 15000 number of steps.

And then there's also some regularization parameter that let machine penalize features that overmagnified (prevent overfitting). Learning rates also tells how faster you want machine learning to train. I choose 0.4 and decay exponentially until 300 steps.

On the deep learning side, I also set dropout rate of 0.5. That means if one layer in neural network contains 100 activation units, at each step 50% of those units will be selected randomly and perform forward prop and backprop. Intuitively, the layer can't rely on particular activation unit to make a prediction. Like regularization, dropout is used to prevent overfitting.

Don't forget to disable dropout at validation and test set. Because you want 100% (instead of 50%) prediction power for your deep learning.

There is 2D patch(kernel) that stride over our images and summarize it by making smaller number of pixels but longer depth. This is what's called 2D Convolutions. Here you can see at first layer of 2D convolutions I make 13x13 2D patch and depth 16. That means, 28x28 images of grayscale (1x channel, instead of RGB 3x channels) will be made smaller to 16x16 but longer to 16 depth channels. So imagine 16 2D matrices of 16x16.

In [ ]:

pooling_size=2

Another method called pooling used to summarize the 2D matrix to even smaller number of pixels. The advantage of pooling over convolutions is that machine doesn't require additional weights to tuned. The pooling, like it called, creates a pooling at every stride movement of the matrix, and determine which value in pixels get in, and who doesn't. Max pooling then choose whichever pixel value the most in 2D matrix. Another method is average pooling, where you average all values in the matrix and pass in result to the next stage. Note that pooling doesn't increase the depths.

[1,2,2,1] tells Tensorflow stride every data, stride every 2 width and height pixels, and stride every depth channels. This makes at every additional convolutions and pooling, 2D pixels gets halved.

There are two different kinds of padding, same padding and valid padding. In valid padding the corner of the kernel started at the corner of the image. This makes the center of the kernel started and finished not in the edge of the images. In same padding you add zeros padding in edges of your 2D matrices, so the center of the kernels started at the corner of 2D images.

Different padding, kernel size and stride movement will determine the size of output matrices.

Finally, I created convenient functions to get weights and biases. So you can just input the shape of the matrix you want, and return tensor variable. There is 4 shape value for 2D convolutions; width, height, input channels, and output channels. Pooling doesn't requires weight and bias. And there are only two shape value for the fully connected layer and logits; the number of activation input and number of activation output. For the fully connected layer, I choose 64 number of activation units as these are the final number of depth channels.

Here are layers of my deep learning architecture:

2D Convolutions (k=13, s=1, padding=valid)

Max Pooling (k=2, s=2, padding=same)

2D Convolution (k=2, s=2, padding=same)

Max Pooling (k=2, s=2, padding=same)

1x1 Convolutions (k=1, s=1, padding=valid)

Fully Connected

Fully Connected

Classifier

As we discussed with each of the layer, 2D image pixels gets halved at every additional layer.

Now let's create a session to run the tensor graphs. You initialized the tensor variable before you run. Again, you can see the minibatch take turns at every step with offset. There is also feed_dict variable that let the data and labels injected at each session run. And we reported training score and validation score at each of the step, and test score at final step. Here the session runs and the reported accuracy.

You can see that even at every 1000 steps, the validation can increase/decrease. This is the result of the mini batch gradient descent that perform stochastic movement to converge ultimate predictive power. Let's do some of the back of the envelope calculations.

We have 128 images and 15000 of steps to run. The number of images that the machine saw,

In [4]:

128*15000

Out[4]:

1920000

While we have 200,000 images, the machine at least saw one image,

In [6]:

1920000./200000

Out[6]:

9.6

So the machine saw almost 10 times at each image. This is the same as running full dataset on just 10 steps. This is relatively small number, and so the machine has potential room to grow by just increasing number of steps, but resulting to longer time to compute.

Arithmetic to get required output number of pixels based on input size, kernel size, stride, padding

Machine Learning classification of one-hot-encoding

Softmax and Cross Entropy

Non-linear function like relu(used in this blog), sigmoid and tanh.

While I would love talk about it, I prefer to keep this blog short. That material alone could be made for another blog. Here I just want to share my deep learning structure and reported accuracy. This is relatively simple deep learning architecture. There are hundreds of layer in deep learning architecture in major production to infer complex interaction in an image rather than simple letter recognition. And there is another thing called inception, that use various combination of pooling and convolutions in one layer. All of these require multiple GPU units for deep learning to train faster.

We also talked about pooling, convolutions, and fully connected. Hopefully you understand if you see other CNN deep learning architecture in the future.

You can download this notebook and try it yourself. See if you can play around with the parameters and beat the accuracy in this blog. And if this blog still feels like missing some details to you, I encourage you to take Deep Learning course by Google at Udacity. It has all the information you need and more.