Jitter, Convolutional Neural Networks, and a Kaggle Framework

In this post, we’re going to look at the Digit Recognizer challenge from Kaggle. This challenge uses the MNIST dataset of handwritten digits. We’re going to train a Convolutional Neural Network with Keras to recognize the digits. We will employ some preprocessing steps to improve the generalization of our model.

In our analysis, we’ll use pandas, NumPy, and even a tool from sci-kit learn. A side goal is to build a framework that could be used in other Kaggle competitions as well. The key idea is to make a validation set from the training set to test your ideas, rather than submitting to the leaderboards.

Prepare the Training, Validation, and Test Set

When we take a look at the training set, we see that the label column has multiple values. Our eventual goal is to use a SoftMax layer for our network, so we will need to convert it to multiple columns with binary values. Because we are going to modify the images as we train, we need to use a separate validation set to see the progress of the Neural Network, as opposed to using Keras built in functionality. Out of convenience, we will use sklearn’s train_test_split function.
We will use pandas get_dummies to convert the labels column to multiple values.

Preprocessing

An important piece of this post is the preprocessing we will do to the images. We are going to introduce random noise (jargon: jitter) to the image. We’re going to randomly apply the jitter in the following ways:

Deleting a column

Deleting a row

Shifting the image

Rotating the image

We will use NumPy for all of these effects. We will perform each action with probability \(p=.7\). The probability that an image is perturbed is \(1-(1-.7)^4 = .99\). With a training set of over 32000, we expect approximately 320 images to stay the same. We will repeat this process each epoch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

defrand_jitter(temp):ifnp.random.random()>.7:temp[np.random.randint(0,28,1),:]=0ifnp.random.random()>.7:temp[:,np.random.randint(0,28,1)]=0ifnp.random.random()>.7:temp=shift(temp,shift=(np.random.randint(-3,3,2)))ifnp.random.random()>.7:temp=rotate(temp,angle=np.random.randint(-20,20,1),reshape=False)returntempimportmatplotlib.pyplotaspltimportmatplotlib.cmascm# Copy to not effect the originalind=np.random.randint(len(X_train))test_image=lambda:np.copy(X_train[ind,0,:,:])# Jitter examplesplt.figure()f,ax=plt.subplots(2,2)forkinrange(2):forjinrange(2):ax[k,j].imshow(rand_jitter(test_image()))

Training

In the interest of time, I will only train two epochs. Each epoch we copy the training set so that we can modify the array itself when we apply the jitter. The first epoch we use the original training set, but every epoch afterwards we use the perturbed data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# Callback for model saving:checkpointer=ModelCheckpoint(filepath="auto_save_weights.hdf5",verbose=1,save_best_only=True)# Parametersn_epochs=2# Trainingforkinrange(0,n_epochs):X_train_temp=np.copy(X_train)# Copy to not effect the originals# Add noise on later epochsifk>0:forjinrange(0,X_train_temp.shape[0]):X_train_temp[j,0,:,:]=rand_jitter(X_train_temp[j,0,:,:])model.fit(X_train_temp,y_train,nb_epoch=1,batch_size=128,validation_data=(X_valid,y_valid),show_accuracy=True,verbose=1,callbacks=[checkpointer])

Predicting on the Test Set

If we were going to submit this model to the competition, then we’d want to retrain the model on the full training set when we are happy with our validation score. If we were using a less time consuming approach, we could choose to use cross validation rather than a validation set. Either way, it’s important to train the final model on the full data set.

But it will not make a difference for our demonstration, so we’ll make our predictions on the test set. As a sanity check, we will validate a prediction by visualizing the input image and comparing to our prediction.