Creating a dataset for image classification

In this article we will go through the necessary steps of building a dataset for a image classification tasks. When you get started with deep learning, most of the ‘Hello World’ tutorials are using datasets provided by the framework (MNIST, Fashion-Mnist, etc.). Usually, at that point, you don’t spent to much time worrying about data.

However, for building real life applications, obtaining and preparing the data is a big part of the task. This posts aims to present such a situations, when we have two folders containing pictures of cats and dogs respectively. Our objective is to train a neural network that, given any random pictures of a cat or dog on the internet, can distinguish between the two.

Downloading the images

The cat/dog images can be downloaded from here but fell free to try with your own images as well.

Extracting the filenames

When working with thousands of images, it’s not possible to simply keep them in memory as Numpy arrays. At this point, we not interested in the actual pixels, only the category and the image location. The image content can be read right before feeding it to the neural network.

For extracting the filenames, we are going to use python glob. The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. In this case, the function will look into the data folder for any file with extension .jpg. By setting the parameter recursive to True, we make sure that the look-up happens in subfolders as well.

Adding the labels

Since all images from a category are placed in the same subfolder, it’s easy to extract the label from the filepath. Let’s build a label array, which will contain the category for each image in data_files:

data_labels=[]
# build the labels forfileindata_files:
# file will be /data/{category}/image_name.jpg so we
# extract the category from therelabel=file.split('/')[2]data_labels.append(label_indexes[label])
assert num_data_files == len(data_labels)

One hot encoding

For our cat/dog classifier, the numerical representation of the labels is pretty straightforward: 0 for Cats and 1 for Dogs. The neural network will output a single number between 0 and 1 telling us how confident is that the picture is a Dog. If the value leans towards 0 (let’s say < 0.5), then probably we show it a picture with a cat.

Output: 0.2
This means that the nn is 80% sure that this is a cat

With this binary approach, the networks will tells us for a picture if it leans towards a cat or a dog. However, when we have more than 2 categories, things are getting a bit more complicated.

Let’s say, for example, that we want to train the network to tells us if the picture contains a cat, a dog or none of them (a 3rd categories).

To understand how to better represent the labels as numerical structures, let’s have a look on how a modern neural networks makes predictions.

The final layer of the network usually contains a neurons for each class. The value of that neuron tells us how confident is the algorithm that the input belongs to that class (on a scale from 0 to 1). The neuron with the higher activation gives the predictions, but we would like to know as well the probability for that class (how confident is the neural network that is correct).

For this we are using a softmax function. A softmax function take all the input values (the activations in the final layer) and transforms them so they would add up to 1 by keeping the proportion between the original values. This way, we know the probability for each category.

As you can see, all predictions are correct but the degree of confidence varies significantly.

Ideally, we would like that the output for the correct class to be as close to 1 as possible and the other outputs to be near 0. A network that outputs similar values for multiple classes is not confident in its prediction.

The correct label for an image should be then a vector containing zeros for all classes except the correct one (which has value one). This representation is named one-hot encoding.

Train/Test split

Most of the images we have are going to be used for training the network, but we should put aside a small percentage of it for validating purposes. In this case, we chose 15%.

# TRAIN/TEST split
# The percentage of the data which will be used in the test set
TRAIN_TEST_SPLIT = 0.15
nr_test_data=int(num_data_files*TRAIN_TEST_SPLIT)train_data_files=data_files[nr_test_data:]test_data_files=data_files[:nr_test_data]train_labels=data_labels[nr_test_data:]test_labels=data_labels[:nr_test_data]assertlen(train_labels)+len(test_labels)==num_data_filesassertlen(test_data_files)+len(train_data_files)==num_data_files

Saving/restoring the data

We are now almost done – we have a test/train dataset containing the filename and the category for each of our images.

Until this point, there is no need to execute this process more than one time (unless we add more images). So we can save the our dataset in a file and restore it any time we want to train the model.

And that’s it! Every call to get_random_batch() gives us a random sample of data which we can feed directly to our model.

A faster alternative in Tensorflow

Tensorflow provides a Dataset API which can help you create complex pipelines for data. You can see an example on my github account or in live in action in this blog post about detecting sign language.

A way faster alternative in Keras

Keras provides ImageDataGenerator in its preprocessing package, which not only takes all the preprocessing heavy weight from your shoulders, but also provides means to easily expand the dataset with data augmentation (rotation, sheering, horizontal/vertical flip).

Conclusions

Preprocessing images is definitely not the most appealing part in building machine learning models, but that doesn’t make it less important. The silver lining is that you usually go through this process only one time when training a model. As you saw in this article, you don’t necessary have to rely on a framework for this process as long as you have a good template to build upon. You can find the source code here.