Detecting fingerspelling with artificial intelligence

Fig. 1 – Fingerspelling characters

In this post we will go through the necessary steps for building an application that can understand fingerspelling (or sign alphabet) from the web camera of your computer. As you can notice in fig. 1, the differences between some letters are quite subtle (see m, n, s and t), which makes it a not so trivial task. We will use Tensorflow to build a comprehensive data set and a powerful neural network which can distinguish all 27 characters plus a different character for space. Check out my youtube demo to see the end result in action.

We will go through the following components in detail:

detecting the skin and extracting the hand from the image

create adataset for training our model

train a convolutional neural network for detecting the characters

wrap it up with a program that can extract our gesture, predict the letter displayed, use autocorrect to make the process faster and create the message

Part 1 – Feature extraction

In order to help the neural network train and focus only on the relevant feature of the image (position of fingers, orientation of the arm, etc.), we need to reduce the noise as much as possible. Having a different background or even a different skin tone can damage our model.

Ideally, after feature extraction process, starting from the frame captured by our web camera, we will be only left with this:

Fig2. Extracting the skin (before / after)

Much of the heavy load of image processing is done by OpenCV so make sure you have it installed before trying out the code. I also recommend Jupyter notebook for getting started as fast as possible.

Let’s define our python script first. We will start reading frames from our web camera using OpenCV. In this article, we will focus on extracting the skin feature in a separate frame.

Region of interest

Firstly, we need to extract the hand from the image. For this, I choose the easy way and defined a so-called region of interest. Basically, you conveniently define the area where your hand will be placed and extract the region for further processing. Since usually your hand is still when spelling, this is sufficient for our use case.

Replace the <— Region of interest code here —> comment with the following code and re-run the code:

Let’s show some skin!

The tricky part is to remove the background. In order to do this, we define a skin-like spectrum of colors which included color of the skin from very bright to dark. This should be large enough to capture the entire hand (the brighter and more shady spots) but also small enough to exclude the background. The optimal interval can be influenced by your skin tone but especially by room lighting (daylight or artificial light) so fell free to explore with it.

To define the color spectrum we are using HSL color format which (Hue, Saturation and Value (brightness)). This makes it easier to define a fixed range of skin tones.

Once the interval is defined, we remove all the pixels that are not within that range and we are left with only skin. To further reduce the noise, some morphological operations (erode and dilate) are going to be applied.

The cv2.inRange operation will make all the skin-like pixels white and the other ones black. Our plan is to discard the black irrelevant black pixels but first we will try to refine the shape further.

Dilation is the operation which expands the connected sets of ones in the binary image. This is useful filling the holes and gapes. If, for example, the central area of our fist may look darker than the rest of the hand because of the light. We would still want the all area of the hand to be included in the image, so even though the inRange operation may loose some important pixels we can count on dilation to recover them. After this operation, hopefully all the relevant pixels from the hand area turned to white.

The complementary operation of dilation is the erosion which shrinks the connected white pixels. This is especially useful for removing white noise around the hand which we hope to not include in the final image. You can find more details about this here.

You can notice in the picture above that even though the erosion and dilation help us remove the noise caused by the beige wall behind, we lost the tip of our thumb in the resulting image, which may be important in detecting the letter N. This is a continuous trade-off between leaving noise or removing relevant pixels which we have to deal with.

This strategy is far from perfect. I’ve found many suggestions online that can improve it (like identifying the face and extract the skin tone from there or define a look-up table to cover more skin tones). However, perfect is not what we are looking for. It’s clear that the noise cannot be removed 100% but we can rely on our AI model to overcome this.

The power of neural networks comes from the fact that they can learn from noisy images as long as the dataset is properly prepared. Feeding it with enough data will help it to learn the relevant features from the images and ignore the redundant pixels. If a human can easily detect the letter from the picture, a well trained state of the art neural network should do it as well.

So our goal in the preparation phase is not to extract the perfect feature, but merely to help as much as we can.

Part 2 – Creating the dataset

Getting a good dataset means wining half the battle when we talk about machine learning.

Now that we know how to obtain the hand from the image, all is required is a lot of patience to build an entire dataset.

The same strategy described in part 1 will be used for extracting the hand from the image. We will set-up the python script to capture few frames per second and store them in a file. This will give us enough time to move the hand around in the frame in order to add as much variance as possible in the images.

For 5 frames per second about 3 minutes and a half is needed for each category collect 1000 images. That still adds up to a lot if we plan to do it 28 times but the advantage is that it has to be done only one time.

After this step, our data folder containing the images is ready. The common input in a supervised machine learning algorithm (like convolutional neural network) is a tuple of form (data, label) where data is the actual image and the label is the category it belongs.

Collecting the images

There are only few adaptation needed to our existing script to start collecting the images. We need to store each processed image (the skin) in a folder that belongs to its class.

Since this is the least exciting part of this process, we won’t go into to much details. I created a template on my Github account for capturing images from the video camera if you are interested.

Creating the dataset

We now have a folder containing all our images grouped in subfolders. The neural network input for training is a tuple formed by the image and its category. So let’s build a 2 dimensional array containing the filename and the label for each image. We can then shuffle it to make sure that the data is uniformly distributed and split it in a train and test datasets.

The source code for this part can be found on my Github as well. Fell free to use it.

Let’s start with the necessary imports and configurations:

importtensorflowastffromrandomimportshuffleimportosimportnumpyasnpimportglob# The directory where the images are keptdirectory='data/'
# The file where we want to store our numpy arraysdata_file='fingerspelling_data.npz'
# The percent of data that will be included in the test setTRAIN_TEST_SPLIT=0.3

The categories can be obtain from the subfolders of the data images.

# Get the labels from the directorylabels=[x[1]forxinos.walk(directory)][0]labels=sorted(labels)num_labels=len(labels)# build dictionary for indexeslabel_indexes={labels[i]:iforiinrange(0,len(labels))}print(label_indexes)

The output of our neural network is not going to be a number from 0 to 28 telling us what label the image represent, but a 28 long vector containing the network confidence that the image belongs to that class. Ideally, all values except the right one will be close to 0 and the correct index will have a value close to 1. This kind of representation is named one-hot encoding and it’s commonly used in classification problem. We are going to keep our labels in the same form. This can be implemented using only python and numpy, but I chose the easy way out again and use a method from Keras tailored exactly for this use case.

# get all the file pathsdata_files=glob.glob(directory+'**/*.jpg',recursive=True)# shuffle the datashuffle(data_files)num_data_files=len(data_files)data_labels=[]# build the labels forfileindata_files:label=file.split('/')[1]data_labels.append(label_indexes[label])assertnum_data_files==len(data_labels)# convert the labels to one hot encodingdata_labels=np.array(data_labels)data_labels_one_hot=tf.keras.utils.to_categorical(data_labels)

Out of all images collected, a big portion of it is going to be used for training the neural network. However, we need to leave a side a subset of it for validating the accuracy. We are going to use this subset to make sure that prediction works for data that was not used for training.

Tensorflow data API provides powerful tools for creating data pipelines – we are going to use it to save us from the burden of reading and loading the content of the filenames, resize the image to our desired dimensions, splitting the data into batches and repeating the process for each epoch.

Finally, we are ready for the good stuff – building the neural network to detect fingerspelling.

Part 3 – Training the neural network

When it comes to image classification, it is no mystery that the state of the art accuracy is achieved by convolutional neural network. These networks use a special architecture which is particularly well-adapted to classify images.

Model architecture

The input is a 32*32 grayscale image. Since the color is not relevant at this point after image processing, we can simply give up on it. The architecture I propose contains 3 convolutional layers followed by a dense layer and an output softmax layer – nothing fancy here. Since we’re using a high-level framework like keras, we can easily take advantage of the modern regularization and optimization techniques (dropout, adam). A deeper look of this concepts is beyond the scope of this article.

There are, of course, many adjustments that can aspects of the model we can play around with (model architecture, size of the image, hyperparameters, etc).

After training for 30 epochs with this dataset and model, I obtain an accuracy of over 99%. This is more than enough in my opinion as I believe that the main challenge to use this in real life is the skin extraction feature.

Part 4 – Word prediction and autocorrect

In order to be able to write entire paragraph we need at least another character for space. I’ve chosen the classic ‘OK’ sign. Even though seems pretty similar to the letter ‘a’, it seems like there is no problem for the AI to differentiate between them.

For a smoother experience of word spelling, I found out while experimenting that is better to set a validation interval for a prediction – let’s say 1 second. This means that, in order for a prediction to be added to the message, all consecutive frames processed in the interval of 1 second must return the same prediction. Frames that produces uncertain prediction are therefore dropped.

Another improvement I found useful is to use a spelling corrector – like the library autocorrect in python.

pip install autocorrect

This way, small spelling mistakes are corrected after a word input, especially related to double letters which is a common problem (the same latter is added multiple times to the word if you hold your hand more than one second).

Conclusions

The fact that modern artificial intelligence algorithms can easily classify images is not a surprised for anyone. The recent development of modern frameworks (like Tensorflow, Theano, Keras) in the past few years has open the doors for many developers to implement this kind of algorithms in their applications.

Using gestures to replace or even enhance the keyboard and the mouse is a very plausible scenario in the near future. As you can see, a decent accuracy can be obtain without a great effort.

There are many possible improvements regarding the app (replacing the region of interest with a hand detection mechanism, improving the skin detection feature, a better neural network model, etc). I will try to improve it and give updates on my progress.