Deep face recognition with Keras, Dlib and OpenCV

Face recognition identifies persons on face images or video frames. In a nutshell, a face recognition system extracts features from an input face image and compares them to the features of labeled faces in a database. Comparison is based on a feature similarity metric and the label of the most similar database entry is used to label the input image. If the similarity value is above a certain threshold the input image is labeled as unknown. Comparing two face images to determine if they show the same person is known as face verification.

This article uses a deep convolutional neural network (CNN) to extract features from input images. It follows the approach described in [1] with modifications inspired by the OpenFace project. Keras is used for implementing the CNN, Dlib and OpenCV for aligning faces on input images. Face recognition performance is evaluated on a small subset of the LFW dataset which you can replace with your own custom dataset e.g. with images of your family and friends if you want to further experiment with the notebook. After an overview of the CNN architecure and how the model can be trained, it is demonstrated how to:

Detect, transform, and crop faces on input images. This ensures that faces are aligned before feeding them into the CNN. This preprocessing step is very important for the performance of the neural network.

Use the CNN to extract 128-dimensional representations, or embeddings, of faces from the aligned input images. In embedding space, Euclidean distance directly corresponds to face similarity.

Compare input embedding vectors to labeled embedding vectors in a database. Here, a support vector machine (SVM) and a KNN classifier, trained on labeled embedding vectors, play the role of a database. Face recognition in this context means using these classifiers to predict the labels i.e. identities of new inputs.

CNN architecture and training

The CNN architecture used here is a variant of the inception architecture [2]. More precisely, it is a variant of the NN4 architecture described in [1] and identified as nn4.small2 model in the OpenFace project. This article uses a Keras implementation of that model whose definition was taken from the Keras-OpenFace project. The architecture details aren’t too important here, it’s only useful to know that there is a fully connected layer with 128 hidden units followed by an L2 normalization layer on top of the convolutional base. These two top layers are referred to as the embedding layer from which the 128-dimensional embedding vectors can be obtained. The complete model is defined in model.py and a graphical overview is given in model.png. A Keras version of the nn4.small2 model can be created with create_model().

frommodelimportcreate_modelnn4_small2=create_model()

Model training aims to learn an embedding of image such that the squared L2 distance between all faces of the same identity is small and the distance between a pair of faces from different identities is large. This can be achieved with a triplet loss that is minimized when the distance between an anchor image and a positive image (same identity) in embedding space is smaller than the distance between that anchor image and a negative image (different identity) by at least a margin .

means and is the number of triplets in the training set. The triplet loss in Keras is best implemented with a custom layer as the loss function doesn’t follow the usual loss(input, target) pattern. This layer calls self.add_loss to install the triplet loss:

During training, it is important to select triplets whose positive pairs and negative pairs are hard to discriminate i.e. their distance difference in embedding space should be less than margin , otherwise, the network is unable to learn a useful embedding. Therefore, each training iteration should select a new batch of triplets based on the embeddings learned in the previous iteration. Assuming that a generator returned from a triplet_generator() call can generate triplets under these constraints, the network can be trained with:

fromdataimporttriplet_generator# triplet_generator() creates a generator that continuously returns # ([a_batch, p_batch, n_batch], None) tuples where a_batch, p_batch # and n_batch are batches of anchor, positive and negative RGB images # each having a shape of (batch_size, 96, 96, 3).generator=triplet_generator()nn4_small2_train.compile(loss=None,optimizer='adam')nn4_small2_train.fit_generator(generator,epochs=10,steps_per_epoch=100)# Please note that the current implementation of the generator only generates # random image data. The main goal of this code snippet is to demonstrate # the general setup for model training. In the following, we will anyway # use a pre-trained model so we don't need a generator here that operates # on real training data. I'll maybe provide a fully functional generator# later.

The above code snippet should merely demonstrate how to setup model training. But instead of actually training a model from scratch we will now use a pre-trained model as training from scratch is very expensive and requires huge datasets to achieve good generalization performance. For example, [1] uses a dataset of 200M images consisting of about 8M identities.

The OpenFace project provides pre-trained models that were trained with the public face recognition datasets FaceScrub and CASIA-WebFace. The Keras-OpenFace project converted the weights of the pre-trained nn4.small2.v1 model to CSV files which were then converted here to a binary format that can be loaded by Keras with load_weights:

Custom dataset

To demonstrate face recognition on a custom dataset, a small subset of the LFW dataset is used. It consists of 100 face images of 10 identities. The metadata for each image (file and identity name) are loaded into memory for later processing.

Face alignment

The nn4.small2.v1 model was trained with aligned face images, therefore, the face images from the custom dataset must be aligned too. Here, we use Dlib for face detection and OpenCV for image transformation and cropping to produce aligned 96x96 RGB face images. By using the AlignDlib utility from the OpenFace project this is straightforward:

importcv2importmatplotlib.pyplotaspltimportmatplotlib.patchesaspatchesfromalignimportAlignDlib%matplotlibinlinedefload_image(path):img=cv2.imread(path,1)# OpenCV loads images with color channels# in BGR order. So we need to reverse themreturnimg[...,::-1]# Initialize the OpenFace face alignment utilityalignment=AlignDlib('models/landmarks.dat')# Load an image of Jacques Chiracjc_orig=load_image(metadata[2].image_path())# Detect face and return bounding boxbb=alignment.getLargestFaceBoundingBox(jc_orig)# Transform image using specified face landmark indices and crop image to 96x96jc_aligned=alignment.align(96,jc_orig,bb,landmarkIndices=AlignDlib.OUTER_EYES_AND_NOSE)# Show original imageplt.subplot(131)plt.imshow(jc_orig)# Show original image with bounding boxplt.subplot(132)plt.imshow(jc_orig)plt.gca().add_patch(patches.Rectangle((bb.left(),bb.top()),bb.width(),bb.height(),fill=False,color='red'))# Show aligned imageplt.subplot(133)plt.imshow(jc_aligned);

As described in the OpenFace pre-trained models section, landmark indices OUTER_EYES_AND_NOSE are required for model nn4.small2.v1. Let’s implement face detection, transformation and cropping as align_image function for later reuse.

As expected, the distance between the two images of Jacques Chirac is smaller than the distance between an image of Jacques Chirac and an image of Gerhard Schröder (0.30 < 1.12). But we still do not know what distance threshold is the best boundary for making a decision between same identity and different identity.

Distance threshold

To find the optimal value for , the face verification performance must be evaluated on a range of distance threshold values. At a given threshold, all possible embedding vector pairs are classified as either same identity or different identity and compared to the ground truth. Since we’re dealing with skewed classes (much more negative pairs than positive pairs), we use the F1 score as evaluation metric instead of accuracy.

The face verification accuracy at = 0.56 is 95.7%. This is not bad given a baseline of 89% for a classifier that always predicts different identity (there are 980 pos. pairs and 8821 neg. pairs) but since nn4.small2.v1 is a relatively small model it is still less than what can be achieved by state-of-the-art models (> 99%).

The following two histograms show the distance distributions of positive and negative pairs and the location of the decision boundary. There is a clear separation of these distributions which explains the discriminative performance of the network. One can also spot some strong outliers in the positive pairs class but these are not further analyzed here.

Face recognition

Given an estimate of the distance threshold , face recognition is now as simple as calculating the distances between an input embedding vector and all embedding vectors in a database. The input is assigned the label (i.e. identity) of the database entry with the smallest distance if it is less than or label unknown otherwise. This procedure can also scale to large databases as it can be easily parallelized. It also supports one-shot learning, as adding only a single entry of a new identity might be sufficient to recognize new examples of that identity.

A more robust approach is to label the input using the top scoring entries in the database which is essentially KNN classification with a Euclidean distance metric. Alternatively, a linear support vector machine (SVM) can be trained with the database entries and used to classify i.e. identify new inputs. For training these classifiers we use 50% of the dataset, for evaluation the other 50%.

Seems reasonable :-) Classification results should actually be checked whether (a subset of) the database entries of the predicted identity have a distance less than , otherwise one should assign an unknown label. This step is skipped here but can be easily added.

Dataset visualization

To embed the dataset into 2D space for displaying identity clusters, t-distributed Stochastic Neighbor Embedding (t-SNE) is applied to the 128-dimensional embedding vectors. Except from a few outliers, identity clusters are well separated.