All neural networks need a loss function for training. A loss function is a quantative measure of how bad the predictions of the network are when compared to ground truth labels. Given this score, a network can improve by iteratively updating its weights to minimise this loss. Some tasks use a combination of multiple loss functions, but often you’ll just use one. MXNet Gluon provides a number of the most commonly used loss functions, and you’ll choose certain functions depending on your network
and task. Some common task and loss function pairs include:

However, we may sometimes want to solve problems that require customized loss functions; this tutorial shows how we can do that in Gluon. We will implement contrastive loss which is typically used in Siamese networks.

Contrastive loss is a distance-based loss function. During training, pairs of images are fed into a model. If the images are similar, the loss function will return 0, otherwise 1.

Y is a binary label indicating similarity between training images. Contrastive loss uses the Euclidean distance D between images and is the sum of 2 terms: - the loss for a pair of similar points - the loss for a pair of dissimilar points

The loss function uses a margin m which is has the effect that dissimlar pairs only contribute if their loss is within a certain margin.

A Siamese network consists of 2 identical networks, that share the same weights. They are trained on pairs of images and each network processes one image. The label defines whether the pair of images is similar or not. The Siamese network learns to differentiate between two input images.

Our network consists of 2 convolutional and max pooling layers that downsample the input image. The output is then fed through a fully connected layer with 256 hidden units and another fully connected layer with 2 hidden units.

We train our network on the Ominglot dataset which is a collection of 1623 hand drawn characters from 50 alphabets. You can download it from here. We need to create a dataset that contains a random set of similar and dissimilar images. We use Gluon’s ImageFolderDataset where we overwrite __getitem__ and randomly return similar and dissimilar pairs of images.

forepochinrange(10):fori,datainenumerate(train_dataloader):image1,image2,label=datawithautograd.record():output1,output2=model(image1,image2)loss_contrastive=loss(output1,output2,label)loss_contrastive.backward()trainer.step(image1.shape[0])loss_mean=loss_contrastive.mean().asscalar()print("Epoch number {}\n Current loss {}\n".format(epoch,loss_mean))

Verify whether the last network layer uses the correct activation function: for instance in binary classification tasks we need to apply a sigmoid on the output data. If we use this activation in the last layer and define a loss function like Gluon’s SigmoidBinaryCrossEntropy, we would basically apply sigmoid twice and the loss would not converge as expected. If we don’t define any activation function, Gluon will per default apply a linear activation.

In our example, we computed the square root of squared distances between 2 images: F.sqrt(distances_squared). If images are very similar we take the sqare root of a value close to 0, which can lead to NaN values. Adding a small epsilon to distances_squared avoids this problem.

In most cases having the wrong tensor shape will lead to an error, as soon as we compare data with labels. But in some cases, we may be able to normally run the training, but it does not converge. For instance, if we don’t set keepdims=True in our customized loss function, the shape of the tensor changes. The example still runs fine but does not converge.

If you encounter a similar problem, then it is useful to check the tensor shape after each computation step in the loss function.