We know who you are before you finish saying “Hello”

Neurodata Lab’s neural networks can identify who is talking by just one pronounced syllable. Even though this article does not directly dwell upon voice-based identification, this topic is partly connected to it. We`ll discuss neural network features, so-called d-vectors that can be used in speech processing: from verification to speech and emotion recognition.

The Basics

One second of sound can contain from 8,000 to 48,000 numbers, depending on the sample rate. These numbers can be seen as a deviationfrom the equilibrium position of a diaphragm of a microphone or a dynamic. In fact, such an explanation is excessive: the amplitude of a signal in the nextmoment depends on the previous moment which means that the signal can be compressed without loss of information. There are many ways to minimize the size of asignal. Most of them are based on physical properties of the sound and human hearing.

The community used to work with hand-crafted features before neural networks proved to be useful (in a broad sense). The most widely used and recognized are Pitch and MFCC. The former has a physical significance of oscillation rate of vocal chords that is usually different for everyone and depends on intonation. The concept of Mel-Frequency Cepstral Coefficients (MFCC) is based on the non-linearity of human sound perception, especially volume and frequency. People believe that one sound is higher than the other by some amount, while in fact, their frequency differs in a certain number of times.

These and other manually calculated features are irreversible in a way that a part of the signal will belost. Though it is not critical for some of the tasks, it would be great to think of a more working and versatile alternative.

The key to the solution of the problem is Fourier transform that can present the audio signal as a sum of waves with different frequencies and amplitudes. In reality, human speech is not stationary, meaning that its spectrum will be different at different moments in time. This feature allows looking at its time-frequency representation through a spectrogram.

In order to build the spectrogram, you need to split the sound into overlapping frames that are 10 milliseconds long. Then calculate Fourier transform for each of them and put their modules into a column on the spectrogram. This transform is (almost) reciprocal. In other words, the initial audio signal can be restored through Fourier transform and Griffin-Lim algorithm (of course, there will be some data loss as the Fourier transform is complex and a spectrogram is a real-valued object, so the iterative Griffin-Lim algorithm is used to restore phases at least approximately). So, if you take a logarithm of the amplitudes, you will get:

A spectrogram of the 5-sec speech.

It is convenient to process them with the convolutional neural network as well.

Here`s a hack for image processing: there are large databases with the samples of different objects (like ImageNet). You can train a big network to identify them and then train additionally for our specific task or take the result output from one of inner fully connected layers. Such architecture is thought to be able to get informative features for output images well. But speaking from the experience, the results will always be better if the neural network was trained from scratch.

The idea of d-vectors (sometimes also called x-vectors) is somewhat similar to the use of the trained ImageNet networks, except for there are no databases for spectrograms. In this case, autoencoders can be considered as a way out, yet they have no idea as to what part of the spectrogram to pay attention to, which makes their work unsatisfactory.

We need to go deeper

Here comes the main part of the article.

There is a quite known task of human voice verification, in which the system basically matches the input speech to a person from a database. It is a different task — to know how to build such systems. There are lots of parameters (e.g. the length of speech; if everyone reads out the same text; setting one vs one, or one vs all) that can turn out critical in different conditions, yet we need to focus on different things.

For instance: how high the quality of the features will be if we train the network to identify the human? Anything we do is to get the features.

In this case we can count on an intuition and this article from 2015. Here, the authors teach network for face recognition. The trick is: they used Triplet Loss.

The idea is simple; normalize the features from the penultimate layer so that they lay on a unit circle and set the points from one class to lay close to each other and to lay far if they belong to different classes. This can be achieved the following way: choose from the range two more points for every anchor — one from the same class and another from the other so that they were positive and negative. Then, form the Loss for the points:

where x — an input image, f — the network`s output after the normalization, alpha — manually set parameters, [ ]_{+} — ReLU function. Qualitatively, if the distance between the anchor and positive is bigger than between the anchor and negative at least in an alpha, the value of the Loss equals 0. And the smaller is the difference between the classes, the bigger is the value.

Here`s what happens to the features after Triplet Loss training.

By the way, you can be smart about uniting the triplets. At some point, the value of the loss will become small, so you can consider negative points that are close to the anchor instead of looking for them in different classes in order to accelerate the training. However, it is hard to do for big datasets, because you need to calculate the pair-wise distance between the classes which changes after each network`s training iteration.

Triplet Loss has an advantage over Categorical Crossentropy which is used for usual classification. If the model is trained with Crossentropy, it will put all of the points from the same class into a small domain while the data that does not suit the task might be lost in the process. We are intending to use the neural network as a feature’s generator and not for the verification, therefore we don`t need that. It is more important for Triplet Loss to distribute different classes into the different domains on the unit circle rather than to group one class.

Before you approach the training of the feature generator from spectrograms, the last thing you need to do is determine their size. It is obvious that the bigger time slot we take, the more accurate the classification and the more average the features will be. That`s why it is good to choose the length of the signal so that there are 1–3 phonemes (syllables). Half a second will do.

We took VoxCeleb2 dataset for the training. There are several a couple minute long audio files for each of the 6300 speakers (all of them recorded in different conditions). Used part of the files for the training and others for validation. Then chose the architecture of convolution network, add Triplet Loss in there and begin the training.

The results were incredible. In almost two weeks of training on 1080Ti (yup, that long), the accuracy of classification reached 55%. Sure, it does not seem much, yet the top-5 is 78%. If we consider only the loudest part which is mostly stressed vowels, the top-5 accuracy will increase up to 91%. Basically, we can identify a person just by one word pretty accurately. Yet it does not really matter.

It was all done for the features that you can get as an output from the penultimate layer before the classification. We tested them on our tasks and the results were better than if we used conventional approaches for feature calculation. For example, the use of d-vectors in emotional recognition allowed to beat the sate-of-the-art solution up to 4%. Our article about that was accepted for FICC 2019. But emotional recognition is a tale for another time.