Spoken language identification with deep convolutional networks

Recently TopCoder announced a contest
to identify the spoken language in audio recordings. I decided to test how well
deep convolutional networks will perform on this kind of data. In short I managed to get
around 95% accuracy and finished at the 10th place. This post reveals all the details.

Contents

Dataset and scoring

The recordings were in one of the 176 languages. Training set consisted of 66176 mp3 files,
376 per language, from which I have separated 12320 recordings for validation
(Python script is available on GitHub).
Test set consisted of 12320 mp3 files. All recordings had the same length (~10 sec)
and seemed to be noise-free (at least all the samples that I have checked).

Score was calculated the following way: for every mp3 top 3 guesses were uploaded in a CSV file.
1000 points were given if the first guess is correct,
400 points if the second guess is correct and 160 points if the third guess is correct.
During the contest the score was calculated only on 3520 recordings from the test set.
After the contest the final score was calculated on the remaining 8800 recordings.

Preprocessing

I entered the contest just 14 days before the deadline, so didn’t have much time to investigate
audio specific techniques. But we had a deep convolutional network developed few months ago,
and it seemed to be a good idea to test a pure CNN on this problem.
Some Google search revealed that the idea is not new. The earliest attempt I could find was a
paper by G. Montavon
presented in NIPS 2009 conference. The author used a network with 3 convolutional layers trained on
spectrograms of audio recordings, and
the output of convolutional/subsampling layers was given to a time-delay neural network.

Network architecture

I took the network architecture designed for the Kaggle’s diabetic retinopathy detection contest.
It has 6 convolutional layers and 2 fully connected layers with 50% dropout.
Activation function is always ReLU. Learning rates are set to be higher for
the first convolutional layers and lower for the top convolutional layers.
The last fully connected layer has 176 neurons and is trained using a softmax loss.

It is important to note that this network does not take into account the sequential characteristics
of the audio data. Although recurrent networks perform well on speech recognition tasks
(one notable example is this paper
by A. Graves, A. Mohamed and G. Hinton, cited by 272 papers according to the Google Scholar),
I didn’t have time to learn how they work.

I trained the CNN on Caffe with 32 images in a batch,
its description in Caffe prototxt format is available here.

Nr

Type

Batches

Channels

Width

Height

Kernel size / stride

0

Input

32

1

858

256

1

Conv

32

32

852

250

7x7 / 1

2

ReLU

32

32

852

250

3

MaxPool

32

32

426

125

3x3 / 2

4

Conv

32

64

422

121

5x5 / 1

5

ReLU

32

64

422

121

6

MaxPool

32

64

211

60

3x3 / 2

7

Conv

32

64

209

58

3x3 / 1

8

ReLU

32

64

209

58

9

MaxPool

32

64

104

29

3x3 / 2

10

Conv

32

128

102

27

3x3 / 1

11

ReLU

32

128

102

27

12

MaxPool

32

128

51

13

3x3 / 2

13

Conv

32

128

49

11

3x3 / 1

14

ReLU

32

128

49

11

15

MaxPool

32

128

24

5

3x3 / 2

16

Conv

32

256

22

3

3x3 / 1

17

ReLU

32

256

22

3

18

MaxPool

32

256

11

1

3x3 / 2

19

Fully connected

20

1024

20

ReLU

20

1024

21

Dropout

20

1024

22

Fully connected

20

1024

23

ReLU

20

1024

24

Dropout

20

1024

25

Fully connected

20

176

26

Softmax Loss

1

176

Hrant suggested to try the ADADELTA solver.
It is a method which dynamically calculates learning rate for every network parameter, and the
training process is said to be independent of the initial choice of learning rate. Recently it
was implemented in Caffe.

In practice, the base learning rate set in the Caffe solver did matter. At first I tried to use 1.0
learning rate, and the network didn’t learn at all. Setting the base learning rate to 0.01
helped a lot and I trained the network for 90 000 iterations (more than 50 epochs).
Then I switched to 0.001 base learning rate for another 60 000
iterations. The solver is available here.
Not sure why the base learning rate mattered so much at the early stages of the training.
One possible reason could be the large learning rate coefficients on the lower convolutional layers.
Both tricks (dynamically updating the learning rates in ADADELTA and large learning rate coefficients)
aim to fight the gradient vanishing problem, and maybe their combination is not a very good idea.
This should be carefully analysed.

Training (blue) and validation (red) loss over the 150 000 iterations on the non-augmented dataset. The sudden drop of training loss corresponds to the point when the base learning rate was changed from 0.01 to 0.001. Plotted using this script.

The signs of overfitting were getting more and more visible and I stopped at 150 000 iterations.
The softmax loss got to 0.43 and it corresponded to 3 180 000 score
(out of 3 520 000 possible). Some ensembling with other models of the same network allowed to
get a bit higher score (3 220 000), but it was obvious that data augmentation is needed to overcome the
overfitting problem.

Data augmentation

The most important weakness of our team in the previous contest
was that we didn’t augment the dataset well enough. So I was looking for ways to augment the
set of spectrograms. One obvious idea was to crop random, say, 9 second intervals of the recordings.
Hrant suggested another idea: to warp the frequency axis of the spectrogram. This process is known as
vocal tract length perturbation, and is generally used for speaker normalization at least
since 1998.
In 2013 N. Jaitly and G. Hinton
used this technique to augment the audio dataset. I used this formula
to linearly scale the frequency bins during spectrogram generation:

I also randomly cropped
the spectrograms so they had 768x256 size. Here are the results:

Spectrogram of one of the recordings

Cropped spectrogram of the same recording with warped frequency axis

For each mp3 I have created 20 random spectrograms, but trained the network on 10 of them.
It took more than 2 days to create the augmented dataset and convert it to LevelDB format (the format Caffe suggests).
But training the network proved to be even harder. For 3 days I couldn’t significantly decrease
the train loss. After removing the dropout layers the loss started to decrease but it would take weeks
to reach reasonable levels. Finally, Hrant suggested to try to reuse the weights of the
model trained on the non-augmented dataset. The problem was that due to the cropping,
the image sizes in the two datasets were different. But it turned out that convolutional
and pooling layers in Caffe work with images of variable sizes,
only the fully connected layers couldn’t reuse the weights from the first model.
So I just renamed the FC layers
in the prototxt file and initialized
the network (convolution filters) by the weights of the first model:

This helped a lot. I used standard stochastic gradient descent (inverse decay learning rate policy)
with base learning rate 0.001 for 36 000 iterations (less than 2 epochs), then increased
the base learning rate to 0.01 for another 48 000 iterations (due to the inverse decay policy
the rate decreased seemingly too much).
These trainings were done without any regularization techniques,
weight decay or dropout layers, and there were clear signs of overfitting. I tried to add 50%
dropout layers on fully connected layers, but the training was extremely slow. To improve the
speed I used 30% dropout, and trained the network for 120 000 more iterations using this solver.
Softmax loss on the validation set reached 0.21 which corresponded to 3 390 000 score.
The score was calculated by averaging softmax outputs over 20 spectrograms of each recording.

Ensembling

30 hours before the deadline I had several models from the same network. And even simple
ensembling (just the sum of softmax activations of different models) performed better than
any individual model. Hrant suggested to use XGBoost,
which is a fast implementation of gradient boosting
algorithm and is very popular among Kagglers. XGBoost has a good documentation and
all parameters are well explained.

To perform the ensembling I was creating a CSV file containing softmax activations
(or the average of softmax activations among 20
augmented versions of the same recording) using this script.
Then I was running XGBoost on these CSV files. The submission file (which was requested by TopCoder)
was generated using this script.

I also tried to train a simple neural network
with one hidden layer on the same CSV files. The results were significantly better than
with XGBoost.

The best result was obtained by ensembling the following two models: snapshots of the last
network (the one with 30% dropout) after 90 000 iterations and 105 000 iterations. Final
score was 3 401 840 and it was the 10th result
of the contest.

What we learned from this contest

This was a quite interesting contest, although too short when compared with Kaggle’s contests.

We believe it is possible to squeeze more from these models with better ensembling methods

Other contestants report
better results based on careful mixing of the results of more traditional techniques,
including n-gram
and Gaussian Mixture Models.
We believe the combination of these techniques with the deep models will provide very
good results on this dataset

One important issue is that the organizers of this contest do not allow
to use the dataset outside the contest. We hope this decision will be changed eventually.