Voice Activity Detection in Noise Using Deep Learning

This example shows how to detect regions of speech in a low signal-to-noise environment using deep learning. The example uses the Speech Commands Dataset to train a Bidirectional Long Short-Term Memory (BiLSTM) network to detect voice activity.

Introduction

Voice activity detection is an essential component of many audio systems, such as automatic speech recognition and speaker recognition. Voice activity detection can be especially challenging in low signal-to-noise (SNR) situations, where speech is obstructed by noise.

This example uses long short-term memory (LSTM) networks, which are a type of recurrent neural network (RNN) well-suited to study sequence and time-series data. An LSTM network can learn long-term dependencies between time steps of a sequence. An LSTM layer (lstmLayer) can look at the time sequence in the forward direction, while a bidirectional LSTM layer (bilstmLayer) can look at the time sequence in both forward and backward directions. This example uses a bidirectional LSTM layer.

To run the example, you must first download the data set. If you do not want to download the data set or train the network, then you can load a pretrained network by opening this example in MATLAB® and typing load("speechDetectNet.mat") at the command line.

Example Summary

The example goes through the following steps:

Training:

Create an audioDatastore that points to the audio speech files used to train the LSTM network.

Create a training signal consisting of speech segments separated by segments of silence of varying durations.

Split Data into Training and Validation

The data set folder contains text files that lists which audio files should be in the validation set and which audio files should be in the test set. These predefined validation and test sets do not contain utterances of the same word by the same person, so it is better to use these predefined sets than to select a random subset of the whole data set. Use the supporting function splitData to split the datastore into training and validation sets based on the list of validation and test files.

[adsTrain,adsValidation] = splitData(ads,datafolder);

Shuffle the order of files in the datastores.

adsTrain = shuffle(adsTrain);
adsValidation = shuffle(adsValidation);

Create Speech-Plus-Silence Training Signal

Read the contents of an audio file using read. Get the sample rate from the info struct.

The signal has non-speech portions (silence, background noise, etc) that do not contain useful speech information. This example removes silence using a simple thresholding approach identical to the one used in Classify Gender Using LSTM Networks.

Extract the useful portion of data. Plot the new audio signal and listen to it using the sound command.

Create a 1000-second training signal by combining multiple speech files from the training dataset. Use HelperGetSpeechSegments to remove unwanted portions of each file. Insert a random period of silence between speech segments.

Preallocate the training signal.

duration = 1000*Fs;
audioTraining = zeros(duration,1);

Preallocate the voice activity training mask. Values of 1 in the mask correspond to samples located in areas with voice activity. Values of 0 correspond to areas with no voice activity.

maskTraining = zeros(duration,1);

Specify a maximum silence segment duration of 2 seconds.

maxSilenceSegment = 2;

Construct the training signal by calling read on the datastore in a loop.

Note that you obtained the baseline voice activity mask using the noiseless speech-plus-silence signal. Verify that using HelperGetSpeechSegments on the noise-corrupted signal does not yield good results.

[~,noisyMask] = HelperGetSpeechSegments(audioTrainingNoisy,Fs);

Visualize a 10-second portion of the noisy training signal. Plot the voice activity mask obtained by analyzing the noisy signal.

Corrupt the validation signal with washing machine noise by adding washing machine noise to the speech signal such that the signal-to-noise ratio is -10 dB. Use a different noise file for the validation signal than you did for the training signal.

Display the dimensions of the features matrix. The first dimension corresponds to the number of windows the signal was broken into (it depends on the window length and the overlap length). The second dimension is the number of features used in this example.

[numWindows,numFeatures] = size(featuresTraining)

numWindows =
124999
numFeatures =
9

In classification applications, it is a good practice to normalize all features to have zero mean and unity standard deviation.

Compute the mean and standard deviation for each coefficient, and use them to normalize the data.

Each feature corresponds to 128 samples of data (the hop length). For each hop, set the expected voice/no voice value to the mode of the baseline mask values corresponding to those 128 samples. Convert the voice/no voice mask to categorical.

Define the LSTM Network Architecture

LSTM networks can learn long-term dependencies between time steps of sequence data. This example uses the bidirectional LSTM layer bilstmLayer to look at the sequence in both forward and backward directions.

Specify the input size to be sequences of length 9 (the number of features). Specify a hidden bidirectional LSTM layer with an output size of 200 and output a sequence. This command instructs the bidirectional LSTM layer to map the input time series into 200 features that are passed to the next layer. Then, specify a bidirectional LSTM layer with an output size of 200 and output the last element of the sequence. This command instructs the bidirectional LSTM layer to map its input into 200 features and then prepares the output for the fully connected layer. Finally, specify two classes by including a fully connected layer of size 2, followed by a softmax layer and a classification layer.

Next, specify the training options for the classifier. Set MaxEpochs to 20 so that the network makes 20 passes through the training data. Set MiniBatchSize to 64 so that the network looks at 64 training signals at a time. Set Plots to "training-progress" to generate plots that show the training progress as the number of iterations increases. Set Verbose to false to disable printing the table output that corresponds to the data shown in the plot. Set Shuffle to "every-epoch" to shuffle the training sequence at the beginning of each epoch. Set LearnRateSchedule to "piecewise" to decrease the learning rate by a specified factor (0.1) every time a certain number of epochs (10) has passed. Set ValidationData to the validation predictors and targets.