Singing Voice Separation – a study

[This is a re-entry of the Singing Voice Separation task performed at MTG, UPF for the music information retrieval course. Here is the original link to the course blog. This work was done with my amazing classmate Gerard Erruz]

Hello World, We are Manaswi and Gerard. We shall be maintaining this blog as documentation, as we develop a Singing Voice Separation algorithm for the Music Information Retrieval course at MTG, UPF. Goal is to make an algorithm that can add transformations to the vocals of a given song.

Goal and description

Singing voice separation (SVS) techniques aim to extract the voice contribution out of an already mixed audio file. This separation is content-based and blindly performed: no extra information is used apart from the original audio file. SVS has many direct applications, such as lyrics recognition and alignment, singer identification, karaoke track creation, and other music information retrieval tasks.

Input/output

The input for the algorithm is an audio file containing the mix with a vocal sound in it. A second argument can be introduced to point out the output directory.

As indicated at MIREX archive and for our use case, the audio file should follow these specifications:

16-bit

monaural

44.1kHz sample rate

30 seconds long

As an output, our algorithm will deliver a new audio file (with same specifications as above) containing the extracted vocal track with minimized noise background.

DataSet

The following datasets shall be used for the unsupervised blind singing voice separation task and evaluation

MIR 1K – 1000 songs with singing voice and accompaniment on either channel of a stereo

Common evaluation measures

Proposed measures for singing voice separation performance can be taken from [1]. These are:

Source to Distortion Ratio (SDR)

Source to Interferences Ratio (SIR)

Sources to Artifacts Ratio (SAR)

These will be evaluated from the BSS Eval Ver 3.0 implementation [3] . MiREX also assigns score for the performance of the system with a normalized mean score for each evaluation measure (SDR, SIR, SAR) over 100 clips of audio which we can use to validate our algorithm against other submissions for Singing Voice Extraction.

We can also make a preliminary assessment of our singing voice separation’s ability to capture the main melody by using the Precision and Recall ( or F Measure) using pitch label ground truths in our dataset.

Hi all! This week we are searching and collecting some useful datasets for our task. We will be working on voice source separation in already mixed songs. The main concept of our task is to train the system with separated sources and then evaluate the source separation performance when the system receives a mixed production (from the same dataset). We didn’t find any submission in previous MIREX editions. Good measures for evaluation are: Source to Distortion Ratio, Source to Interference Ratio and Source to Artifacts Ratio.

For this, we found the following datasets:

iKala

This dataset contains 252 30-second excerpts from pop songs. The recordings have been done by six different singers, which positively increases the variability of the data. Pitch labels and lyrics information with their timestamps are also provided. As a drawback, the mixes are not professionally produced.

This dataset contains 100 tracks of professionally mixed songs. Each track is composed of its professional mixes and the individual sources present separately. Apart from the advantage of professionally mixed sources, this dataset also consists of a variety of styles.

This dataset contains 70 tracks with vocals present in the mix. Each track is mixed professionally and the dataset provides individually processed stems and raw audio for each track. This dataset also covers multiple genres and has additional tags for genre, f0 melody time stamps and instrument activations.

Hi! This week we searched for previous literature on SVS and the current State of the Art. The Singing Voice Separation task in the context of MiREX has appeared only since 2014. Prior to 2014 singing-voice separation systems could be classified into two categories:

A major limitation of such NMF based approaches is that each source is characterized by a single stationary spectral basis (a column vector of the ICA/NMF) and its gain (activation) varies with time. This implies that it will not be able to separate non-stationary signals. One possible solution is to take feature vectors (MFCC, PLP, LPC) and map their spectrum to a classifier – voiced/unvoiced and therefore group all the distributed components of the vocal source.

GMM’s are learnt for each expected source from an example set (general source) similar in statistical properties to the expected source. This is motivated by the fact that audio sources are generally weakly overlapping in time-frequency domain, allowing masks for separating sources. It then uses MAP adaptation to adapt its general source models to better represent the source in a mix. The model parameters are estimated from the mix and therefore model missing acoustic data because of the mixing. But since models are built on vocal source and non vocal source, the idea is limited. The songs must have significant length of non vocal source to obtain an accurate adaptive model. The non vocal source should be similar statistically throughout the song. (Similar to the stationary source problem mentioned in NMF)

The labeled vocal segments are used to identify predominant pitch. These detected pitch contours are then used for singing voice separation by grouping into T-F units by their harmonicity. A T-F unit is labeled singing voice dominant if its local periodicity matches the predominant pitch. Singing voice is the resynthesized from the labeled T-F units unlike the masks used in the previous two cases. [G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135–1150, Sep. 2004.

This was used in 2008 to give state of the art results from main melody extraction, using a source filter model explicitly representing the singing voice. In this algorithm time dependency is not taken into account and each frame is evaluated as independent from the others.

Here they discuss how this source filter model can be extended to perform source separation. They use the source filter model to obtain predominant pitch similar to the previous pitch inference methods and model the power spectral density of each source. The pitch information is used to re-estimate the parameters of the source which are ultimately separated using Wiener filters. One limitation of this method was that accompaniment when playing the predominant pitch was identified as the solo track (no timbre coherence assumed). The main solo track identified was played by different subsequent instruments. The lack of modeling the unvoiced parts and accompaniment makes it hard to separate singing voice source consistently. But this was an advancement on the previously unsupervised sinusoidal models for source separation.

Autocorrelation Based Method – Z. Rafii and B. Pardo, “A simple music/voice separation method based on the extraction of the repeating musical structure,” in ICASSP, May 2011, pp. 221– 224.

This algorithm takes advantage of the fact that music is repetitive and creates period boundaries for the period of the repeating structure. The segments are averaged to create a repeating segment model which is used as a binary mask to perform source separation. The initial repetitions are identified using autocorrelations over successive lagged versions. Unlike previous approaches, this method doesn’t require any prior training or complex frameworks but it is also limited in that regard. More knowledge, better pattern extraction and softer masks are required for better separation when run on the MIR1K dataset.

State of the Art

MiREX (2014-16)

Since its introduction as a task on MiREX from 2014, there has been a lot more activity in the field of singing voice separation. Below, we summarize the best submissions for the MiREX task by year, and give an overview of the other types of approaches.

Po-Sen Huang et. al. previously showed the use of robust principal component analysis for singing voice separation [Huang, Po-Sen, et al. “Singing-voice separation from monaural recordings using robust principal component analysis.” Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012.]. They argue that singing voice is more sparse and has more variation than the accompaniment which can be exploited to solve for underlying low rank and sparse components through RPCA.

In 2014 Po-Sen Huang submitted a deep recurrent neural network for singing voice separation. The model is a deep recurrent neural network which introduces a memory from the previous layers and is jointly trained over all types of sources in order to learn a <source> vs <other>. Unlike their previous RPCA approach (which was the major type of submissions for this task), the deep learning approach learns soft masks regardless of the source and could be used to perform better voice separations (for non stationary and overlapping mixes). They improved their training by data augmenting from one mixture, by circular shifting the singing voice source and obtaining more variant mixes.

This submission achieved the best Voice Global Normalized Signal-to-Distortion Ratio (Voice GNSDR) performance with also a small runtime value (2 hours).

They notice there is an interdependency between estimating the F0 contour and the extraction of the singing voice. They then build a new extraction algorithm that uses these two methods. The main idea is to alternate the two tasks and use their results to feed each other: first, an extraction step is performed by using robust principal component analysis (RPCA). From this extraction, a F0 contour is estimated by using sub-harmonic summation (SHS) and F0 is extracted by salience conditions.

2016

Pritish Chandna, Marius Miron, Jordi Janer, and Emilia Gomez, Monoaural Audio Source Separation Using Deep Convolutional Neural Networks, to appear
This submission is the first to use convolutional neural networks in source separation tasks. The proposed algorithm is not specifically aimed to singing voice separation but to more general sound separation. They use a convolutional autoencoder to estimate time-frequency soft masks. A segment of time T is split from the mixed signal and STFT is computed. Magnitude spectrogram is passed through the CNN, which outputs an estimate for each separated source. Soft masks are computed from these estimates and then used to compute final extracted magnitude spectrograms from the original mixture.

Selected Approaches

After our Literature Survey, we have selected 2 approaches to investigate that cover the different approaches to Source Separation. Here is a summary of the approaches and their online implementation sources.

1. Flexible Audio Source Separation ToolBox:

This toolbox was created to provide a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints.
(1, 2) A. Ozerov, E. Vincent, and F. Bimbot, A general flexible framework for the handling of prior information in audio source separation, IEEE Transactions on Audio, Speech and Signal Processing, Vol. 20 (4), pp. 1118-1133 (2012).

PyFASST is the python implementation of the above mentioned toolbox and is available at PyFASST github

This toolbox provides several audio model classes like 2 source models with instantaneous mixing parameters and NMF model on the spectral parameters.

2. DeepConvSep :

This code is meant for source separation of multiple instrumental contributions, including singing voice. This is the first approach to use Convolutional Neural Networks (CNN) to sound separation task, showing good results in MIREX 2016 Singing Voice Separation task. It estimates time-frequency soft masks using a convolutional autoencoder, comprised of various parametric layers (timbre layer, max pool layer, temporal layer) and a fully connected layer (ReLU).

In the singing voice separation task, the output is a set of estimated voiced and background sources. This is compared with the actual singing voice source to determine accuracy of the algorithms.

MIREX defines this using the following decomposition for each sound source

s_estimated(t) = s_target(t) + e_interf(t) + e_noise(t) + e_artif(t)

s_target is an allowed deformation of the target source s_i(t)
e_interf is an allowed deformation because of interference of the unwanted sources
e_noise is an allowed deformation of the perturbation noise (not from the sources)
e_artif corresponds to artifacts of the separation algorithm

Signal to Distortion Ratio, Signal to Interference Ratio and Signal to Artifacts ratio are calculated from the estimates.

Local and Global SDR can be calculated for identifying consistency of separation across a dynamic range of the mixture.

A global normalized SDR and SIR are used for comparing different algorithms at the MiREX competition

BSS_Eval is a tool box created for the purpose of evaluating SDR, GNSDR, SIR etc. SiSEC proposed this matlab toolbox for evaluation found at this link.