How SoundCloud Recommends and Tags Music

Petko shared with us his experience with music information retrieval (MIR) acquired at SoundCloud. Similar algorithms are employed at Spotify, Pandora, Shazam and SoundHound. The most basic form of music retrieval extracts musical notes from the audio signal. The digital audio signal itself is a sequence of numbers sampled from electrical voltage representing sound waves. Petko explained that the first step in MIR is to perform segmentation of the signal at even intervals (frames), typically lasting 52 ms each.

0

votes

Last Wednesday, Data Science Society and its friends delved deeper into the field of information retrieval. The topic this time was how to extract information from music, while the Faculty of Mathematics and Informatics at the Sofia University continued the tradition of hosting our machine-learning oriented events. The venue bears significance to our speaker as well – Petko Nikolov graduated Informatics in the same faculty. His interest in machine learning developed during his MSc studies in AI at the University of Edinburgh. He had a chance to apply what he learned in practice at SoundCloud and later at Leanplum and HyperScience.

Petko shared with us his experience with music information retrieval (MIR) acquired at SoundCloud. Similar algorithms are employed at Spotify, Pandora, Shazam and SoundHound. The most basic form of music retrieval extracts musical notes from the audio signal. The digital audio signal itself is a sequence of numbers sampled from electrical voltage representing sound waves. Petko explained that the first step in MIR is to perform segmentation of the signal at even intervals (frames), typically lasting 52 ms each. Between each pair of non-overlapping frames, an overlapping frame is inserted covering half of the previous and half of the subsequent frame. Once this information is obtained, it is converted from information over time to information about frequencies. This is achieved by a fancy numerical algorithm called Discrete Fourier Transform (DFT). When the data is in this convenient form, one would like to capture musical characteristics as timbre, tempo, rhythm and acoustics on local level. A way to approximate them is by extracting frame’s statistical properties – for example the center of mass of the spectrum, the slope coefficient of a linear regression, the spectral correlation between the frequencies of two consecutive frames which is useful for distinguishing between slow music like classical music, and fast music like rock music. These are the features (variables) that are the cornerstone for the models applied in music information retrieval.

From this point on, you can apply most machine learning algorithms for classification – for example neural networks, support vector machines or random forest which is applied in SoundCloud. Petko introduced a not so popular concept – Gaussian Mixture Model (GMM). A mixture model represents the presence of subpopulations within an overall population – in our context, styles of music. The algorithm is used to build representation of a track by maximizing the likelihood of its frames being generated from the model’s distribution. Each sub-population has its own probability density, as you can guess from the name in the GMM the densities are modelled as Gaussian. In the MIR field the model is also popular as Universal Background Model.

Petko introduced another cutting-edge methodology – Deep Learning. It builds on the neural networks but has many more hidden layers and the input is as raw as possible. The standard neural network approach using backpropagation doesn’t work so well in such architectures because the gradient fades quickly. The first step is to derive the so-called mel-spectrum by aggregating the frequencies to a logarithmic scale that corresponds to the human perception of sound. Then the deep belief network is employed – it is a form of unsupervised learning that tries to find these values of the weights of the neurons that approximate most closely the input data (the mel-spectrum). A variant of deep belief networks is the deep auto encoders, where the output is not the music style but the mel-spectrum. Deep auto encoders are used for denoising in situations where the signal-to-noise ratio is too low.

During the lively Q&A session after presentation Petko discussed how overfitting is tackled in deep learning – by keeping the weights low with regulating functions, by inserting random noise and by testing the models on a validation set. Our speaker also shared that the feature extraction is the most computationally intensive part of music information retrieval – it takes 10 seconds per 4-minute track, and a typical database contains 100 million tracks! His colleagues in SoundCloud employ C++ for feature extraction and Python for classification purposes.

After so much information in so little space, you certainly have a lot of questions. Some of the answers may be found by taking a look at the presentation and the audio record from the event. A link to the video will appear as soon as the video is uploaded by our collaborators.