Research Projects

Tunebot

Bryan Pardo, Mark Cartwright, Jinyu Han, David Little, Arefin Huq

Funding

This work is funded by National Science Foundation Grant number IIS-0812314.

What is Tunebot?

Tunebot is a search engine that lets you find the music you're looking for
by singing a bit of it (or entering music notation). In response to your query it returns a ranked list of songs you can play.
These songs are linked to www.amazon.com, where you can purchase the desired music.

How is this new?

Most commercial music search engines, such as Amazon's and Apple's iTunes, index their
music by metadata, such as song title, composer name or performer name. What happens if
you can't recall this information but still want to find a piece of music? If you already
have the recording and just want to know its name, you can play the recording
to www.shazam.com. Without the metadata
or an example of the recording, none of these services can identify the music. With our search
engine, you can find the music you seek, as long as you can sing a bit of it. Our search engine
compares what you sing to a database of melodies and returns the melodies that best match your
example. You don't need to know the lyrics. You don't have to have a copy of the recording. You
just have to have a voice and a microphone.

iPhone Beta Release

We are also currently developing an iPhone version of Tunebot. The beta release is not available yet, but it will be soon. If you would like to become a beta tester when it's ready and have iPhone 3.0 software, you must submit a request to interactiveaudiolab [at] gmail.com with the subject "Tunebot iPhone Betatester Request" and include your iPhone or iPod touch unique identifier (NOT the serial number). You can find this go to the device section in iTunes and click on "serial number" to switch it to "Identifier (UDID)". We will then send you instructions on how to install the application on your device. Act quickly as space is limited! Please note that you must have iPhone 3.0 software and an iPhone or 2nd generation iPod touch with an external microphone to test the software.

Why contribute a song?

Our search engine doesn't compare your singing to the original recording you seek. This is because computer systems are confused by comparing an unaccompanied voice to a recording with multiple concurrent voices and instruments. An unaccompanied solo voice is so different from the typical commercial recording that even a really good rendition won't seem similar to a computer. The solution is to compare unaccompanied voices to other unaccompanied voices. This is where you come in. When you add a song to the Tunebot database, you provide us with an unaccompanied solo version that is linked to the appropriate recording on www.amazon.com. Then, when someone else wants to find this song, we compare their singing to your performance. The more performances people contribute, the more songs become searchable with our system.

How does the system work?

The diagram below shows the work flow of the system. A person hums a tune (1), the search engine returns a list of songs, ranked by similarity to the hummed tune (2). The user can then choose the song that is most similar to the one hummed (3). This provides feedback to the system that pairs the desired song with the sung example. These pairings of recorded queries and the correct corresponding song are stored in a database (4). We can use those pairings for a genetic algorithm to optimize parameters for the search engine (5), improving system performance automatically.

Related Papers

D. Little, D. Raffensperger, B. Pardo. A Query by Humming System that Learns from Experience. Proceedings of the 8th International Conference on Music Information Retrieval, Vienna, Austria, September 23-27, 2007 (PDF)

M. Cartwright and B. Pardo. Building a Music Search Database Using Human Computation. Proceedings of the 2012 Sound and Music Computing Conference, Copenhagen, Denmark. July 11-14, 2012. (PDF)

Karaoke Callout

Bryan Pardo, Mark Cartwright, Arefin Huq, Matthew Gilk

Funding

This work is funded by National Science Foundation Grant number IIS-0812314.

What is it?

Karaoke Callout is a karaoke game for Apple's iOS platform that allows users to "tunebot" or challenge each other to a duel of singing.
A person first selects a song from a growing list of music, and then sings their own rendition of the song. The system gives them a score
for their performance. The person then may choose to either call out a friend, who will then record his or her own rendition of the song,
to share their rendition with the world, to try again, or to choose a new song.

How do I play?

The application is still in development, and we will begin beta testing in mid-August. If you would like to become a beta tester when
it's ready and have a device with iOS 3.0 (or higher) software, you must submit a request
to interactiveaudiolab@gmail.com with the subject "Karaoke Callout iOS Betatester Request" and
include your iPhone, iPad or iPod touch unique identifier (NOT the serial number). You can find this go to the device section in iTunes and
click on "serial number" to switch it to "Identifier (UDID)". We will then send you instructions on how to install the application on your device.
Please note that if you are using an iPod Touch, you need an external microphone to test the software..

What songs does Karaoke Callout know?

The Karaoke database contains a wide variety of songs. If you can't find what you're looking for, contribute your own.

Related Papers

B. Pardo and David A. Shamma. Teaching a Music Search Engine Through Play. In Proceedings of CHI 2007 Workshop on Vocal Interaction in Assistive Technologies and Games (CHI 2007), San Jose, CA, USA, April 29 - May 3, 2007. (PDF)

D. Shamma and B. Pardo. Karaoke Callout: using social and collaborative cell phone networking for new entertainment modalities and data collection. In Proceedings of ACM Multimedia Workshop on Audio and Music Computing for Multimedia (AMCMM 2006), Santa Barbara, CA, USA, October 23-27, 2006. (PDF)

Funding

Introduction

Audio imputation is the process of re-synthesizing the missing parts of an audio signal so that after the reconstruction the information inside missing parts is seamlessly recovered.

An effective approach for audio imputation could benefit many important applications, such as audio enhancement in telephone system, speech recognition in noisy environment, sound restoration of historical recordings, and improvement of corrupted audio or source separation results, etc.

While most previous audio imputation methods are based on non-structured, non-constrained models which do not comply well with the characteristics of audio signals, we will consider a more structured and constrained model for audio imputation. This structured, constrained model will allow recovery of missing
spectrogram elements that have fewer artifacts and are more temporally coherent with the original signal.

Incorporate the temporal information of Audio

Non-negative spectrogram factorization refers to a class of methods including non-negative matrix factorization and probabilistic latent component analysis (PLCA), which are used to factorize spectrograms. In this discussion, we will use the specific case of PLCA. However, the ideas generalize to most such methods.

Audio is non-stationary as the statistics of its spectrum change over time. On the other hand, the statistics of the spectral structure are quite consistent over segments of time. There is a structure to the non-stationarity of audio which we call temporal dynamics.

One of the problems with PLCA is that:

It models the audio with a single non-negative dictionary.

It does not account for non-stationarity and temporal dynamics of audio.

A Large dictionary learned by PLCA

Non-negative Hidden Markov Model (N-HMM)

Our proposes method is based on the Non-negative Hidden Markov Model (N-HMM). Compared to previous methods which mainly make use of the spectral structures, the proposed work takes both the spectral and temporal structures of audio into consideration.

N-HMM models the audio with multiple dictionaries such that each time frame is a linear combination of elements from any one dictionary.

A Markov chain models the transitions between dictionaries.

Multiple dictionaries and a Markov chain learned by N-HMM.

An illustration of the proposed audio imputation system using N-HMM (of three states/dictionaries)

Learn a N-HMM from the training data.

Given the corrupted audio, learn the posterior distribution over the states (dictionary) and weights of the spectral components.

The amplitude of the missing frequencies is modeled as a linear combination of the spectral components from the dictionary, as shown below:

Incorporate high-level knowledge

N-HMM considers the temporal dynamics of audio with a Markov chain. For speech, there is a relationship among different words governed by high-level knowledge. More specifically, speech follows syntactic and semantic rules. We can feed the Markov chain of N-HMM with this high-level knowledge to contrain the transitions among different temporal parts of audio.

Experiments

Speech Separation Challenge dataset

We consider 10 speakers, 500 sentences per speaker. We use 450 sentences for traning and 50 for testing, totalling 500 sentences and 10 speakers.

We considered two dierent conditions.

The frst one is to expand the bandwidth of telephony speech signals, referred to as Con-A. The input narrowband signal has a bandwidth of 300 Hz to 3400 Hz, which simulates the bandlimited telephone speech signals.

In the second condition, referred to as Con-B, we removed all the frequencies below 1000 Hz.

Audio Source Separation

Jinyu Han, Zafar Rafii, Bryan Pardo

Funding

This work is supported by National Science Foundation Grant number
IIS-0643752.

What is it?

Source separation is the process of separating a set of source signals from a set of mixture signals.
When there is no prior information about the source signals or the mixing process,
we talk about Blind Source Separation (BSS).

What is it good for?

BSS would be of great use in many Music Information Retrieval (MIR) tasks,
such as instrument/vocalist identification, music/voice transcription, melody extraction, pitch tracking, etc.
It would also find uses in many other applications,
such as audio post-production, audio remixing, multichannel upmixing, karaoke gaming, hearing aids, etc.

Our approaches

We have developed several methods to deal with the problem of BSS applied to music mixtures.
The approach in this project consists in comparing the left and right channels of a stereo mixture
to estimate the spatial cues of the individual sources.

Assuming that the time-frequency representations of the sources do not overlap too much,
the spatial cues of each source can be estimated
by comparing the left and right channels of a stereo mixture
and then used to partition the mixture via time-frequency masking.
While the overlapping assumption is generally true for speech mixtures,
it rarely holds for music mixtures.
We show that BSS of music mixtures based on the estimation of the spatial cues
can be improved by taking into account information
about the harmonic structure of the sources,
by iteratively re-estimating the spatial cues,
or by using time-frequency representations based on the Constant Q Transform.

Recent work in blind source separation applied to anechoic mixtures of speech
allows for improved reconstruction of sources that rarely overlap in a time-frequency representation.
While the assumption that speech mixtures do not overlap significantly in time-frequency is reasonable,
music mixtures rarely meet this constraint, requiring new approaches.

We introduce a method that uses spatial cues from anechoic stereo music recordings
and assumptions regarding the structure of musical source signals to effectively separate mixtures of tonal music.
We use existing techniques to create partial source signal estimates from regions of the mixture
where source signals do not overlap significantly.
We use these partial signals within a new demixing framework, in which we estimate harmonic masks for each source,
allowing the determination of the number of active sources in important time-frequency frames of the mixture.
We then propose a method for distributing energy from time-frequency frames of the mixture to multiple source signals.
This allows dealing with mixtures that contain time-frequency frames
in which multiple harmonic sources are active without requiring knowledge of source characteristics.

As sources increasingly overlap in the time-frequency domain or the angle between sources decreases,
the spatial cues used in our methods become unreliable.
We also introduce a method to re-estimate the spatial cues for mixtures of harmonic sources.
The newly estimated spatial cues are fed to the system to update each source estimate and the pitch estimate of each source.
This iterative procedure is repeated until the difference between the current estimate of the spatial cues and the previous one is under a pre-set threshold.
Results on a set of three-source mixtures of musical instruments show this approach significantly improves separation performance of two existing time-frequency masking systems.

Performance results when the angle between instruments is 20 degres are shown in the Figure above.
The median value of separation performance for each method is labeled with arrow text.
Iterative spatial cues refinement improves DUET's median performance by 5.2 dB and ASE's median performance by 2.5 dB.
Furthermore, our proposed system with iterative spatial cues estimation performs as well as the system using ground truth pitches.

The audio examples listed below are from the experiments on mixtures of horn, oboe and bass clarinet.
The initial estimates are achieved using a method only based on spatial cues and the final estimates are achieved using our new re-estimation method.

Performance results from mixtures created using different mixing angle are shown in the figure below.
In this figure, each data point indicates an average result for 30 mixtures.
The proposed system (DUET+ITER) consistently outperformed the existing systems' performance (excluding the system using ground truth pitches)
for nearly all the mixing angles above 18 degree. The iterative spatial cues estimation improves DUET or ASE when the sources are close to each other
(In the figure, this is the case when the mixing angle is between 18 and 30 degree).

The Degenerate Unmixing Estimation Technique (DUET) is a BSS method which can separate an arbitrary number of unknown sources using a single stereo mixture.
DUET builds a two-dimensional histogram from the amplitude ratio and phase difference between channels,
where each peak indicates a source with peak location corresponding to the mixing parameters associated with that source.

Provided that the time-frequency bins of the sources do not overlap too much - an assumption generally validated by speech mixtures -
DUET identifies the peaks and partitions the time-frequency representation of the mixture by assigning each bin to the source with the closest mixing parameters.
However when time-frequency bins of the sources overlap too much, as often seen in music mixtures when using the Short-Time Fourier Transform,
peaks start to fuse in the 2d histogram, so that DUET cannot perform separation effectively.

We proposed to improve peak/source separation in DUET by building the 2d histogram from an alternative time-frequency representation based on the Constant Q Transform (CQT).
Unlike the Fourier Transform, the CQT has a logarithmic frequency resolution, mirroring the human auditory system
and matching the geometrically spaced frequencies of the Western music scale,
therefore better adapted to music mixtures.
We also proposed other contributions to enhance DUET,
including adaptive boundaries for the 2d histogram to improve peak resolving when sources are spatially too close,
and Wiener filtering to improve source reconstruction.

Experiments on mixtures of piano notes and harmonic sources showed that peak/source separation is overall improved,
especially at low octaves (under 200 Hz) and for small mixing angles (under pi/6 rad).
Experiments on mixtures of female and male speech showed that the use of CQT gives equally good results.

2d histograms of the mixture of the 3 piano notes A2, Bb2 & B2,
built using the classic DUET (left) and DUET combined with the CQT (right).
While the left histogram shows one gross peak because of the poor resolution of the Fourier Transform at low octaves,
the right histogram shows 3 clear peaks thanks to the log frequency resolution of the CQT,
which can resolve peaks for adjacent pitches equally well in low and high octaves.

2d histograms of a mixture of 4 sources (cello 1, cello 2, flute, and strings),
built using the classic DUET (left) and DUET combined with the CQT and adaptive boundaries (right).
While the left histogram shows only 3 peaks,
the right histogram shows 4 peaks, thanks to the use of the CQT
which improves frequency resolution when pitches are too low in frequency (for example here between the two cellos),
and thanks to the use of adaptive boundaries
which improves peak resolution when sources are too close to each other.

We propose an alternate technique for harmonic envelope estimation, based on Harmonic Temporal Envelope Similarity (HTES).
We learn a harmonic envelope model for each instrument from the non-overlapped harmonics of notes of the same instrument,
wherever they occur in the recording. This model is used to reconstruct the harmonic envelopes for overlapped harmonics.
This allows reconstruction of completely overlapped notes. Experiments show our algorithm performs better than an existing system
based on Common Amplitude Modulation when the harmonics of pitched instruments are strongly overlapped.

Illustration of the estimated completely overlapped harmonic enelope from a mixture. Original envelopes of the first harmonic
from nine notes played by clarinet are plotted in solid blue line. Four notes of 398.4Hz, 397.7Hz, 296.6Hz and 293.3Hz are completely overlapped
with four bassoon notes of 132.6Hz, 198.2Hz, 98.2Hz and 146.9Hz (not shown in this figure)

Separation example of a clarinet from a 6.5 seconds mixture of clarinet and bassoon

REpeating Pattern Extraction Technique (REPET)

Zafar Rafii and Bryan Pardo.

This work is supported by National Science Foundation Grant numbers
IIS-0643752
and IIS-0812314,
and by the Advanced Cognitive Science Fellowship for Interdisciplinary Research Projects of Northwestern University.

CONTENTS

Repetition is a fundamental element in generating and perceiving structure.
In audio, mixtures are often composed of structures
where a repeating background signal is superimposed with a varying foreground signal
(e.g., a singer overlaying varying vocals on a repeating accompaniment
or a varying speech signal mixed up with a repeating background noise).
On this basis, we present the REpeating Pattern Extraction Technique (REPET),
a simple approach for separating the repeating background from the non-repeating foreground in an audio mixture.
The basic idea is to find the repeating elements in the mixture,
derive the underlying repeating models, and extract the repeating background by comparing the models to the mixture.
Unlike other separation approaches, REPET does not depend on special parametrizations,
does not rely on complex frameworks, and does not require external information.
Because it is only based on repetition, it has the advantage of being simple, fast, blind,
and therefore completely and easily automatable.

The original REPET aims at identifying and extracting the repeating patterns in an audio mixture,
by estimating a period of the underlying repeating structure
and modeling a segment of the periodically repeating background.

Fig. 1 Overview of the original REPET.
Stage 1: calculation of the beat spectrum b and estimation of a repeating period p.
Stage 2: segmentation of the mixture spectrogram V and calculation of the repeating segment model S.
Stage 3: calculation of the repeating spectrogram model W and derivation of the soft time-frequency mask M.

Experiments on a data set of song clips showed
that the original REPET can be effectively applied for music/voice separation.
Experiments showed that REPET can also be combined with other methods to improve background/foreground separation;
for example, it can be used as a preprocessor to pitch detection algorithms to improve melody extraction,
or as a postprocessor to a singing voice separation algorithm to improve music/voice separation.

The original REPET can be easily extended to handle varying repeating structures,
by simply applying the method along time, on individual segments or via a sliding window.
Experiments on a data set of full-track real-world songs showed that
this method can be effectively applied for music/voice separation.
Experiments also showed that there is a trade-off for the window size in REPET:
if the window is too long, the repetitions will not be sufficiently stable;
if the window is too short, there will not be sufficient repetitions.

The original REPET works well when the repeating background is relatively stable (e.g., a verse or the chorus in a song);
however, the repeating background can also vary over time (e.g., a verse followed by the chorus in the song).
The adaptive REPET is an extension of the original REPET that can handle varying repeating structures,
by estimating the time-varying repeating periods and extracting the repeating background locally,
without the need for segmentation or windowing.

Experiments on a data set of full-track real-world songs showed that
the adaptive REPET can be effectively applied for music/voice separation.

Fig. 4 Music/voice separation using the adaptive REPET.
The mixture is a male singer (foreground) singing over a guitar and drums accompaniment (background).
The guitar has a repeating chord progression that changes around 15 seconds.
The spectrograms and the mask are shown for 5 seconds and up to 2.5 kHz.
The file is Another Dreamer - The Ones We Love
from the task of professionally produced music recordings
of the Signal Separation Evaluation Campaign (SiSEC).

The REPET methods work well when the repeating background has periodically repeating patterns (e.g., jackhammer noise);
however, the repeating patterns can also happen intermittently or without a global or local periodicity (e.g., frogs by a pond).
REPET-SIM is a generalization of REPET that can also handle non-periodically repeating structures,
by using a similarity matrix to identify the repeating elements.

Experiments on a data set of full-track real-world songs showed that REPET-SIM can be effectively applied for music/voice separation.

REPET-SIM can be easily implemented online to handle real-time computing, particularly for real-time speech enhancement.
The online REPET-SIM simply processes the time frames of the mixture one after the other
given a buffer that temporally stores past frames.
Experiments on a data set of two-channel mixtures of one speech source and real-world background noise
showed that the online REPET-SIM can be effectively applied for real-time speech enhancement.

We compared the original REPET with binary masking
with the system of Durrieu et al.,
a state-of-the-art music/voice separation method that uses an unconstrained NMF model for the music component and a source/filter model for the voice component,
and includes unvoiced voice estimation.

The Beach Boys data set consists of 14 real-world full-track songs
created from live-in-the-studio recordings released by The Beach Boys,
where some of the accompaniments and vocals were made available as split stereo tracks and separated tracks
(due to Copyright issues, we cannot make this internal data set available online).

We compared REPET with binary masking with the system of Durrieu et al. including unvoiced voice estimation.
We used the original REPET
and REPET with segmentation (REPET-SEG) (with the corresponding best segmentation).
REPET-SEG was successively enhanced with the use of a high-pass filtering of 100 Hz on the voice estimates (+H),
and the use of the repeating period that leads to the best mean SDR between music and voice estimates (+P) (for every window).

The BSS Eval toolbox provides a set of measures
that intend to quantify the quality of the separation between a source and its estimate (in dB):
Signal to Distortion Ratio (SDR), Source to Interference Ratio (SIR), and Sources to Artifacts Ratio (SAR).

2.1. Example 1: song that gives the best separation results for both the system of Durrieu et al. and REPET

Mixture

Music

Voice

Runtime

SDR (dB)

SIR (dB)

SAR (dB)

SDR (dB)

SIR (dB)

SAR (dB)

Original

-

-

-

-

-

-

-

Durrieu

10.5

20.2

11.0

5.7

16.9

6.1

21'24''

original REPET (no segmentation)

4.2

11.9

5.2

-1.3

0.5

6.4

0'09''

REPET-SEG (best segment = 2.5 s)

7.9

12.7

9.9

2.0

5.2

6.0

0'35''

REPET-SEG+H

8.9

15.0

10.3

3.5

17.6

3.7

-

REPET-SEG+H+P

9.3

13.0

11.8

4.1

12.5

5.0

-

2.2. Example 2: song that gives the worst separation results for both the system of Durrieu et al. and REPET

In addition to the BSS Eval toolbox, the PEASS toolkit
provides a set of measures that were shown to be better correlated with human assessment of signal quality:
Target-related Perceptual Score (TPS), Interference-related Perceptual Score (IPS), Artifacts-related Perceptual Score (APS),
and Overall Perceptual Score (OPS) which measures the overall error.

The MIR-1K dataset
consists of 1,000 song clips in the form of split stereo WAVE files sampled at 16 kHz,
with the background and melody components recorded on the left and right channels, respectively.
The song clips were extracted from 110 karaoke Chinese pop songs performed by amateur singers consisting of 8 females and 11 males.
The duration of the clips ranges from 4 to 13 seconds.
We then created a set of 1,000 mixtures by summing,
for each song clip, the left channel (i.e., the background) and the right channel (i.e., the melody) into a monaural mixture.

We compared different background/melody separation methods:
• a rhythm-based method that focuses on extracting the background via a rhythmic mask derived from identifying the repeating time elements in the mixture
(REPET-SIM)
• a pitch-based method that focuses on extracting the melody via a harmonic mask derived from identifying the predominant pitch contour in the mixture (Pitch)
• a parallel combination of REPET-SIM and Pitch (parallel)
• a series combination of REPET-SIM and Pitch (series)
• a method that uses an unconstrained NMF model for the background and a source-filter model for the melody (Durrieu)
• a method that uses Robust Principal Component Analysis (RPCA) to separate the background and the melody (Huang)

We derived three measures from the BSS Eval toolbox, ΔSDR, ΔSIR, and ΔSAR,
by taking the difference between, respectively, the SIR, SAR, and SDR, computed using the estimated (soft) masks derived from a given method,
and the SIR, SAR, and SDR, computed using the ideal (soft) masks derived from the original sources.

4.1. Example 1: clip that gives the best separation results for the parallel combination

Adaptive User Interfaces

Funding

Try the SocialEQ Demo

SocialEQ is a tool to learn the meaning of sound adjectives that relate to equalization. Once the system thinks it understands, it will give you a slider to make the sound more or less like the adjective (for example, more or less "bright"). Try it out.

What is it?

Many musicians think about sound in individualistic terms that may not have known mappings onto the controls of existing audio production tools.
For example, a violinist may want to make the recording of her violin sound "shimmery."
While she has a clear concept of what a "shimmery" sound is,
she may not know how to articulate it in terms that map "shimmery" onto the available audio tools (such as reverberation and equalization).

A typical parametric equalization plug-in. Too difficult!

What is it good for?

Imagine a computational tool that works alongside the musician to quickly learn how acoustic features map onto an audio concept,
and creates a simple controller to manipulate audio in terms of that concept.
In the case of the violin player, the tool would learn what "shimmery" means to her,
and then create a knob that would let her make a sound more or less shimmery.
This allows the creator to quickly realize a concept, bypassing the bottleneck of technical knowledge.
We therefore propose here a user-centered design approach to the development of audio production tools that automatically adapt to the user's work style,
rather than forcing the user to adapt to the tools.
The result will be new technologies that support and enhance human creativity.

The learning process: a sound is modified by a series of equalization curves or reverberation settings
and the listener rates each example as to how well it exemplifies a target audio concept (e.g. "boomy").

Our Approach

In our initial work, we focused on improving on equalization and reverberation tools.
We have tested algorithms to rapidly learn listener's desired equalization curve or reverberation settings.
First, a sound is modified by a series of equalization curve or reverberation settings.
After each modification, the listener indicates how well the current sound exemplifies a target sound descriptor (e.g. "warm" or "boomy").
After rating, a function is computed that models the user's desired equalization curve or reverberation settings.
Listeners report that sounds generated using this function capture their intended meaning of the descriptor.
Machine ratings generated by computing the similarity of a given curve or given settings to the weighting function are highly correlated to listener responses,
and asymptotic performance is reached after only ~25 listener ratings.

In subsequent work, we showed that we can further expedite this process by incorporating prior knowledge of user's desired equalization curves. Using just a handful of ratings (~5), we can place the user's new concept in a user-ratings space. Once in this user-ratings space, we can use the ratings from prior user concepts to predict the equalization curve of the new concept by using transfer learning. Our experiments have shown this method significantly reduces the number of user-ratings needed to learn a desired equalization controller from user feedback.

Current work includes study of alternate controller paradigms (knobs, sliders, pressure sensors, etc),
other audio tools (compression) or graphic tools and user feedback along multiple dimensions.

Prior user-ratings knowledge that we can use to place a new user concept in the user-ratings space and speed the EQ learning process through transfer learning.

Demo Videos

EQ Learning Max/Msp Patch:
A user is teaching the software the audio concept "tiny".
As the sound is played, different equalization curves are applied and the user rates the sound as to how well the current curve fits her/his concept.
When the confidence value reaches 100%, the user can leave the training mode and enjoy her/his personal "tiny effect" slider.

PCA EQ MAX/MSP Patch:
A two-dimensional map for audio descriptors has been built for the EQ Learning tool.
Each corner represents a particular audio descriptor mapped with a learned equalization curve.
The user can easily modify the equalization curve of the current sound by simnply moving a point on the 2d map.

Zafar Rafii and Bryan Pardo.
"Learning to Control a Reverberator using Subjective Perceptual Descriptors,"
10th International Society for Music Information Retrieval Conference,
Kobe, Japan, October 26-30, 2009.
[pdf]

Soundprism

Funding

What is it?

For a ray of white light, a prism can separate it into multiple rays of light with different colors in real time. How about for sound? Well, here we design Soundprism, which is a computer system that separates single-channel polyphonic music audio played by harmonic sources into source signals in an online fashion. It uses a musical score to guide the separation process.

What is it good for?

There are many situations where Soundprism could be used. Imagine a classical music concert where every audience member could select their favorite personal mix (e.g., switch between enjoying the full performance and concentrating on the cello part) even though the instruments are not given individual microphones. A soundprism could also allow remixing or upmixing of existing monophonic or stereo recordings of classical music, or live broadcasts of such music. Such a system would also be useful in an offline context, for making music-minus-one applications for performers to play along with existing music recordings.

Our Approach

Soundprism is an online score-informed source separation system. As shown in the previous figure, it has two components: score follower, and source separator.

The score follower performs polyphonic audio-score alignment in an online fashion. We use a hidden Markov process model, where each audio frame is associated with a 2-D state (score
position and tempo). After seeing an audio frame, our current observation, we want to infer its state. We use a multi-pitch observation model, which indicates how likely the current audio
frame is to contain the pitches at a hypothesized score position. The inference of the score position and tempo of the current frame is achieved by particle filtering.

The source separator performs pitch-based harmonic source separation. Score-informed pitches at the aligned score position are used to guide source separation. These pitches are first refined using our previous multi-pitch estimation algorithm [1], by maximizing the multi-pitch observation likelihood. Then, a harmonic mask in the frequency domain is built for each pitch to extract its source's magnitude spectrum. In building the mask, overlapping harmonics are identified and their energy is distributed in reverse proportion to the square of their harmonic numbers. Finally, the time domain signal of each source is reconstructed by inverse Fourier transform using the source's magnitude spectrum and the phase spectrum of the mixture.

Code

Results on Synthetic Dataset

We first test Soundprism on a synthetic dataset created in [2]. It contains 20 single-line MIDI melodies made from random note sequences. Each MIDI melody has about 20 notes with a fixed tempo, but is rendered to 11 audio performances with different dynamic tempo curves. Each melody is rendered using a different instrument. We use these monophonic MIDI melodies and their audio renditions to generate polyphonic MIDI scores and corresponding audio performances, with polyphony ranging from 2 to 6. There are in total 120 polyphonic MIDI pieces with corresponding audio renditions. Alignment between MIDI and audio are obtained from the audio rendition process. Although this dataset is not musically meaningful, we use it to test Soundprism on audio mixtures with different polyphonies and tempi, and a large variety of instruments.

The figure above shows boxplots comparisons of source separation results on pieces of polyphony 2 in the dataset. The measures are signal-to-interference ratio (SIR), signal-to-artifacts ratio (SAR), and signal-to-distortion ratio (SDR) which measures both interferences and artifacts. Higher values are better. Each box represents 48 data points corresponding to the 48 tracks of 24 duets. The lower and upper lines of each box show 25th and 75th percentiles of the sample. Higher values are better. The line in the middle of each box is the sample median. The lines extending above and below each box show the extent of the rest of the samples, excluding outliers. We compare Soundprism (1, red) with four other methods: Ideally aligned (2, green) uses ideal alignment information to separate the sources; Ganseman10 (3, blue) is an state-of-the-art offline score-informed source separation system [2]; MPES (4, cyan) performs multi-pitch estimation and streaming without using score information [1, 3]; Oracle (5, purple) shows the theoretical upperbound of time-frequency masking based separation results.

Results on Real Music Dataset

We then test Soundprism on the Bach10 dataset. It contains 10 pieces of four-part J.S. Bach chorales played by instruments. We combine the parts to obtain in total 60 duets, 40 trios and 10 quartets of polyphonic music. The dataset also provides the MIDI score and the audio-score alignment.

The figure above shows boxplots comparisons of source separation results on pieces of polyphony 2 in the dataset. Again, we compare Soundprism (1, red) with four other methods: Ideally aligned (2, green), Ganseman10 (3, blue) [2], MPES (4, cyan) [1, 3], and Oracle (5, purple). There are several interesting observations. First, the results of Soundprism and Ideally aligned are very similar on all measures. This suggests that the score following stage of Soundprism performs well on these pieces. Second, the difference between Soundprism/Ideally aligned and Oracle is not that great. This indicates that the simple separation strategy is suitable for the instruments in this dataset. Third, Soundprism obtains a significantly higher SDR and SAR than Ganseman10 while a lower SIR. This indicates that Ganseman10 performs better in removing interference from other sources while Soundprism introduces less artifacts and leads to less overall distortion. Finally, the performance gap between MPES and the 3 score-informed source separation systems is significantly reduced. This means that the multi-pitch tracking results are more reliable on real music pieces than random note pieces, but still, utilizing score information improves source separation results.

The figure above shows results for different polyphony. We can see that Soundprism and Ideally aligned obtain very similar results for all polyphony. This suggests that the score following stage performs well enough for the separation task on this dataset. In addition, Soundprism obtains a significantly higher SDR than Ganseman10 for all polyphony. Furthermore, MPES degrades much faster than the three score-informed separation systems, which again indicates that score information is more helpful in the pieces with higher polyphony.

Examples on RWC Music Pieces

We show source separation results on two pieces of music extracted from the RWC Music Database. These two pieces were recorded in realistic acoustic environments. Reverberation and equalization effects might be added. Source signals are not accessible. One can hear that Soundprism significantly outperforms Ganseman10 [2], a state-of-the-art offline score-informed source separation method, and MPES [1,3], a multi-pitch based source separation method that does not utilize score information.

Score Alignment

Bryan Pardo, Zhiyao Duan

Funding

This work was funded by National Science Foundation Grant number IIS-0643752.

What is it?

Score alignment involves finding the best alignment between an audio performance and the events in a machine-readable music score. Score alignment can be addressed offline or online. An offline algorithm can use the whole performance of a music piece. The online version (also called score following) cannot "look ahead" at future performance events when aligning the current event to the score.

What is it good for?

Many tasks in music analysis and production would be greatly aided by reliable alignment of a score to the acoustic recording. Editing of audio tracks often requires the location of a particular section or phrase in the music to be tweaked in some way (tuning, equalization, adding an audio effect). Score alignment would allow selection of the desired audio by selecting the measure or passage on the musical score. The audio would be searchable by melodic riff or hook. In entertainment, on-line score alignment allows for interactive automated musical accompaniment (Karaoke that follows you, instead of the other way around). For music education, a system able to align a score to an acoustic performance could indicate to a music student where the performance deviates from indicated score markings ("the score says to play softer here" or "you sped up when it said to slow down").

Our Approaches

1. Aligning semi-improvised music audio with its lead sheet

Existing audio-score alignment methods assume that the audio performance is faithful to a fully-notated MIDI score. For semi-improvised music (e.g. jazz), this assumption is strongly violated. In this paper, we address the problem of aligning semi-improvised music audio with a lead sheet. Our approach does not require prior training on performances of the lead sheet to be aligned. We start by analyzing the problem and propose to represent the lead sheet as a MIDI file together with a structural information file. Then we propose a dynamic-programming-based system to align the chromagram representations of the audio performance and the MIDI score. Techniques are proposed to address the chromagram scaling, key transposition and structural change (e.g. a performer unexpectedly repeats a section) problems. We test our system on 3 jazz lead sheets. For each sheet we align a set of solo piano performances and a set of full-band commercial recordings with different instrumentation
and styles. Results show that our system achieves promising results on some highly improvised music.

2. A state space model for online polyphonic audio-score alignment

We present a novel online audio-score alignment approach for multiinstrument polyphonic music. This approach uses a 2-dimensional state vector to model the underlying score position and tempo of each time frame of the audio performance. The process model is defined by dynamic equations to transition between states. Two representations of the observed audio frame are proposed, resulting in two observation models: a multi-pitch-based and a chroma-based. Particle filtering is used to infer the hidden states from observations. Experiments on 150 music pieces with polyphony from one to four show the proposed approach outperforms an existing offline global string alignment-based score alignment approach. Results also show that the multi-pitch-based observation model works better than the chroma-based one.

3. A HMM model for following music with structural changes

This combination of surface-level variation (block chords vs. arpeggios) and structural variation (play the interlude vs. skip the interlude) presents a problem for the traditional score-matching paradigm: the score elements will be far fewer, and perhaps significantly different, than the transcribed performance elements. Therefore, a score may not be represented as a single sequence of notes and a (nearly) one-to-one mapping between a score sequence and performance sequence is not possible, violating a basic assumption of existing score alignment techniques.

A Markov model describes a process that goes through a sequence of discrete states, such as notes or chords in a lead sheet. Markov models are generative. A generative model describes an underlying structure able to emit a sequence of observed events. A musical score may be represented as a (hidden) Markov model. The directed graph in the figure below shows a Markov model created from the notes in the written score above it.

We use Markov models, automatically generated from scores, whose emission probabilities can be used to represent chordal or melodic material. Directed edges (arrows) represent transitions. Transition probabilities are indicated by a value associated with each edge. Score following is typically done with a Markov model by first training the model on a set of performances, to tune the transition and emission probabilities. Then, when a new sequence of events (a performance transcription) is presented, the Viterbi algorithm is used to determine the most likely sequence of states in the model to generate the performance. This sequence of states is deemed to be the path through the model and thus the path through the score used to generate the model.

Related Papers

Z. Duan and B. Pardo, Aligning semi-improvised music audio with its lead sheet, in Proc. International Society for Music Information Retrieval Conference (ISMIR), Miami, Florida, USA, October 24-28, 2011. (PDF)

Z. Duan and B. Pardo, Soundprism: an online system for score-informed source separation of music audio, IEEE Journal of Selected Topics in Signal Processing, in press. (PDF)

B. Pardo and W. and Birmingham, Modeling Form for On-line Following of Musical Performances, in Proceedings of the Twentieth National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, July 9-13, 2005. (PDF)

B. Pardo and W. Birmingham, Improved Score Following for Acoustic Performances, in Proceedings of the International Computer Music Conference, Gothenburg, Sweden, September 16-20, 2002. (PDF)

Music Story

David Ayman Shamma, Bryan Pardo, Kristian Hammond

What is it?

MusicStory listens to music. Like anyone else, the lyrics it hears bring to mind images that resonate with them. The difference is that its mind is not that of a human. It is instead defined by the set of links between ideas and images that is the World Wide Web.
It listens while it searches online for images linked to the words that are being sung and the connections that exist between those images and the song. As the images are found, it presents them to the audience, creating an 'on the fly' music video, heightening, clarifying, and exposing the connections between words, ideas and images that we often do not even notice.

How does it function?

While listening to the music, MusicStory finds and presents word/image associations. Its takes the emotional experience of listening to music, amplifies it and heightens its visceral appeal by externalizing concrete and visual imagery intrinsic in the music. The retrieved images vary in their association - some semantically on point and some distant.

MusicStory uses online indexes to retrieve images which have popular relevance. This hands back pop-culture meanings of terms and images. In some cases, the image becomes a more authoritative and concrete. The balance of pop-culture and authoritative associations expands the emotional experience. The flow of imagery moves with the pace of the song: providing quick transitions through fast songs, and leisurely transitions through slower songs.

A MusicStory music video version of "Now Get Busy" by the Beastie Boys

How do I get it?

MusicStory was created for personal media devices and is available for download. We are currently expanding MusicStory for use in large-scale, concert installations.

Multi-pitch Estimation & Streaming

Funding

What is it?

Multi-pitch estimation & streaming is to estimate and stream pitches into pitch trajectoies for each underlying source in a polyphonic audio. In Music Information Retrieval (MIR), people usually do not discriminate "pitch" and "fundamental frequency (F0)", so it is also called multi-F0 estimation & streaming.

What is it good for?

Multi-pitch estimation & streaming is of great interest to researchers working in music audio and speech signal processing. It is useful for many applications, including automatic music transcription, source separation, score following, content-based music search, robust speech recognition, prosody analysis, etc. The task, however, remains challenging and existing methods do not match human ability in either accuracy or flexibility.

Our Approach

We decompose the task into two subtasks: estimation and streaming. First, we estimate pitches and the number of pitches (polyphony) in each time frame; then we stream these pitch estimates into pitch trajectories (streams).

For Multi-pitch estimation, we propose a maximum likelihood approach, where the power spectrum of a time frame is the observation and the F0s are the parameters to be estimated. When defining the likelihood model, the proposed method models both spectral peaks and non-peak regions (frequencies further than a musical quarter tone from all observed peaks). It is shown that the peak likelihood and the non-peak region likelihood act as a complementary pair. The former helps find F0s that have harmonics that explain peaks, while the latter helps avoid F0s that have harmonics in non-peak regions. Parameters of these models are learned from monophonic and polyphonic training data. we propose an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation. We also propose a polyphony estimation method to terminate the iterative process. Finally, we propose a post-processing method to refine polyphony and F0 estimates using neighboring frames. It is shown that the refinement method eliminates many inconsistent estimation errors. Evaluations are done on ten recorded four-part J. S. Bach chorales. Results show that the proposed method shows superior F0 estimation and polyphony estimation compared to two state-of-the-art algorithms.

For multi-pitch streaming, we cast the problem as a constrained clustering problem, where each cluster of pitch estimates corresponds to the pitch trajectory of a source. Instance-level constraints (must-links and cannot-links) are defined on pairs of pitches, to utilize their local time-frequency locality information. Must-links are imposed between similar pitches in adjacent frames. Cannot-links are imposed between concurrent pitches, with the monophonic source assumption. The objective function is defined as the intra-class distance between harmonic structures of pitches, to utilize their timbre consistency. This is reasonable, since humans use timbre consistency as an important cue to help discriminate and track sound sources. According to the definition of our constraints, our constrained clustering problem has a unique property: almost every pitch estimate is involved in some constraint. This makes existing constrained clustering algorithms inappropriate. In addition, the pitch estimates upon which constraints are applied may not be accurate, making their constraints non-applicable. Therefore, we propose a new constrained clustering algorithm, which minimizes the objective function, while trying to satisfy as many constraints as possible. Experiments on the abovementioned ten four-part J. S. Bach chorales and 400 two-talker and three talker speech mixtures show that our approach produces good results.

Code:

Multi-pitch Estimation only performs the multi-pitch estimation subtask. It takes a piece of polyphonic audio as input and outputs pitch estimates in each time frame.

Multi-pitch Streaming only performs the multi-pitch streaming subtask. It takes a piece of polyphonic audio, and pitch estimates in individual frames as input. It outputs a pitch trajectory for each underlying source..

Multi-pitch Estimation and Streaming is the whole system. It takes a piece of polyphonic audio as input. It outputs a pitch trajectory for each underlying source. It also outputs the pitch estimates in each time frame as intermediate results.

Sound example:

composed by J. S. Bach, played by violin, clarinet, tenor saxphone and bassoon.

Multi-pitch estimation results

The figure above shows the multi-pitch estimation result on the four part music recording above. Colored lines are the ground truth pitches. Black dots are pitch estimates. It can be seen that most estimates are correct, while there are still some insertion, deletion and substitution errors.

The figure above shows boxplots of comparison results with Klapuri's system [1] on 10 pieces of four-part music recordings. Each box represents 330 data points and each point corresponds to 1 second of audio. The gray boxes are Klapuri06's results and the white ones are ours. The lower and upper lines of each box show 25th and 75th percentiles of the sample. Higher values are better. The line in the middle of each box is the sample median, which is also presented as the number below each box. The lines extending above and below each box show the extent of the rest of the samples, excluding outliers. "Mul-F0" measures the overall accuracy of all pitches. "Pre-F0" measures the accuracy of the first pitch found.

The figure above shows boxplot comparisons on 200 two-talker and 200 three-talker speech mixtures. We compared with two state-of-the-art multi-pitch estimation methods that were specifically designed for speech: Wu03 [3] (gray boxes) and Jin11 [4] (black boxes). Our results are in white boxes. Results show that our results are comparable to Wu03 and is significantly better than Jin11 on two-talker speech mixtures. However, neither Wu03 nor Jin11 (nor any other existing methods) can deal with speech mixtures with three or more talkers. Our method can.

Multi-pitch streaming results

The two figures above show the ground-truth (upper figure) and the our system-output (lower figure) pitch trajectories. Colored dots represent pitches in different trajectories. It is noted that the system-output trajectories are obtained based on multi-pitch estimation results, which contain some insertion and deletion errors. Nevertheless, for correctly estimated pitches, most are tracked into the correct pitch trajectory.

This figure above shows multi-pitch streaming results of the proposed method taking three kinds of pitch estimates as input: Klapuri06 [1] (dark gray boxes), Pertusa08 [2] (light gray boxes), and our multi-pitch estimation method (white boxes). Red lines show the average input pitch estimation accuracies, which set the upper bounds of multi-pitch streaming accuracies. This figure shows that the proposed multi-pitch streaming method can take any pitch estimates as input. The closeness between the multi-pitch streaming accuracies and the input multi-pitch estimation accuracies show that the streaming algorithm works well.

This figure above compares multi-pitch streaming results of different algorithms on multi-talker speech mixtures. These methods are 1) Wohlmayr11 [5]; 2) Hu12 [6]; 3) the proposed method taking our multi-pitch estimation results as inputs; 4) the proposed method taking Wu03 [3] as inputs; 5) the proposed method taking Jin11 [4] as inputs. The red lines show the average input pitch estimation accuracies, which set the upper bounds of the pitch streaming accuracies. Results show that the proposed streaming method works with different multi-pitch estimation methods. On two-talker speech mixtures, the proposed streaming method achieves comparable results as Hu12 and significantly better results than Wohlmayr11. However, neither Wohlmayr11 nor Hu12 can deal with speech mixtures with three or more talkers. The proposed streaming method taking the pitch estimates of the proposed estimation method can.