Sound Archive

To give you a feel for the methods developed in this these, please
listen to this
introductory example.

This example is made from a collection of sounds recorded during a camping
trip. Here, a person starts by a camp fire and then walks past a stream
to their tent. The wind starts to howl and the person gets to their
tent just in time before it rains at the end of the clip.

Remarkably,
all the sounds on this clip are synthetic (except for the sound of the
closing zip). They are produced from a single 'generative
model', which has been trained on natural sounds and learns how to
produce natural sounding versions. The model is particularly good at
synthesising auditory textures like fire, running water, wind and
rain.

The model works by learning the important statistics of these sounds. It
can then produce new synthetic versions, of arbitrary duration, by ensuring the
new sounds match the statistics of the original. This demonstrates that
auditory textures are often defined statistically, a fact
first demonstrated
by Josh McDermott and Eero
Simoncelli.

Characterising the statistics of natural sounds is important. For example, when
your car's automatic speech recognition system tries to figure out what you are
saying when there is traffic noise in the background, it will often
fail. However, the performance can be
enhanced be removing the traffic noise and this can be done by knowing the
difference between the statistics of the traffic noise and speech.

Importantly, by generating synthetic sounds,
this work also reveals the statistics to which auditory processing
is sensitive. This is an important practical tool for understanding how hearing
operates.

Chapter 3: Probabilistic Amplitude Demodulation

Probabilistic amplitude demodulation is a new method we invented for estimating the envelope of
a signal. Below, we illustrate the method by taking a training sound,
extracting its envelope, and then generating a new sound using this envelope
and a white noise carrier. It is clear from this example that the long-time
"rhythm" of the sound remains, but the short-time frequency content is missing.

These examples were also referenced from the paper
accepted and given an oral presentation at ICA 2007 conference,
London. This won the "Best Student Paper" award.

Chapter 5: Probabilistic Time-Frequency Representations

Probabilistic time-frequency representations are
complementary to traditional time-frequency representations. They
were developed in
Chapter 5 of my thesis. This new type of representation is slower to estimate
than traditional methods, but once it has been estimated it is very simple to
resynthesise modified sounds from it. For example, below we illustrate how to
modify the duration of a sound, and also how to modify the pitch of the sound.

Chapter 5: MPAD (and ICASSP) synthetic auditory textures

In this section the goal is produce synthetic versions of natural sounds by
learning their statistics and generating new versions which match those statistics.
In other words, for each of the models below, the model parameters were
learned from a training sound (named on the left hand side), and then entirely new sounds were
generated using those learned parameters.

The "original" sounds are the training sounds. In some cases these are
very short. The "AR2 matched spectra" sounds comprise a sum of AR(2)
processes with parameters chosen to match the training spectra. These sounds
are therefore Gaussian noise with spectra parameterised by the AR(2) processes.The
"independent modulation" sounds are formed from a sum of independently
modulated AR(2) processes. The "Co-modulation" sounds are formed from
a sum of comodulated AR(2) processes. In this way, the models get more
complicated from left to right across the table. Similarly, the
complexity of the sounds increases down the table. So, whilst water is captured relatively well by independently modulated AR(2)
processes, fire requires co-modulated processes to capture the
crackles. Rain also requires comodulation to capture the sound of the
droplets hitting leaves, but because this sound is asymmetric through
time, the models - whose statistics are invariant under a reversal of time - cannot perfectly capture
it. Similarly, speech cannot be captured because of this, and other
higher-order statistics, which the models do not capture.

Here is some more detail about some of the sounds: The first wind sound is dominated by only three patterns of
comodulation as can be demonstrated by the following sounds:

Once again, the rain sound is dominated by two components which handle the
transient sound of the water droplets hitting the leaves. The remaining components capturing the slower aspects of the
rain sound:

Chapter 5: Chimera

These auditory chimera contain the carriers inferred from one sound and the modulators
inferred from another, possibly synthetic, sound. They indicate the aspects of the sounds which
are captured by the two types of processes.

When using MPAD (see Chapter 5 of
my thesis) to carry out sub-band demodulation, the posterior mean over the carriers often appears to contain
modulation. That it is, it appears like MPAD is not demodulating the sub-bands
fully. However, it turns out that this feature is due to the posterior mean not being typical
of the posterior distribution. Consider inferring a carrier when the associated
amplitude is very small (compared to
the observation noise). The posterior mean of the carrier reverts to the prior mean which is zero. This means
that the posterior mean of the carriers tends to contain more energy when the
amplitude is large than when it is small. However, the posterior variance of the carriers is higher
in regions of low amplitude. Therefore, a sample from the posterior distribution over
the carriers, contains rather less modulation than the posterior mean. For this
reason, chimera should be produced using samples from the posterior
distribution over the carriers,
rather than from the posterior mean itself.

Chapter 5: Filling in missing data experiments

Various different generative models were used to fill in missing
sections of speech.

Grouping principle 3: Common
fate.
Harmonic stacks: First half contains four sinusoids with no
modulation. Second half contains same four components, but pairs are
modulated independently in frequency.