As I was writing my previous post on the forensic analysis of audio, it reminded me of a very cool audio enhancement technique. Imagine yourself at a cocktail party. It is crowded, loud, and you are talking with someone standing directly in front of you. Although you are subjected to a medley of complex and distracting sounds, you have the remarkable ability to largely tune out those noises and focus on just one source — the person in front of you. The separation of sounds into their individual components is known as the cocktail party problem, and there are some powerful techniques that can automatically separate a complex mixture of sounds into their individual components.

When two sounds are mixed, the resulting recording is (approximately) a sum of the two sounds. Given this simple combination, it may seem like it should be easy to separate these sounds. Separating the sounds, however, is highly under constrained. It is the same as my asking you “27 is the sum of what two numbers?” There are an infinite number of possible solutions, all equally plausible. Without any more information there is no way of selecting one pair of numbers over another.

There are several basic ways in which a mixture of signals can be separated into their individual components. In the most standard approach, it is assumed that several recordings, from different microphones, are available. Although this does provide more information, it does not entirely constrain the problem. By adding the assumption that the individual components are statistically independent, the individual components can be separated. The intuition behind this method is that each microphone records a different mixture of the signals (because of their relative proximity to the different sources). This, along with the independence assumption, is enough to uniquely determine the sources.

Here, for example, are a pair of recordings of two people simultaneously counting to ten in different languages:

In this example, you can probably hear a faint residual of the second voice in each source, but the result is fairly impressive. There are many computational approaches to separating an audio recording: see FastICA for a particularly effective approach.

It is worth noting that this type of separation can be performed from only a single recording, but stronger statistical assumptions must be placed on the sources. In addition to separating a mixture of sounds into their individual components, these approaches can also be used to remove loud background noise that may be obscuring a signal of interest. The concept of separating sounds has an analog in forensic image analysis as well (see, for example, some of my earlier work on separating reflections in images).