The two following examples demonstrate the ARSS's capability
to reproduce a sound from its spectrogram.
Here, the first sound icon is a link to the original sound re-encoded
in MP3, the image in the middle is a link to the full image obtained
by analysis of the first sound, re-encoded in PNG and possibly
slightly edited for the sake of visibility, and the last icon
represents the sound obtained by synthesis of the sole aforementioned
image.

Units :
- bpo : Bands per octave. That's the frequency resolution. For
example, 24 bpo means there is vertically 24 pixels for each octave,
which implies that the distance between two pixels is half a semi-tone.
- pps : Pixels per second. That's the time resolution. For example,
150 pps means there is horizontally 150 pixels per second, which
implies that the distance between two pixels is 1/150th of a second.

Caption

Original sound

Produced spectrogram

Resynthesised sound

Johann Strauss II's The Blue Danube

38 second classical music extract

Thanks to the brightness correction which brings the sensitivity floor of spectrograms from -48 dB to -96 dB, all of the instruments' harmonics are reproduced intact. The relatively high frequency resolution also plays an important part in the quality of the resynthesis.

The following examples show what kinds of sounds one can obtain by creating spectrograms.

Caption

Original spectrogram

Synthesised sound

HAL 9000 hand-drawn in Photoshop

This spectrogram has been created in
about 15 minutes in Photoshop with the brush tool by following
the lines and imitating the other features of the HAL 9000
spectrogram presented previously.

We can understand quite distinctively what the voice says,
which is almost surprising, considered how quickly and carelessly
this has been executed. This leads me to think that one could
easily learn how to draw every phoneme, and thus create a
clear speech from scratch.

Few real world pictures fed to the ARSS
come out as interesting sounds, and this photograph of
DNA gel
(originally taken from
this page)
is one of them.

It is thanks to its short horizontal lines, well stacked
together vertically, the whole on a black background, that this
picture turns into a series of short and distinct notes making
up a strangely catchy robotic-sounding melody.

The following effects have been obtained simply by resynthesis of the original sound's intact spectrogram merely by using different parameters for synthesis.

Caption

Original sound

Produced spectrogram

Resynthesised sound

Time stretching : slowing down

Scatman John's scat slowed down 5 times

This effect is simply achieved by changing the time resolution setting for resynthesis. The frequency resolution has been turned to the lowest decent setting to obtain the best time resolution possible, which is absolutely crucial when slowing a sound down. Note how different and more natural the result sounds from the same effect as achieved by Adobe Audition 1.5

Unlike most other time stretching algorithms which, in order to speed a sound up a hundred times would simply cut a sound in tiny chunks and keep one chunk out of every 100, in a way similar to how image editing programs can reduce a picture's size using nearest neighbor interpolation, the ARSS properly filters information into keeping everything that still could be heard at such speeds. In this example we can make out two main components : the bubbly sound which is the president's speech, and the short noises which are the audience applauding.

This effect is I believe completely new (however if you think you've heard of such a thing before I'd be delighted to hear about it). For that reason it's also a bit difficult to explain, so please bear with me. While pitch shifting moves notes around but leaves intervals between notes intact, this technique compresses or stretches out intervals between notes in a proportional manner all over the spectrum. This is equivalent to taking a score, and moving all notes apart from each other by a fixed amount of semitones.

So for example if you stretched out the notes C3-D3-G3 by a factor of 2 using that you might obtain the notes C3-E3-D4, or depending on other settings you might as well obtain A5-C#5-B6. The important point is that the interval between two notes is doubled, and in our precise example, we stretch our sound from 4.77 octaves to 9.53 octaves. While I chose here to double intervals for harmonic reasons, you can also chose to reduce them. It usually turns anything into eerie-sounding dissonant music.

The following example shows how an image editing program can be used to achieve things previously impossible in sound processing.

Caption

Original sound

Original spectrograms

Edited spectrograms

Synthesised sounds

Instruments and vocals separation

A 1970s funk loop (George Duke's Reach For It) broken down into layers

The separation of each instrument was achieved mainly by paintbrushing (in Photoshop) in black around the features of interest. Once again one of the main obstacles was the resolution (which rendered the use of two different spectrograms of the same sound at different resolutions necessary), but also how instruments were mixed together even in the image. The drums were the biggest source of problems as their noisy features spread all over the spectrogram, thus mixing up with everything else. However thanks to the power of image editing techniques we can achieve a quality of separation that traditional sound processing techniques cannot come close to.

Thanks to linear frequency scaling (--linear) and sine synthesis we can now use the ARSS to transmit or store images as sound with hardly any loss in quality, provided that we do this under ideal conditions.

There are a few things to note about this very example. Because the final image is produced from the actual MP3, as opposed to a lossless reproduction of the synthesised sound, and because such a sound contains much more information than regular music, the MP3 is encoded using a bitrate 4 times larger than the usual one used for music. If we had used a lower bitrate, the image would have been very noisy, and with an even lower bitrate entire chunks of the image would have been blacked out. This is due to the fact that this type of sound contains a lot more information than the MP3 format was designed for.

It may also be of interest to note that this method of image transmission is actually as efficient as the method used for analog black and white television transmission, which means that we could theoretically transmit TV programs using this method within the same bandwidth as analog television, and with the same quality, under ideal conditions. One of the interesting aspects of this technique is that the images transmitted like this can be picked up and viewed by anyone with a spectrograph, and given the arguable universality of mathematics and time-frequency analysis, one may go as far as arguing that it would be a good way of transmitting images to eventual extraterrestrial civilisations, as we may expect them to be acquainted with such analysis techniques and to use them at some point when analysing strange unusual signals from outer space.

Back on Earth, you could try the following. Ask someone you know to give you a phone call and to play this sound. Record it, analyse it (with the following parameters : 300 Hz to 3400 Hz, height 256 pixels, 10 pps, linear) and if you see anything you recognize please send it to me (I don't have a telephone and I'd like to know how well it works).