Analyzing Sound Files

Introduction

In the last five articles we performed numerical calculations using the Numerical Python module and plotted our calculations using the DISLIN plotting package. This article concludes the series with an application for analyzing sound files on your PC. This small but interesting program will bring together many of the components we have talked about. The concise nature of the application demonstrates the power of the Python language.

Background

In the digital age, sound files are (of course) stored digitally.
Leaving the world of analog cassette tapes and phonographs behind,
computers and compact discs (CDs) store music in a sampled form.
Essentially, the sound wave is sampled or measured at discrete points in
time and recorded onto a storage media. Digitally recorded
sounds are long sequences of numbers that can easily be manipulated
using computers.

Before there were computers or even digital recording, back in the
'20s, Harry
Nyquist showed that to preserve the integrity of a sampled sound, you
must sample at twice the highest frequency you wish to record.
Audiophiles know that audio equipment records and reproduces sound
between 20 and 20,000 Hz (or Hertz, from turn of the
century scientist Heinrich Hertz, meaning cycles per second). So it was with Nyquist
in mind that the music industry designed the compact disc to accurately
record audio at 44,100 samples per second. Thus a CD can record and
reproduce sound up to 22,050 Hz, sufficient for even the most picky
audiophile! If you are interested, there are plenty of references on the
Web. As a start, check digital-recordings.com.

Wave files

Our tool will analyze files from the sound recorder available on your
computer. In Win32 operating systems, the sound recorder is found under
the Accessories part of the Start menu. (Apologies to Linux users, you
will have to interpret a bit to make the tool work under Linux.) In
Win32, the output of the sound recorder is known as a wave file. These
multimedia files contain sound information as well as a header with all
the requisite information for reproduction. A Python module, aptly named
wave, has functions to deal with these file types.

To save a recording you must choose a sample rate, a sample width, and
either mono or stereo. For sample rate, your choices are 44100 Hz, 22050 Hz,
11025 Hz, or 8000 Hz. Higher sample rates preserve higher frequencies but
also result in larger files. For sample types, you can choose 8- or 16-bit
width. To make things simple, we are going to limit ourselves to working
with 8-bit samples. Finally, you can choose stereo (two channel
recording) or mono (single channel recording); for this article we will
be limited to files recorded in mono mode.

Using the sound recorder, the parameters of the sound file are set after
a recording is made. Use the 'Save As' command to select the output
filename. At the bottom of the dialog box, you will see a button for
altering the parameters of the recording. A good setting to get started
is "PCM 11.025 kHz, 8 bit, Mono." This will give you a file sampled at
11.025 kHz (11,025 Hz; k = Kilo = *1000) with 8-bit samples recorded on a
single track (mono).

The tool

Our tool will create a sonogram, measuring the frequency content of
the signal as a function of time and plotting the results. You can use
the tool to view and characterize different sounds. With the tool you
can analyze voice, music, or other sounds and see the harmonic
components. Do you think that people's voices are different? With
this tool, you will be able to see that they are.

Fourier transform

The fundamental algorithm we will use to perform our measurements is
known as the Fourier transform. This transform takes a sequence of time
samples (our sound file) and measures the frequency components, computing
how much energy is present in the sequence at various frequencies. The
Fourier transform is a general technique; the actual algorithm we will be
using is called the fast Fourier transform, or FFT. The FFT is a very
efficient method for calculating the transform and is designed for use on
digital computers. The main restriction for its use is that it works on
sequences that are a power of two in length (2, 4, 8, 16, 32 ...).

Let's try an example to see if we can get a handle on using the FFT module.

Now sig contains 256 samples of a signal, which is
comprised of two tones (generated by the sin function). The numbers in
the inner parentheses (1250/10000 and
625/10000) specify the frequency of the tones. Assuming a
sample rate of 10,000 Hz, we generated a 1250-Hz and a 625-Hz tone. Now
let's plot the output of the FFT of this signal.

>>> plot(x[0:129],10*log10(abs(real_fft(sig))))

Figure 1. Plot of the energy in our signal.

You might ask, "What happened to the other 127 samples? The output
only contains 129 outputs from the original 256 input values." Because
the FFT is designed for complex data (containing both a real and
an imaginary component), and we are using real data, the output is symmetric.
The missing 127 samples are simply replicas of the first 127 in the
output sequence. The real_fft() function we are using knows
this and saves effort by not computing them. Using
10*log10(abs( converts the energy measurements of the FFT
into energy on a logarithmic scale called decibels. This is better
suited for displaying signal power. The two peaks tell us that this time
sequence was comprised of two tones. The x-axis represents all
frequencies from 0 to 5000 Hz divided into 128 bins, each representing
energy found in a 5000/128 =39.0625 Hz swath of the frequency spectrum.
Peak one is located at (5000/128)*16 = 625 Hz, and peak two is located at
(5000/128)*32 = 1250 Hz. These correspond to the numbers we used to
generate the tones. The values found in the remaining bins are just
numerical round-off noise. Compared to the tones, with a power of 20 dB,
the "noise" is ~160 dB less in power -- about 1E-16 in amplitude compared
to unity.