"programming": ["audio", "low-level", "iOS"]

DFT Spectral Analysis with OpenGL

The idea for this little experiment/project of mine came about while I was exploring OpenGL through the use of Cinder, a C++ library that focuses on creative coding. Among its many features, it supports 2D and 3D graphics through OpenGL. As I was experimenting with building a particle engine, I started thinking about how audio can drive various graphical parameters and elements. We don’t see this very much, but it has some intriguing applications in video games and other interactive media.

We all know that randomness is a large part of sound and visual effects in games these days. To illustrate, let’s take an example of a magic spell; both its visual and aural element will have randomness, but will these sync up together? In other words, if a particular instance of the spell’s sound effect contains a lot of high harmonics, does the visual aspect of it respond by being brighter or more energetic? If the visual and aural parts of the spell are completely independent, there is less chance that they will behave as one cohesive unit, where the visual element is accurately reflective of the aural part. It’s a subtle thing, and not applicable in every case, but I strongly believe that these details add much to immersion, and impress upon us a more organic game world.

So now, how does DSP fit into this? Analysing a sound in the time domain (represented by the wave form) doesn’t give us any useful information as to the frequency content or the particular energy it contains across its spectrum. The DFT (Discrete Fourier Transform) is used to convert a signal into the frequency domain where this information resides. Instead of displaying amplitude (y axis) vs. time (x axis) in the time domain, the frequency domain plots the amplitude of each frequency component (y axis) vs. the frequency (x axis). This is a plot of the frequency domain in Audacity:

Frequency domain plot of a piano chord.

Spectral analysis using the DFT is nothing new and has been around for some time, but its due to the FFT (Fast Fourier Transform) and dramatically increased performance with floating point operations in modern computers that using the DFT for analysis in real time has become practical. The DFT is calculated using convolution, which we know is very slow. The FFT uses a highly optimized algorithm (the Cooley-Turkey is the most widely used, but there are several others) to calculate the DFT of a signal. Going into detail on the math of the FFT is beyond my experience, but fortunately we don’t need to know its intricacies to use it effectively.

The DFT works by breaking a signal down into its individual components. This is related to the Fourier Theorem, which states that any complex waveform can be constructed by the addition of sine waves — the building blocks of a signal. Each sine wave represents a single frequency component; its amplitude is the strength of that frequency. The DFT results in a complex signal with both cosine and sine wave components. This information isn’t terribly useful to us in this form. From the complex-valued signal, however, we find the magnitude of each frequency component by converting it into polar form. In rectangular form a complex number is represented as:

a + jb (DSP normally uses “j” instead of “i”), where a is the real part and b the imaginary part

In polar form, a complex number is represented by an angle (the argument) and the magnitude (absolute value). The absolute value is what we’re interested in, and is calculated according to Pythagoras’ Theorem:

m = sqrt( x2 + y2 )

This value, plotted against the frequency (which would be the index of the complex number corresponding to the index of the real-valued array) gives us the spectrum of the signal. In other words, if we take the FFT of an N-point signal, we get N pairs of complex values as output. Each N in the complex signal represents a frequency along the x axis, with 0 being DC, N/2 being Nyquist, and N being the sampling rate. The corresponding magnitude of the complex number gives us the amplitude of that frequency component in the signal. A final important point to make here is that we’re only interested in the frequency points up to N/2, because anything above Nyquist aliases in discrete time (the spectrum of a signal is actually mirrored around N/2). Here is a sample plot of a piano chord I recorded, using 5 512-sample buffers from the audio:

Frequency plot of 5 512-sample buffers of a piano chord.

This plot shows us that most of the energy lies in the frequency range 0 – 0.04 (assuming a sampling rate of 48kHz, this would equate to approximately 0 – 1920 Hz). Now that we can extract the frequency information from an audio signal, I want to talk about how I use this data along with the RMS value to drive the particle engine I made.

There are a number of parameters in generating particles that I considered. The number of particles, how they’re generated/constructed, the radius, and their color. Particles are only generated during transients, and so I check a previous RMS value against the new one to determine if particles should be constructed. This method isn’t foolproof, but with my tests it’s been working quite well for my purposes. The number of particles generated is also related to the RMS of the signal — the louder it is, the more particles are made. This is evenly divided across the whole spectrum, so that at least a few particles from each frequency range is present. The radius of each particle is determined by the frequency component (the N value of the frequency plot above). The lower the frequency, the larger the particle. Finally, the color, or more specifically the brightness of the particle, is determined by the frequency amplitude. Let me go into a little more detail on each of these before presenting a little video I made demonstrating the program.

First, here is the function that gathers all the information (I’m using Portaudio for real-time playback, so all the audio processing happens in a callback function):

Since the FFT function I’m using requires interleaved data, it needs to be zeroed out first and then each real-valued sample stored in the even-numbered indices. The RMS value is acquired using a function I wrote using SSE intrinsics for optimal speed.

Inside the for loop, a Hamming window is applied to the signal prior to sending it to the FFT to bound it along the edges, minimizing or eliminating frequency leakage in the spectrum. In the application’s update method, I test for the presence of a transient in the current buffer. If one exists, particles are added.

First I calculate a stride value that controls the speed at which we execute the following for loop (this evenly divides the number of particles across the x axis). As you can see, this is determined by the volume (calculated as RMS) of the current audio buffer. When creating each individual particle, the i/1024 argument determines the radius — as I mentioned this is based on the frequency value along the x axis. To constrain along reasonable bounds, this value is further calculated as

sqrt( 1 / (i / 1024) )

The color of each particle is first determined using HSV. The hue is randomly chosen, but the saturation and value (brightness) is based on the data.mag(i) * 18 argument — i.e. the amplitude of the frequency component. This is then converted into an RGB value to set the color of the individual particle.

One of the primary challenges of getting this working properly is mapping the output from the DFT analysis and RMS to usable values in the particle engine. That’s the reason for several of the “magic numbers” in the function above (1024 is not one of them; it’s double our sample length, which I said previously we only want from 0 – 0.5, and since the data stored is in interleaved format, the buffer is actually 1024 samples long). Quite a lot of experimenting went into fine-tuning these values to get good results, but more can be done in this area, which I am interested in exploring further. I feel that this is a good first step in this area, however, and I’ll be looking at ways to integrate this process into further projects and ideas I have brewing.

With all of that out of the way, here is the video demonstrating the result (note: my FPS took a hit due to recording, and my way-old, out-of-date video card didn’t help matters either):