Archive

Game audio is a complicated thing. To the end user, it should be mostly transparent - player does a thing, thing makes sound. Player is in a place, place has appropriate music. But in order to make sure that all these sounds fit together in a readable, aurally pleasing way, there are a few things to consider.

In this post, I’ll go through some basics on digital audio and the tools available to shape and refine that audio. In a follow-up post, I’ll talk about how to ultimately combine those elements into a cohesive mix with some audio examples. I will list any important jargon at the end for reference. This will mostly be oriented toward audio folks, but game designers - a working knowledge of these concepts can be very helpful. Don’t be alarmed if you don’t get it at first, it’s weird stuff!

What is digital audio?

Digital audio is distinct from analog audio in a few key ways. Whereas analog recording is hamstrung by issues like noise, fidelity of playback equipment, decomposure of the medium, etc., digital audio’s primary limitation is that it can only capture audio so many times per second (the sample rate), and with only so much precision (bit depth). That’s a hilariously oversimplified explanation, but the important thing to understand is that you only have so many 1s and 0s to go around. If you play a sound through a digital medium that is nearly as loud as that medium’s maximum volume (the headroom), and layer another sound on top of that, there are no more 1s and 0s left in the headroom to express that extra volume, so any additional audio over the maximum is chopped off (known as clipping). This is a cardinal sin in audio production, and it’s a nearly universally reviled sound, except when used to creative effect. But rarely is that effect used on an entire mix.

Here, a sine wave is not clipped by the headroom - there’s enough to fit it all.

But here, the top of the waveform is “clipped” right off. This manifests aurally as bad, ugly distortion!

So! We want to avoid clipping! How do we do that? First, we select a high enough sample rate and bit depth that allows us to play the audio back at the detail level we want. Arguably the most common setting is 44.1/16, or 44.1 Kilohertz sample rate, 16 bit depth. This is the setting for CD audio, which was the first massively successful digital format. In general, it’s safe to consider 44.1/16 the minimum threshold of quality you should aim for. Personally, I find 24/48 to be much nicer to work with. It gives you a little more room for error, and there’s a perceivable difference in the dynamic range in 24/48, particularly in movie soundtracks or very dynamic music like classical or jazz. Whether end-users can tell the difference or not is the subject of a significant portion of internet arguments and such minefields are out of the scope of this article.

How do you fit all this audio in, then?

One easy solution is: make stuff quieter! But, if you make everything too quiet, not only are you making it harder for users to hear the audio, you’re actually not using the full breadth of the audio headroom, and you’re delivering an audio signal that is more aliased and lower-detail because you’re not using the full resolution of the audio signal available to you. Rarely is this phenomenon the source of audio distortion, but since audio goes through so many processes and may ultimately need to be remastered, higher resolution is virtually always a better idea, as long as your computer can keep up. Like in every computer application, higher fidelity requires more processing power.

How do we keep our audio signal loud, but not *too* loud? One of the first-line solutions available to you for this dilemma is the compressor. A compressor does one very simple thing: it reduces, or compresses, the volume of an audio signal once it reaches a certain volume level, or threshold. You can think of it as a computerized version of someone turning the volume down when a song is too loud, and turning it up when it’s too soft. Virtually all compressors also utilize make-up gain, which is a fancy audio way of saying “louder." So when you compress an audio signal, then bring the volume up, the average volume of the signal goes up. This is one of those audio processes that still feels like magic to me. Unfortunately, like real magic, it’s hard to understand and easy to accidentally make horrible things happen. The important thing to understand is that compression sacrifices dynamic range to make an audio signal flatter - more consistent in volume, and easier to pair with other signals without exceeding your headroom.

Note the whole signal has been squished together! We’ve traded dynamics for consistency.

Help! Everything sounds all mushy now

Whoops! We compressed everything into a lifeless pancake-stew of smush-sounds! Turns out, if you compress everything too much, you squeeze all the life out of it. You’ve got another tool in your front-line arsenal - equalization, or EQ.

You’ve used EQ before. Trust me. If you’ve ever turned on Ludicrous Bass Mode on your Walkman, or used a setting on a receiver or car stereo that purported to transport you to a lush concert hall in the verdant hills of Albany, you’ve used EQ. EQ can do two things: it can boost certain frequencies and it can cut certain frequencies. How it does that, and how you operate it, depends on the type.

In this parametric EQ, the numbered points above and below the curved line either boost or cut volume, respectively.

EQ’s primary function is to “shape” sound. There are literally infinite ways to do it, but there are some boiler-plate methodologies that I ascribe to -

The high-pass and the low-pass

Also known as the low-cut and the high-cut respectively, the high-pass or low-pass lets the high or low frequencies, respectively, “pass” through the EQ. This is extremely useful. Remember how we only have enough bit depth resolution for so many 1s and 0s? Well, every sound you’ll ever record contains audio information from across the entire audio spectrum, even the stuff human ears can’t hear! Assuming you’re making audio for human people, there are some things we can strip out to free up some of that excess energy for the frequencies we can hear.

Generally speaking, the human ear can perceive frequencies from 20 Hz to 20,000 Hz. This is impacted by age, health, all sorts of factors, but usually it’s safe to assume that frequencies under 40hz, while important and certainly detectable by a very nice pair of ears, can’t be reliably reproduced by the vast majority of speaker systems, and don’t do much more than muddy up the whole mix. There are absolutely exceptions to this, especially in movie theaters, live sound setups, Mad Max flamethrower trucks, etc.; don’t send me hate mail! The point is, if you’ve created an audio environment in your game that sounds muddy, oversaturated, or just noisy, eliminating the very low stuff can do wonders to free up your mix for the frequencies that are more important to the human ear.

Cutting high frequencies generally doesn’t have as much of a far-reaching impact. Excessive high frequencies can contribute to a “tinniness," a “sharpness," or just make everything more exhausting to listen to. The human ear does not like to hear very high frequencies for very long.

Again, remember that compression and EQ are MAGIC. Every sorcerer (audio engineer) implements their pixie dust differently. You will not master these overnight. I’ve been doing audio production for 20 years, and it still feels like guesswork sometimes, particularly in a very crowded mix. There are TONS more audio processors at our disposal that can solve these problems differently, but the lion’s share of audio processing is still compression and EQ.

In this post, we’ve equipped ourselves with the knowledge we need to shape sounds into a cohesive mix. In the next post, we’ll explore application with some audio examples.

Bit depth - How detailed is each sample taken? Bit depth is a measure of how many possible levels a sound’s loudness can be captured at. For example, a 1-bit depth (21) could only record two states - sound, or no sound, per sample. Whereas an 8-bit depth (28) can express 256 states, 24-bit (224) 16.78 million, etc.

Clipping - Harsh, aurally displeasing distortion of sound caused by information loss when a digital system is overloaded with loudness.

Compressor - Audio process that reduces audio signal over a certain volume (threshold) at a certain ratio, with controls to re-amplify the signal afterwards, to produce a signal with a more consistent average volume.

Headroom - Once audio values exceed the bit depth, any additional signal is effectively lost, or clipped. This limit is referred to the headroom.

Hz - Hertz, or cycles per second, indicates the frequency of an audio signal.

Sample rate - How many times per second a sound is recorded or played back. Resolution is a function of time - higher resolutions take more “samples” of the audio signal per second, to allow for a more accurate reproduction.

Danny Baranowsky is a composer, musician and larger-than-life personality living in Seattle, Washington by way of Mesa, Arizona. Over the past decade, Danny has risen to the top of his field, composing the music for best-selling games Canabalt, Super Meat Boy, The Binding of Isaac, Desktop Dungeons, Crypt of the Necrodancer, and more. This year, Danny looks to expand his musical misadventures - working on solo material, game prototypes, chicken dinners, and even a live set! No task is too tall for Danny (he is 6’4”). Keep on the lookout for more music and tweets regarding the refresh rates and input latency of OLED monitors in the future.