Tuesday, November 27, 2012

Digital filters come in two flavors: IIR (or "Infinite Impulse Response") and FIR (or "Finite Impulse Response"). Those complex acronyms may confuse you, so let's shed a little light on the situation by defining both and explaining the differences.

Some people are interested in which is better. Unfortunately, as with many things, there is no easy answer to that question, other than "it depends", and sometimes what it depends on is your ears. I won't stray too deep into field of opinions, but I will try to mention why some people claim one is better than the other and what some of the advantages and disadvantages are in different situations.

How Filters Work

When you design a filter, you start with a set of specifications. To audio engineers, this might be a bit vague, like "boost 1 kHz by 3 dB", but electrical engineers are usually trained to design filters with very specific constraints. However you start, there's usually some long set of equations, and rules used to "design" the filter, depending on what type of filter you are designing and what the specific constraints are (to see one way you might design a filter, see this post on audio eq design). Once the filter is "designed" you can actually process audio samples.

IIR Filters

Once the filter is designed, the filter itself is implemented as difference equations, like this:

In this case, y is an array storing the output, and x is an array storing the input. Note that each output is a linear function of previous inputs and outputs, as well as the current input.

In order to know the current value of y, we need to know the last value of y, and to know that, you must know the value of still earlier values of y, and so on, all the way back until we reach our initial conditions. For this reason, this kind of filter is sometimes called a "recursive" filter. In principle, this filter can be given a finite input, and it will produce output forever. Because its response is infinite, we call this filter an IIR, or "Infinite Impulse Response" filter.

(To further confuse the terminology, IIR filters are often designed with certain constraints that make them "minimum phase." While IIR filters are not all minimum phase, many people use the terms "recursive", "IIR" and "minimum phase" interchangeably.)

Digital IIR filters are often modeled after analog filters. In many ways, analog-modled IIR filters sound like analog filters. They are very efficient, too: for audio purposes, they usually only require a few multiplies.

FIR Filters

FIR filters, on the other hand, are usually implemented with a difference equation that looks like this:

In this case, we don't use previous outputs: in order to calculate the current output, we only need to know the previous n inputs. This may improve the numerical stability of the filter because roundoff errors are not accumulated inside the filter. However, generally speaking, FIR filters are much more CPU intensive for a comparable response, and have some other problems, such as high latency, and both pass-band and stop-band ripple.

If an FIR filter can be implemented using a difference equation that is symmetrical, like the one above, it has a special property called "linear phase." Linear phase filters delay all frequencies in the signal by the same amount, which is not possible with IIR filters.

Which Filter?

When deciding which filter to use, there are many things to take into account. Here are some of those things:

Some people feel that linear phase FIR filters sound more natural and have fewer "artifacts".

FIR filters are usually much more processor intensive for the same response.

FIR filters have "ripple" in both the passband and stopband, meaning the response is "jumpy". IIR filters can be designed without any ripple.

IIR filters can be easily designed to sound like analog filters.

IIR filters require careful design to ensure stability and good numerical error properties, however, that art is fairly advanced.

Saturday, September 8, 2012

I recently spent the day (yes, the entire day) compiling libjingle on OS X. I'm still running OS X 10.6.8, so that may have been part of the problem, but there are clearly some deeper issues. I thought I'd document the changes I had to make to the compilations instructions in case anyone else (like me in the future) has to go through this nightmare.

First off, the package includes compilation instructions in the README file. This file has some organizational issues (For example, the dependencies expat and srtp are not listed under the "prerequisites" section, but rather the "libjingle" section) and does not account for some bugs I found, but otherwise includes some pretty good detail. Unfortunately, all the "examples" they give are for windows, so I imagine that's where all the development and testing is done. Still, you need to read it. This post is just an outline and only goes into detail where the README doesn't explain things.

Also, there's no longer an active mailing list to go to ask questions, which is sad because that would be a good place to bring these issues up (there are already bugs posted for most of the fixes). It also makes me think maybe libjingle is dead or on critical life-support. (the mailing list linked from the developer's page is currently non-existant, and the link from their blog to the "google talk help center" goes to archive.org!) If you need help, your best bet is probably stackoverflow.com, which is a great place to go for help, but it's no substitute for a mailing list.

Compiling libjingle

Download and extract libjingle from the google code page. I used 0.6.14 for this.

Be sure to extract it somewhere without any weird characters in the path (including spaces) or the build will barf.

Create a makefile (below) at the top level of libjingle. This will be especially useful in case you need to run the build over and over again as you tweak things.

Install the prerequisites (see the README for more details)

Python should already be installed

To install scons, I recommend homebrew: $ brew install scons

download swtoolkit and extract it as talk/third_party/swtoolkit

download gtest. extract it as talk/third_party/gtest

download expat 2.0.1. extract as talk/third_party/expat-2.0.1

download srtp and extract as talk/third_party/srtp

Apply the following fixes:

Fix talk/third_party/swtoolkit/site_scons/site_init.py as described here and here.

Thursday, August 23, 2012

In my last post, I looked at why it's usually better to do EQ (or filtering) in the time domain than the frequency domain as far as audio is concerned, but I didn't spend much time explaining how you might implement a time-domain EQ. That's what I'm going to do now.

The theory behind time-domain filters could fill a book. Instead of trying to cram you full of theory we'll just skip ahead to what you need to know to do it. I'll assume you already have some idea of what a filter is.

Audio EQ Cookbook

The Audio EQ Cookbook by Robert Bristow-Johnson is a great, albeit very terse, description of how to build basic audio EQs. These EQs can be described as second order digital filters, sometimes called "biquads"because the equation that describes them contains two quadratics. In audio, we sometimes use other kinds of filters, but second order filters are a real workhorse. First order filters don't do much: they generally just allow us to adjust the overall balance of high and low frequencies. This can be useful in "tone control" circuits, like you might find on some stereos and guitars, but not much else. Second order filters give us more control -- we can "dial in" a specific frequency, or increase or decrease frequencies above and below a certain threshold, with a fair degree of accuracy, for example. If we need even more control than a second order filter offers, we can often simply take several second order filters and place them in series to simulate the effect of a single higher order filter.Notice I said series, though. Don't try putting these filters in parallel, because they not only alter the frequency response, but also the phase response, so when you put them in parallel you might get unexpected results. For example, if you take a so-called all-pass filter and put it in parallel with no filter, the result will not be a flat frequency response, even though you've combined the output of two signals that have the same frequency response as the original signal.Using the Audio EQ Cookbook, we can design a peaking, high-pass, low-pass, band-pass, notch (or band-stop), or shelving filter. These are the basic filters used in audio. We can even design that crazy all-pass filter I mentioned which actually does come in handy if you are building a phaser. (It has other uses, too, but that's for another post.)

Bell Filter

Let's design a "bell", or "peaking" filter using RBJ's cookbook. Most other filters in the cookbook are either similar to the bell or simpler, so once you understand the bell, you're golden. To start with, you will need to know the sample rate of the audio going into and coming out of your filter, and the center frequency of your filter. The center frequency, in the case of the bell filter, is the frequency that "most affected" by your filter. You will also want to define the width of the filter, which can be done in a number of ways usually with some variation on "Q" or "quality factor" and "bandwidth". RBJ's filters define bandwidth in octaves, and you want to be careful that you don't extend the top of the bandwidth above the Niquist frequency (or 1/2 the sample rate), or your filter won't work. We also need to know how much of our center frequency to add in dB (if we want to remove, we just use a negative value, and for no change, we set that to 0).Fs = Sample RateF0 = Center Frequency (always less than Fs/2)BW = Bandwidth in octavesg = gain in dBGreat! Now we are ready to begin our calculations. First, RJB suggests calculating some intermediate values:A = 10^(g/40)w0 = 2*pi*f0/Fs
c = cos(w0)
s = sin(w0)
alpha = s*sinh( ln(2)/2 * BW * w0/s )This is a great chance to use that hyperbolic sin button on your scientific calculator that, until now, has only been collecting dust. Now that we've done that, we can finally calculate the filter coefficients, which we use when actually processing data:b0 = 1 + alpha*A
b1 = -2*c
b2 = 1 - alpha*A
a0 = 1 + alpha/A
a1 = -2*c
a2 = 1 - alpha/AGenerally speaking, we want to "normalize" these coefficients, so that a0 = 1. We can do this by dividing each coefficient by a0. Do this in advance or the electrical engineers will laugh at you:b0 /= a0
b1 /= a0
b2 /= a0
a1 /= a0
a2 /= a0Now, in pseudocode, here's how we process our data, one sample at a time using a "process" function that looks something like this:

You'll probably have some kind of loop that your process function goes in, since it will get called once for each audio sample.

There's actually more than one way to implement the process function given that particular set of coefficients. This implementation is called "Direct Form I" and happens to work pretty darn well most of the time. "Direct form II" has some admirers, but those people are either suffering from graduate-school-induced trauma or actually have some very good reason for doing what they are doing that in all likelihood does not apply to you. There are of course other implementations, but DFI is a good place to start.You may have noticed that the output of the filter, y, is stored and used as an input to future iterations. The filter is therefore "recursive". This has several implications:

The filter is fairly sensitive to errors in the recursive values and coefficients. Because of this, we need to take care of what happens with the error in our y values. In practice, on computers, we usually just need to use a high resolution floating point value (ie double precision) to store these (on fixed point hardware, it is often another matter).

Another issue is that you can't just blindly set the values of your coefficients, or your filter may become unstable. Fortunately, the coefficients that come out of RJB's equations always result in stable filters, but don't go messing around. For example, you might be tempted to interpolate coefficients from one set of values to another to simulate a filter sweep. Resist this temptation or you will unleash the numerical fury of hell! The values in between will be "unstable" meaning that your output will run off to infinity. Madness, delirium, vomiting and broken speakers are often the unfortunate casualties.

On some platforms you will have to deal with something called "denormal" numbers. This is a major pain in the ass, I'm sorry to say. Basically it means our performance will be between 10 and 100 times worse than it should be because the CPU is busy calculating tiny numbers you don't care about. This is one of the rare cases where I would advocate optimizing before you measure a problem because sometimes your code moves around and it comes up and it's very hard to trace this issue. In this case, the easiest solution is probably to do something like this (imagine we are in C for a moment):

Wednesday, August 8, 2012

In my last post, I discussed how various audio processing may be best done in the frequency or time domain. Specifically, I suggested that EQ, which is a filter that alters the frequency balance of a signal, is best done in the time domain, not the frequency domain. (See my next post if you want to learn how to implement a time-domain filter.)

If this seems counter intuitive to you, rest assured you are not alone. I've been following the "audio" and "FFT" tags (among others) on stack overflow and it's clear that many people attempt to implement EQs in the frequency domain, only to find that they run into a variety of problems.

Frequency Domain Filters

Let's say you want to eliminate or reduce high frequencies from your signal. This is called a "low-pass" filter, or, less commonly, a "high-cut" filter. In the frequency domain, high frequencies get "sorted" into designated "bins", where you can manipulate them or even set them to zero. This seems like an ideal way to do low-pass filtering, but lets explore the process to see why it might not work out so well.

Our first attempt at a low-pass filter, implemented with the FFT might look something like this:

loop on audio input

if enough audio is received, perform FFT, which gives us audio in the frequency domain

in frequency domain, perform manipulations we want. In the case of eliminating high frequencies, we set the bins representing high frequencies to 0.

perform inverse FFT, to get audio back in time domain

output that chunk of audio

But there are quite a few problems with that approach:

We must wait for a chunk of audio before we can even begin processing, which means that we will incur latency in our processing. The higher quality filter we want, the more audio we need to wait for. If the input buffer size does not match the FFT size, extra buffering needs to be done.

The FFT, though efficient compared to the DFT (which is the FFT without the "fast" part), performs worse than linear time, and we need to do both the FFT and it's inverse, which is computationally similar. EQing with the FFT is therefore generally very inefficient compared to comparable time-domain filters.

Because our output chunk has been processed in the frequency domain independent of samples in neighboring chunks, the audio in neighboring chunks may not be continuous. One solution is to process the entire file as one chunk (which only works for offline, rather than real-time processing, and is computationally expensive). The better solution is the OLA or Overlap Add method but this involves complexity that many people miss when implementing a filter this way.

Filters implemented via FFT, as well as time-domain filters implemented via IFFT, often do not perform the way people expect. For example, many people expect that if they set all values in bins above a certain frequency to 0, then all frequencies above the given frequency will be eliminated. This is not the case. Instead, frequency responses at the bin values will be 0, but the frequency response between those values is free to fluctuate -- and it does fluctuate, often greatly. This fluctuation is called "ripple." There are techniques for reducing ripple but they are complex, and they don't eliminate ripple. Note that, in general, frequencies across the entire spectrum are subject to ripple, so even just manipulating a small frequency band many create ripple across the entire frequency spectrum.

FFT filters suffer from so-called "pre-echo", where the sounds can be heard before the main sound hits. In and of itself, this isn't really a problem, but sounds are "smeared" so badly by many designs, that many in the audio world feel that these filters can effect the impact of transients and stereo imaging if not implemented and used correctly.

So it's clear that FFT filters may not be right, or if they are, they involve much more complexity than many people first realize.

As a side note, one case where it might be worth all that work is a special case of so-called FIR filters (also sometimes called "Linear phase" filters). These are used sometimes in audio production and in other cases. In audio, they are usually used only in mastering because of their high latency and computational cost, but even then, many engineers don't like them (while others swear by them). FIR filters are best implemented in the time domain, as well, until the number of "taps"in the filter becomes enormous, which it sometimes does, and it actually becomes more efficient to implement using an FFT with OLA. FIR filters suffer from many of the problems mentioned above including pre-echo, high computational cost and latency, but they do have some acoustical properties that make them desirable in some applications.

Time Domain Filters

Let's try removing high frequencies in the time domain instead. In the time domain, high frequencies are represented by the parts of the signal that change quickly, and low frequencies are represented as the parts that change slowly. One simple way to remove high frequencies, then, would be to use a moving average filter:

y(n) = { x(n) + x(n-1) + .... + x(n-M) } / (M+1)

where x(i) is your input sample at time i, and y(i) is your output sample at time i. No FFT required for that (This is not the best filter for removing high frequencies -- in fact we can do WAY better -- but it is my favorite way to illustrate the point. The moving average filter is not uncommon in economics, image processing and other fields partly for this reason.). Several advantages are immediately obvious, and some are not so obvious:

Each input sample can be processed one at a time to produce one output sample without having to chunk or wait for more audio. Therefore, there are also no continuity issues and minimal latency.

It is extremely efficient, with only a few multiplies, adds and memory stores/retrievals required per sample.

These filters can be designed to closely mimic analog filters.

A major disadvantage is that it is not immediately obvious how to design a high-quality filter in the time domain. In fact, it can take some serious math to do so. It's also worth noting that many time-domain filters, like frequency domain filters, also suffer from ripple, but for many design methods, this ripple is well defined and can be limited in various ways.

In the end, the general rule is that for a given performance, you can get much better results with the time-domain than the frequency domain.

Saturday, August 4, 2012

In the last post I discussed one use for the FFT: pitch tracking. I also mentioned that there were better ways to do pitch tracking. Indeed, aside from improvements on that method, you could also use entirely different methods that don't rely on the FFT at all.

The FFT transforms data into the "frequency domain", or, if your data is broken down into chunks, the FFT transforms it into the "time-frequency domain," which we often still think of as the frequency domain. However, the most basic "domain" you can work in is usually the "time domain." In the time domain, audio is represented as sequence of amplitude values. You may know this as "PCM" audio. This is what's usually stored in WAVs and AIFs, and when we access audio devices like soundcards, this is the most natural way to transfer data. It turns out we can also do a whole lot of processing and analysis in the time domain as well.

Process

Time Domain

Frequency Domain

Filtering/
EQ

Yes!

No!

Pitch Shifting

Okay

Okay

Pitch Tracking

Okay

Okay

Reverb
(Simulated)

Yes!

No!

Reverb
(Impulse)

No!

Yes!

Guitar effects
Chorus/flanger/distortion/etc

Yes!

No!

SR Conversion

Yes!

No!

Compression

Yes!

No!

Panning, Mixing, etc

Yes!

No!

Table 1: Recommendations for Audio Processing in the Time Domain vs. the Frequency Domain

Wow, so impulse reverb is really the only thing on that list you need an FFT for? Actually even that can be done in the time domain, it's just much more efficient in the frequency domain (so much so that it might be considered impossible in the time domain).

You might wonder how to adjust the frequency balance of a signal, which is what an EQ does, in the time domain rather than the frequency domain. Well, you can do it in the frequency domain, but you are asking for trouble. I'll talk about this in my next post.

Sunday, July 22, 2012

It's not necessarily as simple a it seems to find the pitch
from an FFT. Some pre-processing is required as well
as some knowledge of how the data is organized.

How to track pitch with the FFT seems to be a very commonly asked question on stack overflow. Many people seem to think tracking pitch is as simple as putting your data into an FFT, and looking at the result. Unfortunately, this is not the case. Simply applying an FFT to your input, even if you know what size FFT to use, is not going to give you optimal results, although it might work in some cases.

At the end of the day, using the FFT is not actually the best pitch tracking method available for tracking or detecting pitch of an audio signal. While it is possible to make a good pitch tracker using the FFT, doing it right requires a tremendous amount of work. The algorithm shown here works, and works pretty well, but if you need something that converges on the correct pitch really quickly, is very accurate, or tracks multiple notes simultaneously, you need something else.

Still, you can create a decent pitch tracking algorithm that's reasonably easy to understand using the FFT. It doesn't require too much work, and I've explained it and provided code, in the form of a command-line C guitar tuner app which you can get from github. It compiles and runs on Mac OS X and you should be able to get it to run on other platforms without much trouble. If you want to port to other languages, that shouldn't be too hard either. It's worth noting that I specifically designed this app to be similar to the tuner described by Craig A. Lindley in Digital Audio with Java, so if you are looking for Java source code, you can check out his code (although there are differences between hi code and mine).

The Big Picture

To do our pitch detection, we basically loop on the following steps:

Read enough data to fill the FFT

Low-pass the data

Apply a window to the data

Transform the data using the FFT

Find the peak value in the transformed data

Compute the peak frequency from from the index of the peak value in the transformed data

This is the main processing loop for the tuner, with some stuff left out:

Audio Data

We always need to start with a sequence of numbers representing the amplitude of audio over time (sometimes called "Linear, PCM audio"). This is what we get from most uncompressed audio formats like AIFF and WAV. Its also what you get from audio APIs like ASIO, CoreAudio and ALSA. In this case, we are using PortAudio, which acts like a portable wrapper around these and other APIs. If you have a compressed format such as MP3 or OGG, you will have to convert it to uncompressed audio first.

Your data might be 16-bit integer, 8-bit integer, 32-bit floating point or any number of other formats. We'll assume you know how to get your data to floating point representation in the range from -1 to 1. PortAudio takes care of this for us when we specify these input parameters:

You'll also need to know how often your audio is sampled. For a tuner, less is more, so we'll use a sample rate of 8 kHz, which is available on most hardware. This is extremely low for most audio applications (44.1 kHz is considered standard for audio and 48 kHz is standard for video), but for a tuner, 8 kHhz is plenty.

#define SAMPLE_RATE (8000)

Low-Pass Filtering

There's no hard and fast rule about low-pass filtering (or simply "low-passing") your audio data. In fact, it's not even strictly necessary, but doing so can get rid of unwanted noise and the higher frequencies that sometimes masquerade as the fundamental frequency. This is important because some instruments have component frequencies called harmonics that are more powerful than the "fundamental" frequencies, and usually we are interested in the fundamental frequencies. Filtering, therefore, can improve the reliability of the rest of the pitch tracker significantly. Without filtering, some noise might appear to be the dominant pitch, or, more likely, the dominant pitch might appear to be a harmonic of the actual fundamental frequency.

A good choice for the filter is a low-pass filter with a center frequency around or a little above the highest pitch you expect to detect. For a guitar tuner, this might be the high E string, or about 330 Hz. So that's what we'll use -- in fact, we low-pass it twice. If you are modifying the code for another purpose, you can set the center frequency to something that makes sense for your application.

If you aren't sure or you want to go with or want something less agressive, you could try a moving average filter, which simply outputs the average of the current input and some number of previous inputs. Intuitively, we can understand that this filter reduces high frequencies because signals that change quickly get "smoothed" out.

The moving average filter won't make a huge difference, but if the low pass filter I used in my code doesn't suit you and you don't have the degree in electrical engineering required to design the right digital filter (or don't know what the right filter is), it might be better than nothing. I haven't tested the moving average filter myself.

Windowing

Generally speaking, FFTs work in chunks of data, but your input is a long or even continuous stream. To fit this round peg into this square hole, you need to break off chunks of your input, and process the chunks. However, doing so without proper treatment may prove detrimental to your results. In rough terms, the problem is that the edges get lopped off very sloppily, creating artifacts at frequencies that aren't actually present in your signal. These artifacts, called "sidelobes", cause problems for many applications. I know that some tuners are designed without special treatment, so you can skip this step, but I strongly recommend you keep reading because it's easy to deal with this problem.

To reduce the sidelobes, we premultiply each chunk of audio with another signal called a window, or window function. Two simple and popular choices for window functions are the Hamming window, and the Hann window. I put code for both in the tuner, but I used the Hann window.

For a tuning app, the windows may overlap, or there may be gaps in between them, depending on your needs and your available processing power. For example, by overlapping and performing more FFTs, and then averaging the results, you may get more accurate results more quickly, at the cost of more CPU time. I strongly recommend doing this in real apps. I did not do this in my app to make the code easier to follow, and you'll see that the values sometimes jump around and don't respond smoothly.

FFT

The FFT, or Fast Fourier Transform, is an algorithm for quickly computing the frequencies that comprise a given signal. By quickly, we mean O( N log N ). This is way faster than the O( N2) which how long the Fourier transform took before the "fast" algorithm was worked out, but still not linear, so you are going to have to be mindful of performance when you use it. Because the FFT is now the standard way to compute the Fourier transform, many people often use the terms interchangeably, even though this is not strictly correct.

The FFT works on a chunk of samples at a time. You don't get more or less data out of a Fourier Transform than you put into it, you just get it in another form. That means that if you put ten audio samples in you get ten data-points out. The difference is that these ten data points now represent energy at different frequencies instead of energy at different times, and since our data uses real numbers, and not complex, the FFT will contain some redundancies -- specifically, only the first half of the spectrum contains relevant data. That means that for ten samples in, we really only get five relevant data-points out.

Clearly, the more frequency resolution you need, the more time data you need to give it. However, at some point you will run into the problem of not being able to return results quickly enough, either because you are waiting for more input, or because it takes too long to process. Choosing the right size FFT is critical: too big and you consume lots of CPU and delay getting a response, too small and your results lack resolution.

How do we know how big our FFT should be? You can determine the accuracy of your FFT with this simple formula:

binSize = sampleRate/N ;

For example, with a bin size of 8192 (most implementations of the FFT work best with powers of 2), and a sample rate of 44100, you can expect to get results that are accurate to within about 5.38 Hz. Not great for a tuner, but, hey, that's why we are sampling at 8000 Hz, which gives us an accuracy of better than 1 Hz. Still not perfect, for, say, a 5 string bass, but you can always use a a larger N if you need to. Keep in mind that getting enough samples to get that much accuracy takes longer than a second, so our display only updates about once a second. That's yet another reason you might want to overlap your windows.

The output of the FFT is an array of N complex numbers. It is possible to use both the real and imaginary part to get very accurate frequency information, but for now we'll settle for something simpler and much easier to understand: we simply look at the magnitude. To find the magnitude of each frequency component, we use the distance formula:

Now that we know the magnitude of each FFT bin, finding the frequency is simply a matter of finding the bin with the maximum magnitude. The frequency will then be the bin number times the bin size, which we computed earlier. Note that we don't actually need to compute the square root to find the maximum magnitude, so our actual code skips that step.

More

We do a bit more in our code, like identify the nearest semi-tone and find the difference between the that semi-tone and the identified frequency, but for stuff like that we'll leave the code to speak for itself.

Friday, June 29, 2012

I recently had occasion to use the original Freeverb code by Jezar at Dreampoint. There are several variations on this, including Freeverb 3, a complex GPL library, and a bunch of packages from CCRMA, but these are bloated things, not conducive to my needs for a variety of reasons. It took some digging to find the original, and when I did it was buried in a mailing-list archive with the wrong file extension, so I thought I'd post it here to make it easier for anyone else.

Monday, April 30, 2012

Ars Technica, among others, has been talking about Apple's new "Mastered for iTunes" product campaign. They talked to some real mastering engineers and got some real information about audio compression and how carefully tweaking the master before compression might make a difference to sound quality after compression.

It's an interesting article and worth a read. Mostly, I think the conclusions are probably correct, although I think "Mastered for iTunes" fails to address the real problem of poor audio quality in most of the music we listen to today, which has absolutely nothing to do with the delivery format.

Unfortunately, they also managed to let loose some audio myths. Here are some corrections:

Confusing

Using 16 bits for each sample allows a maximum dynamic range of 96dB. (It's even possible with modern signal processing to accurately record and playback as much as 120dB of dynamic range.) Since the most dynamic modern recording doesn't have a dynamic range beyond 60dB, 16-bit audio accurately captures the full dynamic range of nearly any audio source.

This is basically correct, but it sure is confusing. If you want to learn more, you can read all the gory details about the process, called dithering at Bob Katz website. (I am not sure where they got 60 dB from. That's HUGE even for orchestral music. If they are citing this source, they are confusing dB dynamic range with dB absolute volume. I am also not sure where the 120dB figure comes from -- that seems like a very contrived laboratory condition.)

Reality vs Theory

The maximum frequency that can be captured in a digital recording is exactly one-half of the sampling rate. This fact of digital signal processing life is brought to us by the Nyquist-Shannon sampling theorem, and is an incontrovertible mathematical truth. Audio sampled at 44.1kHz can reproduce frequencies up to 22.05kHz. Audio sampled at 96kHz can reproduce frequencies up to 48kHz. And audio sampled at 192kHz—some studios are using equipment and software capable of such high rates—can reproduce frequencies up to 96kHz.

Unfortunately, there's a big difference between "incontrovertible mathematical truth" and what can actually be implemented in hardware and software. In the real-world, we need to filter out all frequencies above the so-called Nyquist limit (one half the sample rate), or we get nasty artifacts called "aliasing". And, in the real-world, there is no filter that lets us keep everything below the limit and reject everything above the limit, so if we want this to work, we need a buffer between what we can hear and the Nyquist limit. That's why 44.1 kHz and not 40 kHz was chosen for CDs to reproduce up to 20 kHz audio. (Ideal filters could be designed if we relaxed certain constraints, such as one known formally as "causality", and if we had an infinite amount of data to work with.)

Typical Hearing

However, human ears have a typical frequency range of about 20Hz to 20kHz. This range varies from person to person—some people can hear frequencies as high as 24kHz—and the frequency response of our ears also diminishes with age. For the vast majority of listeners, a 44.1kHz sampling rate is capable of producing all frequencies that they can hear.

Haha. Sure, maybe my 9-week old son can hear 24kHz, but I doubt it. The range of human hearing which is so often cited as 20Hz to 20kHz does vary from person to person (last time I checked, a few years ago, my hearing went up to about 17kHz), but the 20Hz to 20kHz range is anything but typical. An acoustics textbook puts this more accurately: "a person who can hear the over the entire audible range of 20-20000 Hz is unusual." I would go further and say such a person is not living in the modern world, reading ars technica and buying pop or rock albums. Modern life and aging destroy the tiny hairs in our ears that are sensitive to those frequencies and that's all there is to it. Some people think they have better hearing because they are audiophiles. In fact, they may have superior hearing, but that has nothing to do with how well their ears work: exposure and critical listening improve our ability to hear. We exercise the appropriate parts of our brain and our hearing improves ("Golden Ears" is an example of a product designed for just that purpose).

Some people are reportedly sensitive to "supersonic" frequencies (it may give them headaches, for example). This is not the same as hearing.

"Neither audio transducers nor power amplifiers are free of distortion, and distortion tends to increase rapidly at the lowest and highest frequencies," according to Xiph Foundation founder Chris Montgomery, who created the Ogg Vorbis audio format. "If the same transducer reproduces ultrasonics along with audible content, any nonlinearity will shift some of the ultrasonic content down into the audible range as an uncontrolled spray of intermodulation distortion products covering the entire audible spectrum. Nonlinearity in a power amplifier will produce the same effect."

Chris Mongomery is surely a genius, but I don't think he should be considered the authority on analog electronics. I think many analog engineers will tell a different story: when ultrasonics are pushed through most analog equipment it is steeply attenuated. It's phase might be altered, and it may produce some IM distortion, but at a very low level. For the most part, supersonics might as well not be there. On the other hand, it gives the benefit of allowing less stringent Nyquist filters, which reduces the amount of distortion in DAC. I think compelling arguments could be made either way, although I'm not a proponent of 96 kHz consumer formats. Even in the studio, well designed DSP mitigates the need for high sample-rates, though frequent ADA conversion may sound better at a high sample rate.

What Mastering is

When mastering engineers create a master file for CD reproduction, they downsample the 24/96 file submitted by the recording studio to 16/44.1. During this process, the mastering engineer typically adjusts levels, dynamic compression, and equalization to extract as much "good" audio from the source while eliminating as much "bad" audio, or noise, as possible.

...

Filtering as much useful dynamic range from 24/96 studio files into 16/44.1 CD master files is, in a nutshell, the mastering process.

This is a pretty poor representation of what mastering is, and it's sad that an article on mastering doesn't really bother to explain mastering. I've known top mastering engineers (even ones who have worked at masterdisk) who do all their work at 16/44.1. Many still prefer to work with analog as much as possible, where the bitdepth/samplerate doesn't mean much. All mastering engineers are all happy to deliver a wide variety of formats as the end product. Moreover, equating "bad" audio with noise, talking about level changes, dynamics and EQ as if it has something to do with "extraction" is all wrong, and none of that has anything to do with format. Fundamentally, mastering is about balancing levels, dynamics, and frequencies of a finished mix.

Huh?

...since iTunes Plus tracks are also 16/44.1, it seems logical to use the files created for CD mastering to make the compressed AAC files sold via iTunes.

iTunes Plus tracks, if sourced from 24/96, never become 16/44.1. As you explain in the next paragraph, they go from 24/96 to float/44.1 to AAC/44.1. (They usually are played at 16/44.1, but with the volume control in between, so the effective bit depth is usually lower)

Null Test

Shepard performed what is known as a "null test" to prove his theory that specially mastering songs for iTunes to sound more like the CD version is "BS."

About the only thing a "null test" is good for is determining if two files are identical. It's sort of the audio engineer's equivalent of the "diff" command-line tool. The Ars Technica article quotes Scott Hull arguing against the null test on artistic and perceptual grounds: "...objective tests give us some guide, but they don't account for the fact that our hearing still has an emotional element. We hear emotionally, and you can't measure that." But there are also very sound technical reasons why the null test is simply inappropriate here. When comparing perceptual coding, or even basic eq or other effects, the null test becomes useless because the it is nothing more than subtracting two files sample by sample and seeing what's left. Unfortunately, one of the basic operations you can perform on audio is to shift it in time, which means that data no longer corresponds sample by sample. Minute shifts in time are the only way to achieve eq and other frequency domain changes ("Aha," you say, "but FIR filters don't shift in time," but actually they do, they just don't do so recursively). Most other effects, including most dynamics changes and perceptual coding, do drastic changes in time as well, (although it's possible to do these kinds of changes without time shifts), so anything that changes here more or less here is really apples to oranges (apples to televisions?).

More?

Phew, that's enough for now. I think I got the big ones. Like I said the conclusions are mostly correct, even if the above is wrong, but the whole "Mastered for iTunes" thing does seem to miss the point. (Unless the point is marketing, in which case, cheers!)

Updated 5/5/2012: fixed typo and included Scott Hull quote on null test along with some clarifications to that section.