¡digital audio extravaganza!

I called Napster a "bad brand" a few days ago, but I've got to admit that there seems to be a certain magic to it. In the past couple of days I've had a lot of friends IMing and emailing me about the various ways of turning Napster's DRM'ed WMA files into other, unprotected formats.

Well, yes, you can do that. As I noted in the original post, you can use Winamp's out_lame plugin to encode to MP3. The Napster trick making the rounds uses the Output Stacker plugin (which has since been pulled from AOL-owned Winamp's website), but the principle is the same -- I haven't tried it, but I imagine Output Stacker might let you transfer ID3 information so you don't have to retag your music, but there is very little difference from the out_lame solution, technically speaking.

Thing is, this is nothing new.

You might have heard of the exciting-sounding "analog hole" -- this term refers to the unfortunate fact that our ears and eyes don't work digitally, so the media companies have to allow their product to be decoded at *some* point in order for it to be viewed or listened to. No matter how many fancy software hindrances they introduce, someone can always replace their headphone plug with a line running to a recorder, or point a videocamera at a video screen.

This Napster "trick" is similar. To play sounds your computer must convert audio files into what's called pulse code modulation format, or PCM. But let's back up a bit: how do you record audio digitally? Well, have a look at this page. In a nutshell: sound waves are made up of variations in air pressure. A microphone converts these slight pressure changes into an electrical signal. A digital recording is taken by measuring the strength of this signal very, very quickly -- for CD audio, it happens 44,100 times per second. Each measurement is called a sample. PCM data can be used to tell a speaker cone where it ought to be at each 44,100th of a second, recreating the original pressure waves -- voila, you've got sound. Speakers aren't digital, of course, but this is the general idea.

CD Audio is just PCM data with a particular wrapper of information put around it to help devices recognize and play it -- the same can be said for the WAV format and Apple's lossless AIFF standard, and you can convert from one to another without modifying the PCM data at all. So for the purposes of this discussion, CD audio is the holy grail. Other standards like SACD and DVD-Audio (not the same as the audio on your DVD movies) sample at faster rates than CD's 44.1 kHz. Some also sample at a higher resolution -- CD Audio is 16 bit, meaning a given sample represents one of 216 possible air pressure levels. Increasing either the sample rate or resolution allows for a more accurate reproduction of the original sound wave, but 44.1 kHz/16bit is pretty good, and it's the de-facto standard for consumer digital audio.

So what do so-called "lossy" encoding formats like MP3, Ogg Vorbis, AAC and WMA do? Here's where cognitive science comes into the picture. It turns out that what we perceive is only loosely connected to the sounds surrounding us. In the same way that optical illusions reveal the weird pre-processing that our brains perform before sensation becomes conscious, there are audio illusions that can show us just how imprecise our hearing is. The simplest example is probably relative loudness -- sit in a silent room for a while and even a slight noise will seem very loud. A slightly weirder example is the Shepard Tones, or barber-pole tone: through some sneaky math, a series of notes can be generated that seem to perpetually go up or down in pitch. Click here or here for an example.

Audio codec engineers can take advantage of our human frailty in various ways. For example, we're worse at detecting the location of low-frequency sounds than high frequencies, so they can throw away some low-frequency stereo information. Loud sounds at a particular frequency tend to mask the presence of quieter sounds of the same frequency, so the quieter information can be discarded as well. A lot of this is beyond me, but this is the general idea behind perceptual audio formats like MP3: discard stuff that we couldn't perceive anyway.

You can see a visual representation of this below. These are audio spectrograms of the first half of the first chorus of Jump, Little Children's "My Guitar". The y-axis is frequency; the x-axis is time; and I'm only showing the left channel of the stereo signal. A darker dot means a stronger relative intensity at a given frequency. The left spectrogram is the CD audio; the right is the same audio after running it through the LAME MP3 codec (and chopping off the extra silence added at the start by the conversion process).

Lossless CD Audio

128 kbps MP3

They look pretty similar, right? Well, they are. But what happens when we subtract the compressed spectrogram from the uncompressed one? We'll see everything that the MP3 compression process threw out. The changes are often slight, so I've used Photoshop's Auto Contrast function ("the poor man's normalization") to make it more visible to our feeble human eyes:

information discarded by the MP3 conversion process

Those bits of sound aren't all that important, but what is important is to realize that you can never get them back. Encoding is a one-way process; once discarded, that information is lost forever. Even if you know which MP3 codec you used and what bitrate, you can't regenerate it. That's why this type of encoding is called "lossy".

This is important because different codecs operate in slightly different ways. They throw away different parts of the signal, and each generation gets worse and worse. The difference is often subtle, but it's there. Here: have a listen to this. It's the intro to the Pixies' "Where Is My Mind" -- first, the CD audio version. Second, the same clip after being put through the audio compression wringer -- six different transitions between MP3, Ogg Vorbis and WMA (plus one more encoding to very high quality MP3 to save our bandwidth -- that transition applies to the entire clip and should be negligible). Notice how the second version is breathier? How it's harder to figure out where the "stop!" is coming from?

That's the problem with reencoding an already lossy file -- and that's what's happening with the Napster solution. Napster's audio comes in WMA format, and the hack reencodes it to MP3. The results will be a lot better than the example above, but you'll still inevitably lose some quality. Some songs will suffer more than others.

So that's the reason why I'm not super-excited about this new exploit. Barring some exotic new oppression from Microsoft, you will always be able to do this with the digital audio vendor of your choice -- or hell, your favorite internet radio station. For *nix users (including Mac owners running OS X) all you have to do is record the data going through /dev/dsp. The reason the Hymn Project (for Apple's iTunes Music Store) is different -- and so much cooler -- is that it removes the copy protection without touching the audio at all, so there's no quality loss.

I'm sure Napster's tenuous allies in the music industry are going to be really pissed off about all of this, but they'll eventually have to realize that all digital music systems will inevitably suffer from this vulnerability. It'll be nice to see their license-model wet dream of universal digital serfdom blow up in their faces, but nothing really amazing has happened: just another few executives who should've listened to their nerds a bit more. Move along, nothing to see here.

UPDATE: Above, I incorrectly implied that very high quality audio formats like SACD and DVD-Audio sample at a higher rate, and that some also improve their samples' resolution from CD Audio's 16 bit standard. That's backward -- the discrete sampling rate necessary to reproduce a signal of a given bandwidth is a known quantity determined by the Nyquist-Shannon Theorem. We know the bandwidth of human hearing -- it runs from around 20 Hz to 20 kHz. So there isn't much benefit to a higher sampling rate. Better sample resolution does carry a payoff, however. So getting past the 16-bit limit is the first thing these formats do; many also jack up the sampling rate, but this isn't the first thing audio engineers would pursue.

Comments

But... when you set the bitrate of an mp3 file, doesn't that essentially govern the amount of information you're throwing away? So in the wma to mp3 conversion process, isn't there some CD-audio bitrate limit that wouldn't discard anything? That is, couldn't you encode all of the information in a .wav into a comparably-large .mp3 file? I guess my question is, why does an mp3 "compression" have to discard information from a .wma file? Why isn't there some sort of 1-to-1 conversion process?

I love these posts, by the way. I'm still waiting to read about how you're going to revolutionize the industry using fourier transforms. somehow...

Well, you're right: you could do a perfect copy of the WMA by copying it out to a WAV, or better, the lossless FLAC compression format. And a very high bitrate MP3 would doubtless do pretty well (although you'd need to ask someone who knows more about MP3 than me to answer whether a high-bitrate MP3 would necessarily throw away some information or not -- my guess is that it would).

The highest bitrate MP3 codecs usually support is 384 kbps, whereas PCM runs about 1333 kbps. FLAC can throw out about half of that losslessly, but the theoretical limit for lossless compression is probably not much below, say, 500 kbps. So: still lossy, even though 384 kbps MP3 is generally considered above CD quality by all but the most pretentious twits.

Anyway, nobody's going to want to waste mammoth amounts of disk space in order to get a decent quality file from Napster. Or at least I don't. Realistically, if you rip it to a normal-bitrate MP3 you'll still get a decent quality file for a normal amount of disk space. But it still seems simpler to me to just go use the P2P networks and not worry about compression artifacts at all (or waiting for songs to recompress).

yeah - obviously it would lead to disk space problems... but if your intention was to burn yourself several hundred servicable audio CD's from Napster you could probably do so. Hmm. So how does this "lossless" FLAC thing work? It would seem that reducing the number of bits of anything would inherently be an information loss.

Hey, Tom. Care to explain lossless encoding in a nutshell or at least where the discarded data comes from that can be recreated in decoding?

Also, did you see this thoughtful gift for Valentine's Day? http://www.makezine.com/blog/archive/2005/02/nothing_says_i.html
I.e., cancer in a few decades. I checked the store, but they appear to be all out, so I guess radioactive substances were a winner this year.

It would seem that reducing the number of bits of anything would inherently be an information loss.

ah -- you'd think that. but then, think of how you can zip up a text file to a fraction of its size and uncompress it to its original form losslessly. I'm not clear on *exactly* how FLAC works, but it was designed by a non-expert and my understanding is that it isn't all that complex -- from what I've read it mostly just removes silence from within the signal. PCM is designed for driving speakers, not for efficiency.

Lossless compression (and how much of it you can do) goes to the heart of information theory, one of those amazing fields that was invented pretty much single-handedly by one brilliant guy (in this case a guy named Claude Shannon. I've got a great book on it that I read, enjoyed and forgot most of almost immediately, if you're interested in borrowing it.

Hey, Tom. Care to explain lossless encoding in a nutshell or at least where the discarded data comes from that can be recreated in decoding?

Alright -- a new project! I'll find out about FLAC and write something. But here's a quick theoretical scenario explaining how lossless encoding *could* work (no actual system works this way, I'm sure).

So, each byte in a text file is made of 8 bits and can represent one of 256 different values. See here for the ascii chart. But the vast majority of a given text file will be from the sets [A-Z], [a-z] and [some miscellaneous punctuation]. For the sake of simplicity, let's say there are 63 commonly repeated characters. Now we can use just six bits per character and have room left for one special six-bit "escape" combination (000000, say) that signals the program to look at the next eight bytes and use the traditional ASCII table. So we have a bunch of common six bit characters, and some uncommon fourteen bit characters. With this method we'd get something necessarily larger than 75% of the original file size, the exact size depending on how many "unusual" characters we encounter (the worst case, of course, would be a file 175% of the original's size).

75% would be a terrible compression ratio for text (and the worst case would be totally unacceptable), but you get the idea of how lossless compression can work: the representation of common elements can be simplified. I believe the more common and general technique is to break a binary file into blocks and then find and record the coefficients to a polynomial function that, when who-knows-what quantizing mathematical operation is applied, recreates the source data.