Wednesday, November 11, 2009

WAVE64 vs RF64 vs CAF

Right now I am choosing new a default internal audio file format for XO Wave, and I'd like to choose a format that offers large file sizes and high resolution. I'd like to use an existing popular standard rather than inventing my own or using RAW audio. The pro audio industry is finally moving towards 64-bit file formats, and the three options supported by most pro software are

Wave64, aka Sony Wave64, originally developed by Sonic Foundry before 2003, is an open standard and a true 64-bit format: all 32-bit fields are replaced with 64-bit fields, and all chunks are 8-byte word aligned. Instead of the dreaded FourCC it uses GUID. Other than that, it is pretty much the same as WAV, so the spec is barely 4 pages long, although in my opinion it could stand to be a bit longer, as many aspects of WAV are so poorly devised it really wouldn't hurt for someone to put it all in one place. Some people have criticized the use of GUID on the grounds that there will never be that many chunks, but this misses the point: the point of using GUIDs is that anyone can define their own chunk without having to check with Sony or register a chunk ID. It's actually rather clever.

RF64 was proposed in 2005 by the EBU with full knowledge of Wave64. Although the proposal stated basic requirements that could have easily been met by a few minor extensions to Wave64, and they stated a desire to "join forces" with the developers of Wave64, they made no effort to do so other than to say they hoped they'd be involved. Moreover, the same document proposes RF64 as an alternative, incompatible 64-bit extension to the WAV format. Unlike Wave64, RF64 is not a true 64-bit format. All existing "chunks" remain 32-bit, so, for example, markers, regions and loops will no longer work past a certain number of samples. Even EBU's levl chunk will not work with RF64 because it uses a 32-bit address for pointing to the "peak-of-peaks" in the raw data. RF64 offers the much made-of promise of backwards compatibility via a "junk chunk", but, of course, this is possible with Wave64 as well, as pointed out in the Wave64 spec.

CAF, or Core Audio Format was Apple's entry into the ring. Apple didn't want to be left out of the 64-bit game, after all, and around the same time in 2005 they released CAF. Since they are Apple, they figured people would adopt it (Logic would, if no one else), even if there were competing specs. Their approach, however, was to start from scratch, and it's pretty refreshing. Indeed, the spec addresses practical issues to ensure that important features are implemented, and it even makes that tiny little bit of extra effort required to avoid file corruption by not requiring a header rewrite to finalize a recording of unknown length (Anyone who's ever recorded using software knows that once in a while something goes wrong and a file ends up corrupted. It's so nice that someone finally addressed this in a spec.).

The WAVE format is problematic in many, many ways. For example, in some places it uses zero-based indexing, in others it uses one-based indexing. Sometimes it uses signed integers for raw audio data, other times unsigned. That may not seem so bad, but considering how simple the data it's trying to carry is, but when you add to that the fact that Microsoft had to use format extensions just to clear up ambiguous documentation (and they've still got an ambiguously documented "fact" chunk), it's really not good territory. It is a shame that both Sonic Foundry/Sony and the EBU chose WAVE as the format to extend. Moreover, it's annoying that EBU designed their own, incompatible 64-bit extension to WAVE when a superior one already existed.

Some people think the whole "backwards compatibility" thing is a bunch of hooey because it puts an undo burden on the people writing the libraries. Erik de Castro Lopo, author of the popular LGPL'ed libsoundfile says:

Quite honestly, its stuff like this that makes me think the people who write these specs smoke crack!

If I were to follow the ... insane advice [about retaining backwards compatibility], the test suite would have to write > 4Gig files in order to write a real RF64 file instead of just a normal WAV file.

In order to avoid this insanity, libsndfile, when told to write an RF64 file does exactly as its told.

I would add that the backwards compatibility adds another point of failure in the recording process, in the same way that header rewrites are a point of failure in most current formats (except for CAF and "chunkless" formats like RAW and AU).

All that aside, RF64 is gaining some popularity and support -- probably more than Wave64. As for CAF, it's less popular, but since it's an Apple standard it's probably not going anywhere even if it's not going to be the "next big thing." It could be a fine place to work from, but just scanning the docs everything I looked at brought up a few issues that worried me. For example:

The CAFMarker data-type has three design flaws I noticed. One is that the frame position is a floating point number. I might be missing something here, but in a format where everything else that counts frames and bytes as 64-bit integers, why are we suddenly using floats? Sure that will be integral to pretty big numbers since it's 64-bit, but it's still a float. I didn't use a format like this to get pretty accurate big numbers when I could get completely accurate big numbers! Internally, most apps are going to be converting 64-bit integers to 64-bit floats, which is insane. Another problem is mChannel, which is the channel (starting at 1) that the marker refers to or zero if the marker refers to all channels. Okay, seems reasonable, except that the spec also defined a channel mapping with a 32-bit channel layout bitmask. Why not use that? Granted you might have more than 32-channels, but that's not going to be the most common case, and you could give your users a choice. Consistency is important in APIs. Also, let's face it, the CAFMarker, if not all the basic chunks, should be versioned and extensible. Sure all that takes a few more bits (well, not the float/integer thing), but it's really nothing compared to the sea of data in most audio files.

In the SMTPE timecode types they define kCAF_SMPTE_TimeType30Drop. Now, the fact is that there's really no such thing as 30 Drop, but I can see an argument for including it out of completeness. However, the documentation states that: "30 video frames per second, with video-frame-number counts adjusted to ensure that the timecode matches elapsed clock time." Which is wrong. If you actually had 30 Drop it would run ahead of elapsed, or "wall-clock" time. "Aha!" you say, "they really mean 29 Drop, which is often just called 30 Drop because everyone knows there's no such thing as 30 Drop." But, I'm afraid you are wrong, because there's another constant for that, kCAF_SMPTE_TimeType2997Drop, with pretty much the same documentation, only in this case, it's correct to say that the timecode matches elapsed time. (well, it's very close anyway)

So CAF might be flawed, but probably no more so than WAVE and anything built on it. The reliability factor is sweet. Really. The fact that many people, especially in broadcast, seem to be wanting RF64 support is a detraction, though.

Of course, I might just be over-engineering it. The AU format has been around forever, is super simple and provides high resolution, uncompressed audio of ANY length (it's not even limited to 64-bit). Of course, it lacks metadata which might be useful for BWF-style info as well as region data, but hey, it's wicked simple.

An interesting side note is that by choosing an appropriately sized junk/empty chunk in the header, Wave64, RF64 and CAF can actually be converted from one to another in-place.

1 comment:

You are missing something. CAF frame position is a double because you cannot tune loops properly with an integer frame count. Middle C is not a subharmonic of 44100. One cycle of middle C is 168.562 samples at 44100.