Parsing and Writing QuickTime Files in Java

Apple's QuickTime turns 12 this year. Its very extensible file format has
contributed to this longevity, allowing QuickTime to migrate from a world of
CD-ROMs, AppleTalk, and static content to today's massively-networked,
streaming, interactive world. The format is so flexible that it was chosen as
the basis of the MPEG-4 file format. More than one might expect, the
philosophy and concepts of the file format are integral to working with
QuickTime structures at runtime.

However, the QuickTime APIs do much to isolate developers from the
nuts-and-bolts of the file format when doing the most common tasks, so we'll
examine the format with a simple pure-Java QuickTime file format parser, then
we'll use some QuickTime for Java code to generate some different kinds of
QuickTime files to illustrate the format's flexibility.

The details of the format are readily available in the 351-page Inside
QuickTime: QuickTime File Format (PDF). They are also installed--for Mac OS X
developers--in /Developer/Documentation/QuickTime/qtdevdocs/PDF/QTFileFormat.pdf
by the Developer Tools installer.

Mighty Atom

The heart and soul of QuickTime is the concept of the "atom." The name
should remind you of high-school chemistry, where an atom was the smallest unit
of an element that retained the properties of the element. In QuickTime, an
atom is the lowest level to which we can go and still be able to tell the difference
between, say, an edit-list and a sprite. All atoms have a size and a type.
Any other information they may contain depends on their type. This concept
helps forwards-compatibility in the format--it's easy to skip over an
unknown type because the size is right there.

There's a difference between "classic" atoms and newer
"QT" atoms, but the latter is backwards-compatible with the former
and both are commonly encountered in a single file. Let's focus on the
commonalities. All atoms have a header of either 8 or 16 bytes, consisting of
either two or three parts:

atom size:a 4-byte, unsigned integer. If 0, the
atom continues to the end of the file.

atom type: a 4-byte value, usually interpreted as
an ASCII string like moov, though any value is valid.

Optionally, an extended size: if the atom size was
1, then this field is present and interpreted as an 8-byte unsigned
integer. This allows an atom to contain more than 4 GB of data.

The sample code contains a simple example in the EmptyMovie.mov
file, which is just an untitled movie created in QuickTime Player and saved
without modifiation. Open it in hexdump, od, or your
favorite hex editor (I'm fond of HexEdit for the Mac). If you dump the output as characters (i.e., hexedit -cv EmptyMovie.mov), the atom
types practically jump out at you:

If we look at the byte values instead, and carefully count the sizes of the
atoms, we can see the structure of the movie. Figure 1 shows a graphic
representation. In case you're not comfortable reading hex, the file starts
with the size and type of the first atom, an 0x8c-long
moov, which matches the file size. It contains a
0x6c-long mvhd, which has a few non-null bytes. The
moov's other child is a udta of size
0x18, which itself contains a WLOC of size
0x0c.

Figure 1--graphic map of atoms in EmptyMovie.mov

Little things to notice:

The moov and udata atoms contain other atoms, and
don't seem to do anything besides contain atoms. This is a key trait of
QuickTime atoms--they either contain data or other atoms, never both.
That's different from other tree-structured data formats like XML, where an
element can have both attributes and child elements.

What's the 0x0000 that's in the udta but follows
the WLOC? Depending on your mood, it's a bug or a feature. Apple
says that they write an extra 32 bits of zero after the last child of a
udta atom to maintain compatibility with a bug from way back in
QuickTime 1.0.

If your first eight bytes read as 0000 8c00 6f6d 767f, then
you're running on Windows. QuickTime data structures are defined as "big-endian,"
meaning that the most-significant byte of a two-byte value comes first. PCs
running Windows use little-endian ordering, so the bytes appear backwards when
you look at 16-bit values.

Finally, there's no special sequence to identify the contents as QuickTime
data, like the CAFEBABE
"magic number" that begins Java class files or the ID3
sequence that typically begins an ID3-tagged
MP3 file.

What does all this say anyway? The file-format docs define the contents of
each of the "leaf" atoms, so we look there to interpret the
mvhd and WLOC atoms. Since this is a minimal movie,
there's not much to see--the mvhd is a "movie header;" a
structure that defines some metadata values like creation time, preferred
volume, time-scale, et cetera. These defaults are saved into the file. The next
atom is user data, udta, a container for an arbitrarily long list
of metadata atoms. This is a good place to put your own data into the movie,
with whatever format suits you, so long as you choose an unused atom type and
don't use all-lower-case, which is reserved for Apple. Here, there is only one
piece of user data, the window location, WLOC. It contains two
16-bit unsigned ints for x and y, in this case
(0x34,0x18) or in decimal,
(52,24).

Doing It the Hard Way

While QuickTime for Java generally isolates you from the grubby details of
the format, I've included a simple all-Java QuickTime file parser so we can
quickly see the structure of a movie file on any J2SE platform. Download the
accompanying source tarball and open it up. The parser source and a
pre-compiled .jar are in the atom-parse directory. An Antbuild.xml file is included to help you build the code, if you're interested (do ant help to see the available targets), or you can just run it from the .jar with java
-classpath atomparse.jar com.mac.invalidname.qtatomparse.AtomParser.

The code starts with a basic ParsedAtom class, which represents
any atom found in the file. This is subclassed as
ParsedContainerAtom, containing an array of its children, and
ParsedLeafAtom, which is meant to be a parent for type-specific
subclasses that interpret particular atom types. A factory provides the parser
with the class for a given type--new classes can be added by editing its
properties file. Finally, AtomParser puts it all together,
recursively calling a parseAtoms method when it discovers a
container atom, and returning an array of children.

Here's the critical section for reading an atom's size, type, extended size,
and data, given raf (a RandomAccessFile),
off (current offset that we're reading; i.e., start of an atom), and
stopAt (where the parent atom or file ends).

A few caveats to this code. First, please excuse my abuse of the
BigInteger class to get longs from four-byte arrays,
but the alternative is a blinding amount of bit-shifting. Moreover, the reason
I use longs for atom sizes is that it usually avoids signing
problems (32-bit java ints are signed, while the usual QuickTime
atom size is a 32-bit unsigned value). However, it will be wrong if
you happen to encounter an atom larger than 9,223,372,036,854,775,807 bytes
(i.e.,a 64-bit integer with the top bit set). Just thought I'd mention that, in
case you just got back from the store with a 10 exabyte drive. Also,
my scheme for knowing what atoms are containers is to list known containers in
AtomParser. If I've missed one, the parser handles it fairly
gracefully, because we have the size of the atom and simply advance the offset
to the next atom (unfortunately, without parsing the children).

So far, so boring. Let's try a more interesting bit of content. The movie
tim-drm-ref.mov is a 45-second sound bite of Tim O'Reilly
discussing digital rights management at the recent O'Reilly Mac OS X conference. The file is a reference to a 51 MB movie of the entire
keynote panel, yet this file is a dainty 6 KB, since it consists entirely of
metadata, including the references to the original movie on the O'Reilly
web site.

This file is far more typical of what we expect to see in a movie, or more
accurately, in a moov (go ahead, say it out loud:
moo-vee). In addition to the metadata-bearing mvhd movie
header and the udta user data, there are two trak
atoms, both with a deep, yet similar, structure. This movie consists of two
"tracks," one for video and one for audio. Tracks store metadata in
the tkhd track header (analogous to the mvhd we saw
earlier), an "edits" structure that indicates what parts of the
underlying media are used by the track, and a detailed "media"
structure.

The media structure has, again, a metadata header, a hdlr
handler atom that indicates which component should handle the media data, a
"data information" structure made up of dref data
references to say where the media data is (in this file, elsewhere on disk, on
the net, etc.), and finally, a tricky structure for locating and intepreting
media samples.

It's too much to try to understand what all of these atoms represent right away
if you're new to QuickTime, but it might be helpful to look at Apple's Introduction
to QuickTime tutorial, specifically the section on tracks
and media, and see how the contents map fairly directly onto the structure
presented in the preceding two paragraphs. Another point of interest is
Ridgeworks' QTatomizer,
a shareware product that represents the atom structure of a QuickTime movie as
a Swing JTree.