Making Media from Scratch, Part 1

QuickTime is often described as a "media creation" API, and that
means a lot more than just the ability to edit your audio and video and export
it to an arbitrary format. This month I'd like to take the term very
literally and show you how to create your movies in Java, one frame at a time,
without depending on a pre-existing movie.

To do that, we need to take another look at the format of a QuickTime movie. In "Parsing and Writing the QuickTime File Format," we saw how structures called "atoms" represented this format. For today, let's strip away those details and look at the big picture:

A movie contains metadata (creation and modification time, current
selection, preferred volume and rate, etc.) and zero or more tracks.

A track contains metadata (creation and modification time,
playback quality), exactly one media object, and an edit list
describing which parts of the media are to be used.

A media object contains a data reference that indicates where the audio, video, or other data actually is (in the movie file, in another file, on the network, etc.); information about which QuickTime "media handler" can load, save, and play the data; and a structure called a "sample table" to represent where the sample for a given time can be found in the data.

Graphically, this can be seen as a movie where the references are all to external sources (files, URLs, and other movies), as shown in Figure 1, or a "flattened" one, in which the data is all contained within the same .mov file as the movie's structure, as shown in Figure 2. Either way, the movie is the structure that represents where the samples are, how they're arranged, and what to do with them.

Sampling Samples

By "samples," we mean what is to be seen or heard at some instant
of time, in the smallest amount of time relevant to that kind of media. For
example, imagine a format where we have totally uncompressed video (equivalent
to, say, North American television) and uncompressed CD-quality audio. The
video, by our definition, is 30 frames per second, so there are 30 video
samples in one second. CD-quality audio is 44.1 KHz, meaning there are 44,100
samples in a second.

QuickTime, interestingly, realizes that a player would generally like its data
to be organized with regards to time. For example, you don't want to have a
file with all of the video data first and then all the audio data, since playing
back would require jumping back and forth between the two, and the read/write
head on your hard drive would scream in agony. It's easier to mix them, so
that the video data for a certain time and the audio data for that time are in
the same place. In QuickTime's worldview, this is a process of "chunking" — the media data combines video, audio, and any other data into one stream (a long run of bytes), with "chunks" of
audio, video, and other samples grouped by time. It's up to the media object to
manage several tables, like a time-to-sample table and a sample-to-chunk table,
to allow it to find the samples at playback time.

Fortunately, you as a developer aren't responsible for all of that bookkeeping,
but it's good to understand how it works.

Getting back to the point, to make a movie from scratch, we need to do the
following:

Lay down samples.

Add these to a media object.

Add the media to an appropriate track.

Add the track to a movie.

You may have noticed in the diagrams above that our hypothetical movie
contains not just an audio and a video track, but also a "text track." This is exactly what it sounds like: a time-based collection of text, commonly used for providing captions to QuickTime movies. More
technically, it is a track where the media samples are ordinary text strings.
This is a good place to start with creating our own media, since it doesn't
require knowing anything about images or sounds.

The last argument is a time scale for the media. Movies, tracks, and media
all have their own time scale, which is the number of time units that pass in
one second. For a movie, this value defaults to 600, which has the advantage
of being an even multiple of many common frame-rates: 30 (NTSC video), 25 (PAL
and SECAM video), and 24 (film). Dean Perry of Abstract Plane also reminds me it's an even multiple of the 60 "ticks" per second that older Macintoshes
used for timekeeping. However, you're free to use and abuse the time scales
as you see fit. I arbitrarily chose a value of 100 for my media, so my sample
durations are measured in hundredths of a second.

Next, we tell the new Media object that we intend to do some
edits:

textMedia.beginEdits();

We then get the media handler object, required in this case because it has a
method for creating new text samples:

TextMediaHandler handler = textMedia.getTextHandler();

and we create a rectangle that will be used in every sample to describe the
shape that the text is to be rendered into when played back:

We're finally ready to start adding samples. The sample application uses a
static array of Strings, getting a QuickTime-compatible
QTPointer to each one and passing that as the first argument to
the TextMediaHandler.addTextSample() method. Here's how that call
looks:

QDRect textBox: a QDRect rectangle describing the box in which the text is to be displayed.

int displayFlags: zero or many behavior flags, logically OR'd together, describing behavior such as clipping or scaling the text when displayed over other video, etc. These flags are in StdQTConstants and a list of supported flags is documented for the native TextMediaAddTextSample function.

int scrollDelay: a time to delay between scrolls if the
dfScrollIn and/or dfScrollOut flags are set. Not
useful in this app, with its short samples, but potentially useful for other
purposes.

int hiliteStart: the index of first character of text to
highlight (select), if any.

int hiliteEnd: the index of the last character of text to
highlight.

QDColor rgbHiliteColor: the color of the highlight, if
used.

int duration: the duration of this sample, expressed in the
media's time scale.

The duration is interesting for a couple of reasons. First, it's expressed
in terms of the media's time scale. In our case, the time scale is 100 and the
duration is 100, so the sample is exactly one second long. Of course, we could
have half-second samples by using a duration of 50, or any sample length that
can be expressed as a fraction of duration over time scale. Moreover, despite
the commonness of fixed frame rates in audio and video (30 fps video, 44.1 KHz
sound, etc.), QuickTime requires no such thing -- each sample can be of an
arbitrary duration, different from the sample before or after it.

Wrapping up the application, once the loop is done adding samples, we inform
the Media that we're done editing:

after which we save the file to disk as texttrack.mov, in the
current directory.

To compile and run the sample code, make sure you've worked through any
versioning or classpath issues as covered in our re-introduction
to QTJ a few months back. When you're done, the result will look something
like this (assuming you have the QT plug-in):

One of the nice things to notice is that we picked up word-wrap
automatically, without hand-coding line-breaks.