MP3 Ins and Outs

Until recently, my audioblog experiments were mainly focused on interviews with various folks with whom I discuss the high-tech topics that I research and write about. Over the holidays I tried something new: an NPR-style podcast that weaves music, spoken-word quotations, and narration into a story. In this case, the story is about open source audio.

The exercise got me thinking about the process of audio storytelling. The pros at NPR and elsewhere make it look easy, but of course it's not. I guess that's mainly because all storytelling is hard, at least for me. Factor that out, though, and you're left with a bunch of logistical challenges: finding quotes, blending sources, linking, and making a smooth and consistent final presentation. We all take all these things pretty much for granted in the text medium, but it's new territory for the majority who (like me) lack experience with audio and video.

Wearing my critical hat, I'd have to say that my first effort barely cleared the bar in terms of acceptable production values. Fortunately most of the reviews suggested that I did clear that bar and that listeners were willing to overlook the technical imperfections of the podcast in order to appreciate the story that it told.

One comment took me by surprise, though. Marc Canter wrote:

Jon Udell's first audioblog - congrats Jon! Now where's the playlist? The meta data? The annotation?

Come on dude - you're a nerd - you know how good this COULD be!

Your excellent description of the audiocast was elucidating - but I can't imagine you writing one up like that - for every one you make!

When I followed up with Marc, I learned that what he wished I had supported was Media RSS (specification, FAQ). Media RSS is a Yahoo!-sponsored RSS 2.0 module that's intended to complement the <enclosure> tag, supplying metadata—about the media file itself, its classifications, and its creators—that will make the content easier to discover.

Although the FAQ claims Media RSS is mainly for video enclosures and specifically aims to enhance Yahoo!'s video search engine, Marc's interest clearly was broader. He believes that all online media files are metadata-impoverished, and he wants to enrich them.

I believe the same thing. I have no objection to Media RSS, and if it gains traction I'll be happy to support it. I don't believe, however, that the lack of Media RSS, or some other media-specific XML metadata standard, is the high order bit that gates the future success of podcasting, or vlogging, or screencasting.

We could always use more and better formal metadata than we have. Standards that define that metadata, and tools that help media creators apply it, will certainly be helpful. But one of the key points I've been trying to make in this series of columns is that if we hope to contextualize media content, formal metadata produced by content authors is not the first and best line of attack. The blogosphere itself is a living context engine that works at planetary scale. It will process media content for us, just as it processes textual content, if we can expose the right kinds of entry points into the stuff. The fuel that powers the blogosphere's engine is a 50/50 mix of linking and quotation. If we can make it easy for everyone to link into and quote from media content, we'll unleash powerful forces.

So how do we expose those entry points? What Marc Canter and I violently agree on is that there's no convention. It's not just that we lack a conventional way to represent what Marc calls the "ins and outs," that is, the start and end times of segments. That's trivial. The more fundamental problem is that we lack a conventional way to find and export these ins and outs.

Recently, for example, I blogged this 75-second clip from David Bornstein's Pop!Tech lecture posted on ITConversations. Given a start time of 23:43 and an end time of 24:58, it's easy enough to form this URL:

Doug Kaye provides a helper to make it even easier. But snagging the ins and outs from your media player in order to create that clip isn't always a walk in the park. Let's review the process in four popular players (in no particular order):

Winamp

Because Winamp uses the HTTP 1.1 Range mechanism I mentioned in MP3 Sound Bites, you can start jumping around in a web-based file immediately, without waiting for the whole thing to load. And using CTRL-J, you can jump to precise minute:second offsets. Added to this goodness is the fact that the left and right arrow keys jump forward and backward in small increments.

What's not to like? If I had my druthers, I'd make the jump increment one second rather than five. You need that precise control to mark the boundaries of a quotation. And mouse-driven scrolling, which applies to all players, of course, lacks the necessary precision.

I'd also like to be able to set one or more pairs of start/stop markers and save that information for later use.

RealPlayer

Like Winamp, RealPlayer allows random access into a web-based MP3 file and enables you to jump (Real calls it seek) to exact minute:second offsets. You can also move forward and backward in one-second increments using CTRL plus the left/right arrow keys.

In version 10, we see the beginning of a segment-editing sensibility. When you add a clip to your Favorites, you have the option to set a start time (though not a corresponding end time or duration). On Windows, for the clip mentioned above, the result is this shortcut:

or, alternatively, a reference to a .RAM encapsulation of that URL by way of a ramgen service. It seems to me that if Real Networks hosted a well-known instance of that service, it could accumulate an interesting library of media fragments.

Something else puzzles me here. If you construct this URL and load it, you'll find that RealPlayer's 23:43 isn't quite the same as Winamp's, or ITConversations', or my own clipping service's, or Audacity's. In RealPlayer, the sentence that begins like so, "Historically when we think of how social change happens," appears not at 23:43, but at 23:50. I've yet to get to the bottom of this discrepancy, which amounts to a whopping seven seconds in this example, but it's clearly problematic. QuickTime, by the way, exhibits the same behavior as RealPlayer.

Windows Media Player

Windows Media Player doesn't support HTTP 1.1 random access, so you can only hunt for ins and outs within the portion of an MP3 file that's already loaded. You can move forward and backward using the arrow keys but not precisely: WMP seems to divide this function into 20 increments no matter what the length of the file. And there's no jump-to/seek-to feature.

WMP does, however, take a stab at providing UI for extracting a segment. Its Media Link for E-Mail feature offers these controls: Mark In, Mark Out, and Send media link in e-mail. The lack of precise selection control makes it hard to establish the right ins and outs. There's no way to review or adjust the segment you've selected. And the resulting description is only accessible by way of e-mail. But the e-mail does include this interesting XML attachment:

This is an ASX file. The complete set of ASX elements is documented here.

When I opened the ASX attachment I'd sent myself it did play the referenced segment but only after loading half the 20MB file; as already mentioned, this is because WMP doesn't do HTTP 1.1 random access with MP3s.

QuickTime

When it comes to capturing ins and outs, the standard version of QuickTime may be the least useful of the bunch. It doesn't support HTTP 1.1 random access for MP3s, and it won't let you jump to a specified offset. The arrow keys afford precise cursor control, but there's no way to establish or adjust a segment.

QuickTime Pro, by contrast, does offer in/out markers. You can adjust these precisely with arrow keys and, thanks to the ability to play only the selection, you can review and tweak a segment. However, there's no way to export the ins and outs.

Finally, as I mentioned, QuickTime's notion of 23:43 corresponds to RealPlayer's but differs from the notion held by the other players mentioned here. If someone can explain that to me and suggest what to do about it, I'd be grateful.

A modest proposal

In the final analysis, no single player has all these features:

HTTP 1.1 random access

Seek/jump to offset

Precise cursor control

Selection marking and review, with precise adjustment of segment boundaries

Export in/out times

If we want the natural contextualizing forces of the Web to work on media files, we'll need to establish this feature set as a de facto standard. And we'll need UI conventions that make the selection of time-based content as universal an idiom, across platforms and applications, as the selection of text already is.

Jon Udell is an author, information architect, software developer, and new media innovator.