Problems resulting from design of Ogg

[work in progress]

The sections and bullet points on this page are a dump of slides from a Shane Stephens FOMS 2008 talk. They reflect a number of 'problems' developers regularly bring up with Ogg, both legitimately and erroneously. Because these points are brought up regularly and often incorrectly used in support of other formats [such as Matroska and NUT] we've added them to the Wiki here along with a response/discussion of each claimed problem. For the most part, they reflect either an inadequacy in existing software, an inadequacy in existing documentation and/or a misunderstanding of the Ogg encapsulation. Consider this Wiki page a first step toward rectifying a legitimate lack of documentation.

Seeking and Editing Problems

These mostly boil down to a lack of provided, reference library software to perform the tasks of seeking and tracking for the application developer. Libogg as it exists now is a very low-level library that provides only the rudimentary routines necessary to provide valid stream building blocks (pages and packets). It does not provide any higher-level or automatic stream handling functionality, and as such, each application developer has had to reinvent the higher level routines over and over again. This leads to the impression that Ogg is an overly complex and error-prone encapsulation as 100 different applications will each have their own homegrown stream handling routines, each with their own bugs and unimplemented functions. The chaos that reigns in the Ogg application space is not an idictment of Ogg. It's due to a lack of stable, documented, high-level stream handling libraries from the beginning.

jagged edges [ie, coarse granularity]

In Ogg, a codec's frames (packets) are encapsulated into pages, singly or in groups. It is typical for an audio stream to store many packets on a single page and a video stream to store only one or possibly two packets on a page. The individual ['logical'] audio and video streams are multiplexed into a single 'physical' stream at the page level. This is done for two reasons: First, to make it easy to multiplex and demultiplex streams into new arrangements, and second to reduce the overhead of encapsulation. By grouping small packets into a larger page, the overhead of the page header is spread across the packets. A well-formed Ogg stream has a typical overhead of about 1%, regardless of the media types it encapsulates.

The 'jagged edges' complaint arises because default libogg1 behavior fills all pages to ~ 4kB regardless of stream bitrate. A given audio page might contain a full second of audio packets while a video pages contains a single video frame. We would then see 20 or 30 video pages for each audio page.

This arrangement is not incorrect, it is merely suboptimal when optimizing for minimal buffering and seeking. Packet and sample precision seeking takes longer and buffering overhead is higher. However, in a poorly written stream handler that makes invalid assumptions about Ogg streams for convenience, it can trigger bugs. Either the Ogg stream or Ogg itself is blamed for allowing such 'stupid' streams. Note however that even such a 'broken' stream (it is not broken, it's simply suboptimal) can be repaginated into an optimal arrangement losslessly.

The root of the 'balancing' problem is a lack of functionality. By default, libogg always flushes pages at just over 4kB (rather than working by timestamp or some other better default). Changing this behavior requires manual intervention in stream building when it should be automatic to libogg. Chalk this one up to 'software flaw' not 'inadequacy in Ogg'.

wide variance in location of cotemporal data

This is a different way of describing the 'jagged edges' problem above. Because a suboptimally chosen interleave can have low-bitrate pages spaced far apaprt, the audio frames for a given point in time may be physically located well away from the time-matching video data. This point is addressed above.

impossible to reconstruct all granulepos values around holes

granulepos / timeval mapping inconsistencies

poorly sorted streams are rife

impossible to efficiently seek with noncontinuous data

Synchronization

no absolute clock (no presentation timestamps)

no way to correct for clock skew between audio/video encoding

Other Niggles

end-time ordering

except when we have non-continuous data

Ordering isn't only an issue for non-continuous data. In theory, an idiot can fit up to ~30 minutes of Speex audio (silence) in a single page (or 4 minutes of actual speech).

inefficient lacing values for video

ad-hoc granulepos retrofitting for video, CMML

seeking is hard

pages, and libogg's behaviour when creating them

What use are...

serial numbers?

packet numbers?

pages?

checksums?

Useful for audio (preventing ear damage), but could be optional for video

Cleaner Abstractions

We should not need to know the type of a stream if we are not decoding the stream

granulepos interpretations

headers

seeking

cutting

Skeleton goes some way towards fixing this

Libogg issues

Stupid decision for flushing pages

Makes it generally easy to build broken files.

Proposed solutions

Short-term workarounds (Ogg1-compatible)

Don't use partial packets unless absolutely necessary

If absolutely necessary, don't share the pages with other packets

Specify that pages should not contain more than X ms of data (let's say 250-500 ms)

Put Theora keyframes alone on their page??

A successor to Ogg

It should be called (Ogg2|Ogg3|Ogg++|OggNG|Ogh|Foo|Dumplings|AdvancedOgg or AOgg or Ogg+A|ggo|SOgg)

The design should be done from desired capabilities and desired properties

These capabilities and properties should come from AV experts, web-page designers, system administrators, and users

Desired Capabilities

Simple seeking

Cleanly cuttable

Robust to errors

Composable

Supports arbitrary stream types

Low bit cost

Streamable

Easy to chunk

Low decode cost

Supports multiple streams of each type

Untied We Stand

Can cotemporal data be colocated?

streams & bundles

great for cutting

OK for demultiplexing

“should” cut down on bit overhead

hugely simplifies seeking

Gimme a Hint

Can we add seeking hints to the stream?

these can be tiny and infrequent

awesome for standalone files

what do we do when streaming?

hint correction packets?

is this turtles all the way down?

Would an up-front index be better?

Rebuttal

Devil's Advocate

These problems aren't unsurmountable

but we're only finding some of them now, and we've been working around others for years

Nobody will adopt another container format

Nobody cares about <insert hated feature here> anyway

Even if we have Ogg2, we'll still be stuck having to support Ogg1 and broken files