'''Ogg Skeleton''' provides structuring information for multitrack [[Ogg]] files. It is compatible with Ogg [[Theora]] and provides extra clues for synchronization and content negotiation such as language selection. Skeleton version 4.0 also provides keyframes indexes to enable optimal seeking over high-latency connections, such as the internet.

+

'''Ogg Skeleton''' provides structuring information for multitrack [[Ogg]] files. It is compatible with Ogg [[Theora]] and provides extra clues for synchronization and content negotiation such as language selection. Skeleton version 4.0 also provides keyframe indexes to enable optimal seeking over high-latency connections, such as the internet.

−

Ogg is a generic container format for time-continuous data streams, enabling interleaving of several tracks of frame-wise encoded content in a time-multiplexed manner. As an example, an Ogg physical bitstream could encapsulate several tracks of video encoded in Theora and multiple tracks of audio encoded in Speex or Vorbis or FLAC at the same time. A player that decodes such a bitstream could then, for example, play one video channel as the main video playback, alpha-blend another one on top of it (e.g. a caption track), play a main Vorbis audio together with several FLAC audio tracks simultaneously (e.g. as sound effects), and provide a choice of Speex channels (e.g. providing commentary in different languages). Such a file is generally possible to create with Ogg, it is however not possible to generically parse such a file, seek on it, understand what codecs are contained in such a file, and dynamically handle and play back such content.

+

Ogg is a generic container format, enabling interleaving of several tracks of frame-wise encoded content in a time-multiplexed manner. As an example, an Ogg physical bitstream could encapsulate several tracks of video encoded in Theora and multiple tracks of audio encoded in Speex or Vorbis or FLAC at the same time. A player that decodes such a bitstream could then, for example, play one video channel as the main video playback, alpha-blend another one on top of it (e.g. a caption track), play a main Vorbis audio together with several FLAC audio tracks simultaneously (e.g. as sound effects), and provide a choice of Speex channels (e.g. providing commentary in different languages). Such a file is generally possible to create with Ogg, it is however not possible to generically parse such a file, seek on it, understand what codecs are contained in such a file, and dynamically handle and play back such content.

Ogg does not know anything about the content it carries and leaves it to the media mapping of each codec to declare and describe itself. There is no meta information available at the Ogg level about the content tracks encapsulated within an Ogg physical bitstream. This is particularly a problem if you don't have all the decoder libraries available and just want to parse an Ogg file to find out what type of data it encapsulates (such as the "file" command under *nix to determine what file it is through magic numbers), or want to seek to a temporal offset without having to decode the data (such as on a Web server that just serves out Ogg files and parts thereof).

Ogg does not know anything about the content it carries and leaves it to the media mapping of each codec to declare and describe itself. There is no meta information available at the Ogg level about the content tracks encapsulated within an Ogg physical bitstream. This is particularly a problem if you don't have all the decoder libraries available and just want to parse an Ogg file to find out what type of data it encapsulates (such as the "file" command under *nix to determine what file it is through magic numbers), or want to seek to a temporal offset without having to decode the data (such as on a Web server that just serves out Ogg files and parts thereof).

−

Ogg Skeleton is being designed to overcome these problems. Ogg Skeleton is a logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams. For each logical bitstream it provides information such as its media type, and explains the way the granulepos field in Ogg pages is mapped to time.

+

Ogg Skeleton is designed to overcome these problems. Ogg Skeleton is a logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams. For each logical bitstream it provides information such as its media type, and explains the way the granulepos field in Ogg pages is mapped to time.

+

+

Seeking in an Ogg file is typically implemented as a bisection search for the seek target timestamp. However when seeking over a high latency connection, such as the internet, such searches can be slow. Some bitstreams, notably Theora, have keyframes, and so in order to seek to a given temporal offset in a Theora stream, you must first perform a bisection search to find the target Theora frame, determine its keyframe, and then perform another bisection search to locate that keyframe and decode forwards to the temoporal offset. This can be very slow. The Ogg Skeleton 4.0 provides an index of keyframes, and indexes periodic samples on streams without the concept of a keyframe, so that seeking over high-latency connections can simply be performed optimally with "one hop".

Ogg Skeleton is also designed to allow the creation of substreams from Ogg physical bitstreams that retain the original timing information. For example, when cutting out the segment between the 7th and the 59th second of an Ogg file, it would be nice to continue to start this cut out file with a playback time of 7 seconds and not of 0. This is of particular interest if you're streaming this file from a Web server after a query for a temporal subpart such as in http://example.com/video.ogv?t=7-59 .

Ogg Skeleton is also designed to allow the creation of substreams from Ogg physical bitstreams that retain the original timing information. For example, when cutting out the segment between the 7th and the 59th second of an Ogg file, it would be nice to continue to start this cut out file with a playback time of 7 seconds and not of 0. This is of particular interest if you're streaming this file from a Web server after a query for a temporal subpart such as in http://example.com/video.ogv?t=7-59 .

Line 52:

Line 54:

Because all the Skeleton track's index packets appear in the header pages of the Ogg segment, all the keyframe indexes are immediately available once the header packets have been read when playing the media over a network connection.

Because all the Skeleton track's index packets appear in the header pages of the Ogg segment, all the keyframe indexes are immediately available once the header packets have been read when playing the media over a network connection.

−

For every content stream in an Ogg segment, the Ogg index bitstream provides seek algorithms with an ordered table of "key points". A key point is intrinsically associated with exactly one stream, and stores the offset, o, of the last page which lies before all data required to decode the keyframe, as well as the presentation time of the keyframe t, as a fraction of seconds.

+

For every content stream in an Ogg segment, the Skeleton provides seek algorithms with an index, or ordered table of "key points". A key point is intrinsically associated with exactly one stream, and stores the offset, o, of the last page which lies before all data required to decode the keyframe, as well as the presentation time of the keyframe t, as a fraction of seconds.

−

The offset is relative from the beginning of the Ogg segment, and is exactly the first byte of the a page in the indexed stream, so if you seek to a keypoint's offset and don't find the beginning of a page there, or you find a page from another stream, you can assume that the Ogg segment has been modified since the index was constructed, and the index can be considered invalid. The time t is the keyframe's presentation time corresponding to the granulepos, and is represented as a fraction in seconds. Note that if a stream requires any preroll, this will be accounted for in the time stored in the keypoint.

+

The offset is relative from the beginning of the Ogg segment, and is exactly the first byte of a page in the indexed stream, so if you seek to a keypoint's offset and don't find the beginning of a page there, or you find a page from another stream, you can assume that the Ogg segment has been modified since the index was constructed, and the index can be considered invalid. The time t is the keyframe's presentation time corresponding to the granulepos, and is represented as a fraction in seconds. Note that if a stream requires any preroll, this will be accounted for in the time stored in the keypoint.

The Skeleton 4.0 track contains one index for each content stream in the file. To seek in an Ogg file which contains keyframe indexes, first construct the set which contains every active streams' last keypoint which has time less than or equal to the seek target time. This tells you a known point on every stream which lies before the seek target. Then from that set of key points, select the key point with the smallest byte offset. You then verify that there's a page from the keypoint's stream found at exactly that offset, and if so, you can begin decoding. You are guaranteed to pass keyframes on all streams with time less than or equal to your seek target time while decoding up to the seek target. However if you don't encounter a keyframe with the same presentation time as is stored in the keypoint, then the index is invalid (possibly the file has been changed without updating the index) and you must either fallback to a bisection search, or keep decoding if you've landed "close enough" to the seek target.

The Skeleton 4.0 track contains one index for each content stream in the file. To seek in an Ogg file which contains keyframe indexes, first construct the set which contains every active streams' last keypoint which has time less than or equal to the seek target time. This tells you a known point on every stream which lies before the seek target. Then from that set of key points, select the key point with the smallest byte offset. You then verify that there's a page from the keypoint's stream found at exactly that offset, and if so, you can begin decoding. You are guaranteed to pass keyframes on all streams with time less than or equal to your seek target time while decoding up to the seek target. However if you don't encounter a keyframe with the same presentation time as is stored in the keypoint, then the index is invalid (possibly the file has been changed without updating the index) and you must either fallback to a bisection search, or keep decoding if you've landed "close enough" to the seek target.

Line 77:

Line 79:

−

=== Ogg Skeleton version 4.1 Format Specification ===

+

=== Ogg Skeleton version 4.0 Format Specification ===

Adding the above information into an Ogg bitstream without breaking existing Ogg functionality and code requires the use of a logical bitstream for Ogg Skeleton. This logical bitstream may be ignored on decoding such that existing players can still continue to play back Ogg files that have a Skeleton bitstream. Skeleton enriches the Ogg bitstream to provide meta information about structure and content of the Ogg bitstream.

Adding the above information into an Ogg bitstream without breaking existing Ogg functionality and code requires the use of a logical bitstream for Ogg Skeleton. This logical bitstream may be ignored on decoding such that existing players can still continue to play back Ogg files that have a Skeleton bitstream. Skeleton enriches the Ogg bitstream to provide meta information about structure and content of the Ogg bitstream.

Line 131:

Line 133:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

The version fields provide version information for the Skeleton track, currently being 4.1 (the number having evolved within the Annodex project).

+

The version fields provide version information for the Skeleton track, currently being 4.0 (the number having evolved within the Annodex project).

Presentation time and basetime are specified as a rational number, the denominator providing the temporal resolution at which the time is given (e.g. to specify time in milliseconds, provide a denominator of 1000).

Presentation time and basetime are specified as a rational number, the denominator providing the temporal resolution at which the time is given (e.g. to specify time in milliseconds, provide a denominator of 1000).

Line 171:

Line 173:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

The mime type is provided as a message header field specified in the same way that HTTP header fields are given (e.g. "Content-Type: audio/vorbis", terminated/delimited by "\r\n"). Further meta information (such as language and screen size) are also included as message header fields. The offset to the message header fields at the beginning of a fisbone packet is included for forward compatibility - to allow further fields to be included into the packet without disrupting the message header field parsing.

+

The mime type is provided as a message header field specified in the same way that HTTP header fields are given, e.g. "Content-Type: audio/vorbis". Message header fields are terminated/delimited by "\r\n". Further meta information (such as language and screen size) are also included as message header fields. The offset to the message header fields at the beginning of a fisbone packet is included for forward compatibility - to allow further fields to be included into the packet without disrupting the message header field parsing.

The granule rate is again given as a rational number in the same way that presentation time and basetime were provided above.

The granule rate is again given as a rational number in the same way that presentation time and basetime were provided above.

−

The following message headers are compulsory in Skeleton 4.1:

+

The following message headers are compulsory in Skeleton 4.0:

−

* Content-type: mime-type of the content encoded in this stream, e.g. audio/vorbis, video/theora, etc. The mime types in use here are listed at http://wiki.xiph.org/MIME_Types_and_File_Extensions#Codec_MIME_types.

+

* Content-type: mime type of the content encoded in this stream, e.g. audio/vorbis, video/theora, etc. The mime types in use here are listed at http://wiki.xiph.org/MIME_Types_and_File_Extensions#Codec_MIME_types.

* Role: describes the function of this track. Common examples are "video/main", "audio/main", "text/caption". For a complete list of possibilities, see http://wiki.xiph.org/SkeletonHeaders#Role.

* Role: describes the function of this track. Common examples are "video/main", "audio/main", "text/caption". For a complete list of possibilities, see http://wiki.xiph.org/SkeletonHeaders#Role.

* Name: a unique free text string which can be used to directly address the track in scripting applications, such as an HTML5 viewer.

* Name: a unique free text string which can be used to directly address the track in scripting applications, such as an HTML5 viewer.

Line 181:

Line 183:

For more message headers, see [[SkeletonHeaders]].

For more message headers, see [[SkeletonHeaders]].

−

Before the Skeleton EOS page in the segment header pages come the Skeleton 4.0 keyframe index packets. There should be one index packet foreach content track in the Ogg segment, but index packets are not required for a Skeleton 4.0 track to be considered valid. Each keypoint in the index is stored in a "keypoint", which in turn stores an offset, and timestamp. In order to save space, the offsets and timestamps are stored as deltas, and then variable byte-encoded. The offset and timestamp deltas store the difference between the keypoint's offset and timestamp from the previous keypoint's offset and timestamp. So to calculate the page offset of a keypoint you must sum the offset deltas of up to and including the keypoint in the index.

+

Before the Skeleton EOS page in the segment header pages come the Skeleton 4.0 keyframe index packets. There should be one index packet foreach content track in the Ogg segment, but index packets are not required for a Skeleton 4.0 track to be considered valid. Each keyframe in the index is stored in a "keypoint", which in turn stores an offset, and timestamp. In order to save space, the offsets and timestamps are stored as deltas, and then variable byte-encoded. The offset and timestamp deltas store the difference between the keypoint's offset and timestamp from the previous keypoint's offset and timestamp. So to calculate the page offset of a keypoint you must sum the offset deltas of up to and including the keypoint in the index.

The variable byte encoded integers are encoded using 7 bits per byte to store the integer's bits, and the high bit is set in the last byte used to encode the integer. The bits and bytes are in little endian byte order. For example, the integer 7843, or 0001 1110 1010 0011 in binary, would be stored as two bytes: 0xBD 0x23, or 1011 1101 0010 0011 in binary.

The variable byte encoded integers are encoded using 7 bits per byte to store the integer's bits, and the high bit is set in the last byte used to encode the integer. The bits and bytes are in little endian byte order. For example, the integer 7843, or 0001 1110 1010 0011 in binary, would be stored as two bytes: 0xBD 0x23, or 1011 1101 0010 0011 in binary.

Line 192:

Line 194:

| Identifier 'index\0' | 0-3

| Identifier 'index\0' | 0-3

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued |Serial number | 4-7

+

| ... |Serial number | 4-7

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued |Number of keypoints | 8-11

+

| ... |Number of keypoints | 8-11

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | 12-15

+

| ... | 12-15

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | Timestamp denominator | 16-19

+

| ... | Timestamp denominator | 16-19

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | 20-23

+

| ... | 20-23

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | First sample time numerator | 24-27

+

| ... | First sample time numerator | 24-27

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | 28-31

+

| ... | 28-31

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | Last sample end time numerator| 32-35

+

| ... | Last sample end time numerator| 32-35

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued | 36-39

+

| ... | 36-39

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

−

| ...continued |Keypoints... | 40-43

+

| ... |Keypoints... | 40-43

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Line 239:

Line 241:

* the Skeleton bos page is the very first bos page in the Ogg stream such that it can be identified straight away and decoders don't get confused about it being e.g. Ogg Vorbis without this meta information

* the Skeleton bos page is the very first bos page in the Ogg stream such that it can be identified straight away and decoders don't get confused about it being e.g. Ogg Vorbis without this meta information

* the bos pages of all the other logical bistreams come next (a requirement of Ogg)

* the bos pages of all the other logical bistreams come next (a requirement of Ogg)

Revision as of 15:11, 9 May 2010

The following is a draft!

It is at best incomplete and at worst completely broken. In any case, it is not an “official” Xiph spec or codec, so use with care.

Ogg Skeleton provides structuring information for multitrack Ogg files. It is compatible with Ogg Theora and provides extra clues for synchronization and content negotiation such as language selection. Skeleton version 4.0 also provides keyframe indexes to enable optimal seeking over high-latency connections, such as the internet.

Ogg is a generic container format, enabling interleaving of several tracks of frame-wise encoded content in a time-multiplexed manner. As an example, an Ogg physical bitstream could encapsulate several tracks of video encoded in Theora and multiple tracks of audio encoded in Speex or Vorbis or FLAC at the same time. A player that decodes such a bitstream could then, for example, play one video channel as the main video playback, alpha-blend another one on top of it (e.g. a caption track), play a main Vorbis audio together with several FLAC audio tracks simultaneously (e.g. as sound effects), and provide a choice of Speex channels (e.g. providing commentary in different languages). Such a file is generally possible to create with Ogg, it is however not possible to generically parse such a file, seek on it, understand what codecs are contained in such a file, and dynamically handle and play back such content.

Ogg does not know anything about the content it carries and leaves it to the media mapping of each codec to declare and describe itself. There is no meta information available at the Ogg level about the content tracks encapsulated within an Ogg physical bitstream. This is particularly a problem if you don't have all the decoder libraries available and just want to parse an Ogg file to find out what type of data it encapsulates (such as the "file" command under *nix to determine what file it is through magic numbers), or want to seek to a temporal offset without having to decode the data (such as on a Web server that just serves out Ogg files and parts thereof).

Ogg Skeleton is designed to overcome these problems. Ogg Skeleton is a logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams. For each logical bitstream it provides information such as its media type, and explains the way the granulepos field in Ogg pages is mapped to time.

Seeking in an Ogg file is typically implemented as a bisection search for the seek target timestamp. However when seeking over a high latency connection, such as the internet, such searches can be slow. Some bitstreams, notably Theora, have keyframes, and so in order to seek to a given temporal offset in a Theora stream, you must first perform a bisection search to find the target Theora frame, determine its keyframe, and then perform another bisection search to locate that keyframe and decode forwards to the temoporal offset. This can be very slow. The Ogg Skeleton 4.0 provides an index of keyframes, and indexes periodic samples on streams without the concept of a keyframe, so that seeking over high-latency connections can simply be performed optimally with "one hop".

Ogg Skeleton is also designed to allow the creation of substreams from Ogg physical bitstreams that retain the original timing information. For example, when cutting out the segment between the 7th and the 59th second of an Ogg file, it would be nice to continue to start this cut out file with a playback time of 7 seconds and not of 0. This is of particular interest if you're streaming this file from a Web server after a query for a temporal subpart such as in http://example.com/video.ogv?t=7-59 .

How to describe the logical bitstreams within an Ogg container?

The following information about a logical bitstream is of interest to contain as meta information in the Skeleton:

the serial number: it identifies a content track

the mime type: it identifies the content type

other generic name-value fields that can provide meta information such as the language of a track or the video height and width

the number of header packets: this informs a parser about the number of actual header packets in an Ogg logical bitstream

the granule rate: the granule rate represents the data rate in Hz at which content is sampled for the particular logical bitstream. Note that when using this to interpret timestamps, the granulepos of a data page must first be parsed to extract a granule value using the method described in GranulePosAndSeeking. This value can then be mapped to time by calculating "granules / granulerate".

the preroll: the number of past content packets to take into account when decoding the current Ogg page, which is necessary for seeking (vorbis has generally 2, speex 3)

the granuleshift: the number of lower bits from the granulepos field that are used to provide position information for sub-seekable units (like the keyframe shift in theora)

a basetime: it provides a mapping for granule position 0 (for all logical bitstreams) to a playback time; an example use: most content in professional analog video creation actually starts at a time of 1 hour and thus adding this additional field allows them retain this mapping on digitizing their content

a UTC time: it provides a mapping for granule position 0 (for all logical bitstreams) to a real-world clock time allowing to remember e.g. the recording or broadcast time of some content

the granulepos radix, used during complex granulepos-to-time conversions, particuarly in streams such as Dirac.

predelay: the delay of the presentation time behind the decode time. Used in discontinuous streams such as Dirac.

How to allow the creation of substreams from an Ogg physical bitstream?

When cutting out a subpart of an Ogg physical bitstream, the aim is to keep all the content pages intact (including the framing and granule positions) and just change some information in the Skeleton that allows reconstruction of the accurate time mapping. When remultiplexing such a bitstream, it is necessary to take into account all the different contained logical bitstreams. A given cut-in time maps to several different byte positions in the Ogg physical bitstream because each logical bitstream has its relevant information for that time at a different location. In addition, the resolution of each logical bitstream may not be high enough to accommodate for the given cut-in time and thus there may be some surplus information necessary to be remuxed into the new bitstream.

The following information is necessary to be added to the Skeleton to allow a correct presentation of a subpart of an Ogg bitstream:

the presentation time: this is the actual cut-in time and all logical bitstreams are meant to start presenting from this time onwards, not from the time their data starts, which may be some time before that (because this time may have mapped right into the middle of a packet, or because the logical bitstream has a preroll or a keyframe shift)

the basegranule: this represents the granule number with which this logical bitstream starts in the remuxed stream and provides for each logical bitstream the accurate start time of its data stream; this information is necessary to allow correct decoding and timing of the first data packets contained in a logcial bitstream of a remuxed Ogg stream

Keyframe indexes for faster seeking

Seeking in an Ogg file is typically implemented as a bisection search over the pages in the file. The bisection method above works fine for seeking in local files, but for seeking in files served over the Internet via HTTP, each bisection or non sequential read can trigger a new HTTP request, which can have very high latency, making seeking very slow. Seeking is further complicated by the fact that packets often span multiple
Ogg pages, and that Ogg pages from different streams can be interleaved
between spanning packets.

Each content track has a separate index, which is stored in its own packet in the Skeleton 4.0 track. The index for streams without the concept of a keyframe, such as Vorbis streams, can instead record the time position at periodic intervals, which achieves the same result. When this document refers to keyframes, it also implicitly refers to these independent periodic samples from keyframe-less streams.

Because all the Skeleton track's index packets appear in the header pages of the Ogg segment, all the keyframe indexes are immediately available once the header packets have been read when playing the media over a network connection.

For every content stream in an Ogg segment, the Skeleton provides seek algorithms with an index, or ordered table of "key points". A key point is intrinsically associated with exactly one stream, and stores the offset, o, of the last page which lies before all data required to decode the keyframe, as well as the presentation time of the keyframe t, as a fraction of seconds.

The offset is relative from the beginning of the Ogg segment, and is exactly the first byte of a page in the indexed stream, so if you seek to a keypoint's offset and don't find the beginning of a page there, or you find a page from another stream, you can assume that the Ogg segment has been modified since the index was constructed, and the index can be considered invalid. The time t is the keyframe's presentation time corresponding to the granulepos, and is represented as a fraction in seconds. Note that if a stream requires any preroll, this will be accounted for in the time stored in the keypoint.

The Skeleton 4.0 track contains one index for each content stream in the file. To seek in an Ogg file which contains keyframe indexes, first construct the set which contains every active streams' last keypoint which has time less than or equal to the seek target time. This tells you a known point on every stream which lies before the seek target. Then from that set of key points, select the key point with the smallest byte offset. You then verify that there's a page from the keypoint's stream found at exactly that offset, and if so, you can begin decoding. You are guaranteed to pass keyframes on all streams with time less than or equal to your seek target time while decoding up to the seek target. However if you don't encounter a keyframe with the same presentation time as is stored in the keypoint, then the index is invalid (possibly the file has been changed without updating the index) and you must either fallback to a bisection search, or keep decoding if you've landed "close enough" to the seek target.

Be aware that you cannot assume that any or all Ogg files will contain keyframe indexes, so when implementing Ogg seeking, you must gracefully fall-back to a bisection search or other seek algorithm when the index is not present, or when it is invalid.

The Skeleton 4.0 index packets also stores meta data about the segment in which it resides. It stores the timestamps of the first and last samples in its track. This also allows you to determine the duration of the indexed Ogg media without having to decode the start and end of the Ogg segment to calculate the difference (which is the duration). With the index packets storing the start and end times of every track, you can calculate the duration as the end time of the last active stream minus the start time of first active stream.

The Skeleton 4.0 BOS packet contains the length of the indexed segment in bytes. This is so that if the seek target is outside of the indexed range, you can immediately move to the next/previous segment and either seek using that segment's index, or narrow the bisection window if that segment has no index. You can also use the segement length to verify if the index is valid. If the contents of the segment have changed, it's highly likely that the length of the segment has changed as well. When you load the segment's header pages, you should check the length of the physical segment, and if it doesn't match the length stored in the Skeleton header packet, you know that either the index is out of date, or the file has been chained since indexing.

The Skeleton 4.0 BOS packet also contains the offset of the first non header page in the Ogg segment. This means that if you wish to delay loading of an index for whatever reason, you can skip forward to that offset, and start decoding from that offset forwards.

When using the index to seek, you must verify that the index is still correct. You can consider the index invalid if any of the following are true:

The segment doesn't end at the segment length offset stored in the Skeleton BOS packet (note that a new "link" in a "chain" can start at the end of the segment), or

after a seek to a keypoint's offset, you don't land exactly on a page boundary, or

after a seek to a keypoint's offset, you don't land on a page which belongs to that keypoint's stream.

While loading the Skeleton BOS header, you should always check the Skeleton version field to ensure your decoder correctly knows how to parse the Skeleton track.

Be aware that a keyframe index may not index all keyframes in the Ogg segment, it may only index periodic keyframes instead.

Ogg Skeleton version 4.0 Format Specification

Adding the above information into an Ogg bitstream without breaking existing Ogg functionality and code requires the use of a logical bitstream for Ogg Skeleton. This logical bitstream may be ignored on decoding such that existing players can still continue to play back Ogg files that have a Skeleton bitstream. Skeleton enriches the Ogg bitstream to provide meta information about structure and content of the Ogg bitstream.

The Skeleton logical bitstream starts with an ident header that contains information about all of the logical bitstreams and is mapped into the Skeleton bos page.
The first 8 bytes provide the magic identifier "fishead\0".
After the fishead follows a set of secondary header packets, each of which contains information about one logical bitstream. These secondary header packets are identified by an 8 byte code of "fisbone\0". The Skeleton logical bitstream has no actual content packets. Its eos page is included into the stream before any data pages of the other logical bitstreams appear and contains a packet of length 0.

The version fields provide version information for the Skeleton track, currently being 4.0 (the number having evolved within the Annodex project).
Presentation time and basetime are specified as a rational number, the denominator providing the temporal resolution at which the time is given (e.g. to specify time in milliseconds, provide a denominator of 1000).

The mime type is provided as a message header field specified in the same way that HTTP header fields are given, e.g. "Content-Type: audio/vorbis". Message header fields are terminated/delimited by "\r\n". Further meta information (such as language and screen size) are also included as message header fields. The offset to the message header fields at the beginning of a fisbone packet is included for forward compatibility - to allow further fields to be included into the packet without disrupting the message header field parsing.
The granule rate is again given as a rational number in the same way that presentation time and basetime were provided above.

Before the Skeleton EOS page in the segment header pages come the Skeleton 4.0 keyframe index packets. There should be one index packet foreach content track in the Ogg segment, but index packets are not required for a Skeleton 4.0 track to be considered valid. Each keyframe in the index is stored in a "keypoint", which in turn stores an offset, and timestamp. In order to save space, the offsets and timestamps are stored as deltas, and then variable byte-encoded. The offset and timestamp deltas store the difference between the keypoint's offset and timestamp from the previous keypoint's offset and timestamp. So to calculate the page offset of a keypoint you must sum the offset deltas of up to and including the keypoint in the index.

The variable byte encoded integers are encoded using 7 bits per byte to store the integer's bits, and the high bit is set in the last byte used to encode the integer. The bits and bytes are in little endian byte order. For example, the integer 7843, or 0001 1110 1010 0011 in binary, would be stored as two bytes: 0xBD 0x23, or 1011 1101 0010 0011 in binary.

The serialno of the stream this index applies to, as a 4 byte field. Bytes [6...9]

The number of keypoints in this index packet, 'n' as a 8 byte unsigned integer. This can be 0. Bytes [10...17].

The presentation time denominator for this stream, as an 8 byte signed integer. All timestamps, including keypoint timestamps, first and last sample timestamps are fractions of seconds over this denominator. This must not be 0. Bytes [18...25].

First-sample-time numerator: 8 byte signed integer representing the numerator for the presentation time of the first sample in the track. Bytes [26...33]

Last-sample-time numerator: 8 byte signed integer representing the end time of the last sample in the track. Bytes [34...41]

'n' key points, starting with the first keypoint at byte 42. Each keypoint contains, in the following order:

the keyframe's page's byte offset delta, as a variable byte encoded integer. This is the number of bytes that this keypoint is after the preceeding keypoint's offset, or from the start of the segment if this is the first keypoint. The keypoint's page start is therefore the sum of the byte-offset-deltas of all the keypoints which come before it.

the presentation time numerator delta, of the first key frame which starts on the page at the keypoint's offset, as a variable byte encoded integer. This is the difference from the previous keypoint's timestamp numerator. The keypoint's timestamp numerator is therefore the sum of all the timestamp numerator deltas up to and including the keypoint's. Divide the timestamp numerator sum by the timestamp denominator stored earlier in the index packet to determine the presentation time of the keyframe in seconds.

The key points are stored in increasing order by offset (and thus by presentation time as well).

The byte offsets stored in keypoints are relative to the start of the Ogg bitstream segment. So if you have a physical Ogg bitstream made up of two chained Oggs, the offsets in the second Ogg segment's bitstream's index are relative to the beginning of the second Ogg in the chain, not the first. Also note that if a physical Ogg bitstream is made up of chained Oggs, the presence of an index in one segment does not imply that there will be an index in any other segment.

The first-sample-time and last-sample-time are rational numbers, in units of seconds. If the denominator is 0 for the first-sample-time or the last-sample-time, then that value was unable to be determined at indexing time, and is unknown.

The exact number of keyframes used to construct key points in the index is up to the indexer, but to limit the index size, we recommend including at most one key point per every 64KB of data, or every 2000ms, whichever is least frequent.

Further restrictions

A further restriction on how to encapsulate Skeleton into Ogg is proposed to allow for easier parsing:

there can only be one Skeleton logical bitstream in a Ogg bitstream.

the Skeleton bos page is the very first bos page in the Ogg stream such that it can be identified straight away and decoders don't get confused about it being e.g. Ogg Vorbis without this meta information

the bos pages of all the other logical bistreams come next (a requirement of Ogg)

the secondary header pages of all logical bitstreams come next, including Skeleton's secondary header packets (the fisbone and index packets)

the Skeleton eos page end the control section of the Ogg stream before any content pages of any of the other logical bitstreams appear.