Disclaimer

This is not a Xiph codec, but I was asked to post information
about Ogg/Kate on this wiki. As such, please do not assume that Xiph has anything
to do with this, much less responsibility.

What is Kate?

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Why a new codec?

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

Writ is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-pgg2 later on - I'd been quicker to write Kate from scratch anyway.

CMML is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing

OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

Overview of the Kate bitstream format

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

The granule encoding is not a direct time/granule correspondance, see the granule encoding section.

The EOS packet should have a granule pos higher or equal to the end time of all events.

User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).

The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

Format specification

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0").

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:

headers:

0x80 ID header (BOS)

0x81 Vorbis comment header

0x82 regions list header

0x83 styles list header

0x84 curves list header

0x85 motions list header

0x86 palettes list header

0x87 bitmaps list header

0x88 font ranges and mappings header

data:

0x00 text data (including optional motions and overrides)

0x01 keepalive

0x7f end packet (EOS)

This format described here is for bitstream version 0.x.

Following is the definition of the ID header (packet type 0).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

API overview

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

Decoding

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

Encoding

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

High level decoding API

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

Generic timing

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

Motion

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

Trackers

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

The Kate file format

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

kate {

event { 00:00:05 --> 00:00:10 "This is a text" }

}

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an offset of 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

Karaoke

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

kate {

simple_timed_glyph_style_morph {

from style "start_style" to style "end_style"

"Let " at 1.0

"us " at 1.2

"sing " at 1.4

"to" at 2.0

"ge" at 2.5

"ther" at 3.0

}

}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

Problems to solve

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Seeking and memory

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:

each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)

this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".

this requires reissuing packets, and it doesn't feel right (and wastes space).

it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).

Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.

Well, it seems it can't do a one phase seek anyway.

Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek

algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams
where there could be no pages for a while if no data is needed at that time.

Text encoding

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

Language encoding

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

Bitstream format for floating point values

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

Higher dimensional curves/motions

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

Category definition

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

subtitles - the usual movie subtitles, as text

spu-subtitles - movie subtitles in DVD style paletted images

lyrics - song lyrics

transcript - exact words of a speech

commentary - runnning commentary about an accompanying eg. video

narration - narration of an accompanying eg. video

book - a full book as text, might be a lone Kate stream (or muxed with other languages)

Please remember the 15 character limit if proposing other categories.

Text to speech

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

Possible additions

Embedded binary data

Images and font mappings can be included within a Kate stream.

Images

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

Fonts

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

Reference encoder/decoder

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see The Kate file format),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

Next steps

Continuations

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

A rendering library

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

An XML representation

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

Matroska mapping

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.