A HTML-based media markup language for HTML5

This page introduces a HTML-based time-aligned (or time-synchronized) text markup for audio and video. It is particularly targeted for use with HTML5 audio and video elements, but can be used in stand-alone applications.

The new markup is called "Web Media Markup Language" (WMML) and has a mime type of text/wmml.

The main motivation for creating this markup is to create a text format for specifying captions, subtitles, karaoke, and similar time-aligned text which work by reusing CSS and HTML. It does so by creating a new file format, but re-using existing HTML5 elements that are appropriate. In particular the innerHTML parser of HTML5 will be reused for the main markup. Only a small number of elements are introduced that do not currently exist in HTML5.

The new elements are not an extension to HTML5 and are not planned to be. There are hooks into HTML through the TimedTracks API in HTML5 by which the WMML elements are exposed to the Web page that bears the media resource and the link to the WMML document. Some of these HTML5 APIs are only objects in HTML5, but are actual elements in WMML.

The aim behind this way of defining WMML is to create a format that can reuse existing HTML5 snippet parsing, rather than having to implement a completely new parser. A WMML parser will only consist of a small amount of new parsing code and rely on an existing HTML5 snippet parser to provide for the bulk of its parsing needs. Also, the reuse of CSS will allow reuse of existing implementations for styling and positioning. This should vastly help Web browsers to implement support for WMML, even and particularly including the richer features.

Note: A WMML document is a xml-ish document that contains HTML elements but is not an XML-with-namespaces document. This is on purpose to allow reuse of CSS and HTML snippet parsing without taking on the issues of XML namespaces, and XSL-FO.

Introductory Examples

1. A simple example

This is an example with two 10 sec long text cues provided in the default language "en-US". We use the word "cue" as a general abstraction of the time-aligned text (or "event") that is being provided. It is more general than "caption" or "subtitle" etc.

HTML5 defines a timed track API for cues and the list of cues inside a WMML document maps neatly onto the TimedTrackCueList interface.

If not given otherwise, the default rendering region for a WMML resource that is related to a video is a CSS box with the dimensions of the video viewport, overlayed on the video viewport, and inside that the bottom part. Alternatively, the top, right, and left viewport regions are possible rendering regions, too. Further, the cues could be rendered by a Web page outside the video element, but such information is decided by the rendering Web page and not the WMML file itself. The Web page's setting will always overrule any settings provided in the WMML file.

In this example, the cues are rendered at 10s and 20s as an overlay onto the bottom area of the video viewport.

2. A formatted and positioned example

This is an example with two 10 sec long text cues provided in the default language "en-US" which are placed at the center of top third of the video.

It is possible to define the viewport minimum width and height through styling the <wmml> element. This helps to communicate what CSS box it is expecting to be reserved for. All the formatting specifications in the cues are done relative to that viewport. It is preferred that cues be placed relative to the viewport, such that it will be easy to scale with the video, e.g. for fullscreen viewing.

The cue elements c1 and c2 are formatted - the first one with red color, a different font, and a background transparency of 50%. The second one has spans that are italicised.

The Web page could decide to overrule the rendering target to some other location on screen. This would be provided in the style element of the <track> element through which the WMML resource is linked.

the <head> element

<link> elements inside <head> can be used to link to external style sheets

<script> should be avoided (the need for JavaScript support in WMML is not clear yet)

<style> can be used to put styling information directly in the document

the <cuelist> element

only contains a sequence of <cue> elements

is just a grouping element for the <cue> elements and doesn't support any of the attributes of the HTML body element

the <cue> element

is analogous to the HTML <div> element and supports all of the attributes and content elements of <div>, in particular all flow content (which includes <ruby>).

what actually is used inside a cue is defined by the @profile attribute of the <wmml> element

"plainText": will be parsed by ignoring all markup if any

"innerHTML": will be parsed by the HTML5 snippet parser

"JSON": will be parsed as JSON

"any": will not be parsed but just regarded as any text

<cue> elements cannot appear inside <cue> elements

it has the following additional attributes:

start (float, optional): the start time of the cue (in relation to a media resource that is externally specified in a HTML media element); if missing, start=0 is assumed

end (float, optional): the end time of the cue; if missing, it implicitly ends with the start of the next cue or at the end of the resource; thus, if time-overlapping cues are needed, specification of the end attribute is required

width/height: per cue width/height in %

the <t> element

a flow content element that is used inside the <cue> element for further specification of starting times of smaller elements

by default, the content inside the <t> element inherits its style from the parent <cue> element; its own styling is only activated when its time stamp is reached

it has the following attributes:

at (float): a time stamp specifying at what time the style for the element becomes active

style: the styles to be activated

Note: this could also be achieved with a span element and some CSS3, in particular the transition-delay property, but the markup would be a lot more verbose and make it unreadable.

With the use of attributes, CSS selectors can be applied e.g. to all cues that belong to a certain speaker, like this: cue[class="speaker1"] { ... } .

Rendering

The WMML file's <cue> elements are not rendered into an existing HTML page, but rather a WMML file creates its own iframe-like new nested browsing context. It is linked to the parent HTML page through a track element that is inserted as a child of the video element. Creation of a nested browsing context is important because a WMML file can come from a different domain than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the WMML document in different owner documents. That way, the hosting document has the security origin of its own URL and the WMML document has the security origin of its URL.

As the browser plays the video, it must render the WMML <cue> tags in sync. As the start time of a <cue> tag is reached, the <cue> tag is made active, and it is made inactive as the <cue> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, the cue ends with the start of the next one or at the end of the resource.

The content of WMML cue elements is made available to the HTML page that includes the WMML file and the media resource through the timed track API in HTML. In particular, the getCueAsHTML and getCueAsSource API calls will provide a copy of the DOM subtree for the <cue>. You lose style information that was being applied by <style> elements in the WMML document, but since the main reason for the JavaScript API is to run your own styles, this is acceptable. The returned content may need to be sanitized in case a malicious cue contains a <script> element.

Differences to other proposed formats for use in HTML5

Other formats have been proposed to be used as baseline formats for external time-aligned text documents for HTML5 media elements. The most popular examples are SRT, WebSRT, and DFXP/TTML.

The main difference between SRT and WMML is that WMML is HTML-like and thus requires more markup. But that is offset by the ability to easily extend WMML with existing HTML and CSS features.

WebSRT tries to extend SRT with features that have been deemed required for a collection of use cases around captions, subtitles, and karaoke. In its current definition, it is a platform that allows for plain text, minimal markup and random content. Thus, without adding innerHTML support, it has the drawback that it is not natively extensible to new HTML-conformant applications, such as overlays on videos with ads, or captions with images, icons, or hyperlinks in them. Further, WebSRT doesn't really support CSS, but only a small subpart of it, while making up some new functionality, too, in particular for layout and positioning. While not being as complex as XSL-FO, it still has the same drawback for having to implement another layout approach. Further, WebSRT is not a XML/HTML-based markup and thus requires implementation of a new parsing unit into Web browsers.

TTML has tried to be a XML format that supports tradition XHTML approaches. It has CSS-like formatting instructions. However, it is sufficiently different from HTML/CSS that it is not easily possible to reuse existing HTML & CSS parsing code to interpret a TTML document. At the time of its definition, it seemed like a sensible thing to do in order to stay in sync with XHTML, with XML namespaces and with XSL-FO, but in the modern HTML5 space, these have proven to be a hindrance to implementation in modern Web browsers.

WMML provides a solution to this situation. It is very similar to HTML and reuses CSS for formatting and styling. It tries to be as simple as possible with what it introduces newly. It references HTML and CSS for the bulk of its functionality, which makes it easily extensible, since any new functionality introduced into HTML and CSS is available to WMML, too. In addition, we've adopted the idea of WebSRT to have other types of content in the cues, too with plain text, JSON and any content available. The @profile attribute will make sure that applications that only want to support one type of content can identify such files.

Note that WMML is an improvement over a previous experiment with timed divs. WMML moves away from re-using existing HTML tags for a different purpose (<body>, <div>) and it introduces a <t> element to allow for Karaoke. That latter is an optional addition.

Uptake concerns

Uptake of a new caption format is important with relation to several user groups:

web browsers,

manual authors,

authoring applications,

stand-alone players.

Generally, it is expected that applications ignore CSS and HTML elements that they do not understand rather than failing to parse a WMML document with such elements.

Web browsers should be able to implement support for WMML fairly easily, since they already have support for most of the required CSS and HTML functionalities.

For (manual) authoring of WMML document it is expected that authors exert restraint in the actual elements they use. One reason is that the more features one overlays on video, the less useful the video becomes, so there is usability pressure for the restraint. Also, the more elements of HTML are being used in WMML documents, the less usable the WMML document becomes to players that do not support Web technologies. Over time, increasing amounts of HTML elements may be supported by authoring tools and stand-alone players, so can be used in typical WMML documents.

Since many new players are already capable of parsing HTML pages, implementation of support for WMML in stand-alone players may not be much of an issue.

As for the authoring side of WMML documents: for hand-coding, WMML is a bit more verbose than e.g. SRT. It is frequently pointed out that the XML-based caption format USF (Universal Subtitle Format) as it was defined by Matroska developers never achieved any uptake. Reasoning is that the fansubbing community refused to author documents in such a verbose format. However, there was never any support implemented for USF for more than the basic features in any media player or authoring application which probably had a lot more to do with the lack up uptake than the verbosity of the format.

The situation with WMML is different though, since it's not built completely new from scratch. If all Web browsers support WMML and its advanced features, then authors understand the usefulness of the verbosity. Also, because WMML would reuse HTML parsers, all features would be available immediately in a Web browser without having to wait for player developers to catch up. Exporting to WMML from a subtitling or captioning creation application also wouldn't be hard, at least for the most fundamental needs - and it would provide for all the features of advanced formats, too. Finally, stand-alone players that consider implementation of support for WMML will look at it in the context of also implementing support for HTML documents - something increasingly useful to media players (as exemplified in iTunes etc). Thus, there is no additional overhead (or only minimal overhead) in implementing WMML.

Ultimately, the aim of a new Web caption format should be to enable new people to author captions. By creating a format that is so similar to HTML that it is trivial to author in for any Web developer, we can suddenly recruit all the Web developers of the world as captioners. This is a much more important aim than the in relation easy challenge of convincing existing captioners to export their files into another new file format.