The purpose of this page is to lay out and hammer down a specification for implementing captioning, subtitling, and timed text support for media HTML elements: both Video and Audio. This is a work in progress, and is currently being authored by [[User:Millam]]

The purpose of this page is to lay out and hammer down a specification for implementing captioning, subtitling, and timed text support for media HTML elements: both Video and Audio. This is a work in progress, and is currently being authored by [[User:Millam]]

Latest revision as of 15:17, 27 September 2014

The purpose of this page is to lay out and hammer down a specification for implementing captioning, subtitling, and timed text support for media HTML elements: both Video and Audio. This is a work in progress, and is currently being authored by User:Millam

Visual Description: A written description of the visual scenes. May or may not include spoken words. Targeted at visually impaired users.

Closed Captioning: When the captions are separate from the media stream, and can be toggled on and off.

Open Captioning: When the captions are "burned" into the video, and thus can't be toggled on and off.

External captions: When the captions are kept in a file separate from the media file. (e.g: .srt, .ass)

Included captions: When the captions are included in the video file, whether as a separate text track or as 'prerendered' captions. (Described in Caption Formats, below)

Un-styled captions: Captions that have no styling information. Just plain text.

Styled captions: Captions that have one or more bits of styling: Color, Positioning (relative to the video), 'Karaoke' color changing, animation effects, etc.

Pop-on or Paint-on captions: From Television 608/708 standards, Pop-on and Paint on are positioned, but otherwise unstyled text.

Rollup captions: Captions that 'roll up'. Typically, there's a window of 3 lines, and as captions are added to it, they are added to the bottom line,

pushing older text up, or out of frame. Most typically used with news stations, soap operas, and broadcast events that are captioned live.

From the UserAgent perspective, Timed Text, Captions and Subtitles are functionally identical, the only difference is their described content. For the purpose of this document, I will be using "captions" and "captioning" to refer to all of the above.

Caption Formats

One of the largest barriers to adding captioning to standards in the digital age is the sheer number of formats available. The below list is a small sampling.

608/708, or "Line 21" captions: Designed for Television, caption data is encoded in scanline 21. Text here is often broken up, and drawn by command.

"Prerendered" captions. Designed for DVDs, these are actually transparent video frames that are drawn on top of the video.

Subrip (.srt). Plain text files, where each separate caption consists of three or more lines: The number of the caption (optional), the start time (and optional end time), and the caption text, unstyled. Very easy to read and write by hand.

SSAV4 (.ssa or .ass). A very flexible, styled format. Very verbose, authors usually prefer to use caption editors to create and edit these files.

For HTML5, we should support all Included Captions for all containers that our media elements support (At the time of this writing, there is no decided element), and two External Caption formats:

Subrip: For simplicity, ease of creation, ease of use, and to allow authors to style their text with CSS.

SSAV4: At the opposite end, for more complex uses: Karaoke, etc.

Overview

This document aims to describe a method of adding captioning support to all HTML5 user agents for all media elements. (Currently: audio and video elements). This will require adding subtitling support to the user agent instead of to a media plugin, and will integrate it with the JavaScript engines identically across each UA.

Typical Usage

This section describes how the page author and the page viewer utilize captions.

Author

The author writes an HTML5 page includes a media element (<video>...</video> or <audio>...</audio>) and wishes to add captioning using external captioning in one or more languages. For a single language in a standalone media tag, the author will include a 'subs="..."' tag to define the location of a single captioning file. For multiple languages, the author will use a new element, <captiontrack>, to define a caption track. (Name of <captiontrack> element to be decided.)

User

The user then visits the author's page. The user agent loads the video, then fetches the external caption track. The User Agent then determines whether to turn captions on: Either default on (preferred), or the user has expressed a preference to enable captioning. If captions are on, they are then rendered on top the video. Whether on or off, the User Agent should provide a method to enable or disable the caption track(s).

The Caption Track Element

This is big enough to deserve its own section. The largest problem with caption elements is adding the ability to deal with all three major types of captions: unstyled, styled, and prerendered.

Implicit and Explicit tracks

Implicit tracks are Included Captions: They are part of the media stream. It is very unlikely to know the entirety of the caption track until the entire media file has been received and parsed.

Explicit tracks are External Captions: They are fetched from a separate URI by the User Agent, and have their timings tied to the media element.

Because of the streaming features of the video tag, the Implicit Tracks thus require that captions cannot be treated as if the entirety is known.

From this point on, "Caption Stream" refers to either an implicit or explicit track.

Attributes common to all caption track elements

Name: This is the name of the caption track. e.g: "Default", "Comments", "Auto-translated", "Translated by Cervantes". If null, it defaults to "default".

Language: This is the language (*NOT* the text encoding!) of the caption track. This is the language _code_ that describes which language the caption file is in. "en", "pt", "fr", etc. If null, defaults to the language of the page, if known, or en.

captiontype: "caption", "subtitle", "description", "other", or null.

Type and Language should be considered by the User Agent when deciding whether to enable or disable. Name is used for display to the user for user selection, and is effectively cosmetic. (Language,Name) tuple should be unique across all tracks.

Any or all of the three may be null, because the largest use case is an author adding a single caption track to their video. It also handles the case where tracks are "implicit"

format: "styled", "unstyled", or "prerendered.". If null, the format of the caption stream is used. Subrip is unstyled. ssav4 is styled. DVD-style tracks are prerendered.

encoding: Only relevant to styled and unstyled text. Describes the character encoding of the content.

Lifetime of the track elements

The track elements are created and destroyed along with the media element they are associated with. Any track elements created via JavaScript are destroyed when the media element is destroyed.

Enabling and disabling

If the User Agent displays media controls (pause, start, stop, rewind, seek, etc), they must also include a caption control - if there is at least one caption track. If there is no caption track, the User Agent can elect to not display the caption control or to display a disabled caption control.

If the User Agent does not display media controls, and caption tracks are associated with the media, then the User Agent must include a caption control in its context menu or in an easily accessible toolbar menu.

User Agents are recommended, but not required, to provide a configuration option in the browser accessibility settings for four caption settings:

In the case of multiple caption tracks, then one matching the user's language is first choice. Type has priority from "caption" > "subtitle", unless visual descriptions have been chosen by the user. In which case, visual descriptions have top priority.

Display

There is an implicit block-level element, <captionblock> (Name undecided, see #BlockElement) overlaying the video element, and of the same size and location. If it is a video element, the User Agent may choose to represent the block level element with varying sizes. This implicit block level element is used to place text (styled and unstyled) inside. This block level element allows the web author or the user agent to style the caption text.

Undecided: Should prerendered frames be treated as image elements within the block level element, or just be considered part of the video, and enabled/disabled?

Using

When the user_agent enables a Caption Stream, it triggers a caption_enable event. caption_disable event corresponds to disabling it.

When enabled, the User Agent ties the Caption Stream to the video player. At the appropriate start and end times, or during seek, the User Agent will trigger a caption_add and caption_clear event. (Within the User Agent)

caption_enable

Run when captions are enabled: Whether by default, user action, or javascript.

The Caption Block

In order to allow the author to have CSS control over the style of the caption text, an implicit caption block is placed over the media element, with the same width+height. This can then be styled using: CAPTIONBLOCK { font-size: 200%; background-alpha: 80%; background-color: gray; /* etc */}.

If the page author desires, they can place a <captionblock id="foo"></captionblock> element elsewhere, and use JavaScript to instruct the media element to use that as the block instead. (e.g: mediaElement.setCaptionBlock($('#foo'))). This allows captions to be read without overlaying the video directly.

Thoughts:

If the media element has an id, the caption block should have an id equal to media element's id + ".captions". e.g: <video id="foo"></video> has an implicit captionblock of <captionblock id="#foo.captions"></captionblock>

Should we use a new block name, or use a

instead?

Contents

The caption block will contain a series of elements. Each element represents a single caption that has a start and end time.

e.g, the html5 representation of a snapshot might be: (With two overlapping time elements - To simulate rollup captions)