Report of the Video A11y Grant Progress 2008

By Silvia Pfeiffer

The three months are over during which the Mozilla Foundation provided me with a grant towards analysing the status of accessibility for the HTML5 <video> and <audio> elements, particularly with a view towards Ogg support. This post provides a summary of my findings and recommendations on how to progress video a11y in Firefox, as well as a list of actual progress already made. One of the biggest achievements is that there is now a mailing list at Xiph for video a11y and that this community will continue working on these issues, if slowly.

Background: Video Accessibility study

The study took a broad view on what constitutes "accessibility" for audio and video including and beyond means of providing access to people with disabilities. The study analysed means of generally attaching textual information to audio and video, and enabling search engines with better access to these textual representations. The requirements document is available in the Mozilla wiki.

One particular aim of the study was to recommend means for delivering accessibility features inside the Ogg container format for the open Ogg Theora & Vorbis formats. Since Ogg Theora/Vorbis has been adopted by Firefox as the baseline codec for the audio and video elements, Ogg plays a major role when delivering accessibility features into the Web browser. This also goes beyond mere Web delivery of accessibility features and will have an effect on a larger number of media applications. This is important since the creation of accessibility content for audio and video formats cannot just happen for the Web. It needs to be supported by an ecosystem of applications around the audio and video content, including in particular authoring applications, but also off-line playback applications.

Results and Recommendations of the Video Accessibility study

First of all one has to recognise that some accessibility data is in a text format (e.g. closed captions), and others are actually supplementary media data that accompanies the core video or audio content. Examples for such non-text accessibility data are open captions (i.e. captions that are burnt into the video data), bitmap captions (i.e. graphics files that are blended on top of the video data), or audio descriptions (i.e. descriptive spoken annotations that are aligned with pauses in the original audio track). Most non-text accessibility data actually has a textual representation: closed captions for example come as text that can easily be turned on or off by the video player. Also, textual audio descriptions can be rendered by a screen reader or through a braille device.

1. Text vs Non-Text accessibility data

Textual accessibility data provides a lot more flexibility to the media player than accessibility data that is itself audio, video or graphics. One key advantage is that existing text parsing applications, such as screen readers, braille devices, Web search engines, or automated translators can already deal with text data. In fact, the whole Web has been built around the ability of accessing and processing text data and thus provides an infrastructure that textual representations of audio and video will fit into much more easily. Non-text accessibility data creates extra unnecessary complications e.g. the need to run OCR to get an input for automated translators, or the need to render a video inside another video for sign language.

Recommendation 1: Textual accessibility data should be preferred over audio-, video-, or graphics-based accessibility data for audio and video on the Web.

2. Dealing with Non-Text accessibility data

It needs to be understood that sign language is, for the hearing impaired, often their first language, while transcribed spoken speech is often their first foreign language. Similarly, listening to natural speech in audio descriptions is a lot more relaxing than listening to screen readers or reading braille for the vision impaired. Also, there is already a large amount of audio and video accessibility data available that is not in textual format, e.g. the bitmaps used for captions on DVDs. It would be a shame to exclude such data from being used on the Web.

Considering these circumstances, it is critical to enable the association of non-text captions, sign language and audio annotations with video.

Existing sign language video or audio descriptions usually come as part of the video - either directly part of the given audio or video track by being burnt-in (e.g. picture-in-picture sign language video, or open captions), or as a separate track. QuickTime, MPEG2 or MPEG4 are container formats that are typically used to encapsulate such extra tracks with the original audio or video file. Ogg is capable of the same multi-track encapsulation and provides synchronisation between the tracks. The Ogg skeleton headers can further provide a clear indication of the available tracks inside an Ogg file, which can be used to enable media players to offer audio and video track selection to a user.

Recommendation 2: Non-text accessibility data, such as spoken audio descriptions, should be multiplexed into the Ogg container format, where media players (including Web browsers) will be able to identify them and offer them to users for decoding. It is further recommended that speech-only accessibility tracks should be encoded using Speex, while video should be encoded using Theora. With graphics, there is currently no clearly recommendable codec in Xiph - probably the best one to use is Ogg Kate or Theora, but OggMNG or OggSpots are options, too. Note that the Xiph community may develop and recommend more appropriate codecs for time-aligned graphics in the future.

Also please note that we recommend development of a server-side dynamic content adaptation scheme that allows the browser to request - on behalf of its user - adequate accessibility tracks together with the content. This is described in more detail in section 6 below.

For most of these categories, proprietary formats are being used by the companies that support them. Only closed captions and subtitles, as well as lyrics and linguistic transcripts have widely used open text formats.

The majority of subtitles and captions that are available on the Internet right now are provided in text files that are separate to the video or audio file, mostly in SubRip .srt or SubViewer .sub files (prepared by the fansubbing community). A few now come as xml files in SMIL, 3GPP TimedText .ttxt, W3C TimedText DFXP or CMML files. Song lyrics come in the Lyrics Displayer .lrc file format. Linguistic transcripts come in the Transcriber .trs file format. Several javascript libraries for creating ticker text from txt files that contain sequences of div elements exist. Many other time-aligned text file formats exist.

A typical media player such as mplayer or vlc plays back the subtitles for a video file by allowing the user to open a subtitle file in parallel to the video. The media player then synchronises the playback of the subtitles and renders them on top of the video file. QuickTime and WindowsMediaPlayer do not have this functionality, but rely on subtitle tracks being delivered inside the audio or video file.

The ability to dynamically associate a time-aligned text file with an audio or video file at the moment of playback is very powerful. It has allowed the creation of a whole community of subtitling fans, the fansubbers, which provides accessibility to almost all movies and feature films.

To provide such a functionality inside a Web browser, it is necessary to specify out-of-band time-aligned text files with the video.

A proposal has been made to the WHATWG as a result of this study to support a "text" sub-element of the "video" and "audio" elements that allows the specification of such an external text file. There are a some details at this blog post. An example looks as follows:

How this will actually work is as yet unclear. One approach is to render the out-of-band text files as HTML straight into the DOM of the current Web page. This raises security issues. Another approach is to render it into a kind of iframe i.e. a separate security context. Also, SVG has a "text" element that serves a similar purpose, so the specification could be aligned with that.

There are experimental implementations in javascript of this proposal, one for srt through Wikipedia and one for dfxp through the W3C TimedText working group. Both map the respective out-of-band file into the DOM of the current Web page.

Further analysis, experiments, and a proper specification of this new element are required.

Recommendation 3: Develop a detailed specification and experimental implementations of how to handle out-of-band time-aligned text files for the "video" and "audio" elements of HTML5. In the WHATWG group, it was suggested that a first step may be a mapping of typical file formats to HTML (and CSS), which should include at minimum sub, srt, lrc, trs and dfxp.

4. Developing a comprehensive Time-aligned Text Format

Seeing the number of text categories identified above, it would make sense to have only one time-aligned text file format that can produce them all and is flexible enough to allow even further new time-aligned text ideas to be realised.

Another way to think about this is that the Web browser will only want to deal with one representation for time-aligned text. All formats should map to this representation. In this respect it is similar to defining a "raw" format for time-aligned text similar to how PCM is raw audio and RGB raw video. We can either invent a new format or pick up an existing format as this raw text format.

Of all the formats that were analysed, DFXP is the most flexible format and is capable of representing multiple categories. It still needs to be determined, whether it would be possible for all of the given categories to be supported by DFXP, since DFXP is developed as an exchange format for subtitles and captions in particular. In any case, DFXP is not optimal for a Web-based time-aligned text format, since it redefines a lot of HTML, SMIL and CSS constructs, rather than re-using existing HTML, javascript and CSS. The re-definition was necessary because DFXP was developed as a generic format for time-aligned text, that needs to work in any situation, including outside the Web. However, for purposes of the Web (and for many Web-capable media players), reuse of HTML, CSS and javascript would lead to a format for which it would be much easier to provide implementations.

An idea for such a format is being discussed in the Xiph community. It runs under the name of TDHT (Timed Divs in HTML). A simple example file looks like this:

This format tries to incorporate what we learnt from analysing the needs for existing time-aligned text requirements, while making implementation easy in a Web-friendly environment, since it is a normal HTML file with minimal changes.

One concern that has been raised with TDHT is that HTML may be too comprehensive a format for the needs of time-aligned text. Time-aligned text is predominantly text that should be stylable by the Website that is using it. HTML however is notoriously bad at separating data from styling, unless CSS is used exclusively. It may be better to create a format that is more simlar to RSS than to HTML in its simplicity.

TDHT and DFXP are solutions for out-of-band time-aligned text. Other options for such comprehensive time-aligned text solutions need to be analysed and experimented with.

Recommendation 4: WRT DFXP: analyse whether DFXP is capable of supporting all the identified text categories by creating a collection of test files. This will help understand the capabilities and limitations of DFXP better.

Recommendation 5: WRT TDHT: create a collection of test files that show how to support the different text categories. Based on the requirements determined from creating this collection, decide upon a comprehensive format (e.g. TDHT) and implement support for it.

5. Dealing with In-line Time-aligned Text Codecs

As is the case with non-text accessibility data, time-aligned text can also be encoded into media files. The advantage is that the text representation of the video is actually part of the video file and thus this meta data doesn't get lost when sharing the files further. Also, synchronisation between the media data and the text codecs is a given.

Ogg, QuickTime, FLV, MPEG4, and 3GPP are containers that are typically used to encapsulate such extra tracks with the original audio-visual file. All of these are capable of encapsulating 3GPP TimedText, which is a subpart of DFXP.

Ogg currently supports CMML and Kate as text codecs.

Just like out-of-band time-aligned text comes in multiple formats, in-line text codecs do, too. This study motivated the Xiph community to define a framework for mapping any type of text codec into Ogg through the so-called OggText mapping. A first implementation of this format exists for encapsulating srt files into Ogg.

In a Web framework (such as Firefox) it doesn't make sense to support all possible in-line text codecs. Instead, it is useful to only support one comprehensive format, or at most a simple format (like srt) and a comprehensive format (like TDHT or Kate). Then, an existing format can be transcoded to one of these two formats for in-line encoding and Web delivery. For example, DFXP could be mapped to TDHT before encapsulation into Ogg.

A mapping of TDHT into Ogg is being defined in the Xiph community as OggTDHT. It is a simple extension of OggText and provides a generic in-line time-aligned text format. One has to be aware though that some resources that are required to render TDHT, such as images, javascript files, or CSS files, would not be encapsulated into an Ogg TDHT track, but would either continue to exist out-of-band or would require to be encapsulated in separate tracks.

An alternative way of encapsulating TDHT into Ogg is to map it to Ogg Kate, which encapsulates all required resources inside an Ogg container to make it a compact format. Ogg Kate is however not Web-friendly, and a display of Ogg Kate in a Web browser involves mapping it back to HTML.

When looking at which in-line time-aligned text codecs to support in Ogg, one should also look at what existing media players (outside the Web) are able to decode. There is actually support for decoding of Kate and CMML Ogg tracks in typical open source media players like mplayer or vlc. However, not much content has been produced other than test content that uses these formats to provide subtitles, captions, or other time-aligned text. The most uptake that either format has achieved is as an export format for Metavid. Mostly, the solutions for using subtitles with Ogg have been to use srt files and have the media players synchronise them.

Recommendation 6: Implement support for srt as a simple format in Ogg, e.g. full OggSRT support. This includes an encoding and decoding library for OggText codecs, implementation of the mapping of srt into OggText, creation of a few srt and OggSRT example files, as well as mapping of srt to HTML, and display in Firefox. Further, there should be transcoding tools for other simple formats to srt. Details of some of these steps need to be worked out.

Recommendation 7: A decision on supporting a comprehensive format in Ogg is non-obivous, even though we currently tend towards TDHT. Further discussion about TDHT, Kate and DFXP as rich formats for in-line in Ogg and for Firefox are necessary. The best start is to create example files for all of the different formats and categories to be able to compare these rich formats and get closer to a sensible decision.

Recommendation 8: One aim must be to handle in-line text codecs in the same way as out-of-band time-aligned text files once they hit the Web browser. In analogy to recommendation 3, this will also require a mapping to HTML (and CSS) for Ogg-encapsulated text codecs.

6. Content Selection and Adaptation for Time-aligned Text Codecs

The user of a audio or video resource that has a multitude of time-aligned text tracks will want to decide which tracks to display and which not. If, for example, there are subtitles in 10 different languages, a user will only want to see the subtitles of one language at a time. It makes sense to select this language based on the preferences that the user has stored in the browser preferences, but to give the user the opportunity to choose an alternative one interactively. Also, it makes sense to allow the author of the Web page or Ogg file to specify whether to turn the subtitles on or off by default.

The same functionality is required for all the other text categories, for example for the above mentioned sign language and audio annotation tracks.

To make this available, the Web browser needs to have the ability to gain a full picture of the tracks available for a video or audio resource before being able to choose the ones requested through the user's preferences. The text tracks may come from the video or audio resource itself, or through "text" tracks inside the "video" element. To gain information about the text tracks available inside an Ogg file, a browser can start downloading the first few bytes of the Ogg file, which contain the header data and therefore the description of the tracks available inside it. For Ogg, the skeleton track knows which text tracks are available. However, we need to add the category description into skeleton to allow per-category content selection.

In a general audio or video file that is not Ogg, typically downloading the first few bytes will also provide the header data and thus the description of the available tracks. However, an encapsulated text track does not generally expose the category types of the text tracks. The Xiph community has developed ROE, the XML-based Rich Open multitrack media Exposition file format. ROE is essentially a textual description of the tracks that are composed inside an Ogg file and can be used generically to describe the tracks available inside any audio or video file. Thus, a Web browser could first download the ROE file associated with an audio or video resource to gain a full picture of all the available content tracks. It can then decide which ones to display.

Taking this a step further, if there is a mechanism for the browser to tell the server which tracks it needs, then the server could multiplex together a custom resource for that request. This can potentially be done through a media fragment URI as is currently under specification with the W3C Media Fragments Working Group. Such a content adaptation approach is particularly important for deaf and blind people, who can only consume specific tracks and should not be required to use bandwidth and pay for content that they cannot actually consume.

Recommendation 9: Develop a specification and implementation of per-category content selection in the Web browser based on user preferences (i.e. which categories to show by default and which default language to use), and based on styling attributes provided by the Web page author (e.g. display: none).

Recommendation 11: Experiment with and develop a specification for server-side content adaptation for in-line, out-of-band, and mixed time-aligned text tracks.

The Road Ahead

The detailed analysis and recommendations above may be confusing for some. So, I've put together a diagram that explains the road ahead more succinctly.

(Aside: Explanation of words used:

multiplexing = creating a single file on disk that contains multiple tracks of content temporally interleaved for transfer and decoding purposes

demuxing = de-multiplexing

srt = subrip subtitle file format

tdht = timed divs html file format

PiP = picture-in-picture)

Basically, there are four different areas of work:

Phase 1: simple time-aligned text for video a11y

Phase 2: rich time-aligned text for video a11y

Phase 3: non-text audio and video a11y data

Phase 4: usability improvements for video a11y

The most pressing and simplest to solve is Phase 1. Here, we need to determine how the text will get into the browser, be processed by it, and be displayed. It includes an implementation and will produce first a11y for video through subtitles and captions.

Before being able to implement anything more complex, in Phase 2 we need to further analyse the different text categories identified and in particular a collection of example files, so requirements for the markup language can be identified and a choice of language be made.

Phase 3 is a bit more exotic than the text-based work and will require a scheme to identify which tracks of multitrack video and audio files are a11y tracks, so they can be dealt with appropriately in the browser. Then, display defaults need to be implemented in the browser.

As for Phase 4 - when we get to these challenges, we will have come a long way and video a11y will finally be a solved issue in browsers. I can see that taking a few years though.

Achievements

During the study, some progress was already made towards the four areas of work, in particular towards the second and third.

Here is a list of the documents that have been created as a result of the video accessibility study:

Conclusions

The aim of the study was to "deliver a well thought-out recommendation for how to support the different types of accessibility needs for audio and video, including a specification of the actual file format(s) to use. At minimum, an Ogg solution for captions and subtitles is expected, and a means towards including sign language video tracks, audio annotations, transcripts, scripts, story boards, karaoke, metadata, and semantic annotations is proposed."

This aim of the grant proposal was achieved with great success. In fact, we have gone beyond this aim and created a community at Ogg to continue addressing these issues. And we have gone far beyond a recommendation by also creating initial specifications that address each of the four identified areas of work, in particular:

how to include subtitles into Web pages with a <video> element,

how to encapsulate time-aligned text into Ogg, and

a format for the richer time-aligned text data.

In the next step, Mozilla should look at implementing srt support into the Web browser, and in parallel further analyse the richer time-aligned text categories and their needs. In collaboration with the Xiph community, srt in Ogg should also be addressed.