Abstract

This specification extends HTMLMediaElement to allow
JavaScript to generate media streams for playback.
Allowing JavaScript to generate streams facilitates a variety of use
cases like adaptive streaming and time shifting live streams.

Status of This Document

This section describes the status of this document at the time of its publication. Other
documents may supersede this document. A list of current W3C publications and the latest revision
of this technical report can be found in the W3C technical reports
index at http://www.w3.org/TR/.

Publication as an Editor's Draft does not imply endorsement by the W3C Membership.
This is a draft document and may be updated, replaced or obsoleted by other documents at
any time. It is inappropriate to cite this document as other than work in progress.

1. Introduction

This specification allows JavaScript to dynamically construct media streams for <audio> and <video>.
It defines objects that allow JavaScript to pass media segments to an HTMLMediaElement.
A buffering model is also included to describe how the user agent should act when different media segments are
appended at different times. Byte stream specifications for WebM, ISO Base Media File Format, and MPEG-2 Transport Streams are given to specify the
expected format of byte streams used with these extensions.

1.1 Goals

This specification was designed with the following goals in mind:

Allow JavaScript to construct media streams independent of how the media is fetched.

Define a splicing and buffering model that facilitates use cases like adaptive streaming, ad-insertion, time-shifting, and video editing.

1.2 Definitions

Initialization Segment

A sequence of bytes that contain all of the initialization information required to decode a sequence of media segments. This includes codec initialization data, Track ID mappings for multiplexed segments, and timestamp offsets (e.g. edit lists).

A sequence of bytes that contain packetized & timestamped media data for a portion of the presentation timeline. Media segments are always associated with the most recently appended initialization segment.

A buffer that holds initialization data and coded frames that will be decoded and rendered. This buffer may not exist in actual implementations, but it is intended to represent media data that will be decoded no matter what media segments are appended to update the SourceBuffer. This distinction is important when considering appends that happen close to the current playback position. See Track Buffer to Decoder Buffer transfer for details.

Random Access Point

A position in a media segment where decoding and continuous playback can begin without relying on any previous data in the segment. For video this tends to be the location of I-frames. In the case of audio, most audio frames can be treated as a random access point. Since video tracks tend to have a more sparse distribution of random access points, the location of these points are usually considered the random access points for multiplexed streams.

These URLs are the same as what the File API specification calls a Blob URI, except that anything in the definition of that feature that refers to File and Blob objects is hereby extended to also apply to MediaSource objects.

Track ID

A Track ID is a byte stream format specific identifier that marks sections of the byte stream as being part of a specific track. The Track ID in a track description identifies which sections of a media segment belong to that track.

Track Description

A byte stream format specific structure that provides the Track ID, codec configuration, and other metadata for a single track. Each track description inside a single initialization segment must have a unique Track ID.

Coded Frame

A unit of compressed media data that has a presentation timestamp and decode timestamp. The presentation timestamp indicates when the frame should be rendered. The decode timestamp indicates when the frame needs to be decoded. If frames can be decoded out of order, then the decode timestamp must be present in the bytestream. If frames cannot be decoded out of order and a decode timestamp is not present in the bytestream, then the decode timestamp is equal to the presentation timestamp.

A series of append() calls on a SourceBuffer without any intervening abort() calls. The
media segments in an append sequence must be adjacent and monotonically increasing in time without any gaps. An
abort() call starts a new append sequence which allows media segments to be appended in non-monotonically
increasing order.

2. Source Buffer Model

The subsections below outline the buffering model for this specification. It describes the various rules and behaviors associated with appending
data to an individual SourceBuffer. At the highest level, the web application creates SourceBuffer objects and appends sequences of
initialization segments and media segments to update their state. The media element pulls media data out of the
MediaSource object, plays it, and fires events just like it would if a normal URL was passed to the src attribute.
The web application is expected to monitor media element events to determine when it needs to append more media segments.

2.1 Appending a Media Segment over a buffered region

There are several ways that media segments can overlap segments in the SourceBuffer. Behavior for the different overlap situations are described below. If more than one overlap applies, then the start overlap must be resolved first, followed by any complete overlaps, and finally the end overlap. If a segment contains multiple tracks then the overlap is resolved independently for each track.

2.1.1 Complete Overlap

The figure above shows how the SourceBuffer is updated when a new media segment completely overlaps a segment in the buffer. In this case, the new segment completely replaces the old segment.

2.1.2 Start Overlap

The figure above shows how the SourceBuffer is updated when the beginning of a new media segment overlaps a segment in the buffer. In this case, the new segment replaces all the old media data in the overlapping region. Since media segments are constrained to starting with random access points, this provides a seamless transition between segments.

When an audio frame in the SourceBuffer overlaps with the start of the new media segment special behavior is required. At a minimum implementations must support dropping the old audio frame that overlaps the start of the new segment and insert silence for the small gap that is created. Higher quality implementations may support crossfading or crosslapping between the overlapping audio frames. No matter which strategy is implemented, no gaps are created in the ranges reported by buffered and playback must never stall at the overlap.

2.1.3 End Overlap

The figure above shows how the SourceBuffer is updated when the end of a new media segment overlaps the beginning of a segment in the buffer. In this case, the SourceBuffer tries to keep as much of the old segment as possible. The amount saved depends on where the closest random access point, in the old segment, is to the end of the new segment. In the case of audio, if the gap is smaller than the size of an audio frame, then the SourceBuffer may render silence for this gap. This gap must not be reflect in buffered. The entire new segment must be added to the SourceBuffer, but it is up to the implementation to determine how much of the old segment data is retained.

Note

An implementation may keep old segment data before the end of the new segment to avoid creating a gap if it wishes. Doing this though can significantly increase implementation complexity and could cause delays at the splice point.

Note

The web application can use buffered to determine how much of the old segment was preserved.

2.2 Track Buffer to Decoder Buffer transfer

The track buffer represents the media that the web application would like the media element to play. The decoder buffer contains the data that will actually get decoded and rendered. In most cases the decoder buffer will simply contain a subset of the track buffer near the current playback position. These two buffers start to diverge when media segments that overlap or are very close to the current playback position are appended. Depending on the contents of the new media segment it may not be possible to switch to the new data immediately because there isn't a random access point close enough to the current playback position. The quality of the implementation determines how much data is considered "in the decoder buffer." It should transfer data to the decoder buffer as late as possible whilst maintaining seamless playback. Some implementations may be able to instantiate multiple decoders or decode the new data significantly faster than real-time to achieve a seamless splice immediately. Other implementations may delay until the next random access point before switching to the newly appended data. Notice that this difference in behavior is only observable when appending close to the current playback position. The decoder buffer represents a media subsegment, like a group of pictures or something with similar decode dependencies, that the media element commits to playing. This commitment may be influenced by a variety of things like limited decoding resources, hardware decode buffers, a jitter buffer, or the desire to limit implementation complexity.

Here is an example to help clarify the role of the decoder buffer. Say the current playback position has a timestamp of 8 and the media element pulled frames with timestamp 9 & 10 into the decoder buffer. The web application then appends a higher quality media segment that starts with a random access point at timestamp 9. The track buffer will get updated with the higher quality data, but the media element won't be able to switch to this higher quality data until the next random access point at timestamp 20. This is because a frame for timestamp 9 is already in the decoder buffer. The decoder buffer represents the "point of no return." for decoding. If a seek occurs the media element may choose to use the higher quality data since a seek might imply flushing the decoder buffer and the user expects a break in playback.

3. MediaSource Object

The MediaSource object represents a source of media data for an HTMLMediaElement. It keeps track of the readyState for this source as well as a list of SourceBuffer objects that can be used to add media data to the presentation. MediaSource objects are created by the web application and then attached to an HTMLMediaElement. The application uses the SourceBuffer objects in sourceBuffers to add media data to this source. The HTMLMediaElement fetches this media data from the MediaSource object when it is needed during playback.

enum ReadyState {
"closed",
"open",
"ended"
};

Enumeration description

closed

Indicates the source is not currently attached to a media element.

open

The source has been opened by a media element and is ready for data to be appended to the SourceBuffer objects in sourceBuffers.

ended

The source is still attached to a media element, but endOfStream() has been called.

enum EndOfStreamError {
"network",
"decode"
};

Enumeration description

network

Terminates playback and signals that a network error has occured.

Note

If the JavaScript fetching media data encounters a network error it should use this status code to terminate playback.

decode

Terminates playback and signals that a decoding error has occured.

Note

If the JavaScript code fetching media data has problems parsing the data it should use this status code to terminate playback.

3.2 Methods

When this method is invoked, the user agent must run the following steps:

If type is null or an empty string then throw an INVALID_ACCESS_ERR exception and abort these steps.

If type contains a MIME type that is not supported or contains a MIME type that is not supported with the types specified for the other SourceBuffer objects in sourceBuffers, then throw a NOT_SUPPORTED_ERR exception and abort these steps.

If the user agent can't handle any more SourceBuffer objects then throw a QUOTA_EXCEEDED_ERR exception and abort these steps.

This allows the duration to properly reflect the end of the appended media segments. For example, if the duration was explicitly set to 10 seconds and only media segments for 0 to 5 seconds were appended before endOfStream() was called, then the duration will get updated to 5 seconds.

Notify the media element that it now has all of the media data. Playback should continue until all the media passed in via append() has been played.

If true is returned from this method, it only indicates that the MediaSource implementation is capable of creating SourceBuffer objects for the specified MIME type. An addSourceBuffer() call may still fail if sufficient resources are not available to support the addition of a new SourceBuffer.

Note

This method returning true implies that HTMLMediaElement.canPlayType() will return "maybe" or "probably" since it does not make sense for a MediaSource to support a type the HTMLMediaElement knows it cannot play.

3.4.3 Seeking

Run the following steps as part of the "Wait until the user agent has established whether or not the media data for the new playback position is available, and, if it is, until it has decoded enough data to play back that position" step of the seek algorithm:

3.4.4 SourceBuffer Monitoring

The following steps are periodically run during playback to make sure that all of the SourceBuffer objects in activeSourceBuffers have enough data to ensure uninterrupted playback. Appending new segments and changes to activeSourceBuffers also cause these steps to run because they affect the conditions that trigger state transitions.

Playback may resume at this point if it was previously suspended by a transition to HAVE_CURRENT_DATA.

Abort these steps.

If buffered for at least one object in activeSourceBuffers contains a TimeRange that ends at the current playback position and does not have a range covering the time immediately after the current position:

Controls the offset applied to timestamps inside subsequent media segments that are appended to this SourceBuffer. The timestampOffset is initially set to 0 which indicates that no offset is being applied.

On getting, Return the initial value or the last value that was successfully set.

4.4 Algorithms

4.4.1 Segment Parser Loop

All SourceBuffer objects have an internal append state variable that keeps track of the high-level segment parsing state. It is initially set to WAITING_FOR_SEGMENT and can transition to the following states as data is appended.

4.4.2 Initialization Segment Received

Each SourceBuffer object has an internal first initialization segment flag that tracks whether the first initialization segment has been appended. This flag is set to false when the SourceBuffer is created and updated by the algorithm below.

Let highest intersection end time be the highest end time in the intersection range.

If the highest intersection end time is less than the highest end time, then update the intersection range so that the highest intersection end time equals the highest end time.

Return the intersection range.

8. Byte Stream Formats

The bytes provided through append() for a SourceBuffer form a logical byte stream. The format of this byte stream depends on the media container format in use and is defined in a byte stream format specification. Byte stream format specifications based on WebM , the ISO Base Media File Format, and MPEG-2 Transport Streams are provided below. These format specifications are intended to be the authoritative source for how data from these containers is formatted and passed to a SourceBuffer. If a MediaSource implementation claims to support any of these container formats, then it must implement the corresponding byte stream format specification described below.

This section provides general requirements for all byte stream formats:

If a track is encrypted, provide any encryption parameters necessary to decrypt the content (except the encryption key itself)

For each track, provide all information necessary to decode and render the earliest random access point in the sequence of Media Segments and all subsequent samples in the sequence (in presentation time). This includes, in particular,

Information that determines the intrinsic width and height of the video (specifically, this requires either the picture or pixel aspect ratio, together with the encoded resolution).

Information necessary to convert the video decoder output to a format suitable for display

Identify the global presentation timestamp of every sample in the sequence of Media Segments

For example, if I1 is associated with M1, M2, M3 then the above must hold for all the combinations I1+M1, I1+M2, I1+M1+M2, I1+M2+M3, etc.

Byte stream specifications must at a minimum define constraints which ensure that the above requirements hold. Additional constraints may be defined, for example to simplify implementation.

8.1 WebM Byte Streams

This section defines segment formats for implementations that choose to support WebM.

The Cluster header may contain an "unknown" size value. If it does then the end of the cluster is reached when another Cluster header or an element header that indicates the start of an WebM initialization segment is encountered.

Block & SimpleBlock elements must be in time increasing order consistent with the WebM spec.

If the most recent WebM initialization segment describes multiple tracks, then blocks from all the tracks must be interleaved in time increasing order. At least one block from all audio and video tracks must be present.

Cues or Chapters elements may follow a Cluster element. These elements must be accepted and ignored by the user agent.

8.1.3 Random Access Points

A SimpleBlock element with its Keyframe flag set signals the location of a random access point for that track. Media segments containing multiple tracks are only considered a random access point if the first SimpleBlock for each track has its Keyframe flag set. The order of the multiplexed blocks must conform to the WebM Muxer Guidelines.

8.2 ISO Base Media File Format Byte Streams

8.2.1 Initialization Segments

An ISO BMFF initialization segment must contain a single Movie Header Box (moov). The tracks in the Movie Header Box must not contain any samples (i.e. the entry_count in the stts, stsc and stco boxes must be set to zero). A Movie Extends (mvex) box must be contained in the
Movie Header Box to indicate that Movie Fragments are to be expected.

The initialization segment may contain Edit Boxes (edts) which provide a mapping of composition times for each track to the global presentation time.

8.2.2 Media Segments

An ISO BMFF media segment must contain a single Movie Fragment Box (moof) followed by one or more Media Data Boxes (mdat).

The following rules apply to ISO BMFF media segments:

The Movie Fragment Box must contain at least one Track Fragment Box (traf).

The Movie Fragment Box must use movie-fragment relative addressing and the flag default-base-is-moof must be set; absolute byte-offsets must not be used.

External data references must not be used.

If the Movie Fragment contains multiple tracks, the duration by which each track extends should be as close to equal as practical.

Each Track Fragment Box must contain a Track Fragment Decode Time Box (tfdt)

The first sample in each Track Fragment Run Box (trun) must indicate that the sample is a random access point.

The Media Data Boxes must contain all the samples referenced by the Track Fragment Run Boxes (trun) of the Movie Fragment Box.

All MPEG-2 TS packets must have the transport_error_indicator set to 0

8.3.2 Initialization Segments

An MPEG-2 TS initialization segment must contain a single PAT and a single PMT. Other SI, such as CAT, that are invariant for all subsequent
media segments, may be present.

8.3.3 Media Segments

The following rules apply to all MPEG-2 TS media segments:

PSI that is identical to the information in the initialization segment may appear repeatedly throughout the segment.

The media segment will not rely on initialization information in another media segment.

Media Segments must contain only complete PES packets and sections.

Each PES packet must be comprised of one or more complete access units.

Each PES packet must have a PTS timestamp.

PCR must be present in the Segment prior to the first byte of a TS packet payload containing media data.

The presentation duration of each media component within the Media Segment should be as close to equal as practical.

8.3.4 Random Access Points

A random access point as defined in this specification corresponds to Elementary Stream Random Access Point as defined in
ISO/IEC 13818-1.

8.3.5 Timestamp Rollover & Discontinuities

Timestamp rollovers and discontinuities must be handled by the UA. The UA's MPEG-2 TS implementation must maintain an internal offset
variable, MPEG2TS_timestampOffset, to keep track of the offset that needs to be applied to timestamps
that have rolled over or are part of a discontinuity. MPEG2TS_timestampOffset is initially set to 0 when the SourceBuffer is
created. This offset must be applied to the timestamps as part of the conversion process from MPEG-2 TS packets
into coded frames for the coded frame processing algorithm. This results in the coded frame timestamps
for a packet being computed by the following equations: