RFC 7798

RTP Payload Format for High Efficiency Video Coding (HEVC)

Internet Engineering Task Force (IETF) Y.-K. Wang
Request for Comments: 7798 Qualcomm
Category: Standards Track Y. Sanchez
ISSN: 2070-1721 T. Schierl
Fraunhofer HHI
S. Wenger
Vidyo
M. M. Hannuksela
Nokia
March 2016 RTP Payload Format for High Efficiency Video Coding (HEVC)
Abstract
This memo describes an RTP payload format for the video coding
standard ITU-T Recommendation H.265 and ISO/IEC International
Standard 23008-2, both also known as High Efficiency Video Coding
(HEVC) and developed by the Joint Collaborative Team on Video Coding
(JCT-VC). The RTP payload format allows for packetization of one or
more Network Abstraction Layer (NAL) units in each RTP packet payload
as well as fragmentation of a NAL unit into multiple RTP packets.
Furthermore, it supports transmission of an HEVC bitstream over a
single stream as well as multiple RTP streams. When multiple RTP
streams are used, a single transport or multiple transports may be
utilized. The payload format has wide applicability in
videoconferencing, Internet video streaming, and high-bitrate
entertainment-quality video, among others.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
http://www.rfc-editor.org/info/rfc7798.

7.2.1. Mapping of Payload Type Parameters to SDP ..........647.2.2. Usage with SDP Offer/Answer Model ..................657.2.3. Usage in Declarative Session Descriptions ..........737.2.4. Considerations for Parameter Sets ..................757.2.5. Dependency Signaling in Multi-Stream Mode ..........758. Use with Feedback Messages .....................................758.1. Picture Loss Indication (PLI) .............................758.2. Slice Loss Indication (SLI) ...............................768.3. Reference Picture Selection Indication (RPSI) .............778.4. Full Intra Request (FIR) ..................................779. Security Considerations ........................................7810. Congestion Control ............................................7911. IANA Considerations ...........................................8012. References ....................................................8012.1. Normative References .....................................8012.2. Informative References ...................................82
Acknowledgments ...................................................85
Authors' Addresses ................................................861. Introduction
The High Efficiency Video Coding specification, formally published as
both ITU-T Recommendation H.265 [HEVC] and ISO/IEC International
Standard 23008-2 [ISO23008-2], was ratified by the ITU-T in April
2013; reportedly, it provides significant coding efficiency gains
over H.264 [H.264].
This memo describes an RTP payload format for HEVC. It shares its
basic design with the RTP payload formats of [RFC6184] and [RFC6190].
With respect to design philosophy, security, congestion control, and
overall implementation complexity, it has similar properties to those
earlier payload format specifications. This is a conscious choice,
as at least RFC 6184 is widely deployed and generally known in the
relevant implementer communities. Mechanisms from RFC 6190 were
incorporated as HEVC version 1 supports temporal scalability.
In order to help the overlapping implementer community, frequently
only the differences between RFCs 6184 and 6190 and the HEVC payload
format are highlighted in non-normative, explanatory parts of this
memo. Basic familiarity with both specifications is assumed for
those parts. However, the normative parts of this memo do not
require study of RFCs 6184 or 6190.

1.1. Overview of the HEVC Codec
H.264 and HEVC share a similar hybrid video codec design. In this
memo, we provide a very brief overview of those features of HEVC that
are, in some form, addressed by the payload format specified herein.
Implementers have to read, understand, and apply the ITU-T/ISO/IEC
specifications pertaining to HEVC to arrive at interoperable, well-
performing implementations. Implementers should consider testing
their design (including the interworking between the payload format
implementation and the core video codec) using the tools provided by
ITU-T/ISO/IEC, for example, conformance bitstreams as specified in
[H.265.1]. Not doing so has historically led to systems that perform
badly and that are not secure.
Conceptually, both H.264 and HEVC include a Video Coding Layer (VCL),
which is often used to refer to the coding-tool features, and a
Network Abstraction Layer (NAL), which is often used to refer to the
systems and transport interface aspects of the codecs.
1.1.1. Coding-Tool Features
Similar to earlier hybrid-video-coding-based standards, including
H.264, the following basic video coding design is employed by HEVC.
A prediction signal is first formed by either intra- or motion-
compensated prediction, and the residual (the difference between the
original and the prediction) is then coded. The gains in coding
efficiency are achieved by redesigning and improving almost all parts
of the codec over earlier designs. In addition, HEVC includes
several tools to make the implementation on parallel architectures
easier. Below is a summary of HEVC coding-tool features.
Quad-tree block and transform structure
One of the major tools that contributes significantly to the coding
efficiency of HEVC is the use of flexible coding blocks and
transforms, which are defined in a hierarchical quad-tree manner.
Unlike H.264, where the basic coding block is a macroblock of fixed-
size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size
of 64x64. Each CTU can be divided into smaller units in a
hierarchical quad-tree manner and can represent smaller blocks down
to size 4x4. Similarly, the transforms used in HEVC can have
different sizes, starting from 4x4 and going up to 32x32. Utilizing
large blocks and transforms contributes to the major gain of HEVC,
especially at high resolutions.

Entropy coding
HEVC uses a single entropy-coding engine, which is based on Context
Adaptive Binary Arithmetic Coding (CABAC) [CABAC], whereas H.264 uses
two distinct entropy coding engines. CABAC in HEVC shares many
similarities with CABAC of H.264, but contains several improvements.
Those include improvements in coding efficiency and lowered
implementation complexity, especially for parallel architectures.
In-loop filtering
H.264 includes an in-loop adaptive deblocking filter, where the
blocking artifacts around the transform edges in the reconstructed
picture are smoothed to improve the picture quality and compression
efficiency. In HEVC, a similar deblocking filter is employed but
with somewhat lower complexity. In addition, pictures undergo a
subsequent filtering operation called Sample Adaptive Offset (SAO),
which is a new design element in HEVC. SAO basically adds a pixel-
level offset in an adaptive manner and usually acts as a de-ringing
filter. It is observed that SAO improves the picture quality,
especially around sharp edges, contributing substantially to visual
quality improvements of HEVC.
Motion prediction and coding
There have been a number of improvements in this area that are
summarized as follows. The first category is motion merge and
Advanced Motion Vector Prediction (AMVP) modes. The motion
information of a prediction block can be inferred from the spatially
or temporally neighboring blocks. This is similar to the DIRECT mode
in H.264 but includes new aspects to incorporate the flexible quad-
tree structure and methods to improve the parallel implementations.
In addition, the motion vector predictor can be signaled for improved
efficiency. The second category is high-precision interpolation.
The interpolation filter length is increased to 8-tap from 6-tap,
which improves the coding efficiency but also comes with increased
complexity. In addition, the interpolation filter is defined with
higher precision without any intermediate rounding operations to
further improve the coding efficiency.
Intra prediction and intra-coding
Compared to 8 intra prediction modes in H.264, HEVC supports angular
intra prediction with 33 directions. This increased flexibility
improves both objective coding efficiency and visual quality as the
edges can be better predicted and ringing artifacts around the edges
can be reduced. In addition, the reference samples are adaptively
smoothed based on the prediction direction. To avoid contouring

artifacts a new interpolative prediction generation is included to
improve the visual quality. Furthermore, Discrete Sine Transform
(DST) is utilized instead of traditional Discrete Cosine Transform
(DCT) for 4x4 intra-transform blocks.
Other coding-tool features
HEVC includes some tools for lossless coding and efficient screen-
content coding, such as skipping the transform for certain blocks.
These tools are particularly useful, for example, when streaming the
user interface of a mobile device to a large display.
1.1.2. Systems and Transport Interfaces
HEVC inherited the basic systems and transport interfaces designs
from H.264. These include the NAL-unit-based syntax structure, the
hierarchical syntax and data unit structure, the Supplemental
Enhancement Information (SEI) message mechanism, and the video
buffering model based on the Hypothetical Reference Decoder (HRD).
The hierarchical syntax and data unit structure consists of sequence-
level parameter sets, multi-picture-level or picture-level parameter
sets, slice-level header parameters, and lower-level parameters. In
the following, a list of differences in these aspects compared to
H.264 is summarized.
Video parameter set
A new type of parameter set, called Video Parameter Set (VPS), was
introduced. For the first (2013) version of [HEVC], the VPS NAL unit
is required to be available prior to its activation, while the
information contained in the VPS is not necessary for operation of
the decoding process. For future HEVC extensions, such as the 3D or
scalable extensions, the VPS is expected to include information
necessary for operation of the decoding process, e.g., decoding
dependency or information for reference picture set construction of
enhancement layers. The VPS provides a "big picture" of a bitstream,
including what types of operation points are provided, the profile,
tier, and level of the operation points, and some other high-level
properties of the bitstream that can be used as the basis for session
negotiation and content selection, etc. (see Section 7.1).
Profile, tier, and level
The profile, tier, and level syntax structure that can be included in
both the VPS and Sequence Parameter Set (SPS) includes 12 bytes of
data to describe the entire bitstream (including all temporally
scalable layers, which are referred to as sub-layers in the HEVC
specification), and can optionally include more profile, tier, and

level information pertaining to individual temporally scalable
layers. The profile indicator shows the "best viewed as" profile
when the bitstream conforms to multiple profiles, similar to the
major brand concept in the ISO Base Media File Format (ISOBMFF)
[IS014496-12] [IS015444-12] and file formats derived based on
ISOBMFF, such as the 3GPP file format [3GPPFF]. The profile, tier,
and level syntax structure also includes indications such as 1)
whether the bitstream is free of frame-packed content, 2) whether the
bitstream is free of interlaced source content, and 3) whether the
bitstream is free of field pictures. When the answer is yes for both
2) and 3), the bitstream contains only frame pictures of progressive
source. Based on these indications, clients/players without support
of post-processing functionalities for the handling of frame-packed,
interlaced source content or field pictures can reject those
bitstreams that contain such pictures.
Bitstream and elementary stream
HEVC includes a definition of an elementary stream, which is new
compared to H.264. An elementary stream consists of a sequence of
one or more bitstreams. An elementary stream that consists of two or
more bitstreams has typically been formed by splicing together two or
more bitstreams (or parts thereof). When an elementary stream
contains more than one bitstream, the last NAL unit of the last
access unit of a bitstream (except the last bitstream in the
elementary stream) must contain an end of bitstream NAL unit, and the
first access unit of the subsequent bitstream must be an Intra-Random
Access Point (IRAP) access unit. This IRAP access unit may be a
Clean Random Access (CRA), Broken Link Access (BLA), or Instantaneous
Decoding Refresh (IDR) access unit.
Random access support
HEVC includes signaling in the NAL unit header, through NAL unit
types, of IRAP pictures beyond IDR pictures. Three types of IRAP
pictures, namely IDR, CRA, and BLA pictures, are supported: IDR
pictures are conventionally referred to as closed group-of-pictures
(closed-GOP) random access points whereas CRA and BLA pictures are
conventionally referred to as open-GOP random access points. BLA
pictures usually originate from splicing of two bitstreams or part
thereof at a CRA picture, e.g., during stream switching. To enable
better systems usage of IRAP pictures, altogether six different NAL
units are defined to signal the properties of the IRAP pictures,
which can be used to better match the stream access point types as
defined in the ISOBMFF [IS014496-12] [IS015444-12], which are
utilized for random access support in both 3GP-DASH [3GPDASH] and
MPEG DASH [MPEGDASH]. Pictures following an IRAP picture in decoding
order and preceding the IRAP picture in output order are referred to

as leading pictures associated with the IRAP picture. There are two
types of leading pictures: Random Access Decodable Leading (RADL)
pictures and Random Access Skipped Leading (RASL) pictures. RADL
pictures are decodable when the decoding started at the associated
IRAP picture; RASL pictures are not decodable when the decoding
started at the associated IRAP picture and are usually discarded.
HEVC provides mechanisms to enable specifying the conformance of a
bitstream wherein the originally present RASL pictures have been
discarded. Consequently, system components can discard RASL
pictures, when needed, without worrying about causing the bitstream
to become non-compliant.
Temporal scalability support
HEVC includes an improved support of temporal scalability, by
inclusion of the signaling of TemporalId in the NAL unit header, the
restriction that pictures of a particular temporal sub-layer cannot
be used for inter prediction reference by pictures of a lower
temporal sub-layer, the sub-bitstream extraction process, and the
requirement that each sub-bitstream extraction output be a conforming
bitstream. Media-Aware Network Elements (MANEs) can utilize the
TemporalId in the NAL unit header for stream adaptation purposes
based on temporal scalability.
Temporal sub-layer switching support
HEVC specifies, through NAL unit types present in the NAL unit
header, the signaling of Temporal Sub-layer Access (TSA) and Step-
wise Temporal Sub-layer Access (STSA). A TSA picture and pictures
following the TSA picture in decoding order do not use pictures prior
to the TSA picture in decoding order with TemporalId greater than or
equal to that of the TSA picture for inter prediction reference. A
TSA picture enables up-switching, at the TSA picture, to the sub-
layer containing the TSA picture or any higher sub-layer, from the
immediately lower sub-layer. An STSA picture does not use pictures
with the same TemporalId as the STSA picture for inter prediction
reference. Pictures following an STSA picture in decoding order with
the same TemporalId as the STSA picture do not use pictures prior to
the STSA picture in decoding order with the same TemporalId as the
STSA picture for inter prediction reference. An STSA picture enables
up-switching, at the STSA picture, to the sub-layer containing the
STSA picture, from the immediately lower sub-layer.
Sub-layer reference or non-reference pictures
The concept and signaling of reference/non-reference pictures in HEVC
are different from H.264. In H.264, if a picture may be used by any
other picture for inter prediction reference, it is a reference

picture; otherwise, it is a non-reference picture, and this is
signaled by two bits in the NAL unit header. In HEVC, a picture is
called a reference picture only when it is marked as "used for
reference". In addition, the concept of sub-layer reference picture
was introduced. If a picture may be used by another other picture
with the same TemporalId for inter prediction reference, it is a sub-
layer reference picture; otherwise, it is a sub-layer non-reference
picture. Whether a picture is a sub-layer reference picture or sub-
layer non-reference picture is signaled through NAL unit type values.
Extensibility
Besides the TemporalId in the NAL unit header, HEVC also includes the
signaling of a six-bit layer ID in the NAL unit header, which must be
equal to 0 for a single-layer bitstream. Extension mechanisms have
been included in the VPS, SPS, Picture Parameter Set (PPS), SEI NAL
unit, slice headers, and so on. All these extension mechanisms
enable future extensions in a backward-compatible manner, such that
bitstreams encoded according to potential future HEVC extensions can
be fed to then-legacy decoders (e.g., HEVC version 1 decoders), and
the then-legacy decoders can decode and output the base-layer
bitstream.
Bitstream extraction
HEVC includes a bitstream-extraction process as an integral part of
the overall decoding process. The bitstream extraction process is
used in the process of bitstream conformance tests, which is part of
the HRD buffering model.
Reference picture management
The reference picture management of HEVC, including reference picture
marking and removal from the Decoded Picture Buffer (DPB) as well as
Reference Picture List Construction (RPLC), differs from that of
H.264. Instead of the reference picture marking mechanism based on a
sliding window plus adaptive Memory Management Control Operation
(MMCO) described in H.264, HEVC specifies a reference picture
management and marking mechanism based on Reference Picture Set
(RPS), and the RPLC is consequently based on the RPS mechanism. An
RPS consists of a set of reference pictures associated with a
picture, consisting of all reference pictures that are prior to the
associated picture in decoding order, that may be used for inter
prediction of the associated picture or any picture following the
associated picture in decoding order. The reference picture set
consists of five lists of reference pictures; RefPicSetStCurrBefore,
RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr, and
RefPicSetLtFoll. RefPicSetStCurrBefore, RefPicSetStCurrAfter, and

RefPicSetLtCurr contain all reference pictures that may be used in
inter prediction of the current picture and that may be used in inter
prediction of one or more of the pictures following the current
picture in decoding order. RefPicSetStFoll and RefPicSetLtFoll
consist of all reference pictures that are not used in inter
prediction of the current picture but may be used in inter prediction
of one or more of the pictures following the current picture in
decoding order. RPS provides an "intra-coded" signaling of the DPB
status, instead of an "inter-coded" signaling, mainly for improved
error resilience. The RPLC process in HEVC is based on the RPS, by
signaling an index to an RPS subset for each reference index; this
process is simpler than the RPLC process in H.264.
Ultra-low delay support
HEVC specifies a sub-picture-level HRD operation, for support of the
so-called ultra-low delay. The mechanism specifies a standard-
compliant way to enable delay reduction below a one-picture interval.
Coded Picture Buffer (CPB) and DPB parameters at the sub-picture
level may be signaled, and utilization of this information for the
derivation of CPB timing (wherein the CPB removal time corresponds to
decoding time) and DPB output timing (display time) is specified.
Decoders are allowed to operate the HRD at the conventional access-
unit level, even when the sub-picture-level HRD parameters are
present.
New SEI messages
HEVC inherits many H.264 SEI messages with changes in syntax and/or
semantics making them applicable to HEVC. Additionally, there are a
few new SEI messages reviewed briefly in the following paragraphs.
The display orientation SEI message informs the decoder of a
transformation that is recommended to be applied to the cropped
decoded picture prior to display, such that the pictures can be
properly displayed, e.g., in an upside-up manner.
The structure of pictures SEI message provides information on the NAL
unit types, picture-order count values, and prediction dependencies
of a sequence of pictures. The SEI message can be used, for example,
for concluding what impact a lost picture has on other pictures.
The decoded picture hash SEI message provides a checksum derived from
the sample values of a decoded picture. It can be used for detecting
whether a picture was correctly received and decoded.

The active parameter sets SEI message includes the IDs of the active
video parameter set and the active sequence parameter set and can be
used to activate VPSs and SPSs. In addition, the SEI message
includes the following indications: 1) An indication of whether "full
random accessibility" is supported (when supported, all parameter
sets needed for decoding of the remaining of the bitstream when
random accessing from the beginning of the current CVS by completely
discarding all access units earlier in decoding order are present in
the remaining bitstream, and all coded pictures in the remaining
bitstream can be correctly decoded); 2) An indication of whether
there is no parameter set within the current CVS that updates another
parameter set of the same type preceding in decoding order. An
update of a parameter set refers to the use of the same parameter set
ID but with some other parameters changed. If this property is true
for all CVSs in the bitstream, then all parameter sets can be sent
out-of-band before session start.
The decoding unit information SEI message provides information
regarding coded picture buffer removal delay for a decoding unit.
The message can be used in very-low-delay buffering operations.
The region refresh information SEI message can be used together with
the recovery point SEI message (present in both H.264 and HEVC) for
improved support of gradual decoding refresh. This supports random
access from inter-coded pictures, wherein complete pictures can be
correctly decoded or recovered after an indicated number of pictures
in output/display order.
1.1.3. Parallel Processing Support
The reportedly significantly higher encoding computational demand of
HEVC over H.264, in conjunction with the ever-increasing video
resolution (both spatially and temporally) required by the market,
led to the adoption of VCL coding tools specifically targeted to
allow for parallelization on the sub-picture level. That is,
parallelization occurs, at the minimum, at the granularity of an
integer number of CTUs. The targets for this type of high-level
parallelization are multicore CPUs and DSPs as well as multiprocessor
systems. In a system design, to be useful, these tools require
signaling support, which is provided in Section 7 of this memo. This
section provides a brief overview of the tools available in [HEVC].
Many of the tools incorporated in HEVC were designed keeping in mind
the potential parallel implementations in multicore/multiprocessor
architectures. Specifically, for parallelization, four picture
partition strategies, as described below, are available.

Slices are segments of the bitstream that can be reconstructed
independently from other slices within the same picture (though there
may still be interdependencies through loop filtering operations).
Slices are the only tool that can be used for parallelization that is
also available, in virtually identical form, in H.264.
Parallelization based on slices does not require much inter-processor
or inter-core communication (except for inter-processor or inter-core
data sharing for motion compensation when decoding a predictively
coded picture, which is typically much heavier than inter-processor
or inter-core data sharing due to in-picture prediction), as slices
are designed to be independently decodable. However, for the same
reason, slices can require some coding overhead. Further, slices (in
contrast to some of the other tools mentioned below) also serve as
the key mechanism for bitstream partitioning to match Maximum
Transfer Unit (MTU) size requirements, due to the in-picture
independence of slices and the fact that each regular slice is
encapsulated in its own NAL unit. In many cases, the goal of
parallelization and the goal of MTU size matching can place
contradicting demands to the slice layout in a picture. The
realization of this situation led to the development of the more
advanced tools mentioned below.
Dependent slice segments allow for fragmentation of a coded slice
into fragments at CTU boundaries without breaking any in-picture
prediction mechanisms. They are complementary to the fragmentation
mechanism described in this memo in that they need the cooperation of
the encoder. As a dependent slice segment necessarily contains an
integer number of CTUs, a decoder using multiple cores operating on
CTUs can process a dependent slice segment without communicating
parts of the slice segment's bitstream to other cores.
Fragmentation, as specified in this memo, in contrast, does not
guarantee that a fragment contains an integer number of CTUs.
In Wavefront Parallel Processing (WPP), the picture is partitioned
into rows of CTUs. Entropy decoding and prediction are allowed to
use data from CTUs in other partitions. Parallel processing is
possible through parallel decoding of CTU rows, where the start of
the decoding of a row is delayed by two CTUs, so to ensure that data
related to a CTU above and to the right of the subject CTU is
available before the subject CTU is being decoded. Using this
staggered start (which appears like a wavefront when represented
graphically), parallelization is possible with up to as many
processors/cores as the picture contains CTU rows.
Because in-picture prediction between neighboring CTU rows within a
picture is allowed, the required inter-processor/inter-core
communication to enable in-picture prediction can be substantial.
The WPP partitioning does not result in the creation of more NAL

units compared to when it is not applied; thus, WPP cannot be used
for MTU size matching, though slices can be used in combination for
that purpose.
Tiles define horizontal and vertical boundaries that partition a
picture into tile columns and rows. The scan order of CTUs is
changed to be local within a tile (in the order of a CTU raster scan
of a tile), before decoding the top-left CTU of the next tile in the
order of tile raster scan of a picture. Similar to slices, tiles
break in-picture prediction dependencies (including entropy decoding
dependencies). However, they do not need to be included into
individual NAL units (same as WPP in this regard); hence, tiles
cannot be used for MTU size matching, though slices can be used in
combination for that purpose. Each tile can be processed by one
processor/core, and the inter-processor/inter-core communication
required for in-picture prediction between processing units decoding
neighboring tiles is limited to conveying the shared slice header in
cases a slice is spanning more than one tile, and loop-filtering-
related sharing of reconstructed samples and metadata. Insofar,
tiles are less demanding in terms of inter-processor communication
bandwidth compared to WPP due to the in-picture independence between
two neighboring partitions.
1.1.4. NAL Unit Header
HEVC maintains the NAL unit concept of H.264 with modifications.
HEVC uses a two-byte NAL unit header, as shown in Figure 1. The
payload of a NAL unit refers to the NAL unit excluding the NAL unit
header.
+---------------+---------------+
|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|F| Type | LayerId | TID |
+-------------+-----------------+
Figure 1: The Structure of the HEVC NAL Unit Header
The semantics of the fields in the NAL unit header are as specified
in [HEVC] and described briefly below for convenience. In addition
to the name and size of each field, the corresponding syntax element
name in [HEVC] is also provided.
F: 1 bit
forbidden_zero_bit. Required to be zero in [HEVC]. Note that the
inclusion of this bit in the NAL unit header was to enable
transport of HEVC video over MPEG-2 transport systems (avoidance
of start code emulations) [MPEG2S]. In the context of this memo,

the value 1 may be used to indicate a syntax violation, e.g., for
a NAL unit resulted from aggregating a number of fragmented units
of a NAL unit but missing the last fragment, as described in
Section 4.4.3.
Type: 6 bits
nal_unit_type. This field specifies the NAL unit type as defined
in Table 7-1 of [HEVC]. If the most significant bit of this field
of a NAL unit is equal to 0 (i.e., the value of this field is less
than 32), the NAL unit is a VCL NAL unit. Otherwise, the NAL unit
is a non-VCL NAL unit. For a reference of all currently defined
NAL unit types and their semantics, please refer to Section 7.4.2
in [HEVC].
LayerId: 6 bits
nuh_layer_id. Required to be equal to zero in [HEVC]. It is
anticipated that in future scalable or 3D video coding extensions
of this specification, this syntax element will be used to
identify additional layers that may be present in the CVS, wherein
a layer may be, e.g., a spatial scalable layer, a quality scalable
layer, a texture view, or a depth view.
TID: 3 bits
nuh_temporal_id_plus1. This field specifies the temporal
identifier of the NAL unit plus 1. The value of TemporalId is
equal to TID minus 1. A TID value of 0 is illegal to ensure that
there is at least one bit in the NAL unit header equal to 1, so to
enable independent considerations of start code emulations in the
NAL unit header and in the NAL unit payload data.
1.2. Overview of the Payload Format
This payload format defines the following processes required for
transport of HEVC coded data over RTP [RFC3550]:
o Usage of RTP header with this payload format
o Packetization of HEVC coded NAL units into RTP packets using three
types of payload structures: a single NAL unit packet, aggregation
packet, and fragment unit
o Transmission of HEVC NAL units of the same bitstream within a
single RTP stream or multiple RTP streams (within one or more RTP
sessions), where within an RTP stream transmission of NAL units
may be either non-interleaved (i.e., the transmission order of NAL
units is the same as their decoding order) or interleaved (i.e.,
the transmission order of NAL units is different from the decoding
order)

o Media type parameters to be used with the Session Description
Protocol (SDP) [RFC4566]
o A payload header extension mechanism and data structures for
enhanced support of temporal scalability based on that extension
mechanism.
2. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14 [RFC2119].
In this document, the above key words will convey that interpretation
only when in ALL CAPS. Lowercase uses of these words are not to be
interpreted as carrying the significance described in RFC 2119.
This specification uses the notion of setting and clearing a bit when
bit fields are handled. Setting a bit is the same as assigning that
bit the value of 1 (On). Clearing a bit is the same as assigning
that bit the value of 0 (Off).
3. Definitions and Abbreviations
3.1. Definitions
This document uses the terms and definitions of [HEVC]. Section
3.1.1 lists relevant definitions from [HEVC] for convenience.
Section 3.1.2 provides definitions specific to this memo.
3.1.1. Definitions from the HEVC Specification
access unit: A set of NAL units that are associated with each other
according to a specified classification rule, that are consecutive in
decoding order, and that contain exactly one coded picture.
BLA access unit: An access unit in which the coded picture is a BLA
picture.
BLA picture: An IRAP picture for which each VCL NAL unit has
nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP.
Coded Video Sequence (CVS): A sequence of access units that consists,
in decoding order, of an IRAP access unit with NoRaslOutputFlag equal
to 1, followed by zero or more access units that are not IRAP access
units with NoRaslOutputFlag equal to 1, including all subsequent
access units up to but not including any subsequent access unit that
is an IRAP access unit with NoRaslOutputFlag equal to 1.

Informative note: An IRAP access unit may be an IDR access unit, a
BLA access unit, or a CRA access unit. The value of
NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA
access unit, and each CRA access unit that is the first access
unit in the bitstream in decoding order, is the first access unit
that follows an end of sequence NAL unit in decoding order, or has
HandleCraAsBlaFlag equal to 1.
CRA access unit: An access unit in which the coded picture is a CRA
picture.
CRA picture: A RAP picture for which each VCL NAL unit has
nal_unit_type equal to CRA_NUT.
IDR access unit: An access unit in which the coded picture is an IDR
picture.
IDR picture: A RAP picture for which each VCL NAL unit has
nal_unit_type equal to IDR_W_RADL or IDR_N_LP.
IRAP access unit: An access unit in which the coded picture is an
IRAP picture.
IRAP picture: A coded picture for which each VCL NAL unit has
nal_unit_type in the range of BLA_W_LP (16) to RSV_IRAP_VCL23 (23),
inclusive.
layer: A set of VCL NAL units that all have a particular value of
nuh_layer_id and the associated non-VCL NAL units, or one of a set of
syntactical structures having a hierarchical relationship.
operation point: bitstream created from another bitstream by
operation of the sub-bitstream extraction process with the another
bitstream, a target highest TemporalId, and a target-layer identifier
list as input.
random access: The act of starting the decoding process for a
bitstream at a point other than the beginning of the bitstream.
sub-layer: A temporal scalable layer of a temporal scalable bitstream
consisting of VCL NAL units with a particular value of the TemporalId
variable, and the associated non-VCL NAL units.
sub-layer representation: A subset of the bitstream consisting of NAL
units of a particular sub-layer and the lower sub-layers.
tile: A rectangular region of coding tree blocks within a particular
tile column and a particular tile row in a picture.

tile column: A rectangular region of coding tree blocks having a
height equal to the height of the picture and a width specified by
syntax elements in the picture parameter set.
tile row: A rectangular region of coding tree blocks having a height
specified by syntax elements in the picture parameter set and a width
equal to the width of the picture.
3.1.2. Definitions Specific to This Memo
dependee RTP stream: An RTP stream on which another RTP stream
depends. All RTP streams in a Multiple RTP streams on a Single media
Transport (MRST) or Multiple RTP streams on Multiple media Transports
(MRMT), except for the highest RTP stream, are dependee RTP streams.
highest RTP stream: The RTP stream on which no other RTP stream
depends. The RTP stream in a Single RTP stream on a Single media
Transport (SRST) is the highest RTP stream.
Media-Aware Network Element (MANE): A network element, such as a
middlebox, selective forwarding unit, or application-layer gateway
that is capable of parsing certain aspects of the RTP payload headers
or the RTP payload and reacting to their contents.
Informative note: The concept of a MANE goes beyond normal routers
or gateways in that a MANE has to be aware of the signaling (e.g.,
to learn about the payload type mappings of the media streams),
and in that it has to be trusted when working with Secure RTP
(SRTP). The advantage of using MANEs is that they allow packets
to be dropped according to the needs of the media coding. For
example, if a MANE has to drop packets due to congestion on a
certain link, it can identify and remove those packets whose
elimination produces the least adverse effect on the user
experience. After dropping packets, MANEs must rewrite RTCP
packets to match the changes to the RTP stream, as specified in
Section 7 of [RFC3550].
Media Transport: As used in the MRST, MRMT, and SRST definitions
below, Media Transport denotes the transport of packets over a
transport association identified by a 5-tuple (source address, source
port, destination address, destination port, transport protocol).
See also Section 2.1.13 of [RFC7656].
Informative note: The term "bitstream" in this document is
equivalent to the term "encoded stream" in [RFC7656].

Multiple RTP streams on a Single media Transport (MRST): Multiple
RTP streams carrying a single HEVC bitstream on a Single Transport.
See also Section 3.5 of [RFC7656].
Multiple RTP streams on Multiple media Transports (MRMT): Multiple
RTP streams carrying a single HEVC bitstream on Multiple Transports.
See also Section 3.5 of [RFC7656].
NAL unit decoding order: A NAL unit order that conforms to the
constraints on NAL unit order given in Section 7.4.2.4 in [HEVC].
NAL unit output order: A NAL unit order in which NAL units of
different access units are in the output order of the decoded
pictures corresponding to the access units, as specified in [HEVC],
and in which NAL units within an access unit are in their decoding
order.
NAL-unit-like structure: A data structure that is similar to NAL
units in the sense that it also has a NAL unit header and a payload,
with a difference that the payload does not follow the start code
emulation prevention mechanism required for the NAL unit syntax as
specified in Section 7.3.1.1 of [HEVC]. Examples of NAL-unit-like
structures defined in this memo are packet payloads of Aggregation
Packet (AP), PAyload Content Information (PACI), and Fragmentation
Unit (FU) packets.
NALU-time: The value that the RTP timestamp would have if the NAL
unit would be transported in its own RTP packet.
RTP stream: See [RFC7656]. Within the scope of this memo, one RTP
stream is utilized to transport one or more temporal sub-layers.
Single RTP stream on a Single media Transport (SRST): Single RTP
stream carrying a single HEVC bitstream on a Single (Media)
Transport. See also Section 3.5 of [RFC7656].
transmission order: The order of packets in ascending RTP sequence
number order (in modulo arithmetic). Within an aggregation packet,
the NAL unit transmission order is the same as the order of
appearance of NAL units in the packet.