RFC 7667

RTP Topologies

3.4. Point to Multipoint Using Mesh
Shortcut name: Topo-Mesh
+---+ +---+
| A |<---->| B |
+---+ +---+
^ ^
\ /
\ /
v v
+---+
| C |
+---+
Figure 8: Point to Multipoint Using Mesh
Based on the RTP session definition, it is clearly possible to have a
joint RTP session involving three or more endpoints over multiple
unicast transport flows, like the joint three-endpoint session
depicted above. In this case, A needs to send its RTP streams and
RTCP packets to both B and C over their respective transport flows.
As long as all endpoints do the same, everyone will have a joint view
of the RTP session.
This topology does not create any additional requirements beyond the
need to have multiple transport flows associated with a single RTP
session. Note that an endpoint may use a single local port to
receive all these transport flows (in which case the sending port, IP
address, or SSRC can be used to demultiplex), or it might have
separate local reception ports for each of the endpoints.

configuration in Figure 8, endpoint A has no awareness of the
conditions occurring in the session between endpoints B and C
(whereas if a single RTP session were used, it would have such
awareness).
Loop detection is also affected. With independent RTP sessions, the
SSRC/CSRC cannot be used to determine when an endpoint receives its
own media stream, or a mixed media stream including its own media
stream (a condition known as a loop). The identification of loops
and, in most cases, their avoidance, has to be achieved by other
means, for example, through signaling or the use of an RTP external
namespace binding SSRC/CSRC among any communicating RTP sessions in
the mesh.
3.5. Point to Multipoint Using the RFC 3550 Translator
This section discusses some additional usages related to point to
multipoint of translators compared to the point-to-point cases in
Section 3.2.1.
3.5.1. Relay - Transport Translator
Shortcut name: Topo-PtM-Trn-Translator
This section discusses Transport Translator-only usages to enable
multipoint sessions.
+-----+
+---+ / \ +------------+ +---+
| A |<---/ \ | |<---->| B |
+---+ / \ | | +---+
+ Multicast +->| Translator |
+---+ \ Network / | | +---+
| C |<---\ / | |<---->| D |
+---+ \ / +------------+ +---+
+-----+
Figure 11: Point to Multipoint Using MulticastFigure 11 depicts an example of a Transport Translator performing at
least IP address translation. It allows the (non-multicast-capable)
endpoints B and D to take part in an Any-Source Multicast session
involving endpoints A and C, by having the translator forward their
unicast traffic to the multicast addresses in use, and vice versa.
It must also forward B's traffic to D, and vice versa, to provide
both B and D with a complete view of the session.

+---+ +------------+ +---+
| A |<---->| |<---->| B |
+---+ | | +---+
| Translator |
+---+ | | +---+
| C |<---->| |<---->| D |
+---+ +------------+ +---+
Figure 12: RTP Translator (Relay) with Only Unicast Paths
Another translator scenario is depicted in Figure 12. The translator
in this case connects multiple endpoints through unicast. This can
be implemented using a very simple Transport Translator which, in
this document, is called a relay. The relay forwards all traffic it
receives, both RTP and RTCP, to all other endpoints. In doing so, a
multicast network is emulated without relying on a multicast-capable
network infrastructure.
For RTCP feedback, this results in a similar set of considerations to
those described in the ASM RTP topology. It also puts some
additional signaling requirements onto the session establishment; for
example, a common configuration of RTP payload types is required.
Transport Translators and relays should always consider implementing
source address filtering, to prevent attackers from using the
listening ports on the translator to inject traffic. The translator
can, however, go one step further, especially if explicit SSRC
signaling is used, to prevent endpoints from sending SSRCs other than
its own (that are, for example, used by other participants in the
session). This can improve the security properties of the session,
despite the use of group keys that on a cryptographic level allows
anyone to impersonate another in the same RTP session.
A translator that doesn't change the RTP/RTCP packet content can be
operated without requiring it to have access to the security contexts
used to protect the RTP/RTCP traffic between the participants.
3.5.2. Media Translator
In the context of multipoint communications, a Media Translator is
not providing new mechanisms to establish a multipoint session. It
is more of an enabler, or facilitator, that ensures a given endpoint
or a defined subset of endpoints can participate in the session.
If endpoint B in Figure 11 were behind a limited network path, the
translator may perform media transcoding to allow the traffic
received from the other endpoints to reach B without overloading the
path. This transcoding can help the other endpoints in the multicast

part of the session, by not requiring the quality transmitted by A to
be lowered to the bitrates that B is actually capable of receiving
(and vice versa).
3.6. Point to Multipoint Using the RFC 3550 Mixer Model
Shortcut name: Topo-Mixer
A mixer is a middlebox that aggregates multiple RTP streams that are
part of a session by generating one or more new RTP streams and, in
most cases, by manipulating the media data. One common application
for a mixer is to allow a participant to receive a session with a
reduced amount of resources.
+-----+
+---+ / \ +-----------+ +---+
| A |<---/ \ | |<---->| B |
+---+ / Multi- \ | | +---+
+ cast +->| Mixer |
+---+ \ Network / | | +---+
| C |<---\ / | |<---->| D |
+---+ \ / +-----------+ +---+
+-----+
Figure 13: Point to Multipoint Using the RFC 3550 Mixer Model
A mixer can be viewed as a device terminating the RTP streams
received from other endpoints in the same RTP session. Using the
media data carried in the received RTP streams, a mixer generates
derived RTP streams that are sent to the receiving endpoints.
The content that the mixer provides is the mixed aggregate of what
the mixer receives over the PtP or PtM paths, which are part of the
same Communication Session.
The mixer creates the Media Source and the source RTP stream just
like an endpoint, as it mixes the content (often in the uncompressed
domain) and then encodes and packetizes it for transmission to a
receiving endpoint. The CSRC Count (CC) and CSRC fields in the RTP
header can be used to indicate the contributors to the newly
generated RTP stream. The SSRCs of the to-be-mixed streams on the
mixer input appear as the CSRCs at the mixer output. That output
stream uses a unique SSRC that identifies the mixer's stream. The
CSRC should be forwarded between the different endpoints to allow for
loop detection and identification of sources that are part of the
Communication Session. Note that Section 7.1 of RFC 3550 requires

the SSRC space to be shared between domains for these reasons. This
also implies that any SDES information normally needs to be forwarded
across the mixer.
The mixer is responsible for generating RTCP packets in accordance
with its role. It is an RTP receiver and should therefore send RTCP
receiver reports for the RTP streams it receives and terminates. In
its role as an RTP sender, it should also generate RTCP sender
reports for those RTP streams it sends. As specified in Section 7.3
of RFC 3550, a mixer must not forward RTCP unaltered between the two
domains.
The mixer depicted in Figure 13 is involved in three domains that
need to be separated: the Any-Source Multicast network (including
endpoints A and C), endpoint B, and endpoint D. Assuming all four
endpoints in the conference are interested in receiving content from
all other endpoints, the mixer produces different mixed RTP streams
for B and D, as the one to B may contain content received from D, and
vice versa. However, the mixer may only need one SSRC per media type
in each domain where it is the receiving entity and transmitter of
mixed content.
In the multicast domain, a mixer still needs to provide a mixed view
of the other domains. This makes the mixer simpler to implement and
avoids any issues with advanced RTCP handling or loop detection,
which would be problematic if the mixer were providing non-symmetric
behavior. Please see Section 3.11 for more discussion on this topic.
The mixing operation, however, in each domain could potentially be
different.
A mixer is responsible for receiving RTCP feedback messages and
handling them appropriately. The definition of "appropriate" depends
on the message itself and the context. In some cases, the reception
of a codec-control message by the mixer may result in the generation
and transmission of RTCP feedback messages by the mixer to the
endpoints in the other domain(s). In other cases, a message is
handled by the mixer locally and therefore not forwarded to any other
domain.
When replacing the multicast network in Figure 13 (to the left of the
mixer) with individual unicast paths as depicted in Figure 14, the
mixer model is very similar to the one discussed in Section 3.9
below. Please see the discussion in Section 3.9 about the
differences between these two models.

+---+ +------------+ +---+
| A |<---->| |<---->| B |
+---+ | | +---+
| Mixer |
+---+ | | +---+
| C |<---->| |<---->| D |
+---+ +------------+ +---+
Figure 14: RTP Mixer with Only Unicast Paths
We now discuss in more detail the different mixing operations that a
mixer can perform and how they can affect RTP and RTCP behavior.
3.6.1. Media-Mixing Mixer
The Media-Mixing Mixer is likely the one that most think of when they
hear the term "mixer". Its basic mode of operation is that it
receives RTP streams from several endpoints and selects the stream(s)
to be included in a media-domain mix. The selection can be through
static configuration or by dynamic, content-dependent means such as
voice activation. The mixer then creates a single outgoing RTP
stream from this mix.
The most commonly deployed Media-Mixing Mixer is probably the audio
mixer, used in voice conferencing, where the output consists of a
mixture of all the input audio signals; this needs minimal signaling
to be successfully set up. From a signal processing viewpoint, audio
mixing is relatively straightforward and commonly possible for a
reasonable number of endpoints. Assume, for example, that one wants
to mix N streams from N different endpoints. The mixer needs to
decode those N streams, typically into the sample domain, and then
produce N or N+1 mixes. Different mixes are needed so that each
endpoint gets a mix of all other sources except its own, as this
would result in an echo. When N is lower than the number of all
endpoints, one may produce a mix of all N streams for the group that
are currently not included in the mix; thus, N+1 mixes. These audio
streams are then encoded again, RTP packetized, and sent out. In
many cases, audio level normalization, noise suppression, and similar
signal processing steps are also required or desirable before the
actual mixing process commences.
In video, the term "mixing" has a different interpretation than
audio. It is commonly used to refer to the process of spatially
combining contributed video streams, which is also known as "tiling".
The reconstructed, appropriately scaled down videos can be spatially
arranged in a set of tiles, with each tile containing the video from
an endpoint (typically showing a human participant). Tiles can be of
different sizes so that, for example, a particularly important

participant, or the loudest speaker, is being shown in a larger tile
than other participants. A self-view picture can be included in the
tiling, which can be either locally produced or feedback from a
mixer-received and reconstructed video image. Such remote loopback
allows for confidence monitoring, i.e., it enables the participant to
see himself/herself in the same quality as other participants see
him/her. The tiling normally operates on reconstructed video in the
sample domain. The tiled image is encoded, packetized, and sent by
the mixer to the receiving endpoints. It is possible that a
middlebox with media mixing duties contains only a single mixer of
the aforementioned type, in which case all participants necessarily
see the same tiled video, even if it is being sent over different RTP
streams. More common, however, are mixing arrangements where an
individual mixer is available for each outgoing port of the
middlebox, allowing individual compositions for each receiving
endpoint (a feature commonly referred to as personalized layout).
One problem with media mixing is that it consumes both large amounts
of media processing resources (for the decoding and mixing process in
the uncompressed domain) and encoding resources (for the encoding of
the mixed signal). Another problem is the quality degradation
created by decoding and re-encoding the media, which is the result of
the lossy nature of the most commonly used media codecs. A third
problem is the latency introduced by the media mixing, which can be
substantial and annoyingly noticeable in case of video, or in case of
audio if that mixed audio is lip-synchronized with high-latency
video. The advantage of media mixing is that it is straightforward
for the endpoints to handle the single media stream (which includes
the mixed aggregate of many sources), as they don't need to handle
multiple decodings, local mixing, and composition. In fact, mixers
were introduced in pre-RTP times so that legacy, single stream
receiving endpoints (that, in some protocol environments, actually
didn't need to be aware of the multipoint nature of the conference)
could successfully participate in what a user would recognize as a
multiparty video conference.

the SSRCs from the endpoint to mixer paths are used as CSRCs in
another RTP session, then RTP1, RTP2, and RTP3 become one joint
session as they have a common SSRC space. At this stage, the mixer
also needs to consider which RTCP information it needs to expose in
the different paths. In the above scenario, a mixer would normally
expose nothing more than the SDES information and RTCP BYE for a CSRC
leaving the session. The main goal would be to enable the correct
binding against the application logic and other information sources.
This also enables loop detection in the RTP session.
3.6.2. Media-Switching Mixer
Media-Switching Mixers are used in limited functionality scenarios
where no, or only very limited, concurrent presentation of multiple
sources is required by the application and also in more complex
multi-stream usages with receiver mixing or tiling, including
combined with simulcast and/or scalability between source and mixer.
An RTP mixer based on media switching avoids the media decoding and
encoding operations in the mixer, as it conceptually forwards the
encoded media stream as it was being sent to the mixer. It does not
avoid, however, the decryption and re-encryption cycle as it rewrites
RTP headers. Forwarding media (in contrast to reconstructing-mixing-
encoding media) reduces the amount of computational resources needed
in the mixer and increases the media quality (both in terms of
fidelity and reduced latency).
A Media-Switching Mixer maintains a pool of SSRCs representing
conceptual or functional RTP streams that the mixer can produce.
These RTP streams are created by selecting media from one of the RTP
streams received by the mixer and forwarded to the peer using the
mixer's own SSRCs. The mixer can switch between available sources if
that is required by the concept for the source, like the currently
active speaker. Note that the mixer, in most cases, still needs to
perform a certain amount of media processing, as many media formats
do not allow to "tune into" the stream at arbitrary points in their
bitstream.
To achieve a coherent RTP stream from the mixer's SSRC, the mixer
needs to rewrite the incoming RTP packet's header. First, the SSRC
field must be set to the value of the mixer's SSRC. Second, the
sequence number must be the next in the sequence of outgoing packets
it sent. Third, the RTP timestamp value needs to be adjusted using
an offset that changes each time one switches the Media Source.
Finally, depending on the negotiation of the RTP payload type, the
value representing this particular RTP payload configuration may have
to be changed if the different endpoint-to-mixer paths have not
arrived on the same numbering for a given configuration. This also

The Media-Switching Mixer can, similarly to the Media-Mixing Mixer,
reduce the bitrate required for media transmission towards the
different peers by selecting and forwarding only a subset of RTP
streams it receives from the sending endpoints. In case the mixer
receives simulcast transmissions or a scalable encoding of the Media
Source, the mixer has more degrees of freedom to select streams or
subsets of streams to forward to a receiving endpoint, both based on
transport or endpoint restrictions as well as application logic.
To ensure that a media receiver in an endpoint can correctly decode
the media in the RTP stream after a switch, a codec that uses
temporal prediction needs to start its decoding from independent
refresh points, or points in the bitstream offering similar
functionality (like "dirty refresh points"). For some codecs, for
example, frame-based speech and audio codecs, this is easily achieved
by starting the decoding at RTP packet boundaries, as each packet
boundary provides a refresh point (assuming proper packetization on
the encoder side). For other codecs, particularly in video, refresh
points are less common in the bitstream or may not be present at all
without an explicit request to the respective encoder. The Full
Intra Request [RFC5104] RTCP codec control message has been defined
for this purpose.
In this type of mixer, one could consider fully terminating the RTP
sessions between the different endpoint and mixer paths. The same
arguments and considerations as discussed in Section 3.9 need to be
taken into consideration and apply here.
3.7. Selective Forwarding Middlebox
Another method for handling media in the RTP mixer is to "project",
or make available, all potential RTP sources (SSRCs) into a per-
endpoint, independent RTP session. The middlebox can select which of
the potential sources that are currently actively transmitting media
will be sent to each of the endpoints. This is similar to the Media-
Switching Mixer but has some important differences in RTP details.

number needs to be consecutively incremented based on the packet
actually being transmitted in each RTP session. Therefore, the RTP
sequence number offset will change each time a source is turned on in
an RTP session. The timestamp (possibly offset) stays the same.
The RTP sessions can be considered independent, resulting in that the
SSRC numbers used can also be handled independently. This simplifies
the SSRC collision detection and avoidance but requires tools such as
remapping tables between the RTP sessions. Using independent RTP
sessions is not required, as it is possible for the switching
behavior to also perform with a common SSRC space. However, in this
case, collision detection and handling becomes a different problem.
It is up to the implementation to use a single common SSRC space or
separate ones.
Using separate SSRC spaces has some implications. For example, the
RTP stream that is being sent by endpoint B to the middlebox (BV1)
may use an SSRC value of 12345678. When that RTP stream is sent to
endpoint F by the middlebox, it can use any SSRC value, e.g.,
87654321. As a result, each endpoint may have a different view of
the application usage of a particular SSRC. Any RTP-level identity
information, such as SDES items, also needs to update the SSRC
referenced, if the included SDES items are intended to be global.
Thus, the application must not use SSRC as references to RTP streams
when communicating with other peers directly. This also affects loop
detection, which will fail to work as there is no common namespace
and identities across the different legs in the Communication Session
on the RTP level. Instead, this responsibility falls onto higher
layers.
The middlebox is also responsible for receiving any RTCP codec
control requests coming from an endpoint and deciding if it can act
on the request locally or needs to translate the request into the RTP
session/transport leg that contains the Media Source. Both endpoints
and the middlebox need to implement conference-related codec control
functionalities to provide a good experience. Commonly used are Full
Intra Request to request from the Media Source that switching points
be provided between the sources and Temporary Maximum Media Bitrate
Request (TMMBR) to enable the middlebox to aggregate congestion
control responses towards the Media Source so to enable it to adjust
its bitrate (obviously, only in case the limitation is not in the
source to middlebox link).
The Selective Forwarding Middlebox has been introduced in recently
developed videoconferencing systems in conjunction with, and to
capitalize on, scalable video coding as well as simulcasting. An
example of scalable video coding is Annex G of H.264, but other
codecs, including H.264 AVC and VP8, also exhibit scalability, albeit

only in the temporal dimension. In both scalable coding and
simulcast cases, the video signal is represented by a set of two or
more bitstreams, providing a corresponding number of distinct
fidelity points. The middlebox selects which parts of a scalable
bitstream (or which bitstream, in the case of simulcasting) to
forward to each of the receiving endpoints. The decision may be
driven by a number of factors, such as available bitrate, desired
layout, etc. Contrary to transcoding MCUs, SFMs have extremely low
delay and provide features that are typically associated with high-
end systems (personalized layout, error localization) without any
signal processing at the middlebox. They are also capable of scaling
to a large number of concurrent users, and--due to their very low
delay--can also be cascaded.
This version of the middlebox also puts different requirements on the
endpoint when it comes to decoder instances and handling of the RTP
streams providing media. As each projected SSRC can, at any time,
provide media, the endpoint either needs to be able to handle as many
decoder instances as the middlebox received, or have efficient
switching of decoder contexts in a more limited set of actual decoder
instances to cope with the switches. The application also gets more
responsibility to update how the media provided is to be presented to
the user.
Note that this topology could potentially be seen as a Media
Translator that includes an on/off logic as part of its media
translation. The topology has the property that all SSRCs present in
the session are visible to an endpoint. It also has mixer aspects,
as the streams it provides are not basically translated versions, but
instead they have conceptual property assigned to them and can be
both turned on/off as well as fully or partially delivered. Thus,
this topology appears to be some hybrid between the translator and
mixer model.
The differences between a Selective Forwarding Middlebox and a
Switching-Media Mixer (Section 3.6.2) are minor, and they share most
properties. The above requirement on having a large number of
decoding instances or requiring efficient switching of decoder
contexts, are one point of difference. The other is how the
identification is performed, where the mixer uses CSRC to provide
information on what is included in a particular RTP stream that
represents a particular concept. Selective forwarding gets the
source information through the SSRC and instead uses other mechanisms
to indicate the streams intended usage, if needed.

3.8. Point to Multipoint Using Video-Switching MCUs
Shortcut name: Topo-Video-switch-MCU
+---+ +------------+ +---+
| A |------| Multipoint |------| B |
+---+ | Control | +---+
| Unit |
+---+ | (MCU) | +---+
| C |------| |------| D |
+---+ +------------+ +---+
Figure 18: Point to Multipoint Using a Video-Switching MCU
This PtM topology was popular in early implementations of multipoint
videoconferencing systems due to its simplicity, and the
corresponding middlebox design has been known as a "video-switching
MCU". The more complex RTCP-terminating MCUs, discussed in the next
section, became the norm, however, when technology allowed
implementations at acceptable costs.
A video-switching MCU forwards to a participant a single media
stream, selected from the available streams. The criteria for
selection are often based on voice activity in the audio-visual
conference, but other conference management mechanisms (like
presentation mode or explicit floor control) are known to exist as
well.
The video-switching MCU may also perform media translation to modify
the content in bitrate, encoding, or resolution. However, it still
may indicate the original sender of the content through the SSRC. In
this case, the values of the CC and CSRC fields are retained.
If not terminating RTP, the RTCP sender reports are forwarded for the
currently selected sender. All RTCP receiver reports are freely
forwarded between the endpoints. In addition, the MCU may also
originate RTCP control traffic in order to control the session and/or
report on status from its viewpoint.
The video-switching MCU has most of the attributes of a translator.
However, its stream selection is a mixing behavior. This behavior
has some RTP and RTCP issues associated with it. The suppression of
all but one RTP stream results in most participants seeing only a
subset of the sent RTP streams at any given time, often a single RTP
stream per conference. Therefore, RTCP receiver reports only report
on these RTP streams. Consequently, the endpoints emitting RTP
streams that are not currently forwarded receive a view of the
session that indicates their RTP streams disappear somewhere en

route. This makes the use of RTCP for congestion control, or any
type of quality reporting, very problematic.
To avoid the aforementioned issues, the MCU needs to implement two
features. First, it needs to act as a mixer (see Section 3.6) and
forward the selected RTP stream under its own SSRC and with the
appropriate CSRC values. Second, the MCU needs to modify the RTCP
RRs it forwards between the domains. As a result, it is recommended
that one implement a centralized video-switching conference using a
mixer according to RFC 3550, instead of the shortcut implementation
described here.
3.9. Point to Multipoint Using RTCP-Terminating MCU
Shortcut name: Topo-RTCP-terminating-MCU
+---+ +------------+ +---+
| A |<---->| Multipoint |<---->| B |
+---+ | Control | +---+
| Unit |
+---+ | (MCU) | +---+
| C |<---->| |<---->| D |
+---+ +------------+ +---+
Figure 19: Point to Multipoint Using Content Modifying MCUs
In this PtM scenario, each endpoint runs an RTP point-to-point
session between itself and the MCU. This is a very commonly deployed
topology in multipoint video conferencing. The content that the MCU
provides to each participant is either:
a. a selection of the content received from the other endpoints or
b. the mixed aggregate of what the MCU receives from the other PtP
paths, which are part of the same Communication Session.
In case (a), the MCU may modify the content in terms of bitrate,
encoding format, or resolution. No explicit RTP mechanism is used to
establish the relationship between the original RTP stream of the
media being sent and the RTP stream the MCU sends. In other words,
the outgoing RTP streams typically use a different SSRC, and may well
use a different payload type (PT), even if this different PT happens
to be mapped to the same media type. This is a result of the
individually negotiated RTP session for each endpoint.
In case (b), the MCU is the Media Source and generates the Source RTP
Stream as it mixes the received content and then encodes and
packetizes it for transmission to an endpoint. According to RTP

[RFC3550], the SSRC of the contributors are to be signaled using the
CSRC/CC mechanism. In practice, today, most deployed MCUs do not
implement this feature. Instead, the identification of the endpoints
whose content is included in the mixer's output is not indicated
through any explicit RTP mechanism. That is, most deployed MCUs set
the CC field in the RTP header to zero, thereby indicating no
available CSRC information, even if they could identify the original
sending endpoints as suggested in RTP.
The main feature that sets this topology apart from what RFC 3550
describes is the breaking of the common RTP session across the
centralized device, such as the MCU. This results in the loss of
explicit RTP-level indication of all participants. If one were using
the mechanisms available in RTP and RTCP to signal this explicitly,
the topology would follow the approach of an RTP mixer. The lack of
explicit indication has at least the following potential problems:
1. Loop detection cannot be performed on the RTP level. When
carelessly connecting two misconfigured MCUs, a loop could be
generated.
2. There is no information about active media senders available in
the RTP packet. As this information is missing, receivers cannot
use it. It also deprives the client of information related to
currently active senders in a machine-usable way, thus preventing
clients from indicating currently active speakers in user
interfaces, etc.
Note that many/most deployed MCUs (and video conferencing endpoints)
rely on signaling-layer mechanisms for the identification of the
Contributing Sources, for example, a SIP conferencing package
[RFC4575]. This alleviates, to some extent, the aforementioned
issues resulting from ignoring RTP's CSRC mechanism.
3.10. Split Component Terminal
Shortcut name: Topo-Split-Terminal
In some applications, for example, in some telepresence systems,
terminals may not be integrated into a single functional unit but
composed of more than one subunits. For example, a telepresence room
terminal employing multiple cameras and monitors may consist of
multiple video conferencing subunits, each capable of handling a
single camera and monitor. Another example would be a video
conferencing terminal in which audio is handled by one subunit, and
video by another. Each of these subunits uses its own physical
network interface (for example: Ethernet jack) and network address.

The various (media processing) subunits need (logically and
physically) to be interconnected by control functionality, but their
media plane functionality may be split. These types of terminals are
referred to as split component terminals. Historically, the earliest
split component terminals were perhaps the independent audio and
video conference software tools used over the MBONE in the late
1990s.
An example for such a split component terminal is depicted in
Figure 20. Within split component terminal A, at least audio and
video subunits are addressed by their own network addresses. In some
of these systems, the control stack subunit may also have its own
network address.
From an RTP viewpoint, each of the subunits terminates RTP and acts
as an endpoint in the sense that each subunit includes its own,
independent RTP stack. However, as the subunits are semantically
part of the same terminal, it is appropriate that this semantic
relationship is expressed in RTCP protocol elements, namely in the
CNAME.
+---------------------+
| Endpoint A |
| Local Area Network |
| +------------+ |
| +->| Audio |<+-RTP---\
| | +------------+ | \ +------+
| | +------------+ | +-->| |
| +->| Video |<+-RTP-------->| B |
| | +------------+ | +-->| |
| | +------------+ | / +------+
| +->| Control |<+-SIP---/
| +------------+ |
+---------------------+
Figure 20: Split Component Terminal
It is further sensible that the subunits share a common clock from
which RTP and RTCP clocks are derived, to facilitate synchronization
and avoid clock drift.
To indicate that audio and video Source Streams generated by
different subunits share a common clock, and can be synchronized, the
RTP streams generated from those Source Streams need to include the
same CNAME in their RTCP SDES packets. The use of a common CNAME for
RTP flows carried in different transport-layer flows is entirely
normal for RTP and RTCP senders, and fully compliant RTP endpoints,
middleboxes, and other tools should have no problem with this.

However, outside of the split component terminal scenario (and
perhaps a multihomed endpoint scenario, which is not further
discussed herein), the use of a common CNAME in RTP streams sent from
separate endpoints (as opposed to a common CNAME for RTP streams sent
on different transport-layer flows between two endpoints) is rare.
It has been reported that at least some third-party tools like some
network monitors do not handle gracefully endpoints that use a common
CNAME across multiple transport-layer flows: they report an error
condition in which two separate endpoints are using the same CNAME.
Depending on the sophistication of the support staff, such erroneous
reports can lead to support issues.
The aforementioned support issue can sometimes be avoided if each of
the subunits of a split component terminal is configured to use a
different CNAME, with the synchronization between the RTP streams
being indicated by some non-RTP signaling channel rather than using a
common CNAME sent in RTCP. This complicates the signaling,
especially in cases where there are multiple SSRCs in use with
complex synchronization requirements, as is the same in many current
telepresence systems. Unless one uses RTCP terminating topologies
such as Topo-RTCP-terminating-MCU, sessions involving more than one
video subunit with a common CNAME are close to unavoidable.
The different RTP streams comprising a split terminal system can form
a single RTP session or they can form multiple RTP sessions,
depending on the visibility of their SSRC values in RTCP reports. If
the receiver of the RTP streams sent by the split terminal sends
reports relating to all of the RTP flows (i.e., to each SSRC) in each
RTCP report, then a single RTP session is formed. Alternatively, if
the receiver of the RTP streams sent by the split terminal does not
send cross-reports in RTCP, then the audio and video form separate
RTP sessions.
For example, in Figure 20, B will send RTCP reports to each of the
subunits of A. If the RTCP packets that B sends to the audio subunit
of A include reports on the reception quality of the video as well as
the audio, and similarly if the RTCP packets that B sends to the
video subunit of A include reports on the reception quality of the
audio as well as video, then a single RTP session is formed.
However, if the RTCP packets B sends to the audio subunit of A only
report on the received audio, and the RTCP packets B sends to the
video subunit of A only report on the received video, then there are
two separate RTP sessions.
Forming a single RTP session across the RTP streams sent by the
different subunits of a split terminal gives each subunit visibility
into reception quality of RTP streams sent by the other subunits.

This information can help diagnose reception quality problems, but at
the cost of increased RTCP bandwidth use.
RTP streams sent by the subunits of a split terminal need to use the
same CNAME in their RTCP packets if they are to be synchronized,
irrespective of whether a single RTP session is formed or not.
3.11. Non-symmetric Mixer/Translators
Shortcut name: Topo-Asymmetric
It is theoretically possible to construct an MCU that is a mixer in
one direction and a translator in another. The main reason to
consider this would be to allow topologies similar to Figure 13,
where the mixer does not need to mix in the direction from B or D
towards the multicast domains with A and C. Instead, the RTP streams
from B and D are forwarded without changes. Avoiding this mixing
would save media processing resources that perform the mixing in
cases where it isn't needed. However, there would still be a need to
mix B's media towards D. Only in the direction B -> multicast domain
or D -> multicast domain would it be possible to work as a
translator. In all other directions, it would function as a mixer.
The mixer/translator would still need to process and change the RTCP
before forwarding it in the directions of B or D to the multicast
domain. One issue is that A and C do not know about the mixed-media
stream the mixer sends to either B or D. Therefore, any reports
related to these streams must be removed. Also, receiver reports
related to A's and C's RTP streams would be missing. To avoid A and
C thinking that B and D aren't receiving A and C at all, the mixer
needs to insert locally generated reports reflecting the situation
for the streams from A and C into B's and D's sender reports. In the
opposite direction, the receiver reports from A and C about B's and
D's streams also need to be aggregated into the mixer's receiver
reports sent to B and D. Since B and D only have the mixer as source
for the stream, all RTCP from A and C must be suppressed by the
mixer.
This topology is so problematic, and it is so easy to get the RTCP
processing wrong, that it is not recommended for implementation.
3.12. Combining Topologies
Topologies can be combined and linked to each other using mixers or
translators. However, care must be taken in handling the SSRC/CSRC
space. A mixer does not forward RTCP from sources in other domains,
but instead generates its own RTCP packets for each domain it mixes
into, including the necessary SDES information for both the CSRCs and

the SSRCs. Thus, in a mixed domain, the only SSRCs seen will be the
ones present in the domain, while there can be CSRCs from all the
domains connected together with a combination of mixers and
translators. The combined SSRC and CSRC space is common over any
translator or mixer. It is important to facilitate loop detection,
something that is likely to be even more important in combined
topologies due to the mixed behavior between the domains. Any
hybrid, like the Topo-Video-switch-MCU or Topo-Asymmetric, requires
considerable thought on how RTCP is dealt with.