Identifying Modified
Explicit Congestion Notification (ECN) Semantics for Ultra-Low Queuing
Delay (L4S)Nokia Bell LabsAntwerpBelgiumkoen.de_schepper@nokia.comhttps://www.bell-labs.com/usr/koen.de_schepperCableLabsUKietf@bobbriscoe.nethttp://bobbriscoe.net/Transport
Transport Services (tsv)Internet-DraftI-DThis specification defines the identifier to be used on IP packets
for a new network service called low latency, low loss and scalable
throughput (L4S). It is similar to the original (or 'Classic') Explicit
Congestion Notification (ECN). 'Classic' ECN marking was required to be
equivalent to a drop, both when applied in the network and when
responded to by a transport. Unlike 'Classic' ECN marking, for packets
carrying the L4S identifier, the network applies marking more
immediately and more aggressively than drop, and the transport response
to each mark is reduced and smoothed relative to that for drop. The two
changes counterbalance each other so that the throughput of an L4S flow
will be roughly the same as a 'Classic' flow under the same conditions.
However, the much more frequent control signals and the finer responses
to them result in ultra-low queuing delay without compromising link
utilization, and low delay is maintained during high load. Examples of
new active queue management (AQM) marking algorithms and examples of new
transports (whether TCP-like or real-time) are specified separately. The
new L4S identifier is the key piece that enables them to interwork and
distinguishes them from 'Classic' traffic. It gives an incremental
migration path so that existing 'Classic' TCP traffic will be no worse
off, but it can be prevented from degrading the ultra-low delay and loss
of the new scalable transports.This specification defines the identifier to be used on IP packets
for a new network service called low latency, low loss and scalable
throughput (L4S). It is similar to the original (or 'Classic') Explicit
Congestion Notification (ECN ). 'Classic' ECN
marking was required to be equivalent to a drop, both when applied in
the network and when responded to by a transport. Unlike 'Classic' ECN
marking, the network applies L4S marking more immediately and more
aggressively than drop, and the transport response to each mark is
reduced and smoothed relative to that for drop. The two changes
counterbalance each other so that the throughput of an L4S flow will be
roughly the same as a 'Classic' flow under the same conditions.
Nonetheless, the much more frequent control signals and the finer
responses to them result in ultra-low queuing delay without compromising
link utilization, and low delay is maintained during high load.An example of a scalable transport that would enable the L4S service
is Data Centre TCP (DCTCP), which until now has been applicable solely
to controlled environments like data centres ,
because it is too aggressive to co-exist with existing TCP. The DualQ
Coupled AQM, which is defined in a complementary experimental
specification , is an
AQM framework that enables scalable transports like DCTCP to co-exist
with existing traffic, each getting roughly the same flow rate when they
compete under similar conditions. Note that a transport such as DCTCP is
still not safe to deploy on the Internet unless it satisfies the
requirements listed in .The new L4S identifier is the key piece that enables L4S hosts and
L4S network nodes to interwork and distinguishes their traffic from
'Classic' traffic. It gives an incremental migration path so that
existing 'Classic' TCP traffic will be no worse off, but it can be
prevented from degrading the ultra-low delay and loss of the new
scalable transports. The performance improvement is so great that it is
motivating initial deployment of the separate parts of this system.Latency is becoming the critical performance factor for many
(most?) applications on the public Internet, e.g. interactive Web, Web
services, voice, conversational video, interactive video, interactive
remote presence, instant messaging, online gaming, remote desktop,
cloud-based applications, and video-assisted remote control of
machinery and industrial processes. In the developed world, further
increases in access network bit-rate offer diminishing returns,
whereas latency is still a multi-faceted problem. In the last decade
or so, much has been done to reduce propagation time by placing caches
or servers closer to users. However, queuing remains a major
intermittent component of latency.The Diffserv architecture provides Expedited Forwarding , so that low latency traffic can jump the queue of
other traffic. However, on access links dedicated to individual sites
(homes, small enterprises or mobile devices), often all traffic at any
one time will be latency-sensitive. Then Diffserv is of little use.
Instead, we need to remove the causes of any unnecessary delay.The bufferbloat project has shown that excessively-large buffering
('bufferbloat') has been introducing significantly more delay than the
underlying propagation time. These delays appear only
intermittently—only when a capacity-seeking (e.g. TCP) flow is
long enough for the queue to fill the buffer, making every packet in
other flows sharing the buffer sit through the queue.Active queue management (AQM) was originally developed to solve
this problem (and others). Unlike Diffserv, which gives low latency to
some traffic at the expense of others, AQM controls latency for all traffic in a class. In general, AQMs
introduce an increasing level of discard from the buffer the longer
the queue persists above a shallow threshold. This gives sufficient
signals to capacity-seeking (aka. greedy) flows to keep the buffer
empty for its intended purpose: absorbing bursts. However,
RED and other algorithms from the 1990s
were sensitive to their configuration and hard to set correctly. So,
AQM was not widely deployed.More recent state-of-the-art AQMs, e.g. fq_CoDel , PIE , Adaptive
RED , are easier to configure, because
they define the queuing threshold in time not bytes, so it is
invariant for different link rates. However, no matter how good the
AQM, the sawtoothing rate of TCP will either cause queuing delay to
vary or cause the link to be under-utilized. Even with a perfectly
tuned AQM, the additional queuing delay will be of the same order as
the underlying speed-of-light delay across the network. Flow-queuing
can isolate one flow from another, but it cannot isolate a TCP flow
from the delay variations it inflicts on itself, and it has other
problems - it overrides the flow rate decisions of variable rate video
applications, it does not recognise the flows within IPSec VPN tunnels
and it is relatively expensive to implement.Latency is not our only concern: It was known when TCP was first
developed that it would not scale to high bandwidth-delay products
. Given regular broadband bit-rates over WAN
distances are already beyond the scaling
range of 'Classic' TCP Reno, 'less unscalable' Cubic and Compound variants of TCP have been
successfully deployed. However, these are now approaching their
scaling limits. Unfortunately, fully scalable TCPs such as DCTCP cause 'Classic' TCP to starve itself, which is why
they have been confined to private data centres or research testbeds
(until now).It turns out that a TCP algorithm like DCTCP that solves the
latency problem also solves TCP's scalability problem. The finer
sawteeth have low amplitude, so they cause very little queuing delay
variation and the number of sawteeth per round trip remains invariant,
which maintains constant tight control as flow-rate scales. A
supporting paper gives the full explanation
of why the design solves both the latency and the scaling problems,
both in plain English and in more precise mathematical form. The
explanation is summarised without the maths in the L4S architecture
document .The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
. In this document, these words will appear
with that interpretation only when in ALL CAPS. Lower case uses of
these words are not to be interpreted as carrying RFC-2119
significance.The 'Classic' service is intended
for all the behaviours that currently co-exist with TCP Reno (e.g.
TCP Cubic, Compound, SCTP, etc).The
'L4S' service is intended for traffic from scalable TCP algorithms
such as Data Centre TCP. But it is also more general—it
allows the set of congestion controls with similar scaling
properties to DCTCP to evolve (e.g. Relentless TCP and the L4S variant of SCREAM for real-time
media .Both
Classic and L4S services can cope with a proportion of
unresponsive or less-responsive traffic as well, as long as it
does not build a queue (e.g. DNS, VoIP, game sync datagrams,
etc).The original Explicit Congestion
Notification (ECN) protocol .The new L4S identifier defined in this specification is applicable
for IPv4 and IPv6 packets (as for classic ECN ). It is applicable for the unicast, multicast and
anycast forwarding modes.The L4S identifier is an orthogonal packet classification to the
Differentiated Services Code Point (DSCP ).
explains what this means in
practice.This document is intended for experimental status, so it does not
update any standards track RFCs. Therefore it depends on , which is a standards track specification
that:updates the ECN proposed standard to
allow experimental track RFCs to relax the requirement that an ECN
mark must be equivalent to a drop, both when applied by the
network, and when responded to by the sender;changes the status of the experimental ECN nonce to historic;makes consequent updates to the following additional proposed
standard RFCs to reflect the above two bullets:ECN for RTP ;the congestion control specifications of various DCCP
congestion control identifier (CCID) profiles , , .This subsection briefly records the process that led to a consensus
choice of L4S identifier, selected from all the alternatives in .Ideally, the identifier for packets using the Low Latency, Low Loss,
Scalable throughput (L4S) service ought to meet the following
requirements:it SHOULD survive end-to-end between source and destination
applications: across the boundary between host and network, between
interconnected networks, and through middleboxes;it SHOULD be common to IPv4 and IPv6 and transport-agnostic;it SHOULD be incrementally deployable;it SHOULD enable an AQM to classify packets encapsulated by outer
IP or lower-layer headers;it SHOULD consume minimal extra codepoints;it SHOULD not lead to some packets of a transport-layer flow
being served by a different queue from others.Whether the identifier would be recoverable if the experiment failed
is a factor that could be taken into account. However, this has not been
made a requirement, because that would favour schemes that would be
easier to fail, rather than those more likely to succeed.It is recognised that the chosen identifier is unlikely to satisfy
all these requirements, particularly given the limited space left in the
IP header. Therefore a compromise will be necessary, which is why all
the requirements are expressed with the word 'SHOULD' not 'MUST'. discusses the pros and cons of the compromises
made in various competing identification schemes against the above
requirements.On the basis of this analysis, "ECT(1) and CE codepoints" is the best
compromise. Therefore this scheme is defined in detail in the following
sections, while records the rationale for
this decision.The L4S treatment is an experimental track alternative packet marking
treatment to the classic ECN treatment , which has been updated by
to allow this experiment (amongst others). Like classic ECN, L4S ECN
identifies both network and host behaviour: it identifies the marking
treatment that network nodes are expected to apply to L4S packets, and
it identifies packets that have been sent from hosts that are expected
to comply with a broad type of sending behaviour.For a packet to receive L4S treatment as it is forwarded, the sender
sets the ECN field in the IP header to the ECT(1) codepoint. See for full transport layer behaviour
requirements, including feedback and congestion response.A network node that implements the L4S service normally classifies
arriving ECT(1) and CE packets for L4S treatment. See for full network element behaviour
requirements, including classification, ECN-marking and interaction of
the L4S identifier with other identifiers and per-hop behaviours. For a packet to receive L4S treatment as it is forwarded, the
sender MUST set the ECN field in the IP header (v4 or v6) to the
ECT(1) codepoint.In general, a scalable congestion control needs feedback of the
extent of CE marking on the forward path. Due to the history of TCP
development, when ECN was added TCP reported no more than one CE mark
per round trip. Some transport protocols derived from TCP mimic this
behaviour while others report the accurate extent of TCP marking. This
means that some transport protocols will need to be updated as a
prerequisite for scalable congestion control. The position for a few
well-known transport protocols is given below.Support for accurate ECN feedback (AccECN ) by both ends is a
prerequisite for scalable congestion control. Therefore, the
presence of ECT(1) in the IP headers even in one direction of a
TCP connection will imply that both ends support AccECN. However,
the converse does not apply. So even if both ends support AccECN,
either of the two ends can choose not to use a scalable congestion
control, whatever the other end's choice.An ECN feedback protocol such as that
specified in would be a
prerequisite for scalable congestion control. That draft would
update the ECN feedback protocol sketched out in Appendix A of the
standards track specification of SCTP by
adding a field to report the number of CE marks.A prerequisite for scalable congestion
control is for both (all) ends of one media-level hop to signal
ECN support using the ecn-capable-rtp attribute . Therefore, the presence of ECT(1) implies that
both (all) ends of that hop support ECN. However, the converse
does not apply, so each end of a media-level hop can independently
choose not to use a scalable congestion control, even if both ends
support ECN.The ACK vector in DCCP is already sufficient to report the extent of
CE marking as needed by a scalable congestion control.As a condition for a host to send packets with the L4S identifier
(ECT(1)), it SHOULD implement a congestion control behaviour that
ensures the flow rate is inversely proportional to the proportion of
bytes in packets marked with the CE codepoint. This is termed a
scalable congestion control, because the number of control signals
(ECN marks) per round trip remains roughly constant for any flow rate.
As with all transport behaviours, a detailed specification will need
to be defined for each type of transport or application, including the
timescale over which the proportionality is averaged, and control of
burstiness. The inverse proportionality requirement above is worded as
a 'SHOULD' rather than a 'MUST' to allow reasonable flexibility when
defining these specifications.Data Center TCP (DCTCP ) is an example of a
scalable congestion control.Each sender in a session can use a scalable congestion control
independently of the congestion control used by the receiver(s) when
they send data. Therefore there might be ECT(1) packets in one
direction and ECT(0) or Not-ECT in the other.In order to coexist safely with other Internet traffic, a scalable
congestion control MUST NOT tag its packets with the ECT(1) codepoint
unless it complies with the following bulleted requirements. The
specification of a particular scalable congestion control MUST
describe in detail how it satisfies each requirement:A scalable congestion control MUST react to packet loss in a
way that will coexist safely with a TCP Reno congestion control
(see for rationale).A scalable congestion control MUST react to ECN marking from a
non-L4S but ECN-capable bottleneck in a way that will coexist with
a TCP Reno congestion control (see for rationale).A scalable congestion control MUST reduce or eliminate RTT bias
over as wide a range of RTTs as possible, or at least over the
typical range of RTTs that will interact in the intended
deployment scenario (see
for rationale).A scalable congestion control MUST remain responsive to
congestion when the RTT is significantly smaller than in the
current public Internet (see
for rationale).A scalable congestion control MUST detect loss by counting in
units of time, which is scalable, and MUST NOT count in units of
packets (as in the 3 DupACK rule of traditional TCP), which is not
scalable (see for
rationale).A network node that implements the L4S service MUST classify
arriving ECT(1) packets for L4S treatment and it SHOULD classify
arriving CE packets for L4S treatment as well. describes a possible
exception to this latter rule for some CE packets.An L4S AQM treatment follows similar codepoint transition rules to
those in RFC 3168. Specifically, the ECT(1) codepoint MUST NOT be
changed to any other codepoint than CE, and CE MUST NOT be changed to
any other codepoint. An ECT(1) packet is classified as ECN-capable
and, if congestion increases, an L4S AQM algorithm will mark the ECN
field as CE for an increasing proportion of packets, otherwise
forwarding packets unchanged as ECT(1). Necessary conditions for an
L4S marking treatment are defined in .
Under persistent overload an L4S marking treatment SHOULD turn off ECN
marking, using drop as a congestion signal until the overload episode
has subsided, as recommended for all AQMs in
(Section 4.2.1), which follows the similar advice in RFC 3168 (Section
7).For backward compatibility in uncontrolled environments, a network
node that implements the L4S treatment MUST also implement a classic
AQM treatment. It MUST classify arriving ECT(0) and Not-ECT packets
for treatment by the Classic AQM (see the discussion of the classifier
for the dual-queue coupled AQM in ). Classic treatment means
that the AQM will mark ECT(0) packets under the same conditions as it
would drop Not-ECT packets .The likelihood that an AQM drops a Not-ECT Classic packet (p_C)
MUST be roughly proportional to the square of the likelihood that it
would have marked it if it had been an L4S packet (p_L). That isp_C ~= (p_L / k)^2The constant of proportionality (k) does not have to be
standardised for interoperability, but a value of 2 is
RECOMMENDED. specifies the
essential aspects of an L4S AQM, as well as recommending other
aspects. It gives example implementations in appendices.The term 'likelihood' is used above to allow for marking and
dropping to be either probabilistic or deterministic. The example AQMs
in drop and mark
probabilistically, so the drop probability is arranged to be the
square of the marking probability. Nonetheless, an alternative AQM
that dropped and marked deterministically would be valid, as long as
the dropping frequency was proportional to the square of the marking
frequency.Note that, contrary to RFC 3168, a Dual AQM implementing the L4S
and Classic treatments does not mark an ECT(1) packet under the same
conditions that it would have dropped a Not-ECT packet, as allowed by
, which updates RFC 3168. However, it does
mark an ECT(0) packet under the same conditions that it would have
dropped a Not-ECT packet.To implement the L4S treatment, a network node does not need to
identify transport-layer flows. Nonetheless, if an implementer is
willing to identify transport-layer flows at a network node, and if
the most recent ECT packet in the same flow was ECT(0), the node MAY
classify CE packets for classic ECN
treatment. In all other cases, a network node MUST classify CE packets
for L4S treatment. Examples of such other cases are: i) if no ECT
packets have yet been identified in a flow; ii) if it is not desirable
for a network node to identify transport-layer flows; or iii) if the
most recent ECT packet in a flow was ECT(1).If an implementer uses flow-awareness to classify CE packets, to
determine whether the flow is using ECT(0) or ECT(1) it only uses the
most recent ECT packet of a flow (this advice will need to be verified
as part of L4S experiments). This is because a sender might have to
switch from sending ECT(1) (L4S) packets to sending ECT(0) (Classic)
packets, or back again, in the middle of a transport-layer flow. Such
a switch-over is likely to be very rare, but It could be necessary if
the path bottleneck moves from a network node that supports L4S to one
that only supports Classic ECN. A host ought to be able to detect such
a change from a change in RTT variation.In a typical case for the public Internet a network element
that implements L4S might want to classify some low-rate but
unresponsive traffic (e.g. DNS, voice, game sync packets) into the
low latency queue to mix with L4S traffic. Such non-ECN-based
packet types MUST be safe to mix with L4S traffic without harming
the low latency service.In this case it would not be appropriate to call the queue an
L4S queue, because it is shared by L4S and non-L4S traffic.
Instead it will be called the low latency or L queue. The L queue
then offers two different treatments:The L4S treatment, which is a combination of the L4S AQM
treatment and a priority scheduling treatment;The low latency treatment, which is solely the priority
scheduling treatment, without ECN-marking by the AQM.To identify packets for just the scheduling treatment, it would
be inappropriate to use the L4S ECT(1) identifier, because such
traffic is unresponsive to ECN marking. Therefore, a network
element that implements L4S MAY classify additional packets into
the L queue if they carry certain non-ECN identifiers. For
instance:addresses of specific applications or hosts configured to
be safe (but for example cannot set the ECN field for some
temporary reason);certain protocols that are usually lightweight (e.g. ARP,
DNS);specific Diffserv codepoints that indicate traffic with
limited burstiness such as the EF (Expedited Forwarding) and
Voice-Admit service classes or equivalent local-use DSCPs (see
).For clarity, non-ECN identifiers, such as the examples itemized
above, might be used by some network operators who believe they
identify non-L4S traffic that would be safe to mix with L4S
traffic. They are not alternative ways for a host to indicate that
it is sending L4S packets. Only the ECT(1) and CE ECN codepoints
indicate to a network element that a host is sending L4S packets -
specifically that the host claims its behaviour satisfies the
pre-requisite transport requirements in .To extend the above example, an operator might want to exclude
some traffic from the L4S treatment for policy reason, e.g.
security (traffic from malicious sources) or commercial (initially
the operator may wish to confine the benefits of L4S to business
customers).In this exclusion case, the operator MUST classify on the
relevant locally-used identifiers (e.g. source addresses) before
classifying the non-matching traffic on the end-to-end L4S ECN
identifier.The operator MUST NOT re-mark the end-to-end L4S identifier,
because its decision to exclude certain traffic from L4S treatment
is local-only. The end-to-end L4S identifier then survives for
other operators to use, or indeed, they can apply their own
policy, independently based on their own choice of locally-used
identifiers. This approach also allows any operator to remove its
locally-applied exclusions in future, e.g. if it wishes to widen
the benefit of the L4S treatment to all its customers.L4S concerns low latency, which it can provide for all traffic
without differentiation and without affecting bandwidth allocation.
Diffserv provides for differentiation of both bandwidth and low
latency, but its control of latency depends on its control of
bandwidth. The two can be combined if a network operator wants to
control bandwidth allocation but it also wants to provide low
latency for any amount of traffic within one of these allocations of
bandwidth (rather than only providing low latency by limiting
bandwidth) .The examples above were framed in the context of splitting the
default Best Efforts Per-Hop Behaviour (PHB) into a Low Latency (L)
queue and a Classic (C) Queue. But, more generally, an operator
might choose to control bandwidth allocation through a hierarchy of
Diffserv PHBs at a node, and to split one or more of these PHBs into
a low latency and a classic variant of that PHB.In the first case, where there are no other PHBs except the
DualQ, if a packet carries ECT(1) or CE, a network element would
classify it for the L4S treatment irrespective of its DSCP. And, if
a packet carried (say) the EF DSCP, the network element could
classify it for into L queue irrespective of its ECN codepoint.
However, where the DualQ is in a hierarchy of other PHBs, the
classifier would classify some traffic into other PHBs based on DSCP
before classifying between the latency and classic queues (based on
ECT(1), CE and perhaps the EF DSCP or other identifiers as in the
above example). describes how an
operator might use L4S to offer low latency for all L4S traffic as
well as using Diffserv for bandwidth differentiation. It identifies
two main types of approach, which can be combined: the operator
might split certain Diffserv PHBs between L4S and a corresponding
Classic service. Or it might split the L4S and/or the Classic
service into multiple Diffserv PHBs. In any of these cases, a packet
would have to be classified on its Diffserv and ECN codepoints.In summary, there are numerous ways in which the L4S ECN
identifier (ECT(1) and CE) could be combined with other identifiers
to achieve particular objectives. The following categorization
articulates those that are valid, but it is not necessarily
exhaustive. Those tagged 'Global-use' could be set by the sending
host or a network. Those tagged 'Local-use' would only be set by a
network:Identifiers Complementing the L4S IdentifierIncluding More Traffic in the L Queue(Global-use or Local-use)Excluding Certain Traffic from the L Queue(Local-use only)Identifiers to place L4S classification in a PHB
Hierarchy(Global-use or Local-use)PHBs Before L4S ECN ClassificationPHBs After L4S ECN Classification sets operational
and management requirements for experiments with DualQ Coupled AQMs.
General operational and management requirements for experiments with L4S
congestion controls are given in
and above, e.g. co-existence and
scaling requirements, incremental deployment arrangements. The
specification of each scalable congestion control will need to include
protocol-specific requirements for configuration and monitoring
performance during experiments. Appendix A of
provides a helpful checklist.This specification contains no IANA considerations.Approaches to assure the integrity of signals using the new identifer
are introduced in .Thanks to Richard Scheffenegger, John Leslie, David Täht,
Jonathan Morton, Gorry Fairhurst, Michael Welzl, Mikael Abrahamsson and
Andrew McGregor for the discussions that led to this specification.
Ing-jyh (Inton) Tsang was a contributor to the early drafts of this
document. listing the TCP Prague
Requirements is based on text authored by Marcelo Bagnulo Braun that was
originally an appendix to . That
text was in turn based on the collective output of the attendees listed
in the minutes of a 'bar BoF' on DCTCP Evolution during IETF-94 .The authors' contributions were part-funded by the European Community
under its Seventh Framework Programme through the Reducing Internet
Transport Latency (RITE) project (ICT-317700). Bob Briscoe was also
part-funded by the Research Council of Norway through the TimeIn
project. The views expressed here are solely those of the authors.Adaptive RED: An Algorithm for Increasing the Robustness of
RED's Active Queue ManagementACIRIACIRIACIRIRelentless Congestion ControlPSC'Data Centre to the Home': Ultra-Low Latency for AllBell LabsSimula Research LabBTBell LabsOne more bit is enoughUp to Speed with Queue ViewBTUni KarlstadAnalysis of DCTCP: Stability, Convergence, and
FairnessNotes: DCTCP evolution 'bar BoF': Tue 21 Jul 2015, 17:40,
PragueSimulaScaling TCP's Congestion Window for Small Round Trip
TimesBTBell LabsRapid Acceleration in TCP PragueUniversity of OsloCongestion Avoidance and ControlThis appendix is informative, not normative. It gives a list of
modifications to current scalable transport protocols so that they can
be deployed over the public Internet and coexist safely with existing
traffic. The list complements the normative requirements in that a sender has to comply with before
it can set the L4S identifier in packets it sends into the Internet. As
well as necessary safety improvements (requirements) this appendix also
includes preferable performance improvements (optimizations).These recommendations have become know as the TCP Prague
Requirements, because they were originally identified at an ad hoc
meeting during IETF-94 in Prague . The wording
has been generalized to apply to all scalable congestion controls, not
just TCP congestion control specifically.DCTCP is currently the most widely used
scalable transport protocol. In its current form, DCTCP is specified to
be deployable only in controlled environments. Deploying it in the
public Internet would lead to a number of issues, both from the safety
and the performance perspective. The modifications and additional
mechanisms listed in this section will be necessary for its deployment
over the global Internet. Where an example is needed, DCTCP is used as a
base, but it is likely that most of these requirements equally apply to
other scalable transport protocols.Description: A scalable congestion control needs to distinguish
the packets it sends from those sent by classic congestion
controls.Motivation: It needs to be possible for a network node to
classify L4S packets without flow state into a queue that applies an
L4S ECN marking behaviour and isolates L4S packets from the queuing
delay of classic packets.Description: A scalable transport protocol needs to provide
timely, accurate feedback about the extent of ECN marking
experienced by all packets.Motivation: Classic congestion controls only need feedback about
the existence of a congestion episode within a round trip, not
precisely how many packets were marked with ECN or dropped.
Therefore, in 2001, when ECN feedback was added to TCP , it could not inform the sender of more than one
ECN mark per RTT. Since then, requirements for more accurate ECN
feedback in TCP have been defined in and
specifies an
experimental change to the TCP wire protocol to satisfy these
requirements. Most other transport protocols already satisfy this
requirement.Description: A scalable congestion control needs to react to
packet loss in a way that will coexist safely with a TCP Reno
congestion control .Motivation: Part of the safety conditions for deploying a
scalable congestion control on the public Internet is to make sure
that it behaves properly when it builds a queue at a network
bottleneck that has not been upgraded to support L4S. Packet loss
can have many causes, but it usually has to be conservatively
assumed that it is a sign of congestion. Therefore, on detecting
packet loss, a scalable congestion control will need to fall back to
classic congestion control behaviour. If it does not comply with
this requirement it could starve classic traffic.A scalable congestion control can be used for different types of
transport, e.g. for real-time media or for reliable bulk transport
like TCP. Therefore, the particular classic congestion control
behaviour to fall back on will need to be part of the congestion
control specification of the relevant transport. In the particular
case of DCTCP, the current DCTCP specification states that "It is
RECOMMENDED that an implementation deal with loss episodes in the
same way as conventional TCP." For safe deployment of a scalable
transport in the public Internet, the above requirement would need
to be defined as a "MUST".Packet loss might (rarely) occur in the case that the bottleneck
is L4S capable. In this case, the sender may receive a high number
of packets marked with the CE bit set and also experience a loss.
Current DCTCP implementations react differently to this situation.
At least one implementation reacts only to the drop signal (e.g. by
halving the CWND) and at least another DCTCP implementation reacts
to both signals (e.g. by halving the CWND due to the drop and also
further reducing the CWND based on the proportion of marked packet).
We believe that further experimentation is needed to understand what
is the best behaviour for the public Internet, which may or not be
one of these existing approaches.Description: A scalable congestion control needs to react to ECN
marking from a non-L4S but ECN-capable bottleneck in a way that will
coexist with a TCP Reno congestion control .Motivation: Similarly to the requirement in , this requirement is a safety
condition to ensure a scalable congestion control behaves properly
when it builds a queue at a network bottleneck that has not been
upgraded to support L4S. On detecting classic ECN marking (see
below), a scalable congestion control will need to fall back to
classic congestion control behaviour. If it does not comply with
this requirement it could starve classic traffic.It would take time for endpoints to distinguish classic and L4S
ECN marking. An increase in queuing delay or in delay variation
would be a tell-tale sign, but it is not yet clear where a line
would be drawn between the two behaviours. It might be possible to
cache what was learned about the path to help subsequent attempts to
detect the type of marking.Description: A scalable congestion control needs to reduce or
eliminate RTT bias over as wide a range of RTTs as possible, or at
least over the typical range of RTTs that will interact in the
intended deployment scenario.Motivation: Classic TCP's throughput is known to be inversely
proportional to RTT, so one would expect flows over very low RTT
paths to nearly starve flows over larger RTTs. However, Classic TCP
has never allowed a very low RTT path to exist because it induces a
large queue. For instance, consider two paths with base RTT 1ms and
100ms. If Classic TCP induces a 100ms queue, it turns these RTTs
into 101ms and 200ms leading to a throughput ratio of about 2:1.
Whereas if a Scalable TCP induces only a 1ms queue, the ratio is
2:101, leading to a throughput ratio of about 50:1.Therefore, with very small queues, long RTT flows will
essentially starve, unless scalable congestion controls comply with
this requirement.Description: A scalable congestion control needs to remain
responsive to congestion when RTTs are significantly smaller than in
the current public Internet.Motivation: As currently specified, the minimum required
congestion window of TCP (and its derivatives) is set to 2 maximum
segment sizes (MSS) (see equation (4) in ).
Once the congestion window reaches this minimum, all current TCP
algorithms become unresponsive to congestion signals. No matter how
much drop or ECN marking, the congestion window no longer reduces.
Instead, TCP forces the queue to grow, overriding any AQM and
increasing queuing delay.L4S mechanisms significantly reduce queueing delay so, over the
same path, the RTT becomes lower. Then this problem becomes
surprisingly common . This is because,
for the same link capacity, smaller RTT implies a smaller window.
For instance, consider a residential setting with an upstream
broadband Internet access of 8 Mb/s, assuming a max segment size of
1500 B. Two upstream flows will each have the minimum window of 2
MSS if the RTT is 6ms or less, which is quite common when accessing
a nearby data centre. So, any more than two such parallel TCP flows
will become unresponsive and increase queuing delay.Unless scalable congestion controls are required to comply with
this requirement from the start, they will frequently become
unresponsive, negating the low latency benefit of L4S, for
themselves and for others. One possible sub-MSS window mechanism is
described in , and other approaches
are likely to be feasible.Description: A scalable congestion control needs to detect loss
by counting in units of time, which is scalable, rather than
counting in units of packets, which is not.Motivation: If it is known that all L4S senders using a link obey
this rule, then link technologies that support L4S can remove the
head-of-line blocking delay they have to introduce while trying to
keep packets in tight order to avoid triggering loss detection based
on counting packets. End-systems cannot know whether a missing packet is due to loss
or reordering, except in hindsight - if it appears later. If senders
deem that loss has occurred by counting reordered packets (e.g. the
3 Duplicate ACK rule of Classic TCP), the time over which the
network has to keep packets in order scales down as packet rates
scale up over the years. In contrast, if senders allow a reordering
window in units of time before they deem there has been a loss, the
time over which the network has to keep packets in order stays
constant.Tolerance of reordering over a small duration will allow parallel
(e.g. bonded-channel) link technologies to relax their need to
deliver packets strictly in order. Such links typically give
arriving packets a link-level sequence number and introduce delay
while buffering packets at the receiving end until they can be
delivered in the same order. For radio links, this delay usually
includes the time allowed for link-layer retransmissions.For receivers that need their packets in order, it would seem
that relaxing network ordering would simply shift this reordering
delay from the network to the receiver. However, that is not true in
the general case because links generally do not recognize transport
layer flows and often cannot even see application layer streams
within the flows (as in SCTP, HTTP/2 or QUIC). So a link will often
be holding back packets from one flow or stream while waiting for
those from another. Relaxing strict ordering in the network will
remove this head-of-line blocking delay. {ToDo: this is being
quantified experimentally - will need to add the figures here.}Classic TCP implementations are switching over to the time-based
approach of RACK (Recent ACKnowledgements ). However, it will be many years
(decades?) before networks no longer have to allow for the presence
of traditional TCP senders still using the 3 DupACK rule. This
specification () says that
senders are not entitled to identify packets as L4S in the IP/ECN
field unless they use the time-based approach. Then networks that
identify L4S traffic separately (e.g. using ) can know for certain
that all L4S traffic is using the scalable time-based approach.This will allow networks to remove head-of-line blocking delay
immediately, but only for L4S traffic. But Classic traffic will have
to wait for many years until incremental deployment of RACK has
become near-universal. Nonetheless, experience with RACK will
determine how much reordering tolerance networks will be able to
allow for L4S traffic.Performance Optimization as well as Safety Improvement: The delay
benefit would be lost if any L4S sender did not follow the
time-based approach. Therefore, the time-based approach is made a
normative requirement (a necessary safety improvement). Nonetheless,
the time-based approach also enables a throughput benefit that a
flow can enjoy independently of others (a performance optimization),
explained next.Given the requirement for a scalable congestion control to
fall-back to Reno or Cubic on a loss (see ), it is important that a
scalable congestion control does not deem that a loss has occurred
too soon. If, later within the same round trip, an out-of-order
acknowledgement fills the gap, the sender would have halved its rate
spuriously (as well as retransmitting spuriously). With a RACK-like
approach, allowing longer before a loss is deemed to have occurred
maintains higher throughput in the presence of reordering {ToDo:
Quantify this statement}.On the other hand, it is also important not to wait too long
before deeming that a gap is due to a loss (termed a long reordering
window), otherwise loss recovery would be slow.The speed of loss recovery is much more significant for short
flows than long, therefore a good compromise would adapt the
reordering window; from a small fraction of the RTT at the start of
a flow, to a larger fraction of the RTT for flows that continue for
many round trips. This is the approach adopted by TCP RACK (Recent
ACKnowledgements) and
recommended for all L4S senders, whether using TCP or another
transport protocol.Description: To improve performance, scalable transport protocols
ought to enable ECN at the IP layer in TCP control packets (SYN,
SYN-ACK, pure ACKs, etc.) and in retransmitted packets. The same is
true for derivatives of TCP, e.g. SCTP.Motivation: RFC 3168 prohibits the use of ECN on these types of
TCP packet, based on a number of arguments. This means these packets
are not protected from congestion loss by ECN, which considerably
harms performance, particularly for short flows. counters each argument in
RFC 3168 in turn, showing it was over-cautious. Instead it proposes
experimental use of ECN on all types of TCP packet.Description: It would improve performance if scalable congestion
controls did not limit their congestion window increase to the
traditional additive increase of 1 MSS per round trip during congestion avoidance. The same is true for
derivatives of TCP congestion control.Motivation: As currently defined, DCTCP uses the traditional TCP
Reno additive increase in congestion avoidance phase. When the
available capacity suddenly increases (e.g. when another flow
finishes, or if radio capacity increases) it can take very many
round trips to take advantage of the new capacity. In the steady
state, DCTCP induces about 2 ECN marks per round trip, so it should
be possible to quickly detect when these signals have disappeared
and seek available capacity more rapidly. It will of course be
necessary to minimize the impact on other flows (classic and
scalable).TCP Cubic was designed to solve this problem, but as flow rates
have continued to increase, the delay accelerating into available
capacity has become prohibitive. For instance, with RTT=20 ms, to
increase flow rate from 100Mb/s to 200Mb/s Cubic takes between 50
and 100 round trips. Every 8x increase in flow rate leads to 2x more
acceleration delay.Description: Particularly when a flow starts, scalable congestion
controls need to converge (reach their steady-state share of the
capacity) at least as fast as classic TCP and preferably faster.
This does not just affect TCP Prague, but also the flow start
behaviour of any L4S congestion control derived from a Classic
transport that uses TCP slow start.Motivation: As an example, a new DCTCP flow takes longer than
classic TCP to obtain its share of the capacity of the bottleneck
when there are already ongoing flows using the bottleneck capacity.
In a data centre environment DCTCP takes about a factor of 1.5 to 2
longer to converge due to the much higher typical level of ECN
marking that DCTCP background traffic induces, which causes new
flows to exit slow start early .
In testing for use over the public Internet the convergence time of
DCTCP relative to regular TCP is even less favourable ). It is exacerbated by the typically
greater mismatch between the link rate of the sending host and
typical Internet access bottlenecks, in combination with the shallow
ECN marking threshold needed for TCP Prague. This problem is
detrimental in general, but would particularly harm the performance
of short flows relative to classic TCP.This appendix is informative, not normative. It records the pros and
cons of various alternative ways to identify L4S packets to record the
rationale for the choice of ECT(1) () as
the L4S identifier. At the end,
summarises the distinguishing features of the leading alternatives. It
is intended to supplement, not replace the detailed text.The leading solutions all use the ECN field, sometimes in combination
with the Diffserv field. Both the ECN and Diffserv fields have the
additional advantage that they are no different in either IPv4 or IPv6.
A couple of alternatives that use other fields are mentioned at the end,
but it is quickly explained why they are not serious contenders.Definition:Packets with ECT(1) and conditionally packets with CE would
signify L4S semantics as an alternative to the semantics of
classic ECN , specifically:The ECT(1) codepoint would signify that the packet was sent
by an L4S-capable sender;Given shortage of codepoints, both L4S and classic ECN
sides of an AQM would have to use the same CE codepoint to
indicate that a packet had experienced congestion. If a packet
that had already been marked CE in an upstream buffer arrived
at a subsequent AQM, this AQM would then have to guess whether
to classify CE packets as L4S or classic ECN. Choosing the L4S
treatment would be a safer choice, because then a few classic
packets might arrive early, rather than a few L4S packets
arriving late;Additional information might be available if the classifier
were transport-aware. Then it could classify a CE packet for
classic ECN treatment if the most recent ECT packet in the
same flow had been marked ECT(0). However, the L4S service
ought not to need tranport-layer awareness;Cons:The L4S service is
intended to supersede the service provided by classic ECN,
therefore using ECT(1) to identify L4S packets could ultimately
mean that the ECT(0) codepoint was 'wasted' purely to distinguish
one form of ECN from its successor;It is not always
possible to support ECN in an AQM acting in a buffer below the IP
layer . In
such cases, the L4S service would have to drop rather than mark
frames even though they might contain an ECN-capable packet.
However, such cases would be unusual.Having to
classify all CE packets as L4S risks some classic CE packets
arriving early, which is a form of reordering. Reordering can
cause the TCP sender to retransmit spuriously. However, one or two
packets delivered early does not cause any spurious
retransmissions because the subsequent packets continue to move
the cumulative acknowledgement boundary forwards. Anyway, the risk
of reordering would be low, because: i) it is quite unusual to
experience more than one bottleneck queue on a path; ii) even
then, reordering would only occur if there was simultaneous mixing
of classic and L4S traffic, which would be more unlikely in an
access link, which is where most bottlenecks are located; iii)
even then, spurious retransmissions would only occur if a
contiguous sequence of three or more classic CE packets from one
bottleneck arrived at the next, which should in itself happen very
rarely with a good AQM. The risk would be completely eliminated in
AQMs that were transport-aware (but they should not need to
be);The classic ECN
RFCs and require
a sender to clear the ECN field to Not-ECT for retransmissions and
certain control packets specifically pure ACKs, window probes and
SYNs. When L4S packets are classified by the ECN field alone,
these control packets would not be classified into an L4S queue,
and could therefore be delayed relative to the other packets in
the flow. This would not cause re-ordering (because
retransmissions are already out of order, and the control packets
carry no data). However, it would make critical control packets
more vulnerable to loss and delay. To address this problem, proposes an experiment in
which all TCP control packets and retransmissions are
ECN-capable.Pros:The ECN field generally works
end-to-end across the Internet. Unlike the DSCP, the setting of
the ECN field is at least forwarded unchanged by networks that do
not support ECN, and networks rarely clear it to zero;Unlike Diffserv, ECN is
defined to always work across tunnels. However, tunnels do not
always implement ECN processing as they should do, particularly
because IPsec tunnels were defined differently for a few
years.If all classic ECN
senders eventually evolve to use the L4S service, the ECT(0)
codepoint could be reused for some future purpose, but only once
use of ECT(0) packets had reduced to zero, or near-zero, which
might never happen.Definition:For packets with a defined DSCP, all codepoints of the ECN
field (except Not-ECT) would signify alternative L4S semantics to
those for classic ECN , specifically:The L4S DSCP would signifiy that the packet came from an
L4S-capable sender;ECT(0) and ECT(1) would both signify that the packet was
travelling between transport endpoints that were both
ECN-capable;CE would signify that the packet had been marked by an AQM
implementing the L4S service.Use of a DSCP is the only approach for alternative ECN semantics
given as an example in . However, it was
perhaps considered more for controlled environments than new
end-to-end services;Cons:A DSCP is obviously not
orthogonal to Diffserv. Therefore, wherever the L4S service is
applied to multiple Diffserv scheduling behaviours, it would be
necessary to replace each DSCP with a pair of DSCPs.The
resulting increased number of DSCPs might be hard to support for
some lower layer technologies, e.g. 802.1p and MPLS both offer
only 3-bits for a maximum of 8 traffic class identifiers. Although
L4S should reduce and possibly remove the need for some DSCPs
intended for differentiated queuing delay, it will not remove the
need for Diffserv entirely, because Diffserv is also used to
allocate bandwidth, e.g. by prioritising some classes of traffic
over others when traffic exceeds available capacity.Very few networks
honour a DSCP set by a host. Typically a network will zero
(bleach) the Diffserv field from all hosts. Sometimes networks
will attempt to identify applications by some form of packet
inspection and, based on network policy, they will set the DSCP
considered appropriate for the identified application.
Network-based application identification might use some
combination of protocol ID, port numbers(s), application layer
protocol headers, IP address(es), VLAN ID(s) and even packet
timing.Very few networks
honour a DSCP received from a neighbouring network. Typically a
network will zero (bleach) the Diffserv field from all
neighbouring networks at an interconnection point. Sometimes
bilateral arrangements are made between networks, such that the
receiving network remarks some DSCPs to those it uses for roughly
equivalent services. The likelihood that a DSCP will be bleached
or ignored depends on the type of DSCP:These tend to be used to
implement application-specific network policies, but a
bilateral arrangement to remark certain DSCPs is often applied
to DSCPs in the local-use range simply because it is easier
not to change all of a network's internal configurations when
a new arrangement is made with a neighbour;These do not tend to be
honoured across network interconnections more than local-use
DSCPs. However, if two networks decide to honour certain of
each other's DSCPs, the reconfiguration is a little easier if
both of their globally recognised services are already
represented by the relevant global-use DSCPs. Note that today a global-use DSCP gives little
more assurance of end-to-end service than a local-use DSCP. In
future the global-use range might give more assurance of
end-to-end service than local-use, but it is unlikely that
either assurance will be high, particularly given the hosts
are included in the end-to-end path.Diffserv codepoints are often not
propagated to the outer header when a packet is encapsulated by a
tunnel header. DSCPs are propagated to the outer of uniform mode
tunnels, but not pipe mode , and pipe mode
is fairly common.Because this
approach uses both the Diffserv and ECN fields, an AQM wil only
work at a lower layer if both can be supported. If individual
network operators wished to deploy an AQM at a lower layer, they
would usually propagate an IP Diffserv codepoint to the lower
layer, using for example IEEE 802.1p. However, the ECN capability
is harder to propagate down to lower layers because few lower
layers support it.Pros:If all usage of classic ECN
migrates to usage of L4S, the DSCP would become redundant, and the
ECN capability alone could eventually identify L4S packets without
the interconnection problems of Diffserv detailed above, and
without having permanently consumed more than one codepoint in the
IP header. Although the DSCP does not generally function as an
end-to-end identifier (see above), it could be used initially by
individual ISPs to introduce the L4S service for their own locally
generated traffic;Definition:This approach uses ECN capability alone as the L4S identifier.
It is only feasible if classic ECN is not widely deployed. The
specific definition of codepoints would be:Any ECN codepoint other than Not-ECT would signify an
L4S-capable sender;ECN codepoints would not be used for classic ECN, and the classic network service would
only be used for Not-ECT packets.This approach would only be feasible if it was generally agreed that there was little chance of any
classic ECN deployment in any network
nodes;it was generally agreed that there was little chance of any
client devices being deployed with classic TCP-ECN on by default (note that classic
TCP-ECN is already on-by-default on many servers);for TCP connections, developers of client OSs would all
have to agree not to encourage further deployment of classic
ECN. Specifically, at the start of a TCP connection classic
ECN could be disabled during negotation of the ECN
capability:an L4S-capable host would have to disable ECN if the
corresponding host did not support accurate ECN feedback
, which is a prerequisite for the
L4S service;developers of operating systems for user devices would
only enable ECN by default for TCP once the stack
implemented L4S and accurate ECN feedback including requesting accurate ECN
feedback by default.Cons:The
constraints for deployment above represent a highly unlikely, but
not completely impossible, set of circumstances. If, despite the
above measures, a pair of hosts did negotiate to use classic ECN,
their packets would be classified into the same queue as L4S
traffic, and if they had to compete with a long-running L4S flow
they would get a very small capacity share;See the same issue
with "ECT(1) and CE codepoints" ();See the same
issue with "ECT(1) and CE codepoints" ().Pros:The ECT(1)
codepoint and all spare Diffserv codepoints would remain available
for future use;As with "ECT(1) and CE codepoints"
();As with "ECT(1) and CE
codepoints" ().It has been suggested that a new ID in the IPv4 Protocol field or
the IPv6 Next Header field could identify L4S packets. However this
approach is ruled out by numerous problems:A new protocol ID would need to be paired with the old one for
each transport (TCP, SCTP, UDP, etc.);In IPv6, there can be a sequence of Next Header fields, and it
would not be obvious which one would be expected to identify a
network service like L4S;A new protocol ID would rarely provide an end-to-end service,
because It is well-known that new protocol IDs are often blocked
by numerous types of middlebox;The approach is not a solution for AQMs below the IP layer;Locally, a network operator could arrange for L4S service to be
applied based on source or destination addressing, e.g. packets from
its own data centre and/or CDN hosts, packets to its business
customers, etc. It could use addressing at any layer, e.g. IP
addresses, MAC addresses, VLAN IDs, etc. Although addressing might be
a useful tactical approach for a single ISP, it would not be a
feasible approach to identify an end-to-end service like L4S. Even for
a single ISP, it would require packet classifiers in buffers to be
dependent on changing topology and address allocation decisions
elsewhere in the network. Therefore this approach is not a feasible
solution. provides a very high level
summary of the pros and cons detailed against the schemes described
respectively in , and , for six
issues that set them apart.IssueDSCP + ECNECNECT(1) + CEinitial eventualinitialinitial eventualend-to-endN . . . ? .. . Y. . Y . . Ytunnels. O . . O .. . ?. . ? . . Ylower layersN . . . ? .. O .. O . . . ?codepointsN . . . . ?. . YN . . . . ?reordering. . Y . . Y. . Y. O . . . ?ctrl pkts. . Y . . Y. O .. O . . . ?Note 1Note 1: Only feasible if classic ECN is
obsolete.The schemes are scored based on both their capabilities now
('initial') and in the long term ('eventual'). The 'ECN' scheme shares
the 'eventual' scores of the 'ECT(1) + CE' scheme. The scores are one
of 'N, O, Y', meaning 'Poor', 'Ordinary', 'Good' respectively. The
same scores are aligned vertically to aid the eye. A score of "?" in
one of the positions means that this approach might optimisitically
become this good, given sufficient effort. The table summarises the
text and is not meant to be understandable without having read the
text.The ECT(1) codepoint of the ECN field has already been assigned once
for the ECN nonce , which has now been
categorized as historic . ECN is probably the
only remaining field in the Internet Protocol that is common to IPv4 and
IPv6 and still has potential to work end-to-end, with tunnels and with
lower layers. Therefore, ECT(1) should not be reassigned to a different
experimental use (L4S) without carefully assessing competing potential
uses. These fall into the following categories:Receiving hosts can fool a sender into downloading faster by
suppressing feedback of ECN marks (or of losses if retransmissions are
not necessary or available otherwise).The historic ECN nonce protocol proposed
that a TCP sender could set either of ECT(0) or ECT(1) in each packet
of a flow and remember the sequence it had set. If any packet was lost
or congestion marked, the receiver would miss that bit of the
sequence. An ECN Nonce receiver had to feed back the least significant
bit of the sum, so it could not suppress feedback of a loss or mark
without a 50-50 chance of guessing the sum incorrectly.The ECN Nonce RFC as been reclassified as
historic, partly because other ways have been developed to protect TCP
feedback integrity that do not consume a
codepoint in the IP header. So it is highly unlikely that ECT(1) will
be needed for integrity protection in future.Various researchers have proposed to use ECT(1) as a less severe
congestion notification than CE, particularly to enable flows to fill
available capacity more quickly after an idle period, when another
flow departs or when a flow starts, e.g. VCP ,
Queue View (QV) .Before assigning ECT(1) as an identifer for L4S, we must carefully
consider whether it might be better to hold ECT(1) in reserve for
future standardisation of rapid flow acceleration, which is an
important and enduring problem .Pre-Congestion Notification (PCN) is another scheme that assigns
alternative semantics to the ECN field. It uses ECT(1) to signify a
less severe level of pre-congestion notification than CE . However, the ECN field only takes on the PCN
semantics if packets carry a Diffserv codepoint defined to indicate
PCN marking within a controlled environment. PCN is required to be
applied solely to the outer header of a tunnel across the controlled
region in order not to interfere with any end-to-end use of the ECN
field. Therefore a PCN region on the path would not interfere with any
of the L4S service identifiers proposed in .