This document will provide guidance on
the design of methods to avoid congestion collapse and to react to
incipient congestion. The present document is for discussion
and comment by the IETF. If
published, it plans to update or replace the Best Current Practice in
BCP 41, which currently includes "Congestion Control Principles"
provided in RFC2914.¶
The current recommendations and requirements
on this topic are
distributed across many documents in the RFC series. This document therefore
seeks to gather and consolidate these recommendations in an annexe.
Based on these specifications, and Internet engineering experience, the
document provides input to the design of new congestion control methods
in protocols.¶
This revision updates the source to a modern XML format, and is for discussion
by tsvwg.¶
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF). Note that other groups may also distribute working
documents as Internet-Drafts. The list of current Internet-Drafts is
at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."¶
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Revised BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Revised BSD License.¶
The IETF has specified Internet transports (e.g., TCP [I-D.ietf-tcpm-rfc793bis], UDP [RFC0768], UDP-Lite [RFC3828], SCTP
[RFC4960], and DCCP [RFC4340])
as well as protocols layered on top of these transports (e.g., RTP [RFC3550], QUIC [RFC9000], SCTP/UDP [RFC6951], DCCP/UDP [RFC6773]) and
transports that work directly over the IP network layer. These
transports are implemented in endpoints (either Internet hosts or
routers acting as endpoints), and are designed to detect and react to
network congestion. TCP was the first transport to provide this,
although the TCP specifications found in RFC 793 predates the inclusion
of congestion control and did not contain any discussion of using or
managing a congestion window. RFC 9293 [I-D.ietf-tcpm-rfc793bis] seek to address this.¶
Recommendations and requirements on this topic are distributed across
many documents in the RFC series. The appendix of this document
therefore seeks to gather and consolidate these recommendations. This,
and Internet engineering experience are used as a basis to provide
overall guidelines as input to the design of congestion control methods
that are implemented in Internet protocols. The focus of the present
document is upon unicast point-to-point transports, this includes
migration from using one path to another path.¶
The popularity of the Internet has led to a proliferation in the
number of TCP implementations [RFC2914]. A variety
of non-TCP transports have also being deployed. Some transport
implementations fail to use standardised congestion avoidance mechanisms
correctly because of poor implementation [RFC2525].
However, this is not the only reason for not using standard methods.
Some transports have chosen mechanisms that are not presently
standardised, or have adopted approaches to their design that differ
from present standards. Guidance is needed therefore not only for future
standardisation, but to ensure safe and appropriate evolution of
transports that have not presently been submitted for
standardisation.¶
Some recommendations [RFC5783] and requirements
in this document apply to point-to-multipoint transports (e.g.,
multicast), however this topic extends beyond the current document's
scope. [RFC2914] provides additional guidance on
the use of multicast.¶
Internet transports can reserve capacity at routers or on the links
being used. This is sometimes used in controlled environments, but most
uses across the Internet do not rely upon prior reservation of capacity
along the path they use. In the absence of such a reservation, endpoints
are unable to determine a safe rate at which to start or continue their
transmission. The use of an Internet path therefore requires a
combination of end-to-end transport mechanisms to detect and then
respond to changes in the capacity that it discovers is available across
the network path.¶
Buffering (an increase in latency) or congestion loss (discard of a
packet) arises when the traffic arriving at a link or network exceeds
the resources available. Loss can also occur for other reasons, but it
is usually not possible for an endpoint to reliably disambiguate the
cause of packet loss (e.g., loss could be due to link corruption,
receiver overrun, etc. [RFC3819]). A network device
typically uses a drop-tail policy to drop excess IP packets when its
queue(s) becomes full. This use of buffers can also be managed using
Active Queue Management (AQM) [RFC7567], which can
be combined withb Explicit Congestion Notification signalling.¶
Internet transports need to react to avoid congestion that impacts
other flows sharing a path. The Requirements for
Internet Hosts [RFC1122] formally mandates that endpoints perform
congestion control. "Because congestion control is critical to the
stable operation of the Internet, applications and other protocols that
choose to use UDP as an Internet transport must employ mechanisms to
prevent congestion collapse and to establish some degree of fairness
with concurrent traffic [RFC2914].¶
The general recommendation in the UDP Guidelines [RFC8085] is that applications SHOULD leverage existing
congestion control techniques, such as those defined for TCP [RFC5681], TCP-Friendly Rate Control (TFRC) [RFC5348], SCTP [RFC4960], and other
IETF-defined transports. This is because there are many trade offs and
details that can have a serious impact on the performance of congestion
control for the application they support and other traffic that seeks to
share the resources along the path over which they communicate.¶
Paths through the Internet can experience congestion (loss or delay) that is a result of excess load at a bottleneck(s) along the path.
Incipient congestion is a consequential side effect of the
statistical multiplexing of packet flows.
There will be time where packets need to be buffered or dropped at the bottleneck(s) on the path,
and flows need to react when they encounter this congestion to reduce their
contribution to the load.¶
Persistent congestion occurs when the pattern of arriving traffic results
in over consumption of the path resources. Typically this results in
packet loss. The effects of persistent congestion might impact the flow
that induces congestion, but could also impact other flows,
e.g., starving them of resources; or further reducing the efficiency
of the path (e.g., congestion collapse).¶
The IETF has produced specifications and BCP for transports to
address congestion. TCP has evolved to solve both aspects of congestion
and also to provide efficient loss recovery. (Loss recovery is not
itself a congestion control mechanism, but the cause of loss might be
congestion, so the two become coupled.)¶
There are several reasons to think that things may have changed:
At one time, it was common that the serialisation delay of a packet at the bottleneck
formed a large proportion of the round time of a path, motivating a need for
conservative loss recovery . This is not often the case for today's
higher capacity links.
This increase in link speed often means that for many users, current traffic often does not
normally experience persistent congestion .¶
How do operators understand that traffic is behaving reasonably?¶
How can the IETF developers to safe and efficient congestion control?¶
There are multiple ways to structure a document or documents.
One possibility is to separate the BCP guidance for avoiding persistent congestion
(e.g., starvation, congestion collapse) from the design of the protocol mechanisms
that seek to react to incipient congestion.
Such a split seems possible, in the same way that loss recovery
can be distanced from congestion reaction, but might be hard to achieve. The present version of this document covers both aspects of congestion control.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].¶
The path between endpoints (sometimes called "Internet Hosts" or
called source and destination nodes in IPv6) consists of the endpoint
protocol stack at the sender and the receiver (which together implement
the transport service), and a succession of links and network devices
(routers or middleboxes) that provide connectivity across the network.
The set of network devices forming the path is not usually fixed, and it
should generally be assumed that this set can change over arbitrary
lengths of time.¶
[RFC5783] defines congestion control as "the
feedback-based adjustment of the rate at which data is sent into the
network. Congestion control is an indispensable set of principles and
mechanisms for maintaining the stability of the Internet." [RFC5783] also provides an informational snapshot taken
by the IRTF's Internet Congestion Control Research Group (ICCRG) from
October 2008.¶
The text draws on language used in the specifications of TCP and
other IETF transports. For example, a protocol timer is generally needed
to detect persistent congestion, and this document uses the term
Retransmission Timeout (RTO) to refer to the operation of this timer.
Similarly, the document refers to a congestion window as the variable
that controls the rate of transmission by the congestion controller. The
use of these terms does not imply that endpoints need to implement
functions in the way that TCP currently does. Each new transport needs
to make its own design decisions about how to meet the recommendations
and requirements for congestion control.¶
Other terminology is directly copied from the cited RFCs.¶
Endpoints MUST perform congestion control [RFC1122] and SHOULD leverage existing congestion
control techniques [RFC8085].¶
If an application or protocol chooses not to use a
congestion-controlled transport protocol, it SHOULD control the
rate at which it sends datagrams to a destination host, in order
to fulfil the requirements of [RFC2914], as
stated in [RFC8085].¶
Transports SHOULD control the aggregate traffic they send on a
path [RFC8085]. They ought not to use
multiple congestion-controlled flows between the same endpoints to
gain a performance advantage. An endpoint can become aware of
congestion by various means (including, delay variation, timeout,
ECN, packet loss). A signal that indicates congestion on the
end-to-end network path, SHOULD result in a congestion control
reaction by the transport that reduces the current rate of the
sending endpoint[RFC8087]).¶
Although network devices can be configured to reduce the impact
of flow multiplexing on other flows, endpoints MUST NOT rely
solely on the presence and correct configuration of these methods,
except when constrained to operate in a controlled environment.
Transports that do not target Internet deployment need to be
constrained to only operate in a controlled environment (e.g., see
Section 3.6 of [RFC8085]) and provide
appropriate mechanisms to prevent traffic accidentally leaving the
controlled environment [RFC8084].¶
Path Change: The detection of
congestion and the resulting reduction MUST NOT solely depend upon
reception of a signal from the remote endpoint, because congestion
indications could themselves be lost under persistent
congestion.
The only way to surely confirm that a sending
endpoint has successfully communicated with a remote endpoint is
to utilise a timer (seeSection 5.2.3) to detect a
lack of response that could result from a change in the path or
the path characteristics (usually called the RTO). Congestion
controllers that are unable to react after one (or at most a few)
RTTs after receiving a congestion indication should observe the
guidance in section 3.3 of the UDP Guidelines
[RFC8085].¶
An endpoint needs to provide protection from attacks on the traffic
it generates, or attacks that seek to increase the capacity it
consumes (impacting other traffic that shared a bottleneck).¶
The following guidance is porovidng on protection from attack:¶
Off-Path Attack: A design MUST protect from
off-path attack to the protocol [RFC8085]
(i.e., one by an attacker that is unable to see the contents of
packets exchanged across the path). An attack on the congestion
control can lead to a Denial of Service (DoS) vulnerability for
the flow being controlled and/or other flows that share network
resources along the path.¶
On-Path Attack: A protocol can be designed to
protect from on-path attacks, but this requires more complexity
and the use of encryption/authentication mechanisms (e.g., IPsec
[RFC4301], QUIC [RFC9000]).¶
Validation of Signals: Network signalling and
control messages (e.g., ICMP [RFC0792]) MUST
be validated before they are used to protect from malicious abuse.
This MUST at least include protection from off-path attack [RFC8085].¶
The IETF has provided guidance [RFC5033] for
considering and evaluating alternate congestion control algorithms.¶
There have been changes in the way that protocol mechanisms are deployed
in Internet endpoints.¶
On the one hand, techniques have evolved that now allow incremental deployment
and testing of new methods. This can enable more rapid development of
methods to detect and react to incipient congestion. This allows new mechanisms
can be tested to ensure that 95%, 99%, etc of users see benefit in the networks
they use. there has been considerable progress in developing new loss
recovery and congestion responses that have been evaluated in this way.¶
On the other hand, the Internet continues to be heterogenous, some people experience
very different network path characteristics and some people have very
different patterns of traffic. The IETF seeks to avoid congestion collapse,
and also avoid prejudicing the performance experienced when the Internet is shared.
Different approaches are needed when analysing the collateral damage resulting from using a new mechanism.
An analysis of the suitability of a new mechanism needs to consider the
impact on the outliers in performance, the last 5%, 1%, etc and specifically
needs to understand how changes impact other flows sharing a bottleneck.
This impact is often not visible in the performance data
collected for the new flow - it may not be obvious that a new method starves
some other application of capacity, or patterns of packets disrupt the
timing needed for a particular application.¶
The IRTF has described a set of metrics and related trade-off
between metrics that can be used to compare, contrast, and evaluate
congestion control techniques [RFC5166]. [RFC5783] provides a snapshot of congestion-control
research in 2008.¶
This section summarises the principles for providing congestion
control. The section seeks to differentiate mechanisms associated with preventing persistent congestion; reacting to incipient congestion and utilising additional path information.¶
Persistent congestion can
result in congestion collapse, which MUST be aggressively avoided
[RFC2914]. Endpoints that experience
persistent congestion and have already exponentially reduced their
congestion window to the restart window (e.g., one packet), MUST
further reduce the rate if the RTO timer continues to expire. For
example, TFRC [RFC5348] continues to reduce
its sending rate under persistent congestion to one packet per RTT,
and then exponentially backs off the time between single packet
transmissions if the congestion continues to persist
[RFC2914].¶
Transports MUST avoid inducing flow starvation to other flows
that share resources along the path they use.¶
Endpoints MUST treat a loss of all feedback (e.g., expiry of a
retransmission time out, RTO) as an indication of persistent
congestion (i.e., an indication of potential congestion
collapse).¶
When an endpoint detects persistent congestion, it MUST reduce
the maximum rate (e.g., reduce its congestion window). This
normally involves the use of protocol timers to detect a lack of
acknowledgment for transmitted data (Section 5.2.3).¶
Protocol timers (e.g., used for retransmission or to detect
persistent congestion) need to be appropriately initialised. A
transport SHOULD adapt its protocol timers to follow the measured
the path Round Trip Rime (RTT) (e.g., Section 3.1.1 of [RFC8085]).¶
A transport MUST employ exponential backoff each time
persistent congestion is detected [RFC1122],
until the path characteristics can again be confirmed.¶
Network devices MAY provide mechanisms to mitigate the impact
of congestion collapse by transport flows (e.g., priority
forwarding of control information, and starvation detection), and
SHOULD mitigate the impact of non-conforment and malicious flows
[RFC7567]). These mechanisms complement, but
do not replace, the endpoint congestion avoidance mechanisms.¶
Maintaining the RTO: The RTO SHOULD be set based on
recent RTT observations (including the RTT variance) [RFC8085].¶
RTO Expiry: Persistent lack of feedback (e.g.,
detected by an RTO timer, or other means) MUST be treated an
indication of potential congestion collapse. A failure to receive
any specific response within a RTO interval could potentially be a
result of a RTT change, change of path, excessive loss, or even
congestion collapse. If there is no response within the RTO
interval, TCP collapses the congestion window to one segment [RFC5681]. Other transports MUST similarly respond
when they detect loss of feedback.
An endpoint needs to exponentially backoff the RTO
interval [RFC8085] each time the RTO expires.
That is, the RTO interval MUST be set to at least the RTO * 2
[RFC6298][RFC8085].¶
Maximum RTO:A maximum value MAY be placed on the
RTO interval. This maximum limit to the RTO interval MUST NOT be
less than 60 seconds [RFC6298].¶
When a connection or flow to a new destination is established, the
endpoints have little information about the characteristics of the
network path they will use. This section describes how a flow starts
transmission over such a path to mitigate causing incipient congestion.¶
Flow Start: A new flow between two endpoints needs
to initialise a congestion controller for the path it will use. It
MUST NOT assume that capacity is available at the start of the
flow, unless it uses a mechanism to explicitly reserve capacity.
In the absence of a capacity signal, a flow might therefore start
slowly. The TCP slow-start algorithm is an accepted standard for
flow startup [RFC5681]. TCP uses the notion
of an Initial Window (IW) [RFC3390], updated
by [RFC6928]) to define the initial volume of
data that can be sent on a path. This is not the smallest burst,
or the smallest window, but it is considered a safe starting point
for a path that is not suffering persistent congestion, and is
applicable until feedback about the path is received. The initial
sending rate (e.g., determined by the IW) needs to be viewed as
tentative until the capacity is confirmed to be available.¶
Initial RTO Interval: When a flow sends the first
packet(s), it typically has no way to know the actual RTT of the
path it will use. An initial value needs to be used to initialise
the principal retransmission timer, which will be used to detect
lack of responsiveness from the remote endpoint. In TCP, this is
the starting value of the RTO. The selection of a safe initial
value is a trade off that has important consequences on the
overall Internet stability [RFC6928][RFC8085]. In the absence of any knowledge about
the latency of a path (including the initial value), the RTO MUST
be conservatively set to no less than 1 second. Values shorter
than 1 second can be problematic (see the appendix of [RFC6298]). (Note: Linux TCP has deployed a smaller
initial RTO value).¶
Initial RTO Expiry: If the RTO timer expires while
awaiting completion of a connection setup, or handshake (e.g., the
three-way handshake in TCP, the ACK of a SYN segment), and the
implementation is using an RTO less than 3 seconds, the local
endpoint can resend the connection setup. [[Author note: It would
be useful to discuss how the timer is managed to protect from
multiple handshake failure]].
The RTO MUST then be re-initialized to increase it
to 3 seconds when data transmission begins (i.e., after the
handshake completes) [RFC6298][RFC8085]. This conservative increase is necessary
to avoid congestion collapse when many flows retransmit across a
shared bottleneck with restricted capacity.¶
Initial Measured RTO:Once an RTT measurement is
available (e.g., through reception of an acknowledgement), the
timeout value must be adjusted. This adjustment MUST take into
account the RTT variance. For the first sample, this variance
cannot be determined, and a local endpoint MUST therefore
initialise the variance to RTT/2 (see equation 2.2 of [RFC6928] and related text for UDP in section 3.1.1
of [RFC8085]).¶
Current State:A congestion controller MAY assume
that recently used capacity between a pair of endpoints is an
indication of future capacity available in the next RTT between
the same endpoints. It MUST react (reduce its rate) if this is not
(later) confirmed to be true. [[Author note: do we need to bound
this]].¶
This section describes how a sender needs to regulate the maximum
volume of data in flight over the interval of the current RTT, and how
it manages transmission of the capacity that it perceives is
available, reacting to incipient congestion.¶
Transient Path: Unless managed by a resource
reservation protocol, path capacity information is transient. A
sender that does not use capacity has no understanding whether
previously used capacity remains available to use, or whether that
capacity has disappeared (e.g., a change in the path that causes a
flow to experience a smaller bottleneck, or when more traffic
emerges that consumes previously available capacity resulting in a
new bottleneck). For this reason, a transport that is limited by
the volume of data available to send MUST NOT continue to grow its
congestion window when the current congestion window is more than
twice the volume of data acknowledged in the last RTT.¶
Validating the congestion window">Standard TCP states
that a TCP sender "SHOULD set the congestion window to no more
than the Restart Window (R)" before beginning transmission, if the
sender has not sent data in an interval that exceeds the current
retransmission timeout, i.e., when an application becomes idle
[RFC5681]. An experimental specification
[RFC7661] permits TCP senders to tentatively
maintain a congestion window larger than the path supported in the
last RTT when application-limited, provided that they
appropriately and rapidly collapse the congestion window when
potential congestion is detected. This mechanism is called
Congestion Window Validation (CWV).¶
Collateral Damage:Even in the absence of
congestion, statistical multiplexing of flows can result in
transient effects for flows sharing common resources. A sender
therefore SHOULD avoid inducing excessive congestion to other
flows (collateral damage).¶
Burst Mitigation: While a congestion controller
ought to limit sending at the granularity of the current RTT, this
can be insufficient to satisfy the goals of preventing starvation
and mitigating collateral damage. This requires moderating the
burst rate of the sender to avoid significant periods where a
flow(s) consume all buffer capacity at the path bottleneck, which
would otherwise prevent other flows from gaining a reasonable
share. Endpoints SHOULD provide mechanisms to regulate the bursts
of transmission that the application/protocol sends to the network
(section 3.1.6 of [RFC8085]). ACK-Clocking
[RFC5681] can help mitigate bursts for
protocols that receive continuous feedback of reception (such as
TCP). Sender pacing can mitigate this [RFC8085], (See Section 4.6 of [RFC3449]), and has been recommended for TCP in
conditions where ACK-Clocking is not effective, (e.g., [RFC3742], [RFC7661]). SCTP
[RFC4960] defines a maximum burst length
(Max.Burst) with a recommended value of 4 segments to limit the
SCTP burst size.¶
This section describes mechanisms to detect and provide
retransmission, and to protect the network in the absence of timely
feedback. These topics are important to avoid persistent congestion.¶
Loss Detection: Loss detection occurs after a sender
determines there is no delivery confirmation within an expected
period of time (e.g., by observing the time-ordering of the
reception of ACKs, as in TCP DupACK) or by utilising a timer to
detect loss (e.g., a transmission timer with a period less than
the RTO, [RFC8085][RFC8985]) or a combination of using a
timer and ordering information to trigger retransmission of
data.¶
Retransmission: Retransmission of lost packets or
messages is a common reliability mechanism. When loss is detected,
the sender can choose to retransmit the lost data, ignore the
loss, or send other data (e.g., [RFC8085][RFC9002]), depending on the
reliability model provided by the transport service. Any
transmission consumes network capacity, therefore retransmissions
MUST NOT increase the network load in response to congestion loss
(which worsens that congestion) [RFC8085].
Any method that sends additional data following loss is therefore
responsible for congestion control of the retransmissions (and any
other packets sent, including FEC information) as well as the
original traffic.¶
Measuring the RTT:Once an endpoint has started
communicating with its peer, the RTT be MUST adjusted by measuring
the actual path RTT. This adjustment MUST include adapting to the
measured RTT variance (see equation 2.3 of [RFC6928]).¶
The safety and responsiveness of new proposals need to be evaluated
[RFC5166]. In determining an appropriate
congestion response to incipient congestion, designs could take into consideration the size of
the packets that experience congestion [RFC4828].¶
Congestion Response: An endpoint MUST promptly
reduce the rate of transmission when it receive or detects an
indication of congestion (e.g., loss) [RFC2914].
TCP Reno established a method that relies on
multiplicative-decrease to halve the sending rate while congestion
is detected. This response to congestion indications is considered
sufficient for safe Internet operation, but other decrease factors
have also been published in the RFC Series [RFC8312].¶
ECN Response: A congestion control design should
provide the necessary mechanisms to support Explicit Congestion
Notification (ECN) [RFC3168][RFC6679], as described in section 3.1.7 of [RFC8085]. This can help determine an appropriate
congestion window when supported by routers on the path [RFC7567] to enable early indication of
incipient congestion.
An early detection of incipient congestion
allows a different reaction to an explicit congestion signal
compared to the reaction to detected packet loss [RFC8311][RFC8087]. Simple
feedback of received Congestion Experienced (CE) marks [RFC3168], relies only on an indication that
congestion has been experienced within the last RTT. This style of
response is appropriate when a flow uses ECT(0).
ABE included this modification the reaction to ECN [RFC8511]. Further detail about the received
CE-marking can be obtained by using more accurate receiver
feedback (e.g., [I-D.ietf-tcpm-accurate-ecn]
and extended RTP feedback). The more detailed feedback provides an
opportunity for a finer-granularity of congestion response.
The L4S architecture
[I-D.ietf-tsvwg-l4s-arch] defines a reaction for
packets marked with ECT(1), building on the style of detailed
feedback provided by
[I-D.ietf-tcpm-accurate-ecn] and a modified marking
system that can provide early reaction to incipient congestion
[I-D.ietf-tsvwg-aqm-dualq-coupled].¶
[RFC8085] provides guidelines
for a sender that does not, or is unable to, adapt the congestion
window.¶
In the absence of persistent congestion, an endpoint MAY increase
its congestion window and hence the sending rate. An increase should
only occur when there is additional data available to send across the
path (i.e., the sender will utilise the additional capacity in the
next RTT). This helps manage incipient congestion.¶
Increasing Congestion Window:A sender MUST NOT
continue to increase its rate for more than an RTT after a
congestion indication is received. The transport SHOULD stop
increasing its congestion window as soon as it receives indication
of congestion.
While the sender is increasing the congestion
window, a sender will transmit faster than the last confirmed safe
rate. Any increase above the last confirmed rate needs to be
regarded as tentative and the sender reduce their rate below the
last confirmed safe rate when congestion is experienced (a
congestion event).¶
After detecting congestion: An endpoint MUST utilise a method that
assures the sender will keep the rate below the previously
confirmed safe rate for multiple RTT periods after an observed
congestion event. In TCP, this is performed by using a linear
increase from a slow start threshold that is re-initialised when
congestion is experienced.¶
Avoiding Overshoot:Overshoot of the congestion
window beyond the point of congestion can significantly impact
other flows sharing resources along a path. It is important to
note that as endpoints experience more paths with a large BDP and
a wider range of potential path RTT, that variability or changes
in the path can have very significant constraints on appropriate
dynamics for increasing the congestion window (see also burst
mitigation, Section 5.2.2).¶
An endpoint could cache path information that could be used to inform parameter selection for a new or on-going flow. It might also utilise signals from the network to help determine
how to regulate the traffic it sends.¶
Any information used to accelerate the growth of the congestion window MUST
be viewed as tentative until the path capacity is confirmed by
receiving a confirmation that actual traffic has been sent across
the path. (i.e., the new flow needs to either use or loose the
capacity that has been tentatively offered to it). A sender MUST
reduce its rate if this capacity is not confirmed within the
current RTO interval.¶
Utilising Cached Path Information: A congestion controller that recently
used a specific path could use additional state that lets a flow
take-over the capacity that was previously consumed by another
flow (e.g., in the last RTT) which it understands is using the
same path and no will longer use the capacity it recently used. In
TCP, this mechanism was called TCP Control Block (TCB)
sharing [RFC2140], and is described in [RFC9040]. The capacity and other
information can be used to suggest a faster initial sending
rate.¶
Receiving Network Signals: Mechanisms MUST NOT solely rely on
transport messages or specific signalling messages to perform
safely. (See section 5.2 of [RFC8085]
describing use of ICMP messages). They need to be designed so that
they safely operate when path characteristics change at any time.
Transport mechanisms MUST robust to potential black-holing of any
signals (i.e., need to be robust to loss or modification of
packets, noting that this can occur even after successful first
use of a signal by a flow, as occurs when the path changes, see
Section 4.2).¶
Utilising Network Signals: A mechanism that utilises signals originating in
the network (e.g., RSVP, NSIS, Quick-Start, ECN), MUST assume that
the set of network devices on the path can change. This motivates
the use of soft-state when designing protocols that interact with
signals originating from network devices [RFC9049] (e.g., ECN). This
can include context-sensitive treatment of "soft" signals provided
to the endpoint [RFC5164].¶
This document owes much to the insight offered by Sally Floyd, both
at the time of writing of RFC2914 and her help and review in the many
years that followed this.¶
Nicholas Kuhn helped develop the first draft of these guidelines. Tom
Jones and Ana Custura reviewed the first version of this draft. The
University of Aberdeen has received funding to support this work from the
European Space Agency.¶
This document introduces no new security considerations. Each RFC
listed in this document discusses the security considerations of the
specification it contains. The security considerations for the use of
transports are provided in the references section of the cited RFCs.
Security guidance for applications using UDP is provided in the UDP
Usage Guidelines [RFC8085].¶
Section 4.3 describes general requirements
relating to the design of safe protocols and their protection from on
and off path attack.¶
Section 5.3.1 follows current best practice to
validate ICMP messages prior to use.¶
Braden, R., Ed., "Requirements for Internet Hosts - Communication Layers", STD 3, RFC 1122, DOI 10.17487/RFC1122, , <https://www.rfc-editor.org/info/rfc1122>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, , <https://www.rfc-editor.org/info/rfc3168>.
Floyd, S., Handley, M., Padhye, J., and J. Widmer, "TCP Friendly Rate Control (TFRC): Protocol Specification", RFC 5348, DOI 10.17487/RFC5348, , <https://www.rfc-editor.org/info/rfc5348>.
Paxson, V., Allman, M., Chu, J., and M. Sargent, "Computing TCP's Retransmission Timer", RFC 6298, DOI 10.17487/RFC6298, , <https://www.rfc-editor.org/info/rfc6298>.
[RFC7567]
Baker, F., Ed. and G. Fairhurst, Ed., "IETF Recommendations Regarding Active Queue Management", BCP 197, RFC 7567, DOI 10.17487/RFC7567, , <https://www.rfc-editor.org/info/rfc7567>.
Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J., and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, DOI 10.17487/RFC2309, , <https://www.rfc-editor.org/info/rfc2309>.
[RFC2525]
Paxson, V., Allman, M., Dawson, S., Fenner, W., Griner, J., Heavens, I., Lahey, K., Semke, J., and B. Volz, "Known TCP Implementation Problems", RFC 2525, DOI 10.17487/RFC2525, , <https://www.rfc-editor.org/info/rfc2525>.
[RFC2616]
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, DOI 10.17487/RFC2616, , <https://www.rfc-editor.org/info/rfc2616>.
[RFC3449]
Balakrishnan, H., Padmanabhan, V., Fairhurst, G., and M. Sooriyabandara, "TCP Performance Implications of Network Path Asymmetry", BCP 69, RFC 3449, DOI 10.17487/RFC3449, , <https://www.rfc-editor.org/info/rfc3449>.
[RFC3550]
Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550, , <https://www.rfc-editor.org/info/rfc3550>.
Karn, P., Ed., Bormann, C., Fairhurst, G., Grossman, D., Ludwig, R., Mahdavi, J., Montenegro, G., Touch, J., and L. Wood, "Advice for Internet Subnetwork Designers", BCP 89, RFC 3819, DOI 10.17487/RFC3819, , <https://www.rfc-editor.org/info/rfc3819>.
[RFC3828]
Larzon, L-A., Degermark, M., Pink, S., Jonsson, L-E., Ed., and G. Fairhurst, Ed., "The Lightweight User Datagram Protocol (UDP-Lite)", RFC 3828, DOI 10.17487/RFC3828, , <https://www.rfc-editor.org/info/rfc3828>.
Kohler, E., Handley, M., and S. Floyd, "Datagram Congestion Control Protocol (DCCP)", RFC 4340, DOI 10.17487/RFC4340, , <https://www.rfc-editor.org/info/rfc4340>.
[RFC4828]
Floyd, S. and E. Kohler, "TCP Friendly Rate Control (TFRC): The Small-Packet (SP) Variant", RFC 4828, DOI 10.17487/RFC4828, , <https://www.rfc-editor.org/info/rfc4828>.
Floyd, S. and M. Allman, "Specifying New Congestion Control Algorithms", BCP 133, RFC 5033, DOI 10.17487/RFC5033, , <https://www.rfc-editor.org/info/rfc5033>.
Watson, M., Begen, A., and V. Roca, "Forward Error Correction (FEC) Framework", RFC 6363, DOI 10.17487/RFC6363, , <https://www.rfc-editor.org/info/rfc6363>.
[RFC6679]
Westerlund, M., Johansson, I., Perkins, C., O'Hanlon, P., and K. Carlberg, "Explicit Congestion Notification (ECN) for RTP over UDP", RFC 6679, DOI 10.17487/RFC6679, , <https://www.rfc-editor.org/info/rfc6679>.
[RFC6773]
Phelan, T., Fairhurst, G., and C. Perkins, "DCCP-UDP: A Datagram Congestion Control Protocol UDP Encapsulation for NAT Traversal", RFC 6773, DOI 10.17487/RFC6773, , <https://www.rfc-editor.org/info/rfc6773>.
[RFC6928]
Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, "Increasing TCP's Initial Window", RFC 6928, DOI 10.17487/RFC6928, , <https://www.rfc-editor.org/info/rfc6928>.
[RFC6951]
Tuexen, M. and R. Stewart, "UDP Encapsulation of Stream Control Transmission Protocol (SCTP) Packets for End-Host to End-Host Communication", RFC 6951, DOI 10.17487/RFC6951, , <https://www.rfc-editor.org/info/rfc6951>.
[RFC7661]
Fairhurst, G., Sathiaseelan, A., and R. Secchi, "Updating TCP to Support Rate-Limited Traffic", RFC 7661, DOI 10.17487/RFC7661, , <https://www.rfc-editor.org/info/rfc7661>.
Fairhurst, G. and M. Welzl, "The Benefits of Using Explicit Congestion Notification (ECN)", RFC 8087, DOI 10.17487/RFC8087, , <https://www.rfc-editor.org/info/rfc8087>.
[RFC8311]
Black, D., "Relaxing Restrictions on Explicit Congestion Notification (ECN) Experimentation", RFC 8311, DOI 10.17487/RFC8311, , <https://www.rfc-editor.org/info/rfc8311>.
[RFC8312]
Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and R. Scheffenegger, "CUBIC for Fast Long-Distance Networks", RFC 8312, DOI 10.17487/RFC8312, , <https://www.rfc-editor.org/info/rfc8312>.
[RFC8511]
Khademi, N., Welzl, M., Armitage, G., and G. Fairhurst, "TCP Alternative Backoff with ECN (ABE)", RFC 8511, DOI 10.17487/RFC8511, , <https://www.rfc-editor.org/info/rfc8511>.
[RFC8985]
Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "The RACK-TLP Loss Detection Algorithm for TCP", RFC 8985, DOI 10.17487/RFC8985, , <https://www.rfc-editor.org/info/rfc8985>.
[RFC9000]
Iyengar, J., Ed. and M. Thomson, Ed., "QUIC: A UDP-Based Multiplexed and Secure Transport", RFC 9000, DOI 10.17487/RFC9000, , <https://www.rfc-editor.org/info/rfc9000>.
[RFC9002]
Iyengar, J., Ed. and I. Swett, Ed., "QUIC Loss Detection and Congestion Control", RFC 9002, DOI 10.17487/RFC9002, , <https://www.rfc-editor.org/info/rfc9002>.
Dawkins, S., Ed., "Path Aware Networking: Obstacles to Deployment (A Bestiary of Roads Not Taken)", RFC 9049, DOI 10.17487/RFC9049, , <https://www.rfc-editor.org/info/rfc9049>.
Network devices can be configured to isolate the queuing of packets
for different flows, or aggregates of flows, and thereby assist in
reducing the impact of flow multiplexing on other flows. This could
include methods seeking to equally distribute resources between sharing
flows, but this is explicitly not a requirement for a network device
[Flow-Rate-Fairness]. Endpoints can not rely on the
presence and correct configuration of these methods, and therefore even
when a path is expected to support such methods, also need to employ
methods that work end-to-end.¶
Experience has shown that successful protocols developed in a
specific context or for a particular application tend to also become
used in a wider range of contexts. Therefore, IETF specifications by
default target deployment on the general Internet, or need to be defined
for use only within a controlled environment.¶
A significant pathology can arise when a poorly designed transport
creates congestion. This can result in severe service degradation or
"Internet meltdown". This phenomenon was first observed during the
early growth phase of the Internet in the mid 1980s [RFC0896][RFC0970]. It is
technically called "Congestion Collapse". [RFC2914] notes that informally, "congestion collapse
occurs when an increase in the network load results in a decrease in
the useful work done by the network."¶
Transports need to be specifically designed with measures to avoid
starving other flows of capacity (e.g., [RFC7567]). [RFC2309] also
discussed the dangers of congestion-unresponsive flows, and states
that "all UDP-based streaming applications should incorporate
effective congestion avoidance mechanisms." [RFC7567] and [RFC8085] both
reaffirm this, encouraging development of methods to prevent
starvation.¶
When a transport uses a path to send packets (i.e. a flow), this
impacts any other Internet flows (possibly from or to other endpoints)
that share the capacity of any common network device or link (i.e.,
are multiplexed) along the path. As with loss, latency can also be
incurred for other reasons [RFC3819] (Quality of
Service link scheduling, link radio resource management/bandwidth on
demand, transient outages, link retransmission, and
connection/resource setup below the IP layer, etc).¶
When choosing an appropriate sending rate, packet loss needs to be
considered. Although losses are not always due to congestion, endpoint
congestion control needs to conservatively react to loss as a
potential signal of reduced available capacity and reduce the sending
rate. Many designs place the responsibility of rate-adaption at the
sender (source) endpoint, utilising feedback information provided by
the remote endpoint (receiver). Congestion control can also be
implemented by determining an appropriate rate limit at the receiver
and using this limit to control the maximum transport rate (e.g.,
using methods such as [RFC5348] and [RFC4828]).¶
It is normal to observe some perturbation in latency and/or loss
when flows shares a common network bottleneck with other traffic. This
impact needs to be considered and Internet flows ought to implement
appropriate safeguards to avoid inappropriate impact on other flows
that share the resources along a path. Congestion control methods
satisfy this requirement and therefore can help avoid congestion
collapse.¶
"This raises the issue of the appropriate granularity of a "flow",
where we define a `flow' as the level of granularity appropriate for
the application of both fairness and congestion control. [RFC2309] states: "There are a few `natural' answers:
1) a TCP or UDP connection (source address/port, destination
address/port); 2) a source/destination host pair; 3) a given source
host or a given destination host. We would guess that the
source/destination host pair gives the most appropriate granularity in
many circumstances. The granularity of flows for congestion management
is, at least in part, a policy question that needs to be addressed in
the wider IETF community." [RFC2914]¶
Endpoints can send more than one flow. "The specific issue of a
browser opening multiple connections to the same destination has been
addressed by [RFC2616]. Section 8.1.4 states that
"Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy." [RFC2140].¶
This suggests that there are opportunities for transport
connections between the same endpoints (from the same or differing
applications) might share some information, including their congestion
control state, if they are known to share the same path. [RFC8085] adds "An application that forks multiple
worker processes or otherwise uses multiple sockets to generate UDP
datagrams SHOULD perform congestion control over the aggregate
traffic."¶
In the absence of persistent congestion, an endpoint is permitted
to increase its congestion window and hence the sending rate. An
increase should only occur when there is additional data available to
send across the path (i.e., the sender will utilise the additional
capacity in the next RTT).¶
TCP Reno [RFC5681] defines an algorithm, known
as the Additive-Increase/ Multiplicative-Decrease (AIMD) algorithm,
which allows a sender to exponentially increase the congestion window
each RTT from the initial window to the first detected congestion
event. This is designed to allow new flows to rapidly acquire a
suitable congestion window. Where the bandwidth delay product (BDP) is
large, it can take many RTT periods to determine a suitable share of
the path capacity. Such high BDP paths benefit from methods that more
rapidly increase the congestion window, but in compensation these need
to be designed to also react rapidly to any detected congestion (e.g.,
TCP Cubic [RFC8312]).¶
The capacity available to a
flow could be expressed as the number of bytes in flight, the
sending rate or a limit on the number of unacknowledged segments.
When determining the capacity used, all data sent by a sender
needs to be accounted, this includes any additional overhead or
data generated by the transport. A transport performing congestion
management will usually optimise performance for its application
by avoiding excessive loss or delay and maintain a congestion
window. In steady-state this congestion window reflects a safe
limit to the sending rate that has not resulted in persistent
congestion. A congestion controller for a flow that uses packet
Forward Error Correction (FEC) encoding (e.g., [RFC6363]) needs to consider all additional
overhead introduced by packet FEC when setting and managing its
congestion window.¶
One common model views the path between two
endpoints as a "pipe". New packets enter the pipe at the sending
endpoint, older ones leave the pipe at the receiving endpoint.
Congestion and other forms of loss result in "leakage" from this
pipe. Received data (leaving the network path at the remote
endpoint) is usually acknowledged to the congestion
controller.¶
The rate that data leaves the pipe indicates the
share of the capacity that has been utilised by the flow. If, on
average (over an RTT), the sending rate equals the receiving rate,
this indicates the path capacity. This capacity can be safely used
again in the next RTT. If the average receiving rate is less than
the sending rate, then the path is either queuing packets, the
RTT/path has changed, or there is packet loss.¶
Like RFC2119, this documents borrows heavily from earlier
publications addressing the need for end-to-end congestion control, and
this subsection provides an overview of key topics.¶
[RFC2914] provides a general discussion of the
principles of congestion control. Section 3 discussed Fairness, stating
"The equitable sharing of bandwidth among flows depends on the fact that
all flows are running compatible congestion control algorithms". Section
3.1 describes preventing congestion collapse.¶
Congestion collapse was first reported in the mid 1980s [RFC0896], and at that time was largely due to TCP
connections unnecessarily retransmitting packets that were either in
transit or had already been received at the receiver. We call the
congestion collapse that results from the unnecessary retransmission of
packets classical congestion collapse. Classical congestion collapse is
a stable condition that can result in throughput that is a small
fraction of normal [RFC0896]. Problems with
classical congestion collapse have generally been corrected by
improvements to timer and congestion control mechanisms, implemented in
modern implementations of TCP [Jacobson88]. This classical congestion
collapse was a key focus of [RFC2309].¶
A second form of congestion collapse occurs due to undelivered
packets, where Section 5 of [RFC2914] notes:
"Congestion collapse from undelivered packets arises when bandwidth is
wasted by delivering packets through the network that are dropped before
reaching their ultimate destination. This is probably the largest
unresolved danger with respect to congestion collapse in the Internet
today. Different scenarios can result in different degrees of congestion
collapse, in terms of the fraction of the congested links' bandwidth
used for productive work. The danger of congestion collapse from
undelivered packets is due primarily to the increasing deployment of
open-loop applications not using end-to-end congestion control. Even
more destructive would be best-effort applications that *increase* their
sending rate in response to an increased packet drop rate (e.g.,
automatically using an increased level of FEC (Forward Error
Correction))."¶
Section 3.3 of [RFC2914] notes: "In addition to
the prevention of congestion collapse and concerns about fairness, a
third reason for a flow to use end-to-end congestion control can be to
optimize its own performance regarding throughput, delay, and loss. In
some circumstances, for example in environments with high statistical
multiplexing, the delay and loss rate experienced by a flow are largely
independent of its own sending rate. However, in environments with lower
levels of statistical multiplexing or with per-flow scheduling, the
delay and loss rate experienced by a flow is in part a function of the
flow's own sending rate. Thus, a flow can use end-to-end congestion
control to limit the delay or loss experienced by its own packets. We
would note, however, that in an environment like the current best-effort
Internet, concerns regarding congestion collapse and fairness with
competing flows limit the range of congestion control behaviors
available to a flow."¶
The standardization of congestion control in new transports can avoid
a congestion control "arms race" among competing protocols [RFC2914]. That is, avoid designs of transports that
could compete for Internet resource in a way that significantly reduces
the ability of other flows to use the Internet.¶
Added section 1.1 with text on current BCP status with additional
alignment and updates to RFC2914 on Congestion Control Principles
(after question from M. Scharf).¶
Added text that multicast currently noting that this is out of
scope.¶
Revised sender-based CC text after comment from C. Perkins
(Section 3.1,3.3 and other places).¶
Added more about FEC after comment from C. Perkins.¶
Added an explicit reference to RFC 5783 and updated this text
(after question from M. Scharf).¶
To avoid doubt, added a para about "Each new transport needs to
make its own design decisions about how to meet the recommendations
and requirements for congestion control."¶