原文链接 https://github.com/yfractal/blog/blob/master/blog/2022-02-06-tcp-congestion-control.md
TCP 需要在异构,动态的网络环境下,通过不可靠的下层服务 (IP),提供可靠(reliable)的服务,而拥塞控制是 TCP 必不可少的机制。
TCP 拥塞控制本事是一个非常困难,也非常有趣的问题。
本文主要根据 RFC 2001[1] 对 TCP 拥塞控制的状态变化进行一个简单的描述。
TCP 拥塞控制通过调整 cwnd(congestion window) 对流量进行控制,未经过确认 (ack) 的数据总和不能超过 cwnd 的大小。
TCP 拥塞控制有三个状态,在不同的状态下,会使用不同的方式调整 cwnd 的大小。
slow start 和 congestion avoidance 这两个状态的转换由 cwnd 和 ssthresh(slow start threshold) 控制。
当 cwnd > ssthresh 则由 slow start 转换为 congestion avoidance。
在 slow start 状态下,当收到新的 ack 的时候,cwnd 会成指数增长,而 congestion avoidance 状态下,则成线性增长。
在当前的状态是 slow start 或者是 congestion avoidance 的时候,如果收到 duplicate ACK, 则会将 ssthresh 设置为 window size( min{cwnd, rwnd} ) 的一半。
当收到 3 个或者更多的重复的 ACK 的时候,则进入 Fast Recovery 状态,在收到新的 ACK 的时候,则 Fast Recovery 结束,进入 Congestion Avoidance。
当进入 Fast Recovery 的时候,会将 threshold 设置为 window size( min{cwnd, rwnd} ) 的一半,重传丢失的 segment,将 cwnd 设置为 ssthresh + 3 倍的 segment size。
当 retransmit timer timeout 的时候,则会将 ssthresh 设置为 window size( min{cwnd, rwnd} ) 的一半,并将 cwnd 设置为 segment size。即从新开始 slow start。
图片截取自 MIT 6829 Computer Networking L8 [4]
TCPCongestionControl#receive 在收到 ack 或者 retransmit timer timeout 的时候会被触发。
可以理解成一种 callbck,在有收到 ack 的时候会被调用,在 timeout 事件发生的时候,也会被调用。类似于 Erlang 的 receive。
class TCPCongestionControl
# rwnd receiver's advertised window
def initialize(segment_size)
# segment size announced by the other end, or the default,
# typically 536 or 512
@segment_size = segment_size
@cwnd = segment_size
@ssthresh = 65535 # bytes
@in_fast_recovery = false
# other logic ....
end
def receive
# indicate congestion
if current_algorithm == :slow_start && new_ack?
# exponential growth
@cwnd += @segment_size
# may go_to congestion_avoidance state
# Slow start continues until TCP is halfway to where it was when congestion
# occurred (since it recorded half of the window size that caused
# the problem in step 2), and then congestion avoidance takes over.
elsif current_algorithm == :congestion_avoidance && new_ack?
# linear growth
@cwnd += segsize * segsize / @cwnd
elsif three_or_more_duplicate_ack?
# TCP does not know whether a duplicate ACK is caused by a lost segment or just a reordering of segments
# it waits for a small number and assume it is just a a reordering of segments
# 3 or more duplicate ACKs are received in a row,
# it is a strong indication that a segment has been lost
# go_to fast recovery
@in_fast_recovery = true
@ssthresh = current_window_size / 2
retransmit_missing # retransmit directly without waiting timer
@cwnd = @ssthresh + 3 * @segment_size
# This inflates the congestion window by the number of segments that have left the network and which the other end has cached (3).
elsif current_algorithm == :fast_recovery && new_ack?
@cwnd = @ssthresh # go_to congestion_avoidance
@in_fast_recovery = false
elsif (current_algorithm == :slow_start || current_algorithm == :congestion_avoidance) && duplicate_ack?
@ssthresh = current_window_size / 2 # go_to congestion avoidance
elsif current_algorithm == :fast_recovery && duplicate_ack?
@cwnd += @segment_size
# This inflates the congestion window for the additional segment that has left the network. Transmit a packet, if allowed by the new value of cwnd.
elsif timeout?
# slow_star and congestion_avoidance should come into here
# not sure when timout occurs when fast_recevery should come into here too...
@ssthresh = current_window_size / 2
@cwnd = segment_size # go_to slow start
end
maybe_transmit_packet
end
private
def current_algorithm
return :fast_recovery if @in_fast_recovery
@cwnd <= @ssthresh ? :slow_start : :congestion_avoidance
end
def current_window_size
min(@cwnd, @rwnd)
end
end
Congestion Avoidance and Control[2],对用塞控制背后的原理有详细的解释,不过比较难读(我只读懂了部分)。
Computer Networking: A Top Down Approach 和 Principles of Computer System 都对 TCP 拥塞控有所描述。更深入的可以参考 MIT 的 Computer Networking 课程 [3]。
在操作系统中,也有类似的问题。之前,在 package 超过系统负载能力的时候,CPU 都被用在了处理中断,而没有时间执行应用层的逻辑,导致没办法完整的处理一个 package。和 Congestion Avoidance and Control 有相似的思路,但是用了不同的方法 [4]。
工业上,类似的机制有:流量控制 (rate limiter),背压 (Back pressure)。Facebook 在 memcache cluster 中有使用类似的机制 [5]。
Netfix 有开源 Hystrix[6],但现在已经被 concurrency-limits[7] 所替代,而 concurrency-limits 使用类似 TCP congestion control 的机制。