From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by passt.top (Postfix, from userid 1000) id 39BCA5A0623; Mon, 08 Dec 2025 08:20:24 +0100 (CET) From: Stefano Brivio To: passt-dev@passt.top Subject: [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Date: Mon, 8 Dec 2025 08:20:13 +0100 Message-ID: <20251208072024.3884137-1-sbrivio@redhat.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID-Hash: EVJX6766ZGLMGVUGUTS4CZG7ZS7KCFJ2 X-Message-ID-Hash: EVJX6766ZGLMGVUGUTS4CZG7ZS7KCFJ2 X-MailFrom: sbrivio@passt.top X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Max Chernoff , David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Patch 3/10 is the most relevant fix here, as we currently advertise a window that might be too big for what we can write to the socket, causing retransmissions right away and occasional high latency on short transfers to non-local peers. Mostly as a consequence of fixing that, we now need several improvements and small fixes, including, most notably, an adaptive approach to pick the interval between checks for socket-side ACKs (patch 4/10), and several tricks to reliably trigger TCP buffer size auto-tuning as implemented by the Linux kernel (patches 6/10 and 8/10). These changes make some existing issues more relevant, fixed by the other patches. With this series, I'm getting the expected (wirespeed) throughput for transfers between peers with varying non-local RTTs: I checked different guests bridged on the same machine (~500 us) and hosts with increasing distance using iperf3, as well as HTTP transfers only for some hosts I have control over (500 us and 5 ms case). With increasing RTTs, I can finally see the throughput converging to the available bandwidth reasonably fast: * 500 us RTT, ~10 Gbps available: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 785 MBytes 6.57 Gbits/sec 13 1.84 MBytes [ 5] 1.00-2.00 sec 790 MBytes 6.64 Gbits/sec 0 1.92 MBytes * 5 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 22 240 KBytes [ 5] 1.00-2.00 sec 46.2 MBytes 388 Mbits/sec 34 900 KBytes [ 5] 2.00-3.00 sec 110 MBytes 923 Mbits/sec 0 1.11 MBytes * 10 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 67.9 MBytes 569 Mbits/sec 2 960 KBytes [ 5] 1.00-2.00 sec 110 MBytes 926 Mbits/sec 0 1.05 MBytes [ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 1.17 MBytes * 24 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 20.9 Mbits/sec 16 240 KBytes [ 5] 1.00-2.00 sec 1.50 MBytes 12.6 Mbits/sec 9 120 KBytes [ 5] 2.00-3.00 sec 99.2 MBytes 832 Mbits/sec 4 2.40 MBytes [ 5] 3.00-4.00 sec 122 MBytes 1.03 Gbits/sec 1 3.16 MBytes [ 5] 4.00-5.00 sec 119 MBytes 1.00 Gbits/sec 0 4.16 MBytes * 40 ms, ~600 Mbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.12 MBytes 17.8 Mbits/sec 0 600 KBytes [ 5] 1.00-2.00 sec 3.25 MBytes 27.3 Mbits/sec 14 420 KBytes [ 5] 2.00-3.00 sec 31.5 MBytes 264 Mbits/sec 11 1.29 MBytes [ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 1.46 MBytes * 100 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 1 840 KBytes [ 5] 1.00-2.00 sec 1.62 MBytes 13.6 Mbits/sec 9 240 KBytes [ 5] 2.00-3.00 sec 5.25 MBytes 44.0 Mbits/sec 5 780 KBytes [ 5] 3.00-4.00 sec 9.75 MBytes 81.8 Mbits/sec 0 1.29 MBytes [ 5] 4.00-5.00 sec 15.8 MBytes 132 Mbits/sec 0 1.99 MBytes [ 5] 5.00-6.00 sec 22.9 MBytes 192 Mbits/sec 0 3.05 MBytes [ 5] 6.00-7.00 sec 132 MBytes 1.11 Gbits/sec 0 7.62 MBytes * 114 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.62 MBytes 13.6 Mbits/sec 8 420 KBytes [ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 15 120 KBytes [ 5] 2.00-3.00 sec 26.0 MBytes 218 Mbits/sec 50 1.82 MBytes [ 5] 3.00-4.00 sec 103 MBytes 865 Mbits/sec 31 5.10 MBytes [ 5] 4.00-5.00 sec 111 MBytes 930 Mbits/sec 0 5.92 MBytes * 153 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 2 180 KBytes [ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 11 180 KBytes [ 5] 2.00-3.00 sec 12.6 MBytes 106 Mbits/sec 40 1.29 MBytes [ 5] 3.00-4.00 sec 44.5 MBytes 373 Mbits/sec 22 2.75 MBytes [ 5] 4.00-5.00 sec 86.0 MBytes 721 Mbits/sec 0 6.68 MBytes [ 5] 5.00-6.00 sec 120 MBytes 1.01 Gbits/sec 119 6.97 MBytes [ 5] 6.00-7.00 sec 110 MBytes 927 Mbits/sec 0 6.97 MBytes * 186 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 3 180 KBytes [ 5] 1.00-2.00 sec 512 KBytes 4.19 Mbits/sec 4 120 KBytes [ 5] 2.00-3.00 sec 2.12 MBytes 17.8 Mbits/sec 6 360 KBytes [ 5] 3.00-4.00 sec 27.1 MBytes 228 Mbits/sec 6 1.11 MBytes [ 5] 4.00-5.00 sec 38.2 MBytes 321 Mbits/sec 0 1.99 MBytes [ 5] 5.00-6.00 sec 38.2 MBytes 321 Mbits/sec 0 2.46 MBytes [ 5] 6.00-7.00 sec 69.2 MBytes 581 Mbits/sec 71 3.63 MBytes [ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 5.92 MBytes * 271 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 600 KBytes [ 5] 2.00-3.00 sec 512 KBytes 4.19 Mbits/sec 7 420 KBytes [ 5] 3.00-4.00 sec 896 KBytes 7.34 Mbits/sec 8 60.0 KBytes [ 5] 4.00-5.00 sec 2.62 MBytes 22.0 Mbits/sec 13 420 KBytes [ 5] 5.00-6.00 sec 12.1 MBytes 102 Mbits/sec 7 1020 KBytes [ 5] 6.00-7.00 sec 19.9 MBytes 167 Mbits/sec 0 1.82 MBytes [ 5] 7.00-8.00 sec 17.9 MBytes 150 Mbits/sec 44 1.76 MBytes [ 5] 8.00-9.00 sec 57.4 MBytes 481 Mbits/sec 30 2.70 MBytes [ 5] 9.00-10.00 sec 88.0 MBytes 738 Mbits/sec 0 6.45 MBytes * 292 ms, ~ 600 Mbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 450 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 3 180 KBytes [ 5] 2.00-3.00 sec 640 KBytes 5.24 Mbits/sec 4 120 KBytes [ 5] 3.00-4.00 sec 384 KBytes 3.15 Mbits/sec 2 120 KBytes [ 5] 4.00-5.00 sec 4.50 MBytes 37.7 Mbits/sec 1 660 KBytes [ 5] 5.00-6.00 sec 13.0 MBytes 109 Mbits/sec 3 960 KBytes [ 5] 6.00-7.00 sec 64.5 MBytes 541 Mbits/sec 0 2.17 MBytes * 327 ms, 1 Gbps: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 600 KBytes [ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 4 240 KBytes [ 5] 2.00-3.00 sec 768 KBytes 6.29 Mbits/sec 4 120 KBytes [ 5] 3.00-4.00 sec 1.62 MBytes 13.6 Mbits/sec 5 120 KBytes [ 5] 4.00-5.00 sec 1.88 MBytes 15.7 Mbits/sec 0 480 KBytes [ 5] 5.00-6.00 sec 17.6 MBytes 148 Mbits/sec 14 1.05 MBytes [ 5] 6.00-7.00 sec 35.1 MBytes 295 Mbits/sec 0 2.58 MBytes [ 5] 7.00-8.00 sec 45.2 MBytes 380 Mbits/sec 0 4.63 MBytes [ 5] 8.00-9.00 sec 27.0 MBytes 226 Mbits/sec 96 3.93 MBytes [ 5] 9.00-10.00 sec 85.9 MBytes 720 Mbits/sec 67 4.22 MBytes [ 5] 10.00-11.00 sec 118 MBytes 986 Mbits/sec 0 9.67 MBytes [ 5] 11.00-12.00 sec 124 MBytes 1.04 Gbits/sec 0 15.9 MBytes For short transfers, we strictly stick to the available sending buffer size to (almost) make sure we avoid local retransmissions, and significantly decrease transfer time as a result: from 1.2 s to 60 ms for a 5 MB HTTP transfer from a container hosted in a virtual machine to another guest. v3: - split 1/9 into 1/10 and 2/10 (one adding the function, one changing the usage factor) - in 1/10, rename function and clarify that the factor can be less than 100% - fix timer expiration formatting in 4/10 - in 6/10, use tcpi_rtt directly instead of its stored approximation v2: - Add 1/9, factoring out a generic version of the scaling function we already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well - in 3/9, use 4 bits instead of 3 to represent the RTT, which is important as we now use RTT values for more than just the ACK checks - in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG to decide when to acknowledge data, use an adaptive approach based on the bandwidth-delay product - in 6/9, clarify the relationship between SWS avoidance and Nagle's algorithm, and introduce a reference to RFC 813, Section 4 - in 7/9, use an adaptive approach based on the product of bytes sent (and acknowledged) so far and RTT, instead of the previous approach based on bytes sent only, as it allows us to converge to the expected throughput much quicker with high RTT destinations *** BLURB HERE *** Stefano Brivio (10): tcp, util: Add function for scaling to linearly interpolated factor, use it tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% tcp: Limit advertised window to available, not total sending buffer size tcp: Adaptive interval based on RTT for socket-side acknowledgement checks tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window tcp: Acknowledge everything if it looks like bulk traffic, not interactive tcp: Don't limit window to less-than-MSS values, use zero instead tcp: Allow exceeding the available sending buffer size in window advertisements tcp: Send a duplicate ACK also on complete sendmsg() failure tcp: Skip redundant ACK on partial sendmsg() failure README.md | 2 +- tcp.c | 170 ++++++++++++++++++++++++++++++++++++++++++----------- tcp_conn.h | 9 +++ util.c | 52 ++++++++++++++++ util.h | 2 + 5 files changed, 199 insertions(+), 36 deletions(-) -- 2.43.0