From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Max Chernoff <git@maxchernoff.ca>,
David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers
Date: Mon, 8 Dec 2025 01:22:08 +0100 [thread overview]
Message-ID: <20251208002229.391162-1-sbrivio@redhat.com> (raw)
Patch 2/9 is the most relevant fix here, as we currently advertise a
window that might be too big for what we can write to the socket,
causing retransmissions right away and occasional high latency on
short transfers to non-local peers.
Mostly as a consequence of fixing that, we now need several
improvements and small fixes, including, most notably, an adaptive
approach to pick the interval between checks for socket-side ACKs
(patch 3/9), and several tricks to reliably trigger TCP buffer size
auto-tuning as implemented by the Linux kernel (patches 5/9 and 7/9).
These changes make some existing issues more relevant, fixed by the
other patches.
With this series, I'm getting the expected (wirespeed) throughput for
transfers between peers with varying non-local RTTs: I checked
different guests bridged on the same machine (~500 us) and hosts with
increasing distance using iperf3, as well as HTTP transfers only for
some hosts I have control over (500 us and 5 ms case).
With increasing RTTs, I can finally see the throughput converging to
the available bandwidth reasonably fast:
* 500 us RTT, ~10 Gbps available:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 785 MBytes 6.57 Gbits/sec 13 1.84 MBytes
[ 5] 1.00-2.00 sec 790 MBytes 6.64 Gbits/sec 0 1.92 MBytes
* 5 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 22 240 KBytes
[ 5] 1.00-2.00 sec 46.2 MBytes 388 Mbits/sec 34 900 KBytes
[ 5] 2.00-3.00 sec 110 MBytes 923 Mbits/sec 0 1.11 MBytes
* 10 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 67.9 MBytes 569 Mbits/sec 2 960 KBytes
[ 5] 1.00-2.00 sec 110 MBytes 926 Mbits/sec 0 1.05 MBytes
[ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 1.17 MBytes
* 24 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.50 MBytes 20.9 Mbits/sec 16 240 KBytes
[ 5] 1.00-2.00 sec 1.50 MBytes 12.6 Mbits/sec 9 120 KBytes
[ 5] 2.00-3.00 sec 99.2 MBytes 832 Mbits/sec 4 2.40 MBytes
[ 5] 3.00-4.00 sec 122 MBytes 1.03 Gbits/sec 1 3.16 MBytes
[ 5] 4.00-5.00 sec 119 MBytes 1.00 Gbits/sec 0 4.16 MBytes
* 40 ms, ~600 Mbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.12 MBytes 17.8 Mbits/sec 0 600 KBytes
[ 5] 1.00-2.00 sec 3.25 MBytes 27.3 Mbits/sec 14 420 KBytes
[ 5] 2.00-3.00 sec 31.5 MBytes 264 Mbits/sec 11 1.29 MBytes
[ 5] 3.00-4.00 sec 72.5 MBytes 608 Mbits/sec 0 1.46 MBytes
* 100 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 4.88 MBytes 40.9 Mbits/sec 1 840 KBytes
[ 5] 1.00-2.00 sec 1.62 MBytes 13.6 Mbits/sec 9 240 KBytes
[ 5] 2.00-3.00 sec 5.25 MBytes 44.0 Mbits/sec 5 780 KBytes
[ 5] 3.00-4.00 sec 9.75 MBytes 81.8 Mbits/sec 0 1.29 MBytes
[ 5] 4.00-5.00 sec 15.8 MBytes 132 Mbits/sec 0 1.99 MBytes
[ 5] 5.00-6.00 sec 22.9 MBytes 192 Mbits/sec 0 3.05 MBytes
[ 5] 6.00-7.00 sec 132 MBytes 1.11 Gbits/sec 0 7.62 MBytes
* 114 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.62 MBytes 13.6 Mbits/sec 8 420 KBytes
[ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 15 120 KBytes
[ 5] 2.00-3.00 sec 26.0 MBytes 218 Mbits/sec 50 1.82 MBytes
[ 5] 3.00-4.00 sec 103 MBytes 865 Mbits/sec 31 5.10 MBytes
[ 5] 4.00-5.00 sec 111 MBytes 930 Mbits/sec 0 5.92 MBytes
* 153 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 2 180 KBytes
[ 5] 1.00-2.00 sec 2.12 MBytes 17.8 Mbits/sec 11 180 KBytes
[ 5] 2.00-3.00 sec 12.6 MBytes 106 Mbits/sec 40 1.29 MBytes
[ 5] 3.00-4.00 sec 44.5 MBytes 373 Mbits/sec 22 2.75 MBytes
[ 5] 4.00-5.00 sec 86.0 MBytes 721 Mbits/sec 0 6.68 MBytes
[ 5] 5.00-6.00 sec 120 MBytes 1.01 Gbits/sec 119 6.97 MBytes
[ 5] 6.00-7.00 sec 110 MBytes 927 Mbits/sec 0 6.97 MBytes
* 186 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 3 180 KBytes
[ 5] 1.00-2.00 sec 512 KBytes 4.19 Mbits/sec 4 120 KBytes
[ 5] 2.00-3.00 sec 2.12 MBytes 17.8 Mbits/sec 6 360 KBytes
[ 5] 3.00-4.00 sec 27.1 MBytes 228 Mbits/sec 6 1.11 MBytes
[ 5] 4.00-5.00 sec 38.2 MBytes 321 Mbits/sec 0 1.99 MBytes
[ 5] 5.00-6.00 sec 38.2 MBytes 321 Mbits/sec 0 2.46 MBytes
[ 5] 6.00-7.00 sec 69.2 MBytes 581 Mbits/sec 71 3.63 MBytes
[ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 5.92 MBytes
* 271 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 600 KBytes
[ 5] 2.00-3.00 sec 512 KBytes 4.19 Mbits/sec 7 420 KBytes
[ 5] 3.00-4.00 sec 896 KBytes 7.34 Mbits/sec 8 60.0 KBytes
[ 5] 4.00-5.00 sec 2.62 MBytes 22.0 Mbits/sec 13 420 KBytes
[ 5] 5.00-6.00 sec 12.1 MBytes 102 Mbits/sec 7 1020 KBytes
[ 5] 6.00-7.00 sec 19.9 MBytes 167 Mbits/sec 0 1.82 MBytes
[ 5] 7.00-8.00 sec 17.9 MBytes 150 Mbits/sec 44 1.76 MBytes
[ 5] 8.00-9.00 sec 57.4 MBytes 481 Mbits/sec 30 2.70 MBytes
[ 5] 9.00-10.00 sec 88.0 MBytes 738 Mbits/sec 0 6.45 MBytes
* 292 ms, ~ 600 Mbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 450 KBytes
[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 3 180 KBytes
[ 5] 2.00-3.00 sec 640 KBytes 5.24 Mbits/sec 4 120 KBytes
[ 5] 3.00-4.00 sec 384 KBytes 3.15 Mbits/sec 2 120 KBytes
[ 5] 4.00-5.00 sec 4.50 MBytes 37.7 Mbits/sec 1 660 KBytes
[ 5] 5.00-6.00 sec 13.0 MBytes 109 Mbits/sec 3 960 KBytes
[ 5] 6.00-7.00 sec 64.5 MBytes 541 Mbits/sec 0 2.17 MBytes
* 327 ms, 1 Gbps:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 600 KBytes
[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 4 240 KBytes
[ 5] 2.00-3.00 sec 768 KBytes 6.29 Mbits/sec 4 120 KBytes
[ 5] 3.00-4.00 sec 1.62 MBytes 13.6 Mbits/sec 5 120 KBytes
[ 5] 4.00-5.00 sec 1.88 MBytes 15.7 Mbits/sec 0 480 KBytes
[ 5] 5.00-6.00 sec 17.6 MBytes 148 Mbits/sec 14 1.05 MBytes
[ 5] 6.00-7.00 sec 35.1 MBytes 295 Mbits/sec 0 2.58 MBytes
[ 5] 7.00-8.00 sec 45.2 MBytes 380 Mbits/sec 0 4.63 MBytes
[ 5] 8.00-9.00 sec 27.0 MBytes 226 Mbits/sec 96 3.93 MBytes
[ 5] 9.00-10.00 sec 85.9 MBytes 720 Mbits/sec 67 4.22 MBytes
[ 5] 10.00-11.00 sec 118 MBytes 986 Mbits/sec 0 9.67 MBytes
[ 5] 11.00-12.00 sec 124 MBytes 1.04 Gbits/sec 0 15.9 MBytes
For short transfers, we strictly stick to the available sending buffer
size to (almost) make sure we avoid local retransmissions, and
significantly decrease transfer time as a result: from 1.2 s to 60 ms
for a 5 MB HTTP transfer from a container hosted in a virtual machine
to another guest.
v2:
- Add 1/9, factoring out a generic version of the scaling function we
already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well
- in 3/9, use 4 bits instead of 3 to represent the RTT, which is
important as we now use RTT values for more than just the ACK checks
- in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG
to decide when to acknowledge data, use an adaptive approach based on
the bandwidth-delay product
- in 6/9, clarify the relationship between SWS avoidance and Nagle's
algorithm, and introduce a reference to RFC 813, Section 4
- in 7/9, use an adaptive approach based on the product of bytes sent
(and acknowledged) so far and RTT, instead of the previous approach
based on bytes sent only, as it allows us to converge to the expected
throughput much quicker with high RTT destinations
Stefano Brivio (9):
tcp, util: Add function for scaling to linearly interpolated factor,
use it
tcp: Limit advertised window to available, not total sending buffer
size
tcp: Adaptive interval based on RTT for socket-side acknowledgement
checks
tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized
window
tcp: Acknowledge everything if it looks like bulk traffic, not
interactive
tcp: Don't limit window to less-than-MSS values, use zero instead
tcp: Allow exceeding the available sending buffer size in window
advertisements
tcp: Send a duplicate ACK also on complete sendmsg() failure
tcp: Skip redundant ACK on partial sendmsg() failure
README.md | 2 +-
tcp.c | 168 ++++++++++++++++++++++++++++++++++++++++++-----------
tcp_conn.h | 9 +++
util.c | 52 +++++++++++++++++
util.h | 2 +
5 files changed, 197 insertions(+), 36 deletions(-)
--
2.43.0
next reply other threads:[~2025-12-08 0:22 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-08 0:22 Stefano Brivio [this message]
2025-12-08 0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-08 5:33 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-08 5:41 ` David Gibson
2025-12-08 7:22 ` Stefano Brivio
2025-12-08 8:28 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-08 5:54 ` David Gibson
2025-12-08 7:25 ` Stefano Brivio
2025-12-08 8:31 ` David Gibson
2025-12-08 0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08 6:43 ` David Gibson
2025-12-08 8:11 ` Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08 6:25 ` David Gibson
2025-12-08 7:45 ` Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08 0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08 6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
2025-12-08 8:22 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251208002229.391162-1-sbrivio@redhat.com \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).