public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Max Chernoff <git@maxchernoff.ca>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers
Date: Mon,  8 Dec 2025 08:20:13 +0100	[thread overview]
Message-ID: <20251208072024.3884137-1-sbrivio@redhat.com> (raw)

Patch 3/10 is the most relevant fix here, as we currently advertise a
window that might be too big for what we can write to the socket,
causing retransmissions right away and occasional high latency on
short transfers to non-local peers.

Mostly as a consequence of fixing that, we now need several
improvements and small fixes, including, most notably, an adaptive
approach to pick the interval between checks for socket-side ACKs
(patch 4/10), and several tricks to reliably trigger TCP buffer size
auto-tuning as implemented by the Linux kernel (patches 6/10 and 8/10).

These changes make some existing issues more relevant, fixed by the
other patches.

With this series, I'm getting the expected (wirespeed) throughput for
transfers between peers with varying non-local RTTs: I checked
different guests bridged on the same machine (~500 us) and hosts with
increasing distance using iperf3, as well as HTTP transfers only for
some hosts I have control over (500 us and 5 ms case).

With increasing RTTs, I can finally see the throughput converging to
the available bandwidth reasonably fast:

* 500 us RTT, ~10 Gbps available:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec   785 MBytes  6.57 Gbits/sec   13   1.84 MBytes
  [  5]   1.00-2.00   sec   790 MBytes  6.64 Gbits/sec    0   1.92 MBytes

* 5 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec   22    240 KBytes
  [  5]   1.00-2.00   sec  46.2 MBytes   388 Mbits/sec   34    900 KBytes
  [  5]   2.00-3.00   sec   110 MBytes   923 Mbits/sec    0   1.11 MBytes

* 10 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  67.9 MBytes   569 Mbits/sec    2    960 KBytes
  [  5]   1.00-2.00   sec   110 MBytes   926 Mbits/sec    0   1.05 MBytes
  [  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec    0   1.17 MBytes

* 24 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.50 MBytes  20.9 Mbits/sec   16    240 KBytes
  [  5]   1.00-2.00   sec  1.50 MBytes  12.6 Mbits/sec    9    120 KBytes
  [  5]   2.00-3.00   sec  99.2 MBytes   832 Mbits/sec    4   2.40 MBytes
  [  5]   3.00-4.00   sec   122 MBytes  1.03 Gbits/sec    1   3.16 MBytes
  [  5]   4.00-5.00   sec   119 MBytes  1.00 Gbits/sec    0   4.16 MBytes

* 40 ms, ~600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.12 MBytes  17.8 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec   14    420 KBytes
  [  5]   2.00-3.00   sec  31.5 MBytes   264 Mbits/sec   11   1.29 MBytes
  [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0   1.46 MBytes

* 100 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec    1    840 KBytes
  [  5]   1.00-2.00   sec  1.62 MBytes  13.6 Mbits/sec    9    240 KBytes
  [  5]   2.00-3.00   sec  5.25 MBytes  44.0 Mbits/sec    5    780 KBytes
  [  5]   3.00-4.00   sec  9.75 MBytes  81.8 Mbits/sec    0   1.29 MBytes
  [  5]   4.00-5.00   sec  15.8 MBytes   132 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  22.9 MBytes   192 Mbits/sec    0   3.05 MBytes
  [  5]   6.00-7.00   sec   132 MBytes  1.11 Gbits/sec    0   7.62 MBytes

* 114 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.62 MBytes  13.6 Mbits/sec    8    420 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   15    120 KBytes
  [  5]   2.00-3.00   sec  26.0 MBytes   218 Mbits/sec   50   1.82 MBytes
  [  5]   3.00-4.00   sec   103 MBytes   865 Mbits/sec   31   5.10 MBytes
  [  5]   4.00-5.00   sec   111 MBytes   930 Mbits/sec    0   5.92 MBytes

* 153 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    2    180 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   11    180 KBytes
  [  5]   2.00-3.00   sec  12.6 MBytes   106 Mbits/sec   40   1.29 MBytes
  [  5]   3.00-4.00   sec  44.5 MBytes   373 Mbits/sec   22   2.75 MBytes
  [  5]   4.00-5.00   sec  86.0 MBytes   721 Mbits/sec    0   6.68 MBytes
  [  5]   5.00-6.00   sec   120 MBytes  1.01 Gbits/sec  119   6.97 MBytes
  [  5]   6.00-7.00   sec   110 MBytes   927 Mbits/sec    0   6.97 MBytes

* 186 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    3    180 KBytes
  [  5]   1.00-2.00   sec   512 KBytes  4.19 Mbits/sec    4    120 KBytes
  [  5]   2.00-3.00   sec  2.12 MBytes  17.8 Mbits/sec    6    360 KBytes
  [  5]   3.00-4.00   sec  27.1 MBytes   228 Mbits/sec    6   1.11 MBytes
  [  5]   4.00-5.00   sec  38.2 MBytes   321 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  38.2 MBytes   321 Mbits/sec    0   2.46 MBytes
  [  5]   6.00-7.00   sec  69.2 MBytes   581 Mbits/sec   71   3.63 MBytes
  [  5]   7.00-8.00   sec   110 MBytes   919 Mbits/sec    0   5.92 MBytes

* 271 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    320 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    600 KBytes
  [  5]   2.00-3.00   sec   512 KBytes  4.19 Mbits/sec    7    420 KBytes
  [  5]   3.00-4.00   sec   896 KBytes  7.34 Mbits/sec    8   60.0 KBytes
  [  5]   4.00-5.00   sec  2.62 MBytes  22.0 Mbits/sec   13    420 KBytes
  [  5]   5.00-6.00   sec  12.1 MBytes   102 Mbits/sec    7   1020 KBytes
  [  5]   6.00-7.00   sec  19.9 MBytes   167 Mbits/sec    0   1.82 MBytes
  [  5]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec   44   1.76 MBytes
  [  5]   8.00-9.00   sec  57.4 MBytes   481 Mbits/sec   30   2.70 MBytes
  [  5]   9.00-10.00  sec  88.0 MBytes   738 Mbits/sec    0   6.45 MBytes

* 292 ms, ~ 600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    450 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    3    180 KBytes
  [  5]   2.00-3.00   sec   640 KBytes  5.24 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec   384 KBytes  3.15 Mbits/sec    2    120 KBytes
  [  5]   4.00-5.00   sec  4.50 MBytes  37.7 Mbits/sec    1    660 KBytes
  [  5]   5.00-6.00   sec  13.0 MBytes   109 Mbits/sec    3    960 KBytes
  [  5]   6.00-7.00   sec  64.5 MBytes   541 Mbits/sec    0   2.17 MBytes

* 327 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    4    240 KBytes
  [  5]   2.00-3.00   sec   768 KBytes  6.29 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec  1.62 MBytes  13.6 Mbits/sec    5    120 KBytes
  [  5]   4.00-5.00   sec  1.88 MBytes  15.7 Mbits/sec    0    480 KBytes
  [  5]   5.00-6.00   sec  17.6 MBytes   148 Mbits/sec   14   1.05 MBytes
  [  5]   6.00-7.00   sec  35.1 MBytes   295 Mbits/sec    0   2.58 MBytes
  [  5]   7.00-8.00   sec  45.2 MBytes   380 Mbits/sec    0   4.63 MBytes
  [  5]   8.00-9.00   sec  27.0 MBytes   226 Mbits/sec   96   3.93 MBytes
  [  5]   9.00-10.00  sec  85.9 MBytes   720 Mbits/sec   67   4.22 MBytes
  [  5]  10.00-11.00  sec   118 MBytes   986 Mbits/sec    0   9.67 MBytes
  [  5]  11.00-12.00  sec   124 MBytes  1.04 Gbits/sec    0   15.9 MBytes

For short transfers, we strictly stick to the available sending buffer
size to (almost) make sure we avoid local retransmissions, and
significantly decrease transfer time as a result: from 1.2 s to 60 ms
for a 5 MB HTTP transfer from a container hosted in a virtual machine
to another guest.

v3:

- split 1/9 into 1/10 and 2/10 (one adding the function, one changing
  the usage factor)

- in 1/10, rename function and clarify that the factor can be less
  than 100%

- fix timer expiration formatting in 4/10

- in 6/10, use tcpi_rtt directly instead of its stored approximation

v2:

- Add 1/9, factoring out a generic version of the scaling function we
  already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well

- in 3/9, use 4 bits instead of 3 to represent the RTT, which is
  important as we now use RTT values for more than just the ACK checks

- in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG
  to decide when to acknowledge data, use an adaptive approach based on
  the bandwidth-delay product

- in 6/9, clarify the relationship between SWS avoidance and Nagle's
  algorithm, and introduce a reference to RFC 813, Section 4

- in 7/9, use an adaptive approach based on the product of bytes sent
  (and acknowledged) so far and RTT, instead of the previous approach
  based on bytes sent only, as it allows us to converge to the expected
  throughput much quicker with high RTT destinations

*** BLURB HERE ***

Stefano Brivio (10):
  tcp, util: Add function for scaling to linearly interpolated factor,
    use it
  tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75%
  tcp: Limit advertised window to available, not total sending buffer
    size
  tcp: Adaptive interval based on RTT for socket-side acknowledgement
    checks
  tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized
    window
  tcp: Acknowledge everything if it looks like bulk traffic, not
    interactive
  tcp: Don't limit window to less-than-MSS values, use zero instead
  tcp: Allow exceeding the available sending buffer size in window
    advertisements
  tcp: Send a duplicate ACK also on complete sendmsg() failure
  tcp: Skip redundant ACK on partial sendmsg() failure

 README.md  |   2 +-
 tcp.c      | 170 ++++++++++++++++++++++++++++++++++++++++++-----------
 tcp_conn.h |   9 +++
 util.c     |  52 ++++++++++++++++
 util.h     |   2 +
 5 files changed, 199 insertions(+), 36 deletions(-)

-- 
2.43.0


             reply	other threads:[~2025-12-08  7:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  7:20 Stefano Brivio [this message]
2025-12-08  7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-09  5:10   ` David Gibson
2025-12-09 22:49     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09  5:12   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  8:14   ` Max Chernoff
2025-12-08  8:15   ` Max Chernoff
2025-12-08  8:27     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08  8:25   ` Stefano Brivio
2025-12-08  8:51     ` Max Chernoff
2025-12-08  9:00       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208072024.3884137-1-sbrivio@redhat.com \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).