From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements
Date: Mon, 8 Dec 2025 01:20:59 +0100 [thread overview]
Message-ID: <20251208012059.36459e27@elisabeth> (raw)
In-Reply-To: <aTJEn_K_7G9SH0mY@zatzit>
On Fri, 5 Dec 2025 13:34:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
> > ...under two conditions:
> >
> > - the remote peer is advertising a bigger value to us, meaning that a
> > bigger sending buffer is likely to benefit throughput, AND
>
> I think this condition is redundant: if the remote peer is advertising
> less, we'll clamp new_wnd_to_tap to that value anyway.
I almost fell for this. We have a subtractive term in the expression,
so it's not actually the case.
If the remote peer is advertising a smaller window, we just take buffer
size *minus pending bytes*, as limit, which can be smaller compared to
the window advertised by the peer.
If it's advertising a bigger window, we take an increased buffer size
minus pending bytes, as limit, which can be bigger than the peer's
window, so we'll use the peer's window as limit instead.
I added an example in v2 (now 7/9).
> > - this is not a short-lived connection, where the latency cost of
> > retransmissions would be otherwise unacceptable.
> >
> > By doing this, we can reliably trigger TCP buffer size auto-tuning (as
> > long as it's available) on bulk data transfers.
> >
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > tcp.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 2220059..454df69 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,13 @@ enum {
> > #define LOW_RTT_TABLE_SIZE 8
> > #define LOW_RTT_THRESHOLD 10 /* us */
> >
> > +/* Try to avoid retransmissions to improve latency on short-lived connections */
> > +#define SHORT_CONN_BYTES (16ULL * 1024 * 1024)
> > +
> > +/* Temporarily exceed available sending buffer to force TCP auto-tuning */
> > +#define SNDBUF_BOOST_FACTOR 150 /* % */
> > +#define SNDBUF_BOOST(x) ((x) * SNDBUF_BOOST_FACTOR / 100)
>
> For the short term, the fact this works empirically is enough. For
> the longer term, it would be nice to have a better understanding of
> what this "overcommit" amount is actually estimating.
>
> I think what we're looking for is an estimate of the number of bytes
> that will have left the buffer by the time the guest gets back to us. So:
> <connection throughput> * <guest-side RTT>
I don't think we want the bandwidth-delay product here (which I'm now
using earlier in the series) because the purpose here is to grow the
buffer at the beginning of a connection, if it looks like bulk traffic.
So we want to progressively exploit auto-tuning as long as we're
limited by a small buffer, but not later. At some point we want to
finally switch to the window advertised by the peer.
Well, I tried with the bandwidth-delay product in any case, but it's
not really helping with auto-tuning. It turns out that auto-tuning is
fundamentally different at the beginning anyway.
> Alas, I don't see a way to estimate either of those from the
> information we already track - we'd need additional bookkeeping.
It's all in struct tcp_info, it's called tcpi_delivery_rate. There are
other interesting bits there, by the way, that could be used in a
further refinement.
> > #define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
> >
> > #define CONN_IS_CLOSING(conn) \
> > @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >
> > if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> > limit = 0;
> > + else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
> > + tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)
>
> This is pretty subtle, I think it would be worth having some rationale
> in a comment, not just the commit message.
I turned the macro into a new function and added comments there, in v2.
> > + limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
> > else
> > limit = SNDBUF_GET(conn) - (int)sendq;
> >
> > --
> > 2.43.0
--
Stefano
next prev parent reply other threads:[~2025-12-08 0:21 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-04 7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-04 7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-04 23:10 ` David Gibson
2025-12-04 7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-04 23:48 ` David Gibson
2025-12-05 1:20 ` Stefano Brivio
2025-12-05 2:49 ` David Gibson
2025-12-04 7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-04 23:50 ` David Gibson
2025-12-04 7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
2025-12-05 0:08 ` David Gibson
2025-12-05 1:20 ` Stefano Brivio
2025-12-05 2:50 ` David Gibson
2025-12-08 0:19 ` Stefano Brivio
2025-12-04 7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-05 0:35 ` David Gibson
2025-12-05 1:20 ` Stefano Brivio
2025-12-05 2:53 ` David Gibson
2025-12-04 7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-05 2:34 ` David Gibson
2025-12-08 0:20 ` Stefano Brivio [this message]
2025-12-04 7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-05 2:35 ` David Gibson
2025-12-04 7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-05 2:36 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251208012059.36459e27@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=git@maxchernoff.ca \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).