Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements
Date: Mon, 8 Dec 2025 01:20:59 +0100	[thread overview]
Message-ID: <20251208012059.36459e27@elisabeth> (raw)
In-Reply-To: <aTJEn_K_7G9SH0mY@zatzit>

On Fri, 5 Dec 2025 13:34:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
> > ...under two conditions:
> > 
> > - the remote peer is advertising a bigger value to us, meaning that a
> >   bigger sending buffer is likely to benefit throughput, AND  
> 
> I think this condition is redundant: if the remote peer is advertising
> less, we'll clamp new_wnd_to_tap to that value anyway.

I almost fell for this. We have a subtractive term in the expression,
so it's not actually the case.

If the remote peer is advertising a smaller window, we just take buffer
size *minus pending bytes*, as limit, which can be smaller compared to
the window advertised by the peer.

If it's advertising a bigger window, we take an increased buffer size
minus pending bytes, as limit, which can be bigger than the peer's
window, so we'll use the peer's window as limit instead.

I added an example in v2 (now 7/9).

> > - this is not a short-lived connection, where the latency cost of
> >   retransmissions would be otherwise unacceptable.
> > 
> > By doing this, we can reliably trigger TCP buffer size auto-tuning (as
> > long as it's available) on bulk data transfers.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 2220059..454df69 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,13 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Try to avoid retransmissions to improve latency on short-lived connections */
> > +#define SHORT_CONN_BYTES		(16ULL * 1024 * 1024)
> > +
> > +/* Temporarily exceed available sending buffer to force TCP auto-tuning */
> > +#define SNDBUF_BOOST_FACTOR		150 /* % */
> > +#define SNDBUF_BOOST(x)			((x) * SNDBUF_BOOST_FACTOR / 100)  
> 
> For the short term, the fact this works empirically is enough.  For
> the longer term, it would be nice to have a better understanding of
> what this "overcommit" amount is actually estimating.
> 
> I think what we're looking for is an estimate of the number of bytes
> that will have left the buffer by the time the guest gets back to us.  So:
> 	<connection throughput> * <guest-side RTT>

I don't think we want the bandwidth-delay product here (which I'm now
using earlier in the series) because the purpose here is to grow the
buffer at the beginning of a connection, if it looks like bulk traffic.

So we want to progressively exploit auto-tuning as long as we're
limited by a small buffer, but not later. At some point we want to
finally switch to the window advertised by the peer.

Well, I tried with the bandwidth-delay product in any case, but it's
not really helping with auto-tuning. It turns out that auto-tuning is
fundamentally different at the beginning anyway.

> Alas, I don't see a way to estimate either of those from the
> information we already track - we'd need additional bookkeeping.

It's all in struct tcp_info, it's called tcpi_delivery_rate. There are
other interesting bits there, by the way, that could be used in a
further refinement.

> >  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> >  
> >  #define CONN_IS_CLOSING(conn)						\
> > @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  
> >  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> >  			limit = 0;
> > +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
> > +			 tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)  
> 
> This is pretty subtle, I think it would be worth having some rationale
> in a comment, not just the commit message.

I turned the macro into a new function and added comments there, in v2.

> > +			limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
> >  		else
> >  			limit = SNDBUF_GET(conn) - (int)sendq;
> >  
> > -- 
> > 2.43.0

-- 
Stefano

next prev parent reply	other threads:[~2025-12-08  0:21 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-04 23:10   ` David Gibson
2025-12-04  7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-04 23:48   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:49       ` David Gibson
2025-12-04  7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-04 23:50   ` David Gibson
2025-12-04  7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
2025-12-05  0:08   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:50       ` David Gibson
2025-12-08  0:19         ` Stefano Brivio
2025-12-04  7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-05  0:35   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:53       ` David Gibson
2025-12-04  7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-05  2:34   ` David Gibson
2025-12-08  0:20     ` Stefano Brivio [this message]
2025-12-04  7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-05  2:35   ` David Gibson
2025-12-04  7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-05  2:36   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208012059.36459e27@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).