From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=SlAhl8gV;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by passt.top (Postfix) with ESMTPS id C51115A0271
	for <passt-dev@passt.top>; Mon, 08 Dec 2025 08:46:05 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1765179964;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=NtAVxKlTuDY5Er+l5lAe2P3YLzhYg3K7jcVmKOWpd3g=;
	b=SlAhl8gVcOYOZPBMQ04Bwp6OKpKx6Bz0/ScIIgDdhLtw6dZ97K/IMBdzby0pzbkpHf90Uz
	gwtoDNKmOGBomA5PHroj8qsIsN0ShQz8q+HGXP2URcbJZJpNuzXsV3ktsOGMqlSIqiEQNV
	QPKn+nMmengq/XIJUo2aGSSqjtROxo8=
Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com
 [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-100-lmuktMdKPkakm1PlR51MNQ-1; Mon, 08 Dec 2025 02:46:03 -0500
X-MC-Unique: lmuktMdKPkakm1PlR51MNQ-1
X-Mimecast-MFC-AGG-ID: lmuktMdKPkakm1PlR51MNQ_1765179962
Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-477a1e2b372so33007295e9.2
        for <passt-dev@passt.top>; Sun, 07 Dec 2025 23:46:02 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1765179962; x=1765784762;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=NtAVxKlTuDY5Er+l5lAe2P3YLzhYg3K7jcVmKOWpd3g=;
        b=HJxWuzcKOzQgTkCCBqfSgb9/bchgxyRc/td8G59sXMtY83VRqG7eI2mJ5Vw8dFnyBL
         2VggbYYoVnx2foiSHp325svo4v3vAF4ECDeq0HtLGW0fZADV/2dfXOfoYHbJpgbX7hu+
         kDxXB1k9lpTMKooNog67dkOgrXOKSVjgkb/mLnsKS0xHJMBwg+8O98VS0HCE7tEGHEes
         tvKoQWmy81DZ96sJb9+d5eN80F7WRmCeVkyJx673ajSMRIB1mYjH15zx23cgwqev5QT8
         9qGg9Y7eEpEVh0Sqy1pDPFfAJt/q8+7kMEAskTu1yQNVywyDVjrz2dEWrf1HgR6GifG7
         UC5g==
X-Gm-Message-State: AOJu0Ywj03m3YDnjzfLozd2BQeJ7M+zikYQzpAHRAiNvnVBdAmhN5SU8
	Tx9ZlI3UEFqif6ADwGVZRXssYvytIrL9r1bhetFDsMUnPAHyqyqNgNG40zEdcXJJ5kWigp5DAMo
	p92lJqlUD0RZ5RjZhFG7Jo3tf+o8uExFDT7IOJ1ABClop1x8pgPgh4g==
X-Gm-Gg: ASbGnctCErtulDfj/8eBGoVeTZ4Tdoo+89QXp8PDOQC6fxVBy1Sn8fv+YKXbHfsVSl0
	Sjjmltr+7J68pPBrvJMHYG9Yk+GGgfq6UhILfUpOcEa9VaH266XicBOrLfmHJWXcFjyjhale19o
	35mHBfR7h/7sbVXRzGl4Ywtz6aZcqSG4YNnwC9KOYufpmKNoCbGu0SmIIIM5E1iMWw7rDz/krnP
	i27ue+a9Fx0VJUSLnIXbJKLTLjHdW1Wgs6MRNVQLoKOvY+cQONaaQuLyigFyr+HaFOQmuErtKz/
	hYAHbdIW417Uh05C6It1jMCvY00zwmr3BLHrveFY3UhVyVLacfHoYEGbxOpWQS/9HfnudSsoNgl
	mJOrD5g935o7+w4qzVM9F
X-Received: by 2002:a05:600c:3150:b0:477:7a87:48d1 with SMTP id 5b1f17b1804b1-47939e46e59mr73699965e9.30.1765179961776;
        Sun, 07 Dec 2025 23:46:01 -0800 (PST)
X-Google-Smtp-Source: AGHT+IFfkGt25EDzavldM7UB3WBC+MygzGqQ1jAnoaj2e4OYaW8KGjbxSErjfvfLHRlbO6sBfjFR9g==
X-Received: by 2002:a05:600c:3150:b0:477:7a87:48d1 with SMTP id 5b1f17b1804b1-47939e46e59mr73699665e9.30.1765179961201;
        Sun, 07 Dec 2025 23:46:01 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4793093ac44sm233666255e9.8.2025.12.07.23.46.00
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 07 Dec 2025 23:46:00 -0800 (PST)
Date: Mon, 8 Dec 2025 08:45:59 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v2 7/9] tcp: Allow exceeding the available sending
 buffer size in window advertisements
Message-ID: <20251208084559.2029cc8b@elisabeth>
In-Reply-To: <aTZvPpqRi4_0tYbU@zatzit>
References: <20251208002229.391162-1-sbrivio@redhat.com>
	<20251208002229.391162-8-sbrivio@redhat.com>
	<aTZvPpqRi4_0tYbU@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: iZqQKiXCc4cQo0urEoW7XTiJG1WhYDr89-PkF1X1cVM_1765179962
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: VA3UTHJPE32B33XJDR2Q4TYNVB7KYZFU
X-Message-ID-Hash: VA3UTHJPE32B33XJDR2Q4TYNVB7KYZFU
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20251208084559.2029cc8b@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/VA3UTHJPE32B33XJDR2Q4TYNVB7KYZFU/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Mon, 8 Dec 2025 17:25:02 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:15AM +0100, Stefano Brivio wrote:
> > If the remote peer is advertising a bigger value than our current
> > sending buffer, it means that a bigger sending buffer is likely to
> > benefit throughput.
> > 
> > We can get a bigger sending buffer by means of the buffer size
> > auto-tuning performed by the Linux kernel, which is triggered by
> > aggressively filling the sending buffer.
> > 
> > Use an adaptive boost factor, up to 150%, depending on:
> > 
> > - how much data we sent so far: we don't want to risk retransmissions
> >   for short-lived connections, as the latency cost would be
> >   unacceptable, and
> > 
> > - the current RTT value, as we need a bigger buffer for higher
> >   transmission delays
> > 
> > The factor we use is not quite a bandwidth-delay product, as we're
> > missing the time component of the bandwidth, which is not interesting
> > here: we are trying to make the buffer grow at the beginning of a
> > connection, progressively, as more data is sent.
> > 
> > The tuning of the amount of boost factor we want to apply was done
> > somewhat empirically but it appears to yield the available throughput
> > in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
> > ~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
> > within a few seconds for the 300 ms case.
> > 
> > Note that we want to apply this factor only if the window advertised
> > by the peer is bigger than the current sending buffer, as we only need
> > this for auto-tuning, and we absolutely don't want to incur
> > unnecessary retransmissions otherwise.
> > 
> > The related condition in tcp_update_seqack_wnd() is not redundant as
> > there's a subtractive factor, sendq, in the calculation of the window
> > limit. If the sending buffer is smaller than the peer's advertised
> > window, the additional limit we might apply might be lower than we
> > would do otherwise.
> > 
> > Assuming that the sending buffer is reported as 100k, sendq is
> > 20k, we could have these example cases:
> > 
> > 1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
> >    buffer, so we boost its size to 150k, and we limit the window
> >    to 120k
> > 
> > 2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
> >    buffer, so we aren't trying to trigger buffer auto-tuning and
> >    we'll stick to the existing, more conservative calculation,
> >    by limiting the window to 100 - 20 = 80k
> > 
> > If we omitted the new condition, we would always use the boosted
> > value, that is, 120k, even if potentially causing unnecessary
> > retransmissions.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 3c046a5..60a9687 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,13 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Parameters to temporarily exceed sending buffer to force TCP auto-tuning */
> > +#define SNDBUF_BOOST_BYTES_RTT_LO	2500 /* B * s: no boost until here */
> > +/* ...examples:  5 MB sent * 500 ns RTT, 250 kB * 10 ms,  8 kB * 300 ms */
> > +#define SNDBUF_BOOST_FACTOR		150 /* % */
> > +#define SNDBUF_BOOST_BYTES_RTT_HI	6000 /* apply full boost factor */
> > +/*		12 MB sent * 500 ns RTT, 600 kB * 10 ms, 20 kB * 300 ms */
> > +
> >  /* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> >  #define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> >  
> > @@ -1033,6 +1040,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN));
> >  }
> >  
> > +/**
> > + * tcp_sndbuf_boost() - Calculate limit of sending buffer to force auto-tuning
> > + * @conn:	Connection pointer
> > + * @tinfo:	tcp_info from kernel, must be pre-fetched
> > + *
> > + * Return: increased sending buffer to use as a limit for advertised window
> > + */
> > +static unsigned long tcp_sndbuf_boost(struct tcp_tap_conn *conn,
> > +				      struct tcp_info_linux *tinfo)
> > +{
> > +	unsigned long bytes_rtt_product;
> > +
> > +	if (!bytes_acked_cap)
> > +		return SNDBUF_GET(conn);
> > +
> > +	/* This is *not* a bandwidth-delay product, but it's somewhat related:
> > +	 * as we send more data (usually at the beginning of a connection), we
> > +	 * try to make the sending buffer progressively grow, with the RTT as a
> > +	 * factor (longer delay, bigger buffer needed).
> > +	 */
> > +	bytes_rtt_product = (long long)tinfo->tcpi_bytes_acked *
> > +			    tinfo->tcpi_rtt / 1000 / 1000;  
> 
> I only half follow the reasoning in the commit message, but this
> doesn't see quite right to me.  Assuming the RTT is roughly-fixed, as
> you'd expect, this will always trend to infinity for long-lived
> connections - regardless of whether they're high throughput or
> interactive.  So, we'll always trend towards using 150% of the send
> buffer size.

Yes, that's intended: we want to keep that 150% in the unlikely case
that we don't switch to having a buffer exceeding the peer's advertised
window.

> > +	return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
> > +				  SNDBUF_BOOST_BYTES_RTT_LO,
> > +				  SNDBUF_BOOST_BYTES_RTT_HI,
> > +				  SNDBUF_BOOST_FACTOR);
> > +}
> > +
> >  /**
> >   * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
> >   * @c:		Execution context
> > @@ -1152,6 +1188,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  
> >  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> >  			limit = 0;
> > +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn))
> > +			limit = tcp_sndbuf_boost(conn, tinfo) - (int)sendq;  
> 
> Now that 5/9 has pointed out to be the existence of
> tcpi_delivery_rate, would it make more sense to do a
> 	limit += tcpi_delivery_rate * rtt;
> 
> The idea being to allow the guest to send as much as the receiver can
> accomodate itself, plus as much as we can fit "in the air" between us
> and the peer.

I tried using the bandwidth-delay product (what you're suggesting to
add), but it turned out that we clearly need to skip the time component
of the bandwidth as we really need to use "data so far" rather than
"data in flight".

-- 
Stefano