Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
Date: Mon, 8 Dec 2025 08:25:29 +0100	[thread overview]
Message-ID: <20251208082529.6ce78f65@elisabeth> (raw)
In-Reply-To: <aTZoL29AKav1udzA@zatzit>

On Mon, 8 Dec 2025 16:54:55 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote:
> > ...instead of checking if the current sending buffer is less than
> > SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
> > ACK segments: we rely on having enough data at once from the sender
> > to make the buffer grow by means of TCP buffer size tuning
> > implemented in the Linux kernel.
> > 
> > This is important if we're trying to maximise throughput, but not
> > desirable for interactive traffic, where we want to be transparent as
> > possible and avoid introducing unnecessary latency.
> > 
> > Use the tcpi_delivery_rate field reported by the Linux kernel, if
> > available, to calculate the current bandwidth-delay product: if it's
> > significantly smaller than the available sending buffer, conclude that
> > we're not bandwidth-bound and this is likely to be interactive
> > traffic, so acknowledge data only as it's acknowledged by the peer.
> > 
> > Conversely, if the bandwidth-delay product is comparable to the size
> > of the sending buffer (more than 5%), we're probably bandwidth-bound
> > or... bound to be: acknowledge everything in that case.  
> 
> Ah, nice.  This reasoning is much clearer to me than the previous
> spin.
> 
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 45 +++++++++++++++++++++++++++++++++------------
> >  1 file changed, 33 insertions(+), 12 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 9bf7b8b..533c8a7 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,9 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> > +#define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> > +
> >  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> >  
> >  #define CONN_IS_CLOSING(conn)						\
> > @@ -426,11 +429,13 @@ socklen_t tcp_info_size;
> >  	  sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
> >  
> >  /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
> > -#define snd_wnd_cap	tcp_info_cap(snd_wnd)
> > +#define snd_wnd_cap		tcp_info_cap(snd_wnd)
> >  /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
> > -#define bytes_acked_cap	tcp_info_cap(bytes_acked)
> > +#define bytes_acked_cap		tcp_info_cap(bytes_acked)
> >  /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
> > -#define min_rtt_cap	tcp_info_cap(min_rtt)
> > +#define min_rtt_cap		tcp_info_cap(min_rtt)
> > +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
> > +#define delivery_rate_cap	tcp_info_cap(delivery_rate)
> >  
> >  /* sendmsg() to socket */
> >  static struct iovec	tcp_iov			[UIO_MAXIOV];
> > @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	socklen_t sl = sizeof(*tinfo);
> >  	struct tcp_info_linux tinfo_new;
> >  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> > +	bool ack_everything = true;
> >  	int s = conn->sock;
> >  
> >  	/* At this point we could ack all the data we've accepted for forwarding
> > @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	 * control behaviour.
> >  	 *
> >  	 * For it to be possible and worth it we need:
> > -	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> > +	 *  - The TCP_INFO Linux extensions which give us the peer acked bytes
> > +	 *    and the delivery rate (outbound bandwidth at receiver)
> >  	 *  - Not to be told not to (force_seq)
> >  	 *  - Not half-closed in the peer->guest direction
> >  	 *      With no data coming from the peer, we might not get events which
> > @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	 *      Data goes from socket to socket, with nothing meaningfully "in
> >  	 *      flight".
> >  	 *  - Not a pseudo-local connection (e.g. to a VM on the same host)
> > -	 *  - Large enough send buffer
> > -	 *      In these cases, there's not enough in flight to bother.
> > +	 *      If it is, there's not enough in flight to bother.
> > +	 *  - Sending buffer significantly larger than bandwidth * delay product
> > +	 *      Meaning we're not bandwidth-bound and this is likely to be
> > +	 *      interactive traffic where we want to preserve transparent
> > +	 *      connection behaviour and latency.  
> 
> Do we actually want the sending buffer size here?  Or the amount of
> buffer that's actually in use (SIOCOUTQ)?  If we had a burst transfer
> followed by interactive traffic, the kernel could still have a large
> send buffer allocated, no?

The kernel shrinks it rather fast, and if it's not fast enough, then it
still looks like bulk traffic. I tried several metrics (including
something based on the data just sent, which approximates SIOCOUTQ),
they are not as good as the current buffer size.

> > +	 *
> > +	 *      Otherwise, we probably want to maximise throughput, which needs
> > +	 *      sending buffer auto-tuning, triggered in turn by filling up the
> > +	 *      outbound socket queue.
> >  	 */
> > -	if (bytes_acked_cap && !force_seq &&
> > +	if (bytes_acked_cap && delivery_rate_cap && !force_seq &&
> >  	    !CONN_IS_CLOSING(conn) &&
> > -	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> > -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> > +	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) {
> >  		if (!tinfo) {
> >  			tinfo = &tinfo_new;
> >  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> >  				return 0;
> >  		}
> >  
> > +		if ((unsigned)SNDBUF_GET(conn) > (long long)RTT_GET(conn) *  
> 
> Using RTT_GET seems odd here, since we just got a more up to date and
> precise RTT estimate in tinfo.

Oops, right, fixed.

-- 
Stefano

next prev parent reply	other threads:[~2025-12-08  7:25 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-08  5:33   ` David Gibson
2025-12-08  0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-08  5:41   ` David Gibson
2025-12-08  7:22     ` Stefano Brivio
2025-12-08  8:28       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-08  5:54   ` David Gibson
2025-12-08  7:25     ` Stefano Brivio [this message]
2025-12-08  8:31       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  6:43   ` David Gibson
2025-12-08  8:11     ` Stefano Brivio
2025-12-13 14:24       ` Stefano Brivio
2025-12-15  2:03         ` David Gibson
2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  6:25   ` David Gibson
2025-12-08  7:45     ` Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
2025-12-08  8:22   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208082529.6ce78f65@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).