Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
Date: Tue, 9 Dec 2025 23:49:50 +0100	[thread overview]
Message-ID: <20251209234950.53067946@elisabeth> (raw)
In-Reply-To: <aTevTHB6XWn9qLpc@zatzit>

On Tue, 9 Dec 2025 16:10:36 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote:
> > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> > the previous change, because we would generally cause retransmissions
> > for non-local outbound transfers with relatively high (> 100 Mbps)
> > bandwidth and non-local but low (< 5 ms) RTT.
> > 
> > Now that retransmissions are less frequent, we don't have a proper
> > trigger to check for acknowledged bytes on the socket, and will
> > generally block the sender for a significant amount of time while
> > we could acknowledge more data, instead.
> > 
> > Store the RTT reported by the kernel using an approximation (exponent),
> > to keep flow storage size within two (typical) cachelines. Check for
> > socket updates when half of this time elapses: it should be a good
> > indication of the one-way delay we're interested in (peer to us).
> > 
> > Representable values are between 100 us and 3.2768 s, and any value
> > outside this range is clamped to these bounds. This choice appears
> > to be a good trade-off between additional overhead and throughput.
> > 
> > This mechanism partially overlaps with the "low RTT" destinations,
> > which we use to infer that a socket is connected to an endpoint to
> > the same machine (while possibly in a different namespace) if the
> > RTT is reported as 10 us or less.
> > 
> > This change doesn't, however, conflict with it: we are reading
> > TCP_INFO parameters for local connections anyway, so we can always
> > store the RTT approximation opportunistically.
> > 
> > Then, if the RTT is "low", we don't really need a timer to
> > acknowledge data as we'll always acknowledge everything to the
> > sender right away. However, we have limited space in the array where
> > we store addresses of local destination, so the low RTT property of a
> > connection might toggle frequently. Because of this, it's actually
> > helpful to always have the RTT approximation stored.
> > 
> > This could probably benefit from a future rework, though, introducing
> > a more integrated approach between these two mechanisms.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c      | 30 +++++++++++++++++++++++-------
> >  tcp_conn.h |  9 +++++++++
> >  util.c     | 14 ++++++++++++++
> >  util.h     |  1 +
> >  4 files changed, 47 insertions(+), 7 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 28d3304..0167121 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -202,9 +202,13 @@
> >   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> >   *   either side, the connection is reset
> >   *
> > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > + * - RTT / 2 elapsed after data segment received from tap without having
> >   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > + *
> > + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > + *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
> >   *
> >   *
> >   * Summary of data flows (with ESTABLISHED event)
> > @@ -341,7 +345,6 @@ enum {
> >  #define MSS_DEFAULT			536
> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> >  
> > -#define ACK_INTERVAL			10		/* ms */
> >  #define RTO_INIT			1		/* s, RFC 6298 */
> >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> >  #define FIN_TIMEOUT			60
> > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  	}
> >  
> >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > +		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> > +		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
> >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> >  		int exp = conn->retries, timeout = RTO_INIT;
> >  		if (!(conn->events & ESTABLISHED))
> > @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  		it.it_value.tv_sec = ACT_TIMEOUT;
> >  	}
> >  
> > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > -		 (unsigned long long)it.it_value.tv_sec,
> > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > +		flow_trace(conn, "timer expires in %llu.%03llums",
> > +			   (unsigned long)it.it_value.tv_sec * 1000 +
> > +			   (unsigned long long)it.it_value.tv_nsec %
> > +					       (1000 * 1000),
> > +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> 
> This is the wrong way around.  The ms part needs to be:

Ouch, I... half swapped them. Almost. This risks haunting us for a bit.

Luckily we have discrete values there so I could make this small table
for --trace logs from the broken implementation:

#include <stdio.h>

int main()
{
	unsigned long rtt, n, s;

	printf("before fix\t\tafter fix\tRTT\n");
	for (rtt = 100; rtt < 100 << 16; rtt <<= 1) {
		s = rtt / 2 / (1000 * 1000);
		n = rtt / 2 % (1000 * 1000) * 1000;

		printf("%llu.%03llu\t\t%lu.%02lu\t\t%lu\n",
		       s * 1000 + n % (1000 * 1000), n / 1000,
		       s * 1000 + n / (1000 * 1000), (n / 10000) % 100,
		       rtt);
	}
}

before fix		after fix	RTT
50000.050		0.05		100
100000.100		0.10		200
200000.200		0.20		400
400000.400		0.40		800
800000.800		0.80		1600
600000.1600		1.60		3200
200000.3200		3.20		6400
400000.6400		6.40		12800
800000.12800		12.80		25600
600000.25600		25.60		51200
200000.51200		51.20		102400
400000.102400		102.40		204800
800000.204800		204.80		409600
600000.409600		409.60		819200
200000.819200		819.20		1638400
401000.638400		1638.40		3276800

> 	tv_sec * 1000 + tv_nsec / 1000000
> and the fractional (us) part:
> 	(tv_nsec / 1000) % 1000
> 
> (or if you did just want a single digit after the ., then:
> 	tv_nsec / 100000 % 10
> )

I guess I'll display two digits instead, I never had a ~100 ns RTT in
my tests so I didn't notice until now that the resulting timeout would
be displayed as 0.0ms with one fractional digit.

Thanks for the snippet, I'll post a patch soon (of course, feel free to
do so meanwhile if you already have a local change...).

-- 
Stefano

next prev parent reply	other threads:[~2025-12-09 22:49 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  7:20 [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-09  5:10   ` David Gibson
2025-12-09 22:49     ` Stefano Brivio [this message]
2025-12-08  7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09  5:12   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  8:14   ` Max Chernoff
2025-12-08  8:15   ` Max Chernoff
2025-12-08  8:27     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08  8:25   ` Stefano Brivio
2025-12-08  8:51     ` Max Chernoff
2025-12-08  9:00       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251209234950.53067946@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).