public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
Subject: Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
Date: Mon, 8 Dec 2025 08:22:12 +0100	[thread overview]
Message-ID: <20251208082212.5d2abb50@elisabeth> (raw)
In-Reply-To: <aTZlAbzwPAtSai8k@zatzit>

On Mon, 8 Dec 2025 16:41:21 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote:
> > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> > the previous change, because we would generally cause retransmissions
> > for non-local outbound transfers with relatively high (> 100 Mbps)
> > bandwidth and non-local but low (< 5 ms) RTT.
> > 
> > Now that retransmissions are less frequent, we don't have a proper
> > trigger to check for acknowledged bytes on the socket, and will
> > generally block the sender for a significant amount of time while
> > we could acknowledge more data, instead.
> > 
> > Store the RTT reported by the kernel using an approximation (exponent),
> > to keep flow storage size within two (typical) cachelines. Check for
> > socket updates when half of this time elapses: it should be a good
> > indication of the one-way delay we're interested in (peer to us).
> > 
> > Representable values are between 100 us and 3.2768 s, and any value
> > outside this range is clamped to these bounds. This choice appears
> > to be a good trade-off between additional overhead and throughput.
> > 
> > This mechanism partially overlaps with the "low RTT" destinations,
> > which we use to infer that a socket is connected to an endpoint to
> > the same machine (while possibly in a different namespace) if the
> > RTT is reported as 10 us or less.
> > 
> > This change doesn't, however, conflict with it: we are reading
> > TCP_INFO parameters for local connections anyway, so we can always
> > store the RTT approximation opportunistically.
> > 
> > Then, if the RTT is "low", we don't really need a timer to
> > acknowledge data as we'll always acknowledge everything to the
> > sender right away. However, we have limited space in the array where
> > we store addresses of local destination, so the low RTT property of a
> > connection might toggle frequently. Because of this, it's actually
> > helpful to always have the RTT approximation stored.
> > 
> > This could probably benefit from a future rework, though, introducing
> > a more integrated approach between these two mechanisms.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c      | 28 +++++++++++++++++++++-------
> >  tcp_conn.h |  9 +++++++++
> >  util.c     | 14 ++++++++++++++
> >  util.h     |  1 +
> >  4 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 951f434..8eeef4c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -202,9 +202,13 @@
> >   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> >   *   either side, the connection is reset
> >   *
> > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > + * - RTT / 2 elapsed after data segment received from tap without having
> >   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > + *
> > + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > + *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
> >   *
> >   *
> >   * Summary of data flows (with ESTABLISHED event)
> > @@ -341,7 +345,6 @@ enum {
> >  #define MSS_DEFAULT			536
> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> >  
> > -#define ACK_INTERVAL			10		/* ms */
> >  #define RTO_INIT			1		/* s, RFC 6298 */
> >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> >  #define FIN_TIMEOUT			60
> > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  	}
> >  
> >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > +		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> > +		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
> >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> >  		int exp = conn->retries, timeout = RTO_INIT;
> >  		if (!(conn->events & ESTABLISHED))
> > @@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  		it.it_value.tv_sec = ACT_TIMEOUT;
> >  	}
> >  
> > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > -		 (unsigned long long)it.it_value.tv_sec,
> > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > +			   (unsigned long long)it.it_value.tv_nsec / 1000);  
> 
> This doesn't look right - you need a % to exclude the whole
> milliseconds here for the fractional part.

Ah, oops, right, and on top of that this can be more than one second
but I forgot to add it. Fixed in v3.

> Plus, it looks like this
> is trying to compute microseconds, which would be 3 digits after the
> . in ms, but the format string accomodates only one.

That was intended, I wanted to show only the first digit of
microseconds given that the smallest values are hundreds of
microseconds, but changed anyway given the possible confusion.

-- 
Stefano


  reply	other threads:[~2025-12-08  7:22 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-08  5:33   ` David Gibson
2025-12-08  0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-08  5:41   ` David Gibson
2025-12-08  7:22     ` Stefano Brivio [this message]
2025-12-08  8:28       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-08  5:54   ` David Gibson
2025-12-08  7:25     ` Stefano Brivio
2025-12-08  8:31       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  6:43   ` David Gibson
2025-12-08  8:11     ` Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  6:25   ` David Gibson
2025-12-08  7:45     ` Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
2025-12-08  8:22   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208082212.5d2abb50@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).