From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=WrhcVNu/;
	dkim-atps=neutral
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id E7F245A0619
	for <passt-dev@passt.top>; Fri, 05 Dec 2025 03:50:23 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202510; t=1764903020;
	bh=VoO+2eOCgBMS6Zw3gaCJJwrstONJLrmWyqLhII5aUVM=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=WrhcVNu/dhvcKoTaj1okAsPho1ssCzZbic9gz6jY/tQwqCsr89KROlAM/h/pD/o13
	 fhycebIvlZ96rA7sQns3c7Vv4cy+mIEUi0fz8ymh6NRoz/qweB50LCocwgzPyQr/ww
	 VVdHhIKXbdb5ZLdRpYGBhgEBj1rA9J0sW50qqx8cKUG0xbhHX7fRP6pZX2N13IUxpo
	 /ZsOLxqkyBvYOd1sv2ELefUaaL4YsPCWrPRbrkek/5GISsdYqjZMHRSQYILPug4C/+
	 di8riznlj2zDuoaDWaflHBxma5MQCOzUMAVXgq3iZoY1m5EZVXhhbdNYZlAURiO60E
	 Kg3hnNAKNgjjA==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4dMwph0tg1z4wHF; Fri, 05 Dec 2025 13:50:20 +1100 (AEDT)
Date: Fri, 5 Dec 2025 13:49:40 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side
 acknowledgement checks
Message-ID: <aTJIRAm3zHf3Kw5k@zatzit>
References: <20251204074542.2156548-1-sbrivio@redhat.com>
 <20251204074542.2156548-3-sbrivio@redhat.com>
 <aTIdxMFGyl10cgiV@zatzit>
 <20251205022008.70e1195c@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="BmWzP7zGlEYrKGxY"
Content-Disposition: inline
In-Reply-To: <20251205022008.70e1195c@elisabeth>
Message-ID-Hash: EM5WUNJNXJGYARKCOT4QFUAP2A2LABYW
X-Message-ID-Hash: EM5WUNJNXJGYARKCOT4QFUAP2A2LABYW
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, Max Chernoff <git@maxchernoff.ca>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/aTJIRAm3zHf3Kw5k@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/EM5WUNJNXJGYARKCOT4QFUAP2A2LABYW/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--BmWzP7zGlEYrKGxY
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Dec 05, 2025 at 02:20:08AM +0100, Stefano Brivio wrote:
> On Fri, 5 Dec 2025 10:48:20 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote:
> > > A fixed 10 ms ACK_TIMEOUT timer value served us relatively well
> > > until =20
> >=20
> > Nit: it's called "ACK_INTERVAL" in the code.
>=20
> Oops. I'll change this.

You addressed all my other concerns below, so

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> > > the previous change, because we would generally cause retransmissions
> > > for non-local outbound transfers with relatively high (> 100 Mbps)
> > > bandwidth and non-local but low (< 5 ms) RTT.
> > >=20
> > > Now that retransmissions are less frequent, we don't have a proper
> > > trigger to check for acknowledged bytes on the socket, and will
> > > generally block the sender for a significant amount of time while
> > > we could acknowledge more data, instead.
> > >=20
> > > Store the RTT reported by the kernel using an approximation (exponent=
),
> > > to keep flow storage size within two (typical) cachelines. Check for
> > > socket updates when half of this time elapses: it should be a good
> > > indication of the one-way delay we're interested in (peer to us). =20
> >=20
> > Reasoning based on a one-way delay doesn't quite make sense to me.  We
> > can't know when anything happens at the peer, and - obviously - we can
> > only set a timer starting at an event that occurs on our side.  So, I
> > think only RTT can matter to us, not one-way delay.
>=20
> ...except that we might be scheduling the timer at any point *after* we
> sent data, so the outbound delay might be partially elapsed, and the
> one-way (receiving) delay is actually (more) relevant.
>=20
> If we had instantaneous receiving of ACK segments, we would need to
> probe much more frequently than the RTT, to monitor the actual progress
> more accurately. Note that transmission rate (including forwarding
> delays) is not constant and might be bursty.

*thinks*... oh, yes, you're right.  The key point I was missing is
that what we're primarily trying to (indirectly) observe here is the
rate at which the receiving process is consuming - i.e. something
ocurring locally at the peer - rather than something triggered by our
transmission to the peer.

> But yes, in general it's not much more relevant than the RTT. I could
> drop this part of the commit message.
>=20
> > That said, using half the RTT estimate still makes sense to me: we
> > only have an approximation, and halving it gives us a pretty safe
> > lower bound.
>=20
> In any case, yes.
>=20
> > > Representable values are between 100 us and 12.8 ms, and any value =
=20
> >=20
> > Nit: I think Unicode is long enough supported you can use =B5s
>=20
> I prefer to avoid in the code if possible because one might not have
> Unicode support in all the relevant environments with all the relevant
> consoles (I just finished debugging stuff on Alpine...), and at that
> point I'd rather have consistent commit messages.

Eh, ok.

> > > outside this range is clamped to these bounds. This choice appears
> > > to be a good trade-off between additional overhead and throughput.
> > >=20
> > > This mechanism partially overlaps with the "low RTT" destinations,
> > > which we use to infer that a socket is connected to an endpoint to
> > > the same machine (while possibly in a different namespace) if the
> > > RTT is reported as 10 us or less.
> > >=20
> > > This change doesn't, however, conflict with it: we are reading
> > > TCP_INFO parameters for local connections anyway, so we can always
> > > store the RTT approximation opportunistically.
> > >=20
> > > Then, if the RTT is "low", we don't really need a timer to
> > > acknowledge data as we'll always acknowledge everything to the
> > > sender right away. However, we have limited space in the array where
> > > we store addresses of local destination, so the low RTT property of a
> > > connection might toggle frequently. Because of this, it's actually
> > > helpful to always have the RTT approximation stored.
> > >=20
> > > This could probably benefit from a future rework, though, introducing
> > > a more integrated approach between these two mechanisms. =20
> >=20
> > Right, it feels like it should be possible to combine these
> > mechanisms, but figuring out exactly how isn't trivial.  Problem for
> > another day.
> >=20
> > >=20
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tcp.c      | 29 ++++++++++++++++++++++-------
> > >  tcp_conn.h |  9 +++++++++
> > >  util.c     | 14 ++++++++++++++
> > >  util.h     |  1 +
> > >  4 files changed, 46 insertions(+), 7 deletions(-)
> > >=20
> > > diff --git a/tcp.c b/tcp.c
> > > index 863ccdb..b00b874 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -202,9 +202,13 @@
> > >   * - ACT_TIMEOUT, in the presence of any event: if no activity is de=
tected on
> > >   *   either side, the connection is reset
> > >   *
> > > - * - ACK_INTERVAL elapsed after data segment received from tap witho=
ut having
> > > + * - RTT / 2 elapsed after data segment received from tap without ha=
ving
> > >   *   sent an ACK segment, or zero-sized window advertised to tap/gue=
st (flag
> > > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > > + *
> > > + *   RTT, here, is an approximation of the RTT value reported by the=
 kernel via
> > > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us=
) to
> > > + *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped according=
ly.
> > >   *
> > >   *
> > >   * Summary of data flows (with ESTABLISHED event)
> > > @@ -341,7 +345,6 @@ enum {
> > >  #define MSS_DEFAULT			536
> > >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> > > =20
> > > -#define ACK_INTERVAL			10		/* ms */
> > >  #define RTO_INIT			1		/* s, RFC 6298 */
> > >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> > >  #define FIN_TIMEOUT			60
> > > @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, st=
ruct tcp_tap_conn *conn)
> > >  	}
> > > =20
> > >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > > -		it.it_value.tv_nsec =3D (long)ACK_INTERVAL * 1000 * 1000;
> > > +		it.it_value.tv_nsec =3D (long)RTT_GET(conn) * 1000 / 2;
> > > +		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
> > > +			      ".tv_nsec is greater than 1000 * 1000 * 1000");
> > >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> > >  		int exp =3D conn->retries, timeout =3D RTO_INIT;
> > >  		if (!(conn->events & ESTABLISHED))
> > > @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, s=
truct tcp_tap_conn *conn)
> > >  		it.it_value.tv_sec =3D ACT_TIMEOUT;
> > >  	}
> > > =20
> > > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > > -		 (unsigned long long)it.it_value.tv_sec,
> > > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > > +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> > > +	} else {
> > > +		flow_dbg(conn, "timer expires in %llu.%03llus",
> > > +			 (unsigned long long)it.it_value.tv_sec,
> > > +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > > +	} =20
> >=20
> > One branch is flow_trace(), one is flow_dbg() which doesn't seem
> > correct.
>=20
> No, it's intended, it's actually the main reason why I'm changing this
> part.
>=20
> Now that we have more frequent timer scheduling on ACK_TO_TAP_DUE, the
> debug logs become unusable if you're trying to debug anything that's
> not related to a specific data transfer.
>=20
> > Also, basing the range indirectly on the flags, rather than
> > on the actual numbers in it.it_value seems fragile.
>=20
> Flags tell us why we're scheduling a specific timer, and it's only
> on ACK_TO_TAP_DUE that we want to have more fine-grained values.

Ah... ok, that makes sense.

> > But... this seems
> > overly complex for a trace message anyway.  Maybe just use the seconds
> > formatting, but increase the resolution to =B5s.
>=20
> I tried a number of different combinations like that, they are all
> rather inconvenient.

Fair enough.

>=20
> > > =20
> > >  	if (timerfd_settime(conn->timer, 0, &it, NULL))
> > >  		flow_perror(conn, "failed to set timer");
> > > @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c,=
 struct tcp_tap_conn *conn,
> > >  		conn_flag(c, conn, ACK_TO_TAP_DUE);
> > > =20
> > >  out:
> > > +	/* Opportunistically store RTT approximation on valid TCP_INFO data=
 */
> > > +	if (tinfo)
> > > +		RTT_SET(conn, tinfo->tcpi_rtt);
> > > +
> > >  	return new_wnd_to_tap       !=3D prev_wnd_to_tap ||
> > >  	       conn->seq_ack_to_tap !=3D prev_ack_to_tap;
> > >  }
> > > diff --git a/tcp_conn.h b/tcp_conn.h
> > > index e36910c..76034f6 100644
> > > --- a/tcp_conn.h
> > > +++ b/tcp_conn.h
> > > @@ -49,6 +49,15 @@ struct tcp_tap_conn {
> > >  #define MSS_SET(conn, mss)	(conn->tap_mss =3D (mss >> (16 - TCP_MSS_=
BITS)))
> > >  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
> > > =20
> > > +#define RTT_EXP_BITS			3
> > > +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> > > +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> > > +#define RTT_STORE_MIN			100 /* us, minimum representable */
> > > +#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
> > > +#define RTT_SET(conn, rtt)						\
> > > +	(conn->rtt_exp =3D MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MI=
N))))
> > > +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> > > +
> > >  	int		sock		:FD_REF_BITS;
> > > =20
> > >  	uint8_t		events;
> > > diff --git a/util.c b/util.c
> > > index 4beb7c2..590373d 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
> > >   * fls() - Find last (most significant) bit set in word
> > >   * @x:		Word
> > >   *
> > > + * Note: unlike ffs() and other implementations of fls(), notably th=
e one from
> > > + * the Linux kernel, the starting position is 0 and not 1, that is, =
fls(1) =3D 0.
> > > + *
> > >   * Return: position of most significant bit set, starting from 0, -1=
 if none
> > >   */
> > >  int fls(unsigned long x)
> > > @@ -626,6 +629,17 @@ int fls(unsigned long x)
> > >  	return y;
> > >  }
> > > =20
> > > +/**
> > > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to=
 the base 2)
> > > + * @x:		Argument
> > > + *
> > > + * Return: integral part of binary logarithm of @x, -1 if undefined =
(if @x is 0)
> > > + */
> > > +int ilog2(unsigned long x)
> > > +{
> > > +	return fls(x);
> > > +}
> > > +
> > >  /**
> > >   * write_file() - Replace contents of file with a string
> > >   * @path:	File to write
> > > diff --git a/util.h b/util.h
> > > index 7bf0701..40de694 100644
> > > --- a/util.h
> > > +++ b/util.h
> > > @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
> > >  void pidfile_write(int fd, pid_t pid);
> > >  int __daemon(int pidfile_fd, int devnull_fd);
> > >  int fls(unsigned long x);
> > > +int ilog2(unsigned long x);
> > >  int write_file(const char *path, const char *buf);
> > >  intmax_t read_file_integer(const char *path, intmax_t fallback);
> > >  int write_all_buf(int fd, const void *buf, size_t len);
> > > --=20
> > > 2.43.0
>=20
> --=20
> Stefano
>=20

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--BmWzP7zGlEYrKGxY
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmkySEMACgkQzQJF27ox
2GcKiBAAk1Pw6zUszMDN8jwxfbIiXtMme3IcXD9BGVPIOh28YOHY6tANRwxoTcv8
JFXOKVAZ6JVj67jPoqOCWgjHBIYWnYuWRyUi6R5NC3q93GtT+KcYRdYsmyBT5IRW
N9hpDaExaWu0LRvWBeBQG/+6jY8mt0/0x//2BCylbFcreqAXMaBI00Y7bVjWVOWu
nf6b7MSsEL5tCHyrPFI8UtpmPhmUv09D0zcG3BIzAFgxFWu08t1NUny1fwNIeoUj
3vGEZIApe1dyEZk6aiDumXX2PL7dNBfvJeP1FT6m9kqLb+if/jUX9qx5GnsVe09v
APk5xPnrJKskoKY+PYp5bMfeNsFn59d5sZ8D8m9hTgDaCb6CiD8n+6dtnzCI8vmq
Ex0CH7f2MGITmwGU04lelC2pN2RyDLJQ8TtdF1wjiX4kll+BHiGXqw8O2hqj3rMI
Ro9voeuGs6GbvakhXrfaFwoosk02Ka0zVAssojqDR5flzvyhQ3l9qNvn36IK2Ofl
lTkIhMnqfL7MZsCU2173sfPBd0YsFReWPxKbrACv871AmBfxPnhuimGis+C/6vcA
yNc74oitQJWQ7yNIPu5k/DHW0fmN5WoawfDWEbWREelx5zFtOGf58f+10OnCLh1J
ZE41IDwLMIGYdlC44D94cswaREdeUeag1fuwCLWZ0SC3Y6cwGY0=
=LXhS
-----END PGP SIGNATURE-----

--BmWzP7zGlEYrKGxY--