From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=WrhcVNu/; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id E7F245A0619 for ; Fri, 05 Dec 2025 03:50:23 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1764903020; bh=VoO+2eOCgBMS6Zw3gaCJJwrstONJLrmWyqLhII5aUVM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=WrhcVNu/dhvcKoTaj1okAsPho1ssCzZbic9gz6jY/tQwqCsr89KROlAM/h/pD/o13 fhycebIvlZ96rA7sQns3c7Vv4cy+mIEUi0fz8ymh6NRoz/qweB50LCocwgzPyQr/ww VVdHhIKXbdb5ZLdRpYGBhgEBj1rA9J0sW50qqx8cKUG0xbhHX7fRP6pZX2N13IUxpo /ZsOLxqkyBvYOd1sv2ELefUaaL4YsPCWrPRbrkek/5GISsdYqjZMHRSQYILPug4C/+ di8riznlj2zDuoaDWaflHBxma5MQCOzUMAVXgq3iZoY1m5EZVXhhbdNYZlAURiO60E Kg3hnNAKNgjjA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4dMwph0tg1z4wHF; Fri, 05 Dec 2025 13:50:20 +1100 (AEDT) Date: Fri, 5 Dec 2025 13:49:40 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Message-ID: References: <20251204074542.2156548-1-sbrivio@redhat.com> <20251204074542.2156548-3-sbrivio@redhat.com> <20251205022008.70e1195c@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="BmWzP7zGlEYrKGxY" Content-Disposition: inline In-Reply-To: <20251205022008.70e1195c@elisabeth> Message-ID-Hash: EM5WUNJNXJGYARKCOT4QFUAP2A2LABYW X-Message-ID-Hash: EM5WUNJNXJGYARKCOT4QFUAP2A2LABYW X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --BmWzP7zGlEYrKGxY Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 05, 2025 at 02:20:08AM +0100, Stefano Brivio wrote: > On Fri, 5 Dec 2025 10:48:20 +1100 > David Gibson wrote: >=20 > > On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote: > > > A fixed 10 ms ACK_TIMEOUT timer value served us relatively well > > > until =20 > >=20 > > Nit: it's called "ACK_INTERVAL" in the code. >=20 > Oops. I'll change this. You addressed all my other concerns below, so Reviewed-by: David Gibson > > > the previous change, because we would generally cause retransmissions > > > for non-local outbound transfers with relatively high (> 100 Mbps) > > > bandwidth and non-local but low (< 5 ms) RTT. > > >=20 > > > Now that retransmissions are less frequent, we don't have a proper > > > trigger to check for acknowledged bytes on the socket, and will > > > generally block the sender for a significant amount of time while > > > we could acknowledge more data, instead. > > >=20 > > > Store the RTT reported by the kernel using an approximation (exponent= ), > > > to keep flow storage size within two (typical) cachelines. Check for > > > socket updates when half of this time elapses: it should be a good > > > indication of the one-way delay we're interested in (peer to us). =20 > >=20 > > Reasoning based on a one-way delay doesn't quite make sense to me. We > > can't know when anything happens at the peer, and - obviously - we can > > only set a timer starting at an event that occurs on our side. So, I > > think only RTT can matter to us, not one-way delay. >=20 > ...except that we might be scheduling the timer at any point *after* we > sent data, so the outbound delay might be partially elapsed, and the > one-way (receiving) delay is actually (more) relevant. >=20 > If we had instantaneous receiving of ACK segments, we would need to > probe much more frequently than the RTT, to monitor the actual progress > more accurately. Note that transmission rate (including forwarding > delays) is not constant and might be bursty. *thinks*... oh, yes, you're right. The key point I was missing is that what we're primarily trying to (indirectly) observe here is the rate at which the receiving process is consuming - i.e. something ocurring locally at the peer - rather than something triggered by our transmission to the peer. > But yes, in general it's not much more relevant than the RTT. I could > drop this part of the commit message. >=20 > > That said, using half the RTT estimate still makes sense to me: we > > only have an approximation, and halving it gives us a pretty safe > > lower bound. >=20 > In any case, yes. >=20 > > > Representable values are between 100 us and 12.8 ms, and any value = =20 > >=20 > > Nit: I think Unicode is long enough supported you can use =B5s >=20 > I prefer to avoid in the code if possible because one might not have > Unicode support in all the relevant environments with all the relevant > consoles (I just finished debugging stuff on Alpine...), and at that > point I'd rather have consistent commit messages. Eh, ok. > > > outside this range is clamped to these bounds. This choice appears > > > to be a good trade-off between additional overhead and throughput. > > >=20 > > > This mechanism partially overlaps with the "low RTT" destinations, > > > which we use to infer that a socket is connected to an endpoint to > > > the same machine (while possibly in a different namespace) if the > > > RTT is reported as 10 us or less. > > >=20 > > > This change doesn't, however, conflict with it: we are reading > > > TCP_INFO parameters for local connections anyway, so we can always > > > store the RTT approximation opportunistically. > > >=20 > > > Then, if the RTT is "low", we don't really need a timer to > > > acknowledge data as we'll always acknowledge everything to the > > > sender right away. However, we have limited space in the array where > > > we store addresses of local destination, so the low RTT property of a > > > connection might toggle frequently. Because of this, it's actually > > > helpful to always have the RTT approximation stored. > > >=20 > > > This could probably benefit from a future rework, though, introducing > > > a more integrated approach between these two mechanisms. =20 > >=20 > > Right, it feels like it should be possible to combine these > > mechanisms, but figuring out exactly how isn't trivial. Problem for > > another day. > >=20 > > >=20 > > > Signed-off-by: Stefano Brivio > > > --- > > > tcp.c | 29 ++++++++++++++++++++++------- > > > tcp_conn.h | 9 +++++++++ > > > util.c | 14 ++++++++++++++ > > > util.h | 1 + > > > 4 files changed, 46 insertions(+), 7 deletions(-) > > >=20 > > > diff --git a/tcp.c b/tcp.c > > > index 863ccdb..b00b874 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -202,9 +202,13 @@ > > > * - ACT_TIMEOUT, in the presence of any event: if no activity is de= tected on > > > * either side, the connection is reset > > > * > > > - * - ACK_INTERVAL elapsed after data segment received from tap witho= ut having > > > + * - RTT / 2 elapsed after data segment received from tap without ha= ving > > > * sent an ACK segment, or zero-sized window advertised to tap/gue= st (flag > > > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent > > > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. > > > + * > > > + * RTT, here, is an approximation of the RTT value reported by the= kernel via > > > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us= ) to > > > + * RTT_STORE_MAX (12.8 ms). The timeout value is clamped according= ly. > > > * > > > * > > > * Summary of data flows (with ESTABLISHED event) > > > @@ -341,7 +345,6 @@ enum { > > > #define MSS_DEFAULT 536 > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > =20 > > > -#define ACK_INTERVAL 10 /* ms */ > > > #define RTO_INIT 1 /* s, RFC 6298 */ > > > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ > > > #define FIN_TIMEOUT 60 > > > @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, st= ruct tcp_tap_conn *conn) > > > } > > > =20 > > > if (conn->flags & ACK_TO_TAP_DUE) { > > > - it.it_value.tv_nsec =3D (long)ACK_INTERVAL * 1000 * 1000; > > > + it.it_value.tv_nsec =3D (long)RTT_GET(conn) * 1000 / 2; > > > + static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000, > > > + ".tv_nsec is greater than 1000 * 1000 * 1000"); > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > int exp =3D conn->retries, timeout =3D RTO_INIT; > > > if (!(conn->events & ESTABLISHED)) > > > @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, s= truct tcp_tap_conn *conn) > > > it.it_value.tv_sec =3D ACT_TIMEOUT; > > > } > > > =20 > > > - flow_dbg(conn, "timer expires in %llu.%03llus", > > > - (unsigned long long)it.it_value.tv_sec, > > > - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > > + if (conn->flags & ACK_TO_TAP_DUE) { > > > + flow_trace(conn, "timer expires in %lu.%01llums", > > > + (unsigned long)it.it_value.tv_nsec / 1000 / 1000, > > > + (unsigned long long)it.it_value.tv_nsec / 1000); > > > + } else { > > > + flow_dbg(conn, "timer expires in %llu.%03llus", > > > + (unsigned long long)it.it_value.tv_sec, > > > + (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > > + } =20 > >=20 > > One branch is flow_trace(), one is flow_dbg() which doesn't seem > > correct. >=20 > No, it's intended, it's actually the main reason why I'm changing this > part. >=20 > Now that we have more frequent timer scheduling on ACK_TO_TAP_DUE, the > debug logs become unusable if you're trying to debug anything that's > not related to a specific data transfer. >=20 > > Also, basing the range indirectly on the flags, rather than > > on the actual numbers in it.it_value seems fragile. >=20 > Flags tell us why we're scheduling a specific timer, and it's only > on ACK_TO_TAP_DUE that we want to have more fine-grained values. Ah... ok, that makes sense. > > But... this seems > > overly complex for a trace message anyway. Maybe just use the seconds > > formatting, but increase the resolution to =B5s. >=20 > I tried a number of different combinations like that, they are all > rather inconvenient. Fair enough. >=20 > > > =20 > > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > > > flow_perror(conn, "failed to set timer"); > > > @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c,= struct tcp_tap_conn *conn, > > > conn_flag(c, conn, ACK_TO_TAP_DUE); > > > =20 > > > out: > > > + /* Opportunistically store RTT approximation on valid TCP_INFO data= */ > > > + if (tinfo) > > > + RTT_SET(conn, tinfo->tcpi_rtt); > > > + > > > return new_wnd_to_tap !=3D prev_wnd_to_tap || > > > conn->seq_ack_to_tap !=3D prev_ack_to_tap; > > > } > > > diff --git a/tcp_conn.h b/tcp_conn.h > > > index e36910c..76034f6 100644 > > > --- a/tcp_conn.h > > > +++ b/tcp_conn.h > > > @@ -49,6 +49,15 @@ struct tcp_tap_conn { > > > #define MSS_SET(conn, mss) (conn->tap_mss =3D (mss >> (16 - TCP_MSS_= BITS))) > > > #define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) > > > =20 > > > +#define RTT_EXP_BITS 3 > > > + unsigned int rtt_exp :RTT_EXP_BITS; > > > +#define RTT_EXP_MAX MAX_FROM_BITS(RTT_EXP_BITS) > > > +#define RTT_STORE_MIN 100 /* us, minimum representable */ > > > +#define RTT_STORE_MAX (RTT_STORE_MIN << RTT_EXP_MAX) > > > +#define RTT_SET(conn, rtt) \ > > > + (conn->rtt_exp =3D MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MI= N)))) > > > +#define RTT_GET(conn) (RTT_STORE_MIN << conn->rtt_exp) > > > + > > > int sock :FD_REF_BITS; > > > =20 > > > uint8_t events; > > > diff --git a/util.c b/util.c > > > index 4beb7c2..590373d 100644 > > > --- a/util.c > > > +++ b/util.c > > > @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd) > > > * fls() - Find last (most significant) bit set in word > > > * @x: Word > > > * > > > + * Note: unlike ffs() and other implementations of fls(), notably th= e one from > > > + * the Linux kernel, the starting position is 0 and not 1, that is, = fls(1) =3D 0. > > > + * > > > * Return: position of most significant bit set, starting from 0, -1= if none > > > */ > > > int fls(unsigned long x) > > > @@ -626,6 +629,17 @@ int fls(unsigned long x) > > > return y; > > > } > > > =20 > > > +/** > > > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to= the base 2) > > > + * @x: Argument > > > + * > > > + * Return: integral part of binary logarithm of @x, -1 if undefined = (if @x is 0) > > > + */ > > > +int ilog2(unsigned long x) > > > +{ > > > + return fls(x); > > > +} > > > + > > > /** > > > * write_file() - Replace contents of file with a string > > > * @path: File to write > > > diff --git a/util.h b/util.h > > > index 7bf0701..40de694 100644 > > > --- a/util.h > > > +++ b/util.h > > > @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags); > > > void pidfile_write(int fd, pid_t pid); > > > int __daemon(int pidfile_fd, int devnull_fd); > > > int fls(unsigned long x); > > > +int ilog2(unsigned long x); > > > int write_file(const char *path, const char *buf); > > > intmax_t read_file_integer(const char *path, intmax_t fallback); > > > int write_all_buf(int fd, const void *buf, size_t len); > > > --=20 > > > 2.43.0 >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --BmWzP7zGlEYrKGxY Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmkySEMACgkQzQJF27ox 2GcKiBAAk1Pw6zUszMDN8jwxfbIiXtMme3IcXD9BGVPIOh28YOHY6tANRwxoTcv8 JFXOKVAZ6JVj67jPoqOCWgjHBIYWnYuWRyUi6R5NC3q93GtT+KcYRdYsmyBT5IRW N9hpDaExaWu0LRvWBeBQG/+6jY8mt0/0x//2BCylbFcreqAXMaBI00Y7bVjWVOWu nf6b7MSsEL5tCHyrPFI8UtpmPhmUv09D0zcG3BIzAFgxFWu08t1NUny1fwNIeoUj 3vGEZIApe1dyEZk6aiDumXX2PL7dNBfvJeP1FT6m9kqLb+if/jUX9qx5GnsVe09v APk5xPnrJKskoKY+PYp5bMfeNsFn59d5sZ8D8m9hTgDaCb6CiD8n+6dtnzCI8vmq Ex0CH7f2MGITmwGU04lelC2pN2RyDLJQ8TtdF1wjiX4kll+BHiGXqw8O2hqj3rMI Ro9voeuGs6GbvakhXrfaFwoosk02Ka0zVAssojqDR5flzvyhQ3l9qNvn36IK2Ofl lTkIhMnqfL7MZsCU2173sfPBd0YsFReWPxKbrACv871AmBfxPnhuimGis+C/6vcA yNc74oitQJWQ7yNIPu5k/DHW0fmN5WoawfDWEbWREelx5zFtOGf58f+10OnCLh1J ZE41IDwLMIGYdlC44D94cswaREdeUeag1fuwCLWZ0SC3Y6cwGY0= =LXhS -----END PGP SIGNATURE----- --BmWzP7zGlEYrKGxY--