From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202512 header.b=d1nrZrBR; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 38DFC5A0271 for ; Tue, 09 Dec 2025 06:13:31 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202512; t=1765257208; bh=0g6BhC1M/7KVCDzm0tQKOiE+I2WAdIAdVWdHIzgYHyQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=d1nrZrBR1I4DJEFxFVEgJ375yX+2QyYTmMkwB1UZU88/ZfPxS10U8tuZsBm4H+2il N552oycF0r/VZIaYAlhaN6PLVH/R9doCRjZUaUPGIGDHM4xNSgAAdtgfug3jM1L+7n cqvpZAP2qXBxyM1hx84vYjoN0dE+1/iigA4+w3LwwY4gn5iqbq15pCkjv1LFtrmfyf VIP7MLeT1C49GHTylSNvBTuaFUa7Qez5UJ5d8ymO/K5QGxkFR0vW6w06KO9kHgh/Xs 9Ow9VUdEV777HcXwwlBytYhlD4AzV/cUc4U8BzrK5RawCcMwx5jQX083LR5ry64GjH nxse17Y6dqE9A== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4dQRp04yqgz4wMS; Tue, 09 Dec 2025 16:13:28 +1100 (AEDT) Date: Tue, 9 Dec 2025 16:10:36 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Message-ID: References: <20251208072024.3884137-1-sbrivio@redhat.com> <20251208072024.3884137-5-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="oQC2Ca/bkKVx49DL" Content-Disposition: inline In-Reply-To: <20251208072024.3884137-5-sbrivio@redhat.com> Message-ID-Hash: ICGJCDE7LIYZF2KWB3VNOKKMN32PTAJG X-Message-ID-Hash: ICGJCDE7LIYZF2KWB3VNOKKMN32PTAJG X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --oQC2Ca/bkKVx49DL Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Dec 08, 2025 at 08:20:17AM +0100, Stefano Brivio wrote: > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until > the previous change, because we would generally cause retransmissions > for non-local outbound transfers with relatively high (> 100 Mbps) > bandwidth and non-local but low (< 5 ms) RTT. >=20 > Now that retransmissions are less frequent, we don't have a proper > trigger to check for acknowledged bytes on the socket, and will > generally block the sender for a significant amount of time while > we could acknowledge more data, instead. >=20 > Store the RTT reported by the kernel using an approximation (exponent), > to keep flow storage size within two (typical) cachelines. Check for > socket updates when half of this time elapses: it should be a good > indication of the one-way delay we're interested in (peer to us). >=20 > Representable values are between 100 us and 3.2768 s, and any value > outside this range is clamped to these bounds. This choice appears > to be a good trade-off between additional overhead and throughput. >=20 > This mechanism partially overlaps with the "low RTT" destinations, > which we use to infer that a socket is connected to an endpoint to > the same machine (while possibly in a different namespace) if the > RTT is reported as 10 us or less. >=20 > This change doesn't, however, conflict with it: we are reading > TCP_INFO parameters for local connections anyway, so we can always > store the RTT approximation opportunistically. >=20 > Then, if the RTT is "low", we don't really need a timer to > acknowledge data as we'll always acknowledge everything to the > sender right away. However, we have limited space in the array where > we store addresses of local destination, so the low RTT property of a > connection might toggle frequently. Because of this, it's actually > helpful to always have the RTT approximation stored. >=20 > This could probably benefit from a future rework, though, introducing > a more integrated approach between these two mechanisms. >=20 > Signed-off-by: Stefano Brivio > --- > tcp.c | 30 +++++++++++++++++++++++------- > tcp_conn.h | 9 +++++++++ > util.c | 14 ++++++++++++++ > util.h | 1 + > 4 files changed, 47 insertions(+), 7 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 28d3304..0167121 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -202,9 +202,13 @@ > * - ACT_TIMEOUT, in the presence of any event: if no activity is detect= ed on > * either side, the connection is reset > * > - * - ACK_INTERVAL elapsed after data segment received from tap without h= aving > + * - RTT / 2 elapsed after data segment received from tap without having > * sent an ACK segment, or zero-sized window advertised to tap/guest (= flag > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. > + * > + * RTT, here, is an approximation of the RTT value reported by the ker= nel via > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to > + * RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly. > * > * > * Summary of data flows (with ESTABLISHED event) > @@ -341,7 +345,6 @@ enum { > #define MSS_DEFAULT 536 > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > =20 > -#define ACK_INTERVAL 10 /* ms */ > #define RTO_INIT 1 /* s, RFC 6298 */ > #define RTO_INIT_AFTER_SYN_RETRIES 3 /* s, RFC 6298 */ > #define FIN_TIMEOUT 60 > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct= tcp_tap_conn *conn) > } > =20 > if (conn->flags & ACK_TO_TAP_DUE) { > - it.it_value.tv_nsec =3D (long)ACK_INTERVAL * 1000 * 1000; > + it.it_value.tv_sec =3D RTT_GET(conn) / 2 / (1000 * 1000); > + it.it_value.tv_nsec =3D RTT_GET(conn) / 2 % (1000 * 1000) * 1000; > } else if (conn->flags & ACK_FROM_TAP_DUE) { > int exp =3D conn->retries, timeout =3D RTO_INIT; > if (!(conn->events & ESTABLISHED)) > @@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struc= t tcp_tap_conn *conn) > it.it_value.tv_sec =3D ACT_TIMEOUT; > } > =20 > - flow_dbg(conn, "timer expires in %llu.%03llus", > - (unsigned long long)it.it_value.tv_sec, > - (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > + if (conn->flags & ACK_TO_TAP_DUE) { > + flow_trace(conn, "timer expires in %llu.%03llums", > + (unsigned long)it.it_value.tv_sec * 1000 + > + (unsigned long long)it.it_value.tv_nsec % > + (1000 * 1000), > + (unsigned long long)it.it_value.tv_nsec / 1000); This is the wrong way around. The ms part needs to be: tv_sec * 1000 + tv_nsec / 1000000 and the fractional (us) part: (tv_nsec / 1000) % 1000 (or if you did just want a single digit after the ., then: tv_nsec / 100000 % 10 ) > + } else { > + flow_dbg(conn, "timer expires in %llu.%03llus", > + (unsigned long long)it.it_value.tv_sec, > + (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > + } > =20 > if (timerfd_settime(conn->timer, 0, &it, NULL)) > flow_perror(conn, "failed to set timer"); > @@ -1144,6 +1156,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, str= uct tcp_tap_conn *conn, > conn_flag(c, conn, ACK_TO_TAP_DUE); > =20 > out: > + /* Opportunistically store RTT approximation on valid TCP_INFO data */ > + if (tinfo) > + RTT_SET(conn, tinfo->tcpi_rtt); > + > return new_wnd_to_tap !=3D prev_wnd_to_tap || > conn->seq_ack_to_tap !=3D prev_ack_to_tap; > } > diff --git a/tcp_conn.h b/tcp_conn.h > index e36910c..9c6ff9e 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -49,6 +49,15 @@ struct tcp_tap_conn { > #define MSS_SET(conn, mss) (conn->tap_mss =3D (mss >> (16 - TCP_MSS_BITS= ))) > #define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) > =20 > +#define RTT_EXP_BITS 4 > + unsigned int rtt_exp :RTT_EXP_BITS; > +#define RTT_EXP_MAX MAX_FROM_BITS(RTT_EXP_BITS) > +#define RTT_STORE_MIN 100 /* us, minimum representable */ > +#define RTT_STORE_MAX ((long)(RTT_STORE_MIN << RTT_EXP_MAX)) > +#define RTT_SET(conn, rtt) \ > + (conn->rtt_exp =3D MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN)))) > +#define RTT_GET(conn) (RTT_STORE_MIN << conn->rtt_exp) > + > int sock :FD_REF_BITS; > =20 > uint8_t events; > diff --git a/util.c b/util.c > index 2232a24..bfeb619 100644 > --- a/util.c > +++ b/util.c > @@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd) > * fls() - Find last (most significant) bit set in word > * @x: Word > * > + * Note: unlike ffs() and other implementations of fls(), notably the on= e from > + * the Linux kernel, the starting position is 0 and not 1, that is, fls(= 1) =3D 0. > + * > * Return: position of most significant bit set, starting from 0, -1 if = none > */ > int fls(unsigned long x) > @@ -629,6 +632,17 @@ int fls(unsigned long x) > return y; > } > =20 > +/** > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the= base 2) > + * @x: Argument > + * > + * Return: integral part of binary logarithm of @x, -1 if undefined (if = @x is 0) > + */ > +int ilog2(unsigned long x) > +{ > + return fls(x); > +} > + > /** > * write_file() - Replace contents of file with a string > * @path: File to write > diff --git a/util.h b/util.h > index 744880b..f7a941f 100644 > --- a/util.h > +++ b/util.h > @@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags); > void pidfile_write(int fd, pid_t pid); > int __daemon(int pidfile_fd, int devnull_fd); > int fls(unsigned long x); > +int ilog2(unsigned long x); > int write_file(const char *path, const char *buf); > intmax_t read_file_integer(const char *path, intmax_t fallback); > int write_all_buf(int fd, const void *buf, size_t len); > --=20 > 2.43.0 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --oQC2Ca/bkKVx49DL Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmk3r0sACgkQzQJF27ox 2GcwyBAAkQNZNLIGmGNk47WWO3Q1+sLW/u44NIo9wyhW5jxgXO19Zmf+8m3uh/p9 CgsC+T/nEEWtQtY0iawP/b98WEE70hMYTV0G68IxnSSskhQ7zgh72oyviB8StXSg 78ZQxmxdAAHB/RDQjxoH0TgYmwPz9MKTKFr/MES4EsA/bvtYElywOtMtPKYAXzGQ 2oRUKspI8Cai7x8XIEbnaeDTH0Z0xgNtgq+U733h3rf+06uHnm0vJNt/GLK71GKr lpj26MdOYqKXsKDdwnRes2NlaYi6O4+KrkEsADeA2BAWFPW94jDH5PDCZm+b32AA yBqpDcEuSIGacJUIhJIV4IaA2NyuaFz8oKIFxFI12cMOH21b/OUzl5yl63UPCU6Y j4iLMZPnkUhPR4SZYOAwLfRVsYzpPNggvZx2PTXpkTW8JbPUxrkUeM8boTHb+TWz +PUNWq4mMtm3SyplY6zEy9TGPquHi7G+4n0Ri46rEoICcGZnBD4noBYSNQFBMrE6 P2QWTY4lc4+94tkdTk0N0Yt8dzJWD0i9/4jB7/1fs7ij7zIFYX4MZOVWD9uS95jo t0DlvZD87yiCsmSr9wh9+cwgbPfR9BYGWwAdWDMt9oWdYR7NFNDYPESe+8sWJalK UYV7//9ZgI93T+MI+iU5sNaimdJaO27Hb0k/SWkRVTHN2LdTW/o= =zV2r -----END PGP SIGNATURE----- --oQC2Ca/bkKVx49DL--