From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=CeHN8eYu; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 2487E5A0619 for ; Fri, 05 Dec 2025 02:20:15 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1764897613; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IkoXvZsRT1HKuTvRaKxAQV5PPaL8jxEaBtzmVT8lLEo=; b=CeHN8eYuR2LCIy3yxErG1c4lqewJrT13D3hUmMbehk7mUxin3XTkV3LdfIW9Bbt/2iU7+L 66EbR4Tp38O/RTSIaHurNBPcDh9QPO3zVHnqegSISeAzxnCBKYVBRgdby306a4vp7z0I31 WN7mEr+d4LnJKfxdbuE7P6c36i5PUU0= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-107-xpbDCVJpNiyOFtGqbaekOA-1; Thu, 04 Dec 2025 20:20:12 -0500 X-MC-Unique: xpbDCVJpNiyOFtGqbaekOA-1 X-Mimecast-MFC-AGG-ID: xpbDCVJpNiyOFtGqbaekOA_1764897611 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-4779981523fso11263385e9.2 for ; Thu, 04 Dec 2025 17:20:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764897611; x=1765502411; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bN8wiExVyyT6UWaHHm630D2IUlLyTOdhtQhCRD4TksY=; b=qdoKg4DLKwV7NDcYHNUX+oqV0fWx0LGKkgXrL/8azqhNb01b3QE0wJ6YQ1uglCeaOW lbvLZ3sgW6vyNriuJcYz7XhHFXFtnGyTRNauDr55W6qptOmD9L6UUQzw+gC3ErlfEM0P 5d2jAsR3PqOLRyGCGqRk2uofDhHCaZrN1Tw63om7OYE9ONYBsTE6XyltLbnnrX6znIS3 uVd9v5f8QPnjf1M1UYa1caJmGae/NbZn07vtHOPlYhTIpYmVq0SXn5k+koEf5Yu12OuT VE3tgDHcjB/EiDKfzIEDtptX+YRjKzvR8uuLPrKfwU8p5nbbeu02/D+uk29pLRbozfzM KrJQ== X-Gm-Message-State: AOJu0Yx0pqib1QfNcfpBTxJRrDUf5AFjfV5oGnaIUdsvNagQtfdwxDya KWUINQ4LrQoHE+vThYeLQmeO6lW9xCUOxixAM6zionj2ce635BLIxUDNEBVBPGMa8JuITkLcuYf 1MbPREgyA5315rf02SlJzrgvteR+d9ugwAYO3rSRSXxDq/VtI80kFlQ== X-Gm-Gg: ASbGnctIhQKe91ZahTg19etPIK8nQcWg3HIFtqErU/SCAPpvLmlbpNUFS32Q7wHQHDh 9OMjxmbn4FPNlTm0QP1lNtDy+kSrnwgSPxtDheIY+FYjnhqhT8SihRd9tWwLLXTVD1wNJ8ExFto 4YadwZkzlOMIVaiJ97MjKOb7yzG2pJ4x3bQ785009tiP9ZdtrSvvz5MCf8j9sJ3BAFWq4zoWHZn qlp7r7Oc0qUqplhvLKbnJ8dQsVsGTlt8TU8ZJL7tZM4TkkYzsf6fOdpRasacYnl0wtRcKrK1P7w NNlrVKxdqodcfm7bq4tuEIXYKvYr3vk5cJa31lax0yxUvAbPdV0LjQMeY5FPz2TQEhFnAAAp4zZ aQBexSAjIyXZIpjnd4Ki1 X-Received: by 2002:a05:600c:4e89:b0:477:b642:9dbf with SMTP id 5b1f17b1804b1-4792af48329mr84748505e9.32.1764897611180; Thu, 04 Dec 2025 17:20:11 -0800 (PST) X-Google-Smtp-Source: AGHT+IEg47WpjX9Kc/6rnDHow0Ia27nTQEVXJKQ5XnojwoWpH1R/30KZ7O4FbvhrR1Sy4NuypD1gYA== X-Received: by 2002:a05:600c:4e89:b0:477:b642:9dbf with SMTP id 5b1f17b1804b1-4792af48329mr84748355e9.32.1764897610592; Thu, 04 Dec 2025 17:20:10 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-47930920e0dsm62110265e9.2.2025.12.04.17.20.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 Dec 2025 17:20:09 -0800 (PST) Date: Fri, 5 Dec 2025 02:20:08 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Message-ID: <20251205022008.70e1195c@elisabeth> In-Reply-To: References: <20251204074542.2156548-1-sbrivio@redhat.com> <20251204074542.2156548-3-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 6ppgoldlHqycJtKl0EyD_5AcDJJu7Kuxn8UacUD4KRU_1764897611 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: V3M5NX7CYT3BMSDLQPL7LTCPWEPIXVYH X-Message-ID-Hash: V3M5NX7CYT3BMSDLQPL7LTCPWEPIXVYH X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 5 Dec 2025 10:48:20 +1100 David Gibson wrote: > On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote: > > A fixed 10 ms ACK_TIMEOUT timer value served us relatively well > > until =20 >=20 > Nit: it's called "ACK_INTERVAL" in the code. Oops. I'll change this. > > the previous change, because we would generally cause retransmissions > > for non-local outbound transfers with relatively high (> 100 Mbps) > > bandwidth and non-local but low (< 5 ms) RTT. > >=20 > > Now that retransmissions are less frequent, we don't have a proper > > trigger to check for acknowledged bytes on the socket, and will > > generally block the sender for a significant amount of time while > > we could acknowledge more data, instead. > >=20 > > Store the RTT reported by the kernel using an approximation (exponent), > > to keep flow storage size within two (typical) cachelines. Check for > > socket updates when half of this time elapses: it should be a good > > indication of the one-way delay we're interested in (peer to us). =20 >=20 > Reasoning based on a one-way delay doesn't quite make sense to me. We > can't know when anything happens at the peer, and - obviously - we can > only set a timer starting at an event that occurs on our side. So, I > think only RTT can matter to us, not one-way delay. ...except that we might be scheduling the timer at any point *after* we sent data, so the outbound delay might be partially elapsed, and the one-way (receiving) delay is actually (more) relevant. If we had instantaneous receiving of ACK segments, we would need to probe much more frequently than the RTT, to monitor the actual progress more accurately. Note that transmission rate (including forwarding delays) is not constant and might be bursty. But yes, in general it's not much more relevant than the RTT. I could drop this part of the commit message. > That said, using half the RTT estimate still makes sense to me: we > only have an approximation, and halving it gives us a pretty safe > lower bound. In any case, yes. > > Representable values are between 100 us and 12.8 ms, and any value =20 >=20 > Nit: I think Unicode is long enough supported you can use =C2=B5s I prefer to avoid in the code if possible because one might not have Unicode support in all the relevant environments with all the relevant consoles (I just finished debugging stuff on Alpine...), and at that point I'd rather have consistent commit messages. > > outside this range is clamped to these bounds. This choice appears > > to be a good trade-off between additional overhead and throughput. > >=20 > > This mechanism partially overlaps with the "low RTT" destinations, > > which we use to infer that a socket is connected to an endpoint to > > the same machine (while possibly in a different namespace) if the > > RTT is reported as 10 us or less. > >=20 > > This change doesn't, however, conflict with it: we are reading > > TCP_INFO parameters for local connections anyway, so we can always > > store the RTT approximation opportunistically. > >=20 > > Then, if the RTT is "low", we don't really need a timer to > > acknowledge data as we'll always acknowledge everything to the > > sender right away. However, we have limited space in the array where > > we store addresses of local destination, so the low RTT property of a > > connection might toggle frequently. Because of this, it's actually > > helpful to always have the RTT approximation stored. > >=20 > > This could probably benefit from a future rework, though, introducing > > a more integrated approach between these two mechanisms. =20 >=20 > Right, it feels like it should be possible to combine these > mechanisms, but figuring out exactly how isn't trivial. Problem for > another day. >=20 > >=20 > > Signed-off-by: Stefano Brivio > > --- > > tcp.c | 29 ++++++++++++++++++++++------- > > tcp_conn.h | 9 +++++++++ > > util.c | 14 ++++++++++++++ > > util.h | 1 + > > 4 files changed, 46 insertions(+), 7 deletions(-) > >=20 > > diff --git a/tcp.c b/tcp.c > > index 863ccdb..b00b874 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -202,9 +202,13 @@ > > * - ACT_TIMEOUT, in the presence of any event: if no activity is dete= cted on > > * either side, the connection is reset > > * > > - * - ACK_INTERVAL elapsed after data segment received from tap without= having > > + * - RTT / 2 elapsed after data segment received from tap without havi= ng > > * sent an ACK segment, or zero-sized window advertised to tap/guest= (flag > > - * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent > > + * ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent. > > + * > > + * RTT, here, is an approximation of the RTT value reported by the k= ernel via > > + * TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) = to > > + * RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly= . > > * > > * > > * Summary of data flows (with ESTABLISHED event) > > @@ -341,7 +345,6 @@ enum { > > #define MSS_DEFAULT=09=09=09536 > > #define WINDOW_DEFAULT=09=09=0914600=09=09/* RFC 6928 */ > > =20 > > -#define ACK_INTERVAL=09=09=0910=09=09/* ms */ > > #define RTO_INIT=09=09=091=09=09/* s, RFC 6298 */ > > #define RTO_INIT_AFTER_SYN_RETRIES=093=09=09/* s, RFC 6298 */ > > #define FIN_TIMEOUT=09=09=0960 > > @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, stru= ct tcp_tap_conn *conn) > > =09} > > =20 > > =09if (conn->flags & ACK_TO_TAP_DUE) { > > -=09=09it.it_value.tv_nsec =3D (long)ACK_INTERVAL * 1000 * 1000; > > +=09=09it.it_value.tv_nsec =3D (long)RTT_GET(conn) * 1000 / 2; > > +=09=09static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000, > > +=09=09=09 ".tv_nsec is greater than 1000 * 1000 * 1000"); > > =09} else if (conn->flags & ACK_FROM_TAP_DUE) { > > =09=09int exp =3D conn->retries, timeout =3D RTO_INIT; > > =09=09if (!(conn->events & ESTABLISHED)) > > @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, str= uct tcp_tap_conn *conn) > > =09=09it.it_value.tv_sec =3D ACT_TIMEOUT; > > =09} > > =20 > > -=09flow_dbg(conn, "timer expires in %llu.%03llus", > > -=09=09 (unsigned long long)it.it_value.tv_sec, > > -=09=09 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > +=09if (conn->flags & ACK_TO_TAP_DUE) { > > +=09=09flow_trace(conn, "timer expires in %lu.%01llums", > > +=09=09=09 (unsigned long)it.it_value.tv_nsec / 1000 / 1000, > > +=09=09=09 (unsigned long long)it.it_value.tv_nsec / 1000); > > +=09} else { > > +=09=09flow_dbg(conn, "timer expires in %llu.%03llus", > > +=09=09=09 (unsigned long long)it.it_value.tv_sec, > > +=09=09=09 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000); > > +=09} =20 >=20 > One branch is flow_trace(), one is flow_dbg() which doesn't seem > correct. No, it's intended, it's actually the main reason why I'm changing this part. Now that we have more frequent timer scheduling on ACK_TO_TAP_DUE, the debug logs become unusable if you're trying to debug anything that's not related to a specific data transfer. > Also, basing the range indirectly on the flags, rather than > on the actual numbers in it.it_value seems fragile. Flags tell us why we're scheduling a specific timer, and it's only on ACK_TO_TAP_DUE that we want to have more fine-grained values. > But... this seems > overly complex for a trace message anyway. Maybe just use the seconds > formatting, but increase the resolution to =C2=B5s. I tried a number of different combinations like that, they are all rather inconvenient. > > =20 > > =09if (timerfd_settime(conn->timer, 0, &it, NULL)) > > =09=09flow_perror(conn, "failed to set timer"); > > @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, s= truct tcp_tap_conn *conn, > > =09=09conn_flag(c, conn, ACK_TO_TAP_DUE); > > =20 > > out: > > +=09/* Opportunistically store RTT approximation on valid TCP_INFO data= */ > > +=09if (tinfo) > > +=09=09RTT_SET(conn, tinfo->tcpi_rtt); > > + > > =09return new_wnd_to_tap !=3D prev_wnd_to_tap || > > =09 conn->seq_ack_to_tap !=3D prev_ack_to_tap; > > } > > diff --git a/tcp_conn.h b/tcp_conn.h > > index e36910c..76034f6 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -49,6 +49,15 @@ struct tcp_tap_conn { > > #define MSS_SET(conn, mss)=09(conn->tap_mss =3D (mss >> (16 - TCP_MSS_= BITS))) > > #define MSS_GET(conn)=09=09(conn->tap_mss << (16 - TCP_MSS_BITS)) > > =20 > > +#define RTT_EXP_BITS=09=09=093 > > +=09unsigned int=09rtt_exp=09=09:RTT_EXP_BITS; > > +#define RTT_EXP_MAX=09=09=09MAX_FROM_BITS(RTT_EXP_BITS) > > +#define RTT_STORE_MIN=09=09=09100 /* us, minimum representable */ > > +#define RTT_STORE_MAX=09=09=09(RTT_STORE_MIN << RTT_EXP_MAX) > > +#define RTT_SET(conn, rtt)=09=09=09=09=09=09\ > > +=09(conn->rtt_exp =3D MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MI= N)))) > > +#define RTT_GET(conn)=09=09=09(RTT_STORE_MIN << conn->rtt_exp) > > + > > =09int=09=09sock=09=09:FD_REF_BITS; > > =20 > > =09uint8_t=09=09events; > > diff --git a/util.c b/util.c > > index 4beb7c2..590373d 100644 > > --- a/util.c > > +++ b/util.c > > @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd) > > * fls() - Find last (most significant) bit set in word > > * @x:=09=09Word > > * > > + * Note: unlike ffs() and other implementations of fls(), notably the = one from > > + * the Linux kernel, the starting position is 0 and not 1, that is, fl= s(1) =3D 0. > > + * > > * Return: position of most significant bit set, starting from 0, -1 i= f none > > */ > > int fls(unsigned long x) > > @@ -626,6 +629,17 @@ int fls(unsigned long x) > > =09return y; > > } > > =20 > > +/** > > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to t= he base 2) > > + * @x:=09=09Argument > > + * > > + * Return: integral part of binary logarithm of @x, -1 if undefined (i= f @x is 0) > > + */ > > +int ilog2(unsigned long x) > > +{ > > +=09return fls(x); > > +} > > + > > /** > > * write_file() - Replace contents of file with a string > > * @path:=09File to write > > diff --git a/util.h b/util.h > > index 7bf0701..40de694 100644 > > --- a/util.h > > +++ b/util.h > > @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags); > > void pidfile_write(int fd, pid_t pid); > > int __daemon(int pidfile_fd, int devnull_fd); > > int fls(unsigned long x); > > +int ilog2(unsigned long x); > > int write_file(const char *path, const char *buf); > > intmax_t read_file_integer(const char *path, intmax_t fallback); > > int write_all_buf(int fd, const void *buf, size_t len); > > --=20 > > 2.43.0 --=20 Stefano