On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote:
> A fixed 10 ms ACK_TIMEOUT timer value served us relatively well
> until

Nit: it's called "ACK_INTERVAL" in the code.

> the previous change, because we would generally cause retransmissions
> for non-local outbound transfers with relatively high (> 100 Mbps)
> bandwidth and non-local but low (< 5 ms) RTT.
> 
> Now that retransmissions are less frequent, we don't have a proper
> trigger to check for acknowledged bytes on the socket, and will
> generally block the sender for a significant amount of time while
> we could acknowledge more data, instead.
> 
> Store the RTT reported by the kernel using an approximation (exponent),
> to keep flow storage size within two (typical) cachelines. Check for
> socket updates when half of this time elapses: it should be a good
> indication of the one-way delay we're interested in (peer to us).

Reasoning based on a one-way delay doesn't quite make sense to me.  We
can't know when anything happens at the peer, and - obviously - we can
only set a timer starting at an event that occurs on our side.  So, I
think only RTT can matter to us, not one-way delay.

That said, using half the RTT estimate still makes sense to me: we
only have an approximation, and halving it gives us a pretty safe
lower bound.

> Representable values are between 100 us and 12.8 ms, and any value

Nit: I think Unicode is long enough supported you can use 盜

> outside this range is clamped to these bounds. This choice appears
> to be a good trade-off between additional overhead and throughput.
> 
> This mechanism partially overlaps with the "low RTT" destinations,
> which we use to infer that a socket is connected to an endpoint to
> the same machine (while possibly in a different namespace) if the
> RTT is reported as 10 us or less.
> 
> This change doesn't, however, conflict with it: we are reading
> TCP_INFO parameters for local connections anyway, so we can always
> store the RTT approximation opportunistically.
> 
> Then, if the RTT is "low", we don't really need a timer to
> acknowledge data as we'll always acknowledge everything to the
> sender right away. However, we have limited space in the array where
> we store addresses of local destination, so the low RTT property of a
> connection might toggle frequently. Because of this, it's actually
> helpful to always have the RTT approximation stored.
> 
> This could probably benefit from a future rework, though, introducing
> a more integrated approach between these two mechanisms.

Right, it feels like it should be possible to combine these
mechanisms, but figuring out exactly how isn't trivial.  Problem for
another day.

> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c      | 29 ++++++++++++++++++++++-------
>  tcp_conn.h |  9 +++++++++
>  util.c     | 14 ++++++++++++++
>  util.h     |  1 +
>  4 files changed, 46 insertions(+), 7 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 863ccdb..b00b874 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -202,9 +202,13 @@
>   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
>   *   either side, the connection is reset
>   *
> - * - ACK_INTERVAL elapsed after data segment received from tap without having
> + * - RTT / 2 elapsed after data segment received from tap without having
>   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> + *
> + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> + *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly.
>   *
>   *
>   * Summary of data flows (with ESTABLISHED event)
> @@ -341,7 +345,6 @@ enum {
>  #define MSS_DEFAULT			536
>  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
>  
> -#define ACK_INTERVAL			10		/* ms */
>  #define RTO_INIT			1		/* s, RFC 6298 */
>  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
>  #define FIN_TIMEOUT			60
> @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
> -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> +		it.it_value.tv_nsec = (long)RTT_GET(conn) * 1000 / 2;
> +		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
> +			      ".tv_nsec is greater than 1000 * 1000 * 1000");
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		int exp = conn->retries, timeout = RTO_INIT;
>  		if (!(conn->events & ESTABLISHED))
> @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		it.it_value.tv_sec = ACT_TIMEOUT;
>  	}
>  
> -	flow_dbg(conn, "timer expires in %llu.%03llus",
> -		 (unsigned long long)it.it_value.tv_sec,
> -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	if (conn->flags & ACK_TO_TAP_DUE) {
> +		flow_trace(conn, "timer expires in %lu.%01llums",
> +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> +	} else {
> +		flow_dbg(conn, "timer expires in %llu.%03llus",
> +			 (unsigned long long)it.it_value.tv_sec,
> +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	}

One branch is flow_trace(), one is flow_dbg() which doesn't seem
correct.  Also, basing the range indirectly on the flags, rather than
on the actual numbers in it.it_value seems fragile.  But... this seems
overly complex for a trace message anyway.  Maybe just use the seconds
formatting, but increase the resolution to 盜.

>  
>  	if (timerfd_settime(conn->timer, 0, &it, NULL))
>  		flow_perror(conn, "failed to set timer");
> @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn_flag(c, conn, ACK_TO_TAP_DUE);
>  
>  out:
> +	/* Opportunistically store RTT approximation on valid TCP_INFO data */
> +	if (tinfo)
> +		RTT_SET(conn, tinfo->tcpi_rtt);
> +
>  	return new_wnd_to_tap       != prev_wnd_to_tap ||
>  	       conn->seq_ack_to_tap != prev_ack_to_tap;
>  }
> diff --git a/tcp_conn.h b/tcp_conn.h
> index e36910c..76034f6 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -49,6 +49,15 @@ struct tcp_tap_conn {
>  #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
>  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
>  
> +#define RTT_EXP_BITS			3
> +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> +#define RTT_STORE_MIN			100 /* us, minimum representable */
> +#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
> +#define RTT_SET(conn, rtt)						\
> +	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> +
>  	int		sock		:FD_REF_BITS;
>  
>  	uint8_t		events;
> diff --git a/util.c b/util.c
> index 4beb7c2..590373d 100644
> --- a/util.c
> +++ b/util.c
> @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
>   * fls() - Find last (most significant) bit set in word
>   * @x:		Word
>   *
> + * Note: unlike ffs() and other implementations of fls(), notably the one from
> + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> + *
>   * Return: position of most significant bit set, starting from 0, -1 if none
>   */
>  int fls(unsigned long x)
> @@ -626,6 +629,17 @@ int fls(unsigned long x)
>  	return y;
>  }
>  
> +/**
> + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> + * @x:		Argument
> + *
> + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> + */
> +int ilog2(unsigned long x)
> +{
> +	return fls(x);
> +}
> +
>  /**
>   * write_file() - Replace contents of file with a string
>   * @path:	File to write
> diff --git a/util.h b/util.h
> index 7bf0701..40de694 100644
> --- a/util.h
> +++ b/util.h
> @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
>  void pidfile_write(int fd, pid_t pid);
>  int __daemon(int pidfile_fd, int devnull_fd);
>  int fls(unsigned long x);
> +int ilog2(unsigned long x);
>  int write_file(const char *path, const char *buf);
>  intmax_t read_file_integer(const char *path, intmax_t fallback);
>  int write_all_buf(int fd, const void *buf, size_t len);
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson