public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: Max Chernoff <git@maxchernoff.ca>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
Date: Mon,  8 Dec 2025 08:20:17 +0100	[thread overview]
Message-ID: <20251208072024.3884137-5-sbrivio@redhat.com> (raw)
In-Reply-To: <20251208072024.3884137-1-sbrivio@redhat.com>

A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 30 +++++++++++++++++++++++-------
 tcp_conn.h |  9 +++++++++
 util.c     | 14 ++++++++++++++
 util.h     |  1 +
 4 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 28d3304..0167121 100644
--- a/tcp.c
+++ b/tcp.c
@@ -202,9 +202,13 @@
  * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
  *   either side, the connection is reset
  *
- * - ACK_INTERVAL elapsed after data segment received from tap without having
+ * - RTT / 2 elapsed after data segment received from tap without having
  *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
- *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
+ *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
+ *
+ *   RTT, here, is an approximation of the RTT value reported by the kernel via
+ *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
+ *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
  *
  *
  * Summary of data flows (with ESTABLISHED event)
@@ -341,7 +345,6 @@ enum {
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
 
-#define ACK_INTERVAL			10		/* ms */
 #define RTO_INIT			1		/* s, RFC 6298 */
 #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
 #define FIN_TIMEOUT			60
@@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
-		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
+		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
+		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		int exp = conn->retries, timeout = RTO_INIT;
 		if (!(conn->events & ESTABLISHED))
@@ -608,9 +612,17 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		it.it_value.tv_sec = ACT_TIMEOUT;
 	}
 
-	flow_dbg(conn, "timer expires in %llu.%03llus",
-		 (unsigned long long)it.it_value.tv_sec,
-		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	if (conn->flags & ACK_TO_TAP_DUE) {
+		flow_trace(conn, "timer expires in %llu.%03llums",
+			   (unsigned long)it.it_value.tv_sec * 1000 +
+			   (unsigned long long)it.it_value.tv_nsec %
+					       (1000 * 1000),
+			   (unsigned long long)it.it_value.tv_nsec / 1000);
+	} else {
+		flow_dbg(conn, "timer expires in %llu.%03llus",
+			 (unsigned long long)it.it_value.tv_sec,
+			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	}
 
 	if (timerfd_settime(conn->timer, 0, &it, NULL))
 		flow_perror(conn, "failed to set timer");
@@ -1144,6 +1156,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn_flag(c, conn, ACK_TO_TAP_DUE);
 
 out:
+	/* Opportunistically store RTT approximation on valid TCP_INFO data */
+	if (tinfo)
+		RTT_SET(conn, tinfo->tcpi_rtt);
+
 	return new_wnd_to_tap       != prev_wnd_to_tap ||
 	       conn->seq_ack_to_tap != prev_ack_to_tap;
 }
diff --git a/tcp_conn.h b/tcp_conn.h
index e36910c..9c6ff9e 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -49,6 +49,15 @@ struct tcp_tap_conn {
 #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
 #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
 
+#define RTT_EXP_BITS			4
+	unsigned int	rtt_exp		:RTT_EXP_BITS;
+#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
+#define RTT_STORE_MIN			100 /* us, minimum representable */
+#define RTT_STORE_MAX			((long)(RTT_STORE_MIN << RTT_EXP_MAX))
+#define RTT_SET(conn, rtt)						\
+	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
+#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
+
 	int		sock		:FD_REF_BITS;
 
 	uint8_t		events;
diff --git a/util.c b/util.c
index 2232a24..bfeb619 100644
--- a/util.c
+++ b/util.c
@@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
  * fls() - Find last (most significant) bit set in word
  * @x:		Word
  *
+ * Note: unlike ffs() and other implementations of fls(), notably the one from
+ * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
+ *
  * Return: position of most significant bit set, starting from 0, -1 if none
  */
 int fls(unsigned long x)
@@ -629,6 +632,17 @@ int fls(unsigned long x)
 	return y;
 }
 
+/**
+ * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
+ * @x:		Argument
+ *
+ * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
+ */
+int ilog2(unsigned long x)
+{
+	return fls(x);
+}
+
 /**
  * write_file() - Replace contents of file with a string
  * @path:	File to write
diff --git a/util.h b/util.h
index 744880b..f7a941f 100644
--- a/util.h
+++ b/util.h
@@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags);
 void pidfile_write(int fd, pid_t pid);
 int __daemon(int pidfile_fd, int devnull_fd);
 int fls(unsigned long x);
+int ilog2(unsigned long x);
 int write_file(const char *path, const char *buf);
 intmax_t read_file_integer(const char *path, intmax_t fallback);
 int write_all_buf(int fd, const void *buf, size_t len);
-- 
2.43.0


  parent reply	other threads:[~2025-12-08  7:20 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-08  7:20 [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 01/10] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 02/10] tcp: Change usage factor of sending buffer in tcp_get_sndbuf() to 75% Stefano Brivio
2025-12-09  5:05   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 03/10] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  7:20 ` Stefano Brivio [this message]
2025-12-09  5:10   ` [PATCH v3 04/10] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks David Gibson
2025-12-09 22:49     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 05/10] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 06/10] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-09  5:12   ` David Gibson
2025-12-08  7:20 ` [PATCH v3 07/10] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 08/10] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  8:14   ` Max Chernoff
2025-12-08  8:15   ` Max Chernoff
2025-12-08  8:27     ` Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 09/10] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  7:20 ` [PATCH v3 10/10] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  8:11 ` [PATCH v3 00/10] tcp: Fix throughput issues with non-local peers Max Chernoff
2025-12-08  8:25   ` Stefano Brivio
2025-12-08  8:51     ` Max Chernoff
2025-12-08  9:00       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251208072024.3884137-5-sbrivio@redhat.com \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=git@maxchernoff.ca \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).