[PATCH 0/8] tcp: Fix throughput issues with non-local peers

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH 0/8] tcp: Fix throughput issues with non-local peers
@ 2025-12-04  7:45 Stefano Brivio
  2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

Patch 1/8 is the most relevant fix here, as we currently advertise a
window that might be too big for what we can write to the socket,
causing retransmissions right away and occasional high latency on
short transfers to non-local peers.

Mostly as a consequence of fixing that, we now need several
improvements and small fixes, including, most notably, an adaptive
approach to pick the interval between checks for socket-side ACKs
(patch 2/8), and several tricks to reliably trigger TCP buffer size
auto-tuning as implemented by the Linux kernel (patches 4/8 and 6/8).

These changes make some existing issues more relevant, fixed by the
other patches.

With this series, I'm getting the expected (wirespeed) throughput for
transfers between peers with varying non-local RTTs: I checked
different guests bridged on the same machine (~600 us) and hosts with
increasing distance (approximately 100 to 600 km, ~4 to ~35 ms), using
iperf3 as well as HTTP transfers.

For short transfers, we strictly stick to the available sending buffer
size to (almost) make sure we avoid local retransmissions, and
significantly decrease transfer time as a result: from 1.2 s to 60 ms
for a 5 MB HTTP transfer from a container hosted in a virtual machine
to another guest.

Stefano Brivio (8):
  tcp: Limit advertised window to available, not total sending buffer
    size
  tcp: Adaptive interval based on RTT for socket-side acknowledgement
    checks
  tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized
    window
  tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  tcp: Don't limit window to less-than-MSS values, use zero instead
  tcp: Allow exceeding the available sending buffer size in window
    advertisements
  tcp: Send a duplicate ACK also on complete sendmsg() failure
  tcp: Skip redundant ACK on partial sendmsg() failure

 README.md  |  2 +-
 tcp.c      | 85 ++++++++++++++++++++++++++++++++++++++++++------------
 tcp_conn.h |  9 ++++++
 util.c     | 14 +++++++++
 util.h     |  1 +
 5 files changed, 92 insertions(+), 19 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-04 23:10   ` David Gibson
  2025-12-04  7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

For non-local connections, we advertise the same window size as what
the peer in turn advertises to us, and limit it to the buffer size
reported via SO_SNDBUF.

That's not quite correct: in order to later avoid failures while
queueing data to the socket, we need to limit the window to the
available buffer size, not the total one.

Use the SIOCOUTQ ioctl and subtract the number of outbound queued
bytes from the total buffer size, then clamp to this value.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 README.md |  2 +-
 tcp.c     | 18 ++++++++++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 897ae8b..8fdc0a3 100644
--- a/README.md
+++ b/README.md
@@ -291,7 +291,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
 * ✅ all capabilities dropped, other than `CAP_NET_BIND_SERVICE` (if granted)
 * ✅ with default options, user, mount, IPC, UTS, PID namespaces are detached
 * ✅ no external dependencies (other than a standard C library)
-* ✅ restrictive seccomp profiles (33 syscalls allowed for _passt_, 43 for
+* ✅ restrictive seccomp profiles (34 syscalls allowed for _passt_, 43 for
   _pasta_ on x86_64)
 * ✅ examples of [AppArmor](/passt/tree/contrib/apparmor) and
   [SELinux](/passt/tree/contrib/selinux) profiles available
diff --git a/tcp.c b/tcp.c
index fa95f6b..863ccdb 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1031,6 +1031,8 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
  * @tinfo:	tcp_info from kernel, can be NULL if not pre-fetched
  *
  * Return: 1 if sequence or window were updated, 0 otherwise
+ *
+ * #syscalls ioctl
  */
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 			  bool force_seq, struct tcp_info_linux *tinfo)
@@ -1113,9 +1115,21 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
 		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
+		uint32_t sendq;
+		int limit;
+
+		if (ioctl(s, SIOCOUTQ, &sendq)) {
+			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
+			sendq = 0;
+		}
 		tcp_get_sndbuf(conn);
-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
-				     SNDBUF_GET(conn));
+
+		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
+			limit = 0;
+		else
+			limit = SNDBUF_GET(conn) - (int)sendq;
+
+		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
 	}
 
 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
  2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-04 23:48   ` David Gibson
  2025-12-04  7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

A fixed 10 ms ACK_TIMEOUT timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 12.8 ms, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 29 ++++++++++++++++++++++-------
 tcp_conn.h |  9 +++++++++
 util.c     | 14 ++++++++++++++
 util.h     |  1 +
 4 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 863ccdb..b00b874 100644
--- a/tcp.c
+++ b/tcp.c
@@ -202,9 +202,13 @@
  * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
  *   either side, the connection is reset
  *
- * - ACK_INTERVAL elapsed after data segment received from tap without having
+ * - RTT / 2 elapsed after data segment received from tap without having
  *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
- *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
+ *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
+ *
+ *   RTT, here, is an approximation of the RTT value reported by the kernel via
+ *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
+ *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly.
  *
  *
  * Summary of data flows (with ESTABLISHED event)
@@ -341,7 +345,6 @@ enum {
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
 
-#define ACK_INTERVAL			10		/* ms */
 #define RTO_INIT			1		/* s, RFC 6298 */
 #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
 #define FIN_TIMEOUT			60
@@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
-		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
+		it.it_value.tv_nsec = (long)RTT_GET(conn) * 1000 / 2;
+		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
+			      ".tv_nsec is greater than 1000 * 1000 * 1000");
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		int exp = conn->retries, timeout = RTO_INIT;
 		if (!(conn->events & ESTABLISHED))
@@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		it.it_value.tv_sec = ACT_TIMEOUT;
 	}
 
-	flow_dbg(conn, "timer expires in %llu.%03llus",
-		 (unsigned long long)it.it_value.tv_sec,
-		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	if (conn->flags & ACK_TO_TAP_DUE) {
+		flow_trace(conn, "timer expires in %lu.%01llums",
+			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
+			   (unsigned long long)it.it_value.tv_nsec / 1000);
+	} else {
+		flow_dbg(conn, "timer expires in %llu.%03llus",
+			 (unsigned long long)it.it_value.tv_sec,
+			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	}
 
 	if (timerfd_settime(conn->timer, 0, &it, NULL))
 		flow_perror(conn, "failed to set timer");
@@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn_flag(c, conn, ACK_TO_TAP_DUE);
 
 out:
+	/* Opportunistically store RTT approximation on valid TCP_INFO data */
+	if (tinfo)
+		RTT_SET(conn, tinfo->tcpi_rtt);
+
 	return new_wnd_to_tap       != prev_wnd_to_tap ||
 	       conn->seq_ack_to_tap != prev_ack_to_tap;
 }
diff --git a/tcp_conn.h b/tcp_conn.h
index e36910c..76034f6 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -49,6 +49,15 @@ struct tcp_tap_conn {
 #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
 #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
 
+#define RTT_EXP_BITS			3
+	unsigned int	rtt_exp		:RTT_EXP_BITS;
+#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
+#define RTT_STORE_MIN			100 /* us, minimum representable */
+#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
+#define RTT_SET(conn, rtt)						\
+	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
+#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
+
 	int		sock		:FD_REF_BITS;
 
 	uint8_t		events;
diff --git a/util.c b/util.c
index 4beb7c2..590373d 100644
--- a/util.c
+++ b/util.c
@@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
  * fls() - Find last (most significant) bit set in word
  * @x:		Word
  *
+ * Note: unlike ffs() and other implementations of fls(), notably the one from
+ * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
+ *
  * Return: position of most significant bit set, starting from 0, -1 if none
  */
 int fls(unsigned long x)
@@ -626,6 +629,17 @@ int fls(unsigned long x)
 	return y;
 }
 
+/**
+ * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
+ * @x:		Argument
+ *
+ * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
+ */
+int ilog2(unsigned long x)
+{
+	return fls(x);
+}
+
 /**
  * write_file() - Replace contents of file with a string
  * @path:	File to write
diff --git a/util.h b/util.h
index 7bf0701..40de694 100644
--- a/util.h
+++ b/util.h
@@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
 void pidfile_write(int fd, pid_t pid);
 int __daemon(int pidfile_fd, int devnull_fd);
 int fls(unsigned long x);
+int ilog2(unsigned long x);
 int write_file(const char *path, const char *buf);
 intmax_t read_file_integer(const char *path, intmax_t fallback);
 int write_all_buf(int fd, const void *buf, size_t len);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
  2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
  2025-12-04  7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-04 23:50   ` David Gibson
  2025-12-04  7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

We correctly avoid doing that at the beginning of tcp_prepare_flags(),
but we might clear the flag later on if we actually end up sending a
"flag" segment.

Make sure we don't, otherwise we might delay window updates after a
zero-window condition significantly, and significantly affect
throughput.

In some cases, we're forcing peers to send zero-window probes or
keep-alive segments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index b00b874..e4c5a5b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1285,7 +1285,8 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 	th->fin = !!(flags & FIN);
 
 	if (th->ack) {
-		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
+		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
+		    conn->wnd_to_tap)
 			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
 		else
 			conn_flag(c, conn, ACK_TO_TAP_DUE);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-12-04  7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-05  0:08   ` David Gibson
  2025-12-04  7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...instead of checking if it's less than SNDBUF_SMALL, because this
isn't simply an optimisation to coalesce ACK segments: we rely on
having enough data at once from the sender to make the buffer grow
by means of TCP buffer size tuning implemented in the Linux kernel.

Use SNDBUF_BIG: above that, we don't need auto-tuning (even though
it might happen). SNDBUF_SMALL is too... small.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index e4c5a5b..fbf97a0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1079,7 +1079,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	if (bytes_acked_cap && !force_seq &&
 	    !CONN_IS_CLOSING(conn) &&
 	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
-	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
+	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_BIG) {
 		if (!tinfo) {
 			tinfo = &tinfo_new;
 			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (3 preceding siblings ...)
  2025-12-04  7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-05  0:35   ` David Gibson
  2025-12-04  7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

If the sender uses data clumping (including Nagle's algorithm) for
Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
means the sender might stop sending altogether, and window updates
after a low window condition are just as important as they are in
a zero-window condition.

For simplicity, approximate that limit to zero, as we have an
implementation forcing window updates after zero-sized windows.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tcp.c b/tcp.c
index fbf97a0..2220059 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1140,6 +1140,18 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		else
 			limit = SNDBUF_GET(conn) - (int)sendq;
 
+		/* If the sender uses Nagle's algorithm to prevent Silly Window
+		 * Syndrome (SWS, RFC 813 Section 3) it's critical that, should
+		 * the window ever become less than the MSS, we advertise a new
+		 * value once it increases again to be above it.
+		 *
+		 * To this end, for simplicity, approximate a window value below
+		 * the MSS to zero, as we already have mechanisms in place to
+		 * force updates after the window becomes zero.
+		 */
+		if (limit < MSS_GET(conn))
+			limit = 0;
+
 		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
 	}
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (4 preceding siblings ...)
  2025-12-04  7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-05  2:34   ` David Gibson
  2025-12-04  7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
  2025-12-04  7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...under two conditions:

- the remote peer is advertising a bigger value to us, meaning that a
  bigger sending buffer is likely to benefit throughput, AND

- this is not a short-lived connection, where the latency cost of
  retransmissions would be otherwise unacceptable.

By doing this, we can reliably trigger TCP buffer size auto-tuning (as
long as it's available) on bulk data transfers.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tcp.c b/tcp.c
index 2220059..454df69 100644
--- a/tcp.c
+++ b/tcp.c
@@ -353,6 +353,13 @@ enum {
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
 
+/* Try to avoid retransmissions to improve latency on short-lived connections */
+#define SHORT_CONN_BYTES		(16ULL * 1024 * 1024)
+
+/* Temporarily exceed available sending buffer to force TCP auto-tuning */
+#define SNDBUF_BOOST_FACTOR		150 /* % */
+#define SNDBUF_BOOST(x)			((x) * SNDBUF_BOOST_FACTOR / 100)
+
 #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
 
 #define CONN_IS_CLOSING(conn)						\
@@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 
 		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
 			limit = 0;
+		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
+			 tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)
+			limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
 		else
 			limit = SNDBUF_GET(conn) - (int)sendq;
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (5 preceding siblings ...)
  2025-12-04  7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-05  2:35   ` David Gibson
  2025-12-04  7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...in order to trigger a fast retransmit as soon as possible. There's
no benefit in forcing the sender to wait for a longer time than that.

We already do this on partial failures (short socket writes), but for
historical reason not on complete failures. Make these two cases
consistent between each other.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 454df69..76a9daf 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1965,7 +1965,7 @@ eintr:
 			goto eintr;
 
 		if (errno == EAGAIN || errno == EWOULDBLOCK) {
-			tcp_send_flag(c, conn, ACK_IF_NEEDED);
+			tcp_send_flag(c, conn, ACK | DUP_ACK);
 			return p->count - idx;
 
 		}
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 8/8] tcp: Skip redundant ACK on partial sendmsg() failure
  2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (6 preceding siblings ...)
  2025-12-04  7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
@ 2025-12-04  7:45 ` Stefano Brivio
  2025-12-05  2:36   ` David Gibson
  7 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-04  7:45 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...we'll send a duplicate ACK right away in this case, and this
redundant, earlier check is not just useless, but it might actually
be harmful as we'll now send a triple ACK which might cause two
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/tcp.c b/tcp.c
index 76a9daf..fc986a2 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1972,13 +1972,10 @@ eintr:
 		return -1;
 	}
 
-	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
+	if (n < (int)(seq_from_tap - conn->seq_from_tap))
 		partial_send = 1;
-		conn->seq_from_tap += n;
-		tcp_send_flag(c, conn, ACK_IF_NEEDED);
-	} else {
-		conn->seq_from_tap += n;
-	}
+
+	conn->seq_from_tap += n;
 
 out:
 	if (keep != -1 || partial_send) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size
  2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
@ 2025-12-04 23:10   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-04 23:10 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 3051 bytes --]

On Thu, Dec 04, 2025 at 08:45:34AM +0100, Stefano Brivio wrote:
> For non-local connections, we advertise the same window size as what
> the peer in turn advertises to us, and limit it to the buffer size
> reported via SO_SNDBUF.
> 
> That's not quite correct: in order to later avoid failures while
> queueing data to the socket, we need to limit the window to the
> available buffer size, not the total one.
> 
> Use the SIOCOUTQ ioctl and subtract the number of outbound queued
> bytes from the total buffer size, then clamp to this value.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  README.md |  2 +-
>  tcp.c     | 18 ++++++++++++++++--
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/README.md b/README.md
> index 897ae8b..8fdc0a3 100644
> --- a/README.md
> +++ b/README.md
> @@ -291,7 +291,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
>  * ✅ all capabilities dropped, other than `CAP_NET_BIND_SERVICE` (if granted)
>  * ✅ with default options, user, mount, IPC, UTS, PID namespaces are detached
>  * ✅ no external dependencies (other than a standard C library)
> -* ✅ restrictive seccomp profiles (33 syscalls allowed for _passt_, 43 for
> +* ✅ restrictive seccomp profiles (34 syscalls allowed for _passt_, 43 for
>    _pasta_ on x86_64)
>  * ✅ examples of [AppArmor](/passt/tree/contrib/apparmor) and
>    [SELinux](/passt/tree/contrib/selinux) profiles available
> diff --git a/tcp.c b/tcp.c
> index fa95f6b..863ccdb 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1031,6 +1031,8 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
>   * @tinfo:	tcp_info from kernel, can be NULL if not pre-fetched
>   *
>   * Return: 1 if sequence or window were updated, 0 otherwise
> + *
> + * #syscalls ioctl
>   */
>  int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  			  bool force_seq, struct tcp_info_linux *tinfo)
> @@ -1113,9 +1115,21 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
>  		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
>  	} else {
> +		uint32_t sendq;
> +		int limit;
> +
> +		if (ioctl(s, SIOCOUTQ, &sendq)) {
> +			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
> +			sendq = 0;
> +		}
>  		tcp_get_sndbuf(conn);
> -		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
> -				     SNDBUF_GET(conn));
> +
> +		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> +			limit = 0;
> +		else
> +			limit = SNDBUF_GET(conn) - (int)sendq;
> +
> +		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
>  	}
>  
>  	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-04  7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
@ 2025-12-04 23:48   ` David Gibson
  2025-12-05  1:20     ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-12-04 23:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 8868 bytes --]

On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote:
> A fixed 10 ms ACK_TIMEOUT timer value served us relatively well
> until

Nit: it's called "ACK_INTERVAL" in the code.

> the previous change, because we would generally cause retransmissions
> for non-local outbound transfers with relatively high (> 100 Mbps)
> bandwidth and non-local but low (< 5 ms) RTT.
> 
> Now that retransmissions are less frequent, we don't have a proper
> trigger to check for acknowledged bytes on the socket, and will
> generally block the sender for a significant amount of time while
> we could acknowledge more data, instead.
> 
> Store the RTT reported by the kernel using an approximation (exponent),
> to keep flow storage size within two (typical) cachelines. Check for
> socket updates when half of this time elapses: it should be a good
> indication of the one-way delay we're interested in (peer to us).

Reasoning based on a one-way delay doesn't quite make sense to me.  We
can't know when anything happens at the peer, and - obviously - we can
only set a timer starting at an event that occurs on our side.  So, I
think only RTT can matter to us, not one-way delay.

That said, using half the RTT estimate still makes sense to me: we
only have an approximation, and halving it gives us a pretty safe
lower bound.

> Representable values are between 100 us and 12.8 ms, and any value

Nit: I think Unicode is long enough supported you can use µs

> outside this range is clamped to these bounds. This choice appears
> to be a good trade-off between additional overhead and throughput.
> 
> This mechanism partially overlaps with the "low RTT" destinations,
> which we use to infer that a socket is connected to an endpoint to
> the same machine (while possibly in a different namespace) if the
> RTT is reported as 10 us or less.
> 
> This change doesn't, however, conflict with it: we are reading
> TCP_INFO parameters for local connections anyway, so we can always
> store the RTT approximation opportunistically.
> 
> Then, if the RTT is "low", we don't really need a timer to
> acknowledge data as we'll always acknowledge everything to the
> sender right away. However, we have limited space in the array where
> we store addresses of local destination, so the low RTT property of a
> connection might toggle frequently. Because of this, it's actually
> helpful to always have the RTT approximation stored.
> 
> This could probably benefit from a future rework, though, introducing
> a more integrated approach between these two mechanisms.

Right, it feels like it should be possible to combine these
mechanisms, but figuring out exactly how isn't trivial.  Problem for
another day.

> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c      | 29 ++++++++++++++++++++++-------
>  tcp_conn.h |  9 +++++++++
>  util.c     | 14 ++++++++++++++
>  util.h     |  1 +
>  4 files changed, 46 insertions(+), 7 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 863ccdb..b00b874 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -202,9 +202,13 @@
>   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
>   *   either side, the connection is reset
>   *
> - * - ACK_INTERVAL elapsed after data segment received from tap without having
> + * - RTT / 2 elapsed after data segment received from tap without having
>   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> + *
> + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> + *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly.
>   *
>   *
>   * Summary of data flows (with ESTABLISHED event)
> @@ -341,7 +345,6 @@ enum {
>  #define MSS_DEFAULT			536
>  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
>  
> -#define ACK_INTERVAL			10		/* ms */
>  #define RTO_INIT			1		/* s, RFC 6298 */
>  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
>  #define FIN_TIMEOUT			60
> @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
> -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> +		it.it_value.tv_nsec = (long)RTT_GET(conn) * 1000 / 2;
> +		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
> +			      ".tv_nsec is greater than 1000 * 1000 * 1000");
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		int exp = conn->retries, timeout = RTO_INIT;
>  		if (!(conn->events & ESTABLISHED))
> @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		it.it_value.tv_sec = ACT_TIMEOUT;
>  	}
>  
> -	flow_dbg(conn, "timer expires in %llu.%03llus",
> -		 (unsigned long long)it.it_value.tv_sec,
> -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	if (conn->flags & ACK_TO_TAP_DUE) {
> +		flow_trace(conn, "timer expires in %lu.%01llums",
> +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> +	} else {
> +		flow_dbg(conn, "timer expires in %llu.%03llus",
> +			 (unsigned long long)it.it_value.tv_sec,
> +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	}

One branch is flow_trace(), one is flow_dbg() which doesn't seem
correct.  Also, basing the range indirectly on the flags, rather than
on the actual numbers in it.it_value seems fragile.  But... this seems
overly complex for a trace message anyway.  Maybe just use the seconds
formatting, but increase the resolution to µs.

>  
>  	if (timerfd_settime(conn->timer, 0, &it, NULL))
>  		flow_perror(conn, "failed to set timer");
> @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn_flag(c, conn, ACK_TO_TAP_DUE);
>  
>  out:
> +	/* Opportunistically store RTT approximation on valid TCP_INFO data */
> +	if (tinfo)
> +		RTT_SET(conn, tinfo->tcpi_rtt);
> +
>  	return new_wnd_to_tap       != prev_wnd_to_tap ||
>  	       conn->seq_ack_to_tap != prev_ack_to_tap;
>  }
> diff --git a/tcp_conn.h b/tcp_conn.h
> index e36910c..76034f6 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -49,6 +49,15 @@ struct tcp_tap_conn {
>  #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
>  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
>  
> +#define RTT_EXP_BITS			3
> +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> +#define RTT_STORE_MIN			100 /* us, minimum representable */
> +#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
> +#define RTT_SET(conn, rtt)						\
> +	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> +
>  	int		sock		:FD_REF_BITS;
>  
>  	uint8_t		events;
> diff --git a/util.c b/util.c
> index 4beb7c2..590373d 100644
> --- a/util.c
> +++ b/util.c
> @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
>   * fls() - Find last (most significant) bit set in word
>   * @x:		Word
>   *
> + * Note: unlike ffs() and other implementations of fls(), notably the one from
> + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> + *
>   * Return: position of most significant bit set, starting from 0, -1 if none
>   */
>  int fls(unsigned long x)
> @@ -626,6 +629,17 @@ int fls(unsigned long x)
>  	return y;
>  }
>  
> +/**
> + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> + * @x:		Argument
> + *
> + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> + */
> +int ilog2(unsigned long x)
> +{
> +	return fls(x);
> +}
> +
>  /**
>   * write_file() - Replace contents of file with a string
>   * @path:	File to write
> diff --git a/util.h b/util.h
> index 7bf0701..40de694 100644
> --- a/util.h
> +++ b/util.h
> @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
>  void pidfile_write(int fd, pid_t pid);
>  int __daemon(int pidfile_fd, int devnull_fd);
>  int fls(unsigned long x);
> +int ilog2(unsigned long x);
>  int write_file(const char *path, const char *buf);
>  intmax_t read_file_integer(const char *path, intmax_t fallback);
>  int write_all_buf(int fd, const void *buf, size_t len);
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window
  2025-12-04  7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
@ 2025-12-04 23:50   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-04 23:50 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

On Thu, Dec 04, 2025 at 08:45:36AM +0100, Stefano Brivio wrote:
> We correctly avoid doing that at the beginning of tcp_prepare_flags(),
> but we might clear the flag later on if we actually end up sending a
> "flag" segment.
> 
> Make sure we don't, otherwise we might delay window updates after a
> zero-window condition significantly, and significantly affect
> throughput.
> 
> In some cases, we're forcing peers to send zero-window probes or
> keep-alive segments.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index b00b874..e4c5a5b 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1285,7 +1285,8 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
>  	th->fin = !!(flags & FIN);
>  
>  	if (th->ack) {
> -		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
> +		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
> +		    conn->wnd_to_tap)
>  			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
>  		else
>  			conn_flag(c, conn, ACK_TO_TAP_DUE);
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  2025-12-04  7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
@ 2025-12-05  0:08   ` David Gibson
  2025-12-05  1:20     ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-12-05  0:08 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1534 bytes --]

On Thu, Dec 04, 2025 at 08:45:37AM +0100, Stefano Brivio wrote:
> ...instead of checking if it's less than SNDBUF_SMALL, because this
> isn't simply an optimisation to coalesce ACK segments: we rely on
> having enough data at once from the sender to make the buffer grow
> by means of TCP buffer size tuning implemented in the Linux kernel.
> 
> Use SNDBUF_BIG: above that, we don't need auto-tuning (even though
> it might happen). SNDBUF_SMALL is too... small.

Do you have an idea of how often sndbuf exceeds SNDBUF_BIG?  I'm
wondering if by making this change we might have largely eliminated
the first branch in practice.

> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index e4c5a5b..fbf97a0 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1079,7 +1079,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	if (bytes_acked_cap && !force_seq &&
>  	    !CONN_IS_CLOSING(conn) &&
>  	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> +	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_BIG) {
>  		if (!tinfo) {
>  			tinfo = &tinfo_new;
>  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-04  7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
@ 2025-12-05  0:35   ` David Gibson
  2025-12-05  1:20     ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-12-05  0:35 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

On Thu, Dec 04, 2025 at 08:45:38AM +0100, Stefano Brivio wrote:
> If the sender uses data clumping (including Nagle's algorithm) for
> Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
> means the sender might stop sending altogether, and window updates
> after a low window condition are just as important as they are in
> a zero-window condition.
> 
> For simplicity, approximate that limit to zero, as we have an
> implementation forcing window updates after zero-sized windows.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

The logic change looks good to me, so,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

However, a couple of points about the description (both commit message
and comment).

 * Nagle's algorithm is certainly related, but it's not clear to me
   it's quite the same thing as the sender-side SWS avoidance
   algorithm - Nagle's exists for a different purpose, certainly.
   RFC 813 doesn't name Nagle's algorithm anywhere,    although that
   could because the name wasn't as established at the time.
 * Since you're referencing RFC 813 anyway, it seems relevant that
   what you're doing here is pretty similar to the receiver-side SWS
   avoidance algorithm described in section 4.

> ---
>  tcp.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index fbf97a0..2220059 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1140,6 +1140,18 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		else
>  			limit = SNDBUF_GET(conn) - (int)sendq;
>  
> +		/* If the sender uses Nagle's algorithm to prevent Silly Window
> +		 * Syndrome (SWS, RFC 813 Section 3) it's critical that, should
> +		 * the window ever become less than the MSS, we advertise a new
> +		 * value once it increases again to be above it.
> +		 *
> +		 * To this end, for simplicity, approximate a window value below
> +		 * the MSS to zero, as we already have mechanisms in place to
> +		 * force updates after the window becomes zero.
> +		 */
> +		if (limit < MSS_GET(conn))
> +			limit = 0;
> +
>  		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
>  	}
>  
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-04 23:48   ` David Gibson
@ 2025-12-05  1:20     ` Stefano Brivio
  2025-12-05  2:49       ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-05  1:20 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Fri, 5 Dec 2025 10:48:20 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote:
> > A fixed 10 ms ACK_TIMEOUT timer value served us relatively well
> > until  
> 
> Nit: it's called "ACK_INTERVAL" in the code.

Oops. I'll change this.

> > the previous change, because we would generally cause retransmissions
> > for non-local outbound transfers with relatively high (> 100 Mbps)
> > bandwidth and non-local but low (< 5 ms) RTT.
> > 
> > Now that retransmissions are less frequent, we don't have a proper
> > trigger to check for acknowledged bytes on the socket, and will
> > generally block the sender for a significant amount of time while
> > we could acknowledge more data, instead.
> > 
> > Store the RTT reported by the kernel using an approximation (exponent),
> > to keep flow storage size within two (typical) cachelines. Check for
> > socket updates when half of this time elapses: it should be a good
> > indication of the one-way delay we're interested in (peer to us).  
> 
> Reasoning based on a one-way delay doesn't quite make sense to me.  We
> can't know when anything happens at the peer, and - obviously - we can
> only set a timer starting at an event that occurs on our side.  So, I
> think only RTT can matter to us, not one-way delay.

...except that we might be scheduling the timer at any point *after* we
sent data, so the outbound delay might be partially elapsed, and the
one-way (receiving) delay is actually (more) relevant.

If we had instantaneous receiving of ACK segments, we would need to
probe much more frequently than the RTT, to monitor the actual progress
more accurately. Note that transmission rate (including forwarding
delays) is not constant and might be bursty.

But yes, in general it's not much more relevant than the RTT. I could
drop this part of the commit message.

> That said, using half the RTT estimate still makes sense to me: we
> only have an approximation, and halving it gives us a pretty safe
> lower bound.

In any case, yes.

> > Representable values are between 100 us and 12.8 ms, and any value  
> 
> Nit: I think Unicode is long enough supported you can use µs

I prefer to avoid in the code if possible because one might not have
Unicode support in all the relevant environments with all the relevant
consoles (I just finished debugging stuff on Alpine...), and at that
point I'd rather have consistent commit messages.

> > outside this range is clamped to these bounds. This choice appears
> > to be a good trade-off between additional overhead and throughput.
> > 
> > This mechanism partially overlaps with the "low RTT" destinations,
> > which we use to infer that a socket is connected to an endpoint to
> > the same machine (while possibly in a different namespace) if the
> > RTT is reported as 10 us or less.
> > 
> > This change doesn't, however, conflict with it: we are reading
> > TCP_INFO parameters for local connections anyway, so we can always
> > store the RTT approximation opportunistically.
> > 
> > Then, if the RTT is "low", we don't really need a timer to
> > acknowledge data as we'll always acknowledge everything to the
> > sender right away. However, we have limited space in the array where
> > we store addresses of local destination, so the low RTT property of a
> > connection might toggle frequently. Because of this, it's actually
> > helpful to always have the RTT approximation stored.
> > 
> > This could probably benefit from a future rework, though, introducing
> > a more integrated approach between these two mechanisms.  
> 
> Right, it feels like it should be possible to combine these
> mechanisms, but figuring out exactly how isn't trivial.  Problem for
> another day.
> 
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c      | 29 ++++++++++++++++++++++-------
> >  tcp_conn.h |  9 +++++++++
> >  util.c     | 14 ++++++++++++++
> >  util.h     |  1 +
> >  4 files changed, 46 insertions(+), 7 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 863ccdb..b00b874 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -202,9 +202,13 @@
> >   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> >   *   either side, the connection is reset
> >   *
> > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > + * - RTT / 2 elapsed after data segment received from tap without having
> >   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > + *
> > + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > + *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly.
> >   *
> >   *
> >   * Summary of data flows (with ESTABLISHED event)
> > @@ -341,7 +345,6 @@ enum {
> >  #define MSS_DEFAULT			536
> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> >  
> > -#define ACK_INTERVAL			10		/* ms */
> >  #define RTO_INIT			1		/* s, RFC 6298 */
> >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> >  #define FIN_TIMEOUT			60
> > @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  	}
> >  
> >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > +		it.it_value.tv_nsec = (long)RTT_GET(conn) * 1000 / 2;
> > +		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
> > +			      ".tv_nsec is greater than 1000 * 1000 * 1000");
> >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> >  		int exp = conn->retries, timeout = RTO_INIT;
> >  		if (!(conn->events & ESTABLISHED))
> > @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  		it.it_value.tv_sec = ACT_TIMEOUT;
> >  	}
> >  
> > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > -		 (unsigned long long)it.it_value.tv_sec,
> > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> > +	} else {
> > +		flow_dbg(conn, "timer expires in %llu.%03llus",
> > +			 (unsigned long long)it.it_value.tv_sec,
> > +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > +	}  
> 
> One branch is flow_trace(), one is flow_dbg() which doesn't seem
> correct.

No, it's intended, it's actually the main reason why I'm changing this
part.

Now that we have more frequent timer scheduling on ACK_TO_TAP_DUE, the
debug logs become unusable if you're trying to debug anything that's
not related to a specific data transfer.

> Also, basing the range indirectly on the flags, rather than
> on the actual numbers in it.it_value seems fragile.

Flags tell us why we're scheduling a specific timer, and it's only
on ACK_TO_TAP_DUE that we want to have more fine-grained values.

> But... this seems
> overly complex for a trace message anyway.  Maybe just use the seconds
> formatting, but increase the resolution to µs.

I tried a number of different combinations like that, they are all
rather inconvenient.

> >  
> >  	if (timerfd_settime(conn->timer, 0, &it, NULL))
> >  		flow_perror(conn, "failed to set timer");
> > @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		conn_flag(c, conn, ACK_TO_TAP_DUE);
> >  
> >  out:
> > +	/* Opportunistically store RTT approximation on valid TCP_INFO data */
> > +	if (tinfo)
> > +		RTT_SET(conn, tinfo->tcpi_rtt);
> > +
> >  	return new_wnd_to_tap       != prev_wnd_to_tap ||
> >  	       conn->seq_ack_to_tap != prev_ack_to_tap;
> >  }
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index e36910c..76034f6 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -49,6 +49,15 @@ struct tcp_tap_conn {
> >  #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
> >  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
> >  
> > +#define RTT_EXP_BITS			3
> > +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> > +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> > +#define RTT_STORE_MIN			100 /* us, minimum representable */
> > +#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
> > +#define RTT_SET(conn, rtt)						\
> > +	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> > +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> > +
> >  	int		sock		:FD_REF_BITS;
> >  
> >  	uint8_t		events;
> > diff --git a/util.c b/util.c
> > index 4beb7c2..590373d 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
> >   * fls() - Find last (most significant) bit set in word
> >   * @x:		Word
> >   *
> > + * Note: unlike ffs() and other implementations of fls(), notably the one from
> > + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> > + *
> >   * Return: position of most significant bit set, starting from 0, -1 if none
> >   */
> >  int fls(unsigned long x)
> > @@ -626,6 +629,17 @@ int fls(unsigned long x)
> >  	return y;
> >  }
> >  
> > +/**
> > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> > + * @x:		Argument
> > + *
> > + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> > + */
> > +int ilog2(unsigned long x)
> > +{
> > +	return fls(x);
> > +}
> > +
> >  /**
> >   * write_file() - Replace contents of file with a string
> >   * @path:	File to write
> > diff --git a/util.h b/util.h
> > index 7bf0701..40de694 100644
> > --- a/util.h
> > +++ b/util.h
> > @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
> >  void pidfile_write(int fd, pid_t pid);
> >  int __daemon(int pidfile_fd, int devnull_fd);
> >  int fls(unsigned long x);
> > +int ilog2(unsigned long x);
> >  int write_file(const char *path, const char *buf);
> >  intmax_t read_file_integer(const char *path, intmax_t fallback);
> >  int write_all_buf(int fd, const void *buf, size_t len);
> > -- 
> > 2.43.0

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  2025-12-05  0:08   ` David Gibson
@ 2025-12-05  1:20     ` Stefano Brivio
  2025-12-05  2:50       ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-05  1:20 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Fri, 5 Dec 2025 11:08:06 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 04, 2025 at 08:45:37AM +0100, Stefano Brivio wrote:
> > ...instead of checking if it's less than SNDBUF_SMALL, because this
> > isn't simply an optimisation to coalesce ACK segments: we rely on
> > having enough data at once from the sender to make the buffer grow
> > by means of TCP buffer size tuning implemented in the Linux kernel.
> > 
> > Use SNDBUF_BIG: above that, we don't need auto-tuning (even though
> > it might happen). SNDBUF_SMALL is too... small.  
> 
> Do you have an idea of how often sndbuf exceeds SNDBUF_BIG?  I'm
> wondering if by making this change we might have largely eliminated
> the first branch in practice.

Before this series, or after 6/8 in this series, it happens quite
often. It depends on the bandwidth * delay product of course, but at 1
Gbps and 20 ms RTT we get there in a couple of seconds.

Maybe 1 MiB would make more sense for typical conditions, but I'd defer
this to a more adaptive implementation of the whole thing. I think it
should also depend on the RTT, ideally.

> 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index e4c5a5b..fbf97a0 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1079,7 +1079,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	if (bytes_acked_cap && !force_seq &&
> >  	    !CONN_IS_CLOSING(conn) &&
> >  	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> > -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> > +	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_BIG) {
> >  		if (!tinfo) {
> >  			tinfo = &tinfo_new;
> >  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> > -- 
> > 2.43.0

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-05  0:35   ` David Gibson
@ 2025-12-05  1:20     ` Stefano Brivio
  2025-12-05  2:53       ` David Gibson
  0 siblings, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2025-12-05  1:20 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Fri, 5 Dec 2025 11:35:22 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 04, 2025 at 08:45:38AM +0100, Stefano Brivio wrote:
> > If the sender uses data clumping (including Nagle's algorithm) for
> > Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
> > means the sender might stop sending altogether, and window updates
> > after a low window condition are just as important as they are in
> > a zero-window condition.
> > 
> > For simplicity, approximate that limit to zero, as we have an
> > implementation forcing window updates after zero-sized windows.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> The logic change looks good to me, so,
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> However, a couple of points about the description (both commit message
> and comment).
> 
>  * Nagle's algorithm is certainly related, but it's not clear to me
>    it's quite the same thing as the sender-side SWS avoidance
>    algorithm - Nagle's exists for a different purpose, certainly.
>    RFC 813 doesn't name Nagle's algorithm anywhere,    although that
>    could because the name wasn't as established at the time.

Sure, Nagle's algorithm was published almost two years later (RFC 896).

>  * Since you're referencing RFC 813 anyway, it seems relevant that
>    what you're doing here is pretty similar to the receiver-side SWS
>    avoidance algorithm described in section 4.

The practical problem I observed comes from the "clumping" Linux does
while sending (and that's implemented as part of Nagle's algorithm).
But yes I actually ignored section 4 in all this, I'll mention it
explicitly.

> > ---
> >  tcp.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index fbf97a0..2220059 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1140,6 +1140,18 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  		else
> >  			limit = SNDBUF_GET(conn) - (int)sendq;
> >  
> > +		/* If the sender uses Nagle's algorithm to prevent Silly Window
> > +		 * Syndrome (SWS, RFC 813 Section 3) it's critical that, should
> > +		 * the window ever become less than the MSS, we advertise a new
> > +		 * value once it increases again to be above it.
> > +		 *
> > +		 * To this end, for simplicity, approximate a window value below
> > +		 * the MSS to zero, as we already have mechanisms in place to
> > +		 * force updates after the window becomes zero.
> > +		 */
> > +		if (limit < MSS_GET(conn))
> > +			limit = 0;
> > +
> >  		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
> >  	}
> >  
> > -- 
> > 2.43.0

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-04  7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
@ 2025-12-05  2:34   ` David Gibson
  2025-12-08  0:20     ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 2675 bytes --]

On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
> ...under two conditions:
> 
> - the remote peer is advertising a bigger value to us, meaning that a
>   bigger sending buffer is likely to benefit throughput, AND

I think this condition is redundant: if the remote peer is advertising
less, we'll clamp new_wnd_to_tap to that value anyway.

> - this is not a short-lived connection, where the latency cost of
>   retransmissions would be otherwise unacceptable.
> 
> By doing this, we can reliably trigger TCP buffer size auto-tuning (as
> long as it's available) on bulk data transfers.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index 2220059..454df69 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -353,6 +353,13 @@ enum {
>  #define LOW_RTT_TABLE_SIZE		8
>  #define LOW_RTT_THRESHOLD		10 /* us */
>  
> +/* Try to avoid retransmissions to improve latency on short-lived connections */
> +#define SHORT_CONN_BYTES		(16ULL * 1024 * 1024)
> +
> +/* Temporarily exceed available sending buffer to force TCP auto-tuning */
> +#define SNDBUF_BOOST_FACTOR		150 /* % */
> +#define SNDBUF_BOOST(x)			((x) * SNDBUF_BOOST_FACTOR / 100)

For the short term, the fact this works empirically is enough.  For
the longer term, it would be nice to have a better understanding of
what this "overcommit" amount is actually estimating.

I think what we're looking for is an estimate of the number of bytes
that will have left the buffer by the time the guest gets back to us.  So:
	<connection throughput> * <guest-side RTT>

Alas, I don't see a way to estimate either of those from the
information we already track - we'd need additional bookkeeping.

>  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
>  
>  #define CONN_IS_CLOSING(conn)						\
> @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  
>  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
>  			limit = 0;
> +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
> +			 tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)

This is pretty subtle, I think it would be worth having some rationale
in a comment, not just the commit message.

> +			limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
>  		else
>  			limit = SNDBUF_GET(conn) - (int)sendq;
>  
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure
  2025-12-04  7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
@ 2025-12-05  2:35   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:35 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Thu, Dec 04, 2025 at 08:45:40AM +0100, Stefano Brivio wrote:
> ...in order to trigger a fast retransmit as soon as possible. There's
> no benefit in forcing the sender to wait for a longer time than that.
> 
> We already do this on partial failures (short socket writes), but for
> historical reason not on complete failures. Make these two cases
> consistent between each other.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 454df69..76a9daf 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1965,7 +1965,7 @@ eintr:
>  			goto eintr;
>  
>  		if (errno == EAGAIN || errno == EWOULDBLOCK) {
> -			tcp_send_flag(c, conn, ACK_IF_NEEDED);
> +			tcp_send_flag(c, conn, ACK | DUP_ACK);
>  			return p->count - idx;
>  
>  		}
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 8/8] tcp: Skip redundant ACK on partial sendmsg() failure
  2025-12-04  7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
@ 2025-12-05  2:36   ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:36 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Thu, Dec 04, 2025 at 08:45:41AM +0100, Stefano Brivio wrote:
> ...we'll send a duplicate ACK right away in this case, and this
> redundant, earlier check is not just useless, but it might actually
> be harmful as we'll now send a triple ACK which might cause two
> retransmissions.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  tcp.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 76a9daf..fc986a2 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1972,13 +1972,10 @@ eintr:
>  		return -1;
>  	}
>  
> -	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
> +	if (n < (int)(seq_from_tap - conn->seq_from_tap))
>  		partial_send = 1;
> -		conn->seq_from_tap += n;
> -		tcp_send_flag(c, conn, ACK_IF_NEEDED);
> -	} else {
> -		conn->seq_from_tap += n;
> -	}
> +
> +	conn->seq_from_tap += n;
>  
>  out:
>  	if (keep != -1 || partial_send) {
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-05  1:20     ` Stefano Brivio
@ 2025-12-05  2:49       ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:49 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 11817 bytes --]

On Fri, Dec 05, 2025 at 02:20:08AM +0100, Stefano Brivio wrote:
> On Fri, 5 Dec 2025 10:48:20 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 04, 2025 at 08:45:35AM +0100, Stefano Brivio wrote:
> > > A fixed 10 ms ACK_TIMEOUT timer value served us relatively well
> > > until  
> > 
> > Nit: it's called "ACK_INTERVAL" in the code.
> 
> Oops. I'll change this.

You addressed all my other concerns below, so

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> > > the previous change, because we would generally cause retransmissions
> > > for non-local outbound transfers with relatively high (> 100 Mbps)
> > > bandwidth and non-local but low (< 5 ms) RTT.
> > > 
> > > Now that retransmissions are less frequent, we don't have a proper
> > > trigger to check for acknowledged bytes on the socket, and will
> > > generally block the sender for a significant amount of time while
> > > we could acknowledge more data, instead.
> > > 
> > > Store the RTT reported by the kernel using an approximation (exponent),
> > > to keep flow storage size within two (typical) cachelines. Check for
> > > socket updates when half of this time elapses: it should be a good
> > > indication of the one-way delay we're interested in (peer to us).  
> > 
> > Reasoning based on a one-way delay doesn't quite make sense to me.  We
> > can't know when anything happens at the peer, and - obviously - we can
> > only set a timer starting at an event that occurs on our side.  So, I
> > think only RTT can matter to us, not one-way delay.
> 
> ...except that we might be scheduling the timer at any point *after* we
> sent data, so the outbound delay might be partially elapsed, and the
> one-way (receiving) delay is actually (more) relevant.
> 
> If we had instantaneous receiving of ACK segments, we would need to
> probe much more frequently than the RTT, to monitor the actual progress
> more accurately. Note that transmission rate (including forwarding
> delays) is not constant and might be bursty.

*thinks*... oh, yes, you're right.  The key point I was missing is
that what we're primarily trying to (indirectly) observe here is the
rate at which the receiving process is consuming - i.e. something
ocurring locally at the peer - rather than something triggered by our
transmission to the peer.

> But yes, in general it's not much more relevant than the RTT. I could
> drop this part of the commit message.
> 
> > That said, using half the RTT estimate still makes sense to me: we
> > only have an approximation, and halving it gives us a pretty safe
> > lower bound.
> 
> In any case, yes.
> 
> > > Representable values are between 100 us and 12.8 ms, and any value  
> > 
> > Nit: I think Unicode is long enough supported you can use µs
> 
> I prefer to avoid in the code if possible because one might not have
> Unicode support in all the relevant environments with all the relevant
> consoles (I just finished debugging stuff on Alpine...), and at that
> point I'd rather have consistent commit messages.

Eh, ok.

> > > outside this range is clamped to these bounds. This choice appears
> > > to be a good trade-off between additional overhead and throughput.
> > > 
> > > This mechanism partially overlaps with the "low RTT" destinations,
> > > which we use to infer that a socket is connected to an endpoint to
> > > the same machine (while possibly in a different namespace) if the
> > > RTT is reported as 10 us or less.
> > > 
> > > This change doesn't, however, conflict with it: we are reading
> > > TCP_INFO parameters for local connections anyway, so we can always
> > > store the RTT approximation opportunistically.
> > > 
> > > Then, if the RTT is "low", we don't really need a timer to
> > > acknowledge data as we'll always acknowledge everything to the
> > > sender right away. However, we have limited space in the array where
> > > we store addresses of local destination, so the low RTT property of a
> > > connection might toggle frequently. Because of this, it's actually
> > > helpful to always have the RTT approximation stored.
> > > 
> > > This could probably benefit from a future rework, though, introducing
> > > a more integrated approach between these two mechanisms.  
> > 
> > Right, it feels like it should be possible to combine these
> > mechanisms, but figuring out exactly how isn't trivial.  Problem for
> > another day.
> > 
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tcp.c      | 29 ++++++++++++++++++++++-------
> > >  tcp_conn.h |  9 +++++++++
> > >  util.c     | 14 ++++++++++++++
> > >  util.h     |  1 +
> > >  4 files changed, 46 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 863ccdb..b00b874 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -202,9 +202,13 @@
> > >   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> > >   *   either side, the connection is reset
> > >   *
> > > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > > + * - RTT / 2 elapsed after data segment received from tap without having
> > >   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > > + *
> > > + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> > > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > > + *   RTT_STORE_MAX (12.8 ms). The timeout value is clamped accordingly.
> > >   *
> > >   *
> > >   * Summary of data flows (with ESTABLISHED event)
> > > @@ -341,7 +345,6 @@ enum {
> > >  #define MSS_DEFAULT			536
> > >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> > >  
> > > -#define ACK_INTERVAL			10		/* ms */
> > >  #define RTO_INIT			1		/* s, RFC 6298 */
> > >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> > >  #define FIN_TIMEOUT			60
> > > @@ -594,7 +597,9 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > >  	}
> > >  
> > >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > > -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > > +		it.it_value.tv_nsec = (long)RTT_GET(conn) * 1000 / 2;
> > > +		static_assert(RTT_STORE_MAX * 1000 / 2 < 1000 * 1000 * 1000,
> > > +			      ".tv_nsec is greater than 1000 * 1000 * 1000");
> > >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> > >  		int exp = conn->retries, timeout = RTO_INIT;
> > >  		if (!(conn->events & ESTABLISHED))
> > > @@ -609,9 +614,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> > >  		it.it_value.tv_sec = ACT_TIMEOUT;
> > >  	}
> > >  
> > > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > > -		 (unsigned long long)it.it_value.tv_sec,
> > > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > > +			   (unsigned long long)it.it_value.tv_nsec / 1000);
> > > +	} else {
> > > +		flow_dbg(conn, "timer expires in %llu.%03llus",
> > > +			 (unsigned long long)it.it_value.tv_sec,
> > > +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > > +	}  
> > 
> > One branch is flow_trace(), one is flow_dbg() which doesn't seem
> > correct.
> 
> No, it's intended, it's actually the main reason why I'm changing this
> part.
> 
> Now that we have more frequent timer scheduling on ACK_TO_TAP_DUE, the
> debug logs become unusable if you're trying to debug anything that's
> not related to a specific data transfer.
> 
> > Also, basing the range indirectly on the flags, rather than
> > on the actual numbers in it.it_value seems fragile.
> 
> Flags tell us why we're scheduling a specific timer, and it's only
> on ACK_TO_TAP_DUE that we want to have more fine-grained values.

Ah... ok, that makes sense.

> > But... this seems
> > overly complex for a trace message anyway.  Maybe just use the seconds
> > formatting, but increase the resolution to µs.
> 
> I tried a number of different combinations like that, they are all
> rather inconvenient.

Fair enough.

> 
> > >  
> > >  	if (timerfd_settime(conn->timer, 0, &it, NULL))
> > >  		flow_perror(conn, "failed to set timer");
> > > @@ -1149,6 +1160,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  		conn_flag(c, conn, ACK_TO_TAP_DUE);
> > >  
> > >  out:
> > > +	/* Opportunistically store RTT approximation on valid TCP_INFO data */
> > > +	if (tinfo)
> > > +		RTT_SET(conn, tinfo->tcpi_rtt);
> > > +
> > >  	return new_wnd_to_tap       != prev_wnd_to_tap ||
> > >  	       conn->seq_ack_to_tap != prev_ack_to_tap;
> > >  }
> > > diff --git a/tcp_conn.h b/tcp_conn.h
> > > index e36910c..76034f6 100644
> > > --- a/tcp_conn.h
> > > +++ b/tcp_conn.h
> > > @@ -49,6 +49,15 @@ struct tcp_tap_conn {
> > >  #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
> > >  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
> > >  
> > > +#define RTT_EXP_BITS			3
> > > +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> > > +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> > > +#define RTT_STORE_MIN			100 /* us, minimum representable */
> > > +#define RTT_STORE_MAX			(RTT_STORE_MIN << RTT_EXP_MAX)
> > > +#define RTT_SET(conn, rtt)						\
> > > +	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> > > +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> > > +
> > >  	int		sock		:FD_REF_BITS;
> > >  
> > >  	uint8_t		events;
> > > diff --git a/util.c b/util.c
> > > index 4beb7c2..590373d 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -611,6 +611,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
> > >   * fls() - Find last (most significant) bit set in word
> > >   * @x:		Word
> > >   *
> > > + * Note: unlike ffs() and other implementations of fls(), notably the one from
> > > + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> > > + *
> > >   * Return: position of most significant bit set, starting from 0, -1 if none
> > >   */
> > >  int fls(unsigned long x)
> > > @@ -626,6 +629,17 @@ int fls(unsigned long x)
> > >  	return y;
> > >  }
> > >  
> > > +/**
> > > + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> > > + * @x:		Argument
> > > + *
> > > + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> > > + */
> > > +int ilog2(unsigned long x)
> > > +{
> > > +	return fls(x);
> > > +}
> > > +
> > >  /**
> > >   * write_file() - Replace contents of file with a string
> > >   * @path:	File to write
> > > diff --git a/util.h b/util.h
> > > index 7bf0701..40de694 100644
> > > --- a/util.h
> > > +++ b/util.h
> > > @@ -230,6 +230,7 @@ int output_file_open(const char *path, int flags);
> > >  void pidfile_write(int fd, pid_t pid);
> > >  int __daemon(int pidfile_fd, int devnull_fd);
> > >  int fls(unsigned long x);
> > > +int ilog2(unsigned long x);
> > >  int write_file(const char *path, const char *buf);
> > >  intmax_t read_file_integer(const char *path, intmax_t fallback);
> > >  int write_all_buf(int fd, const void *buf, size_t len);
> > > -- 
> > > 2.43.0
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  2025-12-05  1:20     ` Stefano Brivio
@ 2025-12-05  2:50       ` David Gibson
  2025-12-08  0:19         ` Stefano Brivio
  0 siblings, 1 reply; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:50 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

On Fri, Dec 05, 2025 at 02:20:12AM +0100, Stefano Brivio wrote:
> On Fri, 5 Dec 2025 11:08:06 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 04, 2025 at 08:45:37AM +0100, Stefano Brivio wrote:
> > > ...instead of checking if it's less than SNDBUF_SMALL, because this
> > > isn't simply an optimisation to coalesce ACK segments: we rely on
> > > having enough data at once from the sender to make the buffer grow
> > > by means of TCP buffer size tuning implemented in the Linux kernel.
> > > 
> > > Use SNDBUF_BIG: above that, we don't need auto-tuning (even though
> > > it might happen). SNDBUF_SMALL is too... small.  
> > 
> > Do you have an idea of how often sndbuf exceeds SNDBUF_BIG?  I'm
> > wondering if by making this change we might have largely eliminated
> > the first branch in practice.
> 
> Before this series, or after 6/8 in this series, it happens quite
> often. It depends on the bandwidth * delay product of course, but at 1
> Gbps and 20 ms RTT we get there in a couple of seconds.
> 
> Maybe 1 MiB would make more sense for typical conditions, but I'd defer
> this to a more adaptive implementation of the whole thing. I think it
> should also depend on the RTT, ideally.

Ok.  Adding that context to the commit message might be useful.
Otherwise,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> 
> > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tcp.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index e4c5a5b..fbf97a0 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1079,7 +1079,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  	if (bytes_acked_cap && !force_seq &&
> > >  	    !CONN_IS_CLOSING(conn) &&
> > >  	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> > > -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> > > +	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_BIG) {
> > >  		if (!tinfo) {
> > >  			tinfo = &tinfo_new;
> > >  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> > > -- 
> > > 2.43.0
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-05  1:20     ` Stefano Brivio
@ 2025-12-05  2:53       ` David Gibson
  0 siblings, 0 replies; 25+ messages in thread
From: David Gibson @ 2025-12-05  2:53 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 3302 bytes --]

On Fri, Dec 05, 2025 at 02:20:16AM +0100, Stefano Brivio wrote:
> On Fri, 5 Dec 2025 11:35:22 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 04, 2025 at 08:45:38AM +0100, Stefano Brivio wrote:
> > > If the sender uses data clumping (including Nagle's algorithm) for
> > > Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
> > > means the sender might stop sending altogether, and window updates
> > > after a low window condition are just as important as they are in
> > > a zero-window condition.
> > > 
> > > For simplicity, approximate that limit to zero, as we have an
> > > implementation forcing window updates after zero-sized windows.
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> > 
> > The logic change looks good to me, so,
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > 
> > However, a couple of points about the description (both commit message
> > and comment).
> > 
> >  * Nagle's algorithm is certainly related, but it's not clear to me
> >    it's quite the same thing as the sender-side SWS avoidance
> >    algorithm - Nagle's exists for a different purpose, certainly.
> >    RFC 813 doesn't name Nagle's algorithm anywhere,    although that
> >    could because the name wasn't as established at the time.
> 
> Sure, Nagle's algorithm was published almost two years later (RFC 896).
> 
> >  * Since you're referencing RFC 813 anyway, it seems relevant that
> >    what you're doing here is pretty similar to the receiver-side SWS
> >    avoidance algorithm described in section 4.
> 
> The practical problem I observed comes from the "clumping" Linux does
> while sending (and that's implemented as part of Nagle's algorithm).

Ok.  I guess you could look at Nagle's algorithm as (amongst other
things) a refinement of the sender-side anti-SWS angorithm in RFC 813.

> But yes I actually ignored section 4 in all this, I'll mention it
> explicitly.
> 
> > > ---
> > >  tcp.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index fbf97a0..2220059 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1140,6 +1140,18 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  		else
> > >  			limit = SNDBUF_GET(conn) - (int)sendq;
> > >  
> > > +		/* If the sender uses Nagle's algorithm to prevent Silly Window
> > > +		 * Syndrome (SWS, RFC 813 Section 3) it's critical that, should
> > > +		 * the window ever become less than the MSS, we advertise a new
> > > +		 * value once it increases again to be above it.
> > > +		 *
> > > +		 * To this end, for simplicity, approximate a window value below
> > > +		 * the MSS to zero, as we already have mechanisms in place to
> > > +		 * force updates after the window becomes zero.
> > > +		 */
> > > +		if (limit < MSS_GET(conn))
> > > +			limit = 0;
> > > +
> > >  		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
> > >  	}
> > >  
> > > -- 
> > > 2.43.0
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG
  2025-12-05  2:50       ` David Gibson
@ 2025-12-08  0:19         ` Stefano Brivio
  0 siblings, 0 replies; 25+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Fri, 5 Dec 2025 13:50:15 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Dec 05, 2025 at 02:20:12AM +0100, Stefano Brivio wrote:
> > On Fri, 5 Dec 2025 11:08:06 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Dec 04, 2025 at 08:45:37AM +0100, Stefano Brivio wrote:  
> > > > ...instead of checking if it's less than SNDBUF_SMALL, because this
> > > > isn't simply an optimisation to coalesce ACK segments: we rely on
> > > > having enough data at once from the sender to make the buffer grow
> > > > by means of TCP buffer size tuning implemented in the Linux kernel.
> > > > 
> > > > Use SNDBUF_BIG: above that, we don't need auto-tuning (even though
> > > > it might happen). SNDBUF_SMALL is too... small.    
> > > 
> > > Do you have an idea of how often sndbuf exceeds SNDBUF_BIG?  I'm
> > > wondering if by making this change we might have largely eliminated
> > > the first branch in practice.  
> > 
> > Before this series, or after 6/8 in this series, it happens quite
> > often. It depends on the bandwidth * delay product of course, but at 1
> > Gbps and 20 ms RTT we get there in a couple of seconds.
> > 
> > Maybe 1 MiB would make more sense for typical conditions, but I'd defer
> > this to a more adaptive implementation of the whole thing. I think it
> > should also depend on the RTT, ideally.  
> 
> Ok.  Adding that context to the commit message might be useful.

While trying to add that context I came to two realisations:

1. in the past I used the 'netem' qdisc for convoluted things only, so
   much that I forgot how simple it can be for the task at hand:

  $ ./pasta --config-net -I moon0 -- sh -c '/sbin/tc q a dev moon0 root netem delay 1282ms; ping -c1 2600::'
  PING 2600:: (2600::) 56 data bytes
  64 bytes from 2600::: icmp_seq=1 ttl=255 time=1298 ms

2. while there don't seem to be public iperf3 instances on the moon,
   there are indeed servers in Auckland and Tashkent

...hence v2, which implements the adaptive implementation I was
referring to. It's rather simplistic and can/should be improved further
but it's a big improvement on the existing situation.

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-05  2:34   ` David Gibson
@ 2025-12-08  0:20     ` Stefano Brivio
  0 siblings, 0 replies; 25+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:20 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Fri, 5 Dec 2025 13:34:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
> > ...under two conditions:
> > 
> > - the remote peer is advertising a bigger value to us, meaning that a
> >   bigger sending buffer is likely to benefit throughput, AND  
> 
> I think this condition is redundant: if the remote peer is advertising
> less, we'll clamp new_wnd_to_tap to that value anyway.

I almost fell for this. We have a subtractive term in the expression,
so it's not actually the case.

If the remote peer is advertising a smaller window, we just take buffer
size *minus pending bytes*, as limit, which can be smaller compared to
the window advertised by the peer.

If it's advertising a bigger window, we take an increased buffer size
minus pending bytes, as limit, which can be bigger than the peer's
window, so we'll use the peer's window as limit instead.

I added an example in v2 (now 7/9).

> > - this is not a short-lived connection, where the latency cost of
> >   retransmissions would be otherwise unacceptable.
> > 
> > By doing this, we can reliably trigger TCP buffer size auto-tuning (as
> > long as it's available) on bulk data transfers.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 2220059..454df69 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,13 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Try to avoid retransmissions to improve latency on short-lived connections */
> > +#define SHORT_CONN_BYTES		(16ULL * 1024 * 1024)
> > +
> > +/* Temporarily exceed available sending buffer to force TCP auto-tuning */
> > +#define SNDBUF_BOOST_FACTOR		150 /* % */
> > +#define SNDBUF_BOOST(x)			((x) * SNDBUF_BOOST_FACTOR / 100)  
> 
> For the short term, the fact this works empirically is enough.  For
> the longer term, it would be nice to have a better understanding of
> what this "overcommit" amount is actually estimating.
> 
> I think what we're looking for is an estimate of the number of bytes
> that will have left the buffer by the time the guest gets back to us.  So:
> 	<connection throughput> * <guest-side RTT>

I don't think we want the bandwidth-delay product here (which I'm now
using earlier in the series) because the purpose here is to grow the
buffer at the beginning of a connection, if it looks like bulk traffic.

So we want to progressively exploit auto-tuning as long as we're
limited by a small buffer, but not later. At some point we want to
finally switch to the window advertised by the peer.

Well, I tried with the bandwidth-delay product in any case, but it's
not really helping with auto-tuning. It turns out that auto-tuning is
fundamentally different at the beginning anyway.

> Alas, I don't see a way to estimate either of those from the
> information we already track - we'd need additional bookkeeping.

It's all in struct tcp_info, it's called tcpi_delivery_rate. There are
other interesting bits there, by the way, that could be used in a
further refinement.

> >  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> >  
> >  #define CONN_IS_CLOSING(conn)						\
> > @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  
> >  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> >  			limit = 0;
> > +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
> > +			 tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)  
> 
> This is pretty subtle, I think it would be worth having some rationale
> in a comment, not just the commit message.

I turned the macro into a new function and added comments there, in v2.

> > +			limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
> >  		else
> >  			limit = SNDBUF_GET(conn) - (int)sendq;
> >  
> > -- 
> > 2.43.0

-- 
Stefano

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-12-08  0:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-04  7:45 [PATCH 0/8] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-04  7:45 ` [PATCH 1/8] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-04 23:10   ` David Gibson
2025-12-04  7:45 ` [PATCH 2/8] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-04 23:48   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:49       ` David Gibson
2025-12-04  7:45 ` [PATCH 3/8] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-04 23:50   ` David Gibson
2025-12-04  7:45 ` [PATCH 4/8] tcp: Acknowledge everything if sending buffer is less than SNDBUF_BIG Stefano Brivio
2025-12-05  0:08   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:50       ` David Gibson
2025-12-08  0:19         ` Stefano Brivio
2025-12-04  7:45 ` [PATCH 5/8] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-05  0:35   ` David Gibson
2025-12-05  1:20     ` Stefano Brivio
2025-12-05  2:53       ` David Gibson
2025-12-04  7:45 ` [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-05  2:34   ` David Gibson
2025-12-08  0:20     ` Stefano Brivio
2025-12-04  7:45 ` [PATCH 7/8] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-05  2:35   ` David Gibson
2025-12-04  7:45 ` [PATCH 8/8] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-05  2:36   ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).