[PATCH v2 0/9] tcp: Fix throughput issues with non-local peers

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers
@ 2025-12-08  0:22 Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

Patch 2/9 is the most relevant fix here, as we currently advertise a
window that might be too big for what we can write to the socket,
causing retransmissions right away and occasional high latency on
short transfers to non-local peers.

Mostly as a consequence of fixing that, we now need several
improvements and small fixes, including, most notably, an adaptive
approach to pick the interval between checks for socket-side ACKs
(patch 3/9), and several tricks to reliably trigger TCP buffer size
auto-tuning as implemented by the Linux kernel (patches 5/9 and 7/9).

These changes make some existing issues more relevant, fixed by the
other patches.

With this series, I'm getting the expected (wirespeed) throughput for
transfers between peers with varying non-local RTTs: I checked
different guests bridged on the same machine (~500 us) and hosts with
increasing distance using iperf3, as well as HTTP transfers only for
some hosts I have control over (500 us and 5 ms case).

With increasing RTTs, I can finally see the throughput converging to
the available bandwidth reasonably fast:

* 500 us RTT, ~10 Gbps available:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec   785 MBytes  6.57 Gbits/sec   13   1.84 MBytes
  [  5]   1.00-2.00   sec   790 MBytes  6.64 Gbits/sec    0   1.92 MBytes

* 5 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec   22    240 KBytes
  [  5]   1.00-2.00   sec  46.2 MBytes   388 Mbits/sec   34    900 KBytes
  [  5]   2.00-3.00   sec   110 MBytes   923 Mbits/sec    0   1.11 MBytes

* 10 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  67.9 MBytes   569 Mbits/sec    2    960 KBytes
  [  5]   1.00-2.00   sec   110 MBytes   926 Mbits/sec    0   1.05 MBytes
  [  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec    0   1.17 MBytes

* 24 ms, 1 Gbps:
  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.50 MBytes  20.9 Mbits/sec   16    240 KBytes
  [  5]   1.00-2.00   sec  1.50 MBytes  12.6 Mbits/sec    9    120 KBytes
  [  5]   2.00-3.00   sec  99.2 MBytes   832 Mbits/sec    4   2.40 MBytes
  [  5]   3.00-4.00   sec   122 MBytes  1.03 Gbits/sec    1   3.16 MBytes
  [  5]   4.00-5.00   sec   119 MBytes  1.00 Gbits/sec    0   4.16 MBytes

* 40 ms, ~600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  2.12 MBytes  17.8 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  3.25 MBytes  27.3 Mbits/sec   14    420 KBytes
  [  5]   2.00-3.00   sec  31.5 MBytes   264 Mbits/sec   11   1.29 MBytes
  [  5]   3.00-4.00   sec  72.5 MBytes   608 Mbits/sec    0   1.46 MBytes

* 100 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  4.88 MBytes  40.9 Mbits/sec    1    840 KBytes
  [  5]   1.00-2.00   sec  1.62 MBytes  13.6 Mbits/sec    9    240 KBytes
  [  5]   2.00-3.00   sec  5.25 MBytes  44.0 Mbits/sec    5    780 KBytes
  [  5]   3.00-4.00   sec  9.75 MBytes  81.8 Mbits/sec    0   1.29 MBytes
  [  5]   4.00-5.00   sec  15.8 MBytes   132 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  22.9 MBytes   192 Mbits/sec    0   3.05 MBytes
  [  5]   6.00-7.00   sec   132 MBytes  1.11 Gbits/sec    0   7.62 MBytes

* 114 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.62 MBytes  13.6 Mbits/sec    8    420 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   15    120 KBytes
  [  5]   2.00-3.00   sec  26.0 MBytes   218 Mbits/sec   50   1.82 MBytes
  [  5]   3.00-4.00   sec   103 MBytes   865 Mbits/sec   31   5.10 MBytes
  [  5]   4.00-5.00   sec   111 MBytes   930 Mbits/sec    0   5.92 MBytes

* 153 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    2    180 KBytes
  [  5]   1.00-2.00   sec  2.12 MBytes  17.8 Mbits/sec   11    180 KBytes
  [  5]   2.00-3.00   sec  12.6 MBytes   106 Mbits/sec   40   1.29 MBytes
  [  5]   3.00-4.00   sec  44.5 MBytes   373 Mbits/sec   22   2.75 MBytes
  [  5]   4.00-5.00   sec  86.0 MBytes   721 Mbits/sec    0   6.68 MBytes
  [  5]   5.00-6.00   sec   120 MBytes  1.01 Gbits/sec  119   6.97 MBytes
  [  5]   6.00-7.00   sec   110 MBytes   927 Mbits/sec    0   6.97 MBytes

* 186 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    3    180 KBytes
  [  5]   1.00-2.00   sec   512 KBytes  4.19 Mbits/sec    4    120 KBytes
  [  5]   2.00-3.00   sec  2.12 MBytes  17.8 Mbits/sec    6    360 KBytes
  [  5]   3.00-4.00   sec  27.1 MBytes   228 Mbits/sec    6   1.11 MBytes
  [  5]   4.00-5.00   sec  38.2 MBytes   321 Mbits/sec    0   1.99 MBytes
  [  5]   5.00-6.00   sec  38.2 MBytes   321 Mbits/sec    0   2.46 MBytes
  [  5]   6.00-7.00   sec  69.2 MBytes   581 Mbits/sec   71   3.63 MBytes
  [  5]   7.00-8.00   sec   110 MBytes   919 Mbits/sec    0   5.92 MBytes

* 271 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    320 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0    600 KBytes
  [  5]   2.00-3.00   sec   512 KBytes  4.19 Mbits/sec    7    420 KBytes
  [  5]   3.00-4.00   sec   896 KBytes  7.34 Mbits/sec    8   60.0 KBytes
  [  5]   4.00-5.00   sec  2.62 MBytes  22.0 Mbits/sec   13    420 KBytes
  [  5]   5.00-6.00   sec  12.1 MBytes   102 Mbits/sec    7   1020 KBytes
  [  5]   6.00-7.00   sec  19.9 MBytes   167 Mbits/sec    0   1.82 MBytes
  [  5]   7.00-8.00   sec  17.9 MBytes   150 Mbits/sec   44   1.76 MBytes
  [  5]   8.00-9.00   sec  57.4 MBytes   481 Mbits/sec   30   2.70 MBytes
  [  5]   9.00-10.00  sec  88.0 MBytes   738 Mbits/sec    0   6.45 MBytes

* 292 ms, ~ 600 Mbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    450 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    3    180 KBytes
  [  5]   2.00-3.00   sec   640 KBytes  5.24 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec   384 KBytes  3.15 Mbits/sec    2    120 KBytes
  [  5]   4.00-5.00   sec  4.50 MBytes  37.7 Mbits/sec    1    660 KBytes
  [  5]   5.00-6.00   sec  13.0 MBytes   109 Mbits/sec    3    960 KBytes
  [  5]   6.00-7.00   sec  64.5 MBytes   541 Mbits/sec    0   2.17 MBytes

* 327 ms, 1 Gbps:

  [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
  [  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    600 KBytes
  [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    4    240 KBytes
  [  5]   2.00-3.00   sec   768 KBytes  6.29 Mbits/sec    4    120 KBytes
  [  5]   3.00-4.00   sec  1.62 MBytes  13.6 Mbits/sec    5    120 KBytes
  [  5]   4.00-5.00   sec  1.88 MBytes  15.7 Mbits/sec    0    480 KBytes
  [  5]   5.00-6.00   sec  17.6 MBytes   148 Mbits/sec   14   1.05 MBytes
  [  5]   6.00-7.00   sec  35.1 MBytes   295 Mbits/sec    0   2.58 MBytes
  [  5]   7.00-8.00   sec  45.2 MBytes   380 Mbits/sec    0   4.63 MBytes
  [  5]   8.00-9.00   sec  27.0 MBytes   226 Mbits/sec   96   3.93 MBytes
  [  5]   9.00-10.00  sec  85.9 MBytes   720 Mbits/sec   67   4.22 MBytes
  [  5]  10.00-11.00  sec   118 MBytes   986 Mbits/sec    0   9.67 MBytes
  [  5]  11.00-12.00  sec   124 MBytes  1.04 Gbits/sec    0   15.9 MBytes

For short transfers, we strictly stick to the available sending buffer
size to (almost) make sure we avoid local retransmissions, and
significantly decrease transfer time as a result: from 1.2 s to 60 ms
for a 5 MB HTTP transfer from a container hosted in a virtual machine
to another guest.

v2:

- Add 1/9, factoring out a generic version of the scaling function we
  already use for tcp_get_sndbuf(), as I'm now using it in 7/9 as well

- in 3/9, use 4 bits instead of 3 to represent the RTT, which is
  important as we now use RTT values for more than just the ACK checks

- in 5/9, instead of just comparing the sending buffer to SNDBUF_BIG
  to decide when to acknowledge data, use an adaptive approach based on
  the bandwidth-delay product

- in 6/9, clarify the relationship between SWS avoidance and Nagle's
  algorithm, and introduce a reference to RFC 813, Section 4

- in 7/9, use an adaptive approach based on the product of bytes sent
  (and acknowledged) so far and RTT, instead of the previous approach
  based on bytes sent only, as it allows us to converge to the expected
  throughput much quicker with high RTT destinations


Stefano Brivio (9):
  tcp, util: Add function for scaling to linearly interpolated factor,
    use it
  tcp: Limit advertised window to available, not total sending buffer
    size
  tcp: Adaptive interval based on RTT for socket-side acknowledgement
    checks
  tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized
    window
  tcp: Acknowledge everything if it looks like bulk traffic, not
    interactive
  tcp: Don't limit window to less-than-MSS values, use zero instead
  tcp: Allow exceeding the available sending buffer size in window
    advertisements
  tcp: Send a duplicate ACK also on complete sendmsg() failure
  tcp: Skip redundant ACK on partial sendmsg() failure

 README.md  |   2 +-
 tcp.c      | 168 ++++++++++++++++++++++++++++++++++++++++++-----------
 tcp_conn.h |   9 +++
 util.c     |  52 +++++++++++++++++
 util.h     |   2 +
 5 files changed, 197 insertions(+), 36 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  5:33   ` David Gibson
  2025-12-08  0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

Right now, the only need for this kind of function comes from
tcp_get_sndbuf(), which calculates the amount of sending buffer we
want to use depending on its own size: we want to use more of it
if it's smaller, as bookkeeping overhead is usually lower and we rely
on auto-tuning there, and use less of it when it's bigger.

For this purpose, the new function is overly generic and its name is
a mouthful: @x is the same as @y, that is, we want to use more or
less of the buffer depending on the size of the buffer itself.

However, an upcoming change will need that generality, as we'll want
to scale the amount of sending buffer we use depending on another
(scaled) factor.

While at it, now that we have this new function, which makes it simple
to specify a precise usage factor, change the amount of sending buffer
we want to use at and above 4 MiB: 75% looks perfectly safe.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c  |  8 ++------
 util.c | 38 ++++++++++++++++++++++++++++++++++++++
 util.h |  1 +
 3 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/tcp.c b/tcp.c
index bb661ee..37012cc 100644
--- a/tcp.c
+++ b/tcp.c
@@ -773,7 +773,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 }
 
 /**
- * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.5 usage)
+ * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.75 usage)
  * @conn:	Connection pointer
  */
 static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
@@ -788,11 +788,7 @@ static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
 		return;
 	}
 
-	v = sndbuf;
-	if (v >= SNDBUF_BIG)
-		v /= 2;
-	else if (v > SNDBUF_SMALL)
-		v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
+	v = scale_x_to_y_slope(sndbuf, sndbuf, SNDBUF_SMALL, SNDBUF_BIG, 75);
 
 	SNDBUF_SET(conn, MIN(INT_MAX, v));
 }
diff --git a/util.c b/util.c
index f32c9cb..ff0ba01 100644
--- a/util.c
+++ b/util.c
@@ -1223,3 +1223,41 @@ void fsync_pcap_and_log(void)
 	if (log_file != -1)
 		(void)fsync(log_file);
 }
+
+/**
+ * scale_x_to_y_slope() - Scale @x from 100% to f% depending on @y's value
+ * @x:		Value to scale
+ * @y:		Value determining scaling
+ * @lo:		Lower bound for @y (start of y-axis slope)
+ * @hi:		Upper bound for @y (end of y-axis slope)
+ * @f:		Scaling factor, percent
+ *
+ * Return: @x scaled by @f * linear interpolation of @y between @lo and @hi
+ *
+ * In pictures:
+ *
+ *                f % -> ,----   * If @y < lo (for example, @y is y0), return @x
+ *                      /|   |
+ *                     / |   |   * If @lo < @y < @hi (for example, @y is y1),
+ *                    /  |   |     return @x scaled by a factor linearly
+ * (100 + f) / 2 % ->/   |   |     interpolated between 100% and f% depending on
+ *                  /|   |   |     @y's position between @lo (100%) and @hi (f%)
+ *                 / |   |   |
+ *                /  |   |   |   * If @y > @hi (for example, @y is y2), return
+ * 100 % -> -----'   |   |   |     @x * @f / 100
+ *           |   |   |   |   |
+ *          y0  lo  y1  hi  y2   Example: @f = 150, @lo = 10, @hi = 20, @y = 15,
+ *                                        @x = 1000
+ *                                        -> interpolated factor is 125%
+ *                                        -> return 1250
+ */
+long scale_x_to_y_slope(long x, long y, long lo, long hi, long f)
+{
+	if (y < lo)
+		return x;
+
+	if (y > hi)
+		return x * f / 100;
+
+	return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;
+}
diff --git a/util.h b/util.h
index 17f5ae0..ec75453 100644
--- a/util.h
+++ b/util.h
@@ -242,6 +242,7 @@ int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip);
 void close_open_files(int argc, char **argv);
 bool snprintf_check(char *str, size_t size, const char *format, ...);
 void fsync_pcap_and_log(void);
+long scale_x_to_y_slope(long x, long y, long lo, long hi, long f);
 
 /**
  * af_name() - Return name of an address family
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

For non-local connections, we advertise the same window size as what
the peer in turn advertises to us, and limit it to the buffer size
reported via SO_SNDBUF.

That's not quite correct: in order to later avoid failures while
queueing data to the socket, we need to limit the window to the
available buffer size, not the total one.

Use the SIOCOUTQ ioctl and subtract the number of outbound queued
bytes from the total buffer size, then clamp to this value.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 README.md |  2 +-
 tcp.c     | 18 ++++++++++++++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 897ae8b..8fdc0a3 100644
--- a/README.md
+++ b/README.md
@@ -291,7 +291,7 @@ speeding up local connections, and usually requiring NAT. _pasta_:
 * ✅ all capabilities dropped, other than `CAP_NET_BIND_SERVICE` (if granted)
 * ✅ with default options, user, mount, IPC, UTS, PID namespaces are detached
 * ✅ no external dependencies (other than a standard C library)
-* ✅ restrictive seccomp profiles (33 syscalls allowed for _passt_, 43 for
+* ✅ restrictive seccomp profiles (34 syscalls allowed for _passt_, 43 for
   _pasta_ on x86_64)
 * ✅ examples of [AppArmor](/passt/tree/contrib/apparmor) and
   [SELinux](/passt/tree/contrib/selinux) profiles available
diff --git a/tcp.c b/tcp.c
index 37012cc..951f434 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1026,6 +1026,8 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
  * @tinfo:	tcp_info from kernel, can be NULL if not pre-fetched
  *
  * Return: 1 if sequence or window were updated, 0 otherwise
+ *
+ * #syscalls ioctl
  */
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 			  bool force_seq, struct tcp_info_linux *tinfo)
@@ -1108,9 +1110,21 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
 		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
+		uint32_t sendq;
+		int limit;
+
+		if (ioctl(s, SIOCOUTQ, &sendq)) {
+			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
+			sendq = 0;
+		}
 		tcp_get_sndbuf(conn);
-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd,
-				     SNDBUF_GET(conn));
+
+		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
+			limit = 0;
+		else
+			limit = SNDBUF_GET(conn) - (int)sendq;
+
+		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
 	}
 
 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  5:41   ` David Gibson
  2025-12-08  0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (> 100 Mbps)
bandwidth and non-local but low (< 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c      | 28 +++++++++++++++++++++-------
 tcp_conn.h |  9 +++++++++
 util.c     | 14 ++++++++++++++
 util.h     |  1 +
 4 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 951f434..8eeef4c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -202,9 +202,13 @@
  * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
  *   either side, the connection is reset
  *
- * - ACK_INTERVAL elapsed after data segment received from tap without having
+ * - RTT / 2 elapsed after data segment received from tap without having
  *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
- *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
+ *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
+ *
+ *   RTT, here, is an approximation of the RTT value reported by the kernel via
+ *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
+ *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
  *
  *
  * Summary of data flows (with ESTABLISHED event)
@@ -341,7 +345,6 @@ enum {
 #define MSS_DEFAULT			536
 #define WINDOW_DEFAULT			14600		/* RFC 6928 */
 
-#define ACK_INTERVAL			10		/* ms */
 #define RTO_INIT			1		/* s, RFC 6298 */
 #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
 #define FIN_TIMEOUT			60
@@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	}
 
 	if (conn->flags & ACK_TO_TAP_DUE) {
-		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
+		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
+		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		int exp = conn->retries, timeout = RTO_INIT;
 		if (!(conn->events & ESTABLISHED))
@@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 		it.it_value.tv_sec = ACT_TIMEOUT;
 	}
 
-	flow_dbg(conn, "timer expires in %llu.%03llus",
-		 (unsigned long long)it.it_value.tv_sec,
-		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	if (conn->flags & ACK_TO_TAP_DUE) {
+		flow_trace(conn, "timer expires in %lu.%01llums",
+			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
+			   (unsigned long long)it.it_value.tv_nsec / 1000);
+	} else {
+		flow_dbg(conn, "timer expires in %llu.%03llus",
+			 (unsigned long long)it.it_value.tv_sec,
+			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
+	}
 
 	if (timerfd_settime(conn->timer, 0, &it, NULL))
 		flow_perror(conn, "failed to set timer");
@@ -1144,6 +1154,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		conn_flag(c, conn, ACK_TO_TAP_DUE);
 
 out:
+	/* Opportunistically store RTT approximation on valid TCP_INFO data */
+	if (tinfo)
+		RTT_SET(conn, tinfo->tcpi_rtt);
+
 	return new_wnd_to_tap       != prev_wnd_to_tap ||
 	       conn->seq_ack_to_tap != prev_ack_to_tap;
 }
diff --git a/tcp_conn.h b/tcp_conn.h
index e36910c..9c6ff9e 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -49,6 +49,15 @@ struct tcp_tap_conn {
 #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
 #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
 
+#define RTT_EXP_BITS			4
+	unsigned int	rtt_exp		:RTT_EXP_BITS;
+#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
+#define RTT_STORE_MIN			100 /* us, minimum representable */
+#define RTT_STORE_MAX			((long)(RTT_STORE_MIN << RTT_EXP_MAX))
+#define RTT_SET(conn, rtt)						\
+	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
+#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
+
 	int		sock		:FD_REF_BITS;
 
 	uint8_t		events;
diff --git a/util.c b/util.c
index ff0ba01..e78e10d 100644
--- a/util.c
+++ b/util.c
@@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
  * fls() - Find last (most significant) bit set in word
  * @x:		Word
  *
+ * Note: unlike ffs() and other implementations of fls(), notably the one from
+ * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
+ *
  * Return: position of most significant bit set, starting from 0, -1 if none
  */
 int fls(unsigned long x)
@@ -629,6 +632,17 @@ int fls(unsigned long x)
 	return y;
 }
 
+/**
+ * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
+ * @x:		Argument
+ *
+ * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
+ */
+int ilog2(unsigned long x)
+{
+	return fls(x);
+}
+
 /**
  * write_file() - Replace contents of file with a string
  * @path:	File to write
diff --git a/util.h b/util.h
index ec75453..5c205a5 100644
--- a/util.h
+++ b/util.h
@@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags);
 void pidfile_write(int fd, pid_t pid);
 int __daemon(int pidfile_fd, int devnull_fd);
 int fls(unsigned long x);
+int ilog2(unsigned long x);
 int write_file(const char *path, const char *buf);
 intmax_t read_file_integer(const char *path, intmax_t fallback);
 int write_all_buf(int fd, const void *buf, size_t len);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

We correctly avoid doing that at the beginning of tcp_prepare_flags(),
but we might clear the flag later on if we actually end up sending a
"flag" segment.

Make sure we don't, otherwise we might delay window updates after a
zero-window condition significantly, and significantly affect
throughput.

In some cases, we're forcing peers to send zero-window probes or
keep-alive segments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 8eeef4c..9bf7b8b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1279,7 +1279,8 @@ int tcp_prepare_flags(const struct ctx *c, struct tcp_tap_conn *conn,
 	th->fin = !!(flags & FIN);
 
 	if (th->ack) {
-		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap))
+		if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap) &&
+		    conn->wnd_to_tap)
 			conn_flag(c, conn, ~ACK_TO_TAP_DUE);
 		else
 			conn_flag(c, conn, ACK_TO_TAP_DUE);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (3 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  5:54   ` David Gibson
  2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...instead of checking if the current sending buffer is less than
SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
ACK segments: we rely on having enough data at once from the sender
to make the buffer grow by means of TCP buffer size tuning
implemented in the Linux kernel.

This is important if we're trying to maximise throughput, but not
desirable for interactive traffic, where we want to be transparent as
possible and avoid introducing unnecessary latency.

Use the tcpi_delivery_rate field reported by the Linux kernel, if
available, to calculate the current bandwidth-delay product: if it's
significantly smaller than the available sending buffer, conclude that
we're not bandwidth-bound and this is likely to be interactive
traffic, so acknowledge data only as it's acknowledged by the peer.

Conversely, if the bandwidth-delay product is comparable to the size
of the sending buffer (more than 5%), we're probably bandwidth-bound
or... bound to be: acknowledge everything in that case.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 45 +++++++++++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/tcp.c b/tcp.c
index 9bf7b8b..533c8a7 100644
--- a/tcp.c
+++ b/tcp.c
@@ -353,6 +353,9 @@ enum {
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
 
+/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
+#define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
+
 #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
 
 #define CONN_IS_CLOSING(conn)						\
@@ -426,11 +429,13 @@ socklen_t tcp_info_size;
 	  sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
 
 /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
-#define snd_wnd_cap	tcp_info_cap(snd_wnd)
+#define snd_wnd_cap		tcp_info_cap(snd_wnd)
 /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
-#define bytes_acked_cap	tcp_info_cap(bytes_acked)
+#define bytes_acked_cap		tcp_info_cap(bytes_acked)
 /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
-#define min_rtt_cap	tcp_info_cap(min_rtt)
+#define min_rtt_cap		tcp_info_cap(min_rtt)
+/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
+#define delivery_rate_cap	tcp_info_cap(delivery_rate)
 
 /* sendmsg() to socket */
 static struct iovec	tcp_iov			[UIO_MAXIOV];
@@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	socklen_t sl = sizeof(*tinfo);
 	struct tcp_info_linux tinfo_new;
 	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
+	bool ack_everything = true;
 	int s = conn->sock;
 
 	/* At this point we could ack all the data we've accepted for forwarding
@@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	 * control behaviour.
 	 *
 	 * For it to be possible and worth it we need:
-	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
+	 *  - The TCP_INFO Linux extensions which give us the peer acked bytes
+	 *    and the delivery rate (outbound bandwidth at receiver)
 	 *  - Not to be told not to (force_seq)
 	 *  - Not half-closed in the peer->guest direction
 	 *      With no data coming from the peer, we might not get events which
@@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 	 *      Data goes from socket to socket, with nothing meaningfully "in
 	 *      flight".
 	 *  - Not a pseudo-local connection (e.g. to a VM on the same host)
-	 *  - Large enough send buffer
-	 *      In these cases, there's not enough in flight to bother.
+	 *      If it is, there's not enough in flight to bother.
+	 *  - Sending buffer significantly larger than bandwidth * delay product
+	 *      Meaning we're not bandwidth-bound and this is likely to be
+	 *      interactive traffic where we want to preserve transparent
+	 *      connection behaviour and latency.
+	 *
+	 *      Otherwise, we probably want to maximise throughput, which needs
+	 *      sending buffer auto-tuning, triggered in turn by filling up the
+	 *      outbound socket queue.
 	 */
-	if (bytes_acked_cap && !force_seq &&
+	if (bytes_acked_cap && delivery_rate_cap && !force_seq &&
 	    !CONN_IS_CLOSING(conn) &&
-	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
-	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
+	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) {
 		if (!tinfo) {
 			tinfo = &tinfo_new;
 			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
 				return 0;
 		}
 
+		if ((unsigned)SNDBUF_GET(conn) > (long long)RTT_GET(conn) *
+						 tinfo->tcpi_delivery_rate /
+						 1000 / 1000 *
+						 SNDBUF_TO_BW_DELAY_INTERACTIVE)
+			ack_everything = false;
+	}
+
+	if (ack_everything) {
+		/* Fall back to acknowledging everything we got */
+		conn->seq_ack_to_tap = conn->seq_from_tap;
+	} else {
 		/* This trips a cppcheck bug in some versions, including
 		 * cppcheck 2.18.3.
 		 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
@@ -1087,9 +1111,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
 		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
 		                       conn->seq_init_from_tap;
-	} else {
-		/* Fall back to acknowledging everything we got */
-		conn->seq_ack_to_tap = conn->seq_from_tap;
 	}
 
 	/* It's occasionally possible for us to go from using the fallback above
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (4 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  6:43   ` David Gibson
  2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

If the sender uses data clumping (including Nagle's algorithm) for
Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
means the sender might stop sending altogether, and window updates
after a low window condition are just as important as they are in
a zero-window condition.

For simplicity, approximate that limit to zero, as we have an
implementation forcing window updates after zero-sized windows.
This matches the suggestion from RFC 813, section 4.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/tcp.c b/tcp.c
index 533c8a7..3c046a5 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1155,6 +1155,23 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		else
 			limit = SNDBUF_GET(conn) - (int)sendq;
 
+		/* If the sender uses mechanisms to prevent Silly Window
+		 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
+		 * that, should the window ever become less than the MSS, we
+		 * advertise a new value once it increases again to be above it.
+		 *
+		 * The mechanism to avoid SWS in the kernel is, implicitly,
+		 * implemented by Nagle's algorithm (which was proposed after
+		 * RFC 813).
+		 *
+		 * To this end, for simplicity, approximate a window value below
+		 * the MSS to zero, as we already have mechanisms in place to
+		 * force updates after the window becomes zero. This matches the
+		 * suggestion from RFC 813, Section 4.
+		 */
+		if (limit < MSS_GET(conn))
+			limit = 0;
+
 		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
 	}
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (5 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  6:25   ` David Gibson
  2025-12-08  0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

If the remote peer is advertising a bigger value than our current
sending buffer, it means that a bigger sending buffer is likely to
benefit throughput.

We can get a bigger sending buffer by means of the buffer size
auto-tuning performed by the Linux kernel, which is triggered by
aggressively filling the sending buffer.

Use an adaptive boost factor, up to 150%, depending on:

- how much data we sent so far: we don't want to risk retransmissions
  for short-lived connections, as the latency cost would be
  unacceptable, and

- the current RTT value, as we need a bigger buffer for higher
  transmission delays

The factor we use is not quite a bandwidth-delay product, as we're
missing the time component of the bandwidth, which is not interesting
here: we are trying to make the buffer grow at the beginning of a
connection, progressively, as more data is sent.

The tuning of the amount of boost factor we want to apply was done
somewhat empirically but it appears to yield the available throughput
in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
within a few seconds for the 300 ms case.

Note that we want to apply this factor only if the window advertised
by the peer is bigger than the current sending buffer, as we only need
this for auto-tuning, and we absolutely don't want to incur
unnecessary retransmissions otherwise.

The related condition in tcp_update_seqack_wnd() is not redundant as
there's a subtractive factor, sendq, in the calculation of the window
limit. If the sending buffer is smaller than the peer's advertised
window, the additional limit we might apply might be lower than we
would do otherwise.

Assuming that the sending buffer is reported as 100k, sendq is
20k, we could have these example cases:

1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
   buffer, so we boost its size to 150k, and we limit the window
   to 120k

2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
   buffer, so we aren't trying to trigger buffer auto-tuning and
   we'll stick to the existing, more conservative calculation,
   by limiting the window to 100 - 20 = 80k

If we omitted the new condition, we would always use the boosted
value, that is, 120k, even if potentially causing unnecessary
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tcp.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/tcp.c b/tcp.c
index 3c046a5..60a9687 100644
--- a/tcp.c
+++ b/tcp.c
@@ -353,6 +353,13 @@ enum {
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
 
+/* Parameters to temporarily exceed sending buffer to force TCP auto-tuning */
+#define SNDBUF_BOOST_BYTES_RTT_LO	2500 /* B * s: no boost until here */
+/* ...examples:  5 MB sent * 500 ns RTT, 250 kB * 10 ms,  8 kB * 300 ms */
+#define SNDBUF_BOOST_FACTOR		150 /* % */
+#define SNDBUF_BOOST_BYTES_RTT_HI	6000 /* apply full boost factor */
+/*		12 MB sent * 500 ns RTT, 600 kB * 10 ms, 20 kB * 300 ms */
+
 /* Ratio of buffer to bandwidth * delay product implying interactive traffic */
 #define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
 
@@ -1033,6 +1040,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
 	tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN));
 }
 
+/**
+ * tcp_sndbuf_boost() - Calculate limit of sending buffer to force auto-tuning
+ * @conn:	Connection pointer
+ * @tinfo:	tcp_info from kernel, must be pre-fetched
+ *
+ * Return: increased sending buffer to use as a limit for advertised window
+ */
+static unsigned long tcp_sndbuf_boost(struct tcp_tap_conn *conn,
+				      struct tcp_info_linux *tinfo)
+{
+	unsigned long bytes_rtt_product;
+
+	if (!bytes_acked_cap)
+		return SNDBUF_GET(conn);
+
+	/* This is *not* a bandwidth-delay product, but it's somewhat related:
+	 * as we send more data (usually at the beginning of a connection), we
+	 * try to make the sending buffer progressively grow, with the RTT as a
+	 * factor (longer delay, bigger buffer needed).
+	 */
+	bytes_rtt_product = (long long)tinfo->tcpi_bytes_acked *
+			    tinfo->tcpi_rtt / 1000 / 1000;
+
+	return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
+				  SNDBUF_BOOST_BYTES_RTT_LO,
+				  SNDBUF_BOOST_BYTES_RTT_HI,
+				  SNDBUF_BOOST_FACTOR);
+}
+
 /**
  * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
  * @c:		Execution context
@@ -1152,6 +1188,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 
 		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
 			limit = 0;
+		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn))
+			limit = tcp_sndbuf_boost(conn, tinfo) - (int)sendq;
 		else
 			limit = SNDBUF_GET(conn) - (int)sendq;
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (6 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
  2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
  9 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...in order to trigger a fast retransmit as soon as possible. There's
no benefit in forcing the sender to wait for a longer time than that.

We already do this on partial failures (short socket writes), but for
historical reason not on complete failures. Make these two cases
consistent between each other.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c
index 60a9687..fa22e9b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2017,7 +2017,7 @@ eintr:
 			goto eintr;
 
 		if (errno == EAGAIN || errno == EWOULDBLOCK) {
-			tcp_send_flag(c, conn, ACK_IF_NEEDED);
+			tcp_send_flag(c, conn, ACK | DUP_ACK);
 			return p->count - idx;
 
 		}
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 9/9] tcp: Skip redundant ACK on partial sendmsg() failure
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (7 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
@ 2025-12-08  0:22 ` Stefano Brivio
  2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
  9 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  0:22 UTC (permalink / raw)
  To: passt-dev; +Cc: Max Chernoff, David Gibson

...we'll send a duplicate ACK right away in this case, and this
redundant, earlier check is not just useless, but it might actually
be harmful as we'll now send a triple ACK which might cause two
retransmissions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/tcp.c b/tcp.c
index fa22e9b..9252495 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2024,13 +2024,10 @@ eintr:
 		return -1;
 	}
 
-	if (n < (int)(seq_from_tap - conn->seq_from_tap)) {
+	if (n < (int)(seq_from_tap - conn->seq_from_tap))
 		partial_send = 1;
-		conn->seq_from_tap += n;
-		tcp_send_flag(c, conn, ACK_IF_NEEDED);
-	} else {
-		conn->seq_from_tap += n;
-	}
+
+	conn->seq_from_tap += n;
 
 out:
 	if (keep != -1 || partial_send) {
-- 
2.43.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it
  2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
@ 2025-12-08  5:33   ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2025-12-08  5:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 5741 bytes --]

On Mon, Dec 08, 2025 at 01:22:09AM +0100, Stefano Brivio wrote:
> Right now, the only need for this kind of function comes from
> tcp_get_sndbuf(), which calculates the amount of sending buffer we
> want to use depending on its own size: we want to use more of it
> if it's smaller, as bookkeeping overhead is usually lower and we rely
> on auto-tuning there, and use less of it when it's bigger.
> 
> For this purpose, the new function is overly generic and its name is
> a mouthful: @x is the same as @y, that is, we want to use more or
> less of the buffer depending on the size of the buffer itself.
> 
> However, an upcoming change will need that generality, as we'll want
> to scale the amount of sending buffer we use depending on another
> (scaled) factor.
> 
> While at it, now that we have this new function, which makes it simple
> to specify a precise usage factor, change the amount of sending buffer
> we want to use at and above 4 MiB: 75% looks perfectly safe.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

In that it looks correct.  Several suggestions for improvements in
clarity below, either as followups, or in case a respin is needed
anyway.

> ---
>  tcp.c  |  8 ++------
>  util.c | 38 ++++++++++++++++++++++++++++++++++++++
>  util.h |  1 +
>  3 files changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index bb661ee..37012cc 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -773,7 +773,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  }
>  
>  /**
> - * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.5 usage)
> + * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.75 usage)

I'd slightly prefer the change to 0.75 be in a separate patch, just so
it's easier to tell that change to the helper function itself doesn't
change behaviour here.

>   * @conn:	Connection pointer
>   */
>  static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
> @@ -788,11 +788,7 @@ static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
>  		return;
>  	}
>  
> -	v = sndbuf;
> -	if (v >= SNDBUF_BIG)
> -		v /= 2;
> -	else if (v > SNDBUF_SMALL)
> -		v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
> +	v = scale_x_to_y_slope(sndbuf, sndbuf, SNDBUF_SMALL, SNDBUF_BIG, 75);
>  
>  	SNDBUF_SET(conn, MIN(INT_MAX, v));
>  }
> diff --git a/util.c b/util.c
> index f32c9cb..ff0ba01 100644
> --- a/util.c
> +++ b/util.c
> @@ -1223,3 +1223,41 @@ void fsync_pcap_and_log(void)
>  	if (log_file != -1)
>  		(void)fsync(log_file);
>  }
> +
> +/**
> + * scale_x_to_y_slope() - Scale @x from 100% to f% depending on @y's value

Would "clamped_scale" work as a more descriptive name?

> + * @x:		Value to scale
> + * @y:		Value determining scaling
> + * @lo:		Lower bound for @y (start of y-axis slope)
> + * @hi:		Upper bound for @y (end of y-axis slope)
> + * @f:		Scaling factor, percent

Maybe worth clarifying that this can be less than or more than 100% -
description below uses >100%, but the usage above is <100%.

> + *
> + * Return: @x scaled by @f * linear interpolation of @y between @lo and @hi
> + *
> + * In pictures:
> + *
> + *                f % -> ,----   * If @y < lo (for example, @y is y0), return @x
> + *                      /|   |
> + *                     / |   |   * If @lo < @y < @hi (for example, @y is y1),
> + *                    /  |   |     return @x scaled by a factor linearly
> + * (100 + f) / 2 % ->/   |   |     interpolated between 100% and f% depending on
> + *                  /|   |   |     @y's position between @lo (100%) and @hi (f%)
> + *                 / |   |   |
> + *                /  |   |   |   * If @y > @hi (for example, @y is y2), return
> + * 100 % -> -----'   |   |   |     @x * @f / 100
> + *           |   |   |   |   |
> + *          y0  lo  y1  hi  y2   Example: @f = 150, @lo = 10, @hi = 20, @y = 15,
> + *                                        @x = 1000
> + *                                        -> interpolated factor is 125%
> + *                                        -> return 1250
> + */
> +long scale_x_to_y_slope(long x, long y, long lo, long hi, long f)
> +{
> +	if (y < lo)
> +		return x;
> +
> +	if (y > hi)
> +		return x * f / 100;
> +
> +	return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;

There's a subtle tradeoff here.  Dividing by (hi - lo) before
multiplying by the factor loses some precision in the final result.
On the hand, doing all the multiplies first would increase the risk of
an overflow.


Possible different way of organising this that _might_ be slightly
easier to describe:  rather than including a scaling factor, instead
give upper and lower bounds of the output, so something like:

long clamped_scale(long a, long b, long s, long sa, long sb)

=> returns alue between @a and @b, matching where @s lies between @sa
and @sb.



> +}
> diff --git a/util.h b/util.h
> index 17f5ae0..ec75453 100644
> --- a/util.h
> +++ b/util.h
> @@ -242,6 +242,7 @@ int read_remainder(int fd, const struct iovec *iov, size_t cnt, size_t skip);
>  void close_open_files(int argc, char **argv);
>  bool snprintf_check(char *str, size_t size, const char *format, ...);
>  void fsync_pcap_and_log(void);
> +long scale_x_to_y_slope(long x, long y, long lo, long hi, long f);
>  
>  /**
>   * af_name() - Return name of an address family
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
@ 2025-12-08  5:41   ` David Gibson
  2025-12-08  7:22     ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2025-12-08  5:41 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 8062 bytes --]

On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote:
> A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> the previous change, because we would generally cause retransmissions
> for non-local outbound transfers with relatively high (> 100 Mbps)
> bandwidth and non-local but low (< 5 ms) RTT.
> 
> Now that retransmissions are less frequent, we don't have a proper
> trigger to check for acknowledged bytes on the socket, and will
> generally block the sender for a significant amount of time while
> we could acknowledge more data, instead.
> 
> Store the RTT reported by the kernel using an approximation (exponent),
> to keep flow storage size within two (typical) cachelines. Check for
> socket updates when half of this time elapses: it should be a good
> indication of the one-way delay we're interested in (peer to us).
> 
> Representable values are between 100 us and 3.2768 s, and any value
> outside this range is clamped to these bounds. This choice appears
> to be a good trade-off between additional overhead and throughput.
> 
> This mechanism partially overlaps with the "low RTT" destinations,
> which we use to infer that a socket is connected to an endpoint to
> the same machine (while possibly in a different namespace) if the
> RTT is reported as 10 us or less.
> 
> This change doesn't, however, conflict with it: we are reading
> TCP_INFO parameters for local connections anyway, so we can always
> store the RTT approximation opportunistically.
> 
> Then, if the RTT is "low", we don't really need a timer to
> acknowledge data as we'll always acknowledge everything to the
> sender right away. However, we have limited space in the array where
> we store addresses of local destination, so the low RTT property of a
> connection might toggle frequently. Because of this, it's actually
> helpful to always have the RTT approximation stored.
> 
> This could probably benefit from a future rework, though, introducing
> a more integrated approach between these two mechanisms.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c      | 28 +++++++++++++++++++++-------
>  tcp_conn.h |  9 +++++++++
>  util.c     | 14 ++++++++++++++
>  util.h     |  1 +
>  4 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 951f434..8eeef4c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -202,9 +202,13 @@
>   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
>   *   either side, the connection is reset
>   *
> - * - ACK_INTERVAL elapsed after data segment received from tap without having
> + * - RTT / 2 elapsed after data segment received from tap without having
>   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> + *
> + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> + *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
>   *
>   *
>   * Summary of data flows (with ESTABLISHED event)
> @@ -341,7 +345,6 @@ enum {
>  #define MSS_DEFAULT			536
>  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
>  
> -#define ACK_INTERVAL			10		/* ms */
>  #define RTO_INIT			1		/* s, RFC 6298 */
>  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
>  #define FIN_TIMEOUT			60
> @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  	}
>  
>  	if (conn->flags & ACK_TO_TAP_DUE) {
> -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> +		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> +		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
>  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
>  		int exp = conn->retries, timeout = RTO_INIT;
>  		if (!(conn->events & ESTABLISHED))
> @@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  		it.it_value.tv_sec = ACT_TIMEOUT;
>  	}
>  
> -	flow_dbg(conn, "timer expires in %llu.%03llus",
> -		 (unsigned long long)it.it_value.tv_sec,
> -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	if (conn->flags & ACK_TO_TAP_DUE) {
> +		flow_trace(conn, "timer expires in %lu.%01llums",
> +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> +			   (unsigned long long)it.it_value.tv_nsec / 1000);

This doesn't look right - you need a % to exclude the whole
milliseconds here for the fractional part.  Plus, it looks like this
is trying to compute microseconds, which would be 3 digits after the
. in ms, but the format string accomodates only one.

> +	} else {
> +		flow_dbg(conn, "timer expires in %llu.%03llus",
> +			 (unsigned long long)it.it_value.tv_sec,
> +			 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> +	}
>  
>  	if (timerfd_settime(conn->timer, 0, &it, NULL))
>  		flow_perror(conn, "failed to set timer");
> @@ -1144,6 +1154,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		conn_flag(c, conn, ACK_TO_TAP_DUE);
>  
>  out:
> +	/* Opportunistically store RTT approximation on valid TCP_INFO data */
> +	if (tinfo)
> +		RTT_SET(conn, tinfo->tcpi_rtt);
> +
>  	return new_wnd_to_tap       != prev_wnd_to_tap ||
>  	       conn->seq_ack_to_tap != prev_ack_to_tap;
>  }
> diff --git a/tcp_conn.h b/tcp_conn.h
> index e36910c..9c6ff9e 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -49,6 +49,15 @@ struct tcp_tap_conn {
>  #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
>  #define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
>  
> +#define RTT_EXP_BITS			4
> +	unsigned int	rtt_exp		:RTT_EXP_BITS;
> +#define RTT_EXP_MAX			MAX_FROM_BITS(RTT_EXP_BITS)
> +#define RTT_STORE_MIN			100 /* us, minimum representable */
> +#define RTT_STORE_MAX			((long)(RTT_STORE_MIN << RTT_EXP_MAX))
> +#define RTT_SET(conn, rtt)						\
> +	(conn->rtt_exp = MIN(RTT_EXP_MAX, ilog2(MAX(1, rtt / RTT_STORE_MIN))))
> +#define RTT_GET(conn)			(RTT_STORE_MIN << conn->rtt_exp)
> +
>  	int		sock		:FD_REF_BITS;
>  
>  	uint8_t		events;
> diff --git a/util.c b/util.c
> index ff0ba01..e78e10d 100644
> --- a/util.c
> +++ b/util.c
> @@ -614,6 +614,9 @@ int __daemon(int pidfile_fd, int devnull_fd)
>   * fls() - Find last (most significant) bit set in word
>   * @x:		Word
>   *
> + * Note: unlike ffs() and other implementations of fls(), notably the one from
> + * the Linux kernel, the starting position is 0 and not 1, that is, fls(1) = 0.
> + *
>   * Return: position of most significant bit set, starting from 0, -1 if none
>   */
>  int fls(unsigned long x)
> @@ -629,6 +632,17 @@ int fls(unsigned long x)
>  	return y;
>  }
>  
> +/**
> + * ilog2() - Integral part (floor) of binary logarithm (logarithm to the base 2)
> + * @x:		Argument
> + *
> + * Return: integral part of binary logarithm of @x, -1 if undefined (if @x is 0)
> + */
> +int ilog2(unsigned long x)
> +{
> +	return fls(x);
> +}
> +
>  /**
>   * write_file() - Replace contents of file with a string
>   * @path:	File to write
> diff --git a/util.h b/util.h
> index ec75453..5c205a5 100644
> --- a/util.h
> +++ b/util.h
> @@ -233,6 +233,7 @@ int output_file_open(const char *path, int flags);
>  void pidfile_write(int fd, pid_t pid);
>  int __daemon(int pidfile_fd, int devnull_fd);
>  int fls(unsigned long x);
> +int ilog2(unsigned long x);
>  int write_file(const char *path, const char *buf);
>  intmax_t read_file_integer(const char *path, intmax_t fallback);
>  int write_all_buf(int fd, const void *buf, size_t len);
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
  2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
@ 2025-12-08  5:54   ` David Gibson
  2025-12-08  7:25     ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2025-12-08  5:54 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 6656 bytes --]

On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote:
> ...instead of checking if the current sending buffer is less than
> SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
> ACK segments: we rely on having enough data at once from the sender
> to make the buffer grow by means of TCP buffer size tuning
> implemented in the Linux kernel.
> 
> This is important if we're trying to maximise throughput, but not
> desirable for interactive traffic, where we want to be transparent as
> possible and avoid introducing unnecessary latency.
> 
> Use the tcpi_delivery_rate field reported by the Linux kernel, if
> available, to calculate the current bandwidth-delay product: if it's
> significantly smaller than the available sending buffer, conclude that
> we're not bandwidth-bound and this is likely to be interactive
> traffic, so acknowledge data only as it's acknowledged by the peer.
> 
> Conversely, if the bandwidth-delay product is comparable to the size
> of the sending buffer (more than 5%), we're probably bandwidth-bound
> or... bound to be: acknowledge everything in that case.

Ah, nice.  This reasoning is much clearer to me than the previous
spin.

> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 45 +++++++++++++++++++++++++++++++++------------
>  1 file changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 9bf7b8b..533c8a7 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -353,6 +353,9 @@ enum {
>  #define LOW_RTT_TABLE_SIZE		8
>  #define LOW_RTT_THRESHOLD		10 /* us */
>  
> +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> +#define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> +
>  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
>  
>  #define CONN_IS_CLOSING(conn)						\
> @@ -426,11 +429,13 @@ socklen_t tcp_info_size;
>  	  sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
>  
>  /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
> -#define snd_wnd_cap	tcp_info_cap(snd_wnd)
> +#define snd_wnd_cap		tcp_info_cap(snd_wnd)
>  /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
> -#define bytes_acked_cap	tcp_info_cap(bytes_acked)
> +#define bytes_acked_cap		tcp_info_cap(bytes_acked)
>  /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
> -#define min_rtt_cap	tcp_info_cap(min_rtt)
> +#define min_rtt_cap		tcp_info_cap(min_rtt)
> +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
> +#define delivery_rate_cap	tcp_info_cap(delivery_rate)
>  
>  /* sendmsg() to socket */
>  static struct iovec	tcp_iov			[UIO_MAXIOV];
> @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	socklen_t sl = sizeof(*tinfo);
>  	struct tcp_info_linux tinfo_new;
>  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> +	bool ack_everything = true;
>  	int s = conn->sock;
>  
>  	/* At this point we could ack all the data we've accepted for forwarding
> @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	 * control behaviour.
>  	 *
>  	 * For it to be possible and worth it we need:
> -	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> +	 *  - The TCP_INFO Linux extensions which give us the peer acked bytes
> +	 *    and the delivery rate (outbound bandwidth at receiver)
>  	 *  - Not to be told not to (force_seq)
>  	 *  - Not half-closed in the peer->guest direction
>  	 *      With no data coming from the peer, we might not get events which
> @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  	 *      Data goes from socket to socket, with nothing meaningfully "in
>  	 *      flight".
>  	 *  - Not a pseudo-local connection (e.g. to a VM on the same host)
> -	 *  - Large enough send buffer
> -	 *      In these cases, there's not enough in flight to bother.
> +	 *      If it is, there's not enough in flight to bother.
> +	 *  - Sending buffer significantly larger than bandwidth * delay product
> +	 *      Meaning we're not bandwidth-bound and this is likely to be
> +	 *      interactive traffic where we want to preserve transparent
> +	 *      connection behaviour and latency.

Do we actually want the sending buffer size here?  Or the amount of
buffer that's actually in use (SIOCOUTQ)?  If we had a burst transfer
followed by interactive traffic, the kernel could still have a large
send buffer allocated, no?

> +	 *
> +	 *      Otherwise, we probably want to maximise throughput, which needs
> +	 *      sending buffer auto-tuning, triggered in turn by filling up the
> +	 *      outbound socket queue.
>  	 */
> -	if (bytes_acked_cap && !force_seq &&
> +	if (bytes_acked_cap && delivery_rate_cap && !force_seq &&
>  	    !CONN_IS_CLOSING(conn) &&
> -	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> +	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) {
>  		if (!tinfo) {
>  			tinfo = &tinfo_new;
>  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
>  				return 0;
>  		}
>  
> +		if ((unsigned)SNDBUF_GET(conn) > (long long)RTT_GET(conn) *

Using RTT_GET seems odd here, since we just got a more up to date and
precise RTT estimate in tinfo.

> +						 tinfo->tcpi_delivery_rate /
> +						 1000 / 1000 *
> +						 SNDBUF_TO_BW_DELAY_INTERACTIVE)
> +			ack_everything = false;
> +	}
> +
> +	if (ack_everything) {
> +		/* Fall back to acknowledging everything we got */
> +		conn->seq_ack_to_tap = conn->seq_from_tap;
> +	} else {
>  		/* This trips a cppcheck bug in some versions, including
>  		 * cppcheck 2.18.3.
>  		 * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecde59085/
> @@ -1087,9 +1111,6 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		/* cppcheck-suppress [uninitvar,unmatchedSuppression] */
>  		conn->seq_ack_to_tap = tinfo->tcpi_bytes_acked +
>  		                       conn->seq_init_from_tap;
> -	} else {
> -		/* Fall back to acknowledging everything we got */
> -		conn->seq_ack_to_tap = conn->seq_from_tap;
>  	}
>  
>  	/* It's occasionally possible for us to go from using the fallback above
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
@ 2025-12-08  6:25   ` David Gibson
  2025-12-08  7:45     ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2025-12-08  6:25 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 6174 bytes --]

On Mon, Dec 08, 2025 at 01:22:15AM +0100, Stefano Brivio wrote:
> If the remote peer is advertising a bigger value than our current
> sending buffer, it means that a bigger sending buffer is likely to
> benefit throughput.
> 
> We can get a bigger sending buffer by means of the buffer size
> auto-tuning performed by the Linux kernel, which is triggered by
> aggressively filling the sending buffer.
> 
> Use an adaptive boost factor, up to 150%, depending on:
> 
> - how much data we sent so far: we don't want to risk retransmissions
>   for short-lived connections, as the latency cost would be
>   unacceptable, and
> 
> - the current RTT value, as we need a bigger buffer for higher
>   transmission delays
> 
> The factor we use is not quite a bandwidth-delay product, as we're
> missing the time component of the bandwidth, which is not interesting
> here: we are trying to make the buffer grow at the beginning of a
> connection, progressively, as more data is sent.
> 
> The tuning of the amount of boost factor we want to apply was done
> somewhat empirically but it appears to yield the available throughput
> in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
> ~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
> within a few seconds for the 300 ms case.
> 
> Note that we want to apply this factor only if the window advertised
> by the peer is bigger than the current sending buffer, as we only need
> this for auto-tuning, and we absolutely don't want to incur
> unnecessary retransmissions otherwise.
> 
> The related condition in tcp_update_seqack_wnd() is not redundant as
> there's a subtractive factor, sendq, in the calculation of the window
> limit. If the sending buffer is smaller than the peer's advertised
> window, the additional limit we might apply might be lower than we
> would do otherwise.
> 
> Assuming that the sending buffer is reported as 100k, sendq is
> 20k, we could have these example cases:
> 
> 1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
>    buffer, so we boost its size to 150k, and we limit the window
>    to 120k
> 
> 2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
>    buffer, so we aren't trying to trigger buffer auto-tuning and
>    we'll stick to the existing, more conservative calculation,
>    by limiting the window to 100 - 20 = 80k
> 
> If we omitted the new condition, we would always use the boosted
> value, that is, 120k, even if potentially causing unnecessary
> retransmissions.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  tcp.c | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index 3c046a5..60a9687 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -353,6 +353,13 @@ enum {
>  #define LOW_RTT_TABLE_SIZE		8
>  #define LOW_RTT_THRESHOLD		10 /* us */
>  
> +/* Parameters to temporarily exceed sending buffer to force TCP auto-tuning */
> +#define SNDBUF_BOOST_BYTES_RTT_LO	2500 /* B * s: no boost until here */
> +/* ...examples:  5 MB sent * 500 ns RTT, 250 kB * 10 ms,  8 kB * 300 ms */
> +#define SNDBUF_BOOST_FACTOR		150 /* % */
> +#define SNDBUF_BOOST_BYTES_RTT_HI	6000 /* apply full boost factor */
> +/*		12 MB sent * 500 ns RTT, 600 kB * 10 ms, 20 kB * 300 ms */
> +
>  /* Ratio of buffer to bandwidth * delay product implying interactive traffic */
>  #define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
>  
> @@ -1033,6 +1040,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
>  	tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN));
>  }
>  
> +/**
> + * tcp_sndbuf_boost() - Calculate limit of sending buffer to force auto-tuning
> + * @conn:	Connection pointer
> + * @tinfo:	tcp_info from kernel, must be pre-fetched
> + *
> + * Return: increased sending buffer to use as a limit for advertised window
> + */
> +static unsigned long tcp_sndbuf_boost(struct tcp_tap_conn *conn,
> +				      struct tcp_info_linux *tinfo)
> +{
> +	unsigned long bytes_rtt_product;
> +
> +	if (!bytes_acked_cap)
> +		return SNDBUF_GET(conn);
> +
> +	/* This is *not* a bandwidth-delay product, but it's somewhat related:
> +	 * as we send more data (usually at the beginning of a connection), we
> +	 * try to make the sending buffer progressively grow, with the RTT as a
> +	 * factor (longer delay, bigger buffer needed).
> +	 */
> +	bytes_rtt_product = (long long)tinfo->tcpi_bytes_acked *
> +			    tinfo->tcpi_rtt / 1000 / 1000;

I only half follow the reasoning in the commit message, but this
doesn't see quite right to me.  Assuming the RTT is roughly-fixed, as
you'd expect, this will always trend to infinity for long-lived
connections - regardless of whether they're high throughput or
interactive.  So, we'll always trend towards using 150% of the send
buffer size.

> +	return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
> +				  SNDBUF_BOOST_BYTES_RTT_LO,
> +				  SNDBUF_BOOST_BYTES_RTT_HI,
> +				  SNDBUF_BOOST_FACTOR);
> +}
> +
>  /**
>   * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
>   * @c:		Execution context
> @@ -1152,6 +1188,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  
>  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
>  			limit = 0;
> +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn))
> +			limit = tcp_sndbuf_boost(conn, tinfo) - (int)sendq;

Now that 5/9 has pointed out to be the existence of
tcpi_delivery_rate, would it make more sense to do a
	limit += tcpi_delivery_rate * rtt;

The idea being to allow the guest to send as much as the receiver can
accomodate itself, plus as much as we can fit "in the air" between us
and the peer.

>  		else
>  			limit = SNDBUF_GET(conn) - (int)sendq;
>  
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
@ 2025-12-08  6:43   ` David Gibson
  2025-12-08  8:11     ` Stefano Brivio
  0 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2025-12-08  6:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 2900 bytes --]

On Mon, Dec 08, 2025 at 01:22:14AM +0100, Stefano Brivio wrote:
> If the sender uses data clumping (including Nagle's algorithm) for
> Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
> means the sender might stop sending altogether, and window updates
> after a low window condition are just as important as they are in
> a zero-window condition.
> 
> For simplicity, approximate that limit to zero, as we have an
> implementation forcing window updates after zero-sized windows.
> This matches the suggestion from RFC 813, section 4.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Looking at this again, I'm worrying if it might allow a pathalogical
case here: unlikely to hit, but very bad if it did.

Suppose we have:
  1. A receiver that wants to consume its input in fixed largish
     (~64kiB) records
  2. The receiver has locked its SO_RCVBUF to that record length, or
     only slightly more
  3. The receive buf is near full - but not quite a full record's
     worth

The receiver doesn't consume anything, because it doesn't have a full
record.  Its rcvbuf is near full, so its kernel advertises only a
small window.  We approximate that to zero, so the sender can't send
anything.  So, the record never gets completed and we stall
completely.

The solution would be to "time out" this limitation of the advertised
window, not sure how complicated that would be in practice.

> ---
>  tcp.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/tcp.c b/tcp.c
> index 533c8a7..3c046a5 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1155,6 +1155,23 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		else
>  			limit = SNDBUF_GET(conn) - (int)sendq;
>  
> +		/* If the sender uses mechanisms to prevent Silly Window
> +		 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
> +		 * that, should the window ever become less than the MSS, we
> +		 * advertise a new value once it increases again to be above it.
> +		 *
> +		 * The mechanism to avoid SWS in the kernel is, implicitly,
> +		 * implemented by Nagle's algorithm (which was proposed after
> +		 * RFC 813).
> +		 *
> +		 * To this end, for simplicity, approximate a window value below
> +		 * the MSS to zero, as we already have mechanisms in place to
> +		 * force updates after the window becomes zero. This matches the
> +		 * suggestion from RFC 813, Section 4.
> +		 */
> +		if (limit < MSS_GET(conn))
> +			limit = 0;
> +
>  		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
>  	}
>  
> -- 
> 2.43.0
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers
  2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
                   ` (8 preceding siblings ...)
  2025-12-08  0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
@ 2025-12-08  6:46 ` David Gibson
  2025-12-08  8:22   ` Stefano Brivio
  9 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2025-12-08  6:46 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On Mon, Dec 08, 2025 at 01:22:08AM +0100, Stefano Brivio wrote:
> Patch 2/9 is the most relevant fix here, as we currently advertise a
> window that might be too big for what we can write to the socket,
> causing retransmissions right away and occasional high latency on
> short transfers to non-local peers.
> 
> Mostly as a consequence of fixing that, we now need several
> improvements and small fixes, including, most notably, an adaptive
> approach to pick the interval between checks for socket-side ACKs
> (patch 3/9), and several tricks to reliably trigger TCP buffer size
> auto-tuning as implemented by the Linux kernel (patches 5/9 and 7/9).
> 
> These changes make some existing issues more relevant, fixed by the
> other patches.

I've made a number of comments through the series.  I think they do
want some consideration.  But given this works to address a real
problem empirically, I'm fine for this to be merged as is, with more
polishing later.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-08  5:41   ` David Gibson
@ 2025-12-08  7:22     ` Stefano Brivio
  2025-12-08  8:28       ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  7:22 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Mon, 8 Dec 2025 16:41:21 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote:
> > A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
> > the previous change, because we would generally cause retransmissions
> > for non-local outbound transfers with relatively high (> 100 Mbps)
> > bandwidth and non-local but low (< 5 ms) RTT.
> > 
> > Now that retransmissions are less frequent, we don't have a proper
> > trigger to check for acknowledged bytes on the socket, and will
> > generally block the sender for a significant amount of time while
> > we could acknowledge more data, instead.
> > 
> > Store the RTT reported by the kernel using an approximation (exponent),
> > to keep flow storage size within two (typical) cachelines. Check for
> > socket updates when half of this time elapses: it should be a good
> > indication of the one-way delay we're interested in (peer to us).
> > 
> > Representable values are between 100 us and 3.2768 s, and any value
> > outside this range is clamped to these bounds. This choice appears
> > to be a good trade-off between additional overhead and throughput.
> > 
> > This mechanism partially overlaps with the "low RTT" destinations,
> > which we use to infer that a socket is connected to an endpoint to
> > the same machine (while possibly in a different namespace) if the
> > RTT is reported as 10 us or less.
> > 
> > This change doesn't, however, conflict with it: we are reading
> > TCP_INFO parameters for local connections anyway, so we can always
> > store the RTT approximation opportunistically.
> > 
> > Then, if the RTT is "low", we don't really need a timer to
> > acknowledge data as we'll always acknowledge everything to the
> > sender right away. However, we have limited space in the array where
> > we store addresses of local destination, so the low RTT property of a
> > connection might toggle frequently. Because of this, it's actually
> > helpful to always have the RTT approximation stored.
> > 
> > This could probably benefit from a future rework, though, introducing
> > a more integrated approach between these two mechanisms.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c      | 28 +++++++++++++++++++++-------
> >  tcp_conn.h |  9 +++++++++
> >  util.c     | 14 ++++++++++++++
> >  util.h     |  1 +
> >  4 files changed, 45 insertions(+), 7 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 951f434..8eeef4c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -202,9 +202,13 @@
> >   * - ACT_TIMEOUT, in the presence of any event: if no activity is detected on
> >   *   either side, the connection is reset
> >   *
> > - * - ACK_INTERVAL elapsed after data segment received from tap without having
> > + * - RTT / 2 elapsed after data segment received from tap without having
> >   *   sent an ACK segment, or zero-sized window advertised to tap/guest (flag
> > - *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent
> > + *   ACK_TO_TAP_DUE): forcibly check if an ACK segment can be sent.
> > + *
> > + *   RTT, here, is an approximation of the RTT value reported by the kernel via
> > + *   TCP_INFO, with a representable range from RTT_STORE_MIN (100 us) to
> > + *   RTT_STORE_MAX (3276.8 ms). The timeout value is clamped accordingly.
> >   *
> >   *
> >   * Summary of data flows (with ESTABLISHED event)
> > @@ -341,7 +345,6 @@ enum {
> >  #define MSS_DEFAULT			536
> >  #define WINDOW_DEFAULT			14600		/* RFC 6928 */
> >  
> > -#define ACK_INTERVAL			10		/* ms */
> >  #define RTO_INIT			1		/* s, RFC 6298 */
> >  #define RTO_INIT_AFTER_SYN_RETRIES	3		/* s, RFC 6298 */
> >  #define FIN_TIMEOUT			60
> > @@ -593,7 +596,8 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  	}
> >  
> >  	if (conn->flags & ACK_TO_TAP_DUE) {
> > -		it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000;
> > +		it.it_value.tv_sec = RTT_GET(conn) / 2 / (1000 * 1000);
> > +		it.it_value.tv_nsec = RTT_GET(conn) / 2 % (1000 * 1000) * 1000;
> >  	} else if (conn->flags & ACK_FROM_TAP_DUE) {
> >  		int exp = conn->retries, timeout = RTO_INIT;
> >  		if (!(conn->events & ESTABLISHED))
> > @@ -608,9 +612,15 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  		it.it_value.tv_sec = ACT_TIMEOUT;
> >  	}
> >  
> > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > -		 (unsigned long long)it.it_value.tv_sec,
> > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > +			   (unsigned long long)it.it_value.tv_nsec / 1000);  
> 
> This doesn't look right - you need a % to exclude the whole
> milliseconds here for the fractional part.

Ah, oops, right, and on top of that this can be more than one second
but I forgot to add it. Fixed in v3.

> Plus, it looks like this
> is trying to compute microseconds, which would be 3 digits after the
> . in ms, but the format string accomodates only one.

That was intended, I wanted to show only the first digit of
microseconds given that the smallest values are hundreds of
microseconds, but changed anyway given the possible confusion.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
  2025-12-08  5:54   ` David Gibson
@ 2025-12-08  7:25     ` Stefano Brivio
  2025-12-08  8:31       ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  7:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Mon, 8 Dec 2025 16:54:55 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote:
> > ...instead of checking if the current sending buffer is less than
> > SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
> > ACK segments: we rely on having enough data at once from the sender
> > to make the buffer grow by means of TCP buffer size tuning
> > implemented in the Linux kernel.
> > 
> > This is important if we're trying to maximise throughput, but not
> > desirable for interactive traffic, where we want to be transparent as
> > possible and avoid introducing unnecessary latency.
> > 
> > Use the tcpi_delivery_rate field reported by the Linux kernel, if
> > available, to calculate the current bandwidth-delay product: if it's
> > significantly smaller than the available sending buffer, conclude that
> > we're not bandwidth-bound and this is likely to be interactive
> > traffic, so acknowledge data only as it's acknowledged by the peer.
> > 
> > Conversely, if the bandwidth-delay product is comparable to the size
> > of the sending buffer (more than 5%), we're probably bandwidth-bound
> > or... bound to be: acknowledge everything in that case.  
> 
> Ah, nice.  This reasoning is much clearer to me than the previous
> spin.
> 
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 45 +++++++++++++++++++++++++++++++++------------
> >  1 file changed, 33 insertions(+), 12 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 9bf7b8b..533c8a7 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,9 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> > +#define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> > +
> >  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> >  
> >  #define CONN_IS_CLOSING(conn)						\
> > @@ -426,11 +429,13 @@ socklen_t tcp_info_size;
> >  	  sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
> >  
> >  /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
> > -#define snd_wnd_cap	tcp_info_cap(snd_wnd)
> > +#define snd_wnd_cap		tcp_info_cap(snd_wnd)
> >  /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
> > -#define bytes_acked_cap	tcp_info_cap(bytes_acked)
> > +#define bytes_acked_cap		tcp_info_cap(bytes_acked)
> >  /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
> > -#define min_rtt_cap	tcp_info_cap(min_rtt)
> > +#define min_rtt_cap		tcp_info_cap(min_rtt)
> > +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
> > +#define delivery_rate_cap	tcp_info_cap(delivery_rate)
> >  
> >  /* sendmsg() to socket */
> >  static struct iovec	tcp_iov			[UIO_MAXIOV];
> > @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	socklen_t sl = sizeof(*tinfo);
> >  	struct tcp_info_linux tinfo_new;
> >  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> > +	bool ack_everything = true;
> >  	int s = conn->sock;
> >  
> >  	/* At this point we could ack all the data we've accepted for forwarding
> > @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	 * control behaviour.
> >  	 *
> >  	 * For it to be possible and worth it we need:
> > -	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> > +	 *  - The TCP_INFO Linux extensions which give us the peer acked bytes
> > +	 *    and the delivery rate (outbound bandwidth at receiver)
> >  	 *  - Not to be told not to (force_seq)
> >  	 *  - Not half-closed in the peer->guest direction
> >  	 *      With no data coming from the peer, we might not get events which
> > @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	 *      Data goes from socket to socket, with nothing meaningfully "in
> >  	 *      flight".
> >  	 *  - Not a pseudo-local connection (e.g. to a VM on the same host)
> > -	 *  - Large enough send buffer
> > -	 *      In these cases, there's not enough in flight to bother.
> > +	 *      If it is, there's not enough in flight to bother.
> > +	 *  - Sending buffer significantly larger than bandwidth * delay product
> > +	 *      Meaning we're not bandwidth-bound and this is likely to be
> > +	 *      interactive traffic where we want to preserve transparent
> > +	 *      connection behaviour and latency.  
> 
> Do we actually want the sending buffer size here?  Or the amount of
> buffer that's actually in use (SIOCOUTQ)?  If we had a burst transfer
> followed by interactive traffic, the kernel could still have a large
> send buffer allocated, no?

The kernel shrinks it rather fast, and if it's not fast enough, then it
still looks like bulk traffic. I tried several metrics (including
something based on the data just sent, which approximates SIOCOUTQ),
they are not as good as the current buffer size.

> > +	 *
> > +	 *      Otherwise, we probably want to maximise throughput, which needs
> > +	 *      sending buffer auto-tuning, triggered in turn by filling up the
> > +	 *      outbound socket queue.
> >  	 */
> > -	if (bytes_acked_cap && !force_seq &&
> > +	if (bytes_acked_cap && delivery_rate_cap && !force_seq &&
> >  	    !CONN_IS_CLOSING(conn) &&
> > -	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) &&
> > -	    (unsigned)SNDBUF_GET(conn) >= SNDBUF_SMALL) {
> > +	    !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn)) {
> >  		if (!tinfo) {
> >  			tinfo = &tinfo_new;
> >  			if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl))
> >  				return 0;
> >  		}
> >  
> > +		if ((unsigned)SNDBUF_GET(conn) > (long long)RTT_GET(conn) *  
> 
> Using RTT_GET seems odd here, since we just got a more up to date and
> precise RTT estimate in tinfo.

Oops, right, fixed.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements
  2025-12-08  6:25   ` David Gibson
@ 2025-12-08  7:45     ` Stefano Brivio
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  7:45 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Mon, 8 Dec 2025 17:25:02 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:15AM +0100, Stefano Brivio wrote:
> > If the remote peer is advertising a bigger value than our current
> > sending buffer, it means that a bigger sending buffer is likely to
> > benefit throughput.
> > 
> > We can get a bigger sending buffer by means of the buffer size
> > auto-tuning performed by the Linux kernel, which is triggered by
> > aggressively filling the sending buffer.
> > 
> > Use an adaptive boost factor, up to 150%, depending on:
> > 
> > - how much data we sent so far: we don't want to risk retransmissions
> >   for short-lived connections, as the latency cost would be
> >   unacceptable, and
> > 
> > - the current RTT value, as we need a bigger buffer for higher
> >   transmission delays
> > 
> > The factor we use is not quite a bandwidth-delay product, as we're
> > missing the time component of the bandwidth, which is not interesting
> > here: we are trying to make the buffer grow at the beginning of a
> > connection, progressively, as more data is sent.
> > 
> > The tuning of the amount of boost factor we want to apply was done
> > somewhat empirically but it appears to yield the available throughput
> > in rather different scenarios (from ~ 10 Gbps bandwidth with 500ns to
> > ~ 1 Gbps with 300 ms RTT) and it allows getting there rather quickly,
> > within a few seconds for the 300 ms case.
> > 
> > Note that we want to apply this factor only if the window advertised
> > by the peer is bigger than the current sending buffer, as we only need
> > this for auto-tuning, and we absolutely don't want to incur
> > unnecessary retransmissions otherwise.
> > 
> > The related condition in tcp_update_seqack_wnd() is not redundant as
> > there's a subtractive factor, sendq, in the calculation of the window
> > limit. If the sending buffer is smaller than the peer's advertised
> > window, the additional limit we might apply might be lower than we
> > would do otherwise.
> > 
> > Assuming that the sending buffer is reported as 100k, sendq is
> > 20k, we could have these example cases:
> > 
> > 1. tinfo->tcpi_snd_wnd is 120k, which is bigger than the sending
> >    buffer, so we boost its size to 150k, and we limit the window
> >    to 120k
> > 
> > 2. tinfo->tcpi_snd_wnd is 90k, which is smaller than the sending
> >    buffer, so we aren't trying to trigger buffer auto-tuning and
> >    we'll stick to the existing, more conservative calculation,
> >    by limiting the window to 100 - 20 = 80k
> > 
> > If we omitted the new condition, we would always use the boosted
> > value, that is, 120k, even if potentially causing unnecessary
> > retransmissions.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  tcp.c | 38 ++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 38 insertions(+)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 3c046a5..60a9687 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -353,6 +353,13 @@ enum {
> >  #define LOW_RTT_TABLE_SIZE		8
> >  #define LOW_RTT_THRESHOLD		10 /* us */
> >  
> > +/* Parameters to temporarily exceed sending buffer to force TCP auto-tuning */
> > +#define SNDBUF_BOOST_BYTES_RTT_LO	2500 /* B * s: no boost until here */
> > +/* ...examples:  5 MB sent * 500 ns RTT, 250 kB * 10 ms,  8 kB * 300 ms */
> > +#define SNDBUF_BOOST_FACTOR		150 /* % */
> > +#define SNDBUF_BOOST_BYTES_RTT_HI	6000 /* apply full boost factor */
> > +/*		12 MB sent * 500 ns RTT, 600 kB * 10 ms, 20 kB * 300 ms */
> > +
> >  /* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> >  #define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> >  
> > @@ -1033,6 +1040,35 @@ void tcp_fill_headers(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	tap_hdr_update(taph, MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN));
> >  }
> >  
> > +/**
> > + * tcp_sndbuf_boost() - Calculate limit of sending buffer to force auto-tuning
> > + * @conn:	Connection pointer
> > + * @tinfo:	tcp_info from kernel, must be pre-fetched
> > + *
> > + * Return: increased sending buffer to use as a limit for advertised window
> > + */
> > +static unsigned long tcp_sndbuf_boost(struct tcp_tap_conn *conn,
> > +				      struct tcp_info_linux *tinfo)
> > +{
> > +	unsigned long bytes_rtt_product;
> > +
> > +	if (!bytes_acked_cap)
> > +		return SNDBUF_GET(conn);
> > +
> > +	/* This is *not* a bandwidth-delay product, but it's somewhat related:
> > +	 * as we send more data (usually at the beginning of a connection), we
> > +	 * try to make the sending buffer progressively grow, with the RTT as a
> > +	 * factor (longer delay, bigger buffer needed).
> > +	 */
> > +	bytes_rtt_product = (long long)tinfo->tcpi_bytes_acked *
> > +			    tinfo->tcpi_rtt / 1000 / 1000;  
> 
> I only half follow the reasoning in the commit message, but this
> doesn't see quite right to me.  Assuming the RTT is roughly-fixed, as
> you'd expect, this will always trend to infinity for long-lived
> connections - regardless of whether they're high throughput or
> interactive.  So, we'll always trend towards using 150% of the send
> buffer size.

Yes, that's intended: we want to keep that 150% in the unlikely case
that we don't switch to having a buffer exceeding the peer's advertised
window.

> > +	return scale_x_to_y_slope(SNDBUF_GET(conn), bytes_rtt_product,
> > +				  SNDBUF_BOOST_BYTES_RTT_LO,
> > +				  SNDBUF_BOOST_BYTES_RTT_HI,
> > +				  SNDBUF_BOOST_FACTOR);
> > +}
> > +
> >  /**
> >   * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
> >   * @c:		Execution context
> > @@ -1152,6 +1188,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> >  
> >  		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> >  			limit = 0;
> > +		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn))
> > +			limit = tcp_sndbuf_boost(conn, tinfo) - (int)sendq;  
> 
> Now that 5/9 has pointed out to be the existence of
> tcpi_delivery_rate, would it make more sense to do a
> 	limit += tcpi_delivery_rate * rtt;
> 
> The idea being to allow the guest to send as much as the receiver can
> accomodate itself, plus as much as we can fit "in the air" between us
> and the peer.

I tried using the bandwidth-delay product (what you're suggesting to
add), but it turned out that we clearly need to skip the time component
of the bandwidth as we really need to use "data so far" rather than
"data in flight".

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead
  2025-12-08  6:43   ` David Gibson
@ 2025-12-08  8:11     ` Stefano Brivio
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  8:11 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Mon, 8 Dec 2025 17:43:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:14AM +0100, Stefano Brivio wrote:
> > If the sender uses data clumping (including Nagle's algorithm) for
> > Silly Window Syndrome (SWS) avoidance, advertising less than a MSS
> > means the sender might stop sending altogether, and window updates
> > after a low window condition are just as important as they are in
> > a zero-window condition.
> > 
> > For simplicity, approximate that limit to zero, as we have an
> > implementation forcing window updates after zero-sized windows.
> > This matches the suggestion from RFC 813, section 4.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>  
> 
> Looking at this again, I'm worrying if it might allow a pathalogical
> case here: unlikely to hit, but very bad if it did.
> 
> Suppose we have:
>   1. A receiver that wants to consume its input in fixed largish
>      (~64kiB) records
>   2. The receiver has locked its SO_RCVBUF to that record length, or
>      only slightly more
>   3. The receive buf is near full - but not quite a full record's
>      worth
> 
> The receiver doesn't consume anything, because it doesn't have a full
> record.  Its rcvbuf is near full, so its kernel advertises only a
> small window.  We approximate that to zero, so the sender can't send
> anything.  So, the record never gets completed and we stall
> completely.

I don't think it can be a problem because the receiver shouldn't
advertise less than a MSS in that case anyway, but I need to look up
normative references for this.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers
  2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
@ 2025-12-08  8:22   ` Stefano Brivio
  0 siblings, 0 replies; 23+ messages in thread
From: Stefano Brivio @ 2025-12-08  8:22 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Max Chernoff

On Mon, 8 Dec 2025 17:46:43 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Mon, Dec 08, 2025 at 01:22:08AM +0100, Stefano Brivio wrote:
> > Patch 2/9 is the most relevant fix here, as we currently advertise a
> > window that might be too big for what we can write to the socket,
> > causing retransmissions right away and occasional high latency on
> > short transfers to non-local peers.
> > 
> > Mostly as a consequence of fixing that, we now need several
> > improvements and small fixes, including, most notably, an adaptive
> > approach to pick the interval between checks for socket-side ACKs
> > (patch 3/9), and several tricks to reliably trigger TCP buffer size
> > auto-tuning as implemented by the Linux kernel (patches 5/9 and 7/9).
> > 
> > These changes make some existing issues more relevant, fixed by the
> > other patches.  
> 
> I've made a number of comments through the series.  I think they do
> want some consideration.  But given this works to address a real
> problem empirically, I'm fine for this to be merged as is, with more
> polishing later.

I just posted v3 addressing a few items that actually look wrong /
problematic to me.

Note that I forgot to change the "scaling" function name later in the
series (as posted), I'm fixing it up on merge.

-- 
Stefano




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks
  2025-12-08  7:22     ` Stefano Brivio
@ 2025-12-08  8:28       ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2025-12-08  8:28 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]

On Mon, Dec 08, 2025 at 08:22:12AM +0100, Stefano Brivio wrote:
> On Mon, 8 Dec 2025 16:41:21 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Dec 08, 2025 at 01:22:11AM +0100, Stefano Brivio wrote:
[snip]
> > > -	flow_dbg(conn, "timer expires in %llu.%03llus",
> > > -		 (unsigned long long)it.it_value.tv_sec,
> > > -		 (unsigned long long)it.it_value.tv_nsec / 1000 / 1000);
> > > +	if (conn->flags & ACK_TO_TAP_DUE) {
> > > +		flow_trace(conn, "timer expires in %lu.%01llums",
> > > +			   (unsigned long)it.it_value.tv_nsec / 1000 / 1000,
> > > +			   (unsigned long long)it.it_value.tv_nsec / 1000);  
> > 
> > This doesn't look right - you need a % to exclude the whole
> > milliseconds here for the fractional part.
> 
> Ah, oops, right, and on top of that this can be more than one second
> but I forgot to add it. Fixed in v3.
> 
> > Plus, it looks like this
> > is trying to compute microseconds, which would be 3 digits after the
> > . in ms, but the format string accomodates only one.
> 
> That was intended, I wanted to show only the first digit of
> microseconds given that the smallest values are hundreds of
> microseconds, but changed anyway given the possible confusion.

One digit is fine, but then you need tv_nsec / 100000, rather than
nsec / 1000.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive
  2025-12-08  7:25     ` Stefano Brivio
@ 2025-12-08  8:31       ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2025-12-08  8:31 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Max Chernoff

[-- Attachment #1: Type: text/plain, Size: 5835 bytes --]

On Mon, Dec 08, 2025 at 08:25:29AM +0100, Stefano Brivio wrote:
> On Mon, 8 Dec 2025 16:54:55 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Mon, Dec 08, 2025 at 01:22:13AM +0100, Stefano Brivio wrote:
> > > ...instead of checking if the current sending buffer is less than
> > > SNDBUF_SMALL, because this isn't simply an optimisation to coalesce
> > > ACK segments: we rely on having enough data at once from the sender
> > > to make the buffer grow by means of TCP buffer size tuning
> > > implemented in the Linux kernel.
> > > 
> > > This is important if we're trying to maximise throughput, but not
> > > desirable for interactive traffic, where we want to be transparent as
> > > possible and avoid introducing unnecessary latency.
> > > 
> > > Use the tcpi_delivery_rate field reported by the Linux kernel, if
> > > available, to calculate the current bandwidth-delay product: if it's
> > > significantly smaller than the available sending buffer, conclude that
> > > we're not bandwidth-bound and this is likely to be interactive
> > > traffic, so acknowledge data only as it's acknowledged by the peer.
> > > 
> > > Conversely, if the bandwidth-delay product is comparable to the size
> > > of the sending buffer (more than 5%), we're probably bandwidth-bound
> > > or... bound to be: acknowledge everything in that case.  
> > 
> > Ah, nice.  This reasoning is much clearer to me than the previous
> > spin.
> > 
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  tcp.c | 45 +++++++++++++++++++++++++++++++++------------
> > >  1 file changed, 33 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 9bf7b8b..533c8a7 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -353,6 +353,9 @@ enum {
> > >  #define LOW_RTT_TABLE_SIZE		8
> > >  #define LOW_RTT_THRESHOLD		10 /* us */
> > >  
> > > +/* Ratio of buffer to bandwidth * delay product implying interactive traffic */
> > > +#define SNDBUF_TO_BW_DELAY_INTERACTIVE	/* > */ 20 /* (i.e. < 5% of buffer) */
> > > +
> > >  #define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
> > >  
> > >  #define CONN_IS_CLOSING(conn)						\
> > > @@ -426,11 +429,13 @@ socklen_t tcp_info_size;
> > >  	  sizeof(((struct tcp_info_linux *)NULL)->tcpi_##f_)) <= tcp_info_size)
> > >  
> > >  /* Kernel reports sending window in TCP_INFO (kernel commit 8f7baad7f035) */
> > > -#define snd_wnd_cap	tcp_info_cap(snd_wnd)
> > > +#define snd_wnd_cap		tcp_info_cap(snd_wnd)
> > >  /* Kernel reports bytes acked in TCP_INFO (kernel commit 0df48c26d84) */
> > > -#define bytes_acked_cap	tcp_info_cap(bytes_acked)
> > > +#define bytes_acked_cap		tcp_info_cap(bytes_acked)
> > >  /* Kernel reports minimum RTT in TCP_INFO (kernel commit cd9b266095f4) */
> > > -#define min_rtt_cap	tcp_info_cap(min_rtt)
> > > +#define min_rtt_cap		tcp_info_cap(min_rtt)
> > > +/* Kernel reports delivery rate in TCP_INFO (kernel commit eb8329e0a04d) */
> > > +#define delivery_rate_cap	tcp_info_cap(delivery_rate)
> > >  
> > >  /* sendmsg() to socket */
> > >  static struct iovec	tcp_iov			[UIO_MAXIOV];
> > > @@ -1048,6 +1053,7 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  	socklen_t sl = sizeof(*tinfo);
> > >  	struct tcp_info_linux tinfo_new;
> > >  	uint32_t new_wnd_to_tap = prev_wnd_to_tap;
> > > +	bool ack_everything = true;
> > >  	int s = conn->sock;
> > >  
> > >  	/* At this point we could ack all the data we've accepted for forwarding
> > > @@ -1057,7 +1063,8 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  	 * control behaviour.
> > >  	 *
> > >  	 * For it to be possible and worth it we need:
> > > -	 *  - The TCP_INFO Linux extension which gives us the peer acked bytes
> > > +	 *  - The TCP_INFO Linux extensions which give us the peer acked bytes
> > > +	 *    and the delivery rate (outbound bandwidth at receiver)
> > >  	 *  - Not to be told not to (force_seq)
> > >  	 *  - Not half-closed in the peer->guest direction
> > >  	 *      With no data coming from the peer, we might not get events which
> > > @@ -1067,19 +1074,36 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> > >  	 *      Data goes from socket to socket, with nothing meaningfully "in
> > >  	 *      flight".
> > >  	 *  - Not a pseudo-local connection (e.g. to a VM on the same host)
> > > -	 *  - Large enough send buffer
> > > -	 *      In these cases, there's not enough in flight to bother.
> > > +	 *      If it is, there's not enough in flight to bother.
> > > +	 *  - Sending buffer significantly larger than bandwidth * delay product
> > > +	 *      Meaning we're not bandwidth-bound and this is likely to be
> > > +	 *      interactive traffic where we want to preserve transparent
> > > +	 *      connection behaviour and latency.  
> > 
> > Do we actually want the sending buffer size here?  Or the amount of
> > buffer that's actually in use (SIOCOUTQ)?  If we had a burst transfer
> > followed by interactive traffic, the kernel could still have a large
> > send buffer allocated, no?
> 
> The kernel shrinks it rather fast, and if it's not fast enough, then it
> still looks like bulk traffic. I tried several metrics (including
> something based on the data just sent, which approximates SIOCOUTQ),
> they are not as good as the current buffer size.

Ok.  Thinking about this, I guess the kernel has had quite some time
to tweak its heuristics here, so making indirect use of that
experience would be a good idea.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-12-08 10:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-08  0:22 [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 1/9] tcp, util: Add function for scaling to linearly interpolated factor, use it Stefano Brivio
2025-12-08  5:33   ` David Gibson
2025-12-08  0:22 ` [PATCH v2 2/9] tcp: Limit advertised window to available, not total sending buffer size Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 3/9] tcp: Adaptive interval based on RTT for socket-side acknowledgement checks Stefano Brivio
2025-12-08  5:41   ` David Gibson
2025-12-08  7:22     ` Stefano Brivio
2025-12-08  8:28       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 4/9] tcp: Don't clear ACK_TO_TAP_DUE if we're advertising a zero-sized window Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 5/9] tcp: Acknowledge everything if it looks like bulk traffic, not interactive Stefano Brivio
2025-12-08  5:54   ` David Gibson
2025-12-08  7:25     ` Stefano Brivio
2025-12-08  8:31       ` David Gibson
2025-12-08  0:22 ` [PATCH v2 6/9] tcp: Don't limit window to less-than-MSS values, use zero instead Stefano Brivio
2025-12-08  6:43   ` David Gibson
2025-12-08  8:11     ` Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 7/9] tcp: Allow exceeding the available sending buffer size in window advertisements Stefano Brivio
2025-12-08  6:25   ` David Gibson
2025-12-08  7:45     ` Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 8/9] tcp: Send a duplicate ACK also on complete sendmsg() failure Stefano Brivio
2025-12-08  0:22 ` [PATCH v2 9/9] tcp: Skip redundant ACK on partial " Stefano Brivio
2025-12-08  6:46 ` [PATCH v2 0/9] tcp: Fix throughput issues with non-local peers David Gibson
2025-12-08  8:22   ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).