[PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting
@ 2026-04-22  2:23 Jon Maloy
  2026-04-22 11:51 ` Stefano Brivio
  0 siblings, 1 reply; 2+ messages in thread
From: Jon Maloy @ 2026-04-22  2:23 UTC (permalink / raw)
  To: sbrivio, david, jmaloy, passt-dev

The TCP window advertised to the guest/container must balance two
competing needs: large enough to trigger kernel socket buffer
auto-tuning, but not so large that sendmsg() partially fails, causing
retransmissions.

The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
these values are in reality representing different units: SO_SNDBUF
includes the buffer overhead (sk_buff head, alignment, skb_shared_info),
while SIOCOUTQ only returns the actual payload bytes. The clamped_scale
value of 75% is a rough approximation of this overhead, but it is
inaccurate: too generous for large buffers, causing retransmissions at
higher RTTs, and too conservative for small ones, hence inhibiting
auto-tuning.

We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in
the kernel's own accounting units, i.e. including the per-skb overhead,
and match exactly what the kernel's own sk_stream_memory_free()
function is using.

When we combine the above with the payload bytes indicated by SIOCOUTQ,
the observed overhead ratio self-calibrates to whatever gso_segs, cache
line size, and sk_buff layout the kernel may use, and is even
architecture agnostic.

When data is queued and the overhead ratio is observable
(wmem_queued > sendq), the available payload window is calculated as:
  (sk_sndbuf - wmem_queued) * sendq / wmem_queued

When the ratio cannot be observed, e.g. because the queue is empty or
we are in a transient state, we fall back to 75% of remaining buffer
capacity, like before.

If SO_MEMINFO is unavailable, we fall back to the pre-existing
SNDBUF_GET() - SIOCOUTQ calculation.

Link: https://bugs.passt.top/show_bug.cgi?id=138
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 tcp.c  | 33 ++++++++++++++++++++++++++-------
 util.c |  1 +
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 43b8fdb..3b47a3b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -295,6 +295,7 @@
 #include <arpa/inet.h>

 #include <linux/sockios.h>
+#include <linux/sock_diag.h>

 #include "checksum.h"
 #include "util.h"
@@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
 		unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
+		uint32_t mem[SK_MEMINFO_VARS];
+		socklen_t mem_sl;
 		uint32_t sendq;
-		int limit;
+		uint32_t sndbuf;
+		uint32_t limit;

 		if (ioctl(s, SIOCOUTQ, &sendq)) {
 			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
 			sendq = 0;
 		}
 		tcp_get_sndbuf(conn);
+		sndbuf = SNDBUF_GET(conn);

-		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
-			limit = 0;
-		else
-			limit = SNDBUF_GET(conn) - (int)sendq;
+		mem_sl = sizeof(mem);
+		if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {
+			if (sendq > sndbuf)
+				limit = 0;
+			else
+				limit = sndbuf - sendq;
+		} else {
+			uint32_t sb = mem[SK_MEMINFO_SNDBUF];
+			uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED];
+
+			if (wq > sb)
+				limit = 0;
+			else if (!sendq || wq <= sendq)
+				limit = (sb - wq) * 3 / 4;
+			else
+				limit = (uint64_t)(sb - wq) *
+					sendq / wq;
+		}

 		/* If the sender uses mechanisms to prevent Silly Window
 		 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
@@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		 *   but we won't send enough to fill one because we're stuck
 		 *   with pending data in the outbound queue
 		 */
-		if (limit < MSS_GET(conn) && sendq &&
+		if (limit < (unsigned int)MSS_GET(conn) && sendq &&
 		    tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
 			limit = 0;

-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
+		new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit);
 	}

 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
diff --git a/util.c b/util.c
index 73c9d51..036fac1 100644
--- a/util.c
+++ b/util.c
@@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f)

 	return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;
 }
+
-- 
2.52.0

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting
  2026-04-22  2:23 [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting Jon Maloy
@ 2026-04-22 11:51 ` Stefano Brivio
  0 siblings, 0 replies; 2+ messages in thread
From: Stefano Brivio @ 2026-04-22 11:51 UTC (permalink / raw)
  To: Jon Maloy; +Cc: david, passt-dev

On Tue, 21 Apr 2026 22:23:42 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> The TCP window advertised to the guest/container must balance two
> competing needs: large enough to trigger kernel socket buffer
> auto-tuning, but not so large that sendmsg() partially fails, causing
> retransmissions.
> 
> The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
> these values are in reality representing different units: SO_SNDBUF
> includes the buffer overhead (sk_buff head, alignment, skb_shared_info),
> while SIOCOUTQ only returns the actual payload bytes.

They're not really different units, because SNDBUF_GET() already
returns a scaled value trying to take (very roughly) the overhead into
account.

> The clamped_scale
> value of 75% is a rough approximation of this overhead, but it is
> inaccurate: too generous for large buffers, causing retransmissions at
> higher RTTs, and too conservative for small ones, hence inhibiting
> auto-tuning.

It actually works the other way around (we use 100% for small buffers
and gradually going towards 75% for large buffer) and auto-tuning works
pretty well with it.

Example before your patch with an iperf3 test with 15 ms RTT:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   125 MBytes  1.05 Gbits/sec    0   9.55 MBytes       
[  5]   1.00-2.00   sec   110 MBytes   922 Mbits/sec    0   9.55 MBytes       
[  5]   2.00-3.00   sec   111 MBytes   934 Mbits/sec    0   9.55 MBytes       
[  5]   3.00-4.00   sec   104 MBytes   877 Mbits/sec    0   9.55 MBytes       
[  5]   4.00-5.00   sec   110 MBytes   927 Mbits/sec    0   9.55 MBytes       
[  5]   5.00-6.00   sec   111 MBytes   928 Mbits/sec    0   9.55 MBytes       
[  5]   6.00-7.00   sec   112 MBytes   944 Mbits/sec    0   9.55 MBytes       
[  5]   7.00-8.00   sec   110 MBytes   919 Mbits/sec    0   9.55 MBytes       
[  5]   8.00-9.00   sec   112 MBytes   942 Mbits/sec    0   9.55 MBytes       
[  5]   9.00-10.00  sec   110 MBytes   918 Mbits/sec    0   9.55 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec    0             sender
[  5]   0.00-10.02  sec  1.07 GBytes   918 Mbits/sec                  receiver

and after your patch:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.50 MBytes  21.0 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  1.38 MBytes  11.5 Mbits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   7.00-8.00   sec  1.38 MBytes  11.5 Mbits/sec    0    320 KBytes       
[  5]   8.00-9.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   9.00-10.00  sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  16.0 MBytes  13.4 Mbits/sec    0             sender
[  5]   0.00-10.02  sec  15.1 MBytes  12.7 Mbits/sec                  receiver

It's similar in a test with 285 ms RTT. Before your patch:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.12 MBytes  9.43 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  2.00 MBytes  16.8 Mbits/sec    0   1.17 MBytes       
[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec   11    660 KBytes       
[  5]   3.00-4.00   sec  3.12 MBytes  26.2 Mbits/sec   11    540 KBytes       
[  5]   4.00-5.00   sec  31.5 MBytes   264 Mbits/sec    0   1.93 MBytes       
[  5]   5.00-6.00   sec  83.9 MBytes   704 Mbits/sec    0   4.10 MBytes       
[  5]   6.00-7.00   sec   112 MBytes   941 Mbits/sec    0   7.38 MBytes       
[  5]   7.00-8.00   sec   126 MBytes  1.06 Gbits/sec    0   11.9 MBytes       
[  5]   8.00-9.00   sec   114 MBytes   952 Mbits/sec    0   11.9 MBytes       
[  5]   9.00-10.00  sec   110 MBytes   925 Mbits/sec    0   11.9 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   584 MBytes   490 Mbits/sec   22             sender
[  5]   0.00-10.31  sec   548 MBytes   445 Mbits/sec                  receiver

and after your patch:

[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  2.50 MBytes  21.0 Mbits/sec    0    320 KBytes       
[  5]   1.00-2.00   sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
[  5]   2.00-3.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   3.00-4.00   sec  1.38 MBytes  11.5 Mbits/sec    0    320 KBytes       
[  5]   4.00-5.00   sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
[  5]   5.00-6.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   6.00-7.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   7.00-8.00   sec  1.38 MBytes  11.5 Mbits/sec    0    320 KBytes       
[  5]   8.00-9.00   sec  1.75 MBytes  14.7 Mbits/sec    0    320 KBytes       
[  5]   9.00-10.00  sec  1.25 MBytes  10.5 Mbits/sec    0    320 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  16.0 MBytes  13.4 Mbits/sec    0             sender
[  5]   0.00-10.02  sec  15.1 MBytes  12.7 Mbits/sec                  receiver

> We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
> SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in
> the kernel's own accounting units, i.e. including the per-skb overhead,
> and match exactly what the kernel's own sk_stream_memory_free()
> function is using.

Using SK_MEMINFO_WMEM_QUEUED might be helpful, because that actually
tells us the memory used for pending outgoing segments *as currently
stored*, but SK_MEMINFO_SNDBUF returns the same value as we get via
SO_SNDBUF socket option (in the kernel, that's sk->sk_sndbuf), so we're
still missing the information of how much overhead we'll have for data
*we haven't written yet*.

That is, the approach we were considering so far was something like:

- divide the value returned via SIOCOUTQ into MSS-sized segments,
  calculate and add overhead for each of those.

  This shouldn't be needed if we we use SK_MEMINFO_WMEM_QUEUED (while I
  wanted to avoid SO_MEMINFO because it's a rather large copy_to_user(),
  maybe it's actually fine)

- divide the remaining space into MSS-sized segments, calculate
  overhead for each of them... and this is the tricky part that you're
  approximating here.

> When we combine the above with the payload bytes indicated by SIOCOUTQ,
> the observed overhead ratio self-calibrates to whatever gso_segs, cache
> line size, and sk_buff layout the kernel may use, and is even
> architecture agnostic.

I hope we can get something like this to work, but this patch, as it is,
applies a linear factor to a non-linear overhead, which *I think* is
what results in the underestimation of the available buffer size that's
visible from tests.

> When data is queued and the overhead ratio is observable
> (wmem_queued > sendq), the available payload window is calculated as:
>   (sk_sndbuf - wmem_queued) * sendq / wmem_queued

I still think we should try a bit harder to accurately reverse the
calculation done by the kernel including gso_segs.

If we can't, a variation I would try on this patch is to consider
segments as discrete quantities, because that should be slightly more
accurate.

Example: this patch right now would calculate, say:

  ( 200000 - 87500 (50 segments) ) * 73000 (those 10 segments) / 87500

and give us a 83.428% payload factor which we apply flat over those
112500 bytes remaining, giving us 93857 bytes of window.

Instead, I think we could do this:

  * 87500 bytes per 50 segments -> 290 bytes overhead per segment

  * 180000 bytes left: 64 segments

  * advertise 93440 bytes

> When the ratio cannot be observed, e.g. because the queue is empty or
> we are in a transient state, we fall back to 75% of remaining buffer
> capacity, like before.

That's not generally the case. We use between 75% and 100%.

> If SO_MEMINFO is unavailable, we fall back to the pre-existing
> SNDBUF_GET() - SIOCOUTQ calculation.
> 
> Link: https://bugs.passt.top/show_bug.cgi?id=138
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
>  tcp.c  | 33 ++++++++++++++++++++++++++-------
>  util.c |  1 +
>  2 files changed, 27 insertions(+), 7 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 43b8fdb..3b47a3b 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -295,6 +295,7 @@
>  #include <arpa/inet.h>
>  
>  #include <linux/sockios.h>
> +#include <linux/sock_diag.h>
>  
>  #include "checksum.h"
>  #include "util.h"
> @@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
>  	} else {
>  		unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
> +		uint32_t mem[SK_MEMINFO_VARS];
> +		socklen_t mem_sl;
>  		uint32_t sendq;
> -		int limit;
> +		uint32_t sndbuf;
> +		uint32_t limit;
>  
>  		if (ioctl(s, SIOCOUTQ, &sendq)) {
>  			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
>  			sendq = 0;
>  		}
>  		tcp_get_sndbuf(conn);
> +		sndbuf = SNDBUF_GET(conn);
>  
> -		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> -			limit = 0;
> -		else
> -			limit = SNDBUF_GET(conn) - (int)sendq;
> +		mem_sl = sizeof(mem);
> +		if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {

If we are already fetching this, we don't need to fetch SO_SNDBUF (same
as SK_MEMINFO_SNDBUF).

> +			if (sendq > sndbuf)
> +				limit = 0;
> +			else
> +				limit = sndbuf - sendq;
> +		} else {
> +			uint32_t sb = mem[SK_MEMINFO_SNDBUF];
> +			uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED];
> +
> +			if (wq > sb)
> +				limit = 0;
> +			else if (!sendq || wq <= sendq)
> +				limit = (sb - wq) * 3 / 4;

Note that SNDBUF_GET() is already scaled.

Maybe, actually, that's the problem? Let me try to fix that and see
what happens...

> +			else
> +				limit = (uint64_t)(sb - wq) *
> +					sendq / wq;
> +		}
>  
>  		/* If the sender uses mechanisms to prevent Silly Window
>  		 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
> @@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
>  		 *   but we won't send enough to fill one because we're stuck
>  		 *   with pending data in the outbound queue
>  		 */
> -		if (limit < MSS_GET(conn) && sendq &&
> +		if (limit < (unsigned int)MSS_GET(conn) && sendq &&
>  		    tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
>  			limit = 0;
>  
> -		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
> +		new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit);
>  	}
>  
>  	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
> diff --git a/util.c b/util.c
> index 73c9d51..036fac1 100644
> --- a/util.c
> +++ b/util.c
> @@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f)
>  
>  	return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;
>  }
> +

Nit: three unrelated changes.

-- 
Stefano


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-22 11:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-22  2:23 [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting Jon Maloy
2026-04-22 11:51 ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).