[PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Jon Maloy <jmaloy@redhat.com>
To: sbrivio@redhat.com, david@gibson.dropbear.id.au,
	jmaloy@redhat.com, passt-dev@passt.top
Subject: [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting
Date: Tue, 21 Apr 2026 22:23:42 -0400	[thread overview]
Message-ID: <20260422022342.72046-1-jmaloy@redhat.com> (raw)

The TCP window advertised to the guest/container must balance two
competing needs: large enough to trigger kernel socket buffer
auto-tuning, but not so large that sendmsg() partially fails, causing
retransmissions.

The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
these values are in reality representing different units: SO_SNDBUF
includes the buffer overhead (sk_buff head, alignment, skb_shared_info),
while SIOCOUTQ only returns the actual payload bytes. The clamped_scale
value of 75% is a rough approximation of this overhead, but it is
inaccurate: too generous for large buffers, causing retransmissions at
higher RTTs, and too conservative for small ones, hence inhibiting
auto-tuning.

We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in
the kernel's own accounting units, i.e. including the per-skb overhead,
and match exactly what the kernel's own sk_stream_memory_free()
function is using.

When we combine the above with the payload bytes indicated by SIOCOUTQ,
the observed overhead ratio self-calibrates to whatever gso_segs, cache
line size, and sk_buff layout the kernel may use, and is even
architecture agnostic.

When data is queued and the overhead ratio is observable
(wmem_queued > sendq), the available payload window is calculated as:
  (sk_sndbuf - wmem_queued) * sendq / wmem_queued

When the ratio cannot be observed, e.g. because the queue is empty or
we are in a transient state, we fall back to 75% of remaining buffer
capacity, like before.

If SO_MEMINFO is unavailable, we fall back to the pre-existing
SNDBUF_GET() - SIOCOUTQ calculation.

Link: https://bugs.passt.top/show_bug.cgi?id=138
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
---
 tcp.c  | 33 ++++++++++++++++++++++++++-------
 util.c |  1 +
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/tcp.c b/tcp.c
index 43b8fdb..3b47a3b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -295,6 +295,7 @@
 #include <arpa/inet.h>

 #include <linux/sockios.h>
+#include <linux/sock_diag.h>

 #include "checksum.h"
 #include "util.h"
@@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		new_wnd_to_tap = tinfo->tcpi_snd_wnd;
 	} else {
 		unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
+		uint32_t mem[SK_MEMINFO_VARS];
+		socklen_t mem_sl;
 		uint32_t sendq;
-		int limit;
+		uint32_t sndbuf;
+		uint32_t limit;

 		if (ioctl(s, SIOCOUTQ, &sendq)) {
 			debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
 			sendq = 0;
 		}
 		tcp_get_sndbuf(conn);
+		sndbuf = SNDBUF_GET(conn);

-		if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
-			limit = 0;
-		else
-			limit = SNDBUF_GET(conn) - (int)sendq;
+		mem_sl = sizeof(mem);
+		if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {
+			if (sendq > sndbuf)
+				limit = 0;
+			else
+				limit = sndbuf - sendq;
+		} else {
+			uint32_t sb = mem[SK_MEMINFO_SNDBUF];
+			uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED];
+
+			if (wq > sb)
+				limit = 0;
+			else if (!sendq || wq <= sendq)
+				limit = (sb - wq) * 3 / 4;
+			else
+				limit = (uint64_t)(sb - wq) *
+					sendq / wq;
+		}

 		/* If the sender uses mechanisms to prevent Silly Window
 		 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
@@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 		 *   but we won't send enough to fill one because we're stuck
 		 *   with pending data in the outbound queue
 		 */
-		if (limit < MSS_GET(conn) && sendq &&
+		if (limit < (unsigned int)MSS_GET(conn) && sendq &&
 		    tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
 			limit = 0;

-		new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
+		new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit);
 	}

 	new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
diff --git a/util.c b/util.c
index 73c9d51..036fac1 100644
--- a/util.c
+++ b/util.c
@@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f)

 	return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;
 }
+
-- 
2.52.0

next             reply	other threads:[~2026-04-22  2:23 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-22  2:23 Jon Maloy [this message]
2026-04-22 11:51 ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260422022342.72046-1-jmaloy@redhat.com \
    --to=jmaloy@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).