From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Cc: david@gibson.dropbear.id.au, passt-dev@passt.top
Subject: Re: [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting
Date: Wed, 22 Apr 2026 13:51:32 +0200 (CEST) [thread overview]
Message-ID: <20260422135131.5806f34f@elisabeth> (raw)
In-Reply-To: <20260422022342.72046-1-jmaloy@redhat.com>
On Tue, 21 Apr 2026 22:23:42 -0400
Jon Maloy <jmaloy@redhat.com> wrote:
> The TCP window advertised to the guest/container must balance two
> competing needs: large enough to trigger kernel socket buffer
> auto-tuning, but not so large that sendmsg() partially fails, causing
> retransmissions.
>
> The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
> these values are in reality representing different units: SO_SNDBUF
> includes the buffer overhead (sk_buff head, alignment, skb_shared_info),
> while SIOCOUTQ only returns the actual payload bytes.
They're not really different units, because SNDBUF_GET() already
returns a scaled value trying to take (very roughly) the overhead into
account.
> The clamped_scale
> value of 75% is a rough approximation of this overhead, but it is
> inaccurate: too generous for large buffers, causing retransmissions at
> higher RTTs, and too conservative for small ones, hence inhibiting
> auto-tuning.
It actually works the other way around (we use 100% for small buffers
and gradually going towards 75% for large buffer) and auto-tuning works
pretty well with it.
Example before your patch with an iperf3 test with 15 ms RTT:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 125 MBytes 1.05 Gbits/sec 0 9.55 MBytes
[ 5] 1.00-2.00 sec 110 MBytes 922 Mbits/sec 0 9.55 MBytes
[ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 9.55 MBytes
[ 5] 3.00-4.00 sec 104 MBytes 877 Mbits/sec 0 9.55 MBytes
[ 5] 4.00-5.00 sec 110 MBytes 927 Mbits/sec 0 9.55 MBytes
[ 5] 5.00-6.00 sec 111 MBytes 928 Mbits/sec 0 9.55 MBytes
[ 5] 6.00-7.00 sec 112 MBytes 944 Mbits/sec 0 9.55 MBytes
[ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 9.55 MBytes
[ 5] 8.00-9.00 sec 112 MBytes 942 Mbits/sec 0 9.55 MBytes
[ 5] 9.00-10.00 sec 110 MBytes 918 Mbits/sec 0 9.55 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender
[ 5] 0.00-10.02 sec 1.07 GBytes 918 Mbits/sec receiver
and after your patch:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender
[ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver
It's similar in a test with 285 ms RTT. Before your patch:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 2.00 MBytes 16.8 Mbits/sec 0 1.17 MBytes
[ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 11 660 KBytes
[ 5] 3.00-4.00 sec 3.12 MBytes 26.2 Mbits/sec 11 540 KBytes
[ 5] 4.00-5.00 sec 31.5 MBytes 264 Mbits/sec 0 1.93 MBytes
[ 5] 5.00-6.00 sec 83.9 MBytes 704 Mbits/sec 0 4.10 MBytes
[ 5] 6.00-7.00 sec 112 MBytes 941 Mbits/sec 0 7.38 MBytes
[ 5] 7.00-8.00 sec 126 MBytes 1.06 Gbits/sec 0 11.9 MBytes
[ 5] 8.00-9.00 sec 114 MBytes 952 Mbits/sec 0 11.9 MBytes
[ 5] 9.00-10.00 sec 110 MBytes 925 Mbits/sec 0 11.9 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 584 MBytes 490 Mbits/sec 22 sender
[ 5] 0.00-10.31 sec 548 MBytes 445 Mbits/sec receiver
and after your patch:
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes
[ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
[ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes
[ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
[ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes
[ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes
[ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender
[ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver
> We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
> SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in
> the kernel's own accounting units, i.e. including the per-skb overhead,
> and match exactly what the kernel's own sk_stream_memory_free()
> function is using.
Using SK_MEMINFO_WMEM_QUEUED might be helpful, because that actually
tells us the memory used for pending outgoing segments *as currently
stored*, but SK_MEMINFO_SNDBUF returns the same value as we get via
SO_SNDBUF socket option (in the kernel, that's sk->sk_sndbuf), so we're
still missing the information of how much overhead we'll have for data
*we haven't written yet*.
That is, the approach we were considering so far was something like:
- divide the value returned via SIOCOUTQ into MSS-sized segments,
calculate and add overhead for each of those.
This shouldn't be needed if we we use SK_MEMINFO_WMEM_QUEUED (while I
wanted to avoid SO_MEMINFO because it's a rather large copy_to_user(),
maybe it's actually fine)
- divide the remaining space into MSS-sized segments, calculate
overhead for each of them... and this is the tricky part that you're
approximating here.
> When we combine the above with the payload bytes indicated by SIOCOUTQ,
> the observed overhead ratio self-calibrates to whatever gso_segs, cache
> line size, and sk_buff layout the kernel may use, and is even
> architecture agnostic.
I hope we can get something like this to work, but this patch, as it is,
applies a linear factor to a non-linear overhead, which *I think* is
what results in the underestimation of the available buffer size that's
visible from tests.
> When data is queued and the overhead ratio is observable
> (wmem_queued > sendq), the available payload window is calculated as:
> (sk_sndbuf - wmem_queued) * sendq / wmem_queued
I still think we should try a bit harder to accurately reverse the
calculation done by the kernel including gso_segs.
If we can't, a variation I would try on this patch is to consider
segments as discrete quantities, because that should be slightly more
accurate.
Example: this patch right now would calculate, say:
( 200000 - 87500 (50 segments) ) * 73000 (those 10 segments) / 87500
and give us a 83.428% payload factor which we apply flat over those
112500 bytes remaining, giving us 93857 bytes of window.
Instead, I think we could do this:
* 87500 bytes per 50 segments -> 290 bytes overhead per segment
* 180000 bytes left: 64 segments
* advertise 93440 bytes
> When the ratio cannot be observed, e.g. because the queue is empty or
> we are in a transient state, we fall back to 75% of remaining buffer
> capacity, like before.
That's not generally the case. We use between 75% and 100%.
> If SO_MEMINFO is unavailable, we fall back to the pre-existing
> SNDBUF_GET() - SIOCOUTQ calculation.
>
> Link: https://bugs.passt.top/show_bug.cgi?id=138
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>
> ---
> tcp.c | 33 ++++++++++++++++++++++++++-------
> util.c | 1 +
> 2 files changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 43b8fdb..3b47a3b 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -295,6 +295,7 @@
> #include <arpa/inet.h>
>
> #include <linux/sockios.h>
> +#include <linux/sock_diag.h>
>
> #include "checksum.h"
> #include "util.h"
> @@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> new_wnd_to_tap = tinfo->tcpi_snd_wnd;
> } else {
> unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
> + uint32_t mem[SK_MEMINFO_VARS];
> + socklen_t mem_sl;
> uint32_t sendq;
> - int limit;
> + uint32_t sndbuf;
> + uint32_t limit;
>
> if (ioctl(s, SIOCOUTQ, &sendq)) {
> debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
> sendq = 0;
> }
> tcp_get_sndbuf(conn);
> + sndbuf = SNDBUF_GET(conn);
>
> - if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> - limit = 0;
> - else
> - limit = SNDBUF_GET(conn) - (int)sendq;
> + mem_sl = sizeof(mem);
> + if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {
If we are already fetching this, we don't need to fetch SO_SNDBUF (same
as SK_MEMINFO_SNDBUF).
> + if (sendq > sndbuf)
> + limit = 0;
> + else
> + limit = sndbuf - sendq;
> + } else {
> + uint32_t sb = mem[SK_MEMINFO_SNDBUF];
> + uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED];
> +
> + if (wq > sb)
> + limit = 0;
> + else if (!sendq || wq <= sendq)
> + limit = (sb - wq) * 3 / 4;
Note that SNDBUF_GET() is already scaled.
Maybe, actually, that's the problem? Let me try to fix that and see
what happens...
> + else
> + limit = (uint64_t)(sb - wq) *
> + sendq / wq;
> + }
>
> /* If the sender uses mechanisms to prevent Silly Window
> * Syndrome (SWS, described in RFC 813 Section 3) it's critical
> @@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> * but we won't send enough to fill one because we're stuck
> * with pending data in the outbound queue
> */
> - if (limit < MSS_GET(conn) && sendq &&
> + if (limit < (unsigned int)MSS_GET(conn) && sendq &&
> tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
> limit = 0;
>
> - new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit);
> + new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit);
> }
>
> new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW);
> diff --git a/util.c b/util.c
> index 73c9d51..036fac1 100644
> --- a/util.c
> +++ b/util.c
> @@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f)
>
> return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100;
> }
> +
Nit: three unrelated changes.
--
Stefano
prev parent reply other threads:[~2026-04-22 11:51 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-22 2:23 Jon Maloy
2026-04-22 11:51 ` Stefano Brivio [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260422135131.5806f34f@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=jmaloy@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).