From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=BYuAjUj1; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 5D9BB5A0265 for ; Wed, 22 Apr 2026 13:51:38 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1776858697; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yazetPd+HqtjSr35paCFIFkQbrCpbiVgELbmA966ov4=; b=BYuAjUj1NtYCL+eI4ZczR1p25uWeJL9xE3KgcjVlU5dtXjXjVA5Q5WxnMZUzr+46i58NcA mg80DZSJgkVZNI8cvNfAH535sEXLzJMMHTL/qPq+5lEvGyvx00Q2QZwD8kmqsH4WA/9gAX kI4nUc1gJfgZooYKPh/1dphP+y6psYI= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-551-GMIeJsbbNhqCLphloJdyuQ-1; Wed, 22 Apr 2026 07:51:36 -0400 X-MC-Unique: GMIeJsbbNhqCLphloJdyuQ-1 X-Mimecast-MFC-AGG-ID: GMIeJsbbNhqCLphloJdyuQ_1776858695 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-4411a1f9601so2828100f8f.0 for ; Wed, 22 Apr 2026 04:51:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776858694; x=1777463494; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=yazetPd+HqtjSr35paCFIFkQbrCpbiVgELbmA966ov4=; b=cNzj8LyKJ2PPKEFtN+oLHVd/ZIeHHnOMeY8IvjRGkWWRB07BEnZEcWMDaiEqJMRVpG PlNbJqPoB+s3x9VlIyMEebQduhtwubsWZJVPG0e50vNXSLdr3QUzB8ecxHmMLjn87oNx r2/f7oL/Z3wTD8PaZp6pfG6eSPO7qHoM10fu368xsTcXmc0nhbFwbtTwLZvY1xqAd0IV P2MDqgPSy5OikYWNErV09dt0t5lKUIRyQnaoZsFW7jjGX98Ars0bJ6odmIMd5b0bdAWz 9Hp/gEDEb0G2xi1TdBX+R27SpFFIvcx2Ln22psjfbcNTu6X1vKMw58FLO3lLlx2HDWty EM4Q== X-Forwarded-Encrypted: i=1; AFNElJ8dRmTIbPSm4AhUnHD9+A4VKfXzX08CrINuSKcM4BTAzyyT16ZZuag1KxBFtj9CqwaB71/9WUND5gw=@passt.top X-Gm-Message-State: AOJu0YxnRuJmiOLWX5X2Oncmq6g5MTjusfjrdAHsmjMzM/D//GaxXXYw cCFMkbU/zBPKj3PUksli/Yr8yDauOVgJS4lWemQfpwsLWaC+HMS0H+Rw97baOkXGS6xRxFFxSkA pl4J65ok0aqd00DtiwHEiw1h/usFDVYNsryRGzY1GL8pDWkqrlxIWRWNBfkNMDA== X-Gm-Gg: AeBDieslBnMG8Q+we8E6c+KZ3+2ySUoVL1cGPl8Mxb1U9UuAoyIoyoOJFnDLoaKGoxe bY57n46a2VR0uqeatiWous+fwQelShUI/Ot1ek9VzmUGLSIWvDama0VW3pNRO7PxEpP+vLRYm1M F9M5rANFtGFvr5xWs7WXsSTcWtJyA+NJwl190Q2prGVd9WZEeFKKDRUp8PRsMkeX8gR86snhnU8 GxuemiyrmOzxH8NSayGbw65ohXhIpoK/Wk/xG1rzXSepJOafMwYBSB8uMzWf9GQUvX9Z8rMzpMe VcyC7P0yyZRsIbZQZQSN9YWfvQOnlzIa51q665AAiqu4TVlGGpFTZgKiwHr+NrD9gBZzT2pqgPM uy3eDFEw3r9JYKQJQiH27KKnUYR+6vslAljNK1J+W5JtzihCzt9s/JC+WhdRE X-Received: by 2002:a05:6000:61e:b0:43d:6fb7:fee2 with SMTP id ffacd0b85a97d-43fe3e20b32mr36737509f8f.45.1776858694119; Wed, 22 Apr 2026 04:51:34 -0700 (PDT) X-Received: by 2002:a05:6000:61e:b0:43d:6fb7:fee2 with SMTP id ffacd0b85a97d-43fe3e20b32mr36737449f8f.45.1776858693530; Wed, 22 Apr 2026 04:51:33 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-43fe4e44f69sm47227267f8f.25.2026.04.22.04.51.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 22 Apr 2026 04:51:33 -0700 (PDT) From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting Message-ID: <20260422135131.5806f34f@elisabeth> In-Reply-To: <20260422022342.72046-1-jmaloy@redhat.com> References: <20260422022342.72046-1-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Wed, 22 Apr 2026 13:51:32 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: F1ya5ZcQqcPUgcGeZWgQcGtFU78fz0wsOlq2a21GD4E_1776858695 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: Q6TT2VTYEQR5BGJFXCMOOO5IBKTPRRHV X-Message-ID-Hash: Q6TT2VTYEQR5BGJFXCMOOO5IBKTPRRHV X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: david@gibson.dropbear.id.au, passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 21 Apr 2026 22:23:42 -0400 Jon Maloy wrote: > The TCP window advertised to the guest/container must balance two > competing needs: large enough to trigger kernel socket buffer > auto-tuning, but not so large that sendmsg() partially fails, causing > retransmissions. > > The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but > these values are in reality representing different units: SO_SNDBUF > includes the buffer overhead (sk_buff head, alignment, skb_shared_info), > while SIOCOUTQ only returns the actual payload bytes. They're not really different units, because SNDBUF_GET() already returns a scaled value trying to take (very roughly) the overhead into account. > The clamped_scale > value of 75% is a rough approximation of this overhead, but it is > inaccurate: too generous for large buffers, causing retransmissions at > higher RTTs, and too conservative for small ones, hence inhibiting > auto-tuning. It actually works the other way around (we use 100% for small buffers and gradually going towards 75% for large buffer) and auto-tuning works pretty well with it. Example before your patch with an iperf3 test with 15 ms RTT: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 125 MBytes 1.05 Gbits/sec 0 9.55 MBytes [ 5] 1.00-2.00 sec 110 MBytes 922 Mbits/sec 0 9.55 MBytes [ 5] 2.00-3.00 sec 111 MBytes 934 Mbits/sec 0 9.55 MBytes [ 5] 3.00-4.00 sec 104 MBytes 877 Mbits/sec 0 9.55 MBytes [ 5] 4.00-5.00 sec 110 MBytes 927 Mbits/sec 0 9.55 MBytes [ 5] 5.00-6.00 sec 111 MBytes 928 Mbits/sec 0 9.55 MBytes [ 5] 6.00-7.00 sec 112 MBytes 944 Mbits/sec 0 9.55 MBytes [ 5] 7.00-8.00 sec 110 MBytes 919 Mbits/sec 0 9.55 MBytes [ 5] 8.00-9.00 sec 112 MBytes 942 Mbits/sec 0 9.55 MBytes [ 5] 9.00-10.00 sec 110 MBytes 918 Mbits/sec 0 9.55 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 1.07 GBytes 918 Mbits/sec receiver and after your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver It's similar in a test with 285 ms RTT. Before your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.12 MBytes 9.43 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 2.00 MBytes 16.8 Mbits/sec 0 1.17 MBytes [ 5] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 11 660 KBytes [ 5] 3.00-4.00 sec 3.12 MBytes 26.2 Mbits/sec 11 540 KBytes [ 5] 4.00-5.00 sec 31.5 MBytes 264 Mbits/sec 0 1.93 MBytes [ 5] 5.00-6.00 sec 83.9 MBytes 704 Mbits/sec 0 4.10 MBytes [ 5] 6.00-7.00 sec 112 MBytes 941 Mbits/sec 0 7.38 MBytes [ 5] 7.00-8.00 sec 126 MBytes 1.06 Gbits/sec 0 11.9 MBytes [ 5] 8.00-9.00 sec 114 MBytes 952 Mbits/sec 0 11.9 MBytes [ 5] 9.00-10.00 sec 110 MBytes 925 Mbits/sec 0 11.9 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 584 MBytes 490 Mbits/sec 22 sender [ 5] 0.00-10.31 sec 548 MBytes 445 Mbits/sec receiver and after your patch: [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 2.50 MBytes 21.0 Mbits/sec 0 320 KBytes [ 5] 1.00-2.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 2.00-3.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 3.00-4.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 4.00-5.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes [ 5] 5.00-6.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 6.00-7.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 7.00-8.00 sec 1.38 MBytes 11.5 Mbits/sec 0 320 KBytes [ 5] 8.00-9.00 sec 1.75 MBytes 14.7 Mbits/sec 0 320 KBytes [ 5] 9.00-10.00 sec 1.25 MBytes 10.5 Mbits/sec 0 320 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 16.0 MBytes 13.4 Mbits/sec 0 sender [ 5] 0.00-10.02 sec 15.1 MBytes 12.7 Mbits/sec receiver > We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and > SK_MEMINFO_WMEM_QUEUED from the kernel. Those are both presented in > the kernel's own accounting units, i.e. including the per-skb overhead, > and match exactly what the kernel's own sk_stream_memory_free() > function is using. Using SK_MEMINFO_WMEM_QUEUED might be helpful, because that actually tells us the memory used for pending outgoing segments *as currently stored*, but SK_MEMINFO_SNDBUF returns the same value as we get via SO_SNDBUF socket option (in the kernel, that's sk->sk_sndbuf), so we're still missing the information of how much overhead we'll have for data *we haven't written yet*. That is, the approach we were considering so far was something like: - divide the value returned via SIOCOUTQ into MSS-sized segments, calculate and add overhead for each of those. This shouldn't be needed if we we use SK_MEMINFO_WMEM_QUEUED (while I wanted to avoid SO_MEMINFO because it's a rather large copy_to_user(), maybe it's actually fine) - divide the remaining space into MSS-sized segments, calculate overhead for each of them... and this is the tricky part that you're approximating here. > When we combine the above with the payload bytes indicated by SIOCOUTQ, > the observed overhead ratio self-calibrates to whatever gso_segs, cache > line size, and sk_buff layout the kernel may use, and is even > architecture agnostic. I hope we can get something like this to work, but this patch, as it is, applies a linear factor to a non-linear overhead, which *I think* is what results in the underestimation of the available buffer size that's visible from tests. > When data is queued and the overhead ratio is observable > (wmem_queued > sendq), the available payload window is calculated as: > (sk_sndbuf - wmem_queued) * sendq / wmem_queued I still think we should try a bit harder to accurately reverse the calculation done by the kernel including gso_segs. If we can't, a variation I would try on this patch is to consider segments as discrete quantities, because that should be slightly more accurate. Example: this patch right now would calculate, say: ( 200000 - 87500 (50 segments) ) * 73000 (those 10 segments) / 87500 and give us a 83.428% payload factor which we apply flat over those 112500 bytes remaining, giving us 93857 bytes of window. Instead, I think we could do this: * 87500 bytes per 50 segments -> 290 bytes overhead per segment * 180000 bytes left: 64 segments * advertise 93440 bytes > When the ratio cannot be observed, e.g. because the queue is empty or > we are in a transient state, we fall back to 75% of remaining buffer > capacity, like before. That's not generally the case. We use between 75% and 100%. > If SO_MEMINFO is unavailable, we fall back to the pre-existing > SNDBUF_GET() - SIOCOUTQ calculation. > > Link: https://bugs.passt.top/show_bug.cgi?id=138 > Signed-off-by: Jon Maloy > --- > tcp.c | 33 ++++++++++++++++++++++++++------- > util.c | 1 + > 2 files changed, 27 insertions(+), 7 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 43b8fdb..3b47a3b 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -295,6 +295,7 @@ > #include > > #include > +#include > > #include "checksum.h" > #include "util.h" > @@ -1128,19 +1129,37 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > new_wnd_to_tap = tinfo->tcpi_snd_wnd; > } else { > unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); > + uint32_t mem[SK_MEMINFO_VARS]; > + socklen_t mem_sl; > uint32_t sendq; > - int limit; > + uint32_t sndbuf; > + uint32_t limit; > > if (ioctl(s, SIOCOUTQ, &sendq)) { > debug_perror("SIOCOUTQ on socket %i, assuming 0", s); > sendq = 0; > } > tcp_get_sndbuf(conn); > + sndbuf = SNDBUF_GET(conn); > > - if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ > - limit = 0; > - else > - limit = SNDBUF_GET(conn) - (int)sendq; > + mem_sl = sizeof(mem); > + if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) { If we are already fetching this, we don't need to fetch SO_SNDBUF (same as SK_MEMINFO_SNDBUF). > + if (sendq > sndbuf) > + limit = 0; > + else > + limit = sndbuf - sendq; > + } else { > + uint32_t sb = mem[SK_MEMINFO_SNDBUF]; > + uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED]; > + > + if (wq > sb) > + limit = 0; > + else if (!sendq || wq <= sendq) > + limit = (sb - wq) * 3 / 4; Note that SNDBUF_GET() is already scaled. Maybe, actually, that's the problem? Let me try to fix that and see what happens... > + else > + limit = (uint64_t)(sb - wq) * > + sendq / wq; > + } > > /* If the sender uses mechanisms to prevent Silly Window > * Syndrome (SWS, described in RFC 813 Section 3) it's critical > @@ -1168,11 +1187,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, > * but we won't send enough to fill one because we're stuck > * with pending data in the outbound queue > */ > - if (limit < MSS_GET(conn) && sendq && > + if (limit < (unsigned int)MSS_GET(conn) && sendq && > tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10) > limit = 0; > > - new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit); > + new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit); > } > > new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); > diff --git a/util.c b/util.c > index 73c9d51..036fac1 100644 > --- a/util.c > +++ b/util.c > @@ -1137,3 +1137,4 @@ long clamped_scale(long x, long y, long lo, long hi, long f) > > return x - (x * (y - lo) / (hi - lo)) * (100 - f) / 100; > } > + Nit: three unrelated changes. -- Stefano