From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=HRCwnaHE;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by passt.top (Postfix) with ESMTPS id 0DD6C5A0262
	for <passt-dev@passt.top>; Thu, 07 May 2026 11:48:49 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1778147329;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ipZCvAIJEyA3XROQDdYPnuF9CnfLK1Kin32RBBHvAu0=;
	b=HRCwnaHEYDhIJlJ4781mHrNs+g6b9RDk84LXmI2y7P0TfDs6xFH63lGQfw30QZswkIi8vo
	mN2qumKUnmy+6dCuzu0b4XPXUKU09B/s+kozPS7Z5Jb4cxoQ6fq+WsYdIJPIjVDbJxELLV
	9OqbOOcgRNi1zqBtgV2Vono7AJOvKGo=
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-320-K9UOZhUEOoWAPR0E_rDfrA-1; Thu, 07 May 2026 05:48:47 -0400
X-MC-Unique: K9UOZhUEOoWAPR0E_rDfrA-1
X-Mimecast-MFC-AGG-ID: K9UOZhUEOoWAPR0E_rDfrA_1778147326
Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-44a122a5128so511067f8f.0
        for <passt-dev@passt.top>; Thu, 07 May 2026 02:48:47 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1778147326; x=1778752126;
        h=date:content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=mmPeIllDmYagB+sVaKJZNQsG2EMqZCEosabCm37YF0g=;
        b=OubPl74XchXz4Bk4HhFx+qZj6bYWBx1uw7GEK+6jQU7V8Nmlqa0BFYo6wFdpzQ6lbs
         ahznf3bLmbaZeS4wsEnlcGpSGyEY+lwLJsY8TY6DA1fcttO8SLcwaTu4APsV3D4BUenS
         R3nZ1rfVdp6tQ13QPqmoc5Qxg2OeyeLJeHc04skZqImR1f+TfeFDAs3xoZXoLlOcDE5q
         lotGMBfBSsvVMZvtiG6s/LOX8Jwu2adXDRc1D80aisdpdAZ8q0GApyffYLAaxnmnIwyW
         YCspIfTMpPzltKXwkIr/ITwLl6bx5PgKnv+VPsbqwAaSBRGvlY1fDlTwqzexdD8OrUdX
         crKQ==
X-Forwarded-Encrypted: i=1; AFNElJ+CYLkZHTimpoy1njKmqBwKaIrwcr5LswI9xSgbuFH6098Ird2BdyukCx44mTuP8DI+5qjdEUhFb1w=@passt.top
X-Gm-Message-State: AOJu0YzfEL2Xs+K0UrS5zXxikTqmPqx+4SJjyQ8BgbniB041nJ4loeDb
	Xx8j8fnK/uOa3o1a/VukvvM48k5Uz0W24hmq2NrV5vgQ1bWW3QWQcBvkT3Nd+og4VD/iJ1DEVAj
	eK8Gdo9OV/H812Hxz6IR2LV2kWklXVsxUJ1Vih0pGAtoTkl96yLm4rw==
X-Gm-Gg: AeBDievkG21qKTJ0It01aLntb9TuduJTNpOBpeWeNaM64jxya0MoR1qAN626Isyq5sh
	zPAE7y9aZuDEXrfbk+Vzj0h2TuI4K1IaKsvBRlJvx2aR1tdLJ14h9TsNKGoWcgezyFXkdyRnLFi
	Jky7yxtTNuhzfa+mJVqAhiolOFt1eGbE4bQjOyUV3miUY47ZuqR6rW5e/ar7m3WpjT+BFEaaFnT
	8OcF6Z49Il9Z0Iee1wAGvzxOXRQz6eoLPROL+Qg1zOfc1m13Xn0jR91KlxMB4NNhr/2cSePTOZ1
	3vb4lXCttutHzLjP7LzU/W43HDPPaZaD+hVe6HI8iqZrXpAZQwnyKyVNK9/NNJlB+KqLwiLdniS
	wXUxfoBkqybi3oVAd1gEQd0PCOe0CqIC+LE710YUzpwo=
X-Received: by 2002:a05:6000:1a8b:b0:449:d189:e79f with SMTP id ffacd0b85a97d-4515d5c56b4mr11966622f8f.32.1778147326195;
        Thu, 07 May 2026 02:48:46 -0700 (PDT)
X-Received: by 2002:a05:6000:1a8b:b0:449:d189:e79f with SMTP id ffacd0b85a97d-4515d5c56b4mr11966553f8f.32.1778147325486;
        Thu, 07 May 2026 02:48:45 -0700 (PDT)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45054b02802sm19339091f8f.17.2026.05.07.02.48.44
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 07 May 2026 02:48:44 -0700 (PDT)
From: Stefano Brivio <sbrivio@redhat.com>
To: Jon Maloy <jmaloy@redhat.com>
Subject: Re: [PATCH v3] tcp: Use SO_MEMINFO for accurate send buffer
 overhead accounting
Message-ID: <20260507114842.4f4c85b6@elisabeth>
In-Reply-To: <20260425195818.572409-1-jmaloy@redhat.com>
References: <20260425195818.572409-1-jmaloy@redhat.com>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Date: Thu, 07 May 2026 11:48:44 +0200 (CEST)
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: n2UiJOujnNPa2kwPo9yLIyAgqVzf5a1OFG7gBwpdP5k_1778147326
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: GO4O3DU36X56BCQUFFBEG5OS5RQCA4K4
X-Message-ID-Hash: GO4O3DU36X56BCQUFFBEG5OS5RQCA4K4
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: david@gibson.dropbear.id.au, passt-dev@passt.top, Yumei Huang <yuhuang@redhat.com>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20260507114842.4f4c85b6@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/GO4O3DU36X56BCQUFFBEG5OS5RQCA4K4/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Sat, 25 Apr 2026 15:58:18 -0400
Jon Maloy <jmaloy@redhat.com> wrote:

> The TCP window advertised to the guest/container must balance two
> competing needs: large enough to trigger kernel socket buffer
> auto-tuning, but not so large that sendmsg() partially fails, causing
> retransmissions.
>=20
> The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but
> SNDBUF_GET() returns a scaled value that only roughly accounts for
> per-skb overhead. The clamped_scale approximation doesn't accurately
> track the actual per-segment overhead, which can lead to both excessive
> retransmissions and reduced throughput.
>=20
> We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and
> SK_MEMINFO_WMEM_QUEUED from the kernel. The latter is presented in the
> kernel's own accounting units, i.e. including the sk_buff overhead,
> and matches exactly what the kernel's own sk_stream_memory_free()
> function is using.
>=20
> When data is queued and the overhead ratio is observable, we calculate
> the per-segment overhead as (wmem_queued - sendq) / num_segments, then
> determine how many additional segments should fit in the remaining
> buffer space, considering the calculated per-mss overhead. This approach
> treats segments as discrete quantities, and produces a more accurate
> estimate of available buffer space than a linear scaling factor does.
>=20
> When the ratio cannot be observed, e.g. because the queue is empty or
> we are in a transient state, we fall back to the existing clamped_scale
> calculation (scaling between 100% and 75% of buffer capacity).
>=20
> When SO_MEMINFO succeeds, we also use SK_MEMINFO_SNDBUF directly to
> set SNDBUF, avoiding a separate SO_SNDBUF getsockopt() call.
>=20
> If SO_MEMINFO is unavailable, we fall back to the pre-existing
> SNDBUF_GET() - SIOCOUTQ calculation.
>=20
> Link: https://bugs.passt.top/show_bug.cgi?id=3D138
> Link: https://github.com/containers/podman/issues/28219
> Signed-off-by: Jon Maloy <jmaloy@redhat.com>

I finally tested this in a low (but not negligible) RTT setup (~200 to
~500 =C2=B5s) and it looks extremely reliable there as well. I asked the
reporter of https://github.com/containers/podman/issues/28219 to also
test this but I think we can start merging this meanwhile.

Applied, with an additional tag:

Analysed-by: Yumei Huang <yuhuang@redhat.com>

as the analysis / tests behind this approach partially came from Yumei.

> ---
>=20
> v2: Updated according to feedback from Stefano. Segment-based discrete
>     overhead calculation instead of linear ratio.
>=20
> v3: Addressed Stefano's v2 feedback:
>     - Extracted window calculation into tcp_wnd_from_sndbuf()
>     - Use wmem_queued instead of SIOCOUTQ for fallback and SWS check
> ---
>  tcp.c      | 137 ++++++++++++++++++++++++++++++++++-------------------
>  tcp_conn.h |   2 +-
>  2 files changed, 89 insertions(+), 50 deletions(-)
>=20
> diff --git a/tcp.c b/tcp.c
> index 43b8fdb..61160cf 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -295,6 +295,7 @@
>  #include <arpa/inet.h>
> =20
>  #include <linux/sockios.h>
> +#include <linux/sock_diag.h>
> =20
>  #include "checksum.h"
>  #include "util.h"
> @@ -1017,6 +1018,90 @@ size_t tcp_fill_headers(const struct ctx *c, struc=
t tcp_tap_conn *conn,
>  =09return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN);
>  }
> =20
> +/**
> + * tcp_wnd_from_sndbuf() - Calculate window from available send buffer s=
pace
> + * @s:=09=09Socket file descriptor
> + * @conn:=09Connection pointer
> + * @tinfo:=09tcp_info from kernel
> + *
> + * Return: window value to advertise, not scaled
> + */
> +static uint32_t tcp_wnd_from_sndbuf(int s, struct tcp_tap_conn *conn,
> +=09=09=09=09     const struct tcp_info_linux *tinfo)
> +{
> +=09uint32_t rtt_ms_ceiling =3D DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
> +=09uint32_t mem[SK_MEMINFO_VARS];
> +=09socklen_t mem_sl =3D sizeof(mem);
> +=09int mss =3D MSS_GET(conn);
> +=09uint32_t limit, sendq;
> +
> +=09if (ioctl(s, SIOCOUTQ, &sendq)) {
> +=09=09debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
> +=09=09sendq =3D 0;
> +=09}
> +
> +=09if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) {
> +=09=09tcp_get_sndbuf(conn);
> +
> +=09=09if (sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> +=09=09=09limit =3D 0;
> +=09=09else
> +=09=09=09limit =3D SNDBUF_GET(conn) - sendq;
> +=09} else {
> +=09=09uint32_t sndbuf =3D mem[SK_MEMINFO_SNDBUF];
> +=09=09uint32_t wmemq =3D mem[SK_MEMINFO_WMEM_QUEUED];
> +=09=09uint32_t scaled =3D clamped_scale(sndbuf, sndbuf, SNDBUF_SMALL,
> +=09=09=09=09=09=09SNDBUF_BIG, 75);
> +
> +=09=09SNDBUF_SET(conn, MIN(INT_MAX, scaled));
> +
> +=09=09if (wmemq > sndbuf) {
> +=09=09=09limit =3D 0;
> +=09=09} else if (!sendq || !mss || wmemq <=3D sendq) {
> +=09=09=09limit =3D SNDBUF_GET(conn) - wmemq;
> +=09=09} else {
> +=09=09=09uint32_t used_segs =3D MAX(sendq / mss, 1);
> +=09=09=09uint32_t overhead =3D (wmemq - sendq) / used_segs;
> +=09=09=09uint32_t remaining =3D sndbuf - wmemq;
> +=09=09=09uint32_t avail_segs =3D remaining / (mss + overhead);
> +
> +=09=09=09limit =3D avail_segs * mss;
> +=09=09}
> +=09}
> +
> +=09/* If the sender uses mechanisms to prevent Silly Window
> +=09 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
> +=09 * that, should the window ever become less than the MSS, we
> +=09 * advertise a new value once it increases again to be above it.
> +=09 *
> +=09 * The mechanism to avoid SWS in the kernel is, implicitly,
> +=09 * implemented by Nagle's algorithm (which was proposed after
> +=09 * RFC 813).
> +=09 *
> +=09 * To this end, for simplicity, approximate a window value below
> +=09 * the MSS to zero, as we already have mechanisms in place to
> +=09 * force updates after the window becomes zero. This matches the
> +=09 * suggestion from RFC 813, Section 4.
> +=09 *
> +=09 * But don't do this if, either:
> +=09 *
> +=09 * - there's nothing in the outbound queue: the size of the
> +=09 *   sending buffer is limiting us, and it won't increase if we
> +=09 *   don't send data, so there's no point in waiting, or
> +=09 *
> +=09 * - we haven't sent data in a while (somewhat arbitrarily, ten
> +=09 *   times the RTT), as that might indicate that the receiver
> +=09 *   will only process data in batches that are large enough,
> +=09 *   but we won't send enough to fill one because we're stuck
> +=09 *   with pending data in the outbound queue
> +=09 */
> +=09if (limit < (uint32_t)MSS_GET(conn) && sendq &&
> +=09    tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
> +=09=09limit =3D 0;
> +
> +=09return MIN(tinfo->tcpi_snd_wnd, limit);
> +}
> +
>  /**
>   * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap
>   * @c:=09=09Execution context
> @@ -1124,56 +1209,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, st=
ruct tcp_tap_conn *conn,
>  =09=09}
>  =09}
> =20
> -=09if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) {
> +=09if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn))
>  =09=09new_wnd_to_tap =3D tinfo->tcpi_snd_wnd;
> -=09} else {
> -=09=09unsigned rtt_ms_ceiling =3D DIV_ROUND_UP(tinfo->tcpi_rtt, 1000);
> -=09=09uint32_t sendq;
> -=09=09int limit;
> -
> -=09=09if (ioctl(s, SIOCOUTQ, &sendq)) {
> -=09=09=09debug_perror("SIOCOUTQ on socket %i, assuming 0", s);
> -=09=09=09sendq =3D 0;
> -=09=09}
> -=09=09tcp_get_sndbuf(conn);
> -
> -=09=09if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
> -=09=09=09limit =3D 0;
> -=09=09else
> -=09=09=09limit =3D SNDBUF_GET(conn) - (int)sendq;
> -
> -=09=09/* If the sender uses mechanisms to prevent Silly Window
> -=09=09 * Syndrome (SWS, described in RFC 813 Section 3) it's critical
> -=09=09 * that, should the window ever become less than the MSS, we
> -=09=09 * advertise a new value once it increases again to be above it.
> -=09=09 *
> -=09=09 * The mechanism to avoid SWS in the kernel is, implicitly,
> -=09=09 * implemented by Nagle's algorithm (which was proposed after
> -=09=09 * RFC 813).
> -=09=09 *
> -=09=09 * To this end, for simplicity, approximate a window value below
> -=09=09 * the MSS to zero, as we already have mechanisms in place to
> -=09=09 * force updates after the window becomes zero. This matches the
> -=09=09 * suggestion from RFC 813, Section 4.
> -=09=09 *
> -=09=09 * But don't do this if, either:
> -=09=09 *
> -=09=09 * - there's nothing in the outbound queue: the size of the
> -=09=09 *   sending buffer is limiting us, and it won't increase if we
> -=09=09 *   don't send data, so there's no point in waiting, or
> -=09=09 *
> -=09=09 * - we haven't sent data in a while (somewhat arbitrarily, ten
> -=09=09 *   times the RTT), as that might indicate that the receiver
> -=09=09 *   will only process data in batches that are large enough,
> -=09=09 *   but we won't send enough to fill one because we're stuck
> -=09=09 *   with pending data in the outbound queue
> -=09=09 */
> -=09=09if (limit < MSS_GET(conn) && sendq &&
> -=09=09    tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10)
> -=09=09=09limit =3D 0;
> -
> -=09=09new_wnd_to_tap =3D MIN((int)tinfo->tcpi_snd_wnd, limit);
> -=09}
> +=09else
> +=09=09new_wnd_to_tap =3D tcp_wnd_from_sndbuf(s, conn, tinfo);
> =20
>  =09new_wnd_to_tap =3D MIN(new_wnd_to_tap, MAX_WINDOW);
>  =09if (!(conn->events & ESTABLISHED))
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 6985426..9f5bee0 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -98,7 +98,7 @@ struct tcp_tap_conn {
>  #define SNDBUF_BITS=09=0924
>  =09unsigned int=09sndbuf=09=09:SNDBUF_BITS;
>  #define SNDBUF_SET(conn, bytes)=09(conn->sndbuf =3D ((bytes) >> (32 - SN=
DBUF_BITS)))
> -#define SNDBUF_GET(conn)=09(conn->sndbuf << (32 - SNDBUF_BITS))
> +#define SNDBUF_GET(conn)=09((uint32_t)(conn->sndbuf << (32 - SNDBUF_BITS=
)))
> =20
>  =09uint8_t=09=09seq_dup_ack_approx;
> =20

--=20
Stefano