From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=HRCwnaHE; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 0DD6C5A0262 for ; Thu, 07 May 2026 11:48:49 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778147329; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ipZCvAIJEyA3XROQDdYPnuF9CnfLK1Kin32RBBHvAu0=; b=HRCwnaHEYDhIJlJ4781mHrNs+g6b9RDk84LXmI2y7P0TfDs6xFH63lGQfw30QZswkIi8vo mN2qumKUnmy+6dCuzu0b4XPXUKU09B/s+kozPS7Z5Jb4cxoQ6fq+WsYdIJPIjVDbJxELLV 9OqbOOcgRNi1zqBtgV2Vono7AJOvKGo= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-320-K9UOZhUEOoWAPR0E_rDfrA-1; Thu, 07 May 2026 05:48:47 -0400 X-MC-Unique: K9UOZhUEOoWAPR0E_rDfrA-1 X-Mimecast-MFC-AGG-ID: K9UOZhUEOoWAPR0E_rDfrA_1778147326 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-44a122a5128so511067f8f.0 for ; Thu, 07 May 2026 02:48:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778147326; x=1778752126; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=mmPeIllDmYagB+sVaKJZNQsG2EMqZCEosabCm37YF0g=; b=OubPl74XchXz4Bk4HhFx+qZj6bYWBx1uw7GEK+6jQU7V8Nmlqa0BFYo6wFdpzQ6lbs ahznf3bLmbaZeS4wsEnlcGpSGyEY+lwLJsY8TY6DA1fcttO8SLcwaTu4APsV3D4BUenS R3nZ1rfVdp6tQ13QPqmoc5Qxg2OeyeLJeHc04skZqImR1f+TfeFDAs3xoZXoLlOcDE5q lotGMBfBSsvVMZvtiG6s/LOX8Jwu2adXDRc1D80aisdpdAZ8q0GApyffYLAaxnmnIwyW YCspIfTMpPzltKXwkIr/ITwLl6bx5PgKnv+VPsbqwAaSBRGvlY1fDlTwqzexdD8OrUdX crKQ== X-Forwarded-Encrypted: i=1; AFNElJ+CYLkZHTimpoy1njKmqBwKaIrwcr5LswI9xSgbuFH6098Ird2BdyukCx44mTuP8DI+5qjdEUhFb1w=@passt.top X-Gm-Message-State: AOJu0YzfEL2Xs+K0UrS5zXxikTqmPqx+4SJjyQ8BgbniB041nJ4loeDb Xx8j8fnK/uOa3o1a/VukvvM48k5Uz0W24hmq2NrV5vgQ1bWW3QWQcBvkT3Nd+og4VD/iJ1DEVAj eK8Gdo9OV/H812Hxz6IR2LV2kWklXVsxUJ1Vih0pGAtoTkl96yLm4rw== X-Gm-Gg: AeBDievkG21qKTJ0It01aLntb9TuduJTNpOBpeWeNaM64jxya0MoR1qAN626Isyq5sh zPAE7y9aZuDEXrfbk+Vzj0h2TuI4K1IaKsvBRlJvx2aR1tdLJ14h9TsNKGoWcgezyFXkdyRnLFi Jky7yxtTNuhzfa+mJVqAhiolOFt1eGbE4bQjOyUV3miUY47ZuqR6rW5e/ar7m3WpjT+BFEaaFnT 8OcF6Z49Il9Z0Iee1wAGvzxOXRQz6eoLPROL+Qg1zOfc1m13Xn0jR91KlxMB4NNhr/2cSePTOZ1 3vb4lXCttutHzLjP7LzU/W43HDPPaZaD+hVe6HI8iqZrXpAZQwnyKyVNK9/NNJlB+KqLwiLdniS wXUxfoBkqybi3oVAd1gEQd0PCOe0CqIC+LE710YUzpwo= X-Received: by 2002:a05:6000:1a8b:b0:449:d189:e79f with SMTP id ffacd0b85a97d-4515d5c56b4mr11966622f8f.32.1778147326195; Thu, 07 May 2026 02:48:46 -0700 (PDT) X-Received: by 2002:a05:6000:1a8b:b0:449:d189:e79f with SMTP id ffacd0b85a97d-4515d5c56b4mr11966553f8f.32.1778147325486; Thu, 07 May 2026 02:48:45 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-45054b02802sm19339091f8f.17.2026.05.07.02.48.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 May 2026 02:48:44 -0700 (PDT) From: Stefano Brivio To: Jon Maloy Subject: Re: [PATCH v3] tcp: Use SO_MEMINFO for accurate send buffer overhead accounting Message-ID: <20260507114842.4f4c85b6@elisabeth> In-Reply-To: <20260425195818.572409-1-jmaloy@redhat.com> References: <20260425195818.572409-1-jmaloy@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Thu, 07 May 2026 11:48:44 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: n2UiJOujnNPa2kwPo9yLIyAgqVzf5a1OFG7gBwpdP5k_1778147326 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: GO4O3DU36X56BCQUFFBEG5OS5RQCA4K4 X-Message-ID-Hash: GO4O3DU36X56BCQUFFBEG5OS5RQCA4K4 X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: david@gibson.dropbear.id.au, passt-dev@passt.top, Yumei Huang X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Sat, 25 Apr 2026 15:58:18 -0400 Jon Maloy wrote: > The TCP window advertised to the guest/container must balance two > competing needs: large enough to trigger kernel socket buffer > auto-tuning, but not so large that sendmsg() partially fails, causing > retransmissions. >=20 > The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but > SNDBUF_GET() returns a scaled value that only roughly accounts for > per-skb overhead. The clamped_scale approximation doesn't accurately > track the actual per-segment overhead, which can lead to both excessive > retransmissions and reduced throughput. >=20 > We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and > SK_MEMINFO_WMEM_QUEUED from the kernel. The latter is presented in the > kernel's own accounting units, i.e. including the sk_buff overhead, > and matches exactly what the kernel's own sk_stream_memory_free() > function is using. >=20 > When data is queued and the overhead ratio is observable, we calculate > the per-segment overhead as (wmem_queued - sendq) / num_segments, then > determine how many additional segments should fit in the remaining > buffer space, considering the calculated per-mss overhead. This approach > treats segments as discrete quantities, and produces a more accurate > estimate of available buffer space than a linear scaling factor does. >=20 > When the ratio cannot be observed, e.g. because the queue is empty or > we are in a transient state, we fall back to the existing clamped_scale > calculation (scaling between 100% and 75% of buffer capacity). >=20 > When SO_MEMINFO succeeds, we also use SK_MEMINFO_SNDBUF directly to > set SNDBUF, avoiding a separate SO_SNDBUF getsockopt() call. >=20 > If SO_MEMINFO is unavailable, we fall back to the pre-existing > SNDBUF_GET() - SIOCOUTQ calculation. >=20 > Link: https://bugs.passt.top/show_bug.cgi?id=3D138 > Link: https://github.com/containers/podman/issues/28219 > Signed-off-by: Jon Maloy I finally tested this in a low (but not negligible) RTT setup (~200 to ~500 =C2=B5s) and it looks extremely reliable there as well. I asked the reporter of https://github.com/containers/podman/issues/28219 to also test this but I think we can start merging this meanwhile. Applied, with an additional tag: Analysed-by: Yumei Huang as the analysis / tests behind this approach partially came from Yumei. > --- >=20 > v2: Updated according to feedback from Stefano. Segment-based discrete > overhead calculation instead of linear ratio. >=20 > v3: Addressed Stefano's v2 feedback: > - Extracted window calculation into tcp_wnd_from_sndbuf() > - Use wmem_queued instead of SIOCOUTQ for fallback and SWS check > --- > tcp.c | 137 ++++++++++++++++++++++++++++++++++------------------- > tcp_conn.h | 2 +- > 2 files changed, 89 insertions(+), 50 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 43b8fdb..61160cf 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -295,6 +295,7 @@ > #include > =20 > #include > +#include > =20 > #include "checksum.h" > #include "util.h" > @@ -1017,6 +1018,90 @@ size_t tcp_fill_headers(const struct ctx *c, struc= t tcp_tap_conn *conn, > =09return MAX(l3len + sizeof(struct ethhdr), ETH_ZLEN); > } > =20 > +/** > + * tcp_wnd_from_sndbuf() - Calculate window from available send buffer s= pace > + * @s:=09=09Socket file descriptor > + * @conn:=09Connection pointer > + * @tinfo:=09tcp_info from kernel > + * > + * Return: window value to advertise, not scaled > + */ > +static uint32_t tcp_wnd_from_sndbuf(int s, struct tcp_tap_conn *conn, > +=09=09=09=09 const struct tcp_info_linux *tinfo) > +{ > +=09uint32_t rtt_ms_ceiling =3D DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); > +=09uint32_t mem[SK_MEMINFO_VARS]; > +=09socklen_t mem_sl =3D sizeof(mem); > +=09int mss =3D MSS_GET(conn); > +=09uint32_t limit, sendq; > + > +=09if (ioctl(s, SIOCOUTQ, &sendq)) { > +=09=09debug_perror("SIOCOUTQ on socket %i, assuming 0", s); > +=09=09sendq =3D 0; > +=09} > + > +=09if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) { > +=09=09tcp_get_sndbuf(conn); > + > +=09=09if (sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ > +=09=09=09limit =3D 0; > +=09=09else > +=09=09=09limit =3D SNDBUF_GET(conn) - sendq; > +=09} else { > +=09=09uint32_t sndbuf =3D mem[SK_MEMINFO_SNDBUF]; > +=09=09uint32_t wmemq =3D mem[SK_MEMINFO_WMEM_QUEUED]; > +=09=09uint32_t scaled =3D clamped_scale(sndbuf, sndbuf, SNDBUF_SMALL, > +=09=09=09=09=09=09SNDBUF_BIG, 75); > + > +=09=09SNDBUF_SET(conn, MIN(INT_MAX, scaled)); > + > +=09=09if (wmemq > sndbuf) { > +=09=09=09limit =3D 0; > +=09=09} else if (!sendq || !mss || wmemq <=3D sendq) { > +=09=09=09limit =3D SNDBUF_GET(conn) - wmemq; > +=09=09} else { > +=09=09=09uint32_t used_segs =3D MAX(sendq / mss, 1); > +=09=09=09uint32_t overhead =3D (wmemq - sendq) / used_segs; > +=09=09=09uint32_t remaining =3D sndbuf - wmemq; > +=09=09=09uint32_t avail_segs =3D remaining / (mss + overhead); > + > +=09=09=09limit =3D avail_segs * mss; > +=09=09} > +=09} > + > +=09/* If the sender uses mechanisms to prevent Silly Window > +=09 * Syndrome (SWS, described in RFC 813 Section 3) it's critical > +=09 * that, should the window ever become less than the MSS, we > +=09 * advertise a new value once it increases again to be above it. > +=09 * > +=09 * The mechanism to avoid SWS in the kernel is, implicitly, > +=09 * implemented by Nagle's algorithm (which was proposed after > +=09 * RFC 813). > +=09 * > +=09 * To this end, for simplicity, approximate a window value below > +=09 * the MSS to zero, as we already have mechanisms in place to > +=09 * force updates after the window becomes zero. This matches the > +=09 * suggestion from RFC 813, Section 4. > +=09 * > +=09 * But don't do this if, either: > +=09 * > +=09 * - there's nothing in the outbound queue: the size of the > +=09 * sending buffer is limiting us, and it won't increase if we > +=09 * don't send data, so there's no point in waiting, or > +=09 * > +=09 * - we haven't sent data in a while (somewhat arbitrarily, ten > +=09 * times the RTT), as that might indicate that the receiver > +=09 * will only process data in batches that are large enough, > +=09 * but we won't send enough to fill one because we're stuck > +=09 * with pending data in the outbound queue > +=09 */ > +=09if (limit < (uint32_t)MSS_GET(conn) && sendq && > +=09 tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10) > +=09=09limit =3D 0; > + > +=09return MIN(tinfo->tcpi_snd_wnd, limit); > +} > + > /** > * tcp_update_seqack_wnd() - Update ACK sequence and window to guest/tap > * @c:=09=09Execution context > @@ -1124,56 +1209,10 @@ int tcp_update_seqack_wnd(const struct ctx *c, st= ruct tcp_tap_conn *conn, > =09=09} > =09} > =20 > -=09if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) { > +=09if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) > =09=09new_wnd_to_tap =3D tinfo->tcpi_snd_wnd; > -=09} else { > -=09=09unsigned rtt_ms_ceiling =3D DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); > -=09=09uint32_t sendq; > -=09=09int limit; > - > -=09=09if (ioctl(s, SIOCOUTQ, &sendq)) { > -=09=09=09debug_perror("SIOCOUTQ on socket %i, assuming 0", s); > -=09=09=09sendq =3D 0; > -=09=09} > -=09=09tcp_get_sndbuf(conn); > - > -=09=09if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ > -=09=09=09limit =3D 0; > -=09=09else > -=09=09=09limit =3D SNDBUF_GET(conn) - (int)sendq; > - > -=09=09/* If the sender uses mechanisms to prevent Silly Window > -=09=09 * Syndrome (SWS, described in RFC 813 Section 3) it's critical > -=09=09 * that, should the window ever become less than the MSS, we > -=09=09 * advertise a new value once it increases again to be above it. > -=09=09 * > -=09=09 * The mechanism to avoid SWS in the kernel is, implicitly, > -=09=09 * implemented by Nagle's algorithm (which was proposed after > -=09=09 * RFC 813). > -=09=09 * > -=09=09 * To this end, for simplicity, approximate a window value below > -=09=09 * the MSS to zero, as we already have mechanisms in place to > -=09=09 * force updates after the window becomes zero. This matches the > -=09=09 * suggestion from RFC 813, Section 4. > -=09=09 * > -=09=09 * But don't do this if, either: > -=09=09 * > -=09=09 * - there's nothing in the outbound queue: the size of the > -=09=09 * sending buffer is limiting us, and it won't increase if we > -=09=09 * don't send data, so there's no point in waiting, or > -=09=09 * > -=09=09 * - we haven't sent data in a while (somewhat arbitrarily, ten > -=09=09 * times the RTT), as that might indicate that the receiver > -=09=09 * will only process data in batches that are large enough, > -=09=09 * but we won't send enough to fill one because we're stuck > -=09=09 * with pending data in the outbound queue > -=09=09 */ > -=09=09if (limit < MSS_GET(conn) && sendq && > -=09=09 tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10) > -=09=09=09limit =3D 0; > - > -=09=09new_wnd_to_tap =3D MIN((int)tinfo->tcpi_snd_wnd, limit); > -=09} > +=09else > +=09=09new_wnd_to_tap =3D tcp_wnd_from_sndbuf(s, conn, tinfo); > =20 > =09new_wnd_to_tap =3D MIN(new_wnd_to_tap, MAX_WINDOW); > =09if (!(conn->events & ESTABLISHED)) > diff --git a/tcp_conn.h b/tcp_conn.h > index 6985426..9f5bee0 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -98,7 +98,7 @@ struct tcp_tap_conn { > #define SNDBUF_BITS=09=0924 > =09unsigned int=09sndbuf=09=09:SNDBUF_BITS; > #define SNDBUF_SET(conn, bytes)=09(conn->sndbuf =3D ((bytes) >> (32 - SN= DBUF_BITS))) > -#define SNDBUF_GET(conn)=09(conn->sndbuf << (32 - SNDBUF_BITS)) > +#define SNDBUF_GET(conn)=09((uint32_t)(conn->sndbuf << (32 - SNDBUF_BITS= ))) > =20 > =09uint8_t=09=09seq_dup_ack_approx; > =20 --=20 Stefano