From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202512 header.b=Jmaf2hvE; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id F21615A0652 for ; Mon, 15 Dec 2025 03:03:13 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202512; t=1765764190; bh=TNUrLXDOVVOFCuKx2UniIdepUQcQUw/iE7ibDFRMSso=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Jmaf2hvElhF+NzAKrpPiFGA9ASLVeD7jBUqxNwOES2Ew1jNknUOeros3jsK93YVoJ Al55TTuaVa+slqZlc/0Ms0XtFHuOyJpz8+XwHVaIWgyevu5DwQIxELMoZETOJP/L7m gDvKaDLM03VhTqhv35RQergMeL9fjbKNtX0pSzitrx7vwb4Iq5ppAG5qaKGsrZ5NHS eNbk3T3UxgNbOe98q2XNb29Nax3ctOilde7OqiJH8DNWu7a6PlEyxZlp7v9Dqlvi1Y dzByw3LpIaZztZiwoW2HiJdvh+3hyy5Ii6hfoTtJsw1DTLlsSHwu+BapezXhEW9Df6 rlfWG2rWD7/jw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4dV3Hf6zTrz4wDn; Mon, 15 Dec 2025 13:03:10 +1100 (AEDT) Date: Mon, 15 Dec 2025 13:02:07 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] tcp: Use less-than-MSS window on no queued data, or no data sent recently Message-ID: References: <20251213142540.1319527-1-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="xMV2aj0hnuinGRI2" Content-Disposition: inline In-Reply-To: <20251213142540.1319527-1-sbrivio@redhat.com> Message-ID-Hash: DOPSMVK6OSHRNGPPYNNULH3E6IVMSGHO X-Message-ID-Hash: DOPSMVK6OSHRNGPPYNNULH3E6IVMSGHO X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Max Chernoff , Tyler Cloud X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --xMV2aj0hnuinGRI2 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Dec 13, 2025 at 03:25:40PM +0100, Stefano Brivio wrote: > We limit the advertised window to guests and containers to the > available length of the sending buffer, and if it's less than the MSS, > since commit cf1925fb7b77 ("tcp: Don't limit window to less-than-MSS > values, use zero instead"), we approximate that limit to zero. >=20 > This way, we'll trigger a window update as soon as we realise that we > can advertise a larger value, just like we do in all other cases where > we advertise a zero-sized window. >=20 > By doing that, we don't wait for the peer to send us data before we > update the window. This matters because the guest or container might > be trying to aggregate more data and won't send us anything at all if > the advertised window is too small. >=20 > However, this might be problematic in two situations: >=20 > 1. one, reported by Tyler, where the remote (receiving) peer > advertises a window that's smaller than what we usually get and > very close to the MSS, causing the kernel to give us a starting > size of the buffer that's less than the MSS we advertise to the > guest or container. >=20 > If this happens, we'll never advertise a non-zero window after > the handshake, and the container or guest will never send us any > data at all. >=20 > With a simple 'curl https://cloudflare.com/', we get, with default > TCP memory parameters, a 65535-byte window from the peer, and 46080 > bytes of initial sending buffer from the kernel. But we advertised > a 65480-byte MSS, and we'll never actually receive the client > request. >=20 > This seems to be specific to Cloudflare for some reason, probably > deriving from a particular tuning of TCP parameters on their > servers. >=20 > 2. another one, hypothesised by David, where the peer might only be > willing to process (and acknowledge) data in batches. >=20 > We might have queued outbound data which is, at the same time, not > enough to fill one of these batches and be acknowledged and removed > from the sending queue, but enough to make our available buffer > smaller than the MSS, and the connection will hang. >=20 > Take care of both cases by: >=20 > a. not approximating the sending buffer to zero if we have no outboud > queued data at all, because in that case we don't expect the > available buffer to increase if we don't send any data, so there's > no point in waiting for it to grow larger than the MSS. >=20 > This fixes problem 1. above. >=20 > b. also using the full sending buffer size if we haven't send data to > the socket for a while (reported by tcpi_last_data_sent). This part > was already suggested by David in: >=20 > https://archives.passt.top/passt-dev/aTZzgtcKWLb28zrf@zatzit/ >=20 > and I'm now picking ten times the RTT as a somewhat arbitrary > threshold. >=20 > This is meant to take care of potential problem 2. above, but it > also happens to fix 1. >=20 > Reported-by: Tyler Cloud > Link: https://bugs.passt.top/show_bug.cgi?id=3D183 > Suggested-by: David Gibson > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) >=20 > diff --git a/tcp.c b/tcp.c > index 81bc114..b179e39 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1211,8 +1211,21 @@ int tcp_update_seqack_wnd(const struct ctx *c, str= uct tcp_tap_conn *conn, > * the MSS to zero, as we already have mechanisms in place to > * force updates after the window becomes zero. This matches the > * suggestion from RFC 813, Section 4. > + * > + * But don't do this if, either: > + * > + * - there's nothing in the outbound queue: the size of the > + * sending buffer is limiting us, and it won't increase if we > + * don't send data, so there's no point in waiting, or > + * > + * - we haven't sent data in a while (somewhat arbitrarily, ten > + * times the RTT), as that might indicate that the receiver > + * will only process data in batches that are large enough, > + * but we won't send enough to fill one because we're stuck > + * with pending data in the outbound queue > */ > - if (limit < MSS_GET(conn)) > + if (limit < MSS_GET(conn) && sendq && > + tinfo->tcpi_last_data_sent < tinfo->tcpi_rtt / 1000 * 10) > limit =3D 0; > =20 > new_wnd_to_tap =3D MIN((int)tinfo->tcpi_snd_wnd, limit); > --=20 > 2.43.0 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --xMV2aj0hnuinGRI2 Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmk/bBEACgkQzQJF27ox 2Gch4xAAnnH8d8fvEgbGjQZtkVmLssFsxvaBSkS/cu3rpTQfol9mnQIWdZj9yOkT oAQl8qvTdDhl9fMPdajCZ0W7UsR/AW0ZtkgpdQF5fe2gY10cyECeEyEy2FHNK/f9 BG4PxmqTINOzGTN2bd12OIrJTL8WWe/Lxq70ettyifhlZxu1hN3yNq9i8U1qqOeN p6K+/Bm+/cgFtvM/x1S5ezti/txYWT/sMQgdwx/FL8A2iuGW+0p7ZkB+s+uTqHIF IgNZyF3CB+kXivyM4Ku57P3m4ptOlz9rxDBj5pJFqQXdQbuPn51eVWzYP9Wpd9TV U1tPHhIx+EKHSStkUdZ9KwBHkNF83JI61C64giNB46fAuB95Ef93K2rzFrLdm3gN WweJHDbWaShPso2OpY0+VvHeKIgW99hd/fazVIMOKf3Z+sjGt61oLd23ihbAhcL0 5T0fWHuiW9Ycldm0rlIRek3/Kfap0lM+WsekgNM7jBmBje5VFVAxOct77V/BpsOK L5b8pliUJQ4v2pPmiQOmiqc7sMIU9XcHSWdudXcUY2JEw43+gCyucl6wyKrQq9Gt xx+aCbFKWSjbfJ3aDfU5YsuatmLaTUgEIk/l7Z3P6ukXP6Wk01erWVwl88GFoHL9 OSiVK2iOH8jukdtqxevMdIpoynHQgsTDJ7HGCPhQSdh7p6RcBlE= =OmMv -----END PGP SIGNATURE----- --xMV2aj0hnuinGRI2--