From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id D97405A026D for ; Fri, 10 Nov 2023 01:23:27 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1699575804; bh=QlNf0HdqwUe2k23jQpCLr9c87ud48Ecn6ExRI6FWlFk=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=QD/fTOlRtjRxL1P4K/XJA4PRkb1cuA7liJ692/dEtsTiqjlcWYUEliBQxWjr7aTRZ 8D4anJhz+eH33LRbE1hPr4Bhy7+tif+eXnrm6mAhNFhgvn+geB/uUGaDTuDMufQa1d efEnyrkPO4uRtUCmBKk0cyuDuyKRmsw0l56p6ZZk= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4SRKL4048Zz4x5p; Fri, 10 Nov 2023 11:23:24 +1100 (AEDT) Date: Fri, 10 Nov 2023 11:23:03 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 2/2] tcp: Don't use TCP_WINDOW_CLAMP Message-ID: References: <20231109095400.507679-1-david@gibson.dropbear.id.au> <20231109095400.507679-3-david@gibson.dropbear.id.au> <20231109154836.2d866c69@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="PVT4ajWZeYP24sVN" Content-Disposition: inline In-Reply-To: <20231109154836.2d866c69@elisabeth> Message-ID-Hash: NL3L4VJNBYF5FIQRMB5ZFOPYMU64JV5A X-Message-ID-Hash: NL3L4VJNBYF5FIQRMB5ZFOPYMU64JV5A X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --PVT4ajWZeYP24sVN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 09, 2023 at 03:51:36PM +0100, Stefano Brivio wrote: > On Thu, 9 Nov 2023 20:54:00 +1100 > David Gibson wrote: >=20 > > On the L2 tap side, we see TCP headers and know the TCP window that the > > ultimate receiver is advertising. In order to avoid unnecessary buffer= ing > > within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt > > to advertise that window back to the original sock-side sender using > > TCP_WINDOW_CLAMP. > >=20 > > However, TCP_WINDOW_CLAMP just doesn't work like this. Prior to kernel > > commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply > > had no effect on established sockets. After that commit, it does affect > > established sockets but doesn't behave the way we need: > > * It appears to be designed only to shrink the window, not to allow i= t to > > re-expand. > > * More importantly, that commit has a serious bug where if the > > setsockopt() is made when the existing kernel advertised window for= the > > socket happens to be zero, it will now become locked at zero, stopp= ing > > any further data from being received on the socket. > >=20 > > Since this has never worked as intended, simply remove it. It might be > > possible to re-implement the intended behaviour by manipulating SO_RCVB= UF, > > so we leave a comment to that effect. > >=20 > > This kernel bug is the underlying cause of both the linked passt bug and > > the linked podman bug. We attempted to fix this before with passt comm= it > > d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag"). > > However while that commit masked the bug for some cases, it didn't real= ly > > address the problem. > >=20 > > Fixes: d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED = flag") > > Link: https://github.com/containers/podman/issues/20170 > > Link: https://bugs.passt.top/show_bug.cgi?id=3D74 > >=20 > > Signed-off-by: David Gibson > > --- > > tcp.c | 65 ++++++++---------------------------------------------- > > tcp_conn.h | 7 +++--- > > 2 files changed, 12 insertions(+), 60 deletions(-) > >=20 > > diff --git a/tcp.c b/tcp.c > > index 13e82ca..cfcd40a 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -46,8 +46,8 @@ > > * - avoiding port and address translations whenever possible > > * - mirroring TCP dynamics by observation of socket parameters (TCP= _INFO > > * socket option) and TCP headers of packets coming from the tap i= nterface, > > - * reapplying those parameters in both flow directions (including = TCP_MSS, > > - * TCP_WINDOW_CLAMP socket options) > > + * reapplying those parameters in both flow directions (including = TCP_MSS > > + * socket option) > > * - simplicity: only a small subset of TCP logic is implemented here = and > > * delegated as much as possible to the TCP implementations of guest= and host > > * kernel. This is achieved by: > > @@ -236,12 +236,10 @@ > > * - update @seq_ack_from_tap from ack_seq in header > > * - on two duplicated ACKs, reset @seq_to_tap to @seq_ack_from_ta= p, and > > * resend with steps listed above > > - * - set TCP_WINDOW_CLAMP from TCP header from tap > > * > > * - from tap/guest to socket: > > * - on packet from tap/guest: > > * - set @ts_tap_act > > - * - set TCP_WINDOW_CLAMP from TCP header from tap > > * - check seq from header against @seq_from_tap, if data is missi= ng, send > > * two ACKs with number @seq_ack_to_tap, discard packet > > * - otherwise queue data to socket, set @seq_from_tap to seq from= header > > @@ -420,7 +418,7 @@ static const char *tcp_state_str[] __attribute((__u= nused__)) =3D { > > }; > > =20 > > static const char *tcp_flag_str[] __attribute((__unused__)) =3D { > > - "STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > > + "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", > > "ACK_FROM_TAP_DUE", > > }; > > =20 > > @@ -1790,58 +1788,16 @@ static void tcp_get_tap_ws(struct tcp_tap_conn = *conn, > > =20 > > /** > > * tcp_tap_window_update() - Process an updated window from tap side > > - * @c: Execution context > > * @conn: Connection pointer > > * @window: Window value, host order, unscaled > > */ > > -static void tcp_tap_window_update(const struct ctx *c, > > - struct tcp_tap_conn *conn, unsigned wnd) > > +static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned = wnd) > > { > > - uint32_t prev_scaled =3D conn->wnd_from_tap << conn->ws_from_tap; > > - int s =3D conn->sock; > > - > > wnd =3D MIN(MAX_WINDOW, wnd << conn->ws_from_tap); > > conn->wnd_from_tap =3D MIN(wnd >> conn->ws_from_tap, USHRT_MAX); > > =20 > > - /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end = up > > - * with a zero-sized window on a TCP socket, dropping data (once > > - * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear > > - * to be enough to make the kernel advertise a non-zero window to the > > - * sender. Forcing a TCP_WINDOW_CLAMP setting, even with the existing > > - * value, fixes this. > > - * > > - * The STALLED flag on a connection is a sufficient indication that we > > - * might have a zero-sized window on the socket, because it's set if = we > > - * exhausted the tap-side window, or if everything we receive from a > > - * socket is already in flight to the guest. > > - * > > - * So, if STALLED is set, and we received a window value from the tap, > > - * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated > > - * further and fixed in the kernel instead, if confirmed. > > - */ > > - if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { > > - if (prev_scaled =3D=3D wnd) > > - return; > > - > > - /* Discard +/- 1% updates to spare some syscalls. */ > > - /* TODO: cppcheck, starting from commit b4d455df487c ("Fix > > - * 11349: FP negativeIndex for clamped array index (#4627)"), > > - * reports wnd > prev_scaled as always being true, see also: > > - * > > - * https://github.com/danmar/cppcheck/pull/4627 > > - * > > - * drop this suppression once that's resolved. > > - */ > > - /* cppcheck-suppress [knownConditionTrueFalse, unmatchedSuppression]= */ > > - if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) || > > - (wnd < prev_scaled && wnd * 101 / 100 > prev_scaled)) > > - return; > > - } > > - > > - if (setsockopt(s, SOL_TCP, TCP_WINDOW_CLAMP, &wnd, sizeof(wnd))) > > - trace("TCP: failed to set TCP_WINDOW_CLAMP on socket %i", s); > > - > > - conn_flag(c, conn, WND_CLAMPED); > > + /* FIXME: reflect the tap-side receiver's window back to the sock-side > > + * sender by adjusting SO_RCVBUF? */ >=20 > If we adjust SO_RCVBUF, we lose the chance to get bigger receive > buffers by means of auto-tuning, which is especially problematic with > default values for rmem_max (in turn, the case which is giving us > issues here). Yeah, I forgot about that :/. > I'd rather fix TCP_CLAMP_WINDOW in the kernel and then re-enable this > in passt (I guess they'll be no obvious way to probe this, so we'll > have to rely on the kernel version). I kind of hate checking kernel versions, because it means we can't make use of fixes backported to distro kernels for example. Now that we understand the cause of the problem, I can see a way to probe for it, but it's pretty ugly: 1. Create a pair of connected local TCP sockets A and B 2. Set SO_SNDBUF on A and SO_RCVBUF on B as small as we're allowed 3. Write into A until EAGAIN; since we didn't read B, we can now be confident the window is zero 4. Set TCP_CLAMP_WINDOW on B to a fairly large 5. Read and discard from B until EAGAIN, we can now be confident the buffer is empty 6. Write to A: if we get EGAIN it looks like we're hitting this problem There might be some edge cases where step 6 seems to succeed because it goes into A's send buffer, but never reaches B's rcv buffer. We can potentially disambiguate by attempting reads on B. But.. yeah.. it's not simple. I wonder if we do fix this in the kernel whether there's anything we can do to publish the fact that this is now fixed. Could we tweak the behaviour on getsockopt(TCP_WINDOW_CLAMP) to advertise it maybe? > Anyway, this comment is not so fundamental, so I would apply this > series like it is -- the rest looks good to me. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --PVT4ajWZeYP24sVN Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmVNd9MACgkQzQJF27ox 2Geq5A/9G9Meqb5lhg6tBsqmtTB0YjWl1lg9g9ZGY1WgIpaS8IA42kzVCpi6QLdb R8+Bzz+k4dFiIi+U2kKq/1sATE7zhP7HQWYZAwPD/+ygK1LavbNmZJH0cYpo0Ill jXNIvy+ZWLSORuMdLiSKV4dhA11NPH+ZqI8WHBOa+VuUTOkTpH8a1atKywjCw/xd ldsZEIcYCa7jquX+NnMXmfTwAkpY0dKzA5kf58vvNvasMBR4OfRmEVJnQbl1S+Dl JXST4Ym0n3TEN6XVmTg+jkqvUKwrx1hKyyONAALp+fi5+NbTKaQtEtvD7RwlEpLW L2e5OTizPztiTyQt9CSXdq43kPZqmTSTC4uqcP7Y2eRqSQeFVsh9C1NuUktn0E/T D5HebZLZ79+deX7Ia6joEHJk2JOjJuMud3KuriL0xZKYgas1/x7sKy7xC1TMa0po PEE+bahAl4dUQqECRbuwYepEA7afWmlvnDEY7aTNtPcs9eJqq4WKNFqvmgMEGGJ4 lN1zyREGehN6lSH7/9LUWX2r4/i1k/v9oVXv0hT7yl0kkblGdy2jpnEcVuXLOgyf iP0fxLBfZtW3ILIdhCE+of5pJfvPRmLFPkPrXS+aT+rG5QQmvNxjaTvpRx5d2xFi zCge6gV6E4aEwkPUAd+4Aes6u1uAVYhR24cEAV/zLNjja0vUmyA= =9JSO -----END PGP SIGNATURE----- --PVT4ajWZeYP24sVN--