From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202508 header.b=wRGigWmO; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 514F25A0271 for ; Wed, 10 Sep 2025 04:29:57 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202508; t=1757471394; bh=mUHSoYA/wnUvam5hrbwJMpjBUkcRovR9NWMaPe3JLCw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=wRGigWmOHYXVX7gVbhJYu5lxdG1t33806bqXAEDflPH5TFIsPx50Spe/5ecyV59jO I7PgBTCKlx/hCk4ihMymaroB8f7h4VBueoSAZnHIsAcA56XWq0wIPJAOe26aB24C8b PKnxkr5romwUfwWi0JnLvqLG0/7ZV8uDRCLWlO6oHjpIiXVeu9UKo2gJa77J2UHjB7 8xmgvYGL8gUBgFqmf6tbBo7PY8stNOpDAmAI3Jku4/gcCjbqWwXgHQepg91JFdEQPF qnKHvx+Mx6qM1ZRn46x4yEm1YjGQg3qgb/OUL055tlOpRLwD482Z0Ua8mTgAhwOu4S sUr82qoULkuHw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4cM4Qp57TBz4wC2; Wed, 10 Sep 2025 12:29:54 +1000 (AEST) Date: Wed, 10 Sep 2025 12:29:48 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v4 8/8] tcp: Don't send FIN segment to guest yet if we have pending unacknowledged data Message-ID: References: <20250909181655.2990223-1-sbrivio@redhat.com> <20250909181655.2990223-9-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="NgeS7yTvrbRE4zNg" Content-Disposition: inline In-Reply-To: <20250909181655.2990223-9-sbrivio@redhat.com> Message-ID-Hash: EUTK3UQHAXY6QC23B6ZV73OTLQKPFHC6 X-Message-ID-Hash: EUTK3UQHAXY6QC23B6ZV73OTLQKPFHC6 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Jon Maloy , Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --NgeS7yTvrbRE4zNg Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Sep 09, 2025 at 08:16:55PM +0200, Stefano Brivio wrote: > For some reason, tcp_vu_data_from_sock() already takes care of this, > but the non-vhost-user version ignores this possibility and just sends > out a FIN segment whenever we infer we received one host-side, > regardless of the fact that we might have unacknowledged data still to > send. >=20 > Somewhat surprisingly, this didn't cause any issue to be reported yet, > until 6.17-rc1 and 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") > came around, leading to the following report from Paul, who hit this > running Podman tests: >=20 > 439 0.033032 169.254.1.2 =E2=86=92 192.168.122.100 65540 TCP 56602 = =E2=86=92 5789 [PSH, ACK] Seq=3D10336361 Ack=3D1 Win=3D65536 Len=3D65480 > 440 0.033055 169.254.1.2 =E2=86=92 192.168.122.100 30324 TCP [TCP Wi= ndow Full] 56602 =E2=86=92 5789 [PSH, ACK] Seq=3D10401841 Ack=3D1 Win=3D655= 36 Len=3D30264 >=20 > we're sending data to the container, up to the edge of the window >=20 > 441 0.033059 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP 5789 =E2= =86=92 56602 [ACK] Seq=3D1 Ack=3D10401841 Win=3D83968 Len=3D0 >=20 > and the container acknowledges it >=20 > 442 0.033091 169.254.1.2 =E2=86=92 192.168.122.100 53716 TCP 56602 = =E2=86=92 5789 [PSH, ACK] Seq=3D10432105 Ack=3D1 Win=3D65536 Len=3D53656 >=20 > we send more data, all we possibly can, in window >=20 > 443 0.033126 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP ZeroW= indow] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D0 Len=3D0 >=20 > and the container shrinks the window due to the issue introduced > by kernel commit e2142825c120 ("net: tcp: send zero-window ACK when no > memory"). With a previous patch from this series, we rewind the > sequence, meaning that we assign conn->seq_to_tap from > conn->seq_ack_from_tap, so that we'll retransmit this segment, by > reading again from the socket, and increasing conn->seq_to_tap > once more. >=20 > However: >=20 > 444 0.033144 169.254.1.2 =E2=86=92 192.168.122.100 60 TCP 56602 =E2= =86=92 5789 [FIN, PSH, ACK] Seq=3D10485761 Ack=3D1 Win=3D65536 Len=3D0 >=20 > we eventually get a zero-length read from the socket and we miss the > fact that conn->seq_to_tap !=3D conn->seq_ack_from_tap, so we send a > FIN flag with the most recent sequence. The kernel insists: >=20 > 445 0.033147 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP ZeroW= indow] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D0 Len=3D0 >=20 > with its buggy zero-window update, but: >=20 > 446 0.033152 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Windo= w Update] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D69120 Len= =3D0 > 447 0.033202 192.168.122.100 =E2=86=92 169.254.1.2 60 TCP [TCP Windo= w Update] 5789 =E2=86=92 56602 [ACK] Seq=3D1 Ack=3D10432105 Win=3D142848 Le= n=3D0 >=20 > we don't reset the TAP_FIN_SENT flag anymore, and don't resend > the FIN segment (nor data), as we already rewound the sequence > earlier. >=20 > To solve this, hold off the FIN segment until we get a zero-length > read from the socket *and* we know that there's no unacknowledged > pending data, also without vhost-user, in tcp_buf_data_from_sock(). >=20 > Reported-by: Paul Holzinger > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson At least, I think this makes sense. As always, I find the semantics of the STALLED flag difficult to wrap my head around. > --- > tcp_buf.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) >=20 > diff --git a/tcp_buf.c b/tcp_buf.c > index 4ebb013..49bddbe 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -363,7 +363,10 @@ int tcp_buf_data_from_sock(const struct ctx *c, stru= ct tcp_tap_conn *conn) > } > =20 > if (!len) { > - if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) =3D=3D SOCK_FIN_RC= VD) { > + if (already_sent) { > + conn_flag(c, conn, STALLED); > + } else if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) =3D=3D > + SOCK_FIN_RCVD) { > int ret =3D tcp_buf_send_flag(c, conn, FIN | ACK); > if (ret) { > tcp_rst(c, conn); > --=20 > 2.43.0 >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --NgeS7yTvrbRE4zNg Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmjA4psACgkQzQJF27ox 2GcKpQ/9Gnck0vctZnxSBcgmeeuZpDax9KeLvUJAdRojDVwKk/ttXxJbjwXBH/pJ 41EZZR6mtEhFvMDLT4spUeoB+xIKtsEx8DeAp/O7o6yGIZKX/IxuCAtiI1+qyv6G m3NREUYChcOZnbPEgUYZe2N6Fdn2QCyDCBE3THYDF0ls4os2037FzAsMb1l+4P2+ IMvnh3oi9lP3r820/VMgzhGBF95qxyD36H8XEIatLpdTljNFNsWkBHCv4NDxvi+K 2Pok+aYjdmw6R1yFzEZ2xBQjcR3f6grbY/7BWoIBizjhHFEVDzSOe2QLACCLaFvx uso4lFkTJRAGSEykoQgREshsMY67s0Fb7e1gpZ55zC3/PfqzaFAgt2odGvCxTWp2 Ze3DjINuSXLKexn0TtkAO7wnQnaMTK6yJDMRiXcmlewqM5w/dL5jF0CB/ykQqzwZ Z3+4PMn25iesRqPhPZZ4F/DwrCjs5/7edhfXLdV5bK+oaGErkinjCIR/9y44GXzU ukR04ZdJ+HbVMrcLRskXZ9nHzfFLvdQs1yQVo1N4f3gSIHOmenPgS/V9Cfn26qU7 WiMJttXt8CYe0tQjWwhOZYdFC8cmBF106ytTgCPqFDCYFefSsWShSFcO6s0wORv8 EqNBymp3bn4V61wDorP9sbED+lT6yBhBgCkf6h1XrZZj8+AjOZ8= =3KH1 -----END PGP SIGNATURE----- --NgeS7yTvrbRE4zNg--