From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 8EAC05A026D for ; Mon, 25 Sep 2023 06:09:59 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1695614993; bh=RfP+iD8ek5EzTAp6vyACYYQvt4GV4rwSftv9USrUi5g=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nMKcll/Qu8EkJSmUcig4ILhWLtGaWQsV6TM+3zBoshFt8yXtCqzlHXhoCGxbOxH0u 4OUxeVuCAX3Mmdiu6OOWWnNqN6Yul4zTuu9dNsjbh1rJLirsjQj1uMMGobhw4cloE7 TFHbY1AjuRht+V/cJXWBOJ/vMv2BAojHyeGMTyVo= Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Rv8Xd6Q6jz4xKl; Mon, 25 Sep 2023 14:09:53 +1000 (AEST) Date: Mon, 25 Sep 2023 14:09:41 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH RFT 3/5] tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag Message-ID: References: <20230922220610.58767-1-sbrivio@redhat.com> <20230922220610.58767-4-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="4JAQlPs5nCZam07I" Content-Disposition: inline In-Reply-To: <20230922220610.58767-4-sbrivio@redhat.com> Message-ID-Hash: VORBMFHWPFESLFSWWPYTX7HGDSAMCUKG X-Message-ID-Hash: VORBMFHWPFESLFSWWPYTX7HGDSAMCUKG X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Matej Hrica , passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --4JAQlPs5nCZam07I Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Sep 23, 2023 at 12:06:08AM +0200, Stefano Brivio wrote: > It looks like we need it as workaround for this situation, readily > reproducible at least with a 6.5 Linux kernel, with default rmem_max > and wmem_max values: >=20 > - an iperf3 client on the host sends about 160 KiB, typically > segmented into five frames by passt. We read this data using > MSG_PEEK >=20 > - the iperf3 server on the guest starts receiving >=20 > - meanwhile, the host kernel advertised a zero-sized window to the > receiver, as expected >=20 > - eventually, the guest acknowledges all the data sent so far, and > we drop it from the buffer, courtesy of tcp_sock_consume(), using > recv() with MSG_TRUNC >=20 > - the client, however, doesn't get an updated window value, and > even keepalive packets are answered with zero-window segments, > until the connection is closed >=20 > It looks like dropping data from a socket using MSG_TRUNC doesn't > cause a recalculation of the window, which would be expected as a > result of any receiving operation that invalidates data on a buffer > (that is, not with MSG_PEEK). >=20 > Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to > the previous value we clamped to, forces a recalculation of the > window which is advertised to the guest. >=20 > I couldn't quite confirm this issue by following all the possible > code paths in the kernel, yet. If confirmed, this should be fixed in > the kernel, but meanwhile this workaround looks robust to me (and it > will be needed for backward compatibility anyway). So, I tested this, and things got a bit complicated. First, I reproduced the "read side" problem by setting net.core.rmem_max to 256kiB while setting net.core.wmem_max to 16MiB. The "160kiB" stall happened almost every time. Applying this patch appears to fix it completely, getting GiB/s throughput consistently. So, yah. Then I tried reproducing it differently: by setting both net.core.rmem_max and net.core.wmem_max to 16MiB, but setting SO_RCVBUF to 128kiB explicitly in tcp_sock_set_bufsize() (which actually results in a 256kiB buffer, because of the kernel's weird interpretation). With the SO_RCVBUF clamp and without this patch, I don't get the 160kiB stall consistently any more. What I *do* get is nearly every time - but not *every* time - is slow transfers, ~40Mbps vs. ~12Gbps. Sometimes it stalls after several seconds. The stall is slightly different from the 160kiB stall though: the 160kiB stall seems 0 bytes transferred on both sides. With the RCVBUF stall I get a trickle of bytes (620 bytes/s) on the receiver/guest side, with mostly 0 bytes per interval on the sender but occasionally an interval with several hundred KB. That is it seems like there's a buffer somewhere that's very slowly draining into the receiver, then getting topped up in an instant once it gets low enough. When I have both this patch and the RCVBUF clamp, I don't seem to be able to reproduce the trickle-stall anymore, but I still get the slow transfer speeds most, but not every time. Sometimes, but only rarely, I do seem to still get a complete stall (0 bytes on both sides). So it looks like there are two different things going on here. It looks like this patch fixes at least something, but there might be some more things to investigate afterwards. On that basis: Tested-by: David Gibson Reviewed-by: David Gibson >=20 > Reported-by: Matej Hrica > Link: https://bugs.passt.top/show_bug.cgi?id=3D74 > Analysed-by: David Gibson > Signed-off-by: Stefano Brivio > --- > tcp.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) >=20 > diff --git a/tcp.c b/tcp.c > index 5528e05..4606f17 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, = struct tcp_tap_conn *conn, > wnd <<=3D conn->ws_from_tap; > wnd =3D MIN(MAX_WINDOW, wnd); > =20 > - if (conn->flags & WND_CLAMPED) { > + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up > + * with a zero-sized window on a TCP socket, dropping data (once > + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear > + * to be enough to make the kernel advertise a non-zero window to the > + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing > + * value, fixes this. > + * > + * The STALLED flag on a connection is a sufficient indication that we > + * might have a zero-sized window on the socket, because it's set if we > + * exhausted the tap-side window, or if everything we receive from a > + * socket is already in flight to the guest. > + * > + * So, if STALLED is set, and we received a window value from the tap, > + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated > + * further and fixed in the kernel instead, if confirmed. > + */ > + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { > if (prev_scaled =3D=3D wnd) > return; > =20 > @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struc= t tcp_tap_conn *conn, > i =3D keep - 1; > } > =20 > - tcp_clamp_window(c, conn, max_ack_seq_wnd); > - > /* On socket flush failure, pretend there was no ACK, try again later */ > if (ack && !tcp_sock_consume(conn, max_ack_seq)) > tcp_update_seqack_from_tap(c, conn, max_ack_seq); > =20 > + tcp_clamp_window(c, conn, max_ack_seq_wnd); > + > if (retr) { > trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", > max_ack_seq, conn->seq_to_tap); --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --4JAQlPs5nCZam07I Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmURB/4ACgkQzQJF27ox 2GcVUg//cXEiMb27uCGPceCpq2nFLEytbDNohBhaJb21DHFAepDe3Xl8B7KxB1fE 1WERLt7rL7mw4T3tr5a0zUr/K8e9+aH1Q4bJ7rNCmn28AL3aKdbVVo3KblScNYfc 6ax6vM1gixCSIar0DvEBnhUKD2pH6rBG+I1bQoVrZQo8Z/NxIkdr+1ds3om6O/hT sbkk0NxsO5V5nA6pVmfhGN1ApOU6B5UT5TjNTjpzp1WWUvUPy/c6wdHB9wA6ZZLc oyf2CuqKgiuWaHfC7kkJV5sf/tYV34wwkFnr6ucPkLvt573Gn8gz96V8hnPQKdXC GSky/wBoeR9cET8yq75VxU1CsFAQFjxhrXoRkNBIAg7yFZTJzftZTeNRfDJhI8e2 9LIoDBMxOVj0vEME6KaI/KNC4ehblV9Vbg6Of/vDvq3weeWANJoO4DhPMqYJYn0L LhDtuUZ3aayLoU1OsJnezqWdhFT+Vd5Xtf1b7ds0jQqe6a5OK3JEemQCyrE6CvxL B/njJs1yaIOZ+OQD7rjnvuNnhJCC9Hj2bQc0zR+tOPMFKInbudRELIaGC8jfWlJw F6x4AND2JDpYn4wO7YexcS08fhzA4cAs6+uGf+xIshniOHebScRCl8hhENGMPYZz I8RpuAGIEL8ySM7msQH1OEMVtdPlfoLjxqTxlvDZnX+akL4tuZ8= =CcJR -----END PGP SIGNATURE----- --4JAQlPs5nCZam07I--