From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202410 header.b=CwSgEYS4; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 5704C5A004E for ; Fri, 25 Oct 2024 02:28:05 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202410; t=1729816079; bh=4mTkK+OHD1/tovnGanRUvnb6JAhoB/4xkcxZG4uRpVU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=CwSgEYS4mK+qBak3WcX+HIL/2jkeVSsqI/b+rcoMOXYNs2CLKT68/WHE2IPdj7vbu UjkV0HlIIXlatLrs9eyOfW2uvRqyCETl4hvd8rQR/OsVj3AezyTJvUOOrh3BZ6qdCc Qv8sWIWPCZMjylJ0SxGdrQaylnMgvjj23tmwQsq58khWnTxnqiuhxvY5KGlyB/R2II MyimZKT5BObX3jmfH1P717C30xsHEmsNqU4TJUnLQIOt6UJKzAa61oNogWVRHb+nZ1 dFWT9uosnXoAe+LRNpLglZ6/MdAShEkYStWI38mPG3okzBdOdcJPY46WjbRjUtBGjc 6eIUk/opwWWKQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4XZNsq6RTZz4wbR; Fri, 25 Oct 2024 11:27:59 +1100 (AEDT) Date: Fri, 25 Oct 2024 11:27:55 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] tcp_splice: splice() all we have to the writing side, not what we just read Message-ID: References: <20241024141110.2951221-1-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="WBvY8Yq3l8/h4gms" Content-Disposition: inline In-Reply-To: <20241024141110.2951221-1-sbrivio@redhat.com> Message-ID-Hash: ZS524RJIJBDWYE7I4DW7M6A7AFQYQ6JH X-Message-ID-Hash: ZS524RJIJBDWYE7I4DW7M6A7AFQYQ6JH X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Ed Santiago , Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --WBvY8Yq3l8/h4gms Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Oct 24, 2024 at 04:11:10PM +0200, Stefano Brivio wrote: > In tcp_splice_sock_handler(), we try to calculate how much we can move > from the pipe to the writing socket: if we just read some bytes, we'll > use that amount, but if we haven't, we just try to empty the pipe. >=20 > However, if we just read something, that doesn't mean that that's all > the data we have on the pipe, as it's obvious from this sequence, where: >=20 > pasta: epoll event on connected spliced TCP socket 54 (events: 0x000000= 01) > Flow 0 (TCP connection (spliced)): 98304 from read-side call > Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 9= 8304) > Flow 0 (TCP connection (spliced)): -1 from read-side call > Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 5242= 88) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 > Flow 0 (TCP connection (spliced)): OUT_WAIT_0 >=20 > we first pile up 98304 - 33615 =3D 64689 pending bytes, that we read but > couldn't write, as the receiver buffer is full, and we set the > corresponding OUT_WAIT flag. Then: >=20 > pasta: epoll event on connected spliced TCP socket 54 (events: 0x000000= 01) > Flow 0 (TCP connection (spliced)): 32768 from read-side call > Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 3276= 8) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580 >=20 > we splice() 32768 more bytes from our receiving side to the pipe. At > some point: >=20 > pasta: epoll event on connected spliced TCP socket 49 (events: 0x000000= 04) > Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489 > Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0 > Flow 0 (TCP connection (spliced)): 1320 from read-side call > Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 13= 20) >=20 > the receiver is signalling to us that it's ready for more data > (EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from > our receiving socket into the pipe, and that's what we write to the > receiver, forgetting about the pending 97457 bytes we had, which the > receiver might never get (not the same 97547 bytes: we'll actually > send 1320 of those). >=20 > This condition is rather hard to reproduce, and it was observed with > Podman pulling container images via HTTPS. In the traces above, the > client is side 0 (the initiating peer), and the server is sending the > whole data. >=20 > Instead of splicing from pipe to socket the amount of data we just > read, we need to splice all the pending data we piled up until that > point. We could do that using 'read' and 'written' counters, but > there's actually no need, as the kernel also keeps track of how much > data is available on the pipe. >=20 > So, to make this simple and more robust, just give the whole pipe size > as length to splice(). The kernel knows what to do with it. >=20 > Later in the function, we used 'to_write' for an optimisation meant > to reduce wakeups which retries right away to splice() in both > directions if we couldn't write to the receiver the whole amount of > pending data. Calculate a 'pending' value instead, only if we reach > that point. >=20 > Now that we check for the actual amount of pending data in that > optimisation, we need to make sure we don't compare a zero or negative > 'written' value: if we met that, it means that the receiver signalled > end-of-file, an error, or to try again later. In those three cases, > the optimisation doesn't make any sense, so skip it. >=20 > Reported-by: Ed Santiago > Reported-by: Paul Holzinger > Analysed-by: Paul Holzinger > Link: https://github.com/containers/podman/issues/24219 > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson > --- > tcp_splice.c | 16 ++++++---------- > 1 file changed, 6 insertions(+), 10 deletions(-) >=20 > diff --git a/tcp_splice.c b/tcp_splice.c > index 9f5cc27..f112cfe 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -503,7 +503,7 @@ swap: > lowat_act_flag =3D RCVLOWAT_ACT(fromsidei); > =20 > while (1) { > - ssize_t readlen, to_write =3D 0, written; > + ssize_t readlen, written, pending; > int more =3D 0; > =20 > retry: > @@ -518,14 +518,11 @@ retry: > =20 > if (errno !=3D EAGAIN) > goto close; > - > - to_write =3D c->tcp.pipe_size; > } else if (!readlen) { > eof =3D 1; > - to_write =3D c->tcp.pipe_size; > } else { > never_read =3D 0; > - to_write +=3D readlen; > + > if (readlen >=3D (long)c->tcp.pipe_size * 90 / 100) > more =3D SPLICE_F_MORE; > =20 > @@ -535,10 +532,10 @@ retry: > =20 > eintr: > written =3D splice(conn->pipe[fromsidei][0], NULL, > - conn->s[!fromsidei], NULL, to_write, > + conn->s[!fromsidei], NULL, c->tcp.pipe_size, > SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK); > flow_trace(conn, "%zi from write-side call (passed %zi)", > - written, to_write); > + written, c->tcp.pipe_size); > =20 > /* Most common case: skip updating counters. */ > if (readlen > 0 && readlen =3D=3D written) { > @@ -584,10 +581,9 @@ eintr: > if (never_read && written =3D=3D (long)(c->tcp.pipe_size)) > goto retry; > =20 > - if (!never_read && written < to_write) { > - to_write -=3D written; > + pending =3D conn->read[fromsidei] - conn->written[fromsidei]; > + if (!never_read && written > 0 && written < pending) > goto retry; > - } > =20 > if (eof) > break; --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --WBvY8Yq3l8/h4gms Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmca5goACgkQzQJF27ox 2GcHFw/9EJKKE/d2EBwMJArHaZSClkWoc/59JsU8c0Qa/nnqDkyztjW+9Kt0chKm NWTAuSXwHmXGxFaWM7fMAfsc5E938g0JPooRmss8CRZRuoBnnBJDsR/2Z5tZXO2m g9C/C65j8H8mobh6taL7cRRI8ShZ2Gr2ZopNGulcbiGLqO9ywlYRbCpbNecmKE2/ pt+JvEJm1nMuTiSCROhAj1ioBNMR9Zz9pIQqx84B6f5kjs8p24AsrT0ZI6AXKBzQ /l79s6YBzGW1u1NoSUCi/AZWR6ElXo33XT22Omvc18kWFMYpiw2et1SuoA068N4r MqMCwc1iDYonaF01VxdmSI7Zcsh2bjPI2+GhN21eX5hJk7D8+12WnJgeba0z+SHU A97rAwLs19yPhjS9iXQ15TmPkM/1RD60MEPMKnSi+oouX9liDjDA+UiHYBjrJhIp KekmCV8/zA6LLDEHp+2lnSUifSGuXETD/k8kgDT6L4d+bF8KDl6b1HJ4imxbcwL8 KciKIARr9QVS8R/4HSajEkyD+A5dwBLEGMzWqVP3UK1DzfTb0TtyRuXAzufCJWdF M+5qwhnnMCX4UZRmBGYv2wnbum4acGoSk1OedXnTY/8GXOAe0tN1hY9vuNsD3MBe XMsA50Juhs/BVVeDEbUYw1vY+t77MSpavEh6ia2k2wRVPQ/egXw= =eiKY -----END PGP SIGNATURE----- --WBvY8Yq3l8/h4gms--