From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202606 header.b=m5c86HPp; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 1F3725A061C for ; Mon, 15 Jun 2026 04:13:27 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202606; t=1781489597; bh=I8ptP9q8q38OkWuwJsoQGEUh6QYUDAxnGk4z5VPgYv8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=m5c86HPpU6bYcPaVXqjtWkRsEkTXK+h/5qEDFbW4CWuxKNB1xh4VSkZhlsuAnDXOU wymkVtKJLvwLn7rX+Q2inOIrXeXblvCkrMohJ7bavh1MEk2z9YFe3s3NN4WKGCqzk3 itfyEXLoXfL1vudKjDF4vrc//mp5cDslEd8lu6qX1I9wm693bmQkdIo4Lah30YUAcz glVgmjvY7bJ8v9C/xJ6MOyhLtxXy8J1iaNtayUC4dDzsm8pV71zLP5dTT5/sVYbbFZ QnyJnTuOEOSGNvq87pHi5rwbZWadmFqcSsM6K3u6xajUVLSS8NF2v+Kxg6aJ0/lhGg rcSD82EQKf03Q== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4gdtvK6cNBz4wSW; Mon, 15 Jun 2026 12:13:17 +1000 (AEST) Date: Mon, 15 Jun 2026 12:04:00 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding Message-ID: References: <20260520130851.436931-1-david@gibson.dropbear.id.au> <20260520130851.436931-4-david@gibson.dropbear.id.au> <20260520222851.19e5f430@elisabeth> <20260612181841.40e698e7@elisabeth> <20260612185517.2349a94c@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="1/ZvPlR21txv7neh" Content-Disposition: inline In-Reply-To: <20260612185517.2349a94c@elisabeth> Message-ID-Hash: 72RCROGKSFE5CAPW5AKQDDICKMHHILIX X-Message-ID-Hash: 72RCROGKSFE5CAPW5AKQDDICKMHHILIX X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger , Anshu Kumari X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --1/ZvPlR21txv7neh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jun 12, 2026 at 06:55:18PM +0200, Stefano Brivio wrote: > On Fri, 12 Jun 2026 18:18:41 +0200 > Stefano Brivio wrote: >=20 > > On Thu, 21 May 2026 10:50:24 +1000 > > David Gibson wrote: > >=20 > > > On Wed, May 20, 2026 at 10:28:52PM +0200, Stefano Brivio wrote: =20 > > > > Ah, yes, it looks better now. Three remarks: > > > >=20 > > > > On Wed, 20 May 2026 23:08:48 +1000 > > > > David Gibson wrote: > > > > =20 > > > > > Splice forwarding can be blocked either waiting for data from one= side > > > > > or waiting for space on the other. For that reason, > > > > > tcp_splice_sock_handler() on either socket can forward data in ei= ther or > > > > > both directions, depending on whether we have EPOLLIN, EPOLLOUT o= r both > > > > > events. > > > > >=20 > > > > > The flow control for this is quite hard to follow though, since w= e forward > > > > > in one direction, then sometimes loop back with a goto to do it i= n the > > > > > other direction. Simplify this by adding a tcp_splice_forward() = function > > > > > with the logic to forward in one direction and calling it either = once or > > > > > twice from tcp_splice_sock_handler(). > > > > >=20 > > > > > Signed-off-by: David Gibson > > > > > --- > > > > > tcp_splice.c | 137 ++++++++++++++++++++++++++-------------------= ------ > > > > > 1 file changed, 71 insertions(+), 66 deletions(-) > > > > >=20 > > > > > diff --git a/tcp_splice.c b/tcp_splice.c > > > > > index 34ffea73..18e8b303 100644 > > > > > --- a/tcp_splice.c > > > > > +++ b/tcp_splice.c > > > > > @@ -474,67 +474,20 @@ void tcp_splice_conn_from_sock(const struct= ctx *c, union flow *flow, int s0) > > > > > } > > > > > =20 > > > > > /** > > > > > - * tcp_splice_sock_handler() - Handler for socket mapped to spli= ced connection > > > > > + * tcp_splice_forward() - Forward data in one direction using sp= lice() > > > > > * @c: Execution context > > > > > - * @ref: epoll reference > > > > > - * @events: epoll events bitmap > > > > > + * @conn: Connection to forward data for > > > > > + * @fromsidei: Side to forward data from > > > > > * > > > > > * #syscalls:pasta splice > > > > > */ > > > > > -void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, > > > > > - uint32_t events) > > > > > +static int tcp_splice_forward(struct ctx *c, struct > > > > > + tcp_splice_conn *conn, unsigned fromsidei) =20 > > > >=20 > > > > I think the struct > > > > argument should all be on the same line. =20 > > >=20 > > > Oops, definitely. Forgot to document the return value too. > > > =20 > > > > > { > > > > > - struct tcp_splice_conn *conn =3D conn_at_sidx(ref.flowside); > > > > > - unsigned evsidei =3D ref.flowside.sidei, fromsidei; > > > > > - uint8_t lowat_set_flag, lowat_act_flag; > > > > > - int eof, never_read; > > > > > - > > > > > - assert(conn->f.type =3D=3D FLOW_TCP_SPLICE); > > > > > - > > > > > - if (conn->events =3D=3D SPLICE_CLOSED) > > > > > - return; > > > > > - > > > > > - if (events & EPOLLERR) { > > > > > - int err, rc; > > > > > - socklen_t sl =3D sizeof(err); > > > > > - > > > > > - rc =3D getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl); > > > > > - if (rc) > > > > > - flow_perror(conn, "Error retrieving SO_ERROR"); > > > > > - else > > > > > - flow_dbg(conn, "Error event on %s socket: %s", > > > > > - pif_name(conn->f.pif[evsidei]), > > > > > - strerror_(err)); > > > > > - goto reset; > > > > > - } > > > > > - > > > > > - if (conn->events =3D=3D SPLICE_CONNECT) { > > > > > - if (!(events & EPOLLOUT)) { > > > > > - flow_err(conn, "Unexpected events 0x%x during connect", > > > > > - events); > > > > > - goto reset; > > > > > - } > > > > > - if (tcp_splice_connect_finish(c, conn)) > > > > > - goto reset; > > > > > - } > > > > > - > > > > > - if (events & EPOLLOUT) { > > > > > - fromsidei =3D !evsidei; > > > > > - conn_event(conn, ~OUT_WAIT(evsidei)); > > > > > - } else { > > > > > - fromsidei =3D evsidei; > > > > > - } > > > > > - > > > > > - if (events & EPOLLRDHUP) > > > > > - /* For side 0 this is fake, but implied */ > > > > > - conn_event(conn, FIN_RCVD(evsidei)); > > > > > - > > > > > -swap: > > > > > - eof =3D 0; > > > > > - never_read =3D 1; > > > > > - > > > > > - lowat_set_flag =3D RCVLOWAT_SET(fromsidei); > > > > > - lowat_act_flag =3D RCVLOWAT_ACT(fromsidei); > > > > > + uint8_t lowat_set_flag =3D RCVLOWAT_SET(fromsidei); > > > > > + uint8_t lowat_act_flag =3D RCVLOWAT_ACT(fromsidei); > > > > > + int never_read =3D 1; > > > > > + int eof =3D 0; > > > > > =20 > > > > > while (1) { > > > > > ssize_t readlen, written, pending; > > > > > @@ -551,7 +504,7 @@ retry: > > > > > if (readlen < 0 && errno !=3D EAGAIN) { > > > > > flow_perror(conn, "Splicing from %s socket", > > > > > pif_name(conn->f.pif[fromsidei])); > > > > > - goto reset; > > > > > + return -1; > > > > > } > > > > > =20 > > > > > flow_trace(conn, "%zi from read-side call", readlen); > > > > > @@ -578,7 +531,7 @@ retry: > > > > > if (written < 0 && errno !=3D EAGAIN) { > > > > > flow_perror(conn, "Splicing to %s socket", > > > > > pif_name(conn->f.pif[!fromsidei])); > > > > > - goto reset; > > > > > + return -1; > > > > > } > > > > > =20 > > > > > flow_trace(conn, "%zi from write-side call (passed %zi)", > > > > > @@ -639,24 +592,76 @@ retry: > > > > > if (shutdown(conn->s[!sidei], SHUT_WR) < 0) { > > > > > flow_perror(conn, "shutdown() on %s", > > > > > pif_name(conn->f.pif[!sidei])); > > > > > - goto reset; > > > > > + return -1; > > > > > } > > > > > conn_event(conn, FIN_SENT(!sidei)); > > > > > } > > > > > } > > > > > } > > > > > =20 > > > > > - if (CONN_HAS(conn, FIN_SENT(0) | FIN_SENT(1))) { > > > > > - /* Clean close, no reset */ > > > > > - conn_flag(conn, CLOSING); > > > > > + return 0; > > > > > +} > > > > > + > > > > > +/** > > > > > + * tcp_splice_sock_handler() - Handler for socket mapped to spli= ced connection > > > > > + * @c: Execution context > > > > > + * @ref: epoll reference > > > > > + * @events: epoll events bitmap > > > > > + */ > > > > > +void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, > > > > > + uint32_t events) > > > > > +{ > > > > > + struct tcp_splice_conn *conn =3D conn_at_sidx(ref.flowside); > > > > > + unsigned evsidei =3D ref.flowside.sidei; > > > > > + > > > > > + assert(conn->f.type =3D=3D FLOW_TCP_SPLICE); > > > > > + > > > > > + if (conn->events =3D=3D SPLICE_CLOSED) > > > > > return; > > > > > + > > > > > + if (events & EPOLLERR) { > > > > > + int err, rc; > > > > > + socklen_t sl =3D sizeof(err); > > > > > + > > > > > + rc =3D getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl); > > > > > + if (rc) > > > > > + flow_perror(conn, "Error retrieving SO_ERROR"); > > > > > + else > > > > > + flow_dbg(conn, "Error event on %s socket: %s", > > > > > + pif_name(conn->f.pif[evsidei]), > > > > > + strerror_(err)); > > > > > + goto reset; > > > > > + } > > > > > + > > > > > + if (conn->events =3D=3D SPLICE_CONNECT) { > > > > > + if (!(events & EPOLLOUT)) { > > > > > + flow_err(conn, "Unexpected events 0x%x during connect", > > > > > + events); > > > > > + goto reset; > > > > > + } > > > > > + if (tcp_splice_connect_finish(c, conn)) > > > > > + goto reset; > > > > > + } > > > > > + > > > > > + if (events & EPOLLRDHUP) > > > > > + /* For side 0 this is fake, but implied */ > > > > > + conn_event(conn, FIN_RCVD(evsidei)); =20 > > > >=20 > > > > I saw this all goes away in 5/6, so it wouldn't be relevant. But in > > > > case we decide to drop 5/6, here are my remarks on the this. > > > >=20 > > > > EPOLLRDHUP is now handled before checking the other direction of the > > > > connection in case of EPOLLOUT. =20 > > >=20 > > > I'm pretty sure that hasn't changed. In the old code EPOLLRDHUP > > > handling was before we did any of the actual data handling for EPOLLIN > > > or EPOLLOUT. =20 > >=20 > > Well, kind of, in the sense that it's true we did that before any data > > handling, but we had two checks: > >=20 > > if (conn->events =3D=3D SPLICE_CLOSED) > > return; > >=20 > > [...] > >=20 > > if (conn->events =3D=3D SPLICE_CONNECT) { > > if (!(events & EPOLLOUT)) { > > [...] > > goto reset; > > } > > if (tcp_splice_connect_finish(c, conn)) > > goto reset; > > } > >=20 > > based on conn->events _before_ setting FIN_RCVD(evsidei). > >=20 > > Now, that should never be relevant for SPLICE_CLOSED. I'm not sure about > > SPLICE_CONNECT, what if we get EPOLLRDHUP right away as we are > > re-establishing a connection? I need to look into that, but I wasn't > > able to see any difference in behaviour so far. >=20 > Ah, never mind, those checks are all now part of > tcp_splice_sock_handler() anyway. So, right, there should be no functional > change in the end. >=20 > > What I really missed here is: > >=20 > > > > I think it actually makes more sense this way because we update fla= gs > > > > with everything we know until that point, and it shouldn't have a > > > > functional effect (the check at the end of the new tcp_splice_forwa= rd() > > > > is on FIN_RCVD(fromsidei)), but I'm raising that in case the change > > > > wasn't intended. > > > > =20 > > > > > + > > > > > + if (events & EPOLLOUT) { > > > > > + if (tcp_splice_forward(c, conn, !evsidei)) > > > > > + goto reset; > > > > > + conn_event(conn, ~OUT_WAIT(evsidei)); =20 > >=20 > > ^^^ > >=20 > > this swap, which caused https://bugs.passt.top/show_bug.cgi?id=3D207. > >=20 > > Earlier, we had: > >=20 > > if (events & EPOLLOUT) { > > fromsidei =3D !evsidei; > > conn_event(conn, ~OUT_WAIT(evsidei)); > > } ... > >=20 > > and then the rest of what's now tcp_splice_forward(). > >=20 > > If we clear OUT_WAIT *later*, even if tcp_splice_forward() decides to k= eep > > it after processing an EPOLLOUT event, we'll miss events. > >=20 > > It turns out that 4ccb2eebaa02 ("tcp_splice: Simplify / correct OUT_WAIT > > flag handling") fixes this. I'm still checking whether the fix is compl= ete > > though. >=20 > I'm fairly convinced it is, because with that commit we don't care > anymore whether EPOLLOUT was set to begin with and just set OUT_WAIT as > obviously needed. >=20 > Reading that again just reminded me that I actually spotted this but I > saw it was going away in 5/6 of this revision of the series... and then > I forgot to write that down. :( >=20 > Somewhat funnily, that commit message says: >=20 > -- > We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT > event, but that's not quite right. Even though it's called on an EPO= LLOUT, > tcp_splice_forward() could, in principle empty the pipe, but also read > enough new data from the other side to fill it again. That would set > OUT_WAIT internally, but it would be cleared after returning meaning > we could miss a necessary wakeup. > -- >=20 > ...well, yes, but that problem didn't exist before. :) Right. It looks like the issue was that I made the series in a couple of sessions / batches. I spotted the problem in the second batch, failing to realise that I'd introduced it in the first batch, rather than it existing before that. Well, that's embarrassing, but at least it's fixed now. I do think the final way of calculating OUT_WAIT is easier to reason about and more clearly correct. > Anyway, tagging a new release now. Ta. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --1/ZvPlR21txv7neh Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmovXYIACgkQzQJF27ox 2GfyqA/+MsRHBAAlXBKsY69Y3A1G2lBqpES5/JU3cJKXJBbrWKgCOOdTiMuktoJ9 CZw5CiJg0K0iEof0VtCrGZ10VA9Q1AJ1nosTOayFazGnm7kGvhrmVX/oMWERUgw+ 8MD/ycg1kdwX0mWAHP5Q6XnUWVJTq/Wj75oiJ7HinQF28xDSoTIfU3e3IM9Qg069 J1UdRvOkJ4pOnoeWADg9A0fi0MDm75gYOXhtwRMPx6XPZPcum+/YGVjC4uZraQvv DkCciiF8wMkHngFjEjmUIzGkzU6JujMmZadkvTRTp6Tk1MNuSNg0HpNznJQ8dclk rkUzgjXggsKa83Leu7UunF0DUPuN7jdcJZkd0//ffjNpf/ch7vZAyE90XEWHxFpS rbOYVuUonM5OF7wVu8WRDLaTH5UTg0EkUbpIKddxm9hWLuoco9Hf7nIXcDIx5Cdt TyH4NjdDfxIjETwBO42lMtaBVRx/9OQ83QgO+S4sZ89YkHL4chnEhuIZrNzSk65C R576FMGmtrpn3zPrpVe5afZ9Q0JHq3wG2tW4sFqydnkNhr3nGQBdU3N276oyj76Q qt4khNndrRHTx7kQSe0XHy70zV+i6iaDGPQB3CP6G13j3xRjZY5UmentInXSg2bI l6LHM/nrQXAd/WxEKMYXtwnmRkhMzqPnUiGG6S0rxVJxg4QdDmA= =p0LL -----END PGP SIGNATURE----- --1/ZvPlR21txv7neh--