From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=liGK/iev; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 9594B5A0265 for ; Fri, 22 May 2026 03:30:04 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202602; t=1779413402; bh=nG77gzYQmZgEjDRFgwJOI7jd3yEeevKkJq7OsmrYlGM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=liGK/ievnqEuUaS0ZCwcpFVb3ng9SsGFC94nzKIODCzjfhfjST4shVRzPZTvQEIOp CB+lqknVSNFDprIWHKzgNF8XRR51fLCQgcmRq2wOEJL7ivzm3imuOY6Jr6NImCssw/ xJGjDQNjoPQ3fHg/WDbK1BNkV4JFeZIeXFv0ifHKZY++BYBCwKSzq4uY2xBPRFAVHm I97VyhAVyt3cooYu0hjgXzHjteuu75BbrGRvvVFwrQOQnt1Guxb+ooo/6UhBwdKEY7 yegO8p0+h1PU8PAn3bUhRCIb3y6XDtISg2b0lOOdQn4eQXYPX+DV7NsRvjy5067n1p xE6lujoLEjDLw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4gM74V4XX5z4wSp; Fri, 22 May 2026 11:30:02 +1000 (AEST) Date: Fri, 22 May 2026 11:29:59 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling Message-ID: References: <20260520130851.436931-1-david@gibson.dropbear.id.au> <20260520130851.436931-6-david@gibson.dropbear.id.au> <20260520223003.37ceb0f8@elisabeth> <20260521074030.0e15b36e@elisabeth> <20260521091512.1ede0a84@elisabeth> <20260521171811.5dd65c57@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ei349aWirfoDatak" Content-Disposition: inline In-Reply-To: <20260521171811.5dd65c57@elisabeth> Message-ID-Hash: IN77M4JYSAVXNZQQRSR2VCW3FYOBKAXQ X-Message-ID-Hash: IN77M4JYSAVXNZQQRSR2VCW3FYOBKAXQ X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --ei349aWirfoDatak Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, May 21, 2026 at 05:18:16PM +0200, Stefano Brivio wrote: > On Thu, 21 May 2026 23:51:04 +1000 > David Gibson wrote: >=20 > > On Thu, May 21, 2026 at 09:15:13AM +0200, Stefano Brivio wrote: > > > On Thu, 21 May 2026 16:56:43 +1000 > > > David Gibson wrote: > > > =20 > > > > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote: =20 > > > > > On Thu, 21 May 2026 12:03:33 +1000 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:= =20 > > > > > > > On Wed, 20 May 2026 23:08:50 +1000 > > > > > > > David Gibson wrote: > > > > > > > =20 > > > > > > > > There are two ways we can tell one of our sockets has recei= ved a FIN. We > > > > > > > > can either see an EPOLLRDHUP epoll event, or we can get a z= ero-length read > > > > > > > > (EOF) on the socket. We currently use both, in a mildly co= nfusing way: > > > > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP eve= nt, but then > > > > > > > > some other close out logic is based on seeing an EOF. > > > > > > > >=20 > > > > > > > > Simplify this by setting the flag based on only the EOF. T= o make sure we > > > > > > > > don't miss an event if we get an EPOLLRDHUP with no data, w= e trigger the > > > > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN. > > > > > > > >=20 > > > > > > > > Signed-off-by: David Gibson > > > > > > > > --- > > > > > > > > tcp_splice.c | 14 +++++--------- > > > > > > > > 1 file changed, 5 insertions(+), 9 deletions(-) > > > > > > > >=20 > > > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c > > > > > > > > index 8fbd490f..b45f0060 100644 > > > > > > > > --- a/tcp_splice.c > > > > > > > > +++ b/tcp_splice.c > > > > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ct= x *c, struct > > > > > > > > uint8_t lowat_set_flag =3D RCVLOWAT_SET(fromsidei); > > > > > > > > uint8_t lowat_act_flag =3D RCVLOWAT_ACT(fromsidei); > > > > > > > > int never_read =3D 1; > > > > > > > > - int eof =3D 0; > > > > > > > > =20 > > > > > > > > while (1) { > > > > > > > > ssize_t readlen, written; > > > > > > > > @@ -510,7 +509,7 @@ retry: > > > > > > > > flow_trace(conn, "%zi from read-side call", readlen); > > > > > > > > =20 > > > > > > > > if (!readlen) { > > > > > > > > - eof =3D 1; > > > > > > > > + conn_event(conn, FIN_RCVD(fromsidei)); =20 > > > > > > >=20 > > > > > > > I'm not sure if I really found a concrete issue with this, bu= t it looks > > > > > > > a bit scary, because it changes the semantics of FIN_RCVD, wh= ich used to > > > > > > > mean that we infer we received a FIN, regardless of whether w= e're done > > > > > > > processing all data from that half of the connection. > > > > > > >=20 > > > > > > > Now FIN_RCVD is only set if we actually processed all the dat= a and we > > > > > > > hit the end of file. =20 > > > > > >=20 > > > > > > True. But the only place that tested FIN_RCVD was at the end of > > > > > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense,= this > > > > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't pr= ocess > > > > > > it and shutdown() on the other side, because we didn't have eof= =2E =20 > > > > >=20 > > > > > That sounds like a good motivation to clean this up, just two con= cerns > > > > > below: > > > > > =20 > > > > > > > The (potential) issue I see here is that we get EPOLLRDHUP, s= plice() > > > > > > > returns -1 with EAGAIN in errno because we had no room in the= pipe, > > > > > > > and it would have returned 0 instead. > > > > > > >=20 > > > > > > > Will we ever get our zero-sized "read" later? If not, we migh= t have > > > > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sur= e we have > > > > > > > guarantees in that sense from splice(). =20 > > > > > >=20 > > > > > > It's not really about guarantees from splice. I'm pretty sure = this is > > > > > > ok, reasoning as follows. > > > > > >=20 > > > > > > Consider all the exit points from the loop body: > > > > > > - Each return is a return -1, so we kill the connection anyway= =2E They > > > > > > don't matter > > > > > > - Each continue, goto retry and the end of the body will do th= e read > > > > > > side splice() again, so get another chance to see the EOF > > > > > > - That leaves just the breaks > > > > > >=20 > > > > > > Consider each break (there are three, since patch 2 of this ser= ies) > > > > > > if (written < 0) { > > > > > > if (!conn->pending[fromsidei]) > > > > > > break; > > > > > >=20 > > > > > > (1) The pipe is empty and the write-splice returned EAGAIN, so = it > > > > > > didn't remove data from the pipe. =20 > > > > >=20 > > > > > You're assuming that !conn->pending[fromsidei] means that the pip= e is > > > > > empty. From what we see of it, it is. =20 > > > >=20 > > > > It does mean the pipe is empty. Everything we put in, we've taken > > > > out. There cannot be anything in there. > > > > =20 > > > > > What the kernel can do with it, though, is different. It might re= turn > > > > > EAGAIN even if we think we should have space, because it's resizi= ng it > > > > > under memory pressure or anything like that. Or it delays freeing= up > > > > > space or accounting for whatever reason. =20 > > > >=20 > > > > Theoretically, I suppose. But !pending doesn't just mean the pipe = is > > > > not full it means it's completely completely empty. Not being able= to > > > > put any bytes at all into an empty pipe would be *very* surprising. > > > > So much so that if it happened in practice, I suspect we wouldn't be > > > > safe not having epoll events on the pipe ends, so that we can be > > > > notified when it deigns to accept some data. =20 > > >=20 > > > We can get 512-byte pipes, I actually saw that happening in practice > > > with either: =20 > >=20 > > Sure.. so? We can still put some bytes into it if it's empty. >=20 > The difference between empty and full is pretty small in that case. Small, but not nothing. The empty pipe can still accept *some* bytes - otherwise the pipe is useless. Accept any bytes and the reasoning above works. > > > - people setting low values for ulimits > > >=20 > > > - the user (or just pasta itself) having a lot of pipes open > > >=20 > > > and if I recall correctly that's where I saw the case of a supposedly > > > empty pipe giving us EAGAIN. That was years ago though and I didn't > > > specifically fix that. =20 > >=20 > > I mean.. that sounds like a kernel bug. >=20 > fcntl(2) says, for F_SETPIPE_SZ: >=20 > Note that because of the way the pages of the pipe buffer a= re em=E2=80=90 > ployed when data is written to the pipe, the number of byte= s that > can be written may be less than the nominal size, dependi= ng on > the size of the writes. >=20 > ...which I kind of understand really. Ok, but still if an empty pipe can't accept *any* bytes, it's useless, which would make that a kernel bug. > > If we do have to handle that > > case we'll need epoll events on the pipe ends, since none of the > > socket events we monitor will trigger when the pipe becomes writable. >=20 > Well, EPOLLOUT should do it. But EPOLLOUT on the pipe end itself, not just on the write side socket as we have now. > > > We currently probe the size based on the value we can have for 32 pip= es > > > (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get > > > rather small pipes. > > >=20 > > > Things might already be broken with them, I haven't checked the > > > behaviour in a long while. I think 512 bytes was the lower bound I hi= t. > > > =20 > > > > > So it would be nice to make this part robust to that. I thought s= etting > > > > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that. > > > > > =20 > > > > > > Therefore, the pipe must have been > > > > > > empty before the write-splice. Which means the read-splice can= 't have > > > > > > blocked on a full pipe. > > > > > > conn_event(conn, OUT_WAIT(!fromsidei)); > > > > > > break; > > > > > > } > > > > > >=20 > > > > > > (2) The pipe is non-empty and the write-splice returned EAGAIN,= so it > > > > > > must have blocked on the output socket. We've set OUT_WAIT(), = so > > > > > > we'll get an EPOLLOUT at some point which will cause us to read= -splice > > > > > > again, meaning we get another chance to see the EOF. =20 > > > > >=20 > > > > > ...later. But what if we don't get a zero-sized read *at all*? I'= m not > > > > > sure if splice() guarantees we do get one if we reach end-of-file= =2E =20 > > > > =20 > > > > > That's something valid and very well established for read() and r= ecv(), > > > > > but splice() is a bit weird. The documentation says: > > > > >=20 > > > > > A return value of 0 means end of input. > > > > >=20 > > > > > but I wouldn't assume we'll *always* get at least one in case of = EOF. =20 > > > >=20 > > > > What else could we plausibly get? =20 > > >=20 > > > -1 with EBADF, probably with EPOLLERR, because something timed out? = =20 > >=20 > > EBADF makes no sense, the fds are still valid, even if they're at EOF. >=20 > I was thinking we hit EOF, don't notice right away, but after seconds / > minutes and the socket is already closed. The only place we close the socket is in the flow close path, at which point we've already decided it's over so the whole question is moot. > > > But I guess you're right, as long as we're not in the EPOLLERR catego= ry > > > of things, we should consistently get 0, even if we read multiple > > > times. > > >=20 > > > I had in mind some kernel behaviour where we get 0 once, and then -1 > > > (EAGAIN?) because... go figure. But no, it can't happen. =20 > >=20 > > I think the logic should be ok as long as we see a 0 once, even if we > > get EAGAINs after that. > >=20 > > Another way to look at this - if we had to monitor EPOLLRDHUP to get > > this right, splice() would be unusable from blocking / synchronous > > code, which I don't think is the case. >=20 > Right, yes, I'm fairly convinced at this point. Ok :). > > > > > > [...] > > > > > > if (conn->events & FIN_RCVD(fromsidei)) > > > > > > break; > > > > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF. > > > > > > =20 > > > > > > > The existing implementation distinguishes between end-of-file= we hit in > > > > > > > a given iteration, and EPOLLRDHUP we might have seen at any t= ime. > > > > > > > That was actually intended. =20 > > > > > >=20 > > > > > > It might be intended, but I can't see that we did anything with= that > > > > > > information. =20 > > > > >=20 > > > > > We always set FIN_RCVD on it. You're right, if we only checked th= at on > > > > > 'eof', that didn't solve much, but that wasn't necessarily intend= ed. My > > > > > original intention was to make setting of FIN_RCVD (or whatever i= t was > > > > > originally) robust. =20 > > > >=20 > > > > Ok, well. I've spotted other changes to make in the vicinity that I > > > > think will make some of this easier to reason about anyway. So I'll > > > > consider your points as I rework this and other patches. > > > > =20 > > > > > > That said the conditions on which we exit / retry this loop are= pretty > > > > > > darn confusing. I'll see if I can improve them. =20 >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --ei349aWirfoDatak Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmoPsZYACgkQzQJF27ox 2GdsNxAAkl4pWZECPsIvtI1t2hsIsFVDxxYHTmz2aHwQT0BkFF2laFLsS4IRRjrA eSlFsk6ImT3TXAg3vnl79ZhL9CV6eU71uYUhLI5dImKM7dFwGrdbNRBgSMzNxgBA hpL8KbjENDYqlUNwevjf/6dNltHN9x2el0DqB20KjoFgM/yFM5T8v29QdId8HS5f usjTWktRjLRo5txQOQCxzpCuIvIakjsBE/habXd26s+P4ibQnEaxxKXTdzcb2C7D wKXwQCxvYbrFjzUvQVkVQwV0OXslr9wNbXDktsnrvBy4l9eimxI3cpiJANvGWUX9 4MiZ2hIZfyih+vh7qVtEwOxdj11v6sJprTgAUuAlg4ZoB3chpyxi0QDcmkpYxcjF 8MdMIO9093J5j1CR/J1s633uAlYW9Lh0LPfvYyjaQXMWuTxD/9YFoq3XiXP8QRrI b28q94WvtxCFF4Fh6MawahOt9WoLO34WVBOH4iu3lfh4LDpD5WN4zCSQImhvJjGs PGYNkcBQAG4Se0d1xWR2Qn24ohIawdt4jOhh8OjKwd8O7bst5RbtiNXABeQ17qaR TUnuO6hcTyPy8Y9Nl03x98iSsaR5nhM4W9k3QXE9HH3dxppVrv3/OIsUmzOTNQHS wgbQkwhpfVPTRfj0L+ts6hkLDXBYQIOXxD+0A0ngqsoAU7LgOxc= =fsaE -----END PGP SIGNATURE----- --ei349aWirfoDatak--