From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=cAzQ4LH4;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by passt.top (Postfix) with ESMTPS id 98A105A0265
	for <passt-dev@passt.top>; Thu, 21 May 2026 17:18:31 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1779376710;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=22/maI+6JgDVRFfc2xK7CgAKiZrFxx272Y4ifYXotTs=;
	b=cAzQ4LH4X3t4hiK29ZQnydfynprfKTJkHYjePyl8LGI07Bxrrc0WFSrO42ZhpQRAjN+vpU
	py0RlGAKBUIbAg99iPdmyydguSt54bEq14GAqk7JNA8VFm/hk5yuq3XQZ90U+UY3jXGROn
	H7yq1JBjaOEyVtU+4lCtVgs8RMgblsQ=
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-68-8r42mRHcOue1c8nvTXvsZA-1; Thu, 21 May 2026 11:18:20 -0400
X-MC-Unique: 8r42mRHcOue1c8nvTXvsZA-1
X-Mimecast-MFC-AGG-ID: 8r42mRHcOue1c8nvTXvsZA_1779376699
Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-45aeac88af4so5270737f8f.3
        for <passt-dev@passt.top>; Thu, 21 May 2026 08:18:20 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1779376699; x=1779981499;
        h=date:content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=O8gxFWRQ0vepJW7b1z6FJZ2u6WY7xsXM331p51Xl2Ac=;
        b=igIsVrSeMnx8aqZ9Ipt4FHe4itxVHaM2gfieJeh1mDxiGaskEONYG85Kshg+QyaA3c
         3xgi79lXk0hLc+LZlDqCy/eAnl9uZDgFSeJxhl2WYwsAxbjcANsIUZUDszw3dIdcwqma
         IM8kIjc+AgoKjOlZaarlZ2YOvmGsf6gWA5QbHpHE/z9wMRZwKa6sOfXM6VNMTMaKe1az
         PYg7RIFEhkIyD2ifnGF7Va0ydAB2KMUL4VPSA8alfo1x/SWwiOGUmg9aMXWOpd/vk4D4
         IEJyfHpinL0PgnGPOetE90g1Y4/nYa6xqT+zj4186SQk5VPhbBg7vjvfmwVhWqivaSuQ
         Wqsg==
X-Gm-Message-State: AOJu0YwY3rCXSNJq4DIvrWdYqU70mSwTac75SlbmpMGwwa4Q+SBI5WgB
	7jmNuJZusLi+gePxdDoxwk32PWza6jcL95ZZWsgwt1nmm4aF4C258atBExbrRHCKHSdgpipbsIR
	Ygt29lmhz0Q7irkpthLzgS5+XZn2lzniqiGK5nNjbN6Vzk/yrN0lc6A==
X-Gm-Gg: Acq92OF3jqSngZOKfrKjXJm1rhzzKVQcyeqHiJUOMO+FmSwRjilh5vJlZBWQSGwdF4K
	/lwPPpAMv7hMJqhvnIMWipdSAHZwCDr6RjTgXE8ETwj1WlRx4zfKcL188m8GWtSsxr0f0U3ukyq
	H6tr2v3Gn9O/4KF3WUO8wVZsYO2RKK06uAouZrEiQig+YlGGtsb5VB+JXQHQZhTOydKHVgCe1n7
	pZi4b955sLq5ObJXLz/PrA7HnErrgWOKodsfve1QXRnVCUucGAgjy9gUHvEC6dKbhnHzYy8bq/i
	HihyYgapo+CyeuvYjC1e7g4lV5/PJmPVMGzujmBryzrNzKGbkythKAQ9fPtu3axogLw16eicdcA
	Rzam2g2TPxcqmHWjewWQTn6NOrHO1pzX+
X-Received: by 2002:a05:600c:4d82:b0:48f:e230:8caa with SMTP id 5b1f17b1804b1-490360ceaf7mr29986455e9.30.1779376698694;
        Thu, 21 May 2026 08:18:18 -0700 (PDT)
X-Received: by 2002:a05:600c:4d82:b0:48f:e230:8caa with SMTP id 5b1f17b1804b1-490360ceaf7mr29986175e9.30.1779376698029;
        Thu, 21 May 2026 08:18:18 -0700 (PDT)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-49033d44d7dsm80033225e9.5.2026.05.21.08.18.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 21 May 2026 08:18:17 -0700 (PDT)
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN
 handling
Message-ID: <20260521171811.5dd65c57@elisabeth>
In-Reply-To: <ag8NyD1H3ys2VhO6@zatzit>
References: <20260520130851.436931-1-david@gibson.dropbear.id.au>
	<20260520130851.436931-6-david@gibson.dropbear.id.au>
	<20260520223003.37ceb0f8@elisabeth>
	<ag5n9R-HaoQT4Yey@zatzit>
	<20260521074030.0e15b36e@elisabeth>
	<ag6sq2Wd6R_oxhXp@zatzit>
	<20260521091512.1ede0a84@elisabeth>
	<ag8NyD1H3ys2VhO6@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Date: Thu, 21 May 2026 17:18:16 +0200 (CEST)
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: nEJjDtr0WrhQCZJ4iPwkeGohDrARo9IhXeTfhtWmTSM_1779376699
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: PSOYNHEPKNAOTAQUWYFPXMVZRXDFVQEX
X-Message-ID-Hash: PSOYNHEPKNAOTAQUWYFPXMVZRXDFVQEX
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20260521171811.5dd65c57@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/PSOYNHEPKNAOTAQUWYFPXMVZRXDFVQEX/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Thu, 21 May 2026 23:51:04 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, May 21, 2026 at 09:15:13AM +0200, Stefano Brivio wrote:
> > On Thu, 21 May 2026 16:56:43 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >  =20
> > > On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote: =20
> > > > On Thu, 21 May 2026 12:03:33 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >    =20
> > > > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:  =
 =20
> > > > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >      =20
> > > > > > > There are two ways we can tell one of our sockets has receive=
d a FIN.  We
> > > > > > > can either see an EPOLLRDHUP epoll event, or we can get a zer=
o-length read
> > > > > > > (EOF) on the socket.  We currently use both, in a mildly conf=
using way:
> > > > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event=
, but then
> > > > > > > some other close out logic is based on seeing an EOF.
> > > > > > >=20
> > > > > > > Simplify this by setting the flag based on only the EOF.  To =
make sure we
> > > > > > > don't miss an event if we get an EPOLLRDHUP with no data, we =
trigger the
> > > > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > > > >=20
> > > > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > > > ---
> > > > > > >  tcp_splice.c | 14 +++++---------
> > > > > > >  1 file changed, 5 insertions(+), 9 deletions(-)
> > > > > > >=20
> > > > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > > > index 8fbd490f..b45f0060 100644
> > > > > > > --- a/tcp_splice.c
> > > > > > > +++ b/tcp_splice.c
> > > > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx =
*c, struct
> > > > > > >  =09uint8_t lowat_set_flag =3D RCVLOWAT_SET(fromsidei);
> > > > > > >  =09uint8_t lowat_act_flag =3D RCVLOWAT_ACT(fromsidei);
> > > > > > >  =09int never_read =3D 1;
> > > > > > > -=09int eof =3D 0;
> > > > > > > =20
> > > > > > >  =09while (1) {
> > > > > > >  =09=09ssize_t readlen, written;
> > > > > > > @@ -510,7 +509,7 @@ retry:
> > > > > > >  =09=09flow_trace(conn, "%zi from read-side call", readlen);
> > > > > > > =20
> > > > > > >  =09=09if (!readlen) {
> > > > > > > -=09=09=09eof =3D 1;
> > > > > > > +=09=09=09conn_event(conn, FIN_RCVD(fromsidei));     =20
> > > > > >=20
> > > > > > I'm not sure if I really found a concrete issue with this, but =
it looks
> > > > > > a bit scary, because it changes the semantics of FIN_RCVD, whic=
h used to
> > > > > > mean that we infer we received a FIN, regardless of whether we'=
re done
> > > > > > processing all data from that half of the connection.
> > > > > >=20
> > > > > > Now FIN_RCVD is only set if we actually processed all the data =
and we
> > > > > > hit the end of file.     =20
> > > > >=20
> > > > > True.  But the only place that tested FIN_RCVD was at the end of
> > > > > tcp_splice_forward(), conditional on 'eof' anyway.  In a sense, t=
his
> > > > > was the cause of bug202 - we had FIN_RCVD set, but we didn't proc=
ess
> > > > > it and shutdown() on the other side, because we didn't have eof. =
  =20
> > > >=20
> > > > That sounds like a good motivation to clean this up, just two conce=
rns
> > > > below:
> > > >    =20
> > > > > > The (potential) issue I see here is that we get EPOLLRDHUP, spl=
ice()
> > > > > > returns -1 with EAGAIN in errno because we had no room in the p=
ipe,
> > > > > > and it would have returned 0 instead.
> > > > > >=20
> > > > > > Will we ever get our zero-sized "read" later? If not, we might =
have
> > > > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure =
we have
> > > > > > guarantees in that sense from splice().     =20
> > > > >=20
> > > > > It's not really about guarantees from splice.  I'm pretty sure th=
is is
> > > > > ok, reasoning as follows.
> > > > >=20
> > > > > Consider all the exit points from the loop body:
> > > > >  - Each return is a return -1, so we kill the connection anyway. =
 They
> > > > >    don't matter
> > > > >  - Each continue, goto retry and the end of the body will do the =
read
> > > > >    side splice() again, so get another chance to see the EOF
> > > > >  - That leaves just the breaks
> > > > >=20
> > > > > Consider each break (there are three, since patch 2 of this serie=
s)
> > > > > =09=09if (written < 0) {
> > > > > =09=09=09if (!conn->pending[fromsidei])
> > > > > =09=09=09=09break;
> > > > >=20
> > > > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > > > didn't remove data from the pipe.   =20
> > > >=20
> > > > You're assuming that !conn->pending[fromsidei] means that the pipe =
is
> > > > empty. From what we see of it, it is.   =20
> > >=20
> > > It does mean the pipe is empty.  Everything we put in, we've taken
> > > out.  There cannot be anything in there.
> > >  =20
> > > > What the kernel can do with it, though, is different. It might retu=
rn
> > > > EAGAIN even if we think we should have space, because it's resizing=
 it
> > > > under memory pressure or anything like that. Or it delays freeing u=
p
> > > > space or accounting for whatever reason.   =20
> > >=20
> > > Theoretically, I suppose.  But !pending doesn't just mean the pipe is
> > > not full it means it's completely completely empty.  Not being able t=
o
> > > put any bytes at all into an empty pipe would be *very* surprising.
> > > So much so that if it happened in practice, I suspect we wouldn't be
> > > safe not having epoll events on the pipe ends, so that we can be
> > > notified when it deigns to accept some data. =20
> >=20
> > We can get 512-byte pipes, I actually saw that happening in practice
> > with either: =20
>=20
> Sure.. so?  We can still put some bytes into it if it's empty.

The difference between empty and full is pretty small in that case.

> > - people setting low values for ulimits
> >=20
> > - the user (or just pasta itself) having a lot of pipes open
> >=20
> > and if I recall correctly that's where I saw the case of a supposedly
> > empty pipe giving us EAGAIN. That was years ago though and I didn't
> > specifically fix that. =20
>=20
> I mean.. that sounds like a kernel bug.

fcntl(2) says, for F_SETPIPE_SZ:

              Note that because of the way the pages of the pipe buffer are=
 em=E2=80=90
              ployed when data is written to the pipe, the number of bytes =
that
              can be written may be less than the nominal  size,  depending=
  on
              the size of the writes.

...which I kind of understand really.

> If we do have to handle that
> case we'll need epoll events on the pipe ends, since none of the
> socket events we monitor will trigger when the pipe becomes writable.

Well, EPOLLOUT should do it.

> > We currently probe the size based on the value we can have for 32 pipes
> > (TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
> > rather small pipes.
> >=20
> > Things might already be broken with them, I haven't checked the
> > behaviour in a long while. I think 512 bytes was the lower bound I hit.
> >  =20
> > > > So it would be nice to make this part robust to that. I thought set=
ting
> > > > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> > > >    =20
> > > > >  Therefore, the pipe must have been
> > > > > empty before the write-splice.  Which means the read-splice can't=
 have
> > > > > blocked on a full pipe.
> > > > > =09=09=09conn_event(conn, OUT_WAIT(!fromsidei));
> > > > > =09=09=09break;
> > > > > =09=09}
> > > > >=20
> > > > > (2) The pipe is non-empty and the write-splice returned EAGAIN, s=
o it
> > > > > must have blocked on the output socket.  We've set OUT_WAIT(), so
> > > > > we'll get an EPOLLOUT at some point which will cause us to read-s=
plice
> > > > > again, meaning we get another chance to see the EOF.   =20
> > > >=20
> > > > ...later. But what if we don't get a zero-sized read *at all*? I'm =
not
> > > > sure if splice() guarantees we do get one if we reach end-of-file. =
  =20
> > >  =20
> > > > That's something valid and very well established for read() and rec=
v(),
> > > > but splice() is a bit weird. The documentation says:
> > > >=20
> > > >   A return value of 0 means end of input.
> > > >=20
> > > > but I wouldn't assume we'll *always* get at least one in case of EO=
F.   =20
> > >=20
> > > What else could we plausibly get? =20
> >=20
> > -1 with EBADF, probably with EPOLLERR, because something timed out? =20
>=20
> EBADF makes no sense, the fds are still valid, even if they're at EOF.

I was thinking we hit EOF, don't notice right away, but after seconds /
minutes and the socket is already closed.

> > But I guess you're right, as long as we're not in the EPOLLERR category
> > of things, we should consistently get 0, even if we read multiple
> > times.
> >=20
> > I had in mind some kernel behaviour where we get 0 once, and then -1
> > (EAGAIN?) because... go figure. But no, it can't happen. =20
>=20
> I think the logic should be ok as long as we see a 0 once, even if we
> get EAGAINs after that.
>=20
> Another way to look at this - if we had to monitor EPOLLRDHUP to get
> this right, splice() would be unusable from blocking / synchronous
> code, which I don't think is the case.

Right, yes, I'm fairly convinced at this point.

> > > > > =09=09[...]
> > > > > =09=09if (conn->events & FIN_RCVD(fromsidei))
> > > > > =09=09=09break;
> > > > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > > > >    =20
> > > > > > The existing implementation distinguishes between end-of-file w=
e hit in
> > > > > > a given iteration, and EPOLLRDHUP we might have seen at any tim=
e.
> > > > > > That was actually intended.     =20
> > > > >=20
> > > > > It might be intended, but I can't see that we did anything with t=
hat
> > > > > information.   =20
> > > >=20
> > > > We always set FIN_RCVD on it. You're right, if we only checked that=
 on
> > > > 'eof', that didn't solve much, but that wasn't necessarily intended=
. My
> > > > original intention was to make setting of FIN_RCVD (or whatever it =
was
> > > > originally) robust.   =20
> > >=20
> > > Ok, well.  I've spotted other changes to make in the vicinity that I
> > > think will make some of this easier to reason about anyway.  So I'll
> > > consider your points as I rework this and other patches.
> > >  =20
> > > > > That said the conditions on which we exit / retry this loop are p=
retty
> > > > > darn confusing.  I'll see if I can improve them.   =20

--=20
Stefano