From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling
Date: Thu, 21 May 2026 09:15:13 +0200 (CEST) [thread overview]
Message-ID: <20260521091512.1ede0a84@elisabeth> (raw)
In-Reply-To: <ag6sq2Wd6R_oxhXp@zatzit>
On Thu, 21 May 2026 16:56:43 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Thu, May 21, 2026 at 07:40:31AM +0200, Stefano Brivio wrote:
> > On Thu, 21 May 2026 12:03:33 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Wed, May 20, 2026 at 10:30:04PM +0200, Stefano Brivio wrote:
> > > > On Wed, 20 May 2026 23:08:50 +1000
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > There are two ways we can tell one of our sockets has received a FIN. We
> > > > > can either see an EPOLLRDHUP epoll event, or we can get a zero-length read
> > > > > (EOF) on the socket. We currently use both, in a mildly confusing way:
> > > > > we only set the FIN_RCVD() flag based on the EPOLLRDHUP event, but then
> > > > > some other close out logic is based on seeing an EOF.
> > > > >
> > > > > Simplify this by setting the flag based on only the EOF. To make sure we
> > > > > don't miss an event if we get an EPOLLRDHUP with no data, we trigger the
> > > > > forwarding path for EPOLLRDHUP as well as EPOLLIN.
> > > > >
> > > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > > ---
> > > > > tcp_splice.c | 14 +++++---------
> > > > > 1 file changed, 5 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/tcp_splice.c b/tcp_splice.c
> > > > > index 8fbd490f..b45f0060 100644
> > > > > --- a/tcp_splice.c
> > > > > +++ b/tcp_splice.c
> > > > > @@ -487,7 +487,6 @@ static int tcp_splice_forward(struct ctx *c, struct
> > > > > uint8_t lowat_set_flag = RCVLOWAT_SET(fromsidei);
> > > > > uint8_t lowat_act_flag = RCVLOWAT_ACT(fromsidei);
> > > > > int never_read = 1;
> > > > > - int eof = 0;
> > > > >
> > > > > while (1) {
> > > > > ssize_t readlen, written;
> > > > > @@ -510,7 +509,7 @@ retry:
> > > > > flow_trace(conn, "%zi from read-side call", readlen);
> > > > >
> > > > > if (!readlen) {
> > > > > - eof = 1;
> > > > > + conn_event(conn, FIN_RCVD(fromsidei));
> > > >
> > > > I'm not sure if I really found a concrete issue with this, but it looks
> > > > a bit scary, because it changes the semantics of FIN_RCVD, which used to
> > > > mean that we infer we received a FIN, regardless of whether we're done
> > > > processing all data from that half of the connection.
> > > >
> > > > Now FIN_RCVD is only set if we actually processed all the data and we
> > > > hit the end of file.
> > >
> > > True. But the only place that tested FIN_RCVD was at the end of
> > > tcp_splice_forward(), conditional on 'eof' anyway. In a sense, this
> > > was the cause of bug202 - we had FIN_RCVD set, but we didn't process
> > > it and shutdown() on the other side, because we didn't have eof.
> >
> > That sounds like a good motivation to clean this up, just two concerns
> > below:
> >
> > > > The (potential) issue I see here is that we get EPOLLRDHUP, splice()
> > > > returns -1 with EAGAIN in errno because we had no room in the pipe,
> > > > and it would have returned 0 instead.
> > > >
> > > > Will we ever get our zero-sized "read" later? If not, we might have
> > > > missed EPOLLRDHUP *and* the end of file. I'm not entirely sure we have
> > > > guarantees in that sense from splice().
> > >
> > > It's not really about guarantees from splice. I'm pretty sure this is
> > > ok, reasoning as follows.
> > >
> > > Consider all the exit points from the loop body:
> > > - Each return is a return -1, so we kill the connection anyway. They
> > > don't matter
> > > - Each continue, goto retry and the end of the body will do the read
> > > side splice() again, so get another chance to see the EOF
> > > - That leaves just the breaks
> > >
> > > Consider each break (there are three, since patch 2 of this series)
> > > if (written < 0) {
> > > if (!conn->pending[fromsidei])
> > > break;
> > >
> > > (1) The pipe is empty and the write-splice returned EAGAIN, so it
> > > didn't remove data from the pipe.
> >
> > You're assuming that !conn->pending[fromsidei] means that the pipe is
> > empty. From what we see of it, it is.
>
> It does mean the pipe is empty. Everything we put in, we've taken
> out. There cannot be anything in there.
>
> > What the kernel can do with it, though, is different. It might return
> > EAGAIN even if we think we should have space, because it's resizing it
> > under memory pressure or anything like that. Or it delays freeing up
> > space or accounting for whatever reason.
>
> Theoretically, I suppose. But !pending doesn't just mean the pipe is
> not full it means it's completely completely empty. Not being able to
> put any bytes at all into an empty pipe would be *very* surprising.
> So much so that if it happened in practice, I suspect we wouldn't be
> safe not having epoll events on the pipe ends, so that we can be
> notified when it deigns to accept some data.
We can get 512-byte pipes, I actually saw that happening in practice
with either:
- people setting low values for ulimits
- the user (or just pasta itself) having a lot of pipes open
and if I recall correctly that's where I saw the case of a supposedly
empty pipe giving us EAGAIN. That was years ago though and I didn't
specifically fix that.
We currently probe the size based on the value we can have for 32 pipes
(TCP_SPLICE_PIPE_POOL_SIZE). By making that 4096 or so you should get
rather small pipes.
Things might already be broken with them, I haven't checked the
behaviour in a long while. I think 512 bytes was the lower bound I hit.
> > So it would be nice to make this part robust to that. I thought setting
> > FIN_RCVD on EPOLLRDHUP was a good way to achieve that.
> >
> > > Therefore, the pipe must have been
> > > empty before the write-splice. Which means the read-splice can't have
> > > blocked on a full pipe.
> > > conn_event(conn, OUT_WAIT(!fromsidei));
> > > break;
> > > }
> > >
> > > (2) The pipe is non-empty and the write-splice returned EAGAIN, so it
> > > must have blocked on the output socket. We've set OUT_WAIT(), so
> > > we'll get an EPOLLOUT at some point which will cause us to read-splice
> > > again, meaning we get another chance to see the EOF.
> >
> > ...later. But what if we don't get a zero-sized read *at all*? I'm not
> > sure if splice() guarantees we do get one if we reach end-of-file.
>
> > That's something valid and very well established for read() and recv(),
> > but splice() is a bit weird. The documentation says:
> >
> > A return value of 0 means end of input.
> >
> > but I wouldn't assume we'll *always* get at least one in case of EOF.
>
> What else could we plausibly get?
-1 with EBADF, probably with EPOLLERR, because something timed out?
But I guess you're right, as long as we're not in the EPOLLERR category
of things, we should consistently get 0, even if we read multiple
times.
I had in mind some kernel behaviour where we get 0 once, and then -1
(EAGAIN?) because... go figure. But no, it can't happen.
> > > [...]
> > > if (conn->events & FIN_RCVD(fromsidei))
> > > break;
> > > (3) By the new semantics of FIN_RCVD, we *have* seen the EOF.
> > >
> > > > The existing implementation distinguishes between end-of-file we hit in
> > > > a given iteration, and EPOLLRDHUP we might have seen at any time.
> > > > That was actually intended.
> > >
> > > It might be intended, but I can't see that we did anything with that
> > > information.
> >
> > We always set FIN_RCVD on it. You're right, if we only checked that on
> > 'eof', that didn't solve much, but that wasn't necessarily intended. My
> > original intention was to make setting of FIN_RCVD (or whatever it was
> > originally) robust.
>
> Ok, well. I've spotted other changes to make in the vicinity that I
> think will make some of this easier to reason about anyway. So I'll
> consider your points as I rework this and other patches.
>
> > > That said the conditions on which we exit / retry this loop are pretty
> > > darn confusing. I'll see if I can improve them.
> >
> > --
> > Stefano
--
Stefano
next prev parent reply other threads:[~2026-05-21 7:15 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
2026-05-20 14:31 ` Stefano Brivio
2026-05-21 0:43 ` David Gibson
2026-05-21 5:08 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:46 ` David Gibson
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:50 ` David Gibson
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
2026-05-20 20:29 ` Stefano Brivio
2026-05-21 0:54 ` David Gibson
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:03 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
2026-05-21 6:56 ` David Gibson
2026-05-21 7:15 ` Stefano Brivio [this message]
2026-05-21 13:51 ` David Gibson
2026-05-21 15:18 ` Stefano Brivio
2026-05-22 1:29 ` David Gibson
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:11 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260521091512.1ede0a84@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).