From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 6/8] tcp_splice: Simplify / correct OUT_WAIT flag handling
Date: Thu, 4 Jun 2026 15:14:08 +1000 [thread overview]
Message-ID: <aiEJoLrWEw4MuBuK@zatzit> (raw)
In-Reply-To: <20260604064134.478745ed@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 5005 bytes --]
On Thu, Jun 04, 2026 at 06:41:37AM +0200, Stefano Brivio wrote:
> On Thu, 28 May 2026 15:02:11 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > We set the OUT_WAIT flag if we stop forwarding due to EAGAIN, but there's
> > still data in the pipe. That ensures we wake up when the output socket has
> > room to drain the pipe into.
> >
> > We clear the OUT_WAIT flag when we complete forwarding on an EPOLLOUT
> > event, but that's not quite right. Even though it's called on an EPOLLOUT,
> > tcp_splice_forward() could, in principle empty the pipe, but also read
> > enough new data from the other side to fill it again. That would set
> > OUT_WAIT internally, but it would be cleared after returning meaning
> > we could miss a necessary wakeup.
>
> The current logic in tcp_splice_sock_handler():
>
> if (events & EPOLLOUT) {
> if (tcp_splice_forward(c, conn, !evsidei, now))
> goto reset;
> conn_event(conn, ~OUT_WAIT(evsidei));
> }
>
> if (events & EPOLLIN) {
> if (tcp_splice_forward(c, conn, evsidei, now))
> goto reset;
> }
>
> would prevent the case you described, because if we read new data from
> the other side filling the pipe, we'll hit (events & EPOLLIN) and set
> OUT_WAIT again if needed.
Nope. The (events & EPOLLIN) is an event on the same socket,
forwarding in the opposite direction. The pipe would be refilled by
data on the _other_ socket forwarding in the same direction.
Now, _usually_ you'd then get an EPOLLIN on that other socket and that
would trigger the wake up. But, this is actually a rare case where we
might "miss" an event because we're using level not edge trigger
(rather than the other way around).
Consider just one direction of flow from socket A to socket B
1. epoll_wait() returns (just) an EPOLLOUT on socket B, nothing has
arrived yet on socket A, so no EPOLLIN there.
2. Data arrives on socket A.
3. We reach tcp_splice_forward(), it empties the pipe, but refills it
with the data that arrived in step (2). It happens that this also
consumes all the data that arrived in (2) - we got exactly one
pipe's worth of data.
4. We return from tcp_splice_forward() and clear OUT_WAIT.
5. We return to the epoll_wait(), but because we already read the
data from socket A, and we're using level triggered events, we
don't get an EPOLLIN
6. Space becomes available on socket B, but we don't get an EPOLLOUT,
because OUT_WAIT is clear
...and we're stuck.
Unlikely, but possible
> But there's a case this should actually fix, even though I've never
> seen it happening in practice: what if we *don't* read new data from
> the other side, and we can't empty the pipe in one EPOLLOUT shot anyway?
>
> I hadn't considered that before but if the receiver is slow enough
> that's probably possible.
True, that's probably more likely than the scenario above, actually.
>
> > The condition on whether we need write side wakeups is actually fairly
> > simple: we need them if and only if we return to the main loop with data
> > in the pipe. Maintain that in a single place - right after we exit the
> > forwarding loop in tcp_splice_forward().
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 16 +++++++++-------
> > 1 file changed, 9 insertions(+), 7 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 42902684..5f412584 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -531,19 +531,22 @@ static int tcp_splice_forward(struct ctx *c,
> > conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
> > conn->pending[fromsidei] -= written > 0 ? written : 0;
> >
> > - if (written < 0) {
> > - if (!conn->pending[fromsidei])
> > - break;
> > -
> > - conn_event(conn, OUT_WAIT(!fromsidei));
> > + if (written < 0)
> > break;
> > - }
> >
> > if (conn->events & FIN_RCVD(fromsidei) &&
> > !conn->pending[fromsidei])
> > break;
> > }
> >
> > + /* We need write-side wakeups if and only if we have data in the pipe to
> > + * drain.
> > + */
> > + if (conn->pending[fromsidei])
> > + conn_event(conn, OUT_WAIT(!fromsidei));
> > + else
> > + conn_event(conn, ~OUT_WAIT(!fromsidei));
> > +
> > if ((conn->events & FIN_RCVD(fromsidei)) &&
> > !(conn->events & FIN_SENT(!fromsidei)) &&
> > !conn->pending[fromsidei]) {
> > @@ -606,7 +609,6 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
> > if (events & EPOLLOUT) {
> > if (tcp_splice_forward(c, conn, !evsidei, now))
> > goto reset;
> > - conn_event(conn, ~OUT_WAIT(evsidei));
> > }
> >
> > if (events & (EPOLLIN | EPOLLRDHUP)) {
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2026-06-04 5:26 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-28 5:02 [PATCH 0/8] splice() forwarding cleanups David Gibson
2026-05-28 5:02 ` [PATCH 1/8] tcp_splice: Remove never-invoked SO_RCVLOWAT logic David Gibson
2026-05-28 5:02 ` [PATCH 2/8] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-28 5:02 ` [PATCH 3/8] tcp_splice: Improve EOF exit condition for the loop David Gibson
2026-05-28 5:02 ` [PATCH 4/8] tcp_splice: Remove goto from forwarding loop David Gibson
2026-05-28 5:02 ` [PATCH 5/8] tcp_splice: Simplify shutdown(2) handling David Gibson
2026-05-28 5:02 ` [PATCH 6/8] tcp_splice: Simplify / correct OUT_WAIT flag handling David Gibson
2026-06-04 4:41 ` Stefano Brivio
2026-06-04 5:14 ` David Gibson [this message]
2026-05-28 5:02 ` [PATCH 7/8] tcp_splice: Remove questionable "optimisation" of pending bytes tracking David Gibson
2026-06-04 4:41 ` Stefano Brivio
2026-06-04 5:14 ` David Gibson
2026-05-28 5:02 ` [PATCH 8/8] tcp_splice: Exit forwarding earlier when stalled read side David Gibson
2026-06-04 4:41 ` Stefano Brivio
2026-06-04 5:26 ` David Gibson
2026-06-04 5:44 ` Stefano Brivio
2026-06-04 7:08 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiEJoLrWEw4MuBuK@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).