From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes
Date: Thu, 21 May 2026 10:54:47 +1000 [thread overview]
Message-ID: <ag5X141b8T8IbilX@zatzit> (raw)
In-Reply-To: <20260520222911.6d12ff70@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 5947 bytes --]
On Wed, May 20, 2026 at 10:29:12PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > For each each direction of each spliced connection, we keep track of how
> > many bytes we've read from one socket and written to the other. However,
> > we never actually care about the absolute values of these, only the
> > difference between them, which represents how much data is currently "in
> > flight" in the splicing pipe.
> >
> > Simplify the handling by having a single variable tracking the number of
> > bytes in the pipe.
>
> For me it actually looks slightly more complicated to think about it
> this way, I added explicit 'read' and 'written' after being bitten by
> some issue I introduced with a previous 'pending' concept, but I have
> to admit it slightly simplifies the overflow topic.
>
> > As a bonus, the new scheme makes it clearer that we don't need to worry
> > about overflows: pending can never become larger than the maximum pipe
> > bufffer size, well within 32-bits.
> >
> > I _think_ the old scheme was safe in the case of overflow - again under
> > the assumption that read/written can never be further apart than the pipe
> > buffer size. However, it's much harder to reason about this case. It's
> > certainly plausible that an overflow could occur - sending 4GiB through
> > a local socket is entirely achievable.
>
> For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps,
> but without hitting the "optimised" case, it would take about five
> minutes), but all the operations between the two counters are between
> two uint32_t, so they would happen in uint32_t, hence modulo 32 bits,
> similar to TCP sequences.
Plus it's only equality comparisons, so we don't need SEQ_GT or the
like. Yeah, that's the reasoning, but to me that's still a lot more
than "can't exceed pipe size".
> Anyway, overall, I think it's an improvement over the original. One nit
> here:
>
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_conn.h | 6 ++----
> > tcp_splice.c | 18 +++++++++---------
> > 2 files changed, 11 insertions(+), 13 deletions(-)
> >
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 9f5bee03..c8381aa7 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext {
> > * @f: Generic flow information
> > * @s: File descriptor for sockets
> > * @pipe: File descriptors for pipes
> > - * @read: Bytes read (not fully written to other side in one shot)
> > - * @written: Bytes written (not fully written from one other side read)
> > + * @pending: Bytes currently in each pipe
> > * @events: Events observed/actions performed on connection
> > * @flags: Connection flags (attributes, not events)
> > */
> > @@ -218,8 +217,7 @@ struct tcp_splice_conn {
> > int s[SIDES];
> > int pipe[SIDES][2];
> >
> > - uint32_t read[SIDES];
> > - uint32_t written[SIDES];
> > + uint32_t pending[SIDES];
> >
> > uint8_t events;
> > #define SPLICE_CLOSED 0
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 18e8b303..8fbd490f 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
> > conn->s[sidei] = -1;
> > }
> >
> > - conn->read[sidei] = conn->written[sidei] = 0;
> > + conn->pending[sidei] = 0;
> > }
> >
> > conn->events = SPLICE_CLOSED;
> > @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct
> > int eof = 0;
> >
> > while (1) {
> > - ssize_t readlen, written, pending;
> > + ssize_t readlen, written;
> > int more = 0;
> >
> > retry:
> > @@ -537,7 +537,7 @@ retry:
> > flow_trace(conn, "%zi from write-side call (passed %zi)",
> > written, c->tcp.pipe_size);
> >
> > - /* Most common case: skip updating counters. */
> > + /* Most common case: skip updating pending. */
>
> "pending" isn't a noun (even though the variable name is, but it's
> not quite obvious that you're referring to it). I think that:
>
> /* Most common case: skip updating count of pending bytes */
>
> would be slightly clearer (and also omit the '.' because it's not a
> complete sentence, as we usually do on single-line comments, similarly
> to most occurrences in the kernel).
Done.
>
> > if (readlen > 0 && readlen == written) {
> > if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
> > continue;
> > @@ -561,11 +561,11 @@ retry:
> > continue;
> > }
> >
> > - conn->read[fromsidei] += readlen > 0 ? readlen : 0;
> > - conn->written[fromsidei] += written > 0 ? written : 0;
> > + conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
> > + conn->pending[fromsidei] -= written > 0 ? written : 0;
> >
> > if (written < 0) {
> > - if (conn->read[fromsidei] == conn->written[fromsidei])
> > + if (!conn->pending[fromsidei])
> > break;
> >
> > conn_event(conn, OUT_WAIT(!fromsidei));
> > @@ -575,15 +575,15 @@ retry:
> > if (never_read && written == (long)(c->tcp.pipe_size))
> > goto retry;
> >
> > - pending = conn->read[fromsidei] - conn->written[fromsidei];
> > - if (!never_read && written > 0 && written < pending)
> > + if (!never_read && written > 0 &&
> > + written < conn->pending[fromsidei])
> > goto retry;
> >
> > if (eof)
> > break;
> > }
> >
> > - if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
> > + if (!conn->pending[fromsidei] && eof) {
> > unsigned sidei;
> >
> > flow_foreach_sidei(sidei) {
>
> --
> Stefano
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2026-05-21 2:29 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
2026-05-20 14:31 ` Stefano Brivio
2026-05-21 0:43 ` David Gibson
2026-05-21 5:08 ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:46 ` David Gibson
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
2026-05-20 20:28 ` Stefano Brivio
2026-05-21 0:50 ` David Gibson
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
2026-05-20 20:29 ` Stefano Brivio
2026-05-21 0:54 ` David Gibson [this message]
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:03 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
2026-05-21 6:56 ` David Gibson
2026-05-21 7:15 ` Stefano Brivio
2026-05-21 13:51 ` David Gibson
2026-05-21 15:18 ` Stefano Brivio
2026-05-22 1:29 ` David Gibson
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
2026-05-20 20:30 ` Stefano Brivio
2026-05-21 2:11 ` David Gibson
2026-05-21 5:40 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ag5X141b8T8IbilX@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).