public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Paul Holzinger <pholzing@redhat.com>
Subject: Re: [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes
Date: Thu, 21 May 2026 10:54:47 +1000	[thread overview]
Message-ID: <ag5X141b8T8IbilX@zatzit> (raw)
In-Reply-To: <20260520222911.6d12ff70@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 5947 bytes --]

On Wed, May 20, 2026 at 10:29:12PM +0200, Stefano Brivio wrote:
> On Wed, 20 May 2026 23:08:49 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > For each each direction of each spliced connection, we keep track of how
> > many bytes we've read from one socket and written to the other.  However,
> > we never actually care about the absolute values of these, only the
> > difference between them, which represents how much data is currently "in
> > flight" in the splicing pipe.
> > 
> > Simplify the handling by having a single variable tracking the number of
> > bytes in the pipe.
> 
> For me it actually looks slightly more complicated to think about it
> this way, I added explicit 'read' and 'written' after being bitten by
> some issue I introduced with a previous 'pending' concept, but I have
> to admit it slightly simplifies the overflow topic.
> 
> > As a bonus, the new scheme makes it clearer that we don't need to worry
> > about overflows: pending can never become larger than the maximum pipe
> > bufffer size, well within 32-bits.
> > 
> > I _think_ the old scheme was safe in the case of overflow - again under
> > the assumption that read/written can never be further apart than the pipe
> > buffer size.  However, it's much harder to reason about this case.  It's
> > certainly plausible that an overflow could occur - sending 4GiB through
> > a local socket is entirely achievable.
> 
> For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps,
> but without hitting the "optimised" case, it would take about five
> minutes), but all the operations between the two counters are between
> two uint32_t, so they would happen in uint32_t, hence modulo 32 bits,
> similar to TCP sequences.

Plus it's only equality comparisons, so we don't need SEQ_GT or the
like.  Yeah, that's the reasoning, but to me that's still a lot more
than "can't exceed pipe size".

> Anyway, overall, I think it's an improvement over the original. One nit
> here:
> 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp_conn.h   |  6 ++----
> >  tcp_splice.c | 18 +++++++++---------
> >  2 files changed, 11 insertions(+), 13 deletions(-)
> > 
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 9f5bee03..c8381aa7 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext {
> >   * @f:			Generic flow information
> >   * @s:			File descriptor for sockets
> >   * @pipe:		File descriptors for pipes
> > - * @read:		Bytes read (not fully written to other side in one shot)
> > - * @written:		Bytes written (not fully written from one other side read)
> > + * @pending:		Bytes currently in each pipe
> >   * @events:		Events observed/actions performed on connection
> >   * @flags:		Connection flags (attributes, not events)
> >   */
> > @@ -218,8 +217,7 @@ struct tcp_splice_conn {
> >  	int s[SIDES];
> >  	int pipe[SIDES][2];
> >  
> > -	uint32_t read[SIDES];
> > -	uint32_t written[SIDES];
> > +	uint32_t pending[SIDES];
> >  
> >  	uint8_t events;
> >  #define SPLICE_CLOSED			0
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 18e8b303..8fbd490f 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn)
> >  			conn->s[sidei] = -1;
> >  		}
> >  
> > -		conn->read[sidei] = conn->written[sidei] = 0;
> > +		conn->pending[sidei] = 0;
> >  	}
> >  
> >  	conn->events = SPLICE_CLOSED;
> > @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct
> >  	int eof = 0;
> >  
> >  	while (1) {
> > -		ssize_t readlen, written, pending;
> > +		ssize_t readlen, written;
> >  		int more = 0;
> >  
> >  retry:
> > @@ -537,7 +537,7 @@ retry:
> >  		flow_trace(conn, "%zi from write-side call (passed %zi)",
> >  			   written, c->tcp.pipe_size);
> >  
> > -		/* Most common case: skip updating counters. */
> > +		/* Most common case: skip updating pending. */
> 
> "pending" isn't a noun (even though the variable name is, but it's
> not quite obvious that you're referring to it). I think that:
> 
> 		/* Most common case: skip updating count of pending bytes */
> 
> would be slightly clearer (and also omit the '.' because it's not a
> complete sentence, as we usually do on single-line comments, similarly
> to most occurrences in the kernel).

Done.

> 
> >  		if (readlen > 0 && readlen == written) {
> >  			if (readlen >= (long)c->tcp.pipe_size * 10 / 100)
> >  				continue;
> > @@ -561,11 +561,11 @@ retry:
> >  			continue;
> >  		}
> >  
> > -		conn->read[fromsidei]    += readlen > 0 ? readlen : 0;
> > -		conn->written[fromsidei] += written > 0 ? written : 0;
> > +		conn->pending[fromsidei] += readlen > 0 ? readlen : 0;
> > +		conn->pending[fromsidei] -= written > 0 ? written : 0;
> >  
> >  		if (written < 0) {
> > -			if (conn->read[fromsidei] == conn->written[fromsidei])
> > +			if (!conn->pending[fromsidei])
> >  				break;
> >  
> >  			conn_event(conn, OUT_WAIT(!fromsidei));
> > @@ -575,15 +575,15 @@ retry:
> >  		if (never_read && written == (long)(c->tcp.pipe_size))
> >  			goto retry;
> >  
> > -		pending = conn->read[fromsidei] - conn->written[fromsidei];
> > -		if (!never_read && written > 0 && written < pending)
> > +		if (!never_read && written > 0 &&
> > +		    written < conn->pending[fromsidei])
> >  			goto retry;
> >  
> >  		if (eof)
> >  			break;
> >  	}
> >  
> > -	if (conn->read[fromsidei] == conn->written[fromsidei] && eof) {
> > +	if (!conn->pending[fromsidei] && eof) {
> >  		unsigned sidei;
> >  
> >  		flow_foreach_sidei(sidei) {
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  reply	other threads:[~2026-05-21  2:29 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 13:08 [PATCH 0/6] Fix race condition while closing spliced connections David Gibson
2026-05-20 13:08 ` [PATCH 1/6] tcp_splice: Improve error reporting David Gibson
2026-05-20 14:31   ` Stefano Brivio
2026-05-21  0:43     ` David Gibson
2026-05-21  5:08       ` Stefano Brivio
2026-05-20 13:08 ` [PATCH 2/6] tcp_splice: Avoid missing EOF recognition while forwarding David Gibson
2026-05-20 20:28   ` Stefano Brivio
2026-05-21  0:46     ` David Gibson
2026-05-20 13:08 ` [PATCH 3/6] tcp_splice: Clean up flow control path for splice forwarding David Gibson
2026-05-20 20:28   ` Stefano Brivio
2026-05-21  0:50     ` David Gibson
2026-05-20 13:08 ` [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes David Gibson
2026-05-20 20:29   ` Stefano Brivio
2026-05-21  0:54     ` David Gibson [this message]
2026-05-20 13:08 ` [PATCH 5/6] tcp_splice: Simplify EPOLLRDHUP / eof / FIN handling David Gibson
2026-05-20 20:30   ` Stefano Brivio
2026-05-21  2:03     ` David Gibson
2026-05-21  5:40       ` Stefano Brivio
2026-05-21  6:56         ` David Gibson
2026-05-21  7:15           ` Stefano Brivio
2026-05-21 13:51             ` David Gibson
2026-05-21 15:18               ` Stefano Brivio
2026-05-22  1:29                 ` David Gibson
2026-05-20 13:08 ` [PATCH 6/6] tcp_splice: Simplify shutdown(2) handling David Gibson
2026-05-20 20:30   ` Stefano Brivio
2026-05-21  2:11     ` David Gibson
2026-05-21  5:40       ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag5X141b8T8IbilX@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=pholzing@redhat.com \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).