On Wed, May 20, 2026 at 10:29:12PM +0200, Stefano Brivio wrote: > On Wed, 20 May 2026 23:08:49 +1000 > David Gibson wrote: > > > For each each direction of each spliced connection, we keep track of how > > many bytes we've read from one socket and written to the other. However, > > we never actually care about the absolute values of these, only the > > difference between them, which represents how much data is currently "in > > flight" in the splicing pipe. > > > > Simplify the handling by having a single variable tracking the number of > > bytes in the pipe. > > For me it actually looks slightly more complicated to think about it > this way, I added explicit 'read' and 'written' after being bitten by > some issue I introduced with a previous 'pending' concept, but I have > to admit it slightly simplifies the overflow topic. > > > As a bonus, the new scheme makes it clearer that we don't need to worry > > about overflows: pending can never become larger than the maximum pipe > > bufffer size, well within 32-bits. > > > > I _think_ the old scheme was safe in the case of overflow - again under > > the assumption that read/written can never be further apart than the pipe > > buffer size. However, it's much harder to reason about this case. It's > > certainly plausible that an overflow could occur - sending 4GiB through > > a local socket is entirely achievable. > > For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps, > but without hitting the "optimised" case, it would take about five > minutes), but all the operations between the two counters are between > two uint32_t, so they would happen in uint32_t, hence modulo 32 bits, > similar to TCP sequences. Plus it's only equality comparisons, so we don't need SEQ_GT or the like. Yeah, that's the reasoning, but to me that's still a lot more than "can't exceed pipe size". > Anyway, overall, I think it's an improvement over the original. One nit > here: > > > Signed-off-by: David Gibson > > --- > > tcp_conn.h | 6 ++---- > > tcp_splice.c | 18 +++++++++--------- > > 2 files changed, 11 insertions(+), 13 deletions(-) > > > > diff --git a/tcp_conn.h b/tcp_conn.h > > index 9f5bee03..c8381aa7 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext { > > * @f: Generic flow information > > * @s: File descriptor for sockets > > * @pipe: File descriptors for pipes > > - * @read: Bytes read (not fully written to other side in one shot) > > - * @written: Bytes written (not fully written from one other side read) > > + * @pending: Bytes currently in each pipe > > * @events: Events observed/actions performed on connection > > * @flags: Connection flags (attributes, not events) > > */ > > @@ -218,8 +217,7 @@ struct tcp_splice_conn { > > int s[SIDES]; > > int pipe[SIDES][2]; > > > > - uint32_t read[SIDES]; > > - uint32_t written[SIDES]; > > + uint32_t pending[SIDES]; > > > > uint8_t events; > > #define SPLICE_CLOSED 0 > > diff --git a/tcp_splice.c b/tcp_splice.c > > index 18e8b303..8fbd490f 100644 > > --- a/tcp_splice.c > > +++ b/tcp_splice.c > > @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn) > > conn->s[sidei] = -1; > > } > > > > - conn->read[sidei] = conn->written[sidei] = 0; > > + conn->pending[sidei] = 0; > > } > > > > conn->events = SPLICE_CLOSED; > > @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct > > int eof = 0; > > > > while (1) { > > - ssize_t readlen, written, pending; > > + ssize_t readlen, written; > > int more = 0; > > > > retry: > > @@ -537,7 +537,7 @@ retry: > > flow_trace(conn, "%zi from write-side call (passed %zi)", > > written, c->tcp.pipe_size); > > > > - /* Most common case: skip updating counters. */ > > + /* Most common case: skip updating pending. */ > > "pending" isn't a noun (even though the variable name is, but it's > not quite obvious that you're referring to it). I think that: > > /* Most common case: skip updating count of pending bytes */ > > would be slightly clearer (and also omit the '.' because it's not a > complete sentence, as we usually do on single-line comments, similarly > to most occurrences in the kernel). Done. > > > if (readlen > 0 && readlen == written) { > > if (readlen >= (long)c->tcp.pipe_size * 10 / 100) > > continue; > > @@ -561,11 +561,11 @@ retry: > > continue; > > } > > > > - conn->read[fromsidei] += readlen > 0 ? readlen : 0; > > - conn->written[fromsidei] += written > 0 ? written : 0; > > + conn->pending[fromsidei] += readlen > 0 ? readlen : 0; > > + conn->pending[fromsidei] -= written > 0 ? written : 0; > > > > if (written < 0) { > > - if (conn->read[fromsidei] == conn->written[fromsidei]) > > + if (!conn->pending[fromsidei]) > > break; > > > > conn_event(conn, OUT_WAIT(!fromsidei)); > > @@ -575,15 +575,15 @@ retry: > > if (never_read && written == (long)(c->tcp.pipe_size)) > > goto retry; > > > > - pending = conn->read[fromsidei] - conn->written[fromsidei]; > > - if (!never_read && written > 0 && written < pending) > > + if (!never_read && written > 0 && > > + written < conn->pending[fromsidei]) > > goto retry; > > > > if (eof) > > break; > > } > > > > - if (conn->read[fromsidei] == conn->written[fromsidei] && eof) { > > + if (!conn->pending[fromsidei] && eof) { > > unsigned sidei; > > > > flow_foreach_sidei(sidei) { > > -- > Stefano > -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson