From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=N31Dtipa; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 7B5965A0262 for ; Wed, 20 May 2026 22:29:17 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1779308956; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pKWqVZLMb+tmpaEj/RUgIPILiS4eTZjuat6b9gpaT5Q=; b=N31DtipaB7gIvyVHEGd+Mj0NOIw02wynw3XpaB7mNQZVrmzhKC5Q+s3demm/IdufGihEQe Sb6nyPzFZb3on4CB/XpyIZdwq0VM+GcBRzdSlBUCWhK3q5sWg8946XxXMkQy/jortQQERf Ck/6ZCDmJYWJTYi7NCUAm2AKjg1BcFU= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-30-ML6UMQV0O1WoVDuhnjypdA-1; Wed, 20 May 2026 16:29:15 -0400 X-MC-Unique: ML6UMQV0O1WoVDuhnjypdA-1 X-Mimecast-MFC-AGG-ID: ML6UMQV0O1WoVDuhnjypdA_1779308954 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-48fdb2b0cb8so33725245e9.0 for ; Wed, 20 May 2026 13:29:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779308954; x=1779913754; h=date:content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pKWqVZLMb+tmpaEj/RUgIPILiS4eTZjuat6b9gpaT5Q=; b=WIsnxpoD+KNBfvsDSWDsrdNVvnnKmw0guYJf75BXfmDbmEUEa72nNsa7h41t4gG514 JId6gwMVaddhg+c+OdkVNWuvHo00mFBblRFTIxAygMbHo7ucwxltez1fIcGzUPHggJWz 9strUBRt7DGK+k53ULowoWKF1AJy4Hs2wWG4v7S+8AlxdiDKMLuhrB2N6v2sM9Qw7Coh +GklX9/WfSa5JPR4sTWmiqQBv4V5Uf8WCtuvc9FHyMi6HmwtWGDc1jQ63jyF625Nd/nz LBssXKTmpKtRdiEnZzmPt6f1VGgugTECenUpMuBBAxL22yZMhwBrczJUqwZ9Dg4KbXFk Ro7g== X-Gm-Message-State: AOJu0YyKn8qKaY7FhmMOlQbKrd34K0SFfVwBlMVA2Ch9Zz1PDWPXl294 bfFpib9dpcLRqKdgGnLiH7MuvQpj2ny8pDlXo8mlMWPWY18jybR3NbJfBt/5CEOcqNH+sDXN9o5 TnKtTrzTzS3ByxCJpnhLTjJUeeolqyhnrnsYOhIf33XKWtfVSazLf6g== X-Gm-Gg: Acq92OHwgDbTTST8EZImk0o6wxZP18NJtlG26ZvDUCWTI0xabgHH7cNpkI0CDeHr2k7 OZGyEzo908aZ7oY2AjsmK4LDge5k+Xz26RgwVFQNV12H9XQpaSTDzNj4yK07kt2zOMVYGKgGScB NOVYsNqLpnaIX0U3RBnxAyFpplDvj7puXqzaW1VO/rZrxP+5wEUO02fkoxuTdrNHzI7yR24yVCQ 7ByCzBX0y8NkPdj1EMOyHaYhBMwKAOOql0yeASnldGQOWxZktUAvh4S8dV/A2QyWxBMIUrXXBTf BlWQFINmGxHyG/9am2FRs1UtmjQxwaMp7PAGWmqvqa1nRv9qqv/v5DDcoJ5pRZGbYg48zcQ5npA gC5TwtMMINPfnDtwOgZu/fwh7df1plJ/D X-Received: by 2002:a05:600c:1505:b0:48f:ed68:6760 with SMTP id 5b1f17b1804b1-48fed6867famr214345595e9.29.1779308953694; Wed, 20 May 2026 13:29:13 -0700 (PDT) X-Received: by 2002:a05:600c:1505:b0:48f:ed68:6760 with SMTP id 5b1f17b1804b1-48fed6867famr214345425e9.29.1779308953195; Wed, 20 May 2026 13:29:13 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-49033d56fe6sm16921385e9.10.2026.05.20.13.29.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 May 2026 13:29:12 -0700 (PDT) From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 4/6] tcp_splice: Simplify tracking of read/written bytes Message-ID: <20260520222911.6d12ff70@elisabeth> In-Reply-To: <20260520130851.436931-5-david@gibson.dropbear.id.au> References: <20260520130851.436931-1-david@gibson.dropbear.id.au> <20260520130851.436931-5-david@gibson.dropbear.id.au> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 Date: Wed, 20 May 2026 22:29:12 +0200 (CEST) X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: iGTrF-DuBHu8N1R8Ofsz5uFhcBsS4ePt0BqG3D4nzpE_1779308954 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: WGCFBGJZTYQBOPIVSY5OXU6YHFEUSKUR X-Message-ID-Hash: WGCFBGJZTYQBOPIVSY5OXU6YHFEUSKUR X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 20 May 2026 23:08:49 +1000 David Gibson wrote: > For each each direction of each spliced connection, we keep track of how > many bytes we've read from one socket and written to the other. However, > we never actually care about the absolute values of these, only the > difference between them, which represents how much data is currently "in > flight" in the splicing pipe. > > Simplify the handling by having a single variable tracking the number of > bytes in the pipe. For me it actually looks slightly more complicated to think about it this way, I added explicit 'read' and 'written' after being bitten by some issue I introduced with a previous 'pending' concept, but I have to admit it slightly simplifies the overflow topic. > As a bonus, the new scheme makes it clearer that we don't need to worry > about overflows: pending can never become larger than the maximum pipe > bufffer size, well within 32-bits. > > I _think_ the old scheme was safe in the case of overflow - again under > the assumption that read/written can never be further apart than the pipe > buffer size. However, it's much harder to reason about this case. It's > certainly plausible that an overflow could occur - sending 4GiB through > a local socket is entirely achievable. For me it looked pretty simple: you can overflow 32 bits (at 100 Gbps, but without hitting the "optimised" case, it would take about five minutes), but all the operations between the two counters are between two uint32_t, so they would happen in uint32_t, hence modulo 32 bits, similar to TCP sequences. Anyway, overall, I think it's an improvement over the original. One nit here: > Signed-off-by: David Gibson > --- > tcp_conn.h | 6 ++---- > tcp_splice.c | 18 +++++++++--------- > 2 files changed, 11 insertions(+), 13 deletions(-) > > diff --git a/tcp_conn.h b/tcp_conn.h > index 9f5bee03..c8381aa7 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -206,8 +206,7 @@ struct tcp_tap_transfer_ext { > * @f: Generic flow information > * @s: File descriptor for sockets > * @pipe: File descriptors for pipes > - * @read: Bytes read (not fully written to other side in one shot) > - * @written: Bytes written (not fully written from one other side read) > + * @pending: Bytes currently in each pipe > * @events: Events observed/actions performed on connection > * @flags: Connection flags (attributes, not events) > */ > @@ -218,8 +217,7 @@ struct tcp_splice_conn { > int s[SIDES]; > int pipe[SIDES][2]; > > - uint32_t read[SIDES]; > - uint32_t written[SIDES]; > + uint32_t pending[SIDES]; > > uint8_t events; > #define SPLICE_CLOSED 0 > diff --git a/tcp_splice.c b/tcp_splice.c > index 18e8b303..8fbd490f 100644 > --- a/tcp_splice.c > +++ b/tcp_splice.c > @@ -292,7 +292,7 @@ bool tcp_splice_flow_defer(struct tcp_splice_conn *conn) > conn->s[sidei] = -1; > } > > - conn->read[sidei] = conn->written[sidei] = 0; > + conn->pending[sidei] = 0; > } > > conn->events = SPLICE_CLOSED; > @@ -490,7 +490,7 @@ static int tcp_splice_forward(struct ctx *c, struct > int eof = 0; > > while (1) { > - ssize_t readlen, written, pending; > + ssize_t readlen, written; > int more = 0; > > retry: > @@ -537,7 +537,7 @@ retry: > flow_trace(conn, "%zi from write-side call (passed %zi)", > written, c->tcp.pipe_size); > > - /* Most common case: skip updating counters. */ > + /* Most common case: skip updating pending. */ "pending" isn't a noun (even though the variable name is, but it's not quite obvious that you're referring to it). I think that: /* Most common case: skip updating count of pending bytes */ would be slightly clearer (and also omit the '.' because it's not a complete sentence, as we usually do on single-line comments, similarly to most occurrences in the kernel). > if (readlen > 0 && readlen == written) { > if (readlen >= (long)c->tcp.pipe_size * 10 / 100) > continue; > @@ -561,11 +561,11 @@ retry: > continue; > } > > - conn->read[fromsidei] += readlen > 0 ? readlen : 0; > - conn->written[fromsidei] += written > 0 ? written : 0; > + conn->pending[fromsidei] += readlen > 0 ? readlen : 0; > + conn->pending[fromsidei] -= written > 0 ? written : 0; > > if (written < 0) { > - if (conn->read[fromsidei] == conn->written[fromsidei]) > + if (!conn->pending[fromsidei]) > break; > > conn_event(conn, OUT_WAIT(!fromsidei)); > @@ -575,15 +575,15 @@ retry: > if (never_read && written == (long)(c->tcp.pipe_size)) > goto retry; > > - pending = conn->read[fromsidei] - conn->written[fromsidei]; > - if (!never_read && written > 0 && written < pending) > + if (!never_read && written > 0 && > + written < conn->pending[fromsidei]) > goto retry; > > if (eof) > break; > } > > - if (conn->read[fromsidei] == conn->written[fromsidei] && eof) { > + if (!conn->pending[fromsidei] && eof) { > unsigned sidei; > > flow_foreach_sidei(sidei) { -- Stefano