From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202510 header.b=TH8aD0XO; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 8A7175A026F for ; Wed, 08 Oct 2025 03:22:53 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202510; t=1759886571; bh=AuhqO4JOGiXpiphejast1LrBE2Bm4HprVRhXKFZpY08=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=TH8aD0XOG9Vd6puCWodtAvRhRySdAQSVAZobJSPuI/s9yAhaVeC5Dsu0yccu3ZCcx klsxtJLWiGBBQQs2b1v0rG/iRED/6Ex2sa6Wq7QiPjDdVC1YsnirkojzSz3RbRXDyE 4pF5b5BO670i2akdrJ7arZ0/VnrHNwNtKhI/Ym27C6jMQxGfs2BpG1OxZacDbBX98K 6V8KPY9j2k9X6bgfeBHs2wiCFMN/Lr+CajfkV5wFlLipInmEakycMqgkIwUK70w1Qe +wRclo7Pu3DCEU2JQaLbj6VOHkO2YhG0ExEszKrViQPgqatzVoVMbRlzGsWIm0Kwe+ GescqqSN6nBHQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4chFcW38B1z4wCY; Wed, 8 Oct 2025 12:22:51 +1100 (AEDT) Date: Wed, 8 Oct 2025 12:21:35 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 1/1] tcp: Clarify logic calculating how much guest data to ack Message-ID: References: <20251003063051.1127873-1-david@gibson.dropbear.id.au> <20251003063051.1127873-2-david@gibson.dropbear.id.au> <20251008004212.25d0d0dc@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="syxNpZKtGb0a+OYv" Content-Disposition: inline In-Reply-To: <20251008004212.25d0d0dc@elisabeth> Message-ID-Hash: H4OLRSYPJVBQD663JKRXE4OYQHPLXJMP X-Message-ID-Hash: H4OLRSYPJVBQD663JKRXE4OYQHPLXJMP X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --syxNpZKtGb0a+OYv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Oct 08, 2025 at 12:42:12AM +0200, Stefano Brivio wrote: > On Fri, 3 Oct 2025 16:30:51 +1000 > David Gibson wrote: >=20 > > This is fairly complex, because we have a method we prefer but we need = to > > fall back to a simpler one in a bunch of cases. Slightly reorganise the > > code to make the flow clearer, and add a large comment giving the > > rationale. >=20 > I think this is a strict improvement on the original and I was about to > apply it regardless of my pending series with TCP fixes (it looks > completely independent to me) and a few nits I had, but then I noticed > one bit that might be substantially misleading, at the end. >=20 > So here come all my comments: >=20 > > Signed-off-by: David Gibson > > --- > > tcp.c | 68 ++++++++++++++++++++++++++++++++++++----------------------- > > 1 file changed, 42 insertions(+), 26 deletions(-) > >=20 > > diff --git a/tcp.c b/tcp.c > > index 7da41797..85eb2c32 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -1014,35 +1014,51 @@ int tcp_update_seqack_wnd(const struct ctx *c, = struct tcp_tap_conn *conn, > > uint32_t new_wnd_to_tap =3D prev_wnd_to_tap; > > int s =3D conn->sock; > > =20 > > - if (!bytes_acked_cap) { > > - conn->seq_ack_to_tap =3D conn->seq_from_tap; > > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > - conn->seq_ack_to_tap =3D prev_ack_to_tap; > > - } else { > > - if ((unsigned)SNDBUF_GET(conn) < SNDBUF_SMALL || > > - tcp_rtt_dst_low(conn) || CONN_IS_CLOSING(conn) || > > - (conn->flags & LOCAL) || force_seq) { > > - conn->seq_ack_to_tap =3D conn->seq_from_tap; > > - } else if (conn->seq_ack_to_tap !=3D conn->seq_from_tap) { > > - if (!tinfo) { > > - tinfo =3D &tinfo_new; > > - if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > > - return 0; > > - } > > - > > - /* This trips a cppcheck bug in some versions, including > > - * cppcheck 2.18.3. > > - * https://sourceforge.net/p/cppcheck/discussion/general/thread/fec= de59085/ > > - */ > > - /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > > - conn->seq_ack_to_tap =3D tinfo->tcpi_bytes_acked + > > - conn->seq_init_from_tap; > > - > > - if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > - conn->seq_ack_to_tap =3D prev_ack_to_tap; > > + /* At this point we could ack all the data we've accepted for forward= ing > > + * (seq_from_tap). When possible, however, we want to only ack what = the > > + * peer has acked. This makes it appear to the guest more like a dir= ect > > + * connection to the peer, and may improve flow control behaviour. >=20 > For consistency, as we don't use "ack" as a verb anywhere else, maybe > spell it out as "acknowledge" / "acknowledged". We don't in comments, but there is TAP_FIN_ACKED and bytes_acked_cap (copied from the kernel's tcpi_bytes_acked). Adjusted, nonetheless. > > + * > > + * For it to be possible and worth it we need: > > + * - The TCP_INFO Linux extension which gives us the peer acked bytes > > + * - Not to be told not to (force_seq) > > + * - Not half-closed in the peer->guest direction > > + * With no data coming from the peer, we won't get further events > > + * which would prompt us to recheck bytes_acked. We could poll = on > > + * a timer, but that's more trouble than it's worth. >=20 > Strictly speaking, we could (and usually do) get further events > prompting us to check bytes_acked, in the form of segments from the > guest, but perhaps we can just leave this detail out for brevity, > unless you want to try and factor that in. Good point. I was thinking about the fact that we don't get events for the fact that bytes_acked has changed of itself, but the comment doesn't make that clear. I've tweaked the wording. > > + * - Not a host local connection >=20 > The tcp_rtt_dst_low() is a trick to consider "local" also anything (VMs) > that's connected to us via veth. >=20 > It's not local from a network segment perspective, but it's local to > the machine, and the same consideration applies (somewhat surprisingly, > for veth). Same here, I guess we could leave this out for brevity. I've adjusted the wording in a way I think is better, but it will want a re-review. >=20 > > + * Data goes directly from socket to socket in this case, with > > + * nothing meaningful "in flight". > > + * - Large enough send buffer > > + * If this is small, there's not enough in flight to bother. > > + */ > > + if (bytes_acked_cap && !force_seq && > > + !CONN_IS_CLOSING(conn) && > > + !(conn->flags & LOCAL) && !tcp_rtt_dst_low(conn) && > > + (unsigned)SNDBUF_GET(conn) >=3D SNDBUF_SMALL) { > > + if (!tinfo) { > > + tinfo =3D &tinfo_new; > > + if (getsockopt(s, SOL_TCP, TCP_INFO, tinfo, &sl)) > > + return 0; > > } > > + > > + /* This trips a cppcheck bug in some versions, including > > + * cppcheck 2.18.3. > > + * https://sourceforge.net/p/cppcheck/discussion/general/thread/fecd= e59085/ > > + */ > > + /* cppcheck-suppress [uninitvar,unmatchedSuppression] */ > > + conn->seq_ack_to_tap =3D tinfo->tcpi_bytes_acked + > > + conn->seq_init_from_tap; >=20 > Maybe fix the indentation while at it? >=20 > conn->seq_ack_to_tap =3D tinfo->tcpi_bytes_acked + > conn->seq_init_from_tap; Ah, sure. That detail of our style isn't known by my editor, so I missed it. > > + } else { > > + /* Fall back to acking everything we have */ >=20 > Maybe specifically refer to what we got so far, >=20 > /* Fall back to acknowledging everything we got */ >=20 > ? Uh, sure, done. > > + conn->seq_ack_to_tap =3D conn->seq_from_tap; > > } > > =20 > > + /* If the guest is retransmitting, don't let our ACKed sequence go > > + * backwards */ >=20 > This is the misleading part I realised about, after I mentioned it in: >=20 > https://archives.passt.top/passt-dev/20251007003219.3f286b1d@elisabeth/ >=20 > ...the reason why we risk rewinding the acknowledged sequence isn't > that the guest is retransmitting, because in that case we wouldn't have > advanced conn->seq_to_tap to begin with. Right, I misunderstood the situation in which this logic is needed. > The reason is that one of those conditions for using bytes_acked you > listed above happened to be false, and now it becomes true again. Aha! > The only practical one I can think of is the array used by > tcp_rtt_dst_low() getting full at some point, but later we re-insert the > peer we're talking to in the table. Comment updated. > By the way, for consistency: >=20 > /* Multi-line > * comment > */ Done. > > + if (SEQ_LT(conn->seq_ack_to_tap, prev_ack_to_tap)) > > + conn->seq_ack_to_tap =3D prev_ack_to_tap; >=20 > The reason behind the current code structure is to skip this if we > didn't touch conn->seq_ack_to_tap at all, but the compiler will probably > figure this out by itself, and even if it doesn't, I guess it's more > efficient to do this unconditionally anyway. That was my thinking. It should all be nearly-free in-register integer arithmetic, plus a single probably-L1-hot store. This way makes for less conditionals to parse, primarily for the human reader though also for the CPU. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --syxNpZKtGb0a+OYv Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmjlvJ4ACgkQzQJF27ox 2GfD9w/+PDDhMEBax+WDIOvZWnNFsUGCwUSYQNbZahxP4vA2VeJPE4NqTr9SDG9Y lie540VTwBRGMmm323nw5GBM5mAB+PYpHmEu158ivMDhK6XAn8VG85NluHhfCkxa XA6DgoBKkKXXbT1pOlva0deDGEKKm6BDSQhib5ethYwucvsiouZcKnekYUGIgFli RX5fVrUJ23/ULLFHFKmXQJOl54UW/mAFxsSc3Z9+ua/7IJUxh3p+pXQdBj8kqUk+ AhCDdJnFZxV3SmtNkpXyVMKRqhE2Q079pXODaOPW4jr4+gM95M2JkM8FG4SGv4F0 4NxVpeb1e0CNtfj7vOE51Kn3VEIZ08cZC7LH9UORhFxYSUv47bE0kUiu7ZeQWN7B TEyBKKgudzUg771QOC4/y3ODaODOXVTVl7e0odwN2sYLROYwPobIS+bgBVITVy6z n9uqOWzL0cjOyPa16PDpu0o9Xhoefq6zZ+RKA2RKTHhJeSnZ1ObLUH+lKwKQq8KO +/s65KlOfjLDpOIKoYa/Eg51iuHlxfnvYzRGB/a5EgkSH3Va4MgcbQV0yIncEto8 TOihadv+oZgZ8iPvH+vb6qu6YQLvzXipC1r2+Yl2K83mfTMmcniVqsoiDasrc4ql 4p6OmigLq65fq1Zy32NQ6ZPjnrunfPL7hCQjF9jpaE3JUdc7rV0= =4DKV -----END PGP SIGNATURE----- --syxNpZKtGb0a+OYv--