From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=dQjE551a; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 7D7405A0265 for ; Thu, 05 Mar 2026 05:19:51 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202602; t=1772684387; bh=bEMdvRSKpW1umMvQVJS9dWYpCZb0CaLWc58WiLh49WA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=dQjE551a4RguAy8uYLAblfRRXwb6UWN8co0WWN1h42uSHwnWrmlB7dLz/E96WuD8P jIOV9koIqG/xl+fFz1t39EJMKfGDC6gDDmdoK+aaCUMDIjHAIsHItQdWxt5wBkZbV0 XbOAZVKPca60goCSSUXYSZ663ZunwsRrPGjRMZhSlThlqzJaLjfiFw5fRrZ+R/Nvsz v3aBO6hzIFqxuSzQXGDaY6WpxTHZv17QE5O7FpigrHMZ8GaNvTDK5uAvqw84cES6iy v550O6URE2JAUKNMcBRVHJT+5D8PD0CuLgyCnHMZsI71kIE/CIDK7Ianr7BAfj//PW 359yOaSTHaS3g== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4fRGXM0MM7z4wCJ; Thu, 05 Mar 2026 15:19:47 +1100 (AEDT) Date: Thu, 5 Mar 2026 15:19:40 +1100 From: David Gibson To: Stefano Brivio Subject: Re: Pesto protocol proposals Message-ID: References: <20260305021952.17963c3f@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="gGUeNAn6gw1QcHbB" Content-Disposition: inline In-Reply-To: <20260305021952.17963c3f@elisabeth> Message-ID-Hash: VXT5LVTSVZFNVJOII5HNA6WAVBWDNKET X-Message-ID-Hash: VXT5LVTSVZFNVJOII5HNA6WAVBWDNKET X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --gGUeNAn6gw1QcHbB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote: > On Wed, 4 Mar 2026 15:28:30 +1100 > David Gibson wrote: >=20 > > Most of today and yesterday I've spent thinking about the dynamic > > update model and protocol. I certainly don't have all the details > > pinned down, let alone any implementation, but I have come to some > > conclusions. > >=20 > > # Shadow forward table > >=20 > > On further consideration, I think this is a bad idea. To avoid peer > > visible disruption, we don't want to destroy and recreate listening > > sockets >=20 > (Side note: if it's just *listening* sockets, is this actually that > bad?) Well, it's obviously much less bad that interrupting existing connections. It does mean a peer attempting to connect at the wrong moment might get an ECONNREFUSED, as far as it knows, a permanent error. > > that are associated with a forward rule that's not being altered. >=20 > After reading the rest of your proposal, as long as: >=20 > > Doing that with a shadow table would mean we'd need to essentially > > diff the two tables as we switch. That seems moderately complex, >=20 > ...this is the only downside (I can't think of others though), and I > don't think it's *that* complex as I mentioned, it would be a O(n^2) > step that can be probably optimised (via sorting) to O(n * log(m)) with > n new rules and m old rules, cycling on new rules and creating listening > sockets (we need this part anyway) unless we find (marking it > somewhere temporarily) a matching one... I wasn't particularly concerned about the computational cost. It was more that I couldn't quickly see a clear approach with unambiguous semantics. But, I think I came up with one now, see later. > > and > > kind of silly when then client almost certainly have created the > > shadow table using specific adds/removes from the original table. >=20 > ...even though this is true conceptually, at least at a first glance > (why would I send 11 rules to add a single rule to a table of 10?), I > think the other details of the implementation, and conceptual matters > (such as rollback and two-step activation) make this apparent silliness > much less relevant, and I'm more and more convinced that a shadow table > is actually the simplest, most robust, least bug-prone approach. >=20 > Especially: >=20 > > # Rule states / active bit > >=20 > > I think we *do* still want two stage activation of new rules: >=20 > ...this part, which led to a huge number of bugs over the years in nft > / nftables updates, which also use separate insert / activate / commit > / deactivate / delete operations. Huh, interesting. I wasn't aware of that, and it's pretty persuasive. > It's extremely complicated to grasp and implement properly, and you end > up with a lot of quasi-diffing anyway (to check for duplicates in > ranges, for example). >=20 > It makes much more sense in nftables because you can have hundreds of > megabytes of data stored in tables, but any usage that was ever > mentioned for passt in the past ~5 years would seem to imply at most > hundreds of kilobytes per table. >=20 > Shifting complexity to the client is also a relevant topic for me, as we > decided to have a binary client to avoid anything complicated (parsing) > in the server. A shadow table allows us to shift even more complexity > to the client, which is important for security. I definitely agree in principle - what I wasn't convinced about was that the overall balance actually favoured the client, because of my concern over the complexity of that "diff"ing. But=20 > I haven't finished drafting a proposal based on this idea, but I plan to > do it within one day or so. Actually, you convinced me already, so I can do that. > It won't be as detailed, because I don't think it's realistic to come > up with all the details before writing any of the code (what's the > point if you then have to throw away 70% of it?) but I hope it will be > complete enough to provide a comparison. >=20 > By the way, at least at a first approximation, closing and reopening > listening sockets will mostly do the trick for anything our users > (mostly via Podman) will ever reasonably want, so I have half a mind of > keeping it like that in a first proposal, but indeed we should make > sure there's a way around it, which is what is is taking me a bit more > time to demonstrate. With some more thought I saw a way of doing the "diff" that looks pretty straightforward and reasonable. Moreover it's less churn of the existing code, and works nicely with close-and-reopen as an interim step. It even provides socket continuity for arbitrarily overlapping ranges in the old and new tables. For close and re-open, we can implement COMMIT as: 1. fwd_listen_close() on old table 2. fwd_listen_sync() on new table I think we can get socket continuity if by swapping the order of those steps and extending fwd_sync_one() to do: for each port: if : nothing to do else if : steal socket for new table else: open/bind/listen new socket The "steal" would mark the fd as -1 in the old table so fwd_listen_close() won't get rid of it. I think the check for a matching socket in the old table will be moderately expensive O(n), but not so much as to be a problem in practice. > > [...] > > > > # Suggested client workflow > >=20 > > I suggest the client should: > >=20 > > 1. Parse all rule modifications > > 2. INSERT all new rules > > -> On error, DELETE them again =20 > > 3. DEACTIVATE all removed rules > > -> Should only fail if the client has done something wrong =20 > > 4. ACTIVATE all new rules > > -> On error (rule conflict): =20 > > DEACTIVATE rules we already ACTIVATEd > > ACTIVATE rules we already DEACTIVATEd > > DELETE rules we INSERTed > > 5. Check for bind errors (see details later) > > If there are failures we can't tolerate: > > DEACTIVATE rules we already ACTIVATEd > > ACTIVATE rules we already DEACTIVATEd > > DELETE rules we INSERTed > > 6. DELETE rules we DEACTIVATEd > > -> Should only fail if the client has done something wrong =20 > >=20 > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between > > new rules and rules we're deleting. > >=20 > > I think that gets us closeish to "as atomic as we can be", at least > > from the perspective of peers. The main case it doesn't catch is that > > we don't detect rule conflicts until after we might have removed some > > rules. Is that good enough? >=20 > I think it is absolutely fine as an outcome, but the complexity of error > handling in this case is a bit worrying. This is exactly the kind of > thing (and we discussed it already a couple of times) that made and > makes me think that a shadow table is a better approach instead. I'll work on a more concrete proposal based on the shadow table approach. There are still some wrinkles with how to report bind() errors with this scheme to figure out. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --gGUeNAn6gw1QcHbB Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmmpBFsACgkQzQJF27ox 2GcTdg//Vp4OlhjUpKkbcZ916rEkXLwX1T0gz7YuI/7mOUOQ4PqPWeISQjvkIr+f wzXfpaa0w2sxY1gC23QPKKtqA2sszf+Ezx3tSOKSm17bfgcD4ynnJdpc90x7e79V WZPw9ii+/fNEoWl9Hoc8d4yM0shfUFxKbMzAYzOefNHzz09BPa6kjoympNdb5a7y 2CRKBpWiHdfIy3Dm7AeQODq1RRYqzCOBRkXl3ij80xcIQWEACkzPEQCaKvcXvFJU Br2PPVMTGwGRfrMF2a6/A+4AnzgPF0suD+saAqLjJBLBxc+xhv30JF4RWEVvZzm5 phNP71GbRF3KnRRpbC4cbOIKutTXh6pSelJjqVAsfwHMHE/PaseG9NRZC0MOmD3U TtlRMgh9uykSNNycys6YTU04lHnpqdoNM+50Dm0XHCvpYNd99RsDQn0ViR8tZh2X IbTOENv8fvsrBmOeYUnn4NaGfjenYIReEjYdF+yZBJ9fTnGP57DZn9i0tYL8xG19 lATitd8i7AO4hio8CNkep4Y5zydY0WFMr9uKoKRpXvkzNuAwEYPe2UipJJMVUyxd wQlec/FueHTIlUZmK5ioQAHB/R6E44EjrDiYhh1MpPbqlH9DHX1KPtD9yZPF+9Xn 5zgbkHdL8GkkrSgaGNUczia/rV9gKRite1AB2hTf1ItGMAob2X4= =jmCG -----END PGP SIGNATURE----- --gGUeNAn6gw1QcHbB--