From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=OwYfP7Xo; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 130FC5A0619 for ; Fri, 06 Mar 2026 13:55:00 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202602; t=1772801696; bh=++q44qrkhjWxU7AGAoW40lydy/T676dDvyiwxmllSqg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=OwYfP7XoK0jXs0hfhCdc0bSg0iqFcL7esAOgd/KgoPyGlpIemUfoEoeuLzv9b1B+9 y6P/3NsSBR4R9CFNNz7bbF8ftjP7RIEN2CdDRlKiuTmxdVJmNA2WBCJlKqB1U7fOzZ vrZlPvZ5ucMAw1cqBibLzQ3MCegkrcGJ/wu2wlVbDvpGwk226rukzB/YFF4MgFTjcI EMAcPw0BSuix+6CHVPOlLhpxYYc3bdZ2utA7IXol8owO0cTcP6iHYDw8PiPJ5fS1qT ZqQ7k77PvHjQenv3cn2PpPcfjuto43JBj8yX/XS5VDjfiVuMccNlsiMYe5EmKhE/DB j8Qms/DNnViZg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4fS5wJ0WMwz4wCm; Fri, 06 Mar 2026 23:54:56 +1100 (AEDT) Date: Fri, 6 Mar 2026 23:23:50 +1100 From: David Gibson To: Stefano Brivio Subject: Re: Pesto protocol proposals Message-ID: References: <20260305021952.17963c3f@elisabeth> <20260306101827.38124251@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="pEwyEwOxgxzK9xhr" Content-Disposition: inline In-Reply-To: <20260306101827.38124251@elisabeth> Message-ID-Hash: 74QCTI5PIHY2UYX6QFTENE5L5NJIVAI7 X-Message-ID-Hash: 74QCTI5PIHY2UYX6QFTENE5L5NJIVAI7 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --pEwyEwOxgxzK9xhr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Mar 06, 2026 at 10:18:27AM +0100, Stefano Brivio wrote: > On Thu, 5 Mar 2026 15:19:40 +1100 > David Gibson wrote: >=20 > > On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote: > > > On Wed, 4 Mar 2026 15:28:30 +1100 > > > David Gibson wrote: > > > =20 > > > > Most of today and yesterday I've spent thinking about the dynamic > > > > update model and protocol. I certainly don't have all the details > > > > pinned down, let alone any implementation, but I have come to some > > > > conclusions. > > > >=20 > > > > # Shadow forward table > > > >=20 > > > > On further consideration, I think this is a bad idea. To avoid peer > > > > visible disruption, we don't want to destroy and recreate listening > > > > sockets =20 > > >=20 > > > (Side note: if it's just *listening* sockets, is this actually that > > > bad?) =20 > >=20 > > Well, it's obviously much less bad that interrupting existing > > connections. It does mean a peer attempting to connect at the wrong > > moment might get an ECONNREFUSED, as far as it knows, a permanent > > error. >=20 > Right. Now, I'm not sure if it helps simplifying the plan from your new > proposal even further but... consider this: *for the moment being* (as > Podman will most likely be the only user of this feature for presumably > a couple of months), it would simply mean that when Podman adds a > container to an existing custom network, there are a couple of > milliseconds during which new connections to existing containers are > not accepted. >=20 > Surely something that needs to be fixed, but not an outrageous issue if > you ask me. On the other hand, maybe it's structural enough that we > want to get it right in the first place. Of course you know better about > this. Yeah, as discussed in my revised proposal I think that's a good interim step. > > > > that are associated with a forward rule that's not being altered. = =20 > > >=20 > > > After reading the rest of your proposal, as long as: > > > =20 > > > > Doing that with a shadow table would mean we'd need to essentially > > > > diff the two tables as we switch. That seems moderately complex, = =20 > > >=20 > > > ...this is the only downside (I can't think of others though), and I > > > don't think it's *that* complex as I mentioned, it would be a O(n^2) > > > step that can be probably optimised (via sorting) to O(n * log(m)) wi= th > > > n new rules and m old rules, cycling on new rules and creating listen= ing > > > sockets (we need this part anyway) unless we find (marking it > > > somewhere temporarily) a matching one... =20 > >=20 > > I wasn't particularly concerned about the computational cost. It was > > more that I couldn't quickly see a clear approach with unambiguous > > semantics. But, I think I came up with one now, see later. >=20 > Ah, sorry, I assumed it was a combination of the two, that is, I > thought it would be sort of straightforward to do it (at least > initially) as O(n^2) worst case but you were considering it > unsustainable. On the other hand we have 256 rules... Right. A long as the maximum number of rules remains that order of magnitude, I think O(n^2) is acceptable for this fairly rare operation. > > > > and > > > > kind of silly when then client almost certainly have created the > > > > shadow table using specific adds/removes from the original table. = =20 > > >=20 > > > ...even though this is true conceptually, at least at a first glance > > > (why would I send 11 rules to add a single rule to a table of 10?), I > > > think the other details of the implementation, and conceptual matters > > > (such as rollback and two-step activation) make this apparent silline= ss > > > much less relevant, and I'm more and more convinced that a shadow tab= le > > > is actually the simplest, most robust, least bug-prone approach. > > >=20 > > > Especially: > > > =20 > > > > # Rule states / active bit > > > >=20 > > > > I think we *do* still want two stage activation of new rules: =20 > > >=20 > > > ...this part, which led to a huge number of bugs over the years in nft > > > / nftables updates, which also use separate insert / activate / commit > > > / deactivate / delete operations. =20 > >=20 > > Huh, interesting. I wasn't aware of that, and it's pretty persuasive. > >=20 > > > It's extremely complicated to grasp and implement properly, and you e= nd > > > up with a lot of quasi-diffing anyway (to check for duplicates in > > > ranges, for example). > > >=20 > > > It makes much more sense in nftables because you can have hundreds of > > > megabytes of data stored in tables, but any usage that was ever > > > mentioned for passt in the past ~5 years would seem to imply at most > > > hundreds of kilobytes per table. > > >=20 > > > Shifting complexity to the client is also a relevant topic for me, as= we > > > decided to have a binary client to avoid anything complicated (parsin= g) > > > in the server. A shadow table allows us to shift even more complexity > > > to the client, which is important for security. =20 > >=20 > > I definitely agree in principle - what I wasn't convinced about was > > that the overall balance actually favoured the client, because of my > > concern over the complexity of that "diff"ing. But=20 > >=20 > > > I haven't finished drafting a proposal based on this idea, but I plan= to > > > do it within one day or so. =20 > >=20 > > Actually, you convinced me already, so I can do that. > >=20 > > > It won't be as detailed, because I don't think it's realistic to come > > > up with all the details before writing any of the code (what's the > > > point if you then have to throw away 70% of it?) but I hope it will be > > > complete enough to provide a comparison. > > >=20 > > > By the way, at least at a first approximation, closing and reopening > > > listening sockets will mostly do the trick for anything our users > > > (mostly via Podman) will ever reasonably want, so I have half a mind = of > > > keeping it like that in a first proposal, but indeed we should make > > > sure there's a way around it, which is what is is taking me a bit more > > > time to demonstrate. =20 > >=20 > > With some more thought I saw a way of doing the "diff" that looks > > pretty straightforward and reasonable. Moreover it's less churn of > > the existing code, and works nicely with close-and-reopen as an > > interim step. It even provides socket continuity for arbitrarily > > overlapping ranges in the old and new tables. >=20 > Oh, great! I was stuck pretty much at this point: >=20 > > For close and re-open, we can implement COMMIT as: > > 1. fwd_listen_close() on old table > > 2. fwd_listen_sync() on new table >=20 > ...trying to figure out how interleaved (table vs. single socket) these > steps would be. In my mind I actually thought we would just call > fwd_listen_sync() which would make the diff itself and close left-over > sockets as needed but: Eh, that's basically just a question of what we name functions. My point is that the above will work with the existing implementation of fwd_listen_sync(). > > I think we can get socket continuity if by swapping the order of those > > steps and extending fwd_sync_one() to do: > > for each port: > > if : > > nothing to do > > else if : > > steal socket for new table > > else: > > open/bind/listen new socket > >=20 > > The "steal" would mark the fd as -1 in the old table so > > fwd_listen_close() won't get rid of it. >=20 > ...this should be more practical I guess. >=20 > > I think the check for a matching socket in the old table will be > > moderately expensive O(n), but not so much as to be a problem in > > practice. >=20 > And again we could sort them eventually, which should make things > O(log(n)) on average (still O(n^2) worst case I guess). >=20 > > > > [...] > > > > > > > > # Suggested client workflow > > > >=20 > > > > I suggest the client should: > > > >=20 > > > > 1. Parse all rule modifications > > > > 2. INSERT all new rules =20 > > > > -> On error, DELETE them again =20 > > > > 3. DEACTIVATE all removed rules =20 > > > > -> Should only fail if the client has done something wrong = =20 > > > > 4. ACTIVATE all new rules =20 > > > > -> On error (rule conflict): =20 > > > > DEACTIVATE rules we already ACTIVATEd > > > > ACTIVATE rules we already DEACTIVATEd > > > > DELETE rules we INSERTed > > > > 5. Check for bind errors (see details later) > > > > If there are failures we can't tolerate: > > > > DEACTIVATE rules we already ACTIVATEd > > > > ACTIVATE rules we already DEACTIVATEd > > > > DELETE rules we INSERTed > > > > 6. DELETE rules we DEACTIVATEd =20 > > > > -> Should only fail if the client has done something wrong = =20 > > > >=20 > > > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between > > > > new rules and rules we're deleting. > > > >=20 > > > > I think that gets us closeish to "as atomic as we can be", at least > > > > from the perspective of peers. The main case it doesn't catch is t= hat > > > > we don't detect rule conflicts until after we might have removed so= me > > > > rules. Is that good enough? =20 > > >=20 > > > I think it is absolutely fine as an outcome, but the complexity of er= ror > > > handling in this case is a bit worrying. This is exactly the kind of > > > thing (and we discussed it already a couple of times) that made and > > > makes me think that a shadow table is a better approach instead. =20 > >=20 > > I'll work on a more concrete proposal based on the shadow table > > approach. There are still some wrinkles with how to report bind() > > errors with this scheme to figure out. >=20 > I was thinking that with this scheme we would just report success or > failure without any further detail (except for warnings / error > messages we might print, but not part of the protocol), at least at the > beginning. >=20 > I'll comment on your new proposal in more detail though. >=20 > --=20 > Stefano >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --pEwyEwOxgxzK9xhr Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmmqx0oACgkQzQJF27ox 2GcTtg/8DgfaqPv4Ya0XkJ/TGpGG/o99FRdKilnmaATmkJLRazNO+2HJOCUj3vbD jE35BSg1735uP80DEEKYFZpBu24N2O4iBGgyd1BttciwoBKbsEHwfUDIfT+iZXC1 s7hmAQilFtR62gToT3xZfddz2yEJE5ERfTTPvlWVlwMuV6cNuWBblaptj89wlBRS A9oS7AEY68mDalpw8SaFfaMbNldl0VikT8OoHZ4Wko14NtKEzvdElhqVGrsNeC5D /MJaKGPbB3QuyuIKF4XQggNHDS+DuQpaYghHWX0Xxw2iuTAjBroubcQIGinnVtyf mZ/UDXPxn8QfEWMFCb4/V3MbfdlZvkhJYFhRQfkGmpRJhNJ4fVuIqTm/LxjtrkMt tjU3mWEcUBJPRgv+seAPTc8DfqGW2moQOQ5FnOkq/JldqaGWIYlJ7vnT1S9SXccX +MfHpBHR/98p2whbmzf27xGUVPBJokfwBdtdPw/UiEq2V3Lsi8dyboDwHxxLVeJO wa40lyRtzNs7ICiUUnRDcLL+SsT8kkkdjlAbdb6IhSSCUhjxDxDdrjSblwjW3j7t 5YLC8ls7H6b1BvbiRU5to8XG0kljethfmdj2E83wSaYkrVQP3uZroShFM4xuA7ab YTANS5KtxgyfSBAbgoQz6MtviN9zM77bNeztFrcMhCT/zpZmkOI= =2wqb -----END PGP SIGNATURE----- --pEwyEwOxgxzK9xhr--