From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=dQjE551a;
	dkim-atps=neutral
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id 7D7405A0265
	for <passt-dev@passt.top>; Thu, 05 Mar 2026 05:19:51 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202602; t=1772684387;
	bh=bEMdvRSKpW1umMvQVJS9dWYpCZb0CaLWc58WiLh49WA=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=dQjE551a4RguAy8uYLAblfRRXwb6UWN8co0WWN1h42uSHwnWrmlB7dLz/E96WuD8P
	 jIOV9koIqG/xl+fFz1t39EJMKfGDC6gDDmdoK+aaCUMDIjHAIsHItQdWxt5wBkZbV0
	 XbOAZVKPca60goCSSUXYSZ663ZunwsRrPGjRMZhSlThlqzJaLjfiFw5fRrZ+R/Nvsz
	 v3aBO6hzIFqxuSzQXGDaY6WpxTHZv17QE5O7FpigrHMZ8GaNvTDK5uAvqw84cES6iy
	 v550O6URE2JAUKNMcBRVHJT+5D8PD0CuLgyCnHMZsI71kIE/CIDK7Ianr7BAfj//PW
	 359yOaSTHaS3g==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4fRGXM0MM7z4wCJ; Thu, 05 Mar 2026 15:19:47 +1100 (AEDT)
Date: Thu, 5 Mar 2026 15:19:40 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: Pesto protocol proposals
Message-ID: <aakEXBLxxkV2YDLE@zatzit>
References: <aae07j0fhcXOFeab@zatzit>
 <20260305021952.17963c3f@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="gGUeNAn6gw1QcHbB"
Content-Disposition: inline
In-Reply-To: <20260305021952.17963c3f@elisabeth>
Message-ID-Hash: VXT5LVTSVZFNVJOII5HNA6WAVBWDNKET
X-Message-ID-Hash: VXT5LVTSVZFNVJOII5HNA6WAVBWDNKET
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/aakEXBLxxkV2YDLE@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/VXT5LVTSVZFNVJOII5HNA6WAVBWDNKET/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--gGUeNAn6gw1QcHbB
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> On Wed, 4 Mar 2026 15:28:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > Most of today and yesterday I've spent thinking about the dynamic
> > update model and protocol.  I certainly don't have all the details
> > pinned down, let alone any implementation, but I have come to some
> > conclusions.
> >=20
> > # Shadow forward table
> >=20
> > On further consideration, I think this is a bad idea.  To avoid peer
> > visible disruption, we don't want to destroy and recreate listening
> > sockets
>=20
> (Side note: if it's just *listening* sockets, is this actually that
> bad?)

Well, it's obviously much less bad that interrupting existing
connections.  It does mean a peer attempting to connect at the wrong
moment might get an ECONNREFUSED, as far as it knows, a permanent
error.

> > that are associated with a forward rule that's not being altered.
>=20
> After reading the rest of your proposal, as long as:
>=20
> > Doing that with a shadow table would mean we'd need to essentially
> > diff the two tables as we switch.  That seems moderately complex,
>=20
> ...this is the only downside (I can't think of others though), and I
> don't think it's *that* complex as I mentioned, it would be a O(n^2)
> step that can be probably optimised (via sorting) to O(n * log(m)) with
> n new rules and m old rules, cycling on new rules and creating listening
> sockets (we need this part anyway) unless we find (marking it
> somewhere temporarily) a matching one...

I wasn't particularly concerned about the computational cost.  It was
more that I couldn't quickly see a clear approach with unambiguous
semantics.  But, I think I came up with one now, see later.

> > and
> > kind of silly when then client almost certainly have created the
> > shadow table using specific adds/removes from the original table.
>=20
> ...even though this is true conceptually, at least at a first glance
> (why would I send 11 rules to add a single rule to a table of 10?), I
> think the other details of the implementation, and conceptual matters
> (such as rollback and two-step activation) make this apparent silliness
> much less relevant, and I'm more and more convinced that a shadow table
> is actually the simplest, most robust, least bug-prone approach.
>=20
> Especially:
>=20
> > # Rule states / active bit
> >=20
> > I think we *do* still want two stage activation of new rules:
>=20
> ...this part, which led to a huge number of bugs over the years in nft
> / nftables updates, which also use separate insert / activate / commit
> / deactivate / delete operations.

Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.

> It's extremely complicated to grasp and implement properly, and you end
> up with a lot of quasi-diffing anyway (to check for duplicates in
> ranges, for example).
>=20
> It makes much more sense in nftables because you can have hundreds of
> megabytes of data stored in tables, but any usage that was ever
> mentioned for passt in the past ~5 years would seem to imply at most
> hundreds of kilobytes per table.
>=20
> Shifting complexity to the client is also a relevant topic for me, as we
> decided to have a binary client to avoid anything complicated (parsing)
> in the server. A shadow table allows us to shift even more complexity
> to the client, which is important for security.

I definitely agree in principle - what I wasn't convinced about was
that the overall balance actually favoured the client, because of my
concern over the complexity of that "diff"ing.  But=20

> I haven't finished drafting a proposal based on this idea, but I plan to
> do it within one day or so.

Actually, you convinced me already, so I can do that.

> It won't be as detailed, because I don't think it's realistic to come
> up with all the details before writing any of the code (what's the
> point if you then have to throw away 70% of it?) but I hope it will be
> complete enough to provide a comparison.
>=20
> By the way, at least at a first approximation, closing and reopening
> listening sockets will mostly do the trick for anything our users
> (mostly via Podman) will ever reasonably want, so I have half a mind of
> keeping it like that in a first proposal, but indeed we should make
> sure there's a way around it, which is what is is taking me a bit more
> time to demonstrate.

With some more thought I saw a way of doing the "diff" that looks
pretty straightforward and reasonable.  Moreover it's less churn of
the existing code, and works nicely with close-and-reopen as an
interim step.  It even provides socket continuity for arbitrarily
overlapping ranges in the old and new tables.

For close and re-open, we can implement COMMIT as:
	1. fwd_listen_close() on old table
	2. fwd_listen_sync() on new table

I think we can get socket continuity if by swapping the order of those
steps and extending fwd_sync_one() to do:
	for each port:
	    if <already opened>:
	        nothing to do
<new>	    else if <matching open socket in old table>:
<new>	        steal socket for new table
            else:
	        open/bind/listen new socket

The "steal" would mark the fd as -1 in the old table so
fwd_listen_close() won't get rid of it.

I think the check for a matching socket in the old table will be
moderately expensive O(n), but not so much as to be a problem in
practice.

> > [...]
> >
> > # Suggested client workflow
> >=20
> > I suggest the client should:
> >=20
> >    1. Parse all rule modifications
> >    2. INSERT all new rules
> >       -> On error, DELETE them again =20
> >    3. DEACTIVATE all removed rules
> >       -> Should only fail if the client has done something wrong =20
> >    4. ACTIVATE all new rules
> >       -> On error (rule conflict): =20
> >          DEACTIVATE rules we already ACTIVATEd
> > 	 ACTIVATE rules we already DEACTIVATEd
> > 	 DELETE rules we INSERTed
> >    5. Check for bind errors (see details later)
> >       If there are failures we can't tolerate:
> >          DEACTIVATE rules we already ACTIVATEd
> > 	 ACTIVATE rules we already DEACTIVATEd
> > 	 DELETE rules we INSERTed
> >    6. DELETE rules we DEACTIVATEd
> >       -> Should only fail if the client has done something wrong =20
> >=20
> > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > new rules and rules we're deleting.
> >=20
> > I think that gets us closeish to "as atomic as we can be", at least
> > from the perspective of peers.  The main case it doesn't catch is that
> > we don't detect rule conflicts until after we might have removed some
> > rules.  Is that good enough?
>=20
> I think it is absolutely fine as an outcome, but the complexity of error
> handling in this case is a bit worrying. This is exactly the kind of
> thing (and we discussed it already a couple of times) that made and
> makes me think that a shadow table is a better approach instead.

I'll work on a more concrete proposal based on the shadow table
approach.  There are still some wrinkles with how to report bind()
errors with this scheme to figure out.

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--gGUeNAn6gw1QcHbB
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmmpBFsACgkQzQJF27ox
2GcTdg//Vp4OlhjUpKkbcZ916rEkXLwX1T0gz7YuI/7mOUOQ4PqPWeISQjvkIr+f
wzXfpaa0w2sxY1gC23QPKKtqA2sszf+Ezx3tSOKSm17bfgcD4ynnJdpc90x7e79V
WZPw9ii+/fNEoWl9Hoc8d4yM0shfUFxKbMzAYzOefNHzz09BPa6kjoympNdb5a7y
2CRKBpWiHdfIy3Dm7AeQODq1RRYqzCOBRkXl3ij80xcIQWEACkzPEQCaKvcXvFJU
Br2PPVMTGwGRfrMF2a6/A+4AnzgPF0suD+saAqLjJBLBxc+xhv30JF4RWEVvZzm5
phNP71GbRF3KnRRpbC4cbOIKutTXh6pSelJjqVAsfwHMHE/PaseG9NRZC0MOmD3U
TtlRMgh9uykSNNycys6YTU04lHnpqdoNM+50Dm0XHCvpYNd99RsDQn0ViR8tZh2X
IbTOENv8fvsrBmOeYUnn4NaGfjenYIReEjYdF+yZBJ9fTnGP57DZn9i0tYL8xG19
lATitd8i7AO4hio8CNkep4Y5zydY0WFMr9uKoKRpXvkzNuAwEYPe2UipJJMVUyxd
wQlec/FueHTIlUZmK5ioQAHB/R6E44EjrDiYhh1MpPbqlH9DHX1KPtD9yZPF+9Xn
5zgbkHdL8GkkrSgaGNUczia/rV9gKRite1AB2hTf1ItGMAob2X4=
=jmCG
-----END PGP SIGNATURE-----

--gGUeNAn6gw1QcHbB--