From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: Pesto protocol proposals
Date: Thu, 5 Mar 2026 15:19:40 +1100 [thread overview]
Message-ID: <aakEXBLxxkV2YDLE@zatzit> (raw)
In-Reply-To: <20260305021952.17963c3f@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 7427 bytes --]
On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> On Wed, 4 Mar 2026 15:28:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Most of today and yesterday I've spent thinking about the dynamic
> > update model and protocol. I certainly don't have all the details
> > pinned down, let alone any implementation, but I have come to some
> > conclusions.
> >
> > # Shadow forward table
> >
> > On further consideration, I think this is a bad idea. To avoid peer
> > visible disruption, we don't want to destroy and recreate listening
> > sockets
>
> (Side note: if it's just *listening* sockets, is this actually that
> bad?)
Well, it's obviously much less bad that interrupting existing
connections. It does mean a peer attempting to connect at the wrong
moment might get an ECONNREFUSED, as far as it knows, a permanent
error.
> > that are associated with a forward rule that's not being altered.
>
> After reading the rest of your proposal, as long as:
>
> > Doing that with a shadow table would mean we'd need to essentially
> > diff the two tables as we switch. That seems moderately complex,
>
> ...this is the only downside (I can't think of others though), and I
> don't think it's *that* complex as I mentioned, it would be a O(n^2)
> step that can be probably optimised (via sorting) to O(n * log(m)) with
> n new rules and m old rules, cycling on new rules and creating listening
> sockets (we need this part anyway) unless we find (marking it
> somewhere temporarily) a matching one...
I wasn't particularly concerned about the computational cost. It was
more that I couldn't quickly see a clear approach with unambiguous
semantics. But, I think I came up with one now, see later.
> > and
> > kind of silly when then client almost certainly have created the
> > shadow table using specific adds/removes from the original table.
>
> ...even though this is true conceptually, at least at a first glance
> (why would I send 11 rules to add a single rule to a table of 10?), I
> think the other details of the implementation, and conceptual matters
> (such as rollback and two-step activation) make this apparent silliness
> much less relevant, and I'm more and more convinced that a shadow table
> is actually the simplest, most robust, least bug-prone approach.
>
> Especially:
>
> > # Rule states / active bit
> >
> > I think we *do* still want two stage activation of new rules:
>
> ...this part, which led to a huge number of bugs over the years in nft
> / nftables updates, which also use separate insert / activate / commit
> / deactivate / delete operations.
Huh, interesting. I wasn't aware of that, and it's pretty persuasive.
> It's extremely complicated to grasp and implement properly, and you end
> up with a lot of quasi-diffing anyway (to check for duplicates in
> ranges, for example).
>
> It makes much more sense in nftables because you can have hundreds of
> megabytes of data stored in tables, but any usage that was ever
> mentioned for passt in the past ~5 years would seem to imply at most
> hundreds of kilobytes per table.
>
> Shifting complexity to the client is also a relevant topic for me, as we
> decided to have a binary client to avoid anything complicated (parsing)
> in the server. A shadow table allows us to shift even more complexity
> to the client, which is important for security.
I definitely agree in principle - what I wasn't convinced about was
that the overall balance actually favoured the client, because of my
concern over the complexity of that "diff"ing. But
> I haven't finished drafting a proposal based on this idea, but I plan to
> do it within one day or so.
Actually, you convinced me already, so I can do that.
> It won't be as detailed, because I don't think it's realistic to come
> up with all the details before writing any of the code (what's the
> point if you then have to throw away 70% of it?) but I hope it will be
> complete enough to provide a comparison.
>
> By the way, at least at a first approximation, closing and reopening
> listening sockets will mostly do the trick for anything our users
> (mostly via Podman) will ever reasonably want, so I have half a mind of
> keeping it like that in a first proposal, but indeed we should make
> sure there's a way around it, which is what is is taking me a bit more
> time to demonstrate.
With some more thought I saw a way of doing the "diff" that looks
pretty straightforward and reasonable. Moreover it's less churn of
the existing code, and works nicely with close-and-reopen as an
interim step. It even provides socket continuity for arbitrarily
overlapping ranges in the old and new tables.
For close and re-open, we can implement COMMIT as:
1. fwd_listen_close() on old table
2. fwd_listen_sync() on new table
I think we can get socket continuity if by swapping the order of those
steps and extending fwd_sync_one() to do:
for each port:
if <already opened>:
nothing to do
<new> else if <matching open socket in old table>:
<new> steal socket for new table
else:
open/bind/listen new socket
The "steal" would mark the fd as -1 in the old table so
fwd_listen_close() won't get rid of it.
I think the check for a matching socket in the old table will be
moderately expensive O(n), but not so much as to be a problem in
practice.
> > [...]
> >
> > # Suggested client workflow
> >
> > I suggest the client should:
> >
> > 1. Parse all rule modifications
> > 2. INSERT all new rules
> > -> On error, DELETE them again
> > 3. DEACTIVATE all removed rules
> > -> Should only fail if the client has done something wrong
> > 4. ACTIVATE all new rules
> > -> On error (rule conflict):
> > DEACTIVATE rules we already ACTIVATEd
> > ACTIVATE rules we already DEACTIVATEd
> > DELETE rules we INSERTed
> > 5. Check for bind errors (see details later)
> > If there are failures we can't tolerate:
> > DEACTIVATE rules we already ACTIVATEd
> > ACTIVATE rules we already DEACTIVATEd
> > DELETE rules we INSERTed
> > 6. DELETE rules we DEACTIVATEd
> > -> Should only fail if the client has done something wrong
> >
> > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > new rules and rules we're deleting.
> >
> > I think that gets us closeish to "as atomic as we can be", at least
> > from the perspective of peers. The main case it doesn't catch is that
> > we don't detect rule conflicts until after we might have removed some
> > rules. Is that good enough?
>
> I think it is absolutely fine as an outcome, but the complexity of error
> handling in this case is a bit worrying. This is exactly the kind of
> thing (and we discussed it already a couple of times) that made and
> makes me think that a shadow table is a better approach instead.
I'll work on a more concrete proposal based on the shadow table
approach. There are still some wrinkles with how to report bind()
errors with this scheme to figure out.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2026-03-05 4:19 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 4:28 David Gibson
2026-03-05 1:19 ` Stefano Brivio
2026-03-05 4:19 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aakEXBLxxkV2YDLE@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).