Re: Pesto protocol proposals

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: Pesto protocol proposals
Date: Fri, 6 Mar 2026 23:23:50 +1100	[thread overview]
Message-ID: <aarHVinVY1vOaszm@zatzit> (raw)
In-Reply-To: <20260306101827.38124251@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 10421 bytes --]

On Fri, Mar 06, 2026 at 10:18:27AM +0100, Stefano Brivio wrote:
> On Thu, 5 Mar 2026 15:19:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> > > On Wed, 4 Mar 2026 15:28:30 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Most of today and yesterday I've spent thinking about the dynamic
> > > > update model and protocol.  I certainly don't have all the details
> > > > pinned down, let alone any implementation, but I have come to some
> > > > conclusions.
> > > > 
> > > > # Shadow forward table
> > > > 
> > > > On further consideration, I think this is a bad idea.  To avoid peer
> > > > visible disruption, we don't want to destroy and recreate listening
> > > > sockets  
> > > 
> > > (Side note: if it's just *listening* sockets, is this actually that
> > > bad?)  
> > 
> > Well, it's obviously much less bad that interrupting existing
> > connections.  It does mean a peer attempting to connect at the wrong
> > moment might get an ECONNREFUSED, as far as it knows, a permanent
> > error.
> 
> Right. Now, I'm not sure if it helps simplifying the plan from your new
> proposal even further but... consider this: *for the moment being* (as
> Podman will most likely be the only user of this feature for presumably
> a couple of months), it would simply mean that when Podman adds a
> container to an existing custom network, there are a couple of
> milliseconds during which new connections to existing containers are
> not accepted.
> 
> Surely something that needs to be fixed, but not an outrageous issue if
> you ask me. On the other hand, maybe it's structural enough that we
> want to get it right in the first place. Of course you know better about
> this.

Yeah, as discussed in my revised proposal I think that's a good
interim step.

> > > > that are associated with a forward rule that's not being altered.  
> > > 
> > > After reading the rest of your proposal, as long as:
> > >   
> > > > Doing that with a shadow table would mean we'd need to essentially
> > > > diff the two tables as we switch.  That seems moderately complex,  
> > > 
> > > ...this is the only downside (I can't think of others though), and I
> > > don't think it's *that* complex as I mentioned, it would be a O(n^2)
> > > step that can be probably optimised (via sorting) to O(n * log(m)) with
> > > n new rules and m old rules, cycling on new rules and creating listening
> > > sockets (we need this part anyway) unless we find (marking it
> > > somewhere temporarily) a matching one...  
> > 
> > I wasn't particularly concerned about the computational cost.  It was
> > more that I couldn't quickly see a clear approach with unambiguous
> > semantics.  But, I think I came up with one now, see later.
> 
> Ah, sorry, I assumed it was a combination of the two, that is, I
> thought it would be sort of straightforward to do it (at least
> initially) as O(n^2) worst case but you were considering it
> unsustainable. On the other hand we have 256 rules...

Right.  A long as the maximum number of rules remains that order of
magnitude, I think O(n^2) is acceptable for this fairly rare
operation.

> > > > and
> > > > kind of silly when then client almost certainly have created the
> > > > shadow table using specific adds/removes from the original table.  
> > > 
> > > ...even though this is true conceptually, at least at a first glance
> > > (why would I send 11 rules to add a single rule to a table of 10?), I
> > > think the other details of the implementation, and conceptual matters
> > > (such as rollback and two-step activation) make this apparent silliness
> > > much less relevant, and I'm more and more convinced that a shadow table
> > > is actually the simplest, most robust, least bug-prone approach.
> > > 
> > > Especially:
> > >   
> > > > # Rule states / active bit
> > > > 
> > > > I think we *do* still want two stage activation of new rules:  
> > > 
> > > ...this part, which led to a huge number of bugs over the years in nft
> > > / nftables updates, which also use separate insert / activate / commit
> > > / deactivate / delete operations.  
> > 
> > Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.
> > 
> > > It's extremely complicated to grasp and implement properly, and you end
> > > up with a lot of quasi-diffing anyway (to check for duplicates in
> > > ranges, for example).
> > > 
> > > It makes much more sense in nftables because you can have hundreds of
> > > megabytes of data stored in tables, but any usage that was ever
> > > mentioned for passt in the past ~5 years would seem to imply at most
> > > hundreds of kilobytes per table.
> > > 
> > > Shifting complexity to the client is also a relevant topic for me, as we
> > > decided to have a binary client to avoid anything complicated (parsing)
> > > in the server. A shadow table allows us to shift even more complexity
> > > to the client, which is important for security.  
> > 
> > I definitely agree in principle - what I wasn't convinced about was
> > that the overall balance actually favoured the client, because of my
> > concern over the complexity of that "diff"ing.  But 
> > 
> > > I haven't finished drafting a proposal based on this idea, but I plan to
> > > do it within one day or so.  
> > 
> > Actually, you convinced me already, so I can do that.
> > 
> > > It won't be as detailed, because I don't think it's realistic to come
> > > up with all the details before writing any of the code (what's the
> > > point if you then have to throw away 70% of it?) but I hope it will be
> > > complete enough to provide a comparison.
> > > 
> > > By the way, at least at a first approximation, closing and reopening
> > > listening sockets will mostly do the trick for anything our users
> > > (mostly via Podman) will ever reasonably want, so I have half a mind of
> > > keeping it like that in a first proposal, but indeed we should make
> > > sure there's a way around it, which is what is is taking me a bit more
> > > time to demonstrate.  
> > 
> > With some more thought I saw a way of doing the "diff" that looks
> > pretty straightforward and reasonable.  Moreover it's less churn of
> > the existing code, and works nicely with close-and-reopen as an
> > interim step.  It even provides socket continuity for arbitrarily
> > overlapping ranges in the old and new tables.
> 
> Oh, great! I was stuck pretty much at this point:
> 
> > For close and re-open, we can implement COMMIT as:
> > 	1. fwd_listen_close() on old table
> > 	2. fwd_listen_sync() on new table
> 
> ...trying to figure out how interleaved (table vs. single socket) these
> steps would be. In my mind I actually thought we would just call
> fwd_listen_sync() which would make the diff itself and close left-over
> sockets as needed but:

Eh, that's basically just a question of what we name functions.  My
point is that the above will work with the existing implementation of
fwd_listen_sync().

> > I think we can get socket continuity if by swapping the order of those
> > steps and extending fwd_sync_one() to do:
> > 	for each port:
> > 	    if <already opened>:
> > 	        nothing to do
> > <new>	    else if <matching open socket in old table>:
> > <new>	        steal socket for new table
> >             else:
> > 	        open/bind/listen new socket
> > 
> > The "steal" would mark the fd as -1 in the old table so
> > fwd_listen_close() won't get rid of it.
> 
> ...this should be more practical I guess.
> 
> > I think the check for a matching socket in the old table will be
> > moderately expensive O(n), but not so much as to be a problem in
> > practice.
> 
> And again we could sort them eventually, which should make things
> O(log(n)) on average (still O(n^2) worst case I guess).
> 
> > > > [...]
> > > >
> > > > # Suggested client workflow
> > > > 
> > > > I suggest the client should:
> > > > 
> > > >    1. Parse all rule modifications
> > > >    2. INSERT all new rules  
> > > >       -> On error, DELETE them again    
> > > >    3. DEACTIVATE all removed rules  
> > > >       -> Should only fail if the client has done something wrong    
> > > >    4. ACTIVATE all new rules  
> > > >       -> On error (rule conflict):    
> > > >          DEACTIVATE rules we already ACTIVATEd
> > > > 	 ACTIVATE rules we already DEACTIVATEd
> > > > 	 DELETE rules we INSERTed
> > > >    5. Check for bind errors (see details later)
> > > >       If there are failures we can't tolerate:
> > > >          DEACTIVATE rules we already ACTIVATEd
> > > > 	 ACTIVATE rules we already DEACTIVATEd
> > > > 	 DELETE rules we INSERTed
> > > >    6. DELETE rules we DEACTIVATEd  
> > > >       -> Should only fail if the client has done something wrong    
> > > > 
> > > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > > > new rules and rules we're deleting.
> > > > 
> > > > I think that gets us closeish to "as atomic as we can be", at least
> > > > from the perspective of peers.  The main case it doesn't catch is that
> > > > we don't detect rule conflicts until after we might have removed some
> > > > rules.  Is that good enough?  
> > > 
> > > I think it is absolutely fine as an outcome, but the complexity of error
> > > handling in this case is a bit worrying. This is exactly the kind of
> > > thing (and we discussed it already a couple of times) that made and
> > > makes me think that a shadow table is a better approach instead.  
> > 
> > I'll work on a more concrete proposal based on the shadow table
> > approach.  There are still some wrinkles with how to report bind()
> > errors with this scheme to figure out.
> 
> I was thinking that with this scheme we would just report success or
> failure without any further detail (except for warnings / error
> messages we might print, but not part of the protocol), at least at the
> beginning.
> 
> I'll comment on your new proposal in more detail though.
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

     prev parent reply	other threads:[~2026-03-06 12:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04  4:28 David Gibson
2026-03-05  1:19 ` Stefano Brivio
2026-03-05  4:19   ` David Gibson
2026-03-06  9:18     ` Stefano Brivio
2026-03-06 12:23       ` David Gibson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aarHVinVY1vOaszm@zatzit \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).