Pesto protocol proposals

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* Pesto protocol proposals
@ 2026-03-04  4:28 David Gibson
  2026-03-05  1:19 ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2026-03-04  4:28 UTC (permalink / raw)
  To: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8554 bytes --]

Most of today and yesterday I've spent thinking about the dynamic
update model and protocol.  I certainly don't have all the details
pinned down, let alone any implementation, but I have come to some
conclusions.

# Shadow forward table

On further consideration, I think this is a bad idea.  To avoid peer
visible disruption, we don't want to destroy and recreate listening
sockets that are associated with a forward rule that's not being
altered.

Doing that with a shadow table would mean we'd need to essentially
diff the two tables as we switch.  That seems moderately complex, and
kind of silly when then client almost certainly have created the
shadow table using specific adds/removes from the original table.

# Rule states / active bit

I think we *do* still want two stage activation of new rules: if the
first stage fails we're guaranteed we can roll back with no peer
visible consequences.  The second stage (the actual bind()s and
listen()s) doesn't have that property, but that's unavoidable.

To implement that, I think each rule should have an "active" bit.  Or,
at least an active bit - it's possible that might not be enough, but
we could extend it to a state field, loosely analagous to the state
field in flow table entries.  But let's assume just the active bit,
until/unless a case shows up where it's insufficient.

Entries are always inserted in inactive state.  Entries must be moved
to inactive state before deletion.  fwd_listen_sync() would ignore
inactive entries.  Turning the active flag on triggers the actual
bind() / listen() calls, but requires no rearrangement of the fwd
table or socket array.  Turning it off close()s the associated
sockets, but again requires no rearrangement of the data structures.

# Tentative client operations

## INSERT
  Parameters: rule specification + rule index
  Returns: error status

Inserts the new rule (inactive) at the given index (moving later
rules, if necessary).

Fails with no effect, for a bad index or if there's no room in the
table or socket array.  Does *not* check for conflicts with other
rules.

NOTE: moving rules could mean thousands of epoll_ctl() calls to
      adjust rule indices.  We don't expect those to fail, but if they
      did, what do we do?

## DELETE
  Parameters: rule index
  Returns: error status

Deletes the given inactive rule (moving later rules, if necessary).

Fails with no effect if it's a bad index, or the given rule is active.

NOTE: As for INSERT

## ACTIVATE
  Parameters: rule index
  Returns: error status

Enables the rule, bind()ing all the necessary listening sockets.

Fails with no effect if the rule conflicts with another active rule.

Completes, even if some bind()s fail (see later for handling of this).

## DEACTIVATE
  Parameters: rule index
  Returns: error status

Disables the rule, close()ing any listening sockets.

Fails with no effect for bad index or an already inactive rule.

## STATUS
  Parameters: rule index
  Returns: active/inactive bit + possible metadata

Indicates whether the rule is currently active.  Could also give
limited metadata about the rule (see below for possible use in bind()
error reporting).

## READ
  Parameters: rule index
  Returns: rule specification, or error code

Reads out the rule spec and returns it.

Fails for bad index.

To dump the whole table, the client can READ each slot starting from
0, until it gets an error.

# Suggested client workflow

I suggest the client should:

   1. Parse all rule modifications
   2. INSERT all new rules
      -> On error, DELETE them again
   3. DEACTIVATE all removed rules
      -> Should only fail if the client has done something wrong
   4. ACTIVATE all new rules
      -> On error (rule conflict):
         DEACTIVATE rules we already ACTIVATEd
	 ACTIVATE rules we already DEACTIVATEd
	 DELETE rules we INSERTed
   5. Check for bind errors (see details later)
      If there are failures we can't tolerate:
         DEACTIVATE rules we already ACTIVATEd
	 ACTIVATE rules we already DEACTIVATEd
	 DELETE rules we INSERTed
   6. DELETE rules we DEACTIVATEd
      -> Should only fail if the client has done something wrong

DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
new rules and rules we're deleting.

I think that gets us closeish to "as atomic as we can be", at least
from the perspective of peers.  The main case it doesn't catch is that
we don't detect rule conflicts until after we might have removed some
rules.  Is that good enough?

# Bind error handling

Note, in the below I'm considering pasta/passt's command line handling
and conf path as a client rather than part of the backend.

The distinction weak and non-weak entries is a bit clunky.  How many
failures is too many is kind of a question for the client, not the
backend.  So I'm suggesting we remove that concept from the backend.
ACTIVATE completes even if some or all binds fail.  However, we keep a
count of how many sockets we got a bind() or listen() failure for in
the rule, and it can be retrieved with STATUS.  That lets the client
make a decision as to whether to live with it or roll back as best it
can.  A client could also potentially poll later to see if some
failures resolved themselves (we now reattempt bind()s on every
fwd_listen_sync()).

NOTE: the meaning of that count is pretty straightforward with regular
rules, but with SCAN rules, we'd have to be more careful.

# Concurrent updates

I suggest we prevent concurrent updates by only allowing one client to
connect to the control socket at a time.

# Possible tweaks

Not sure if these are improvements or not, but they're options to
consider.

## Rule conflicts

Currently fwd_rule_add() checks for rules with conflicts and rejects
them.  We can't really report that at INSERT, because we could get a
bogus conflict with a rule we intend to DEACTIVATE/DELETE.  But
reporting at ACTIVATE is also a bit clunky.

This could potentially be sidestepped by removing the notion of rule
conflicts entirely.  Instead overlapping rules are simply allowed, and
the first rule to match a flow wins.

## Rollback in backend

The proposal above has rollback essentially handled by the client.  We
could instead do it in the backend.
  - Instead of a single active bit, each rule has an "active now" and
    "active future" bit
  - On client connect, all active future bits are set equal to active
    now bits
  - INSERT adds a rule with active now false and active future true
     - If we conflict check, we check only against active future rules
  - DELETE clears the active future bit
  - ACTIVATE/DEACTIVATE no longer exist
  - ROLLBACK deletes all !active now rules and sets active future bits
    to active now bits again
  - COMMIT does the bind()s and close()s and on success, sets active
    now bits to active future bits.  On failure... it's fairly
    complex, we'd need to think about it

## Persistent rule IDs

Proposal above uses raw indices in the table to identify rules, which
means INSERT and DELETE change the numbers of other rules.  That in
turn requires a bunch of epoll_ctl()s to update existing sockets.

Here's one way we could avoid some of that with a persistent rule ID:
  - Each rule has an ID (say a u32), supplied by the client at INSERT
  - Rules still apply in ID order, so the order matters, but not the
    exact values
  - INSERT to an existing rule ID is not permitted - you must DELETE
    first

Internally we still store the table packed, but sorted by ID.  We
could look up rule by ID either with a binary search, or maybe a radix
lookup table.  If we reduced the ID to a u16 (or so) we could
potentially use a single level lookup table.  I suspect binary search
might be faster than a lookup table anyway, because of dcache impact.

So, we still to memmove() things about for INSERT, and maybe update
the lookup table, but that's relatively easy.  epoll data holds the
persistent ID, so that doesn't need to be altered.

Clients (including the internal conf path) could choose to leave gaps
in the IDs to leave space for future inserts.  And/or certain ranges
could be reserved by convention for different purposes.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pesto protocol proposals
  2026-03-04  4:28 Pesto protocol proposals David Gibson
@ 2026-03-05  1:19 ` Stefano Brivio
  2026-03-05  4:19   ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2026-03-05  1:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 4 Mar 2026 15:28:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Most of today and yesterday I've spent thinking about the dynamic
> update model and protocol.  I certainly don't have all the details
> pinned down, let alone any implementation, but I have come to some
> conclusions.
> 
> # Shadow forward table
> 
> On further consideration, I think this is a bad idea.  To avoid peer
> visible disruption, we don't want to destroy and recreate listening
> sockets

(Side note: if it's just *listening* sockets, is this actually that
bad?)

> that are associated with a forward rule that's not being altered.

After reading the rest of your proposal, as long as:

> Doing that with a shadow table would mean we'd need to essentially
> diff the two tables as we switch.  That seems moderately complex,

...this is the only downside (I can't think of others though), and I
don't think it's *that* complex as I mentioned, it would be a O(n^2)
step that can be probably optimised (via sorting) to O(n * log(m)) with
n new rules and m old rules, cycling on new rules and creating listening
sockets (we need this part anyway) unless we find (marking it
somewhere temporarily) a matching one...

> and
> kind of silly when then client almost certainly have created the
> shadow table using specific adds/removes from the original table.

...even though this is true conceptually, at least at a first glance
(why would I send 11 rules to add a single rule to a table of 10?), I
think the other details of the implementation, and conceptual matters
(such as rollback and two-step activation) make this apparent silliness
much less relevant, and I'm more and more convinced that a shadow table
is actually the simplest, most robust, least bug-prone approach.

Especially:

> # Rule states / active bit
> 
> I think we *do* still want two stage activation of new rules:

...this part, which led to a huge number of bugs over the years in nft
/ nftables updates, which also use separate insert / activate / commit
/ deactivate / delete operations.

It's extremely complicated to grasp and implement properly, and you end
up with a lot of quasi-diffing anyway (to check for duplicates in
ranges, for example).

It makes much more sense in nftables because you can have hundreds of
megabytes of data stored in tables, but any usage that was ever
mentioned for passt in the past ~5 years would seem to imply at most
hundreds of kilobytes per table.

Shifting complexity to the client is also a relevant topic for me, as we
decided to have a binary client to avoid anything complicated (parsing)
in the server. A shadow table allows us to shift even more complexity
to the client, which is important for security.

I haven't finished drafting a proposal based on this idea, but I plan to
do it within one day or so.

It won't be as detailed, because I don't think it's realistic to come
up with all the details before writing any of the code (what's the
point if you then have to throw away 70% of it?) but I hope it will be
complete enough to provide a comparison.

By the way, at least at a first approximation, closing and reopening
listening sockets will mostly do the trick for anything our users
(mostly via Podman) will ever reasonably want, so I have half a mind of
keeping it like that in a first proposal, but indeed we should make
sure there's a way around it, which is what is is taking me a bit more
time to demonstrate.

> [...]
>
> # Suggested client workflow
> 
> I suggest the client should:
> 
>    1. Parse all rule modifications
>    2. INSERT all new rules
>       -> On error, DELETE them again  
>    3. DEACTIVATE all removed rules
>       -> Should only fail if the client has done something wrong  
>    4. ACTIVATE all new rules
>       -> On error (rule conflict):  
>          DEACTIVATE rules we already ACTIVATEd
> 	 ACTIVATE rules we already DEACTIVATEd
> 	 DELETE rules we INSERTed
>    5. Check for bind errors (see details later)
>       If there are failures we can't tolerate:
>          DEACTIVATE rules we already ACTIVATEd
> 	 ACTIVATE rules we already DEACTIVATEd
> 	 DELETE rules we INSERTed
>    6. DELETE rules we DEACTIVATEd
>       -> Should only fail if the client has done something wrong  
> 
> DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> new rules and rules we're deleting.
> 
> I think that gets us closeish to "as atomic as we can be", at least
> from the perspective of peers.  The main case it doesn't catch is that
> we don't detect rule conflicts until after we might have removed some
> rules.  Is that good enough?

I think it is absolutely fine as an outcome, but the complexity of error
handling in this case is a bit worrying. This is exactly the kind of
thing (and we discussed it already a couple of times) that made and
makes me think that a shadow table is a better approach instead.

> [...]

-- 
Stefano

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pesto protocol proposals
  2026-03-05  1:19 ` Stefano Brivio
@ 2026-03-05  4:19   ` David Gibson
  2026-03-06  9:18     ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2026-03-05  4:19 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7427 bytes --]

On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> On Wed, 4 Mar 2026 15:28:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Most of today and yesterday I've spent thinking about the dynamic
> > update model and protocol.  I certainly don't have all the details
> > pinned down, let alone any implementation, but I have come to some
> > conclusions.
> > 
> > # Shadow forward table
> > 
> > On further consideration, I think this is a bad idea.  To avoid peer
> > visible disruption, we don't want to destroy and recreate listening
> > sockets
> 
> (Side note: if it's just *listening* sockets, is this actually that
> bad?)

Well, it's obviously much less bad that interrupting existing
connections.  It does mean a peer attempting to connect at the wrong
moment might get an ECONNREFUSED, as far as it knows, a permanent
error.

> > that are associated with a forward rule that's not being altered.
> 
> After reading the rest of your proposal, as long as:
> 
> > Doing that with a shadow table would mean we'd need to essentially
> > diff the two tables as we switch.  That seems moderately complex,
> 
> ...this is the only downside (I can't think of others though), and I
> don't think it's *that* complex as I mentioned, it would be a O(n^2)
> step that can be probably optimised (via sorting) to O(n * log(m)) with
> n new rules and m old rules, cycling on new rules and creating listening
> sockets (we need this part anyway) unless we find (marking it
> somewhere temporarily) a matching one...

I wasn't particularly concerned about the computational cost.  It was
more that I couldn't quickly see a clear approach with unambiguous
semantics.  But, I think I came up with one now, see later.

> > and
> > kind of silly when then client almost certainly have created the
> > shadow table using specific adds/removes from the original table.
> 
> ...even though this is true conceptually, at least at a first glance
> (why would I send 11 rules to add a single rule to a table of 10?), I
> think the other details of the implementation, and conceptual matters
> (such as rollback and two-step activation) make this apparent silliness
> much less relevant, and I'm more and more convinced that a shadow table
> is actually the simplest, most robust, least bug-prone approach.
> 
> Especially:
> 
> > # Rule states / active bit
> > 
> > I think we *do* still want two stage activation of new rules:
> 
> ...this part, which led to a huge number of bugs over the years in nft
> / nftables updates, which also use separate insert / activate / commit
> / deactivate / delete operations.

Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.

> It's extremely complicated to grasp and implement properly, and you end
> up with a lot of quasi-diffing anyway (to check for duplicates in
> ranges, for example).
> 
> It makes much more sense in nftables because you can have hundreds of
> megabytes of data stored in tables, but any usage that was ever
> mentioned for passt in the past ~5 years would seem to imply at most
> hundreds of kilobytes per table.
> 
> Shifting complexity to the client is also a relevant topic for me, as we
> decided to have a binary client to avoid anything complicated (parsing)
> in the server. A shadow table allows us to shift even more complexity
> to the client, which is important for security.

I definitely agree in principle - what I wasn't convinced about was
that the overall balance actually favoured the client, because of my
concern over the complexity of that "diff"ing.  But 

> I haven't finished drafting a proposal based on this idea, but I plan to
> do it within one day or so.

Actually, you convinced me already, so I can do that.

> It won't be as detailed, because I don't think it's realistic to come
> up with all the details before writing any of the code (what's the
> point if you then have to throw away 70% of it?) but I hope it will be
> complete enough to provide a comparison.
> 
> By the way, at least at a first approximation, closing and reopening
> listening sockets will mostly do the trick for anything our users
> (mostly via Podman) will ever reasonably want, so I have half a mind of
> keeping it like that in a first proposal, but indeed we should make
> sure there's a way around it, which is what is is taking me a bit more
> time to demonstrate.

With some more thought I saw a way of doing the "diff" that looks
pretty straightforward and reasonable.  Moreover it's less churn of
the existing code, and works nicely with close-and-reopen as an
interim step.  It even provides socket continuity for arbitrarily
overlapping ranges in the old and new tables.

For close and re-open, we can implement COMMIT as:
	1. fwd_listen_close() on old table
	2. fwd_listen_sync() on new table

I think we can get socket continuity if by swapping the order of those
steps and extending fwd_sync_one() to do:
	for each port:
	    if <already opened>:
	        nothing to do
<new>	    else if <matching open socket in old table>:
<new>	        steal socket for new table
            else:
	        open/bind/listen new socket

The "steal" would mark the fd as -1 in the old table so
fwd_listen_close() won't get rid of it.

I think the check for a matching socket in the old table will be
moderately expensive O(n), but not so much as to be a problem in
practice.

> > [...]
> >
> > # Suggested client workflow
> > 
> > I suggest the client should:
> > 
> >    1. Parse all rule modifications
> >    2. INSERT all new rules
> >       -> On error, DELETE them again  
> >    3. DEACTIVATE all removed rules
> >       -> Should only fail if the client has done something wrong  
> >    4. ACTIVATE all new rules
> >       -> On error (rule conflict):  
> >          DEACTIVATE rules we already ACTIVATEd
> > 	 ACTIVATE rules we already DEACTIVATEd
> > 	 DELETE rules we INSERTed
> >    5. Check for bind errors (see details later)
> >       If there are failures we can't tolerate:
> >          DEACTIVATE rules we already ACTIVATEd
> > 	 ACTIVATE rules we already DEACTIVATEd
> > 	 DELETE rules we INSERTed
> >    6. DELETE rules we DEACTIVATEd
> >       -> Should only fail if the client has done something wrong  
> > 
> > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > new rules and rules we're deleting.
> > 
> > I think that gets us closeish to "as atomic as we can be", at least
> > from the perspective of peers.  The main case it doesn't catch is that
> > we don't detect rule conflicts until after we might have removed some
> > rules.  Is that good enough?
> 
> I think it is absolutely fine as an outcome, but the complexity of error
> handling in this case is a bit worrying. This is exactly the kind of
> thing (and we discussed it already a couple of times) that made and
> makes me think that a shadow table is a better approach instead.

I'll work on a more concrete proposal based on the shadow table
approach.  There are still some wrinkles with how to report bind()
errors with this scheme to figure out.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pesto protocol proposals
  2026-03-05  4:19   ` David Gibson
@ 2026-03-06  9:18     ` Stefano Brivio
  2026-03-06 12:23       ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2026-03-06  9:18 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 5 Mar 2026 15:19:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> > On Wed, 4 Mar 2026 15:28:30 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Most of today and yesterday I've spent thinking about the dynamic
> > > update model and protocol.  I certainly don't have all the details
> > > pinned down, let alone any implementation, but I have come to some
> > > conclusions.
> > > 
> > > # Shadow forward table
> > > 
> > > On further consideration, I think this is a bad idea.  To avoid peer
> > > visible disruption, we don't want to destroy and recreate listening
> > > sockets  
> > 
> > (Side note: if it's just *listening* sockets, is this actually that
> > bad?)  
> 
> Well, it's obviously much less bad that interrupting existing
> connections.  It does mean a peer attempting to connect at the wrong
> moment might get an ECONNREFUSED, as far as it knows, a permanent
> error.

Right. Now, I'm not sure if it helps simplifying the plan from your new
proposal even further but... consider this: *for the moment being* (as
Podman will most likely be the only user of this feature for presumably
a couple of months), it would simply mean that when Podman adds a
container to an existing custom network, there are a couple of
milliseconds during which new connections to existing containers are
not accepted.

Surely something that needs to be fixed, but not an outrageous issue if
you ask me. On the other hand, maybe it's structural enough that we
want to get it right in the first place. Of course you know better about
this.

> > > that are associated with a forward rule that's not being altered.  
> > 
> > After reading the rest of your proposal, as long as:
> >   
> > > Doing that with a shadow table would mean we'd need to essentially
> > > diff the two tables as we switch.  That seems moderately complex,  
> > 
> > ...this is the only downside (I can't think of others though), and I
> > don't think it's *that* complex as I mentioned, it would be a O(n^2)
> > step that can be probably optimised (via sorting) to O(n * log(m)) with
> > n new rules and m old rules, cycling on new rules and creating listening
> > sockets (we need this part anyway) unless we find (marking it
> > somewhere temporarily) a matching one...  
> 
> I wasn't particularly concerned about the computational cost.  It was
> more that I couldn't quickly see a clear approach with unambiguous
> semantics.  But, I think I came up with one now, see later.

Ah, sorry, I assumed it was a combination of the two, that is, I
thought it would be sort of straightforward to do it (at least
initially) as O(n^2) worst case but you were considering it
unsustainable. On the other hand we have 256 rules...

> > > and
> > > kind of silly when then client almost certainly have created the
> > > shadow table using specific adds/removes from the original table.  
> > 
> > ...even though this is true conceptually, at least at a first glance
> > (why would I send 11 rules to add a single rule to a table of 10?), I
> > think the other details of the implementation, and conceptual matters
> > (such as rollback and two-step activation) make this apparent silliness
> > much less relevant, and I'm more and more convinced that a shadow table
> > is actually the simplest, most robust, least bug-prone approach.
> > 
> > Especially:
> >   
> > > # Rule states / active bit
> > > 
> > > I think we *do* still want two stage activation of new rules:  
> > 
> > ...this part, which led to a huge number of bugs over the years in nft
> > / nftables updates, which also use separate insert / activate / commit
> > / deactivate / delete operations.  
> 
> Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.
> 
> > It's extremely complicated to grasp and implement properly, and you end
> > up with a lot of quasi-diffing anyway (to check for duplicates in
> > ranges, for example).
> > 
> > It makes much more sense in nftables because you can have hundreds of
> > megabytes of data stored in tables, but any usage that was ever
> > mentioned for passt in the past ~5 years would seem to imply at most
> > hundreds of kilobytes per table.
> > 
> > Shifting complexity to the client is also a relevant topic for me, as we
> > decided to have a binary client to avoid anything complicated (parsing)
> > in the server. A shadow table allows us to shift even more complexity
> > to the client, which is important for security.  
> 
> I definitely agree in principle - what I wasn't convinced about was
> that the overall balance actually favoured the client, because of my
> concern over the complexity of that "diff"ing.  But 
> 
> > I haven't finished drafting a proposal based on this idea, but I plan to
> > do it within one day or so.  
> 
> Actually, you convinced me already, so I can do that.
> 
> > It won't be as detailed, because I don't think it's realistic to come
> > up with all the details before writing any of the code (what's the
> > point if you then have to throw away 70% of it?) but I hope it will be
> > complete enough to provide a comparison.
> > 
> > By the way, at least at a first approximation, closing and reopening
> > listening sockets will mostly do the trick for anything our users
> > (mostly via Podman) will ever reasonably want, so I have half a mind of
> > keeping it like that in a first proposal, but indeed we should make
> > sure there's a way around it, which is what is is taking me a bit more
> > time to demonstrate.  
> 
> With some more thought I saw a way of doing the "diff" that looks
> pretty straightforward and reasonable.  Moreover it's less churn of
> the existing code, and works nicely with close-and-reopen as an
> interim step.  It even provides socket continuity for arbitrarily
> overlapping ranges in the old and new tables.

Oh, great! I was stuck pretty much at this point:

> For close and re-open, we can implement COMMIT as:
> 	1. fwd_listen_close() on old table
> 	2. fwd_listen_sync() on new table

...trying to figure out how interleaved (table vs. single socket) these
steps would be. In my mind I actually thought we would just call
fwd_listen_sync() which would make the diff itself and close left-over
sockets as needed but:

> I think we can get socket continuity if by swapping the order of those
> steps and extending fwd_sync_one() to do:
> 	for each port:
> 	    if <already opened>:
> 	        nothing to do
> <new>	    else if <matching open socket in old table>:
> <new>	        steal socket for new table
>             else:
> 	        open/bind/listen new socket
> 
> The "steal" would mark the fd as -1 in the old table so
> fwd_listen_close() won't get rid of it.

...this should be more practical I guess.

> I think the check for a matching socket in the old table will be
> moderately expensive O(n), but not so much as to be a problem in
> practice.

And again we could sort them eventually, which should make things
O(log(n)) on average (still O(n^2) worst case I guess).

> > > [...]
> > >
> > > # Suggested client workflow
> > > 
> > > I suggest the client should:
> > > 
> > >    1. Parse all rule modifications
> > >    2. INSERT all new rules  
> > >       -> On error, DELETE them again    
> > >    3. DEACTIVATE all removed rules  
> > >       -> Should only fail if the client has done something wrong    
> > >    4. ACTIVATE all new rules  
> > >       -> On error (rule conflict):    
> > >          DEACTIVATE rules we already ACTIVATEd
> > > 	 ACTIVATE rules we already DEACTIVATEd
> > > 	 DELETE rules we INSERTed
> > >    5. Check for bind errors (see details later)
> > >       If there are failures we can't tolerate:
> > >          DEACTIVATE rules we already ACTIVATEd
> > > 	 ACTIVATE rules we already DEACTIVATEd
> > > 	 DELETE rules we INSERTed
> > >    6. DELETE rules we DEACTIVATEd  
> > >       -> Should only fail if the client has done something wrong    
> > > 
> > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > > new rules and rules we're deleting.
> > > 
> > > I think that gets us closeish to "as atomic as we can be", at least
> > > from the perspective of peers.  The main case it doesn't catch is that
> > > we don't detect rule conflicts until after we might have removed some
> > > rules.  Is that good enough?  
> > 
> > I think it is absolutely fine as an outcome, but the complexity of error
> > handling in this case is a bit worrying. This is exactly the kind of
> > thing (and we discussed it already a couple of times) that made and
> > makes me think that a shadow table is a better approach instead.  
> 
> I'll work on a more concrete proposal based on the shadow table
> approach.  There are still some wrinkles with how to report bind()
> errors with this scheme to figure out.

I was thinking that with this scheme we would just report success or
failure without any further detail (except for warnings / error
messages we might print, but not part of the protocol), at least at the
beginning.

I'll comment on your new proposal in more detail though.

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Pesto protocol proposals
  2026-03-06  9:18     ` Stefano Brivio
@ 2026-03-06 12:23       ` David Gibson
  0 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2026-03-06 12:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10421 bytes --]

On Fri, Mar 06, 2026 at 10:18:27AM +0100, Stefano Brivio wrote:
> On Thu, 5 Mar 2026 15:19:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> > > On Wed, 4 Mar 2026 15:28:30 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Most of today and yesterday I've spent thinking about the dynamic
> > > > update model and protocol.  I certainly don't have all the details
> > > > pinned down, let alone any implementation, but I have come to some
> > > > conclusions.
> > > > 
> > > > # Shadow forward table
> > > > 
> > > > On further consideration, I think this is a bad idea.  To avoid peer
> > > > visible disruption, we don't want to destroy and recreate listening
> > > > sockets  
> > > 
> > > (Side note: if it's just *listening* sockets, is this actually that
> > > bad?)  
> > 
> > Well, it's obviously much less bad that interrupting existing
> > connections.  It does mean a peer attempting to connect at the wrong
> > moment might get an ECONNREFUSED, as far as it knows, a permanent
> > error.
> 
> Right. Now, I'm not sure if it helps simplifying the plan from your new
> proposal even further but... consider this: *for the moment being* (as
> Podman will most likely be the only user of this feature for presumably
> a couple of months), it would simply mean that when Podman adds a
> container to an existing custom network, there are a couple of
> milliseconds during which new connections to existing containers are
> not accepted.
> 
> Surely something that needs to be fixed, but not an outrageous issue if
> you ask me. On the other hand, maybe it's structural enough that we
> want to get it right in the first place. Of course you know better about
> this.

Yeah, as discussed in my revised proposal I think that's a good
interim step.

> > > > that are associated with a forward rule that's not being altered.  
> > > 
> > > After reading the rest of your proposal, as long as:
> > >   
> > > > Doing that with a shadow table would mean we'd need to essentially
> > > > diff the two tables as we switch.  That seems moderately complex,  
> > > 
> > > ...this is the only downside (I can't think of others though), and I
> > > don't think it's *that* complex as I mentioned, it would be a O(n^2)
> > > step that can be probably optimised (via sorting) to O(n * log(m)) with
> > > n new rules and m old rules, cycling on new rules and creating listening
> > > sockets (we need this part anyway) unless we find (marking it
> > > somewhere temporarily) a matching one...  
> > 
> > I wasn't particularly concerned about the computational cost.  It was
> > more that I couldn't quickly see a clear approach with unambiguous
> > semantics.  But, I think I came up with one now, see later.
> 
> Ah, sorry, I assumed it was a combination of the two, that is, I
> thought it would be sort of straightforward to do it (at least
> initially) as O(n^2) worst case but you were considering it
> unsustainable. On the other hand we have 256 rules...

Right.  A long as the maximum number of rules remains that order of
magnitude, I think O(n^2) is acceptable for this fairly rare
operation.

> > > > and
> > > > kind of silly when then client almost certainly have created the
> > > > shadow table using specific adds/removes from the original table.  
> > > 
> > > ...even though this is true conceptually, at least at a first glance
> > > (why would I send 11 rules to add a single rule to a table of 10?), I
> > > think the other details of the implementation, and conceptual matters
> > > (such as rollback and two-step activation) make this apparent silliness
> > > much less relevant, and I'm more and more convinced that a shadow table
> > > is actually the simplest, most robust, least bug-prone approach.
> > > 
> > > Especially:
> > >   
> > > > # Rule states / active bit
> > > > 
> > > > I think we *do* still want two stage activation of new rules:  
> > > 
> > > ...this part, which led to a huge number of bugs over the years in nft
> > > / nftables updates, which also use separate insert / activate / commit
> > > / deactivate / delete operations.  
> > 
> > Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.
> > 
> > > It's extremely complicated to grasp and implement properly, and you end
> > > up with a lot of quasi-diffing anyway (to check for duplicates in
> > > ranges, for example).
> > > 
> > > It makes much more sense in nftables because you can have hundreds of
> > > megabytes of data stored in tables, but any usage that was ever
> > > mentioned for passt in the past ~5 years would seem to imply at most
> > > hundreds of kilobytes per table.
> > > 
> > > Shifting complexity to the client is also a relevant topic for me, as we
> > > decided to have a binary client to avoid anything complicated (parsing)
> > > in the server. A shadow table allows us to shift even more complexity
> > > to the client, which is important for security.  
> > 
> > I definitely agree in principle - what I wasn't convinced about was
> > that the overall balance actually favoured the client, because of my
> > concern over the complexity of that "diff"ing.  But 
> > 
> > > I haven't finished drafting a proposal based on this idea, but I plan to
> > > do it within one day or so.  
> > 
> > Actually, you convinced me already, so I can do that.
> > 
> > > It won't be as detailed, because I don't think it's realistic to come
> > > up with all the details before writing any of the code (what's the
> > > point if you then have to throw away 70% of it?) but I hope it will be
> > > complete enough to provide a comparison.
> > > 
> > > By the way, at least at a first approximation, closing and reopening
> > > listening sockets will mostly do the trick for anything our users
> > > (mostly via Podman) will ever reasonably want, so I have half a mind of
> > > keeping it like that in a first proposal, but indeed we should make
> > > sure there's a way around it, which is what is is taking me a bit more
> > > time to demonstrate.  
> > 
> > With some more thought I saw a way of doing the "diff" that looks
> > pretty straightforward and reasonable.  Moreover it's less churn of
> > the existing code, and works nicely with close-and-reopen as an
> > interim step.  It even provides socket continuity for arbitrarily
> > overlapping ranges in the old and new tables.
> 
> Oh, great! I was stuck pretty much at this point:
> 
> > For close and re-open, we can implement COMMIT as:
> > 	1. fwd_listen_close() on old table
> > 	2. fwd_listen_sync() on new table
> 
> ...trying to figure out how interleaved (table vs. single socket) these
> steps would be. In my mind I actually thought we would just call
> fwd_listen_sync() which would make the diff itself and close left-over
> sockets as needed but:

Eh, that's basically just a question of what we name functions.  My
point is that the above will work with the existing implementation of
fwd_listen_sync().

> > I think we can get socket continuity if by swapping the order of those
> > steps and extending fwd_sync_one() to do:
> > 	for each port:
> > 	    if <already opened>:
> > 	        nothing to do
> > <new>	    else if <matching open socket in old table>:
> > <new>	        steal socket for new table
> >             else:
> > 	        open/bind/listen new socket
> > 
> > The "steal" would mark the fd as -1 in the old table so
> > fwd_listen_close() won't get rid of it.
> 
> ...this should be more practical I guess.
> 
> > I think the check for a matching socket in the old table will be
> > moderately expensive O(n), but not so much as to be a problem in
> > practice.
> 
> And again we could sort them eventually, which should make things
> O(log(n)) on average (still O(n^2) worst case I guess).
> 
> > > > [...]
> > > >
> > > > # Suggested client workflow
> > > > 
> > > > I suggest the client should:
> > > > 
> > > >    1. Parse all rule modifications
> > > >    2. INSERT all new rules  
> > > >       -> On error, DELETE them again    
> > > >    3. DEACTIVATE all removed rules  
> > > >       -> Should only fail if the client has done something wrong    
> > > >    4. ACTIVATE all new rules  
> > > >       -> On error (rule conflict):    
> > > >          DEACTIVATE rules we already ACTIVATEd
> > > > 	 ACTIVATE rules we already DEACTIVATEd
> > > > 	 DELETE rules we INSERTed
> > > >    5. Check for bind errors (see details later)
> > > >       If there are failures we can't tolerate:
> > > >          DEACTIVATE rules we already ACTIVATEd
> > > > 	 ACTIVATE rules we already DEACTIVATEd
> > > > 	 DELETE rules we INSERTed
> > > >    6. DELETE rules we DEACTIVATEd  
> > > >       -> Should only fail if the client has done something wrong    
> > > > 
> > > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > > > new rules and rules we're deleting.
> > > > 
> > > > I think that gets us closeish to "as atomic as we can be", at least
> > > > from the perspective of peers.  The main case it doesn't catch is that
> > > > we don't detect rule conflicts until after we might have removed some
> > > > rules.  Is that good enough?  
> > > 
> > > I think it is absolutely fine as an outcome, but the complexity of error
> > > handling in this case is a bit worrying. This is exactly the kind of
> > > thing (and we discussed it already a couple of times) that made and
> > > makes me think that a shadow table is a better approach instead.  
> > 
> > I'll work on a more concrete proposal based on the shadow table
> > approach.  There are still some wrinkles with how to report bind()
> > errors with this scheme to figure out.
> 
> I was thinking that with this scheme we would just report success or
> failure without any further detail (except for warnings / error
> messages we might print, but not part of the protocol), at least at the
> beginning.
> 
> I'll comment on your new proposal in more detail though.
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-06 12:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-04  4:28 Pesto protocol proposals David Gibson
2026-03-05  1:19 ` Stefano Brivio
2026-03-05  4:19   ` David Gibson
2026-03-06  9:18     ` Stefano Brivio
2026-03-06 12:23       ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).