Pesto Protocol Proposals, imProved

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* Pesto Protocol Proposals, imProved
@ 2026-03-06  1:08 David Gibson
  2026-03-06 10:58 ` Stefano Brivio
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2026-03-06  1:08 UTC (permalink / raw)
  To: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4109 bytes --]

Stefano convinced me that my earlier proposal for the dynamic update
protocol was unnecessarily complex.  Plus, I saw a much better way of
handling socket continuity in the context of a "whole table"
replacement.  So here's an entirely revised protocol suggestion.

# Outline

I suggest that each connection to the control socket handles a single
transaction.

    1. Server hello
        - Server sends magic number, version
	- Possibly feature flags / limits (e.g. max number of rules
          allowed)
    2. Client hello
        - Client sends magic number
	- Do we need anything else?
    3. Server lists pifs
        - Server sends the number of pifs, their indices and names
    4. Server lists rules
        - Server sends the list of rules, one pif at a time
    5. Client gives new rules
        - Client sends the new list of rules, one pif at a time
	- Server loads them into the shadow table, and validates
	  (no socket operations)
    6. Server acknowledges
        - Either reports an error and disconnects, or acks waiting for
	  client
    7. Client signals apply
        - Server swaps shadow and active tables, and syncs sockets
	  with new active table
    8. Server gives error summary
        - Server reports bind/listen/whatever errors
    9a. Client signals commit
        - Shadow table (now the old table) discarded
or
    9b. Client signals rollback
        - Shadow and active tables swapped back, syncs sockets
	- Discard shadow table (now the "new" table again)
	- New bind error report?
    10. Server closes control connection

# Client disconnects

A client disconnect before step (7) is straightforward: discard the
shadow table, nothing has changed.

A client disconnect between (7) and (9) triggers a rollback, same as (9b).    

# Error reporting

Error reporting at step (6) is fairly straightforward: we can send an
error code and/or an error message.

Error reporting at (8) is trickier.  As a first cut, we could just
report "yes" or "no" - taking into account the FWD_WEAK flag.  But the
client might be able to make better decisions or at least better
messages to the user if we report more detailed information.
Exactly how detailed is an open question: number of bind failures?
number of failures per rule?  specific ports which failed?

# Interim steps

I propose these steps toward implementing this:

 i. Merge TCP and UDP rule tables.  The protocol above assumes a
    single rule table per-pif, which I think is an easier model to
    understand and more extensible for future protocol support.
 ii. Read-only client.  Implement steps (1) to (4).  Client can query
     and list the current rules, but not change them.
 iii. Rule updates.  Implement remaining protocol steps, but with a
      "close and re-open" approach on the server, so unaltered
      listening sockets might briefly disappear
 iv. Socket continuity.  Have the socket sync "steal" sockets from the
     old table in preference to re-opening them.

If you have any time to work on (ii) while I work on (i), those should
be parallelizable.

# Concurrent updates

Server guarantees that a single transaction as above is atomic in the
sense that nothing else is allowed to change the rules between (4) and
(9).  The easiest way to do that initially is probably to only allow a
single client connection at a time.  If there's a reason to, we could
alter that so that concurrent connections are allowed, but if another
client changed anything after step (4), then we give an error on the
next op (or maybe just close the control socket from the server side).

# Tweaks / variants

 - I'm not sure that step (2) is necessary
 - I'm not certain that step (7) is necessary, although I do kind of
   prefer the client getting a chance to see a "so far, so good"
   before any socket operations happen.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pesto Protocol Proposals, imProved
  2026-03-06  1:08 Pesto Protocol Proposals, imProved David Gibson
@ 2026-03-06 10:58 ` Stefano Brivio
  2026-03-06 12:54   ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Stefano Brivio @ 2026-03-06 10:58 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 6 Mar 2026 12:08:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Stefano convinced me that my earlier proposal for the dynamic update
> protocol was unnecessarily complex.  Plus, I saw a much better way of
> handling socket continuity in the context of a "whole table"
> replacement.  So here's an entirely revised protocol suggestion.
> 
> # Outline
> 
> I suggest that each connection to the control socket handles a single
> transaction.
> 
>     1. Server hello
>         - Server sends magic number, version
> 	- Possibly feature flags / limits (e.g. max number of rules
>           allowed)

Feature flags and limits could be fixed depending on the version, for
simplicity.

If pifs are unexpected (somebody trying to forward ports to a container
and touching passt instead) we should find out as part of 3. I can't
think of other substantial types of mismatches.

>     2. Client hello
>         - Client sends magic number
> 	- Do we need anything else?

As long as we have a version reported by the server, we should be fine.
We'll just increase it if we need something else.

Do we want a client version too?

>     3. Server lists pifs
>         - Server sends the number of pifs, their indices and names

Up to here, I guess we can skip all this for an initial
Podman-side-complete implementation.

>     4. Server lists rules
>         - Server sends the list of rules, one pif at a time

Could this be a fixed-size blob with up to, say, 16 pifs?

We'll need to generalise pifs at some point. I'm not sure if it makes
things simpler. I would defer this to the implementation.

>     5. Client gives new rules
>         - Client sends the new list of rules, one pif at a time
> 	- Server loads them into the shadow table, and validates
> 	  (no socket operations)

Is it one shadow table per pif or one with everything?

If it's one per pif, do we want to have the whole exchange prepended by
"load table for pif x" or "store table for pif y" commands?

I would suggest not, at the moment, as it looks slightly complicated,
but eventually in a later version we could switch to that.

>     6. Server acknowledges
>         - Either reports an error and disconnects, or acks waiting for
> 	  client
>     7. Client signals apply
>         - Server swaps shadow and active tables, and syncs sockets
> 	  with new active table
>     8. Server gives error summary
>         - Server reports bind/listen/whatever errors
>     9a. Client signals commit
>         - Shadow table (now the old table) discarded
> or
>     9b. Client signals rollback
>         - Shadow and active tables swapped back, syncs sockets
> 	- Discard shadow table (now the "new" table again)
> 	- New bind error report?

Do we need these as five separate steps?

Couldn't the server simply apply or try to apply as soon as the client
is done, and acknowledge or return error once everything is done?

What about this instead:

	5. Client sends new rules (blob of known size)

	6. Server receives, loads into shadow table, swaps tables and
           syncs socket, with rollback to old table on error.

	8. Server sends error / success summary (single byte, at
	   least in this version)

>     10. Server closes control connection

...if we keep my 8. above, it would be more logical that the client
closes the connection.

> # Client disconnects
> 
> A client disconnect before step (7) is straightforward: discard the
> shadow table, nothing has changed.
> 
> A client disconnect between (7) and (9) triggers a rollback, same as (9b).    

In my modified version, a client disconnect during 5. would trigger
discarding of the shadow table that's being filled (kind of no-op
really).

A disconnect after that doesn't affect the following steps instead, but
the server won't report error or success.

> # Error reporting
> 
> Error reporting at step (6) is fairly straightforward: we can send an
> error code and/or an error message.
> 
> Error reporting at (8) is trickier.  As a first cut, we could just
> report "yes" or "no" - taking into account the FWD_WEAK flag.  But the
> client might be able to make better decisions or at least better
> messages to the user if we report more detailed information.
> Exactly how detailed is an open question: number of bind failures?
> number of failures per rule?  specific ports which failed?

For the moment I would report a single byte.

Later, we could probably send back the list of rules with a success /
error type version for each one of them. Think of just sending the
same type of fixed-size table back and forth.

> # Interim steps
> 
> I propose these steps toward implementing this:
> 
>  i. Merge TCP and UDP rule tables.  The protocol above assumes a
>     single rule table per-pif, which I think is an easier model to
>     understand and more extensible for future protocol support.
>  ii. Read-only client.  Implement steps (1) to (4).  Client can query
>      and list the current rules, but not change them.
>  iii. Rule updates.  Implement remaining protocol steps, but with a
>       "close and re-open" approach on the server, so unaltered
>       listening sockets might briefly disappear
>  iv. Socket continuity.  Have the socket sync "steal" sockets from the
>      old table in preference to re-opening them.
> 
> If you have any time to work on (ii) while I work on (i), those should
> be parallelizable.

Yes, I'll start adapting the existing draft as soon as possible. I
think ii. could go in parallel with all the other steps, I can just
call some stubs meanwhile.

> # Concurrent updates
> 
> Server guarantees that a single transaction as above is atomic in the
> sense that nothing else is allowed to change the rules between (4) and
> (9).  The easiest way to do that initially is probably to only allow a
> single client connection at a time.

I would call this a feature...

> If there's a reason to, we could
> alter that so that concurrent connections are allowed, but if another
> client changed anything after step (4), then we give an error on the
> next op (or maybe just close the control socket from the server side).

...even if we go for my modified version.

> # Tweaks / variants
> 
>  - I'm not sure that step (2) is necessary

I would skip it. The only reason why we might want it is to send a
client version, but we can also implement sending of a client version
only starting from a newer server version.

>  - I'm not certain that step (7) is necessary, although I do kind of
>    prefer the client getting a chance to see a "so far, so good"
>    before any socket operations happen.

I think it's quite unrealistic that we'll ever manage to build some
sensible logic to decide what to do depending on partial failures.

If so far is not good, the server should just abort, and the user will
have to fix mistakes and try again.

-- 
Stefano

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pesto Protocol Proposals, imProved
  2026-03-06 10:58 ` Stefano Brivio
@ 2026-03-06 12:54   ` David Gibson
  2026-03-06 13:18     ` Stefano Brivio
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2026-03-06 12:54 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 11089 bytes --]

On Fri, Mar 06, 2026 at 11:58:14AM +0100, Stefano Brivio wrote:
> On Fri, 6 Mar 2026 12:08:07 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Stefano convinced me that my earlier proposal for the dynamic update
> > protocol was unnecessarily complex.  Plus, I saw a much better way of
> > handling socket continuity in the context of a "whole table"
> > replacement.  So here's an entirely revised protocol suggestion.
> > 
> > # Outline
> > 
> > I suggest that each connection to the control socket handles a single
> > transaction.
> > 
> >     1. Server hello
> >         - Server sends magic number, version
> > 	- Possibly feature flags / limits (e.g. max number of rules
> >           allowed)
> 
> Feature flags and limits could be fixed depending on the version, for
> simplicity.

We could, but it's a tradeoff between simplicity right now, and
simplicity of later alterations.  Fwiw, I was thinking of feature
flags for things that already fit into this general scheme, but might
not be immediately implemented.  For example, allowing rules for
protocols other than TCP & UDP.

> If pifs are unexpected (somebody trying to forward ports to a container
> and touching passt instead) we should find out as part of 3. I can't
> think of other substantial types of mismatches.

If we extend the table to control the target side address (which we
probably want, at least once we have a table for tap packets) we might
have some options for different mapping nodes (many to one, identity.
If we don't implement all those immediately, feature flags could be
useful.

> >     2. Client hello
> >         - Client sends magic number
> > 	- Do we need anything else?
> 
> As long as we have a version reported by the server, we should be fine.
> We'll just increase it if we need something else.

The model I had in mind was that the server just advertises a version,
and it's up to the client to match it.  Again, reducing server
complexity at the cost of client complexity.  I was thinking the
server could advertise a min_compat_version, like with migration, for
cases where the new version is strictly backwards compatible with the
old one (e.g. identical layout, but some things that weren't permitted
before now are).

> Do we want a client version too?

I don't think we need one.  I think a client magic number is a good
idea though.  We have a binary protocol that will largely be made up
of IP addresses and u16s, where all (or at least most) possible bit
patterns are valid.  A client magic number removes the risk that some
random unrelated program could connect to the socket, sends some
garbage to passt and have it do anything other than just reject the
transaction with an error.

> >     3. Server lists pifs
> >         - Server sends the number of pifs, their indices and names
> 
> Up to here, I guess we can skip all this for an initial
> Podman-side-complete implementation.

Not really sure what you mean.

> >     4. Server lists rules
> >         - Server sends the list of rules, one pif at a time
> 
> Could this be a fixed-size blob with up to, say, 16 pifs?

It could, but I don't think it's a great idea.  Allowing a variable
number isn't that hard to implement, and decouples the protocol from
the specific internal values of PIF_NUM_TYPES and MAX_FWD_RULES.

I guess we could do it for a PoC, but I don't think we should release
something with a fixed blob.

> We'll need to generalise pifs at some point. I'm not sure if it makes
> things simpler. I would defer this to the implementation.
> 
> >     5. Client gives new rules
> >         - Client sends the new list of rules, one pif at a time
> > 	- Server loads them into the shadow table, and validates
> > 	  (no socket operations)
> 
> Is it one shadow table per pif or one with everything?

Conceptually, one shadow table globally.  Internally I guess it would
be one per pif, but I was intending that the whole shebang be switched
as a unit.

> If it's one per pif, do we want to have the whole exchange prepended by
> "load table for pif x" or "store table for pif y" commands?
> 
> I would suggest not, at the moment, as it looks slightly complicated,
> but eventually in a later version we could switch to that.

We could, but I don't really see a reason we'd want to.

> >     6. Server acknowledges
> >         - Either reports an error and disconnects, or acks waiting for
> > 	  client
> >     7. Client signals apply
> >         - Server swaps shadow and active tables, and syncs sockets
> > 	  with new active table
> >     8. Server gives error summary
> >         - Server reports bind/listen/whatever errors
> >     9a. Client signals commit
> >         - Shadow table (now the old table) discarded
> > or
> >     9b. Client signals rollback
> >         - Shadow and active tables swapped back, syncs sockets
> > 	- Discard shadow table (now the "new" table again)
> > 	- New bind error report?
> 
> Do we need these as five separate steps?
> 
> Couldn't the server simply apply or try to apply as soon as the client
> is done, and acknowledge or return error once everything is done?
> 
> What about this instead:
> 
> 	5. Client sends new rules (blob of known size)
> 
> 	6. Server receives, loads into shadow table, swaps tables and
>            syncs socket, with rollback to old table on error.
> 
> 	8. Server sends error / success summary (single byte, at
> 	   least in this version)
> 
> >     10. Server closes control connection

Uh.. yeah, I think that works.  I still think a client magic number is
worthwhile, but that could be considered part of (5).  In this case I
do think the error code needs at least 3 values: success, "safe" fail
(before we attempted any socket operations), socket operation fail.
We might also want one if something went wrong during the rollback
(e.g. we removed a mapping, and some other program bound the port
before we rolled back).

> ...if we keep my 8. above, it would be more logical that the client
> closes the connection.

I think having the server close the connection will simplify state
management inside passt a little: we don't have to have handle events
for closing our side once the client has closed, or deal with the
client incorrectly sending garbage after the transaction.  Once we've
sent the error code, we're signalling unambiguously: the transaction
is over now, if you want to do something else, the client must start
again.

> > # Client disconnects
> > 
> > A client disconnect before step (7) is straightforward: discard the
> > shadow table, nothing has changed.
> > 
> > A client disconnect between (7) and (9) triggers a rollback, same as (9b).    
> 
> In my modified version, a client disconnect during 5. would trigger
> discarding of the shadow table that's being filled (kind of no-op
> really).

Right.

> A disconnect after that doesn't affect the following steps instead, but
> the server won't report error or success.

That's one option.  We could also force a rollback in this case
(essentially, roll back if we get an error sending the status back to
the client).  I have no strong preference.

> > # Error reporting
> > 
> > Error reporting at step (6) is fairly straightforward: we can send an
> > error code and/or an error message.
> > 
> > Error reporting at (8) is trickier.  As a first cut, we could just
> > report "yes" or "no" - taking into account the FWD_WEAK flag.  But the
> > client might be able to make better decisions or at least better
> > messages to the user if we report more detailed information.
> > Exactly how detailed is an open question: number of bind failures?
> > number of failures per rule?  specific ports which failed?
> 
> For the moment I would report a single byte.
> 
> Later, we could probably send back the list of rules with a success /
> error type version for each one of them. Think of just sending the
> same type of fixed-size table back and forth.

Right.  We could put an error count / status field in each rule and
consider it part of the table.

> > # Interim steps
> > 
> > I propose these steps toward implementing this:
> > 
> >  i. Merge TCP and UDP rule tables.  The protocol above assumes a
> >     single rule table per-pif, which I think is an easier model to
> >     understand and more extensible for future protocol support.
> >  ii. Read-only client.  Implement steps (1) to (4).  Client can query
> >      and list the current rules, but not change them.
> >  iii. Rule updates.  Implement remaining protocol steps, but with a
> >       "close and re-open" approach on the server, so unaltered
> >       listening sockets might briefly disappear
> >  iv. Socket continuity.  Have the socket sync "steal" sockets from the
> >      old table in preference to re-opening them.
> > 
> > If you have any time to work on (ii) while I work on (i), those should
> > be parallelizable.
> 
> Yes, I'll start adapting the existing draft as soon as possible. I
> think ii. could go in parallel with all the other steps, I can just
> call some stubs meanwhile.

Ok, great.

> > # Concurrent updates
> > 
> > Server guarantees that a single transaction as above is atomic in the
> > sense that nothing else is allowed to change the rules between (4) and
> > (9).  The easiest way to do that initially is probably to only allow a
> > single client connection at a time.
> 
> I would call this a feature...

I tend to agree.

> > If there's a reason to, we could
> > alter that so that concurrent connections are allowed, but if another
> > client changed anything after step (4), then we give an error on the
> > next op (or maybe just close the control socket from the server side).
> 
> ...even if we go for my modified version.

Yes.

> > # Tweaks / variants
> > 
> >  - I'm not sure that step (2) is necessary
> 
> I would skip it. The only reason why we might want it is to send a
> client version, but we can also implement sending of a client version
> only starting from a newer server version.

Again, I think the magic number is important, but we could fold that
into a later step.

> 
> >  - I'm not certain that step (7) is necessary, although I do kind of
> >    prefer the client getting a chance to see a "so far, so good"
> >    before any socket operations happen.
> 
> I think it's quite unrealistic that we'll ever manage to build some
> sensible logic to decide what to do depending on partial failures.

Eh, maybe.  An error count per rule would be enough to move the
implementation of FWD_WEAK to the client.

> If so far is not good, the server should just abort, and the user will
> have to fix mistakes and try again.

Yeah, that makes sense.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pesto Protocol Proposals, imProved
  2026-03-06 12:54   ` David Gibson
@ 2026-03-06 13:18     ` Stefano Brivio
  0 siblings, 0 replies; 4+ messages in thread
From: Stefano Brivio @ 2026-03-06 13:18 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 6 Mar 2026 23:54:49 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Mar 06, 2026 at 11:58:14AM +0100, Stefano Brivio wrote:
> > On Fri, 6 Mar 2026 12:08:07 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > [...]
> >
> > >     3. Server lists pifs
> > >         - Server sends the number of pifs, their indices and names  
> > 
> > Up to here, I guess we can skip all this for an initial
> > Podman-side-complete implementation.  
> 
> Not really sure what you mean.

I've seen later that you indicate (1) to (4) as part of step ii. of the
implementation, but I was actually thinking that we could skip 1., 2.,
and 3., and still have something we can use to settle on a semi-proven
command-line for Podman's usage (they should start integrating this as
soon as possible, even if it's not working).

On the other hand 3. is probably trivial and probably needed, 1. not
strictly needed but trivial, 2. we agreed we'll skip it for the
moment... so fine, let me keep those as part of step ii. It should take
minutes.

-- 
Stefano

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-06 13:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-06  1:08 Pesto Protocol Proposals, imProved David Gibson
2026-03-06 10:58 ` Stefano Brivio
2026-03-06 12:54   ` David Gibson
2026-03-06 13:18     ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).