From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ABPl4zOM;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id 86A735A0265
	for <passt-dev@passt.top>; Fri, 06 Mar 2026 10:18:33 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1772788712;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=gu+kr9uv6fuBXwn7KZuyF0+7e1TLbmC0QVaeiZ6PKPI=;
	b=ABPl4zOMbf4xspq+2rOP5ivcaIWxOTwnid1MD33j34SAtzbZmkVRg4mLDIW9dw07TVhBap
	r6G6fVLuequYXWO5YyVCAcs0xQLwA/CrjLePjztD42KTLINN6etYkqgTY9RDISSIOOn2T0
	T/N4rY3Jbx/Td6SACKDncu+0Zqsgcmc=
Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com
 [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-637-FzYwXORcMsu0wVM9HeVkng-1; Fri, 06 Mar 2026 04:18:30 -0500
X-MC-Unique: FzYwXORcMsu0wVM9HeVkng-1
X-Mimecast-MFC-AGG-ID: FzYwXORcMsu0wVM9HeVkng_1772788709
Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-48372facfedso80906865e9.0
        for <passt-dev@passt.top>; Fri, 06 Mar 2026 01:18:30 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1772788709; x=1773393509;
        h=date:content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=gu+kr9uv6fuBXwn7KZuyF0+7e1TLbmC0QVaeiZ6PKPI=;
        b=vaz+brKpCKx27TEHIN3c+NS20rtzYcLbwUHlRJkNQwb83Nr1WTvWDeRB4+pG/cZEwN
         2BIPJMwsAcZWuRvsENa+OZiNtSDdZkGLGlnlsYvlt2AMLWRUbJn0UkQI4bj0iU5Ju4nt
         h2rwyNHebFgPtEmUco71hHc9pcaaGeZXBhNgSXXPSB4eoeliXIj/eQgR7WTlONyBGGcQ
         zWAGsXe5GHOAOboinxU2B3IpxCyXIYy56CF8iFFvOytDZcqFDwmWP9XypFZv4IRBfMuc
         cleCiy6XBBZUfawlhz6HstOIQp2/1LgUeryj6DcHK8DdERoGoH9KMyGELjnZEtdiuaGp
         wLJA==
X-Gm-Message-State: AOJu0YwAGyj8fsDzaR6hTYbLjp3WdDkJCgXv2VU4VEkXRPgdE3KkxvcM
	nAEpiryRN4/Hcys08SFDC/gZEp3VWFauS6tJI2i0fdz6DVIGKXtv25wn/aW/FSP1NeGVzzej8gX
	Ro4ys0aGFFLvfjeAx64jHf+VzLR0yy/GBvLTXk71tWEyTR7/SVsmTCg==
X-Gm-Gg: ATEYQzyJOQDV7JHA1Qhvdc90HkfXnVhwS1VinGeqm26J93FgAqrQYARknCrdtMfrS10
	KSIa8GUUOcqlOrgE3ANFj70GZC7ynVjnfX65cAQt370Dy3A4Kq5sSIvk+ZC0xNdJTWhFw8JILgj
	ojJC7I96/Vsu28hFVsj0+zCvwYLxJYAmeVLimVdRMUPCZ+UpKjzzNC5V7o5DysTZwGvd/mLIT2n
	SEQdF3/sWuSnm7hduTXAUZpBuJdOqii9yTQcDwfif1ofXTHUxCtICejq25VrQZxmtcQ5lFDERPn
	oDx11fCTlky8/6IHVhE6EnDr2qGWQhz5Shnjj14DTsr9ynr80BZDAHADquhRDzGPrveuk8CWocX
	Hf20RJm8jrx4kGiPnrL5p57EB8EAm9/HR+RvCipz5w+YtXk0xfw==
X-Received: by 2002:a05:600c:1d27:b0:47e:e2b8:66e6 with SMTP id 5b1f17b1804b1-48526759afbmr23691725e9.14.1772788709341;
        Fri, 06 Mar 2026 01:18:29 -0800 (PST)
X-Received: by 2002:a05:600c:1d27:b0:47e:e2b8:66e6 with SMTP id 5b1f17b1804b1-48526759afbmr23691115e9.14.1772788708628;
        Fri, 06 Mar 2026 01:18:28 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4851fae538bsm139283575e9.7.2026.03.06.01.18.28
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 06 Mar 2026 01:18:28 -0800 (PST)
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: Pesto protocol proposals
Message-ID: <20260306101827.38124251@elisabeth>
In-Reply-To: <aakEXBLxxkV2YDLE@zatzit>
References: <aae07j0fhcXOFeab@zatzit>
	<20260305021952.17963c3f@elisabeth>
	<aakEXBLxxkV2YDLE@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Date: Fri, 06 Mar 2026 10:18:27 +0100 (CET)
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: BbLhAPSuWjRdlAto59v5TI6HvRXp_02t77Gl7LqjzqM_1772788709
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: V75B4Q6P6KXZLJNM5IIFINKH65XJM7VG
X-Message-ID-Hash: V75B4Q6P6KXZLJNM5IIFINKH65XJM7VG
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20260306101827.38124251@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/V75B4Q6P6KXZLJNM5IIFINKH65XJM7VG/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Thu, 5 Mar 2026 15:19:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Mar 05, 2026 at 02:19:53AM +0100, Stefano Brivio wrote:
> > On Wed, 4 Mar 2026 15:28:30 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Most of today and yesterday I've spent thinking about the dynamic
> > > update model and protocol.  I certainly don't have all the details
> > > pinned down, let alone any implementation, but I have come to some
> > > conclusions.
> > > 
> > > # Shadow forward table
> > > 
> > > On further consideration, I think this is a bad idea.  To avoid peer
> > > visible disruption, we don't want to destroy and recreate listening
> > > sockets  
> > 
> > (Side note: if it's just *listening* sockets, is this actually that
> > bad?)  
> 
> Well, it's obviously much less bad that interrupting existing
> connections.  It does mean a peer attempting to connect at the wrong
> moment might get an ECONNREFUSED, as far as it knows, a permanent
> error.

Right. Now, I'm not sure if it helps simplifying the plan from your new
proposal even further but... consider this: *for the moment being* (as
Podman will most likely be the only user of this feature for presumably
a couple of months), it would simply mean that when Podman adds a
container to an existing custom network, there are a couple of
milliseconds during which new connections to existing containers are
not accepted.

Surely something that needs to be fixed, but not an outrageous issue if
you ask me. On the other hand, maybe it's structural enough that we
want to get it right in the first place. Of course you know better about
this.

> > > that are associated with a forward rule that's not being altered.  
> > 
> > After reading the rest of your proposal, as long as:
> >   
> > > Doing that with a shadow table would mean we'd need to essentially
> > > diff the two tables as we switch.  That seems moderately complex,  
> > 
> > ...this is the only downside (I can't think of others though), and I
> > don't think it's *that* complex as I mentioned, it would be a O(n^2)
> > step that can be probably optimised (via sorting) to O(n * log(m)) with
> > n new rules and m old rules, cycling on new rules and creating listening
> > sockets (we need this part anyway) unless we find (marking it
> > somewhere temporarily) a matching one...  
> 
> I wasn't particularly concerned about the computational cost.  It was
> more that I couldn't quickly see a clear approach with unambiguous
> semantics.  But, I think I came up with one now, see later.

Ah, sorry, I assumed it was a combination of the two, that is, I
thought it would be sort of straightforward to do it (at least
initially) as O(n^2) worst case but you were considering it
unsustainable. On the other hand we have 256 rules...

> > > and
> > > kind of silly when then client almost certainly have created the
> > > shadow table using specific adds/removes from the original table.  
> > 
> > ...even though this is true conceptually, at least at a first glance
> > (why would I send 11 rules to add a single rule to a table of 10?), I
> > think the other details of the implementation, and conceptual matters
> > (such as rollback and two-step activation) make this apparent silliness
> > much less relevant, and I'm more and more convinced that a shadow table
> > is actually the simplest, most robust, least bug-prone approach.
> > 
> > Especially:
> >   
> > > # Rule states / active bit
> > > 
> > > I think we *do* still want two stage activation of new rules:  
> > 
> > ...this part, which led to a huge number of bugs over the years in nft
> > / nftables updates, which also use separate insert / activate / commit
> > / deactivate / delete operations.  
> 
> Huh, interesting.  I wasn't aware of that, and it's pretty persuasive.
> 
> > It's extremely complicated to grasp and implement properly, and you end
> > up with a lot of quasi-diffing anyway (to check for duplicates in
> > ranges, for example).
> > 
> > It makes much more sense in nftables because you can have hundreds of
> > megabytes of data stored in tables, but any usage that was ever
> > mentioned for passt in the past ~5 years would seem to imply at most
> > hundreds of kilobytes per table.
> > 
> > Shifting complexity to the client is also a relevant topic for me, as we
> > decided to have a binary client to avoid anything complicated (parsing)
> > in the server. A shadow table allows us to shift even more complexity
> > to the client, which is important for security.  
> 
> I definitely agree in principle - what I wasn't convinced about was
> that the overall balance actually favoured the client, because of my
> concern over the complexity of that "diff"ing.  But 
> 
> > I haven't finished drafting a proposal based on this idea, but I plan to
> > do it within one day or so.  
> 
> Actually, you convinced me already, so I can do that.
> 
> > It won't be as detailed, because I don't think it's realistic to come
> > up with all the details before writing any of the code (what's the
> > point if you then have to throw away 70% of it?) but I hope it will be
> > complete enough to provide a comparison.
> > 
> > By the way, at least at a first approximation, closing and reopening
> > listening sockets will mostly do the trick for anything our users
> > (mostly via Podman) will ever reasonably want, so I have half a mind of
> > keeping it like that in a first proposal, but indeed we should make
> > sure there's a way around it, which is what is is taking me a bit more
> > time to demonstrate.  
> 
> With some more thought I saw a way of doing the "diff" that looks
> pretty straightforward and reasonable.  Moreover it's less churn of
> the existing code, and works nicely with close-and-reopen as an
> interim step.  It even provides socket continuity for arbitrarily
> overlapping ranges in the old and new tables.

Oh, great! I was stuck pretty much at this point:

> For close and re-open, we can implement COMMIT as:
> 	1. fwd_listen_close() on old table
> 	2. fwd_listen_sync() on new table

...trying to figure out how interleaved (table vs. single socket) these
steps would be. In my mind I actually thought we would just call
fwd_listen_sync() which would make the diff itself and close left-over
sockets as needed but:

> I think we can get socket continuity if by swapping the order of those
> steps and extending fwd_sync_one() to do:
> 	for each port:
> 	    if <already opened>:
> 	        nothing to do
> <new>	    else if <matching open socket in old table>:
> <new>	        steal socket for new table
>             else:
> 	        open/bind/listen new socket
> 
> The "steal" would mark the fd as -1 in the old table so
> fwd_listen_close() won't get rid of it.

...this should be more practical I guess.

> I think the check for a matching socket in the old table will be
> moderately expensive O(n), but not so much as to be a problem in
> practice.

And again we could sort them eventually, which should make things
O(log(n)) on average (still O(n^2) worst case I guess).

> > > [...]
> > >
> > > # Suggested client workflow
> > > 
> > > I suggest the client should:
> > > 
> > >    1. Parse all rule modifications
> > >    2. INSERT all new rules  
> > >       -> On error, DELETE them again    
> > >    3. DEACTIVATE all removed rules  
> > >       -> Should only fail if the client has done something wrong    
> > >    4. ACTIVATE all new rules  
> > >       -> On error (rule conflict):    
> > >          DEACTIVATE rules we already ACTIVATEd
> > > 	 ACTIVATE rules we already DEACTIVATEd
> > > 	 DELETE rules we INSERTed
> > >    5. Check for bind errors (see details later)
> > >       If there are failures we can't tolerate:
> > >          DEACTIVATE rules we already ACTIVATEd
> > > 	 ACTIVATE rules we already DEACTIVATEd
> > > 	 DELETE rules we INSERTed
> > >    6. DELETE rules we DEACTIVATEd  
> > >       -> Should only fail if the client has done something wrong    
> > > 
> > > DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> > > new rules and rules we're deleting.
> > > 
> > > I think that gets us closeish to "as atomic as we can be", at least
> > > from the perspective of peers.  The main case it doesn't catch is that
> > > we don't detect rule conflicts until after we might have removed some
> > > rules.  Is that good enough?  
> > 
> > I think it is absolutely fine as an outcome, but the complexity of error
> > handling in this case is a bit worrying. This is exactly the kind of
> > thing (and we discussed it already a couple of times) that made and
> > makes me think that a shadow table is a better approach instead.  
> 
> I'll work on a more concrete proposal based on the shadow table
> approach.  There are still some wrinkles with how to report bind()
> errors with this scheme to figure out.

I was thinking that with this scheme we would just report success or
failure without any further detail (except for warnings / error
messages we might print, but not part of the protocol), at least at the
beginning.

I'll comment on your new proposal in more detail though.

-- 
Stefano