From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=ZN1PBxK3;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id 0D5595A0274
	for <passt-dev@passt.top>; Thu, 05 Mar 2026 02:19:57 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1772673596;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=ENOcR+IYU8ELBfbdbthUnpt6I/60JjdTPU2hhpIVYd0=;
	b=ZN1PBxK3yLngMzefjnWdAHIbWuBMyd5me9mINaVUWkvsNyG17JfNFZRD37K/OuH5ddWMfh
	p6itm4RUv9fueUA+2e6iZgrNXlbesBX8B/tShcesew4SCFKOJyg6W/Zd7getIhx5Rm8Ewq
	H/Q6uESLVZ1RBofAGriwYbmdMnsiI8s=
Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com
 [209.85.221.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-625-BYWmJjy8MpeAEtL2GwixKw-1; Wed, 04 Mar 2026 20:19:55 -0500
X-MC-Unique: BYWmJjy8MpeAEtL2GwixKw-1
X-Mimecast-MFC-AGG-ID: BYWmJjy8MpeAEtL2GwixKw_1772673595
Received: by mail-wr1-f70.google.com with SMTP id ffacd0b85a97d-439b9116e2eso2427611f8f.2
        for <passt-dev@passt.top>; Wed, 04 Mar 2026 17:19:55 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1772673594; x=1773278394;
        h=date:content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:x-gm-gg
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=ENOcR+IYU8ELBfbdbthUnpt6I/60JjdTPU2hhpIVYd0=;
        b=XzrFKcIEcZvX2M/32OJClpl2mlVEEyVVVmvyXTVm4M2Fq9AJV5JOM8W5BTUVe5Pycv
         9y+fFuLJNG/Bjzt45l/ElAkdBy7IxAUjRyo1irDzcSsm659sWnUgzhBMTVMz2trAv8O2
         tnh5pzkHIqMkN3tnNCTM+1CLKxzNQtEV+UrNZk2giyPeuBLb2XWxxuvCkbOjPgdml8HY
         G0GpkdMvgwE7Xu3P77EY8xlPcWs90QqgFArmjy6U8IJ00dNe67C0Zfil0dsgqqEXLghz
         ClVdQFkUQP1GtPD6hEz6ALZ7v6p1DvmbJOFZd/z3UG6958AUMTTEB8LaeHE6ICdwfC+N
         ZXBg==
X-Gm-Message-State: AOJu0YwRm2z+XDYg6Qp3mURV4k/YQexfUL3YfyT/fGjGxW7KKW2/ergS
	EprXuWp72WnzwpmeKuHgeuTzHlp1HbUKmAHxk0kSIXpRxdXvPd1VDOw7xPN31sGzZTT1+dQFkdd
	/IPpt40czBLw1al+Q78uaEDC4s2Iucc2PAATXnGVSAFLFioZCVAwdlA==
X-Gm-Gg: ATEYQzxQG0W9v81YFQGsSu2tsIGlBm2cWPEZIQDYHNxo38VXQbCHq5J20t3VTuAZ+eW
	n022/Pu9wvUW2WUilVSKa4zgucG9B8vn92gDDBQPnmfcsxx3fUBVApgbMKd8WikUy5M2Rkbzcyo
	FfnzzIAqXRH/eyRaBoMFdbtwBjJ7zSUyVyPBcTide7YC9QetvNO254HKIxFe69wXbfdblGsG3DE
	DJPVr8ZRVvhWhwwrG1NdI9H9afVOPhy5OVTixnKKIkpTzxGcGvl1uDY4NJ2DAvQGv6yyMdnNYZi
	d4zmHeu9hQgG5Fq1YNXgjA6Adux3OjdSHVykYHWMNeIsReqSb2blXzMwd4Emqq3ijp2BjMqejMa
	5zx3hiOLNKA63rSApEJV7lv/dxSaowdCd
X-Received: by 2002:a05:600c:a087:b0:483:badb:618b with SMTP id 5b1f17b1804b1-4851988467amr65134905e9.24.1772673594459;
        Wed, 04 Mar 2026 17:19:54 -0800 (PST)
X-Received: by 2002:a05:600c:a087:b0:483:badb:618b with SMTP id 5b1f17b1804b1-4851988467amr65134575e9.24.1772673593875;
        Wed, 04 Mar 2026 17:19:53 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4851fade9fdsm10152095e9.4.2026.03.04.17.19.53
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 04 Mar 2026 17:19:53 -0800 (PST)
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: Pesto protocol proposals
Message-ID: <20260305021952.17963c3f@elisabeth>
In-Reply-To: <aae07j0fhcXOFeab@zatzit>
References: <aae07j0fhcXOFeab@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Date: Thu, 05 Mar 2026 02:19:53 +0100 (CET)
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: bg6vkwzWw4PmrNYIbPPPnu4ntUnX0F-If5ZXETExMOo_1772673595
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: RGLC7DIPOIMRWLKBF7XIKBZR22ISYCA6
X-Message-ID-Hash: RGLC7DIPOIMRWLKBF7XIKBZR22ISYCA6
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20260305021952.17963c3f@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/RGLC7DIPOIMRWLKBF7XIKBZR22ISYCA6/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Wed, 4 Mar 2026 15:28:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Most of today and yesterday I've spent thinking about the dynamic
> update model and protocol.  I certainly don't have all the details
> pinned down, let alone any implementation, but I have come to some
> conclusions.
> 
> # Shadow forward table
> 
> On further consideration, I think this is a bad idea.  To avoid peer
> visible disruption, we don't want to destroy and recreate listening
> sockets

(Side note: if it's just *listening* sockets, is this actually that
bad?)

> that are associated with a forward rule that's not being altered.

After reading the rest of your proposal, as long as:

> Doing that with a shadow table would mean we'd need to essentially
> diff the two tables as we switch.  That seems moderately complex,

...this is the only downside (I can't think of others though), and I
don't think it's *that* complex as I mentioned, it would be a O(n^2)
step that can be probably optimised (via sorting) to O(n * log(m)) with
n new rules and m old rules, cycling on new rules and creating listening
sockets (we need this part anyway) unless we find (marking it
somewhere temporarily) a matching one...

> and
> kind of silly when then client almost certainly have created the
> shadow table using specific adds/removes from the original table.

...even though this is true conceptually, at least at a first glance
(why would I send 11 rules to add a single rule to a table of 10?), I
think the other details of the implementation, and conceptual matters
(such as rollback and two-step activation) make this apparent silliness
much less relevant, and I'm more and more convinced that a shadow table
is actually the simplest, most robust, least bug-prone approach.

Especially:

> # Rule states / active bit
> 
> I think we *do* still want two stage activation of new rules:

...this part, which led to a huge number of bugs over the years in nft
/ nftables updates, which also use separate insert / activate / commit
/ deactivate / delete operations.

It's extremely complicated to grasp and implement properly, and you end
up with a lot of quasi-diffing anyway (to check for duplicates in
ranges, for example).

It makes much more sense in nftables because you can have hundreds of
megabytes of data stored in tables, but any usage that was ever
mentioned for passt in the past ~5 years would seem to imply at most
hundreds of kilobytes per table.

Shifting complexity to the client is also a relevant topic for me, as we
decided to have a binary client to avoid anything complicated (parsing)
in the server. A shadow table allows us to shift even more complexity
to the client, which is important for security.

I haven't finished drafting a proposal based on this idea, but I plan to
do it within one day or so.

It won't be as detailed, because I don't think it's realistic to come
up with all the details before writing any of the code (what's the
point if you then have to throw away 70% of it?) but I hope it will be
complete enough to provide a comparison.

By the way, at least at a first approximation, closing and reopening
listening sockets will mostly do the trick for anything our users
(mostly via Podman) will ever reasonably want, so I have half a mind of
keeping it like that in a first proposal, but indeed we should make
sure there's a way around it, which is what is is taking me a bit more
time to demonstrate.

> [...]
>
> # Suggested client workflow
> 
> I suggest the client should:
> 
>    1. Parse all rule modifications
>    2. INSERT all new rules
>       -> On error, DELETE them again  
>    3. DEACTIVATE all removed rules
>       -> Should only fail if the client has done something wrong  
>    4. ACTIVATE all new rules
>       -> On error (rule conflict):  
>          DEACTIVATE rules we already ACTIVATEd
> 	 ACTIVATE rules we already DEACTIVATEd
> 	 DELETE rules we INSERTed
>    5. Check for bind errors (see details later)
>       If there are failures we can't tolerate:
>          DEACTIVATE rules we already ACTIVATEd
> 	 ACTIVATE rules we already DEACTIVATEd
> 	 DELETE rules we INSERTed
>    6. DELETE rules we DEACTIVATEd
>       -> Should only fail if the client has done something wrong  
> 
> DEACTIVATE comes before ACTIVATE to avoid spurious conflicts between
> new rules and rules we're deleting.
> 
> I think that gets us closeish to "as atomic as we can be", at least
> from the perspective of peers.  The main case it doesn't catch is that
> we don't detect rule conflicts until after we might have removed some
> rules.  Is that good enough?

I think it is absolutely fine as an outcome, but the complexity of error
handling in this case is a bit worrying. This is exactly the kind of
thing (and we discussed it already a couple of times) that made and
makes me think that a shadow table is a better approach instead.

> [...]

-- 
Stefano