From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202512 header.b=iEFFkrlO; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 87B7B5A0652 for ; Wed, 04 Feb 2026 13:05:39 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202512; t=1770206737; bh=GMzHUP5ZZFkr/ZHKmUE/f/+LXkNwH3FqKzAfJu9HbjY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=iEFFkrlODdBIab7T8MiBOZQEQaeIIxipATZ2Dkfu26ILdNqhqOjnHXoV5SjIpiGso hIkRBvErOn0Bw7p2+Zyy539m+5MUqG+DMteV/QFQa4phLNPcgTtZU0cRW1IHCCaDr2 Fs5wVctbB8qShjUC8tfAcxD/UaAZ+Hd+2rhn2fHeui04UgsfS69eWmk/vdPnHqlTXT 5SRJYA7JdWJQl7FJ9yyC6JmViqdrTN4sGWYdynbsk9k1zKUosZBBaDR/XYFsx+uY/m BjD/PeT/TVlkFUTP0SB8Ciz2BMWRpjbZia41qoH1RCppxrqGharZFHa7NBrgeVKbqA nSqY5lh0rTd+Q== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4f5fFF2VZvz4w91; Wed, 04 Feb 2026 23:05:37 +1100 (AEDT) Date: Wed, 4 Feb 2026 21:57:03 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets Message-ID: References: <20260130055811.2408284-1-david@gibson.dropbear.id.au> <20260130055811.2408284-2-david@gibson.dropbear.id.au> <20260131104727.2fbdfaff@elisabeth> <20260202231159.44c251bd@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="uFxq4N1vPErXgZGU" Content-Disposition: inline In-Reply-To: <20260202231159.44c251bd@elisabeth> Message-ID-Hash: LQSUTSRCJMIGLK2NDMDAUHNT55TDPE2R X-Message-ID-Hash: LQSUTSRCJMIGLK2NDMDAUHNT55TDPE2R X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --uFxq4N1vPErXgZGU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Feb 02, 2026 at 11:12:08PM +0100, Stefano Brivio wrote: > On Mon, 2 Feb 2026 10:24:14 +1000 > David Gibson wrote: >=20 > > On Sat, Jan 31, 2026 at 10:47:28AM +0100, Stefano Brivio wrote: > > > On Fri, 30 Jan 2026 16:58:11 +1100 > > > David Gibson wrote: > > > =20 > > > > On incoming migrations we need to bind() reconstructed sockets to t= heir > > > > correct local address. We can't do this if the origin passt instan= ce is > > > > in the same namespace and still has those addresses bound. Arguabl= y that's > > > > a bug in bind()s operation during repair mode, but for now we have = to work > > > > around it. > > > >=20 > > > > So, to allow local-to-local migrations we close() sockets on the ou= tgoing > > > > side as we process them. In addition to closing the connected sock= et we > > > > also have to close the associated listen()ing socket, because that = can also > > > > cause an address conflict. > > > >=20 > > > > To do that, we introduced the listening_sock field in the connection > > > > state, because we had no other way to find the right listening sock= ets. > > > > Now that we have the forwarding table, we have a complete list of > > > > listening sockets elsewhere. We can use that instead, to close all > > > > listening sockets on outbound migration, rather than just the ones = that > > > > might conflict. > > > >=20 > > > > This is cleaner and, importantly, saves a valuable 32-bits in the f= low > > > > state structure. It does mean that there is a longer window where = a peer > > > > attempting to connect during migration might get a Connection Refus= ed. > > > > I think this is an acceptable trade-off for now: arguably we should= not > > > > allow local-to-local migrations in any case, since the socket close= s make > > > > it impossible to safely roll back migration as per the qemu model. > > > >=20 > > > > Signed-off-by: David Gibson > > > > --- > > > > flow.c | 12 ++++++++++++ > > > > fwd.c | 21 +++++++++++++++++++++ > > > > fwd.h | 1 + > > > > tcp.c | 9 --------- > > > > tcp_conn.h | 3 --- > > > > 5 files changed, 34 insertions(+), 12 deletions(-) > > > >=20 > > > > diff --git a/flow.c b/flow.c > > > > index fd4d5f38..5207143d 100644 > > > > --- a/flow.c > > > > +++ b/flow.c > > > > @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(struc= t ctx *c, unsigned bound, int ret) > > > > =20 > > > > debug("...roll back migration"); > > > > =20 > > > > + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP) < 0) > > > > + die("Failed to re-establish listening sockets"); > > > > + > > > > foreach_established_tcp_flow(flow) { > > > > if (FLOW_IDX(flow) >=3D bound) > > > > break; > > > > @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, const= struct migrate_stage *stage, =20 > > >=20 > > > Nit: the comment to this function currently says "Send data (flow > > > table) for flow, close listening". I fixed that up (dropped ", close = listening"). =20 > >=20 > > Good point, thanks. > >=20 > > > > return flow_migrate_source_rollback(c, FLOW_MAX, rc); > > > > } > > > > =20 > > > > + /* HACK: A local to local migrate will fail if the origin passt h= as the > > > > + * listening sockets still open when the destination passt tries = to bind > > > > + * them. This does mean there's a window where we lost our liste= n()s, > > > > + * even if the migration is rolled back later. The only way to r= eally > > > > + * fix that is to not allow local to local migration, which argua= bly we > > > > + * should (use namespaces for testing instead). */ =20 > > >=20 > > > Actually, we already use namespaces in the current tests, =20 > >=20 > > Oh, nice. > >=20 > > > but we didn't > > > (always) do that during development, and it might be convenient in > > > general to have the possibility to test *a part* of the implementation > > > using the same namespace as long as it's reasonably cheap (it seems to > > > be). =20 > >=20 > > Depends what cost you're talking about. It's cheap in terms of > > computational complexity, and code compexity. It means, however, that > > we can't necessarily roll back failed migrations - i.e. resume on the > > origin system. That isn't really correct for the qemu migration > > model, which is why I think allowing local migrations probably isn't > > the best idea, at least by default. >=20 > Well it's not entirely true that we can't roll back failed migrations. >=20 > Depending on the stage we're at, we might close connections, but other > than that we can always continue, somehow. Losing some connections > looks relatively cheap as well. I guess that's true. I'm not sure we currently handle this even as well as is possible within the constraints. > But sure, it's not ideal. Maybe yes, we shouldn't have that as default, > add a new option and update QEMU's documentation (see below) with > it. There's two layers to it. Dropping connections on what's otherwise a no-op migration failure is ugly. Opening the possibility that we might not be able to rebind ports we were already listening on is worse. > > > That's just a part because anyway bind() and connect() will conflict, > > > if we're in the same namespace, which is a kernel issue you already > > > noted: =20 > >=20 > > Well, it's a kernel issue that the bound listen()ing sockets conflict > > with the half-constructed flow sockets. Having the listening sockets > > of the origin passt conflict with the listening sockets of the > > destination passt is pretty much expected, and would still be an > > impediment to local migration. >=20 > Ah, right, we don't set listening sockets to repair mode... should we, > by the way? Uh.. I don't think so? I'm not really sure how repair mode operates for listening sockets, or if it should even allow that. The relevant state of a listening socket is pretty much limited to its bound address, so I don't think we need any additional mechanism to extract state. > With the fix for freezing incoming TCP queues I'm working > on, as a side effect, that would also mean that the kernel will ignore > SYN segments altogether, which is desirable I think. Maybe it already > happens for some other reason...? >=20 > But anyway, without forwarded ports you don't have listening sockets. True. > > > https://pad.passt.top/p/TcpRepairTodo#L3 > > > Repair mode sockets should not have address conflicts with non-repa= ir > > > sockets (both bind() and connect()) > > >=20 > > > but even that part is convenient to have, I think, =20 > >=20 > > I'm not really sure what you mean by that. >=20 > That we can still *partially* test local migration, without forwarded > ports. Ah, ok. > We can check the interface, including passt-repair and all the > vhost-user bits without actual connections, in a simpler way compared > to namespaces. >=20 > See: >=20 > https://www.qemu.org/docs/master/system/devices/net.html#example-of-mig= ration-of-a-guest-on-the-same-host >=20 > and: >=20 > https://lore.kernel.org/qemu-devel/20260131032700.12f27487@elisabeth/ >=20 > it won't work for any practical case, but one can still test a big > chunk of functionality that way. Good point. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --uFxq4N1vPErXgZGU Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmmDM+gACgkQzQJF27ox 2GevTRAAkSSvF7Pr8F5WBIeyOLO7hwN6ok9gVIYIb2mNVQM6luacWJAA3fJo/2Qf YLFGC+LL0aktbnu8M8WFhFqYxEewycABtoFXfsy93UZxgc8SO1Nw4WRB2RWjQ9Pj 3gznb097hqYSIaC5whAo4zC1YmvCmkJ/ynW2YP/P+lYA85H8pW3uhoqUhYPfQbe3 5vjVwcRoSLxhSH+r89vBWnZzAUwP2rYkC9fdTb1fh5IiqmW3CZmj3I21qmHx8mLn lewzMnDwYAx3U1Q/tDv2Z8ThcKqVeAngP1vfML1QRExxXHfd1HnOf4F2wDTYGWNr VtODTtcxa86y/0lvsZpEngM9ebbVBvCD9Isfw094UKoSt8+ahKgts+IsFwfM5ES+ uNASrhkf84GdZGYUXwRxlqiBCZQ7b9TNEV5E0LqZ2VvMew8IDP7NlEY5iawOkUIR h+mu50z3TkXvU4I3qjWh/+kcRcIhmXlBXbLWmMXMIgt6h95LcAymQKBRL3a/hb+a 6RhvR1qBuKoXMgS3YheFItAWn5T2uTtI4PeZM8T0CG0W2JK8Kl8mksQgTzRoXUkJ QTdvxB506TK8R/5L84VRsDHcGTIkos18/hbWyHDeT8gPyelggSHxyQZx6U2U71SD B9Uz4T2IGk49Vs5rfeYxGKl/+oNq0A0PRQNTGglTwpjhjzsD1oY= =byo7 -----END PGP SIGNATURE----- --uFxq4N1vPErXgZGU--