From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202602 header.b=aGn47KhM; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id C56D05A0623 for ; Mon, 09 Feb 2026 22:53:44 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202602; t=1770674021; bh=2gnjhySUydKhYeYE3Ks+xpMES9w8XDPqXXTE45iVc6A=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=aGn47KhM2YUw3LR5/LpYwKuzhTwDV3fZiGJK+uVbf6/pmVBOdYqKzp5gXnNS+PozP WlktBgwDUh7PAjpdD70T7Xi9B2vWLaBFHLPL85D68mIAVTqgDZcoVj1YNUwFm+MfRJ gbcJurRJZjb1lJpCv2W5rOshLWO0t1heUrNmhMnOKfvZ6LNVJAIeVe/OjCVhJYdgYI ybFMxlOhrsHnkkLi++vFzgvUCgGUYngoQbAfCapRHaMSsdx/kkjwX8gQtxR/K3A8Jr h04buZa4Ce+EuYzwrYIpBe2l3LlRW9H1eP838GTCeAAi1oCiIyIg1v7hEszVsE1Re8 816lRDUHPCBLw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4f8z3T0PbMz4wDk; Tue, 10 Feb 2026 08:53:41 +1100 (AEDT) Date: Tue, 10 Feb 2026 08:53:22 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 1/1] migrate: Use forward table information to close() listening sockets Message-ID: References: <20260130055811.2408284-1-david@gibson.dropbear.id.au> <20260130055811.2408284-2-david@gibson.dropbear.id.au> <20260131104727.2fbdfaff@elisabeth> <20260202231159.44c251bd@elisabeth> <20260205011752.65ba41c6@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="ivf6dvKf+N0/ZIlm" Content-Disposition: inline In-Reply-To: <20260205011752.65ba41c6@elisabeth> Message-ID-Hash: JA75NQWTPJPHXZ5GPNIR45N3GAQPBIPL X-Message-ID-Hash: JA75NQWTPJPHXZ5GPNIR45N3GAQPBIPL X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --ivf6dvKf+N0/ZIlm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Feb 05, 2026 at 01:17:52AM +0100, Stefano Brivio wrote: > On Wed, 4 Feb 2026 21:57:03 +1000 > David Gibson wrote: >=20 > > On Mon, Feb 02, 2026 at 11:12:08PM +0100, Stefano Brivio wrote: > > > On Mon, 2 Feb 2026 10:24:14 +1000 > > > David Gibson wrote: > > > =20 > > > > On Sat, Jan 31, 2026 at 10:47:28AM +0100, Stefano Brivio wrote: =20 > > > > > On Fri, 30 Jan 2026 16:58:11 +1100 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On incoming migrations we need to bind() reconstructed sockets = to their > > > > > > correct local address. We can't do this if the origin passt in= stance is > > > > > > in the same namespace and still has those addresses bound. Arg= uably that's > > > > > > a bug in bind()s operation during repair mode, but for now we h= ave to work > > > > > > around it. > > > > > >=20 > > > > > > So, to allow local-to-local migrations we close() sockets on th= e outgoing > > > > > > side as we process them. In addition to closing the connected = socket we > > > > > > also have to close the associated listen()ing socket, because t= hat can also > > > > > > cause an address conflict. > > > > > >=20 > > > > > > To do that, we introduced the listening_sock field in the conne= ction > > > > > > state, because we had no other way to find the right listening = sockets. > > > > > > Now that we have the forwarding table, we have a complete list = of > > > > > > listening sockets elsewhere. We can use that instead, to close= all > > > > > > listening sockets on outbound migration, rather than just the o= nes that > > > > > > might conflict. > > > > > >=20 > > > > > > This is cleaner and, importantly, saves a valuable 32-bits in t= he flow > > > > > > state structure. It does mean that there is a longer window wh= ere a peer > > > > > > attempting to connect during migration might get a Connection R= efused. > > > > > > I think this is an acceptable trade-off for now: arguably we sh= ould not > > > > > > allow local-to-local migrations in any case, since the socket c= loses make > > > > > > it impossible to safely roll back migration as per the qemu mod= el. > > > > > >=20 > > > > > > Signed-off-by: David Gibson > > > > > > --- > > > > > > flow.c | 12 ++++++++++++ > > > > > > fwd.c | 21 +++++++++++++++++++++ > > > > > > fwd.h | 1 + > > > > > > tcp.c | 9 --------- > > > > > > tcp_conn.h | 3 --- > > > > > > 5 files changed, 34 insertions(+), 12 deletions(-) > > > > > >=20 > > > > > > diff --git a/flow.c b/flow.c > > > > > > index fd4d5f38..5207143d 100644 > > > > > > --- a/flow.c > > > > > > +++ b/flow.c > > > > > > @@ -1023,6 +1023,9 @@ static int flow_migrate_source_rollback(s= truct ctx *c, unsigned bound, int ret) > > > > > > =20 > > > > > > debug("...roll back migration"); > > > > > > =20 > > > > > > + if (fwd_listen_sync(c, &c->tcp.fwd_in, PIF_HOST, IPPROTO_TCP)= < 0) > > > > > > + die("Failed to re-establish listening sockets"); > > > > > > + > > > > > > foreach_established_tcp_flow(flow) { > > > > > > if (FLOW_IDX(flow) >=3D bound) > > > > > > break; > > > > > > @@ -1147,6 +1150,15 @@ int flow_migrate_source(struct ctx *c, c= onst struct migrate_stage *stage, =20 > > > > >=20 > > > > > Nit: the comment to this function currently says "Send data (flow > > > > > table) for flow, close listening". I fixed that up (dropped ", cl= ose listening"). =20 > > > >=20 > > > > Good point, thanks. > > > > =20 > > > > > > return flow_migrate_source_rollback(c, FLOW_MAX, rc); > > > > > > } > > > > > > =20 > > > > > > + /* HACK: A local to local migrate will fail if the origin pas= st has the > > > > > > + * listening sockets still open when the destination passt tr= ies to bind > > > > > > + * them. This does mean there's a window where we lost our l= isten()s, > > > > > > + * even if the migration is rolled back later. The only way = to really > > > > > > + * fix that is to not allow local to local migration, which a= rguably we > > > > > > + * should (use namespaces for testing instead). */ =20 > > > > >=20 > > > > > Actually, we already use namespaces in the current tests, =20 > > > >=20 > > > > Oh, nice. > > > > =20 > > > > > but we didn't > > > > > (always) do that during development, and it might be convenient in > > > > > general to have the possibility to test *a part* of the implement= ation > > > > > using the same namespace as long as it's reasonably cheap (it see= ms to > > > > > be). =20 > > > >=20 > > > > Depends what cost you're talking about. It's cheap in terms of > > > > computational complexity, and code compexity. It means, however, t= hat > > > > we can't necessarily roll back failed migrations - i.e. resume on t= he > > > > origin system. That isn't really correct for the qemu migration > > > > model, which is why I think allowing local migrations probably isn't > > > > the best idea, at least by default. =20 > > >=20 > > > Well it's not entirely true that we can't roll back failed migrations. > > >=20 > > > Depending on the stage we're at, we might close connections, but other > > > than that we can always continue, somehow. Losing some connections > > > looks relatively cheap as well. =20 > >=20 > > I guess that's true. I'm not sure we currently handle this even as > > well as is possible within the constraints. >=20 > I actually checked a while ago, nothing unexpected happen. >=20 > > > But sure, it's not ideal. Maybe yes, we shouldn't have that as defaul= t, > > > add a new option and update QEMU's documentation (see below) with > > > it. =20 > >=20 > > There's two layers to it. Dropping connections on what's otherwise a > > no-op migration failure is ugly. Opening the possibility that we > > might not be able to rebind ports we were already listening on is > > worse. >=20 > But that would only happen if somebody starts an unrelated process > binding the same ports, right? I'm not sure if it's something we really > need to take care of. Maybe. Actually, arguably more important than the possibility of losing the port is that there may be an interval where the virtual server will reject connections, making the migration not really seamless from the point of view of peers (which is, after all, the point of live migration). > > > > > That's just a part because anyway bind() and connect() will confl= ict, > > > > > if we're in the same namespace, which is a kernel issue you alrea= dy > > > > > noted: =20 > > > >=20 > > > > Well, it's a kernel issue that the bound listen()ing sockets confli= ct > > > > with the half-constructed flow sockets. Having the listening socke= ts > > > > of the origin passt conflict with the listening sockets of the > > > > destination passt is pretty much expected, and would still be an > > > > impediment to local migration. =20 > > >=20 > > > Ah, right, we don't set listening sockets to repair mode... should we, > > > by the way? =20 > >=20 > > Uh.. I don't think so? I'm not really sure how repair mode operates > > for listening sockets, or if it should even allow that. The relevant > > state of a listening socket is pretty much limited to its bound > > address, so I don't think we need any additional mechanism to extract > > state. >=20 > I wasn't suggesting that for any functional purpose. Repair mode has no > meaning there. >=20 > But it would have the advantage, if we fix/change this in the kernel: >=20 > https://pad.passt.top/p/TcpRepairTodo#L3 > Repair mode sockets should not have address conflicts with > non-repair sockets (both bind() and connect()) >=20 > that bind() calls wouldn't conflict with bound sockets in repair mode, > and let us test a bit more of local migration. I guess. The key difference with listening sockets is that in a sense they really do conflict, or at least really will. The problem with address conflicts between listening sockets and connected sockets, is that the conflict is reported when we've bound, but not yet connected the socket we're reconstructing. The conflict will go away once the socket state is fully restored, but we can't get there without hitting an error because we must bind() before we connect(). That's the specific problem I was looking to address by not caring about conflicts between repair mode and non-repair-mode sockets. > On the other hand I'm not sure what should happen once you bring a > socket, which was originally bound to a given port, out of repair mode, > once there's a new socket bound to it. Right. This is potentially an issue with connect()ed sockets, too. I think we have to fail the exit from repair mode if there's an address conflict. That's introduces a new somewhat awkward case, but I can't see how else to address the ephemeral address conflict. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --ivf6dvKf+N0/ZIlm Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmmKV0MACgkQzQJF27ox 2GfvWg//cCt84Z+K9doBE2c+XKQc7g5yg8NmqMNMZqc80GrPcfU1bRZ8O5Fq71Ss Pweo2RLI19eaae3nDdZTPVcGBcKqpnzdI5SZC4JNMRgi+xh4Xtg3KTyrT52SYXbq DpL1K5Q2y/DhN/o3MkyuA6zNPpiusbc7d1JrKCCU6QmwmVmH2gtKR5ZcNi6derir tuHaC9emdibgDrLIgB5KJdGhicvszZzsBuiWq8qXRknfrCSwt0L5W8olLFZkTQwn 6C63BKKVG/oLPMvruR4/XDV+AiQfVWr2+RIBa/6gl9/gpBr3z67vCVyp6eLPfhGt gQkF5NN0G6kyEy7OMFPghvTS6+IYixDXh1+sgHMQdSkN1NkkDFxzUbHfYPTZDkhg wOgH6qZ9QIaapTCQrRfV5Kh4oXMOyzhng63wkag/CBreauL1/XvgucSHMyRUU6ch VqsRpUgkXlLc/jntBr/d81//mMb32SbJPQFMHNdumq1+qPdULTmt96xNwiLrRF+H rLnb5wEvg+32oMpVvb5Gz9u6YX5hKpdjUKjkoPjibDEFAjfKcx0hswCP44+zDpV1 6dRyRHuFI9cnGrpCRv+LfzrCnw+hmEARInXu9LzvQb6KwMP+RkLoWOfQ+MLm+I12 q5WxjuQrW+fhE6s39jwahk7o50AfUBngaBFrYe+kIuSI1ylXjrs= =ad2+ -----END PGP SIGNATURE----- --ivf6dvKf+N0/ZIlm--