From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202506 header.b=he7LZi5B; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id B69425A0282 for ; Wed, 23 Jul 2025 02:27:50 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202506; t=1753230302; bh=LoumJcLFLXqQr+rftxROur5XLJ+y2Nkh/lr1uH1qky8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=he7LZi5B/pf3E6F2PaWQ4tLFkGpY1ewmRhagdvZsurbaxzkNWXO48LNan/gu0TmOD 70dHrD2GsvzZiNbxcTHH4lhufxiOoUSTpVr4zu6pcczB4T10jgrBIL4xKjQE6pKB+b 6unqXIvFrwGvGDPlO5fOw5xZjDBpIYWLGYZh2uod2H2WIDiaQyjOlmBabydQVijyj2 yNXs2cO2DN36S/0pOYrw/d5b92tc0Eem1HN/pVl+dmFAjm25eMXSSn4efruEOLEBlE CqalL/PdLURB/TMn2LyHtJF0OAEHHruC0M0MNydmHTcjf1q4NEKoMblTqNfWk9B2m4 YW16VdhgmUhNA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4bmvzL5Xt2z4x3p; Wed, 23 Jul 2025 10:25:02 +1000 (AEST) Date: Wed, 23 Jul 2025 10:27:30 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open Message-ID: References: <20250721221258.2874863-1-sbrivio@redhat.com> <20250722231249.22e09340@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="MaE6mi66UbGRyhLx" Content-Disposition: inline In-Reply-To: <20250722231249.22e09340@elisabeth> Message-ID-Hash: PCGSBOBQ5P7MELECQJP3BNY4N25MJLQ4 X-Message-ID-Hash: PCGSBOBQ5P7MELECQJP3BNY4N25MJLQ4 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Nir Dothan X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --MaE6mi66UbGRyhLx Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote: > On Tue, 22 Jul 2025 10:33:00 +1000 > David Gibson wrote: >=20 > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote: > > > We are hitting an issue in the KubeVirt integration where some data is > > > still sent to the source instance even after migration is complete. As > > > we exit, the kernel closes our sockets and resets connections. The > > > resulting RST segments are sent to peers, effectively terminating > > > connections that were meanwhile migrated. > > >=20 > > > At the moment, this is not done intentionally, but in the future > > > KubeVirt might enable OVN-Kubernetes features where source and > > > destination nodes are explicitly getting mirrored traffic for a while, > > > in order to decrease migration downtime. > > >=20 > > > By default, don't quit after migration is completed on the source: the > > > previous behaviour can be enabled with the new --migrate-exit > > > option. =20 > >=20 > > Might be worth explicitly mentioning that the reason this helps is > > that having the sockets still around in repair mode prevents the > > source node from responding to any traffic for them. >=20 > Added as separate paragraph, below. Thanks. [snip] > > > diff --git a/conf.c b/conf.c > > > index 6c747aa..5e69014 100644 > > > --- a/conf.c > > > +++ b/conf.c > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int= status) > > > FPRINTF(f, > > > " --repair-path PATH path for passt-repair(1)\n" > > > " default: append '.repair' to UNIX domain path\n"); > > > + FPRINTF(f, > > > + " --migrate-exit source quits after migration\n" > > > + " default: source keeps running after migration\n"); > > > + FPRINTF(f, > > > + " --migrate-no-linger close sockets on migration\n" > > > + " default: keep sockets open, ignore data events\n"); =20 > >=20 > > I'm a bit uncomfortable adding this as regular, supported options. > > I'm hoping we can eliminate them in the not too distant future. In > > the medium term we can do that by improving our test code to simulate > > different hosts with namespaces. Longer term we can allow local to > > local migration without these options by fixing the kernel: I don't > > believe bind() should fail if the address conflict is with a socket in > > repair mode. >=20 > I see the point about --migrate-no-linger, but --migrate-exit looks > like something that depends on the migration workflow, not necessarily > something to cover up issues in our tests or in the kernel. In any > case... It does depend on the workflow, but I think this needs to be managed by qemu (or whatever upper level component is managing the overall migration). --migrate-exit doesn't make us exit once the migration is complete - it makes us exit once *our part of* the migration is complete, the migration could still fail after that. So in the happy path, yes, we'll exit, but it should be because qemu or libvirt or whatever is satisfied the migration is finished and kills us off > > Further, as user accessible options these are kinda weird - what they > > do is defined in terms of some pretty obscure internals, not in terms > > of what the upshot is. So, I think it would be better to introduce > > these as deprecated / debug-only / undocumented options. >=20 > ...I marked both as deprecated in v2, to signal to any hypothetical > (I'm fairly sure not real) user that they shouldn't assume it's safe to > rely on them. >=20 > That's the only other category of options we have so far, we don't have > undocumented ones, and I think it's a feature. Especially as we use > them in the tests. Works for me. > By the way, I would argue that testing is an important type of usage, > and changes in the test setup itself or in the kernel are non-trivial > effort, and it's rather uncertain whether we'll ever manage to commit > to that, so I don't quite feel that marking them as deprecated is > entirely warranted, but I see the point about possibly dropping that > migration model at some point in the future. >=20 > Regardless, the fact we *need* them at the moment should prove that > they are at least convenient, even if just for testing. Fair enough. [snip] > > > diff --git a/passt.c b/passt.c > > > index 388d10f..f4c108f 100644 > > > --- a/passt.c > > > +++ b/passt.c > > > @@ -308,6 +308,9 @@ loop: > > > c.mode =3D=3D MODE_PASTA ? "pasta" : "passt", > > > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); > > > =20 > > > + if (c.ignore_data_events && ref.type <=3D EPOLL_TYPE_DATA_MAX) > > > + continue; > > > + =20 > >=20 > > This worries me slightly. I think it will be fine if passt is killed > > not long after the migration. However, if passt lingers, we're > > ignoring, but not clearing events. For those events which are > > level-triggered this could make us chew CPU on a tight epoll_wait() > > loop. >=20 > Ah, yes, I had that concern as well, but I conveniently forgot about it > at some point. >=20 > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at > > the point we would close() them with --migrate-no-linger. I'd be fine > > with that as a follow up improvement, though. >=20 > I was pondering about adding this on top of the ignore_data_events > trick, but, actually, the whole thing as of this patch is somewhat > bogus because >=20 > - we're ignoring events on TCP sockets (intentional), >=20 > - we're ignoring events on the tap device (who cares, migration is only > supported with vhost-user) >=20 > - but *not* ignoring events on the vhost-user kick descriptor > (oversight). >=20 > On a second thought, it doesn't look safe to ignore events on the kick > descriptor, and in any case, with this change, we don't want to prevent > the guest to send out further packets. It's not expected anyway. >=20 > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as > we go through the sockets. It's simpler and arguably safer. Yes, that's what I had in mind as well (I thought I put that in the mail, but it looks like I didn't). Just one additional concern that I don't think need hold up merge: do we also need to epoll_del() our listening sockets? --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --MaE6mi66UbGRyhLx Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmiALHEACgkQzQJF27ox 2GfNuw//b7KXzls45r9S96ivahBHbkorwi7UReEgep3q2DTxOXjmz2Wf8msPw5VQ D5X+Ja17eb9dP8T1WMIduzulAj8DZ5AJAlNoZbYpyPdEYGfXH65Taau22XTEy44G Uo0Nn5USM5O5VexWIVpriC45eox58hjqUVdzqXzZj7Urmu5HuZlJm36lYJ2aGKtj 4mNc9Py3BGaC+bAGesn2wV8qq99XCPlKJHCeZWxH8s89oPfeLy0B+vIRDpoLUf/w lhfGdHHIIhkjOAreXcYkpxv4Yxh0RuF+hwxExA/snOzs3QSSipArpFvMExKvRLpn dcHc4n4zL1btuh9B/IyoRPeb11iQcblLl9kQshhl0juzUb31xckQ1RRqwRo0I/GH seuy8/AbUWpi/AcD/bH2smZ2TJqYq7rbPAwH1WaBP7D7ZwDro/rrwys0JAPdNWye DSsOf9bZfQefyr6eTBywG4b6KQ1uEvaiyRvqcOBARrYHdzxQHHYrgmo2HSMY58q3 1+EVtbfykkX7DaBJlHKZK/pl9cwNqPnCrdOzSIlDCA20ySvS7MoMGOZtVrIqPpCt EnH+fO8jW2RczkL1pp1/DGsRY8qgR4nNj21ycuN8fLqQzAzY0812mrPbvXU9DSky yehfqG1v4nNqDvcV9NVrgTH7a4LNlV4xcDxeQo5pYgzsHN2+Y3Y= =gdhY -----END PGP SIGNATURE----- --MaE6mi66UbGRyhLx--