From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=J9AW6GD+;
	dkim-atps=neutral
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id F39CC5A061D
	for <passt-dev@passt.top>; Thu, 27 Feb 2025 02:57:29 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202502; t=1740621440;
	bh=x8Q+SxfkRmPCSz4GTYpwsnMyPMrjOncyzATxXd7yQTo=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=J9AW6GD+B69R5PWHQx3O2B3jpV6Ag8cykGrXICZ3b04kOSMLsXMDY+2nLRs7uBfD+
	 Yb0Faljbn7sT1PaJiX6xjF3Jk1FP6kcnfncZn3i42e+I22oXcqEqjdiJ+91rLguUwE
	 +mbmYKpgNkzDon6ee9rAkYEUcy5RKpJhaKFv3ZNdi1keCLitJ6mV1OSTFJguiBk89N
	 y/OMo414/2VRVYfCMYkkCgDEr4m1bYpOXzPr5AGHn+hoUEb2zDusqprvIoyrf0U1jv
	 7hh2XkZgW3TkRqyafGRt8lwaQYMs7j+zM/+UGtcRuHoZZHTx/hdwbGx5U5YNZjLTUb
	 KtShu/jeBNpTQ==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4Z3DxD3CP6z4x5h; Thu, 27 Feb 2025 12:57:20 +1100 (AEDT)
Date: Thu, 27 Feb 2025 12:43:41 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH v2 0/2] More graceful handling of migration without
 passt-repair
Message-ID: <Z7_DTVnEZFYlRMJl@zatzit>
References: <20250225055132.3677190-1-david@gibson.dropbear.id.au>
 <20250225184316.407247f4@elisabeth>
 <Z75f9IDhnLS7UmDW@zatzit>
 <20250226090948.3d1fff91@elisabeth>
 <Z77V_6xrLXlkmuDx@zatzit>
 <20250226122412.33009f77@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="J6VFAp/MoJM/D/Qx"
Content-Disposition: inline
In-Reply-To: <20250226122412.33009f77@elisabeth>
Message-ID-Hash: FMNIATJWVUQ43A3XN2OHOQ6XOKC34RQZ
X-Message-ID-Hash: FMNIATJWVUQ43A3XN2OHOQ6XOKC34RQZ
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/Z7_DTVnEZFYlRMJl@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/FMNIATJWVUQ43A3XN2OHOQ6XOKC34RQZ/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--J6VFAp/MoJM/D/Qx
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Feb 26, 2025 at 12:24:12PM +0100, Stefano Brivio wrote:
> On Wed, 26 Feb 2025 19:51:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:
> > > On Wed, 26 Feb 2025 11:27:32 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >  =20
> > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: =20
> > > > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >    =20
> > > > > > From Red Hat internal testing we've had some reports that if
> > > > > > attempting to migrate without passt-repair, the failure mode is=
 uglier
> > > > > > than we'd like.  The migration fails, which is somewhat expecte=
d, but
> > > > > > we don't correctly roll things back on the source, so it breaks
> > > > > > network there as well.
> > > > > >=20
> > > > > > Handle this more gracefully allowing the migration to proceed i=
n this
> > > > > > case, but allow TCP connections to break
> > > > > >=20
> > > > > > I've now tested this reasonably:
> > > > > >  * I get a clean migration if there are now active flows
> > > > > >  * Migration completes, although connections are broken if
> > > > > >    passt-repair isn't connected
> > > > > >  * Basic test suite (minus perf)
> > > > > >=20
> > > > > > I didn't manage to test with libvirt yet, but I'm pretty convin=
ced the
> > > > > > behaviour should be better than it was.   =20
> > > > >=20
> > > > > I did, and it is. The series looks good to me and I would apply i=
t as
> > > > > it is, but I'm waiting a bit longer in case you want to try out s=
ome
> > > > > variations based on my tests as well. Here's what I did.   =20
> > > >=20
> > > > [snip]
> > > >=20
> > > > Thanks for the detailed instructions.  More complex than I might ha=
ve
> > > > liked, but oh well.
> > > >  =20
> > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunnel=
ed qemu+ssh://88.198.0.161:10951/session
> > > > >   Migration: [97.59 %]error: End of file while reading data: : In=
put/output error
> > > > >=20
> > > > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > > > LIBVIRT_DEBUG=3D1 instead?), but passt terminates at this point. =
With
> > > > > this series (I just used 'make install' from the local build), mi=
gration
> > > > > succeeds instead:
> > > > >=20
> > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunnel=
ed qemu+ssh://88.198.0.161:10951/session
> > > > >   Migration: [100.00 %]
> > > > >=20
> > > > > Now, on the target, I still have to figure out how to tell libvirt
> > > > > to start QEMU and prepare for the migration (equivalent of '-inco=
ming'
> > > > > as we use in our tests), instead of just starting a new instance =
like
> > > > > it does. Otherwise, I have no chance to start passt-repair there.
> > > > > Perhaps it has something to do with persistent mode described her=
e:   =20
> > > >=20
> > > > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > > > with --incoming on the target. =20
> > >=20
> > > ("-incoming"), yes, see src/qemu/qemu_migration.c,
> > > qemuMigrationDstPrepare().
> > >  =20
> > > > IIUC the problem here is more about
> > > > timing: we want it to start it early, so that we have a chance to
> > > > start passt-repair and let it connect before the migration actually
> > > > happens. =20
> > >=20
> > > For the timing itself, we could actually wait for passt-repair to be
> > > there, with a timeout (say, 100ms). =20
> >=20
> > I guess.  That still requires some way for KubeVirt (or whatever) to
> > know at least roughly when it needs to launch passt-repair, and I'm
> > not sure if that's something we can currently get from libvirt.
>=20
> KubeVirt sets up the target pod, and that's when it should be done (if
> we have an inotify mechanism or similar). I can't point to an exact code
> path yet but there's something like that.

Right, but that approach does require inotify and starting
passt-repair before passt, which might be fine, but I have the concern
noted below.  To avoid that we'd need notification after passt & qemu
are started on the target, but before the migration is actually
initiated which I don't think libvirt provides.

> > > We could also modify passt-repair to set up an inotify watcher if the
> > > socket isn't there yet. =20
> >=20
> > Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
> > connects to it" model though, and I wonder if we need to revisit the
> > security implications of that.
>=20
> I don't think it actually breaks that model for security purposes,
> because the guest doesn't have anyway a chance to cause a connection to
> passt-repair. The guest is still suspended (or missing) at that point.

I wasn't thinking of threat models coming from the guest, but an
attack from an unrelated process impersonating passt in order to
access passt-repair's superpowers.

> > > > Crud... I didn't think of this before.  I don't know that there's a=
ny
> > > > sensible way to do this without having libvirt managing passt-repair
> > > > as well. =20
> > >=20
> > > But we can't really use it as we're assuming that passt-repair will r=
un
> > > with capabilities virtqemud doesn't want/need. =20
> >=20
> > Oh.  True.
> >=20
> > > > I mean it's not impossible there's some option to do this,
> > > > but I doubt there's been any reason before for something outside of
> > > > libvirt to control the timing of the target qemu's creation.  I thi=
nk
> > > > we need to ask libvirt people about this. =20
> > >=20
> > > I'm looking into it (and perhaps virtiofsd had similar troubles?). =
=20
> >=20
> > I'm guessing libvirt already knows how to start virtiofsd - just as it
> > already knows how to start passt, just not passt-repair.
> >=20
> > > > >   https://libvirt.org/migration.html#configuration-file-handling =
  =20
> > > >=20
> > > > Yeah.. I don't think this is relevant.
> > > >  =20
> > > > > and --listen-address, but I'm not quite sure yet.
> > > > >=20
> > > > > That is, I could only test different failures (early one on sourc=
e, or
> > > > > later one on target) with this, not a complete successful migrati=
on.
> > > > >    =20
> > > > > > There are more fragile cases that I'm looking to fix, particula=
rly the
> > > > > > die()s in flow_migrate_source_rollback() and elsewhere, however=
 I ran
> > > > > > into various complications that I didn't manage to sort out tod=
ay.
> > > > > > I'll continue looking at those tomorrow.  I'm now pretty confid=
ent
> > > > > > that those additional fixes won't entirely supersede the change=
s in
> > > > > > this series, so it should be fine to apply these on their own. =
  =20
> > > > >=20
> > > > > By the way, I think the somewhat less fragile/more obvious case w=
here
> > > > > we fail clumsily is when the target doesn't have the same address=
 as
> > > > > the source (among other possible addresses). In that case, we fai=
l (and
> > > > > terminate) with a rather awkward:   =20
> > > >=20
> > > > Ah, yes, that is a higher priority fragile case.
> > > >  =20
> > > > >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cann=
ot assign requested address
> > > > >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket:=
 (null), drop
> > > > >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket op=
eration on non-socket
> > > > >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > > > >=20
> > > > > that's because, oops, I only took care of socket() failures in
> > > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.   =
=20
> > > >=20
> > > > No, you check for errors on both. =20
> > >=20
> > > Well, "check", yes, but I'm not even setting an error code. I haven't
> > > tried your 3/3 yet but look at "(null)" resulting from:
> > >=20
> > >  		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
> > >=20
> > > ...rc is 0. =20
> >=20
> > -1, not 0, otherwise we wouldn't enter that if clause at all.
>=20
> Ah, oops, right.
>=20
> > But,
> > still, out of bounds for strerror().  I did spot that bug -
> > tcp_flow_repair_socket() is directly passing on the return code from
> > bind(), whereas it should be returning -errno.
> >=20
> > So, two bugs actually: 1) in the existing code we should return -errno
> > not rc if bind() fails, 2) in my 3/3 it should be calling strerror()
> > on -rc, not rc.
>=20
> Right, yes, 1) is what I meant.

I'll fix both of these for the next spin.

> > > > The problem is that in
> > > > tcp_flow_migrate_target() we cancel the flow allocation and carry o=
n -
> > > > but the source will still send information for this flow, putting us
> > > > out of sync with the stream. =20
> > >=20
> > > That, too, yes.
> > >  =20
> > > > > Once that's fixed, flow_migrate_target() should also take care of
> > > > > decreasing 'count' accordingly. I just had a glimpse but didn't
> > > > > really try to sketch a fix.   =20
> > > >=20
> > > > Adjusting count won't do the job.  Instead we'd need to keep the fl=
ow
> > > > around, but marked as "dead" somehow, so that we read but discard t=
he
> > > > incoming information for it.  The MIGRATING state I added in one of=
 my
> > > > drafts was supposed to help with this sort of thing.  But that's qu=
ite
> > > > a complex change. =20
> > >=20
> > > I think it's great that you could (practically) solve it with three
> > > lines... =20
> >=20
> > Yeah, I sent that email at the beginning of my day, by the end I'd
> > come up with the simpler approach.
> >=20
> > > > Hrm... at least in the near term, I think it might actually be easi=
er
> > > > to set IP_FREEBIND when we create sockets for in-migrating flows.
> > > > That way we can process them normally, they just won't do much with=
out
> > > > the address set.  It has the additional advantage that it should wo=
rk
> > > > if the higher layers only move the IP just after the migration,
> > > > instead of in advance. =20
> > >=20
> > > Perhaps we want it anyway, but I wonder: =20
> >=20
> > Right, I'm no longer considering this as a short term solution, since
> > checking for fd < 0 I think works better for the immediate problems.
> >=20
> > > what happens if we turn repair
> > > mode off and we bound to a non-local address? I suppose we won't send
> > > out anything, but I'm not sure. If we send out the first keep-alive
> > > segment with a wrong address, we probably ruined the connection. =20
> >=20
> > That's a good point.  More specifically, I think IP_FREEBIND is
> > generally used for listen()ing sockets, I'm guessing you'll get an
> > error if you try to connect() a socket that's bound to a non-local
> > address.  It's possible TCP_REPAIR would defer that until repair mode
> > is switched off, which wouldn't make a lot of difference to us.  It's
> > also possible there could be bug in repair mode that would let you
> > construct a non-locally bound, connected socket that way.  I'm not
> > entirely sure what the consequences would be.  I guess that might
> > already be possible in a different way: what happens if you have a
> > connect()ed socket, then the admin removes the address to which it is
> > bound?
>=20
> I'm fairly sure that the outcome of testing this will surprise us in a
> way or another, so probably we should start from testing it.

Yeah, I tend to agree.

> We could even think of deferring switching repair mode off until the
> right address is there, by the way. That would make a difference to
> us.

Do you mean by blocking?  Or by returning to normal operation with the
flow flagged somehow to be "woken up" by a netlink monitor?

> > > Once I find a solution for the target libvirt/passt-repair thing (and
> > > the remaining SELinux issues), I'll try to have a look at this too. I
> > > haven't tried yet a migration with a mismatching address on the target
> > > and passt-repair available. =20
> >=20
> > Right, I was trying to set up a test case for this today.  I made some
> > progress but didn't really get it working.  I was using qemu directly
> > with scripts to put the two ends into different net namespaces, rather
> > than libvirt on separate L1 VMs.  Working out how to get the two
> > namespaces connected in a way I could do the migration, while still
> > being separate enough was doing my head in a bit.
>=20
> I didn't actually get to the point of having a truly working migration.
> I just considered a clean attempt at resuming an existing connection
> (with keep-alive and subsequent RST from underlying passt instance) as
> success.
>=20
> So, yes, of course, it would be great to have a full simulation in a
> compact form of what's going on with KubeVirt and OVN.
>=20
> Probably bridging the L1 VMs would be a quick solution. It needs root

Well.. for some value of quick.  That's basically what I've been
working towards, except using namespaces for the L1s instead of VMs.
Figuring out exactly how to set up the bridge so that we have
addressing that allows the migration between the qemus to take place,
and also has the properties we need for passt to work and test the
things we want is kind of fiddly.

> (at least for the setup), which means they become L2 (at least in my
> world): L1 would still be connected by passt, the two L2 instances are
> bridged, and source and target guests would be L3.

For the tests here we don't even necessarily need the L1 connected to
the outside world: we can put the peer for the migrated connection(s)
inside the L1=20

> > In doing that, I also spotted another wrinkle.  I don't think this is
> > one we can reasonably fix - but we should be aware, since someone will
> > probably try it at some point:
>=20
> Yeah, I tried it, you might see remnants of that in the setup_migrate()
> stuff (I copied it from the "two_guests" test). I originally wanted to
> have two namespaces and two instances of pasta (just like
> "two_guests"), but soon realised the issue you describe below.

Right.  It took me an embarrassing amount of time to figure out what
was going wrong, alas.

> > migration is not going to work if the
> > two hosts have their own connectivity provided by (separate instances
> > of) passt or pasta (or slirp for that matter).  The migrating VM can
> > have its TCP stream reconstructed perfectly, so the right L2 packets
> > come out of the host, but the host's own passt/pasta instance won't
> > know about the flows and so will just drop/reject the packets.
> >=20
> > To make that work we'd basically have to migrate state for every
> > "ancestor" passt/pasta until we hit a common namespace.  That seems
> > pretty infeasible to me, since the pieces that know about the
> > migration probably don't own those layers of the network.
>=20
> If there's any potential interest around it, we could abstract things a
> bit more than we did until now and have some kind of o... orch...
> coordination of data/migration flows, with some kind of external tool
> identifying and connecting to several instances of passt and moving
> flows between them. It sounds a bit like OVN, but with the notable
> difference that we don't need to implement a overlay network.

Right.  I mean this is basically the responsibility of whatever's
managing host nodes' network.  For a virtual bridge or router it's a
relatively simple matter of changing where the guest address is
routed.  If the the host nodes' network involves passt or pasta
it's... harder.

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--J6VFAp/MoJM/D/Qx
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAme/w00ACgkQzQJF27ox
2GdeWhAAkSiDtA3/jvaw1K1yeZQKe17K/SM6pqFeNbOaOiVgVn+gN4hfRDpA/Lys
KTFBVVmd5S0OjcZUbvkKlbR03eWWUdurNtWzHFLVJNijOQbVEszF9ixFgJDkYwVR
xWFYsrX2z/TFvWWydrG3qE9UxhWSM9e/slJosjfbbsJ09wSkvtB8Rub8hJhW0+ze
jSR6Aa9IjkRt6qwELCYwjmz6NEJpHJ/nFgbZ4/ngS9bVQ5OsmblTAm/gXDvoP/zH
pkvbbNGPR1lj0o5zYF9HqN4Wiu/ms+TFGt9nb4wgv7gmWKDGgE62Q7maEeR7qM6B
yYzGIx7Tgv5HioSMlJakaP3y12w8nn3OquB6bM9uYFh3evyRWi41axRISs7KLmDq
RhK0Kh3zaOs4DGr0cqUjbIqjEzvWYPP2Gtn7yfKweTaWXmYe57+rAFFt6+L9KvAc
eIL8hkF7xYGYwugADp+JoN7HzdwtqQo9nb+loket6UrCEWA8hXe2gQtLs7ZhWvn+
iyQDPdxjNMsGSaCIG+q/kO93gokCbEjXLD6pN0WC9BM3tJfyJHYD6r5dmhjBlEeS
0nyHDnJq9XAEYX4jQGnaeJwgnGEYoUel0KeX27j8jGykl2Bbp52Xwuo/4SAlXJ9M
dH7+bMN4XtJg7Tpg7GkYLkBZCgv2gFXBXQ2/fFDaKJcZyoXsfVY=
=EDqh
-----END PGP SIGNATURE-----

--J6VFAp/MoJM/D/Qx--