From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=fcUV0oQQ;
	dkim-atps=neutral
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id B9B935A0272
	for <passt-dev@passt.top>; Wed, 05 Feb 2025 12:55:26 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202502; t=1738756509;
	bh=5ldpD0MfVunUG0sdgqvj6K6wZtyy9vcnORfudaHHyis=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=fcUV0oQQdSTQzdGZct1ydRDg7hG80y/7Wlode9aqvmjjqKfR+89TquLg9mLu7SvLa
	 D6j4udHV71LRZdscTen8mIBtRwEXuNLpqb5aWNfjcGBtHWcRfNl8dbvbhUlxBHC4pV
	 JaJ2euEOXp02dsNB2v0eGhDaVpybh9WmhFmVp64cPXSgYRk4AzHJIbP8M0ziY3zDd4
	 biuM6QrcMr3fQXLThyi27knkDNZL4WUX0Kgn2JCtE/AILPUtDPnh6dPJ/yKPId2q+C
	 pEDcWI82Q4sGa8Zhjty8nLQ5grxcI7b+1RlIT37AXmrj8zIGUuogVN1xWPIVSJZVm6
	 UXtk8tGamcMxQ==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4YnzF900Fhz4wj2; Wed,  5 Feb 2025 22:55:08 +1100 (AEDT)
Date: Wed, 5 Feb 2025 22:39:25 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Hanna Czenczek <hreitz@redhat.com>
Subject: Re: [PATCH v5 5/6] vhost_user: Make source quit after reporting
 migration state
Message-ID: <Z6NN7ULSF9_aj9j4@zatzit>
References: <20250205003904.2797491-1-sbrivio@redhat.com>
 <20250205003904.2797491-6-sbrivio@redhat.com>
 <Z6LIZnLylfzD_mfk@zatzit>
 <20250205064708.1af3ef28@elisabeth>
 <9aa66ede-51f5-46b5-ac15-4dbe93ce8903@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="QBiIk5WAwRECYupX"
Content-Disposition: inline
In-Reply-To: <9aa66ede-51f5-46b5-ac15-4dbe93ce8903@redhat.com>
Message-ID-Hash: SEB5ELLJ2XTCVI5DEDLTLUJNNLB3Z3IK
X-Message-ID-Hash: SEB5ELLJ2XTCVI5DEDLTLUJNNLB3Z3IK
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: Stefano Brivio <sbrivio@redhat.com>, German Maglione <gmaglione@redhat.com>, passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/Z6NN7ULSF9_aj9j4@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/SEB5ELLJ2XTCVI5DEDLTLUJNNLB3Z3IK/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--QBiIk5WAwRECYupX
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Feb 05, 2025 at 09:58:11AM +0100, Hanna Czenczek wrote:
> On 05.02.25 06:47, Stefano Brivio wrote:
> > To: German whom I feel okay to To: now that
> > https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who
> > knows one thing or two about vhost-user based migration.
> >=20
> > This is Stefano from your neighbouring mailing list, passt-dev. David
> > is wondering:
> >=20
> > On Wed, 5 Feb 2025 13:09:42 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >=20
> > > On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
> > > > On migration, the source process asks passt-helper to set TCP socke=
ts
> > > > in repair mode, dumps the information we need to migrate connection=
s,
> > > > and closes them.
> > > >=20
> > > > At this point, we can't pass them back to passt-helper using
> > > > SCM_RIGHTS, because they are closed, from that perspective, and
> > > > sendmsg() will give us EBADF. But if we don't clear repair mode, the
> > > > port they are bound to will not be available for binding in the
> > > > target.
> > > >=20
> > > > Terminate once we're done with the migration and we reported the
> > > > state. This is equivalent to clearing repair mode on the sockets we
> > > > just closed.
> > > As we've discussed, quitting still makes sense, but the description
> > > above is not really accurate.  Perhaps,
> > >=20
> > > =3D=3D=3D
> > > Once we've passed the migration's "point of no return", there's no way
> > > to resume the guest on the source side, because we no longer own the
> > > connections.  There's not really anything we can do except exit.
> > > =3D=3D=3D
> > >=20
> > > Except.. thinking about it, I'm not sure that's technically true.
> > > After migration, the source qemu enters a kind of limbo state.  I
> > > suppose for the case of to-disk migration (savevm) the guest can
> > > actually be resumed.  Which for us is not really compatible with
> > > completing at least a local migration properly.  Not really sure what
> > > to do about that.
>=20
> I wouldn=E2=80=99t call it a limbo state, AFAIU the source side just cont=
inues to be
> the VM that it was, just paused.=C2=A0 So in case of migration failure on=
 the
> destination (which is where virtiofsd migration errors are reported), you
> can (and should, I believe?) continue the source VM.=C2=A0 But even in ca=
se of
> success, you could have it continue, and then you=E2=80=99d have cloned t=
he VM
> (which I think has always been a bit of a problem for networking
> specifically).

Right, I kind of realised this later in my thinking.

> > > I think it's also technically possible to use monitor commands to boot
> > > up essentially an entirely new guest instance in the original qemu,
> > > in which case for us it would make sense to basically reset ourselves
> > > (flush the low table).
> > >=20
> > > Hrm.. we really need to know the sequence of events in a bit more
> > > detail to get this right (not that this stops improving the guts of
> > > the logic in the meantime).
> > >=20
> > > I'm asking around to see if I can find who did the migration stuff or
> > > virtiofsd, so we can compare notes.
> > ...what we should be doing in the source passt at different stages of
> > moving our TCP connections (or failing to move them) over to the target,
> > which might be inspired by what you're doing with your... filesystem
> > things in virtiofsd.
>=20
> As far as I understand, with passt, there=E2=80=99s ownership that needs =
to be=E2=80=A6
> passet. (ha ha)=C2=A0 Which leads to that =E2=80=9Cpoint of no return=E2=
=80=9D.=C2=A0 We don=E2=80=99t have
> such a thing in virtiofsd, because ownership can be and is shared between
> source and destination instance once the migration state has started to be
> generated (and we know that the guest is stopped during vhost-user
> migration).

Right, exactly.  Because we're a network device with active
connections, those connections can't be continued by both the source
and destination guests.  Therefore, there has to be some point at
which ownership is irrevocably transferred.

This shows up most obviously for a local to local migration, because
on the target we fail to connect() if the source hasn't yet closed its
corresponding sockets.

I guess with a more normal L2 network device the problem is still
there, but appears at a different level: if you resume both guests
that'll somewhat work, but one or both will shortly get network errors
(TCP RSTs, most likely) as the packets they're trying to send on the
same connection conflict.

> So how it works in virtiofsd is:
> - Depending on how it=E2=80=99s done, collecting our state can too much t=
ake time
> for VM downtime.=C2=A0 So once we see F_LOG_ALL set, we begin collecting =
it.=C2=A0 At
> this point, the guest is still running, so we need to be able to change t=
he
> collected state when the guest does something that=E2=80=99d affect it.

Oh!  That's a huge insight, thanks.  We also (might) have trouble with
having too much state to transfer (large socket buffers, or example).
I was thinking there was nothing we could do about it, because we
didn't know about the migration until the guest was already stopped -
but I'd forgotten that we get the F_LOG_ALL notification.

That might make a number of improvements possible.

> - Once we receive SET_DEVICE_STATE, we serialize our state.*
> - Errors are returned through CHECK_DEVICE_STATE.
> - The source instance remains open and operational.=C2=A0 If the (source)=
 guest
> is unpaused, it can continue to use the filesystem.
>=20
> The destination instance meanwhile receives that state (while the guest is
> stopped), deserializes and applies it, and is then ready to begin operati=
on
> whenever requests start coming in from the guest.
>=20
> There is no point of no return for us, because the source instance remains
> operational throughout; nothing is moved, just shared. Until execution
> continues on the destination (which I think you can only detect by seeing
> the virtqueue being started?), anything can still report an error and thus
> fail migration, requiring continuing on the source.
>=20
> Besides waiting for which instance will eventually continue execution, the
> only thing that comes to mind that might be helpful is the return-path
> feature; with it, the source instance will at least know whether the
> destination was successful in migrating or not.=C2=A0 It isn=E2=80=99t re=
flected in
> vhost-user (yet), though.

Right, that's kind of what we'd be looking for, but it's kind of
worse, because ideally we'd want a two-way handshake.  The target says
"I've got all the information" to the source, then the source drops
the connections and says to the target "I've dropped all the
connections", and only then does the target re-establish the
connections.

> * Note that in case of snapshotting (migration to file) with the
> background-snapshot migration flag set, F_LOG_ALL won=E2=80=99t be set.=
=C2=A0 So we must
> also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in whi=
ch
> case we must then first collect our state before we can serialize
> it.

Oh, ouch.  For snapshotting, we'd be expecting to resume on the
source, even if it's after a long delay.  On the target there's really
no hope of maintaining active connections.  Maybe we should not
transfer TCP connnections if F_LOG_ALL isn't set.

> Hanna
>=20
> > We're taking for granted that as long as we have a chance to detect
> > failure (e.g. we can't dump sequence numbers from a TCP socket in the
> > source), we should use that to abort the whole thing.
> >=20
> > Once we're past that point, we have several options. And actually,
> > that's regardless of failure: because we also have the question of what
> > to do if we see that nothing went wrong.
> >=20
> > We can exit in the source, for example (this is what patch implements):
> > wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.
> >=20
> > Or we can just clear up all our connections and resume (start from a
> > blank state). Or do that, only if we didn't smell failure.
> >=20
> > Would you have some pointers, general guidelines, ideas? I know that
> > the topic is a bit broad, but I'm hopeful that you have a lot of
> > clear answers for us. :) Thanks.
> >=20
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > >   vhost_user.c | 11 +++++++++++
> > > >   1 file changed, 11 insertions(+)
> > > >=20
> > > > diff --git a/vhost_user.c b/vhost_user.c
> > > > index b107d0f..70773d6 100644
> > > > --- a/vhost_user.c
> > > > +++ b/vhost_user.c
> > > > @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vd=
ev,
> > > >   	return false;
> > > >   }
> > > > +static bool quit_on_device_state =3D false;
> > > > +
> > > >   /**
> > > >    * vu_set_device_state_fd_exec() - Set the device state migration=
 channel
> > > >    * @vdev:	vhost-user device
> > > > @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struc=
t vu_dev *vdev,
> > > >   	migrate_request(vdev->context, msg->fds[0],
> > > >   			direction =3D=3D VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
> > > > +	if (direction =3D=3D VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE)
> > > > +		quit_on_device_state =3D true;
> > > > +
> > > >   	/* We don't provide a new fd for the data transfer */
> > > >   	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
> > > > @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev,=
 int fd, uint32_t events)
> > > >   	if (reply_requested)
> > > >   		vu_send_reply(fd, &msg);
> > > > +
> > > > +	if (quit_on_device_state &&
> > > > +	    msg.hdr.request =3D=3D VHOST_USER_CHECK_DEVICE_STATE) {
> > > > +		info("Migration complete, exiting");
> > > > +		exit(EXIT_SUCCESS);
> > > > +	}
> > > >   }
>=20

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--QBiIk5WAwRECYupX
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmejTewACgkQzQJF27ox
2Gc08A/+LtvZ6vZKGLQu1fTBaVR4KwvjsGIKaPrGLUFmjJ0rjKTjb0Bu0ESi5b3n
ifYSXFQuhrXwx2FreEJQV7QnHNAUHFtSHnB2hkxajed0JrT/6WzJJp1MqiiGrLVE
1P28+1PYugL8DZ/rNBmPcYQsSL7lk/xXW+cbnPbbSX3IknL2jMntMo8AfTQMca1x
G0hGv9+YSsGfA2za+xlqVn7puGR1lMHnJZGq13GLMbx1cIftTcqtP9/IkGxkeD3r
VzMQmrvf13R/DKzyIpP2x1NqAqzcK0nnceP/HLiXiWwn9AoYK5cIbXfNZKXoTS7D
eQYwCbgTSAyOgDimp7zP6NfstWl1rgd+juS///oqg3ZJbI0qWcjy7O4oHI5rptMj
5VPbOtTR2Z6lS3Y4nwBwnAopu0FsXjzPW6jFFKh4lu2j7iIpOKWRymtIel3xbi9+
AU5BVD1GChu/Qyfx4cashrDxG77H7ptTLeX0CXBZuspqnzK0Gh76cutMNCKrdmFl
6qkSFOSK4YXbUpAaxCcf0o/unu31QoJBzFaSJyfpGUdmWBk30RLIRGKy8Ivq1O73
Nq6fQ/sh8ZOXZ7SXY2THzW3tKf1E3hYXlRbahofKF/AMR040B7RUKDZuHHFPEykZ
+vaKaDqwehtcKfwKeuE5UKFzVzv92xZOj4t0Xr11AQlJpGlMRF0=
=/cNL
-----END PGP SIGNATURE-----

--QBiIk5WAwRECYupX--