From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=fcUV0oQQ; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id B9B935A0272 for ; Wed, 05 Feb 2025 12:55:26 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1738756509; bh=5ldpD0MfVunUG0sdgqvj6K6wZtyy9vcnORfudaHHyis=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=fcUV0oQQdSTQzdGZct1ydRDg7hG80y/7Wlode9aqvmjjqKfR+89TquLg9mLu7SvLa D6j4udHV71LRZdscTen8mIBtRwEXuNLpqb5aWNfjcGBtHWcRfNl8dbvbhUlxBHC4pV JaJ2euEOXp02dsNB2v0eGhDaVpybh9WmhFmVp64cPXSgYRk4AzHJIbP8M0ziY3zDd4 biuM6QrcMr3fQXLThyi27knkDNZL4WUX0Kgn2JCtE/AILPUtDPnh6dPJ/yKPId2q+C pEDcWI82Q4sGa8Zhjty8nLQ5grxcI7b+1RlIT37AXmrj8zIGUuogVN1xWPIVSJZVm6 UXtk8tGamcMxQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YnzF900Fhz4wj2; Wed, 5 Feb 2025 22:55:08 +1100 (AEDT) Date: Wed, 5 Feb 2025 22:39:25 +1100 From: David Gibson To: Hanna Czenczek Subject: Re: [PATCH v5 5/6] vhost_user: Make source quit after reporting migration state Message-ID: References: <20250205003904.2797491-1-sbrivio@redhat.com> <20250205003904.2797491-6-sbrivio@redhat.com> <20250205064708.1af3ef28@elisabeth> <9aa66ede-51f5-46b5-ac15-4dbe93ce8903@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="QBiIk5WAwRECYupX" Content-Disposition: inline In-Reply-To: <9aa66ede-51f5-46b5-ac15-4dbe93ce8903@redhat.com> Message-ID-Hash: SEB5ELLJ2XTCVI5DEDLTLUJNNLB3Z3IK X-Message-ID-Hash: SEB5ELLJ2XTCVI5DEDLTLUJNNLB3Z3IK X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Stefano Brivio , German Maglione , passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --QBiIk5WAwRECYupX Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Feb 05, 2025 at 09:58:11AM +0100, Hanna Czenczek wrote: > On 05.02.25 06:47, Stefano Brivio wrote: > > To: German whom I feel okay to To: now that > > https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who > > knows one thing or two about vhost-user based migration. > >=20 > > This is Stefano from your neighbouring mailing list, passt-dev. David > > is wondering: > >=20 > > On Wed, 5 Feb 2025 13:09:42 +1100 > > David Gibson wrote: > >=20 > > > On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote: > > > > On migration, the source process asks passt-helper to set TCP socke= ts > > > > in repair mode, dumps the information we need to migrate connection= s, > > > > and closes them. > > > >=20 > > > > At this point, we can't pass them back to passt-helper using > > > > SCM_RIGHTS, because they are closed, from that perspective, and > > > > sendmsg() will give us EBADF. But if we don't clear repair mode, the > > > > port they are bound to will not be available for binding in the > > > > target. > > > >=20 > > > > Terminate once we're done with the migration and we reported the > > > > state. This is equivalent to clearing repair mode on the sockets we > > > > just closed. > > > As we've discussed, quitting still makes sense, but the description > > > above is not really accurate. Perhaps, > > >=20 > > > =3D=3D=3D > > > Once we've passed the migration's "point of no return", there's no way > > > to resume the guest on the source side, because we no longer own the > > > connections. There's not really anything we can do except exit. > > > =3D=3D=3D > > >=20 > > > Except.. thinking about it, I'm not sure that's technically true. > > > After migration, the source qemu enters a kind of limbo state. I > > > suppose for the case of to-disk migration (savevm) the guest can > > > actually be resumed. Which for us is not really compatible with > > > completing at least a local migration properly. Not really sure what > > > to do about that. >=20 > I wouldn=E2=80=99t call it a limbo state, AFAIU the source side just cont= inues to be > the VM that it was, just paused.=C2=A0 So in case of migration failure on= the > destination (which is where virtiofsd migration errors are reported), you > can (and should, I believe?) continue the source VM.=C2=A0 But even in ca= se of > success, you could have it continue, and then you=E2=80=99d have cloned t= he VM > (which I think has always been a bit of a problem for networking > specifically). Right, I kind of realised this later in my thinking. > > > I think it's also technically possible to use monitor commands to boot > > > up essentially an entirely new guest instance in the original qemu, > > > in which case for us it would make sense to basically reset ourselves > > > (flush the low table). > > >=20 > > > Hrm.. we really need to know the sequence of events in a bit more > > > detail to get this right (not that this stops improving the guts of > > > the logic in the meantime). > > >=20 > > > I'm asking around to see if I can find who did the migration stuff or > > > virtiofsd, so we can compare notes. > > ...what we should be doing in the source passt at different stages of > > moving our TCP connections (or failing to move them) over to the target, > > which might be inspired by what you're doing with your... filesystem > > things in virtiofsd. >=20 > As far as I understand, with passt, there=E2=80=99s ownership that needs = to be=E2=80=A6 > passet. (ha ha)=C2=A0 Which leads to that =E2=80=9Cpoint of no return=E2= =80=9D.=C2=A0 We don=E2=80=99t have > such a thing in virtiofsd, because ownership can be and is shared between > source and destination instance once the migration state has started to be > generated (and we know that the guest is stopped during vhost-user > migration). Right, exactly. Because we're a network device with active connections, those connections can't be continued by both the source and destination guests. Therefore, there has to be some point at which ownership is irrevocably transferred. This shows up most obviously for a local to local migration, because on the target we fail to connect() if the source hasn't yet closed its corresponding sockets. I guess with a more normal L2 network device the problem is still there, but appears at a different level: if you resume both guests that'll somewhat work, but one or both will shortly get network errors (TCP RSTs, most likely) as the packets they're trying to send on the same connection conflict. > So how it works in virtiofsd is: > - Depending on how it=E2=80=99s done, collecting our state can too much t= ake time > for VM downtime.=C2=A0 So once we see F_LOG_ALL set, we begin collecting = it.=C2=A0 At > this point, the guest is still running, so we need to be able to change t= he > collected state when the guest does something that=E2=80=99d affect it. Oh! That's a huge insight, thanks. We also (might) have trouble with having too much state to transfer (large socket buffers, or example). I was thinking there was nothing we could do about it, because we didn't know about the migration until the guest was already stopped - but I'd forgotten that we get the F_LOG_ALL notification. That might make a number of improvements possible. > - Once we receive SET_DEVICE_STATE, we serialize our state.* > - Errors are returned through CHECK_DEVICE_STATE. > - The source instance remains open and operational.=C2=A0 If the (source)= guest > is unpaused, it can continue to use the filesystem. >=20 > The destination instance meanwhile receives that state (while the guest is > stopped), deserializes and applies it, and is then ready to begin operati= on > whenever requests start coming in from the guest. >=20 > There is no point of no return for us, because the source instance remains > operational throughout; nothing is moved, just shared. Until execution > continues on the destination (which I think you can only detect by seeing > the virtqueue being started?), anything can still report an error and thus > fail migration, requiring continuing on the source. >=20 > Besides waiting for which instance will eventually continue execution, the > only thing that comes to mind that might be helpful is the return-path > feature; with it, the source instance will at least know whether the > destination was successful in migrating or not.=C2=A0 It isn=E2=80=99t re= flected in > vhost-user (yet), though. Right, that's kind of what we'd be looking for, but it's kind of worse, because ideally we'd want a two-way handshake. The target says "I've got all the information" to the source, then the source drops the connections and says to the target "I've dropped all the connections", and only then does the target re-establish the connections. > * Note that in case of snapshotting (migration to file) with the > background-snapshot migration flag set, F_LOG_ALL won=E2=80=99t be set.= =C2=A0 So we must > also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in whi= ch > case we must then first collect our state before we can serialize > it. Oh, ouch. For snapshotting, we'd be expecting to resume on the source, even if it's after a long delay. On the target there's really no hope of maintaining active connections. Maybe we should not transfer TCP connnections if F_LOG_ALL isn't set. > Hanna >=20 > > We're taking for granted that as long as we have a chance to detect > > failure (e.g. we can't dump sequence numbers from a TCP socket in the > > source), we should use that to abort the whole thing. > >=20 > > Once we're past that point, we have several options. And actually, > > that's regardless of failure: because we also have the question of what > > to do if we see that nothing went wrong. > >=20 > > We can exit in the source, for example (this is what patch implements): > > wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit. > >=20 > > Or we can just clear up all our connections and resume (start from a > > blank state). Or do that, only if we didn't smell failure. > >=20 > > Would you have some pointers, general guidelines, ideas? I know that > > the topic is a bit broad, but I'm hopeful that you have a lot of > > clear answers for us. :) Thanks. > >=20 > > > > Signed-off-by: Stefano Brivio > > > > --- > > > > vhost_user.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > >=20 > > > > diff --git a/vhost_user.c b/vhost_user.c > > > > index b107d0f..70773d6 100644 > > > > --- a/vhost_user.c > > > > +++ b/vhost_user.c > > > > @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vd= ev, > > > > return false; > > > > } > > > > +static bool quit_on_device_state =3D false; > > > > + > > > > /** > > > > * vu_set_device_state_fd_exec() - Set the device state migration= channel > > > > * @vdev: vhost-user device > > > > @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struc= t vu_dev *vdev, > > > > migrate_request(vdev->context, msg->fds[0], > > > > direction =3D=3D VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD); > > > > + if (direction =3D=3D VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) > > > > + quit_on_device_state =3D true; > > > > + > > > > /* We don't provide a new fd for the data transfer */ > > > > vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK); > > > > @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev,= int fd, uint32_t events) > > > > if (reply_requested) > > > > vu_send_reply(fd, &msg); > > > > + > > > > + if (quit_on_device_state && > > > > + msg.hdr.request =3D=3D VHOST_USER_CHECK_DEVICE_STATE) { > > > > + info("Migration complete, exiting"); > > > > + exit(EXIT_SUCCESS); > > > > + } > > > > } >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --QBiIk5WAwRECYupX Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmejTewACgkQzQJF27ox 2Gc08A/+LtvZ6vZKGLQu1fTBaVR4KwvjsGIKaPrGLUFmjJ0rjKTjb0Bu0ESi5b3n ifYSXFQuhrXwx2FreEJQV7QnHNAUHFtSHnB2hkxajed0JrT/6WzJJp1MqiiGrLVE 1P28+1PYugL8DZ/rNBmPcYQsSL7lk/xXW+cbnPbbSX3IknL2jMntMo8AfTQMca1x G0hGv9+YSsGfA2za+xlqVn7puGR1lMHnJZGq13GLMbx1cIftTcqtP9/IkGxkeD3r VzMQmrvf13R/DKzyIpP2x1NqAqzcK0nnceP/HLiXiWwn9AoYK5cIbXfNZKXoTS7D eQYwCbgTSAyOgDimp7zP6NfstWl1rgd+juS///oqg3ZJbI0qWcjy7O4oHI5rptMj 5VPbOtTR2Z6lS3Y4nwBwnAopu0FsXjzPW6jFFKh4lu2j7iIpOKWRymtIel3xbi9+ AU5BVD1GChu/Qyfx4cashrDxG77H7ptTLeX0CXBZuspqnzK0Gh76cutMNCKrdmFl 6qkSFOSK4YXbUpAaxCcf0o/unu31QoJBzFaSJyfpGUdmWBk30RLIRGKy8Ivq1O73 Nq6fQ/sh8ZOXZ7SXY2THzW3tKf1E3hYXlRbahofKF/AMR040B7RUKDZuHHFPEykZ +vaKaDqwehtcKfwKeuE5UKFzVzv92xZOj4t0Xr11AQlJpGlMRF0= =/cNL -----END PGP SIGNATURE----- --QBiIk5WAwRECYupX--