On Wed, Feb 05, 2025 at 09:58:11AM +0100, Hanna Czenczek wrote:
> On 05.02.25 06:47, Stefano Brivio wrote:
> > To: German whom I feel okay to To: now that
> > https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who
> > knows one thing or two about vhost-user based migration.
> > 
> > This is Stefano from your neighbouring mailing list, passt-dev. David
> > is wondering:
> > 
> > On Wed, 5 Feb 2025 13:09:42 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
> > > > On migration, the source process asks passt-helper to set TCP sockets
> > > > in repair mode, dumps the information we need to migrate connections,
> > > > and closes them.
> > > > 
> > > > At this point, we can't pass them back to passt-helper using
> > > > SCM_RIGHTS, because they are closed, from that perspective, and
> > > > sendmsg() will give us EBADF. But if we don't clear repair mode, the
> > > > port they are bound to will not be available for binding in the
> > > > target.
> > > > 
> > > > Terminate once we're done with the migration and we reported the
> > > > state. This is equivalent to clearing repair mode on the sockets we
> > > > just closed.
> > > As we've discussed, quitting still makes sense, but the description
> > > above is not really accurate.  Perhaps,
> > > 
> > > ===
> > > Once we've passed the migration's "point of no return", there's no way
> > > to resume the guest on the source side, because we no longer own the
> > > connections.  There's not really anything we can do except exit.
> > > ===
> > > 
> > > Except.. thinking about it, I'm not sure that's technically true.
> > > After migration, the source qemu enters a kind of limbo state.  I
> > > suppose for the case of to-disk migration (savevm) the guest can
> > > actually be resumed.  Which for us is not really compatible with
> > > completing at least a local migration properly.  Not really sure what
> > > to do about that.
> 
> I wouldn’t call it a limbo state, AFAIU the source side just continues to be
> the VM that it was, just paused.  So in case of migration failure on the
> destination (which is where virtiofsd migration errors are reported), you
> can (and should, I believe?) continue the source VM.  But even in case of
> success, you could have it continue, and then you’d have cloned the VM
> (which I think has always been a bit of a problem for networking
> specifically).

Right, I kind of realised this later in my thinking.

> > > I think it's also technically possible to use monitor commands to boot
> > > up essentially an entirely new guest instance in the original qemu,
> > > in which case for us it would make sense to basically reset ourselves
> > > (flush the low table).
> > > 
> > > Hrm.. we really need to know the sequence of events in a bit more
> > > detail to get this right (not that this stops improving the guts of
> > > the logic in the meantime).
> > > 
> > > I'm asking around to see if I can find who did the migration stuff or
> > > virtiofsd, so we can compare notes.
> > ...what we should be doing in the source passt at different stages of
> > moving our TCP connections (or failing to move them) over to the target,
> > which might be inspired by what you're doing with your... filesystem
> > things in virtiofsd.
> 
> As far as I understand, with passt, there’s ownership that needs to be…
> passet. (ha ha)  Which leads to that “point of no return”.  We don’t have
> such a thing in virtiofsd, because ownership can be and is shared between
> source and destination instance once the migration state has started to be
> generated (and we know that the guest is stopped during vhost-user
> migration).

Right, exactly.  Because we're a network device with active
connections, those connections can't be continued by both the source
and destination guests.  Therefore, there has to be some point at
which ownership is irrevocably transferred.

This shows up most obviously for a local to local migration, because
on the target we fail to connect() if the source hasn't yet closed its
corresponding sockets.

I guess with a more normal L2 network device the problem is still
there, but appears at a different level: if you resume both guests
that'll somewhat work, but one or both will shortly get network errors
(TCP RSTs, most likely) as the packets they're trying to send on the
same connection conflict.

> So how it works in virtiofsd is:
> - Depending on how it’s done, collecting our state can too much take time
> for VM downtime.  So once we see F_LOG_ALL set, we begin collecting it.  At
> this point, the guest is still running, so we need to be able to change the
> collected state when the guest does something that’d affect it.

Oh!  That's a huge insight, thanks.  We also (might) have trouble with
having too much state to transfer (large socket buffers, or example).
I was thinking there was nothing we could do about it, because we
didn't know about the migration until the guest was already stopped -
but I'd forgotten that we get the F_LOG_ALL notification.

That might make a number of improvements possible.

> - Once we receive SET_DEVICE_STATE, we serialize our state.*
> - Errors are returned through CHECK_DEVICE_STATE.
> - The source instance remains open and operational.  If the (source) guest
> is unpaused, it can continue to use the filesystem.
> 
> The destination instance meanwhile receives that state (while the guest is
> stopped), deserializes and applies it, and is then ready to begin operation
> whenever requests start coming in from the guest.
> 
> There is no point of no return for us, because the source instance remains
> operational throughout; nothing is moved, just shared. Until execution
> continues on the destination (which I think you can only detect by seeing
> the virtqueue being started?), anything can still report an error and thus
> fail migration, requiring continuing on the source.
> 
> Besides waiting for which instance will eventually continue execution, the
> only thing that comes to mind that might be helpful is the return-path
> feature; with it, the source instance will at least know whether the
> destination was successful in migrating or not.  It isn’t reflected in
> vhost-user (yet), though.

Right, that's kind of what we'd be looking for, but it's kind of
worse, because ideally we'd want a two-way handshake.  The target says
"I've got all the information" to the source, then the source drops
the connections and says to the target "I've dropped all the
connections", and only then does the target re-establish the
connections.

> * Note that in case of snapshotting (migration to file) with the
> background-snapshot migration flag set, F_LOG_ALL won’t be set.  So we must
> also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in which
> case we must then first collect our state before we can serialize
> it.

Oh, ouch.  For snapshotting, we'd be expecting to resume on the
source, even if it's after a long delay.  On the target there's really
no hope of maintaining active connections.  Maybe we should not
transfer TCP connnections if F_LOG_ALL isn't set.

> Hanna
> 
> > We're taking for granted that as long as we have a chance to detect
> > failure (e.g. we can't dump sequence numbers from a TCP socket in the
> > source), we should use that to abort the whole thing.
> > 
> > Once we're past that point, we have several options. And actually,
> > that's regardless of failure: because we also have the question of what
> > to do if we see that nothing went wrong.
> > 
> > We can exit in the source, for example (this is what patch implements):
> > wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.
> > 
> > Or we can just clear up all our connections and resume (start from a
> > blank state). Or do that, only if we didn't smell failure.
> > 
> > Would you have some pointers, general guidelines, ideas? I know that
> > the topic is a bit broad, but I'm hopeful that you have a lot of
> > clear answers for us. :) Thanks.
> > 
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > >   vhost_user.c | 11 +++++++++++
> > > >   1 file changed, 11 insertions(+)
> > > > 
> > > > diff --git a/vhost_user.c b/vhost_user.c
> > > > index b107d0f..70773d6 100644
> > > > --- a/vhost_user.c
> > > > +++ b/vhost_user.c
> > > > @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
> > > >   	return false;
> > > >   }
> > > > +static bool quit_on_device_state = false;
> > > > +
> > > >   /**
> > > >    * vu_set_device_state_fd_exec() - Set the device state migration channel
> > > >    * @vdev:	vhost-user device
> > > > @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
> > > >   	migrate_request(vdev->context, msg->fds[0],
> > > >   			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
> > > > +	if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE)
> > > > +		quit_on_device_state = true;
> > > > +
> > > >   	/* We don't provide a new fd for the data transfer */
> > > >   	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
> > > > @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
> > > >   	if (reply_requested)
> > > >   		vu_send_reply(fd, &msg);
> > > > +
> > > > +	if (quit_on_device_state &&
> > > > +	    msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) {
> > > > +		info("Migration complete, exiting");
> > > > +		exit(EXIT_SUCCESS);
> > > > +	}
> > > >   }
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson