On Mon, Feb 03, 2025 at 07:09:32AM +0100, Stefano Brivio wrote:
> On Mon, 3 Feb 2025 12:55:47 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote:
> > > On migration, the source process asks passt-helper to set TCP sockets
> > > in repair mode, dumps the information we need to migrate connections,
> > > and closes them.
> > > 
> > > At this point, we can't pass them back to passt-helper using
> > > SCM_RIGHTS, because they are closed, from that perspective, and
> > > sendmsg() will give us EBADF. But if we don't clear repair mode, the
> > > port they are bound to will not be available for binding in the
> > > target.
> > > 
> > > Terminate once we're done with the migration and we reported the
> > > state. This is equivalent to clearing repair mode on the sockets we
> > > just closed.  
> > 
> > As noted on the passt-repair patch, I think this is based on a
> > misinterpreation of the situation.  I think the problem is that the
> > sockets aren't closed in passt-repair, so the additional handle copy
> > is keeping the underlying socket open.  This appears to work, because
> > it is causing passt-repair to also terminate.
> 
> Right, exactly that.
> 
> > That said, we probably want to terminate on the source side after a
> > succesful migrate anyway.  At the very least we need to close() all
> > our sockets, and delete the corresponding flows, because we don't own
> > them any more.  Quitting is probably the simplest way to do that.
> 
> I'm not sure if there's an established behaviour for helpers supporting
> state migration.

By "helper" do you mean passt as a device helper to qemu, or
passt-repair as a helper to passt.  For the latter I wouldn't expect
so - it's only a weirdness of our situation that we need passt-repair
at all.  If the former, I'm not really sure what you're after.

> We could probably close sockets, delete flows, and keep things up and
> running for the rest (restart from a clean situation), but at that
> point we already the guest networking is already broken in a number of
> ways. So, yeah, maybe let's keep this instead.

So, I realised it's a bit more complicated than that.  We need to
identify exactly where the "point of no return" is.  I'll discuss in
our call tonight.

> > Which also makes me realise, on a *failed* outbound migration, we _do_
> > need to turn repair mode off on everything again.  Is that implemented
> > yet?
> 
> No, sorry, that's the "/* TODO: rollback */" comments in
> flow_migrate_source_pre(), flow.c. But other than that, it should be
> pretty much implemented, at a migrate.c level.

Right, I realised that a bit after I wrote this.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson