On Mon, Feb 03, 2025 at 07:09:32AM +0100, Stefano Brivio wrote: > On Mon, 3 Feb 2025 12:55:47 +1100 > David Gibson wrote: > > > On Fri, Jan 31, 2025 at 08:39:50PM +0100, Stefano Brivio wrote: > > > On migration, the source process asks passt-helper to set TCP sockets > > > in repair mode, dumps the information we need to migrate connections, > > > and closes them. > > > > > > At this point, we can't pass them back to passt-helper using > > > SCM_RIGHTS, because they are closed, from that perspective, and > > > sendmsg() will give us EBADF. But if we don't clear repair mode, the > > > port they are bound to will not be available for binding in the > > > target. > > > > > > Terminate once we're done with the migration and we reported the > > > state. This is equivalent to clearing repair mode on the sockets we > > > just closed. > > > > As noted on the passt-repair patch, I think this is based on a > > misinterpreation of the situation. I think the problem is that the > > sockets aren't closed in passt-repair, so the additional handle copy > > is keeping the underlying socket open. This appears to work, because > > it is causing passt-repair to also terminate. > > Right, exactly that. > > > That said, we probably want to terminate on the source side after a > > succesful migrate anyway. At the very least we need to close() all > > our sockets, and delete the corresponding flows, because we don't own > > them any more. Quitting is probably the simplest way to do that. > > I'm not sure if there's an established behaviour for helpers supporting > state migration. By "helper" do you mean passt as a device helper to qemu, or passt-repair as a helper to passt. For the latter I wouldn't expect so - it's only a weirdness of our situation that we need passt-repair at all. If the former, I'm not really sure what you're after. > We could probably close sockets, delete flows, and keep things up and > running for the rest (restart from a clean situation), but at that > point we already the guest networking is already broken in a number of > ways. So, yeah, maybe let's keep this instead. So, I realised it's a bit more complicated than that. We need to identify exactly where the "point of no return" is. I'll discuss in our call tonight. > > Which also makes me realise, on a *failed* outbound migration, we _do_ > > need to turn repair mode off on everything again. Is that implemented > > yet? > > No, sorry, that's the "/* TODO: rollback */" comments in > flow_migrate_source_pre(), flow.c. But other than that, it should be > pretty much implemented, at a migrate.c level. Right, I realised that a bit after I wrote this. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson