On Wed, Jul 23, 2025 at 11:17:08AM +0200, Stefano Brivio wrote: > On Wed, 23 Jul 2025 10:27:30 +1000 > David Gibson wrote: > > On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote: [snip] > > > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at > > > > the point we would close() them with --migrate-no-linger. I'd be fine > > > > with that as a follow up improvement, though. > > > > > > I was pondering about adding this on top of the ignore_data_events > > > trick, but, actually, the whole thing as of this patch is somewhat > > > bogus because > > > > > > - we're ignoring events on TCP sockets (intentional), > > > > > > - we're ignoring events on the tap device (who cares, migration is only > > > supported with vhost-user) > > > > > > - but *not* ignoring events on the vhost-user kick descriptor > > > (oversight). > > > > > > On a second thought, it doesn't look safe to ignore events on the kick > > > descriptor, and in any case, with this change, we don't want to prevent > > > the guest to send out further packets. It's not expected anyway. > > > > > > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as > > > we go through the sockets. It's simpler and arguably safer. > > > > Yes, that's what I had in mind as well (I thought I put that in the > > mail, but it looks like I didn't). Just one additional concern that I > > don't think need hold up merge: do we also need to epoll_del() our > > listening sockets? > > Right, I had that thought as well, and this was somewhat covered by the > first version because we'd ignore events on those. But that would still > come with the risk of epoll_wait() loops. Exactly. > In the perspective of a simple implementation / fix mostly intended for > KubeVirt such as this one, we know that the guest is suspended at that > point, so we'll send a SYN to it, nothing comes back, we'll eventually > time out in 10 seconds and try to reset the connection (in the unlikely > case we're still running *and* getting traffic at that point). Hrm. Maybe. Wouldn't surprise me if there are still some weird edge cases in her, but, yes, I suspect they're not a huge deal. > And I guess that's fine because if we're still running and getting > traffic after that timeout something must have gone wrong, so sending a > RST segment doesn't look that bad to me. If the connection was meanwhile > established with the target node, I think our RST segment won't actually > reset the connection because sequence numbers won't match. > > But surely it's not elegant and I think we should eventually have an > explicit implementation of the whole thing, perhaps with a new socket > state ("MIGRATED"?) and going through the whole list of listening > sockets and... closing them, I suppose? Right. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson