On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote: > On Tue, 22 Jul 2025 10:33:00 +1000 > David Gibson wrote: > > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote: > > > We are hitting an issue in the KubeVirt integration where some data is > > > still sent to the source instance even after migration is complete. As > > > we exit, the kernel closes our sockets and resets connections. The > > > resulting RST segments are sent to peers, effectively terminating > > > connections that were meanwhile migrated. > > > > > > At the moment, this is not done intentionally, but in the future > > > KubeVirt might enable OVN-Kubernetes features where source and > > > destination nodes are explicitly getting mirrored traffic for a while, > > > in order to decrease migration downtime. > > > > > > By default, don't quit after migration is completed on the source: the > > > previous behaviour can be enabled with the new --migrate-exit > > > option. > > > > Might be worth explicitly mentioning that the reason this helps is > > that having the sockets still around in repair mode prevents the > > source node from responding to any traffic for them. > > Added as separate paragraph, below. Thanks. [snip] > > > diff --git a/conf.c b/conf.c > > > index 6c747aa..5e69014 100644 > > > --- a/conf.c > > > +++ b/conf.c > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status) > > > FPRINTF(f, > > > " --repair-path PATH path for passt-repair(1)\n" > > > " default: append '.repair' to UNIX domain path\n"); > > > + FPRINTF(f, > > > + " --migrate-exit source quits after migration\n" > > > + " default: source keeps running after migration\n"); > > > + FPRINTF(f, > > > + " --migrate-no-linger close sockets on migration\n" > > > + " default: keep sockets open, ignore data events\n"); > > > > I'm a bit uncomfortable adding this as regular, supported options. > > I'm hoping we can eliminate them in the not too distant future. In > > the medium term we can do that by improving our test code to simulate > > different hosts with namespaces. Longer term we can allow local to > > local migration without these options by fixing the kernel: I don't > > believe bind() should fail if the address conflict is with a socket in > > repair mode. > > I see the point about --migrate-no-linger, but --migrate-exit looks > like something that depends on the migration workflow, not necessarily > something to cover up issues in our tests or in the kernel. In any > case... It does depend on the workflow, but I think this needs to be managed by qemu (or whatever upper level component is managing the overall migration). --migrate-exit doesn't make us exit once the migration is complete - it makes us exit once *our part of* the migration is complete, the migration could still fail after that. So in the happy path, yes, we'll exit, but it should be because qemu or libvirt or whatever is satisfied the migration is finished and kills us off > > Further, as user accessible options these are kinda weird - what they > > do is defined in terms of some pretty obscure internals, not in terms > > of what the upshot is. So, I think it would be better to introduce > > these as deprecated / debug-only / undocumented options. > > ...I marked both as deprecated in v2, to signal to any hypothetical > (I'm fairly sure not real) user that they shouldn't assume it's safe to > rely on them. > > That's the only other category of options we have so far, we don't have > undocumented ones, and I think it's a feature. Especially as we use > them in the tests. Works for me. > By the way, I would argue that testing is an important type of usage, > and changes in the test setup itself or in the kernel are non-trivial > effort, and it's rather uncertain whether we'll ever manage to commit > to that, so I don't quite feel that marking them as deprecated is > entirely warranted, but I see the point about possibly dropping that > migration model at some point in the future. > > Regardless, the fact we *need* them at the moment should prove that > they are at least convenient, even if just for testing. Fair enough. [snip] > > > diff --git a/passt.c b/passt.c > > > index 388d10f..f4c108f 100644 > > > --- a/passt.c > > > +++ b/passt.c > > > @@ -308,6 +308,9 @@ loop: > > > c.mode == MODE_PASTA ? "pasta" : "passt", > > > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); > > > > > > + if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX) > > > + continue; > > > + > > > > This worries me slightly. I think it will be fine if passt is killed > > not long after the migration. However, if passt lingers, we're > > ignoring, but not clearing events. For those events which are > > level-triggered this could make us chew CPU on a tight epoll_wait() > > loop. > > Ah, yes, I had that concern as well, but I conveniently forgot about it > at some point. > > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at > > the point we would close() them with --migrate-no-linger. I'd be fine > > with that as a follow up improvement, though. > > I was pondering about adding this on top of the ignore_data_events > trick, but, actually, the whole thing as of this patch is somewhat > bogus because > > - we're ignoring events on TCP sockets (intentional), > > - we're ignoring events on the tap device (who cares, migration is only > supported with vhost-user) > > - but *not* ignoring events on the vhost-user kick descriptor > (oversight). > > On a second thought, it doesn't look safe to ignore events on the kick > descriptor, and in any case, with this change, we don't want to prevent > the guest to send out further packets. It's not expected anyway. > > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as > we go through the sockets. It's simpler and arguably safer. Yes, that's what I had in mind as well (I thought I put that in the mail, but it looks like I didn't). Just one additional concern that I don't think need hold up merge: do we also need to epoll_del() our listening sockets? -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson