From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Nir Dothan <ndothan@redhat.com>
Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
Date: Wed, 23 Jul 2025 11:17:08 +0200 [thread overview]
Message-ID: <20250723111708.4d78111e@elisabeth> (raw)
In-Reply-To: <aIAscoA5yMLtBbTf@zatzit>
On Wed, 23 Jul 2025 10:27:30 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote:
> > On Tue, 22 Jul 2025 10:33:00 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:
> > > > We are hitting an issue in the KubeVirt integration where some data is
> > > > still sent to the source instance even after migration is complete. As
> > > > we exit, the kernel closes our sockets and resets connections. The
> > > > resulting RST segments are sent to peers, effectively terminating
> > > > connections that were meanwhile migrated.
> > > >
> > > > At the moment, this is not done intentionally, but in the future
> > > > KubeVirt might enable OVN-Kubernetes features where source and
> > > > destination nodes are explicitly getting mirrored traffic for a while,
> > > > in order to decrease migration downtime.
> > > >
> > > > By default, don't quit after migration is completed on the source: the
> > > > previous behaviour can be enabled with the new --migrate-exit
> > > > option.
> > >
> > > Might be worth explicitly mentioning that the reason this helps is
> > > that having the sockets still around in repair mode prevents the
> > > source node from responding to any traffic for them.
> >
> > Added as separate paragraph, below.
>
> Thanks.
>
> [snip]
> > > > diff --git a/conf.c b/conf.c
> > > > index 6c747aa..5e69014 100644
> > > > --- a/conf.c
> > > > +++ b/conf.c
> > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> > > > FPRINTF(f,
> > > > " --repair-path PATH path for passt-repair(1)\n"
> > > > " default: append '.repair' to UNIX domain path\n");
> > > > + FPRINTF(f,
> > > > + " --migrate-exit source quits after migration\n"
> > > > + " default: source keeps running after migration\n");
> > > > + FPRINTF(f,
> > > > + " --migrate-no-linger close sockets on migration\n"
> > > > + " default: keep sockets open, ignore data events\n");
> > >
> > > I'm a bit uncomfortable adding this as regular, supported options.
> > > I'm hoping we can eliminate them in the not too distant future. In
> > > the medium term we can do that by improving our test code to simulate
> > > different hosts with namespaces. Longer term we can allow local to
> > > local migration without these options by fixing the kernel: I don't
> > > believe bind() should fail if the address conflict is with a socket in
> > > repair mode.
> >
> > I see the point about --migrate-no-linger, but --migrate-exit looks
> > like something that depends on the migration workflow, not necessarily
> > something to cover up issues in our tests or in the kernel. In any
> > case...
>
> It does depend on the workflow, but I think this needs to be managed
> by qemu (or whatever upper level component is managing the overall
> migration). --migrate-exit doesn't make us exit once the migration is
> complete - it makes us exit once *our part of* the migration is
> complete, the migration could still fail after that. So in the happy
> path, yes, we'll exit, but it should be because qemu or libvirt or
> whatever is satisfied the migration is finished and kills us off
>
> > > Further, as user accessible options these are kinda weird - what they
> > > do is defined in terms of some pretty obscure internals, not in terms
> > > of what the upshot is. So, I think it would be better to introduce
> > > these as deprecated / debug-only / undocumented options.
> >
> > ...I marked both as deprecated in v2, to signal to any hypothetical
> > (I'm fairly sure not real) user that they shouldn't assume it's safe to
> > rely on them.
> >
> > That's the only other category of options we have so far, we don't have
> > undocumented ones, and I think it's a feature. Especially as we use
> > them in the tests.
>
> Works for me.
>
> > By the way, I would argue that testing is an important type of usage,
> > and changes in the test setup itself or in the kernel are non-trivial
> > effort, and it's rather uncertain whether we'll ever manage to commit
> > to that, so I don't quite feel that marking them as deprecated is
> > entirely warranted, but I see the point about possibly dropping that
> > migration model at some point in the future.
> >
> > Regardless, the fact we *need* them at the moment should prove that
> > they are at least convenient, even if just for testing.
>
> Fair enough.
>
> [snip]
> > > > diff --git a/passt.c b/passt.c
> > > > index 388d10f..f4c108f 100644
> > > > --- a/passt.c
> > > > +++ b/passt.c
> > > > @@ -308,6 +308,9 @@ loop:
> > > > c.mode == MODE_PASTA ? "pasta" : "passt",
> > > > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> > > >
> > > > + if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > > > + continue;
> > > > +
> > >
> > > This worries me slightly. I think it will be fine if passt is killed
> > > not long after the migration. However, if passt lingers, we're
> > > ignoring, but not clearing events. For those events which are
> > > level-triggered this could make us chew CPU on a tight epoll_wait()
> > > loop.
> >
> > Ah, yes, I had that concern as well, but I conveniently forgot about it
> > at some point.
> >
> > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> > > the point we would close() them with --migrate-no-linger. I'd be fine
> > > with that as a follow up improvement, though.
> >
> > I was pondering about adding this on top of the ignore_data_events
> > trick, but, actually, the whole thing as of this patch is somewhat
> > bogus because
> >
> > - we're ignoring events on TCP sockets (intentional),
> >
> > - we're ignoring events on the tap device (who cares, migration is only
> > supported with vhost-user)
> >
> > - but *not* ignoring events on the vhost-user kick descriptor
> > (oversight).
> >
> > On a second thought, it doesn't look safe to ignore events on the kick
> > descriptor, and in any case, with this change, we don't want to prevent
> > the guest to send out further packets. It's not expected anyway.
> >
> > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
> > we go through the sockets. It's simpler and arguably safer.
>
> Yes, that's what I had in mind as well (I thought I put that in the
> mail, but it looks like I didn't). Just one additional concern that I
> don't think need hold up merge: do we also need to epoll_del() our
> listening sockets?
Right, I had that thought as well, and this was somewhat covered by the
first version because we'd ignore events on those. But that would still
come with the risk of epoll_wait() loops.
In the perspective of a simple implementation / fix mostly intended for
KubeVirt such as this one, we know that the guest is suspended at that
point, so we'll send a SYN to it, nothing comes back, we'll eventually
time out in 10 seconds and try to reset the connection (in the unlikely
case we're still running *and* getting traffic at that point).
And I guess that's fine because if we're still running and getting
traffic after that timeout something must have gone wrong, so sending a
RST segment doesn't look that bad to me. If the connection was meanwhile
established with the target node, I think our RST segment won't actually
reset the connection because sequence numbers won't match.
But surely it's not elegant and I think we should eventually have an
explicit implementation of the whole thing, perhaps with a new socket
state ("MIGRATED"?) and going through the whole list of listening
sockets and... closing them, I suppose?
--
Stefano
next prev parent reply other threads:[~2025-07-23 9:17 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-21 22:12 [PATCH] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-22 0:33 ` David Gibson
2025-07-22 21:12 ` Stefano Brivio
2025-07-23 0:27 ` David Gibson
2025-07-23 9:17 ` Stefano Brivio [this message]
2025-07-24 1:48 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250723111708.4d78111e@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=ndothan@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).