From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Nir Dothan <ndothan@redhat.com>
Subject: Re: [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
Date: Fri, 25 Jul 2025 16:50:23 +1000 [thread overview]
Message-ID: <aIMpL2jG6OcENO2-@zatzit> (raw)
In-Reply-To: <20250725071058.0842f7a2@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 6443 bytes --]
On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
> On Fri, 25 Jul 2025 14:04:17 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
> > > We are hitting an issue in the KubeVirt integration where some data is
> > > still sent to the source instance even after migration is complete. As
> > > we exit, the kernel closes our sockets and resets connections. The
> > > resulting RST segments are sent to peers, effectively terminating
> > > connections that were meanwhile migrated.
> > >
> > > At the moment, this is not done intentionally, but in the future
> > > KubeVirt might enable OVN-Kubernetes features where source and
> > > destination nodes are explicitly getting mirrored traffic for a while,
> > > in order to decrease migration downtime.
> > >
> > > By default, don't quit after migration is completed on the source: the
> > > previous behaviour can be enabled with the new, but deprecated,
> > > --migrate-exit option. After migration (as source), the -1 / --one-off
> > > option has no effect.
> > >
> > > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > > long as we're running, and ignore events on any epoll descriptor
> > > representing data channels. The previous behaviour can be enabled with
> > > the new, equally deprecated, --migrate-no-linger option.
> > >
> > > By keeping sockets open, and not exiting, we prevent the kernel
> > > running on the source node to send out RST segments if further data
> > > reaches us.
> > >
> > > Reported-by: Nir Dothan <ndothan@redhat.com>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > v2:
> > > - assorted changes in commit message
> > > - context variable ignore_linger becomes ignore_no_linger
> > > - new options are deprecated
> > > - don't ignore events on some descriptors, drop them from epoll
> > >
> > > v3:
> > > - Nir reported occasional failures (connections being reset)
> > > with both v1 and v2, because, in KubeVirt's usage, we quit as
> > > QEMU exits. Disable --one-off after migration as source, and
> > > document this exception
> >
> > This seems like an awful, awful hack.
>
> Well, of course, it is, and long term it should be fixed in
> either KubeVirt or libvirt (even though I'm not sure how, see below)
> instead.
But this hack means that even when it's fixed we'll still have this
wildly counterintuitive behaviour that every future user will have to
work around. There's no sensible internal reason for out-migration to
affect lifetime, it's a workaround for problems that are quite
specific to this stack of layers above.
> > We're abandoning consistent
> > semantics on a wild guess as to what the layers above us need.
>
> No, not really, we tested this and tested the alternative.
With just one use case. Creating semantics to work with exactly how
something is used now, without thought to whether they make sense in
general is the definition of fragile software.
> > Specifically, --once-off used to mean that the layer above us didn't
>
> --one-off
>
> > need to manage passt's lifetime; it was tied to qemu's. Now it still
> > needs to manually manage passt's lifetime, so what's the point. So,
> > if it needs passt to outlive qemu it should actually manage that and
> > not use --once-off.
>
> The main point is that it does *not* manually manage passt's lifetime
> if there's no migration (which is the general case for libvirt and all
> other users).
That's exactly my point. With this hack it's neither one model nor
the other so you have to be aware of both.
> We don't have any other user with an implementation of the migration
> workflow anyway (libvirt itself doesn't do that, yet). It's otherwise
> unusable for KubeVirt. So I'd say let's fix it for the only user we
> have.
Please not at the expense of forcing every future user to deal with
this suckage.
> > Requring passt to outlive qemu already seems pretty dubious to me:
> > having the source still connected when passt was quitting is one thing
> > - indeed it's arguably hard to avoid. Having it still connected when
> > *qemu* quits is much less defensible.
>
> The fundamental problem here is that there's an issue in KubeVirt
> (and working around it is the whole point of this patch) which implies
> that packets are sent to the source pod *for a while* after migration.
>
> We found out that the guest is generally suspended during that while,
> but sometimes it might even have already exited. The pod remains,
> though, as long as it's needed. That's the only certainty we have.
Keeping the pod around is fine. What needs to change is that the
guest's IP(s) needs to be removed from the source host before qemu
(and therefore passt) is terminated. The pod must have at least one
other IP, or it would be impossible to perform the migration in the
first place.
This essentially matches the situation for bridged networking: with
the source guest suspended the source host will no longer respond to
the guest IP
> So, do we want to drop --one-off from the libvirt integration, and have
> libvirt manage passt's lifecycle entirely (note that all users outside
> KubeVirt don't use migration, so we would make the general case vastly
> more complicated for the sake of correctness for a single usage...)?
Hmm.. if I understand correctly the network swizzling is handled by
KubeVirt, not libvirt. I'm hoping that means there's a suitable point
at which it can remove the IP without having to alter libvirt.
> Well, we can try to do that. Except that libvirt doesn't know either
> for how long this traffic will reach the source pod (that's a KubeVirt
> concept). So it should implement the same hack: let it outlive QEMU on
> migration... as long as we have that issue in KubeVirt.
>
> But I asked KubeVirt people, and it turns out that it's extremely
> complicated to fix this in KubeVirt. So, actually, I don't see another
> way to fix this in the short term. And without KubeVirt using this we
> could also drop the whole feature...
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-07-25 6:50 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-24 17:28 [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-25 4:04 ` David Gibson
2025-07-25 5:10 ` Stefano Brivio
2025-07-25 6:50 ` David Gibson [this message]
2025-07-25 8:21 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIMpL2jG6OcENO2-@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=ndothan@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).