public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Nir Dothan <ndothan@redhat.com>
Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
Date: Wed, 23 Jul 2025 11:17:08 +0200	[thread overview]
Message-ID: <20250723111708.4d78111e@elisabeth> (raw)
In-Reply-To: <aIAscoA5yMLtBbTf@zatzit>

On Wed, 23 Jul 2025 10:27:30 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote:
> > On Tue, 22 Jul 2025 10:33:00 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:  
> > > > We are hitting an issue in the KubeVirt integration where some data is
> > > > still sent to the source instance even after migration is complete. As
> > > > we exit, the kernel closes our sockets and resets connections. The
> > > > resulting RST segments are sent to peers, effectively terminating
> > > > connections that were meanwhile migrated.
> > > > 
> > > > At the moment, this is not done intentionally, but in the future
> > > > KubeVirt might enable OVN-Kubernetes features where source and
> > > > destination nodes are explicitly getting mirrored traffic for a while,
> > > > in order to decrease migration downtime.
> > > > 
> > > > By default, don't quit after migration is completed on the source: the
> > > > previous behaviour can be enabled with the new --migrate-exit
> > > > option.    
> > > 
> > > Might be worth explicitly mentioning that the reason this helps is
> > > that having the sockets still around in repair mode prevents the
> > > source node from responding to any traffic for them.  
> > 
> > Added as separate paragraph, below.  
> 
> Thanks.
> 
> [snip]
> > > > diff --git a/conf.c b/conf.c
> > > > index 6c747aa..5e69014 100644
> > > > --- a/conf.c
> > > > +++ b/conf.c
> > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> > > >  		FPRINTF(f,
> > > >  			"  --repair-path PATH	path for passt-repair(1)\n"
> > > >  			"    default: append '.repair' to UNIX domain path\n");
> > > > +		FPRINTF(f,
> > > > +			"  --migrate-exit	source quits after migration\n"
> > > > +			"    default: source keeps running after migration\n");
> > > > +		FPRINTF(f,
> > > > +			"  --migrate-no-linger	close sockets on migration\n"
> > > > +			"    default: keep sockets open, ignore data events\n");    
> > > 
> > > I'm a bit uncomfortable adding this as regular, supported options.
> > > I'm hoping we can eliminate them in the not too distant future.  In
> > > the medium term we can do that by improving our test code to simulate
> > > different hosts with namespaces.  Longer term we can allow local to
> > > local migration without these options by fixing the kernel: I don't
> > > believe bind() should fail if the address conflict is with a socket in
> > > repair mode.  
> > 
> > I see the point about --migrate-no-linger, but --migrate-exit looks
> > like something that depends on the migration workflow, not necessarily
> > something to cover up issues in our tests or in the kernel. In any
> > case...  
> 
> It does depend on the workflow, but I think this needs to be managed
> by qemu (or whatever upper level component is managing the overall
> migration).  --migrate-exit doesn't make us exit once the migration is
> complete - it makes us exit once *our part of* the migration is
> complete, the migration could still fail after that.  So in the happy
> path, yes, we'll exit, but it should be because qemu or libvirt or
> whatever is satisfied the migration is finished and kills us off
> 
> > > Further, as user accessible options these are kinda weird - what they
> > > do is defined in terms of some pretty obscure internals, not in terms
> > > of what the upshot is.  So, I think it would be better to introduce
> > > these as deprecated / debug-only / undocumented options.  
> > 
> > ...I marked both as deprecated in v2, to signal to any hypothetical
> > (I'm fairly sure not real) user that they shouldn't assume it's safe to
> > rely on them.
> > 
> > That's the only other category of options we have so far, we don't have
> > undocumented ones, and I think it's a feature. Especially as we use
> > them in the tests.  
> 
> Works for me.
> 
> > By the way, I would argue that testing is an important type of usage,
> > and changes in the test setup itself or in the kernel are non-trivial
> > effort, and it's rather uncertain whether we'll ever manage to commit
> > to that, so I don't quite feel that marking them as deprecated is
> > entirely warranted, but I see the point about possibly dropping that
> > migration model at some point in the future.
> > 
> > Regardless, the fact we *need* them at the moment should prove that
> > they are at least convenient, even if just for testing.  
> 
> Fair enough.
> 
> [snip]
> > > > diff --git a/passt.c b/passt.c
> > > > index 388d10f..f4c108f 100644
> > > > --- a/passt.c
> > > > +++ b/passt.c
> > > > @@ -308,6 +308,9 @@ loop:
> > > >  		      c.mode == MODE_PASTA ? "pasta" : "passt",
> > > >  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> > > >  
> > > > +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > > > +			continue;
> > > > +    
> > > 
> > > This worries me slightly.  I think it will be fine if passt is killed
> > > not long after the migration.  However, if passt lingers, we're
> > > ignoring, but not clearing events.  For those events which are
> > > level-triggered this could make us chew CPU on a tight epoll_wait()
> > > loop.  
> > 
> > Ah, yes, I had that concern as well, but I conveniently forgot about it
> > at some point.
> >   
> > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> > > the point we would close() them with --migrate-no-linger.  I'd be fine
> > > with that as a follow up improvement, though.  
> > 
> > I was pondering about adding this on top of the ignore_data_events
> > trick, but, actually, the whole thing as of this patch is somewhat
> > bogus because
> > 
> > - we're ignoring events on TCP sockets (intentional),
> > 
> > - we're ignoring events on the tap device (who cares, migration is only
> >   supported with vhost-user)
> > 
> > - but *not* ignoring events on the vhost-user kick descriptor
> >   (oversight).
> > 
> > On a second thought, it doesn't look safe to ignore events on the kick
> > descriptor, and in any case, with this change, we don't want to prevent
> > the guest to send out further packets. It's not expected anyway.
> > 
> > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
> > we go through the sockets. It's simpler and arguably safer.  
> 
> Yes, that's what I had in mind as well (I thought I put that in the
> mail, but it looks like I didn't).  Just one additional concern that I
> don't think need hold up merge: do we also need to epoll_del() our
> listening sockets?

Right, I had that thought as well, and this was somewhat covered by the
first version because we'd ignore events on those. But that would still
come with the risk of epoll_wait() loops.

In the perspective of a simple implementation / fix mostly intended for
KubeVirt such as this one, we know that the guest is suspended at that
point, so we'll send a SYN to it, nothing comes back, we'll eventually
time out in 10 seconds and try to reset the connection (in the unlikely
case we're still running *and* getting traffic at that point).

And I guess that's fine because if we're still running and getting
traffic after that timeout something must have gone wrong, so sending a
RST segment doesn't look that bad to me. If the connection was meanwhile
established with the target node, I think our RST segment won't actually
reset the connection because sequence numbers won't match.

But surely it's not elegant and I think we should eventually have an
explicit implementation of the whole thing, perhaps with a new socket
state ("MIGRATED"?) and going through the whole list of listening
sockets and... closing them, I suppose?

-- 
Stefano


  reply	other threads:[~2025-07-23  9:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21 22:12 [PATCH] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-22  0:33 ` David Gibson
2025-07-22 21:12   ` Stefano Brivio
2025-07-23  0:27     ` David Gibson
2025-07-23  9:17       ` Stefano Brivio [this message]
2025-07-24  1:48         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250723111708.4d78111e@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=ndothan@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).