Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top, Nir Dothan <ndothan@redhat.com>
Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
Date: Tue, 22 Jul 2025 23:12:49 +0200	[thread overview]
Message-ID: <20250722231249.22e09340@elisabeth> (raw)
In-Reply-To: <aH7cPD2PeeegqbRk@zatzit>

On Tue, 22 Jul 2025 10:33:00 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:
> > We are hitting an issue in the KubeVirt integration where some data is
> > still sent to the source instance even after migration is complete. As
> > we exit, the kernel closes our sockets and resets connections. The
> > resulting RST segments are sent to peers, effectively terminating
> > connections that were meanwhile migrated.
> > 
> > At the moment, this is not done intentionally, but in the future
> > KubeVirt might enable OVN-Kubernetes features where source and
> > destination nodes are explicitly getting mirrored traffic for a while,
> > in order to decrease migration downtime.
> > 
> > By default, don't quit after migration is completed on the source: the
> > previous behaviour can be enabled with the new --migrate-exit
> > option.  
> 
> Might be worth explicitly mentioning that the reason this helps is
> that having the sockets still around in repair mode prevents the
> source node from responding to any traffic for them.

Added as separate paragraph, below.

> > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > long as we're running, and ignore events on any epoll descriptor
> > representing data channels. The previous (and, strictly speaking,
> > correct)  
> 
> I'm not necessarily convinced the previous behaviour is more correct.
> As we've discussed the new behaviour is more in keeping with the
> normal qemu migration model, and with further work could let us allow
> a resume on the source node even if there's a failure late in
> migration (as long as it's before the target resumes, of course).
> 
> As with --migrate-exit the new behaviour is also more robust against
> the exact timing of network switchover.  I think that's generally
> desirable, even if it's arguably not strictly necessary.

Right, parenthesis dropped. Well, the previous behaviour had the
advantage of uncovering the issue in KubeVirt (at least we know that
overlap might happen), but that doesn't make it more correct, per se.

> > behaviour can be enabled with the new --migrate-no-linger
> > option.
> > 
> > Reported-by: Nir Dothan <ndothan@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> ... in the sense that I think it does accomplish what it sets out to
> do.  There are several things that I think would improve it, see
> details below.
> 
> > ---
> >  conf.c         | 23 +++++++++++++++++++++++
> >  epoll_type.h   | 12 ++++++++----
> >  flow.c         |  2 +-
> >  passt.1        | 14 ++++++++++++++
> >  passt.c        |  6 +++++-
> >  passt.h        |  6 ++++++
> >  tcp.c          |  7 +++++--
> >  tcp_conn.h     |  3 ++-
> >  test/lib/setup |  4 ++--
> >  vhost_user.c   | 10 ++++++++--
> >  10 files changed, 74 insertions(+), 13 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 6c747aa..5e69014 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> >  		FPRINTF(f,
> >  			"  --repair-path PATH	path for passt-repair(1)\n"
> >  			"    default: append '.repair' to UNIX domain path\n");
> > +		FPRINTF(f,
> > +			"  --migrate-exit	source quits after migration\n"
> > +			"    default: source keeps running after migration\n");
> > +		FPRINTF(f,
> > +			"  --migrate-no-linger	close sockets on migration\n"
> > +			"    default: keep sockets open, ignore data events\n");  
> 
> I'm a bit uncomfortable adding this as regular, supported options.
> I'm hoping we can eliminate them in the not too distant future.  In
> the medium term we can do that by improving our test code to simulate
> different hosts with namespaces.  Longer term we can allow local to
> local migration without these options by fixing the kernel: I don't
> believe bind() should fail if the address conflict is with a socket in
> repair mode.

I see the point about --migrate-no-linger, but --migrate-exit looks
like something that depends on the migration workflow, not necessarily
something to cover up issues in our tests or in the kernel. In any
case...

> Further, as user accessible options these are kinda weird - what they
> do is defined in terms of some pretty obscure internals, not in terms
> of what the upshot is.  So, I think it would be better to introduce
> these as deprecated / debug-only / undocumented options.

...I marked both as deprecated in v2, to signal to any hypothetical
(I'm fairly sure not real) user that they shouldn't assume it's safe to
rely on them.

That's the only other category of options we have so far, we don't have
undocumented ones, and I think it's a feature. Especially as we use
them in the tests.

By the way, I would argue that testing is an important type of usage,
and changes in the test setup itself or in the kernel are non-trivial
effort, and it's rather uncertain whether we'll ever manage to commit
to that, so I don't quite feel that marking them as deprecated is
entirely warranted, but I see the point about possibly dropping that
migration model at some point in the future.

Regardless, the fact we *need* them at the moment should prove that
they are at least convenient, even if just for testing.

> >  	}
> >  
> >  	FPRINTF(f,
> > @@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		{"socket-path",	required_argument,	NULL,		's' },
> >  		{"fqdn",	required_argument,	NULL,		27 },
> >  		{"repair-path",	required_argument,	NULL,		28 },
> > +		{"migrate-exit", no_argument,		NULL,		29 },
> > +		{"migrate-no-linger", no_argument,	NULL,		30 },
> >  		{ 0 },
> >  	};
> >  	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
> > @@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		fwd_default = FWD_AUTO;
> >  	}
> >  
> > +	if (c->mode == MODE_VU)  
> 
> No reason for the conditional.  It's irrelevant but harmless in other cases.

Dropped the whole thing in v2.

> > +		c->migrate_linger = true;  
> 
> Nicer to avoid the explicit initialisation by inverting the sense of
> this variable - plus I think it's a little nicer to have 'true' be the
> abnormal case.

Right, I wrote this before coming up with the option name, changed in
v2.

> 
> >  	if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
> >  		max_mtu = tap_l2_max_len(c) - ETH_HLEN;
> >  	c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
> > @@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv)
> >  					   optarg))
> >  				die("Invalid passt-repair path: %s", optarg);
> >  
> > +			break;
> > +		case 29:
> > +			if (c->mode != MODE_VU)
> > +				die("--migrate-exit is for vhost-user mode only");
> > +			c->migrate_exit = true;
> > +
> > +			break;
> > +		case 30:
> > +			if (c->mode != MODE_VU)
> > +				die("--migrate-no-linger is for vhost-user mode only");
> > +			c->migrate_linger = false;
> > +
> >  			break;
> >  		case 'd':
> >  			c->debug = 1;
> > diff --git a/epoll_type.h b/epoll_type.h
> > index 12ac59b..f2991b6 100644
> > --- a/epoll_type.h
> > +++ b/epoll_type.h
> > @@ -12,6 +12,7 @@
> >  enum epoll_type {
> >  	/* Special value to indicate an invalid type */
> >  	EPOLL_TYPE_NONE = 0,
> > +
> >  	/* Connected TCP sockets */
> >  	EPOLL_TYPE_TCP,
> >  	/* Connected TCP sockets (spliced) */
> > @@ -26,16 +27,19 @@ enum epoll_type {
> >  	EPOLL_TYPE_UDP,
> >  	/* ICMP/ICMPv6 ping sockets */
> >  	EPOLL_TYPE_PING,
> > -	/* inotify fd watching for end of netns (pasta) */
> > -	EPOLL_TYPE_NSQUIT_INOTIFY,
> > -	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> > -	EPOLL_TYPE_NSQUIT_TIMER,
> >  	/* tuntap character device */
> >  	EPOLL_TYPE_TAP_PASTA,
> >  	/* socket connected to qemu  */
> >  	EPOLL_TYPE_TAP_PASST,
> >  	/* socket listening for qemu socket connections */
> >  	EPOLL_TYPE_TAP_LISTEN,
> > +	/* End of event types involving data transfers or connections */
> > +	EPOLL_TYPE_DATA_MAX = EPOLL_TYPE_TAP_LISTEN,
> > +
> > +	/* inotify fd watching for end of netns (pasta) */
> > +	EPOLL_TYPE_NSQUIT_INOTIFY,
> > +	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> > +	EPOLL_TYPE_NSQUIT_TIMER,
> >  	/* vhost-user command socket */
> >  	EPOLL_TYPE_VHOST_CMD,
> >  	/* vhost-user kick event socket */
> > diff --git a/flow.c b/flow.c
> > index 00885f6..feefda3 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> >  	 * as EIO).
> >  	 */
> >  	foreach_established_tcp_flow(flow) {
> > -		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
> > +		rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
> >  		if (rc) {
> >  			flow_err(flow, "Can't send extended data: %s",
> >  				 strerror_(-rc));
> > diff --git a/passt.1 b/passt.1
> > index 60066c2..b85aaa0 100644
> > --- a/passt.1
> > +++ b/passt.1
> > @@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> >  chosen for the hypervisor UNIX domain socket. No socket is created if not in
> >  \-\-vhost-user mode.
> >  
> > +.TP
> > +.BR \-\-migrate-exit
> > +Exit after a completed migration as source. By default, \fBpasst\fR keeps
> > +running and the migrated guest can continue using its connection, or a new guest
> > +can connect.
> > +
> > +.TP
> > +.BR \-\-migrate-no-linger
> > +Close TCP sockets on the source instance once migration completes.
> > +
> > +By default, sockets are kept open, and events on data sockets are ignored, so
> > +that any further message reaching sockets after the source migrated is silently
> > +ignored, to avoid connection resets in case data is received after migration.
> > +
> >  .TP
> >  .BR \-F ", " \-\-fd " " \fIFD
> >  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> > diff --git a/passt.c b/passt.c
> > index 388d10f..f4c108f 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -308,6 +308,9 @@ loop:
> >  		      c.mode == MODE_PASTA ? "pasta" : "passt",
> >  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> >  
> > +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > +			continue;
> > +  
> 
> This worries me slightly.  I think it will be fine if passt is killed
> not long after the migration.  However, if passt lingers, we're
> ignoring, but not clearing events.  For those events which are
> level-triggered this could make us chew CPU on a tight epoll_wait()
> loop.

Ah, yes, I had that concern as well, but I conveniently forgot about it
at some point.

> I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> the point we would close() them with --migrate-no-linger.  I'd be fine
> with that as a follow up improvement, though.

I was pondering about adding this on top of the ignore_data_events
trick, but, actually, the whole thing as of this patch is somewhat
bogus because

- we're ignoring events on TCP sockets (intentional),

- we're ignoring events on the tap device (who cares, migration is only
  supported with vhost-user)

- but *not* ignoring events on the vhost-user kick descriptor
  (oversight).

On a second thought, it doesn't look safe to ignore events on the kick
descriptor, and in any case, with this change, we don't want to prevent
the guest to send out further packets. It's not expected anyway.

So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
we go through the sockets. It's simpler and arguably safer.

> >  		switch (ref.type) {
> >  		case EPOLL_TYPE_TAP_PASTA:
> >  			tap_handler_pasta(&c, eventmask, &now);
> > @@ -363,7 +366,8 @@ loop:
> >  		}
> >  	}
> >  
> > -	post_handler(&c, &now);
> > +	if (!c.ignore_data_events)
> > +		post_handler(&c, &now);
> >  
> >  	migrate_handler(&c);
> >  
> > diff --git a/passt.h b/passt.h
> > index 8693794..636b3c3 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -241,6 +241,9 @@ struct ip6_ctx {
> >   * @device_state_fd:	Device state migration channel
> >   * @device_state_result: Device state migration result
> >   * @migrate_target:	Are we the target, on the next migration request?
> > + * @migrate_linger:	Keep sockets open after migration, ignore data events
> > + * @migrate_exit:	Exit (on source) once migration is complete
> > + * @ignore_data_events:	Ignore data events (for migration, source instance)
> >   */
> >  struct ctx {
> >  	enum passt_modes mode;
> > @@ -318,6 +321,9 @@ struct ctx {
> >  	int device_state_fd;
> >  	int device_state_result;
> >  	bool migrate_target;
> > +	bool migrate_linger;
> > +	bool migrate_exit;
> > +	bool ignore_data_events;
> >  };
> >  
> >  void proto_update_l2_buf(const unsigned char *eth_d,
> > diff --git a/tcp.c b/tcp.c
> > index 2b88466..1f7a6ab 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
> >  
> >  /**
> >   * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> > + * @c:		Execution context
> >   * @fd:		Descriptor for state migration
> >   * @conn:	Pointer to the TCP connection structure
> >   *
> >   * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
> >   */
> > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> > +int tcp_flow_migrate_source_ext(struct ctx *c,
> > +				int fd, const struct tcp_tap_conn *conn)
> >  {
> >  	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
> >  	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
> > @@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> >  	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
> >  		goto fail;
> >  
> > -	close(s);
> > +	if (!c->migrate_linger)
> > +		close(s);
> >  
> >  	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
> >  	 * based on the end of the queues.
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 35d813d..d49ae88 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
> >  int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
> >  
> >  int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
> > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
> > +int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
> > +				const struct tcp_tap_conn *conn);
> >  
> >  int tcp_flow_migrate_target(struct ctx *c, int fd);
> >  int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
> > diff --git a/test/lib/setup b/test/lib/setup
> > index 575bc21..5994598 100755
> > --- a/test/lib/setup
> > +++ b/test/lib/setup
> > @@ -350,7 +350,7 @@ setup_migrate() {
> >  
> >  	sleep 1
> >  
> > -	__opts="--vhost-user"
> > +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
> >  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
> >  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> >  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> > @@ -360,7 +360,7 @@ setup_migrate() {
> >  
> >  	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
> >  
> > -	__opts="--vhost-user"
> > +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
> >  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
> >  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> >  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> > diff --git a/vhost_user.c b/vhost_user.c
> > index e7fb049..7fd27dc 100644
> > --- a/vhost_user.c
> > +++ b/vhost_user.c
> > @@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
> >  	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
> >  	    vdev->context->device_state_result == 0 &&
> >  	    !vdev->context->migrate_target) {
> > -		info("Migration complete, exiting");
> > -		_exit(EXIT_SUCCESS);
> > +		if (vdev->context->migrate_exit) {
> > +			info("Migration complete, exiting");
> > +			_exit(EXIT_SUCCESS);
> > +		}
> > +
> > +		info("Migration complete");
> > +		if (vdev->context->migrate_linger)
> > +			vdev->context->ignore_data_events = true;
> >  	}
> >  }  
>

next prev parent reply	other threads:[~2025-07-22 21:12 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-21 22:12 Stefano Brivio
2025-07-22  0:33 ` David Gibson
2025-07-22 21:12   ` Stefano Brivio [this message]
2025-07-23  0:27     ` David Gibson
2025-07-23  9:17       ` Stefano Brivio
2025-07-24  1:48         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250722231249.22e09340@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=ndothan@redhat.com \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).