public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH] treewide: By default, don't quit source after migration, keep sockets open
@ 2025-07-21 22:12 Stefano Brivio
  2025-07-22  0:33 ` David Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2025-07-21 22:12 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson, Nir Dothan

We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new --migrate-exit option.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous (and, strictly speaking,
correct) behaviour can be enabled with the new --migrate-no-linger
option.

Reported-by: Nir Dothan <ndothan@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 conf.c         | 23 +++++++++++++++++++++++
 epoll_type.h   | 12 ++++++++----
 flow.c         |  2 +-
 passt.1        | 14 ++++++++++++++
 passt.c        |  6 +++++-
 passt.h        |  6 ++++++
 tcp.c          |  7 +++++--
 tcp_conn.h     |  3 ++-
 test/lib/setup |  4 ++--
 vhost_user.c   | 10 ++++++++--
 10 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/conf.c b/conf.c
index 6c747aa..5e69014 100644
--- a/conf.c
+++ b/conf.c
@@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
 		FPRINTF(f,
 			"  --repair-path PATH	path for passt-repair(1)\n"
 			"    default: append '.repair' to UNIX domain path\n");
+		FPRINTF(f,
+			"  --migrate-exit	source quits after migration\n"
+			"    default: source keeps running after migration\n");
+		FPRINTF(f,
+			"  --migrate-no-linger	close sockets on migration\n"
+			"    default: keep sockets open, ignore data events\n");
 	}
 
 	FPRINTF(f,
@@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"socket-path",	required_argument,	NULL,		's' },
 		{"fqdn",	required_argument,	NULL,		27 },
 		{"repair-path",	required_argument,	NULL,		28 },
+		{"migrate-exit", no_argument,		NULL,		29 },
+		{"migrate-no-linger", no_argument,	NULL,		30 },
 		{ 0 },
 	};
 	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
@@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv)
 		fwd_default = FWD_AUTO;
 	}
 
+	if (c->mode == MODE_VU)
+		c->migrate_linger = true;
+
 	if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
 		max_mtu = tap_l2_max_len(c) - ETH_HLEN;
 	c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
@@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv)
 					   optarg))
 				die("Invalid passt-repair path: %s", optarg);
 
+			break;
+		case 29:
+			if (c->mode != MODE_VU)
+				die("--migrate-exit is for vhost-user mode only");
+			c->migrate_exit = true;
+
+			break;
+		case 30:
+			if (c->mode != MODE_VU)
+				die("--migrate-no-linger is for vhost-user mode only");
+			c->migrate_linger = false;
+
 			break;
 		case 'd':
 			c->debug = 1;
diff --git a/epoll_type.h b/epoll_type.h
index 12ac59b..f2991b6 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -12,6 +12,7 @@
 enum epoll_type {
 	/* Special value to indicate an invalid type */
 	EPOLL_TYPE_NONE = 0,
+
 	/* Connected TCP sockets */
 	EPOLL_TYPE_TCP,
 	/* Connected TCP sockets (spliced) */
@@ -26,16 +27,19 @@ enum epoll_type {
 	EPOLL_TYPE_UDP,
 	/* ICMP/ICMPv6 ping sockets */
 	EPOLL_TYPE_PING,
-	/* inotify fd watching for end of netns (pasta) */
-	EPOLL_TYPE_NSQUIT_INOTIFY,
-	/* timer fd watching for end of netns, fallback for inotify (pasta) */
-	EPOLL_TYPE_NSQUIT_TIMER,
 	/* tuntap character device */
 	EPOLL_TYPE_TAP_PASTA,
 	/* socket connected to qemu  */
 	EPOLL_TYPE_TAP_PASST,
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
+	/* End of event types involving data transfers or connections */
+	EPOLL_TYPE_DATA_MAX = EPOLL_TYPE_TAP_LISTEN,
+
+	/* inotify fd watching for end of netns (pasta) */
+	EPOLL_TYPE_NSQUIT_INOTIFY,
+	/* timer fd watching for end of netns, fallback for inotify (pasta) */
+	EPOLL_TYPE_NSQUIT_TIMER,
 	/* vhost-user command socket */
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
diff --git a/flow.c b/flow.c
index 00885f6..feefda3 100644
--- a/flow.c
+++ b/flow.c
@@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * as EIO).
 	 */
 	foreach_established_tcp_flow(flow) {
-		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
+		rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
 		if (rc) {
 			flow_err(flow, "Can't send extended data: %s",
 				 strerror_(-rc));
diff --git a/passt.1 b/passt.1
index 60066c2..b85aaa0 100644
--- a/passt.1
+++ b/passt.1
@@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
 chosen for the hypervisor UNIX domain socket. No socket is created if not in
 \-\-vhost-user mode.
 
+.TP
+.BR \-\-migrate-exit
+Exit after a completed migration as source. By default, \fBpasst\fR keeps
+running and the migrated guest can continue using its connection, or a new guest
+can connect.
+
+.TP
+.BR \-\-migrate-no-linger
+Close TCP sockets on the source instance once migration completes.
+
+By default, sockets are kept open, and events on data sockets are ignored, so
+that any further message reaching sockets after the source migrated is silently
+ignored, to avoid connection resets in case data is received after migration.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
diff --git a/passt.c b/passt.c
index 388d10f..f4c108f 100644
--- a/passt.c
+++ b/passt.c
@@ -308,6 +308,9 @@ loop:
 		      c.mode == MODE_PASTA ? "pasta" : "passt",
 		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
 
+		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
+			continue;
+
 		switch (ref.type) {
 		case EPOLL_TYPE_TAP_PASTA:
 			tap_handler_pasta(&c, eventmask, &now);
@@ -363,7 +366,8 @@ loop:
 		}
 	}
 
-	post_handler(&c, &now);
+	if (!c.ignore_data_events)
+		post_handler(&c, &now);
 
 	migrate_handler(&c);
 
diff --git a/passt.h b/passt.h
index 8693794..636b3c3 100644
--- a/passt.h
+++ b/passt.h
@@ -241,6 +241,9 @@ struct ip6_ctx {
  * @device_state_fd:	Device state migration channel
  * @device_state_result: Device state migration result
  * @migrate_target:	Are we the target, on the next migration request?
+ * @migrate_linger:	Keep sockets open after migration, ignore data events
+ * @migrate_exit:	Exit (on source) once migration is complete
+ * @ignore_data_events:	Ignore data events (for migration, source instance)
  */
 struct ctx {
 	enum passt_modes mode;
@@ -318,6 +321,9 @@ struct ctx {
 	int device_state_fd;
 	int device_state_result;
 	bool migrate_target;
+	bool migrate_linger;
+	bool migrate_exit;
+	bool ignore_data_events;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/tcp.c b/tcp.c
index 2b88466..1f7a6ab 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
 
 /**
  * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
+ * @c:		Execution context
  * @fd:		Descriptor for state migration
  * @conn:	Pointer to the TCP connection structure
  *
  * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
  */
-int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
+int tcp_flow_migrate_source_ext(struct ctx *c,
+				int fd, const struct tcp_tap_conn *conn)
 {
 	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
 	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
@@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
 		goto fail;
 
-	close(s);
+	if (!c->migrate_linger)
+		close(s);
 
 	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
 	 * based on the end of the queues.
diff --git a/tcp_conn.h b/tcp_conn.h
index 35d813d..d49ae88 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
 int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
-int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
+				const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_target(struct ctx *c, int fd);
 int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
diff --git a/test/lib/setup b/test/lib/setup
index 575bc21..5994598 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -350,7 +350,7 @@ setup_migrate() {
 
 	sleep 1
 
-	__opts="--vhost-user"
+	__opts="--vhost-user --migrate-exit --migrate-no-linger"
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
@@ -360,7 +360,7 @@ setup_migrate() {
 
 	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
 
-	__opts="--vhost-user"
+	__opts="--vhost-user --migrate-exit --migrate-no-linger"
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
diff --git a/vhost_user.c b/vhost_user.c
index e7fb049..7fd27dc 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
 	    vdev->context->device_state_result == 0 &&
 	    !vdev->context->migrate_target) {
-		info("Migration complete, exiting");
-		_exit(EXIT_SUCCESS);
+		if (vdev->context->migrate_exit) {
+			info("Migration complete, exiting");
+			_exit(EXIT_SUCCESS);
+		}
+
+		info("Migration complete");
+		if (vdev->context->migrate_linger)
+			vdev->context->ignore_data_events = true;
 	}
 }
-- 
@@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
 	    vdev->context->device_state_result == 0 &&
 	    !vdev->context->migrate_target) {
-		info("Migration complete, exiting");
-		_exit(EXIT_SUCCESS);
+		if (vdev->context->migrate_exit) {
+			info("Migration complete, exiting");
+			_exit(EXIT_SUCCESS);
+		}
+
+		info("Migration complete");
+		if (vdev->context->migrate_linger)
+			vdev->context->ignore_data_events = true;
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-21 22:12 [PATCH] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
@ 2025-07-22  0:33 ` David Gibson
  2025-07-22 21:12   ` Stefano Brivio
  0 siblings, 1 reply; 6+ messages in thread
From: David Gibson @ 2025-07-22  0:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Nir Dothan

[-- Attachment #1: Type: text/plain, Size: 13807 bytes --]

On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:
> We are hitting an issue in the KubeVirt integration where some data is
> still sent to the source instance even after migration is complete. As
> we exit, the kernel closes our sockets and resets connections. The
> resulting RST segments are sent to peers, effectively terminating
> connections that were meanwhile migrated.
> 
> At the moment, this is not done intentionally, but in the future
> KubeVirt might enable OVN-Kubernetes features where source and
> destination nodes are explicitly getting mirrored traffic for a while,
> in order to decrease migration downtime.
> 
> By default, don't quit after migration is completed on the source: the
> previous behaviour can be enabled with the new --migrate-exit
> option.

Might be worth explicitly mentioning that the reason this helps is
that having the sockets still around in repair mode prevents the
source node from responding to any traffic for them.

> 
> Also, by default, keep migrated TCP sockets open (in repair mode) as
> long as we're running, and ignore events on any epoll descriptor
> representing data channels. The previous (and, strictly speaking,
> correct)

I'm not necessarily convinced the previous behaviour is more correct.
As we've discussed the new behaviour is more in keeping with the
normal qemu migration model, and with further work could let us allow
a resume on the source node even if there's a failure late in
migration (as long as it's before the target resumes, of course).

As with --migrate-exit the new behaviour is also more robust against
the exact timing of network switchover.  I think that's generally
desirable, even if it's arguably not strictly necessary.

> behaviour can be enabled with the new --migrate-no-linger
> option.
> 
> Reported-by: Nir Dothan <ndothan@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

... in the sense that I think it does accomplish what it sets out to
do.  There are several things that I think would improve it, see
details below.

> ---
>  conf.c         | 23 +++++++++++++++++++++++
>  epoll_type.h   | 12 ++++++++----
>  flow.c         |  2 +-
>  passt.1        | 14 ++++++++++++++
>  passt.c        |  6 +++++-
>  passt.h        |  6 ++++++
>  tcp.c          |  7 +++++--
>  tcp_conn.h     |  3 ++-
>  test/lib/setup |  4 ++--
>  vhost_user.c   | 10 ++++++++--
>  10 files changed, 74 insertions(+), 13 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 6c747aa..5e69014 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
>  		FPRINTF(f,
>  			"  --repair-path PATH	path for passt-repair(1)\n"
>  			"    default: append '.repair' to UNIX domain path\n");
> +		FPRINTF(f,
> +			"  --migrate-exit	source quits after migration\n"
> +			"    default: source keeps running after migration\n");
> +		FPRINTF(f,
> +			"  --migrate-no-linger	close sockets on migration\n"
> +			"    default: keep sockets open, ignore data events\n");

I'm a bit uncomfortable adding this as regular, supported options.
I'm hoping we can eliminate them in the not too distant future.  In
the medium term we can do that by improving our test code to simulate
different hosts with namespaces.  Longer term we can allow local to
local migration without these options by fixing the kernel: I don't
believe bind() should fail if the address conflict is with a socket in
repair mode.

Further, as user accessible options these are kinda weird - what they
do is defined in terms of some pretty obscure internals, not in terms
of what the upshot is.  So, I think it would be better to introduce
these as deprecated / debug-only / undocumented options.

>  	}
>  
>  	FPRINTF(f,
> @@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"socket-path",	required_argument,	NULL,		's' },
>  		{"fqdn",	required_argument,	NULL,		27 },
>  		{"repair-path",	required_argument,	NULL,		28 },
> +		{"migrate-exit", no_argument,		NULL,		29 },
> +		{"migrate-no-linger", no_argument,	NULL,		30 },
>  		{ 0 },
>  	};
>  	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
> @@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv)
>  		fwd_default = FWD_AUTO;
>  	}
>  
> +	if (c->mode == MODE_VU)

No reason for the conditional.  It's irrelevant but harmless in other cases.

> +		c->migrate_linger = true;

Nicer to avoid the explicit initialisation by inverting the sense of
this variable - plus I think it's a little nicer to have 'true' be the
abnormal case.

>  	if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
>  		max_mtu = tap_l2_max_len(c) - ETH_HLEN;
>  	c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
> @@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv)
>  					   optarg))
>  				die("Invalid passt-repair path: %s", optarg);
>  
> +			break;
> +		case 29:
> +			if (c->mode != MODE_VU)
> +				die("--migrate-exit is for vhost-user mode only");
> +			c->migrate_exit = true;
> +
> +			break;
> +		case 30:
> +			if (c->mode != MODE_VU)
> +				die("--migrate-no-linger is for vhost-user mode only");
> +			c->migrate_linger = false;
> +
>  			break;
>  		case 'd':
>  			c->debug = 1;
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..f2991b6 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -12,6 +12,7 @@
>  enum epoll_type {
>  	/* Special value to indicate an invalid type */
>  	EPOLL_TYPE_NONE = 0,
> +
>  	/* Connected TCP sockets */
>  	EPOLL_TYPE_TCP,
>  	/* Connected TCP sockets (spliced) */
> @@ -26,16 +27,19 @@ enum epoll_type {
>  	EPOLL_TYPE_UDP,
>  	/* ICMP/ICMPv6 ping sockets */
>  	EPOLL_TYPE_PING,
> -	/* inotify fd watching for end of netns (pasta) */
> -	EPOLL_TYPE_NSQUIT_INOTIFY,
> -	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> -	EPOLL_TYPE_NSQUIT_TIMER,
>  	/* tuntap character device */
>  	EPOLL_TYPE_TAP_PASTA,
>  	/* socket connected to qemu  */
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* End of event types involving data transfers or connections */
> +	EPOLL_TYPE_DATA_MAX = EPOLL_TYPE_TAP_LISTEN,
> +
> +	/* inotify fd watching for end of netns (pasta) */
> +	EPOLL_TYPE_NSQUIT_INOTIFY,
> +	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> +	EPOLL_TYPE_NSQUIT_TIMER,
>  	/* vhost-user command socket */
>  	EPOLL_TYPE_VHOST_CMD,
>  	/* vhost-user kick event socket */
> diff --git a/flow.c b/flow.c
> index 00885f6..feefda3 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
>  	 * as EIO).
>  	 */
>  	foreach_established_tcp_flow(flow) {
> -		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
> +		rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
>  		if (rc) {
>  			flow_err(flow, "Can't send extended data: %s",
>  				 strerror_(-rc));
> diff --git a/passt.1 b/passt.1
> index 60066c2..b85aaa0 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
>  chosen for the hypervisor UNIX domain socket. No socket is created if not in
>  \-\-vhost-user mode.
>  
> +.TP
> +.BR \-\-migrate-exit
> +Exit after a completed migration as source. By default, \fBpasst\fR keeps
> +running and the migrated guest can continue using its connection, or a new guest
> +can connect.
> +
> +.TP
> +.BR \-\-migrate-no-linger
> +Close TCP sockets on the source instance once migration completes.
> +
> +By default, sockets are kept open, and events on data sockets are ignored, so
> +that any further message reaching sockets after the source migrated is silently
> +ignored, to avoid connection resets in case data is received after migration.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 388d10f..f4c108f 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -308,6 +308,9 @@ loop:
>  		      c.mode == MODE_PASTA ? "pasta" : "passt",
>  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
>  
> +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> +			continue;
> +

This worries me slightly.  I think it will be fine if passt is killed
not long after the migration.  However, if passt lingers, we're
ignoring, but not clearing events.  For those events which are
level-triggered this could make us chew CPU on a tight epoll_wait()
loop.

I think a preferable approach would be to EPOLL_CTL_DEL each socket at
the point we would close() them with --migrate-no-linger.  I'd be fine
with that as a follow up improvement, though.

>  		switch (ref.type) {
>  		case EPOLL_TYPE_TAP_PASTA:
>  			tap_handler_pasta(&c, eventmask, &now);
> @@ -363,7 +366,8 @@ loop:
>  		}
>  	}
>  
> -	post_handler(&c, &now);
> +	if (!c.ignore_data_events)
> +		post_handler(&c, &now);
>  
>  	migrate_handler(&c);
>  
> diff --git a/passt.h b/passt.h
> index 8693794..636b3c3 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -241,6 +241,9 @@ struct ip6_ctx {
>   * @device_state_fd:	Device state migration channel
>   * @device_state_result: Device state migration result
>   * @migrate_target:	Are we the target, on the next migration request?
> + * @migrate_linger:	Keep sockets open after migration, ignore data events
> + * @migrate_exit:	Exit (on source) once migration is complete
> + * @ignore_data_events:	Ignore data events (for migration, source instance)
>   */
>  struct ctx {
>  	enum passt_modes mode;
> @@ -318,6 +321,9 @@ struct ctx {
>  	int device_state_fd;
>  	int device_state_result;
>  	bool migrate_target;
> +	bool migrate_linger;
> +	bool migrate_exit;
> +	bool ignore_data_events;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/tcp.c b/tcp.c
> index 2b88466..1f7a6ab 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
>  
>  /**
>   * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> + * @c:		Execution context
>   * @fd:		Descriptor for state migration
>   * @conn:	Pointer to the TCP connection structure
>   *
>   * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
>   */
> -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> +int tcp_flow_migrate_source_ext(struct ctx *c,
> +				int fd, const struct tcp_tap_conn *conn)
>  {
>  	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
>  	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
> @@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
>  	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
>  		goto fail;
>  
> -	close(s);
> +	if (!c->migrate_linger)
> +		close(s);
>  
>  	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
>  	 * based on the end of the queues.
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 35d813d..d49ae88 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
>  int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
>  
>  int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
> -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
> +int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
> +				const struct tcp_tap_conn *conn);
>  
>  int tcp_flow_migrate_target(struct ctx *c, int fd);
>  int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
> diff --git a/test/lib/setup b/test/lib/setup
> index 575bc21..5994598 100755
> --- a/test/lib/setup
> +++ b/test/lib/setup
> @@ -350,7 +350,7 @@ setup_migrate() {
>  
>  	sleep 1
>  
> -	__opts="--vhost-user"
> +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> @@ -360,7 +360,7 @@ setup_migrate() {
>  
>  	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
>  
> -	__opts="--vhost-user"
> +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
>  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
>  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
>  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> diff --git a/vhost_user.c b/vhost_user.c
> index e7fb049..7fd27dc 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
>  	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
>  	    vdev->context->device_state_result == 0 &&
>  	    !vdev->context->migrate_target) {
> -		info("Migration complete, exiting");
> -		_exit(EXIT_SUCCESS);
> +		if (vdev->context->migrate_exit) {
> +			info("Migration complete, exiting");
> +			_exit(EXIT_SUCCESS);
> +		}
> +
> +		info("Migration complete");
> +		if (vdev->context->migrate_linger)
> +			vdev->context->ignore_data_events = true;
>  	}
>  }

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-22  0:33 ` David Gibson
@ 2025-07-22 21:12   ` Stefano Brivio
  2025-07-23  0:27     ` David Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2025-07-22 21:12 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Nir Dothan

On Tue, 22 Jul 2025 10:33:00 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:
> > We are hitting an issue in the KubeVirt integration where some data is
> > still sent to the source instance even after migration is complete. As
> > we exit, the kernel closes our sockets and resets connections. The
> > resulting RST segments are sent to peers, effectively terminating
> > connections that were meanwhile migrated.
> > 
> > At the moment, this is not done intentionally, but in the future
> > KubeVirt might enable OVN-Kubernetes features where source and
> > destination nodes are explicitly getting mirrored traffic for a while,
> > in order to decrease migration downtime.
> > 
> > By default, don't quit after migration is completed on the source: the
> > previous behaviour can be enabled with the new --migrate-exit
> > option.  
> 
> Might be worth explicitly mentioning that the reason this helps is
> that having the sockets still around in repair mode prevents the
> source node from responding to any traffic for them.

Added as separate paragraph, below.

> > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > long as we're running, and ignore events on any epoll descriptor
> > representing data channels. The previous (and, strictly speaking,
> > correct)  
> 
> I'm not necessarily convinced the previous behaviour is more correct.
> As we've discussed the new behaviour is more in keeping with the
> normal qemu migration model, and with further work could let us allow
> a resume on the source node even if there's a failure late in
> migration (as long as it's before the target resumes, of course).
> 
> As with --migrate-exit the new behaviour is also more robust against
> the exact timing of network switchover.  I think that's generally
> desirable, even if it's arguably not strictly necessary.

Right, parenthesis dropped. Well, the previous behaviour had the
advantage of uncovering the issue in KubeVirt (at least we know that
overlap might happen), but that doesn't make it more correct, per se.

> > behaviour can be enabled with the new --migrate-no-linger
> > option.
> > 
> > Reported-by: Nir Dothan <ndothan@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> ... in the sense that I think it does accomplish what it sets out to
> do.  There are several things that I think would improve it, see
> details below.
> 
> > ---
> >  conf.c         | 23 +++++++++++++++++++++++
> >  epoll_type.h   | 12 ++++++++----
> >  flow.c         |  2 +-
> >  passt.1        | 14 ++++++++++++++
> >  passt.c        |  6 +++++-
> >  passt.h        |  6 ++++++
> >  tcp.c          |  7 +++++--
> >  tcp_conn.h     |  3 ++-
> >  test/lib/setup |  4 ++--
> >  vhost_user.c   | 10 ++++++++--
> >  10 files changed, 74 insertions(+), 13 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 6c747aa..5e69014 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> >  		FPRINTF(f,
> >  			"  --repair-path PATH	path for passt-repair(1)\n"
> >  			"    default: append '.repair' to UNIX domain path\n");
> > +		FPRINTF(f,
> > +			"  --migrate-exit	source quits after migration\n"
> > +			"    default: source keeps running after migration\n");
> > +		FPRINTF(f,
> > +			"  --migrate-no-linger	close sockets on migration\n"
> > +			"    default: keep sockets open, ignore data events\n");  
> 
> I'm a bit uncomfortable adding this as regular, supported options.
> I'm hoping we can eliminate them in the not too distant future.  In
> the medium term we can do that by improving our test code to simulate
> different hosts with namespaces.  Longer term we can allow local to
> local migration without these options by fixing the kernel: I don't
> believe bind() should fail if the address conflict is with a socket in
> repair mode.

I see the point about --migrate-no-linger, but --migrate-exit looks
like something that depends on the migration workflow, not necessarily
something to cover up issues in our tests or in the kernel. In any
case...

> Further, as user accessible options these are kinda weird - what they
> do is defined in terms of some pretty obscure internals, not in terms
> of what the upshot is.  So, I think it would be better to introduce
> these as deprecated / debug-only / undocumented options.

...I marked both as deprecated in v2, to signal to any hypothetical
(I'm fairly sure not real) user that they shouldn't assume it's safe to
rely on them.

That's the only other category of options we have so far, we don't have
undocumented ones, and I think it's a feature. Especially as we use
them in the tests.

By the way, I would argue that testing is an important type of usage,
and changes in the test setup itself or in the kernel are non-trivial
effort, and it's rather uncertain whether we'll ever manage to commit
to that, so I don't quite feel that marking them as deprecated is
entirely warranted, but I see the point about possibly dropping that
migration model at some point in the future.

Regardless, the fact we *need* them at the moment should prove that
they are at least convenient, even if just for testing.

> >  	}
> >  
> >  	FPRINTF(f,
> > @@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		{"socket-path",	required_argument,	NULL,		's' },
> >  		{"fqdn",	required_argument,	NULL,		27 },
> >  		{"repair-path",	required_argument,	NULL,		28 },
> > +		{"migrate-exit", no_argument,		NULL,		29 },
> > +		{"migrate-no-linger", no_argument,	NULL,		30 },
> >  		{ 0 },
> >  	};
> >  	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
> > @@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		fwd_default = FWD_AUTO;
> >  	}
> >  
> > +	if (c->mode == MODE_VU)  
> 
> No reason for the conditional.  It's irrelevant but harmless in other cases.

Dropped the whole thing in v2.

> > +		c->migrate_linger = true;  
> 
> Nicer to avoid the explicit initialisation by inverting the sense of
> this variable - plus I think it's a little nicer to have 'true' be the
> abnormal case.

Right, I wrote this before coming up with the option name, changed in
v2.

> 
> >  	if (tap_l2_max_len(c) - ETH_HLEN < max_mtu)
> >  		max_mtu = tap_l2_max_len(c) - ETH_HLEN;
> >  	c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t));
> > @@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv)
> >  					   optarg))
> >  				die("Invalid passt-repair path: %s", optarg);
> >  
> > +			break;
> > +		case 29:
> > +			if (c->mode != MODE_VU)
> > +				die("--migrate-exit is for vhost-user mode only");
> > +			c->migrate_exit = true;
> > +
> > +			break;
> > +		case 30:
> > +			if (c->mode != MODE_VU)
> > +				die("--migrate-no-linger is for vhost-user mode only");
> > +			c->migrate_linger = false;
> > +
> >  			break;
> >  		case 'd':
> >  			c->debug = 1;
> > diff --git a/epoll_type.h b/epoll_type.h
> > index 12ac59b..f2991b6 100644
> > --- a/epoll_type.h
> > +++ b/epoll_type.h
> > @@ -12,6 +12,7 @@
> >  enum epoll_type {
> >  	/* Special value to indicate an invalid type */
> >  	EPOLL_TYPE_NONE = 0,
> > +
> >  	/* Connected TCP sockets */
> >  	EPOLL_TYPE_TCP,
> >  	/* Connected TCP sockets (spliced) */
> > @@ -26,16 +27,19 @@ enum epoll_type {
> >  	EPOLL_TYPE_UDP,
> >  	/* ICMP/ICMPv6 ping sockets */
> >  	EPOLL_TYPE_PING,
> > -	/* inotify fd watching for end of netns (pasta) */
> > -	EPOLL_TYPE_NSQUIT_INOTIFY,
> > -	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> > -	EPOLL_TYPE_NSQUIT_TIMER,
> >  	/* tuntap character device */
> >  	EPOLL_TYPE_TAP_PASTA,
> >  	/* socket connected to qemu  */
> >  	EPOLL_TYPE_TAP_PASST,
> >  	/* socket listening for qemu socket connections */
> >  	EPOLL_TYPE_TAP_LISTEN,
> > +	/* End of event types involving data transfers or connections */
> > +	EPOLL_TYPE_DATA_MAX = EPOLL_TYPE_TAP_LISTEN,
> > +
> > +	/* inotify fd watching for end of netns (pasta) */
> > +	EPOLL_TYPE_NSQUIT_INOTIFY,
> > +	/* timer fd watching for end of netns, fallback for inotify (pasta) */
> > +	EPOLL_TYPE_NSQUIT_TIMER,
> >  	/* vhost-user command socket */
> >  	EPOLL_TYPE_VHOST_CMD,
> >  	/* vhost-user kick event socket */
> > diff --git a/flow.c b/flow.c
> > index 00885f6..feefda3 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> >  	 * as EIO).
> >  	 */
> >  	foreach_established_tcp_flow(flow) {
> > -		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
> > +		rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
> >  		if (rc) {
> >  			flow_err(flow, "Can't send extended data: %s",
> >  				 strerror_(-rc));
> > diff --git a/passt.1 b/passt.1
> > index 60066c2..b85aaa0 100644
> > --- a/passt.1
> > +++ b/passt.1
> > @@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> >  chosen for the hypervisor UNIX domain socket. No socket is created if not in
> >  \-\-vhost-user mode.
> >  
> > +.TP
> > +.BR \-\-migrate-exit
> > +Exit after a completed migration as source. By default, \fBpasst\fR keeps
> > +running and the migrated guest can continue using its connection, or a new guest
> > +can connect.
> > +
> > +.TP
> > +.BR \-\-migrate-no-linger
> > +Close TCP sockets on the source instance once migration completes.
> > +
> > +By default, sockets are kept open, and events on data sockets are ignored, so
> > +that any further message reaching sockets after the source migrated is silently
> > +ignored, to avoid connection resets in case data is received after migration.
> > +
> >  .TP
> >  .BR \-F ", " \-\-fd " " \fIFD
> >  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> > diff --git a/passt.c b/passt.c
> > index 388d10f..f4c108f 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -308,6 +308,9 @@ loop:
> >  		      c.mode == MODE_PASTA ? "pasta" : "passt",
> >  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> >  
> > +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > +			continue;
> > +  
> 
> This worries me slightly.  I think it will be fine if passt is killed
> not long after the migration.  However, if passt lingers, we're
> ignoring, but not clearing events.  For those events which are
> level-triggered this could make us chew CPU on a tight epoll_wait()
> loop.

Ah, yes, I had that concern as well, but I conveniently forgot about it
at some point.

> I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> the point we would close() them with --migrate-no-linger.  I'd be fine
> with that as a follow up improvement, though.

I was pondering about adding this on top of the ignore_data_events
trick, but, actually, the whole thing as of this patch is somewhat
bogus because

- we're ignoring events on TCP sockets (intentional),

- we're ignoring events on the tap device (who cares, migration is only
  supported with vhost-user)

- but *not* ignoring events on the vhost-user kick descriptor
  (oversight).

On a second thought, it doesn't look safe to ignore events on the kick
descriptor, and in any case, with this change, we don't want to prevent
the guest to send out further packets. It's not expected anyway.

So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
we go through the sockets. It's simpler and arguably safer.

> >  		switch (ref.type) {
> >  		case EPOLL_TYPE_TAP_PASTA:
> >  			tap_handler_pasta(&c, eventmask, &now);
> > @@ -363,7 +366,8 @@ loop:
> >  		}
> >  	}
> >  
> > -	post_handler(&c, &now);
> > +	if (!c.ignore_data_events)
> > +		post_handler(&c, &now);
> >  
> >  	migrate_handler(&c);
> >  
> > diff --git a/passt.h b/passt.h
> > index 8693794..636b3c3 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -241,6 +241,9 @@ struct ip6_ctx {
> >   * @device_state_fd:	Device state migration channel
> >   * @device_state_result: Device state migration result
> >   * @migrate_target:	Are we the target, on the next migration request?
> > + * @migrate_linger:	Keep sockets open after migration, ignore data events
> > + * @migrate_exit:	Exit (on source) once migration is complete
> > + * @ignore_data_events:	Ignore data events (for migration, source instance)
> >   */
> >  struct ctx {
> >  	enum passt_modes mode;
> > @@ -318,6 +321,9 @@ struct ctx {
> >  	int device_state_fd;
> >  	int device_state_result;
> >  	bool migrate_target;
> > +	bool migrate_linger;
> > +	bool migrate_exit;
> > +	bool ignore_data_events;
> >  };
> >  
> >  void proto_update_l2_buf(const unsigned char *eth_d,
> > diff --git a/tcp.c b/tcp.c
> > index 2b88466..1f7a6ab 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
> >  
> >  /**
> >   * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> > + * @c:		Execution context
> >   * @fd:		Descriptor for state migration
> >   * @conn:	Pointer to the TCP connection structure
> >   *
> >   * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
> >   */
> > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> > +int tcp_flow_migrate_source_ext(struct ctx *c,
> > +				int fd, const struct tcp_tap_conn *conn)
> >  {
> >  	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
> >  	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
> > @@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> >  	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
> >  		goto fail;
> >  
> > -	close(s);
> > +	if (!c->migrate_linger)
> > +		close(s);
> >  
> >  	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
> >  	 * based on the end of the queues.
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 35d813d..d49ae88 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
> >  int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
> >  
> >  int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
> > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
> > +int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
> > +				const struct tcp_tap_conn *conn);
> >  
> >  int tcp_flow_migrate_target(struct ctx *c, int fd);
> >  int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
> > diff --git a/test/lib/setup b/test/lib/setup
> > index 575bc21..5994598 100755
> > --- a/test/lib/setup
> > +++ b/test/lib/setup
> > @@ -350,7 +350,7 @@ setup_migrate() {
> >  
> >  	sleep 1
> >  
> > -	__opts="--vhost-user"
> > +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
> >  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
> >  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> >  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> > @@ -360,7 +360,7 @@ setup_migrate() {
> >  
> >  	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
> >  
> > -	__opts="--vhost-user"
> > +	__opts="--vhost-user --migrate-exit --migrate-no-linger"
> >  	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
> >  	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> >  	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> > diff --git a/vhost_user.c b/vhost_user.c
> > index e7fb049..7fd27dc 100644
> > --- a/vhost_user.c
> > +++ b/vhost_user.c
> > @@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
> >  	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
> >  	    vdev->context->device_state_result == 0 &&
> >  	    !vdev->context->migrate_target) {
> > -		info("Migration complete, exiting");
> > -		_exit(EXIT_SUCCESS);
> > +		if (vdev->context->migrate_exit) {
> > +			info("Migration complete, exiting");
> > +			_exit(EXIT_SUCCESS);
> > +		}
> > +
> > +		info("Migration complete");
> > +		if (vdev->context->migrate_linger)
> > +			vdev->context->ignore_data_events = true;
> >  	}
> >  }  
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-22 21:12   ` Stefano Brivio
@ 2025-07-23  0:27     ` David Gibson
  2025-07-23  9:17       ` Stefano Brivio
  0 siblings, 1 reply; 6+ messages in thread
From: David Gibson @ 2025-07-23  0:27 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Nir Dothan

[-- Attachment #1: Type: text/plain, Size: 6540 bytes --]

On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote:
> On Tue, 22 Jul 2025 10:33:00 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:
> > > We are hitting an issue in the KubeVirt integration where some data is
> > > still sent to the source instance even after migration is complete. As
> > > we exit, the kernel closes our sockets and resets connections. The
> > > resulting RST segments are sent to peers, effectively terminating
> > > connections that were meanwhile migrated.
> > > 
> > > At the moment, this is not done intentionally, but in the future
> > > KubeVirt might enable OVN-Kubernetes features where source and
> > > destination nodes are explicitly getting mirrored traffic for a while,
> > > in order to decrease migration downtime.
> > > 
> > > By default, don't quit after migration is completed on the source: the
> > > previous behaviour can be enabled with the new --migrate-exit
> > > option.  
> > 
> > Might be worth explicitly mentioning that the reason this helps is
> > that having the sockets still around in repair mode prevents the
> > source node from responding to any traffic for them.
> 
> Added as separate paragraph, below.

Thanks.

[snip]
> > > diff --git a/conf.c b/conf.c
> > > index 6c747aa..5e69014 100644
> > > --- a/conf.c
> > > +++ b/conf.c
> > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> > >  		FPRINTF(f,
> > >  			"  --repair-path PATH	path for passt-repair(1)\n"
> > >  			"    default: append '.repair' to UNIX domain path\n");
> > > +		FPRINTF(f,
> > > +			"  --migrate-exit	source quits after migration\n"
> > > +			"    default: source keeps running after migration\n");
> > > +		FPRINTF(f,
> > > +			"  --migrate-no-linger	close sockets on migration\n"
> > > +			"    default: keep sockets open, ignore data events\n");  
> > 
> > I'm a bit uncomfortable adding this as regular, supported options.
> > I'm hoping we can eliminate them in the not too distant future.  In
> > the medium term we can do that by improving our test code to simulate
> > different hosts with namespaces.  Longer term we can allow local to
> > local migration without these options by fixing the kernel: I don't
> > believe bind() should fail if the address conflict is with a socket in
> > repair mode.
> 
> I see the point about --migrate-no-linger, but --migrate-exit looks
> like something that depends on the migration workflow, not necessarily
> something to cover up issues in our tests or in the kernel. In any
> case...

It does depend on the workflow, but I think this needs to be managed
by qemu (or whatever upper level component is managing the overall
migration).  --migrate-exit doesn't make us exit once the migration is
complete - it makes us exit once *our part of* the migration is
complete, the migration could still fail after that.  So in the happy
path, yes, we'll exit, but it should be because qemu or libvirt or
whatever is satisfied the migration is finished and kills us off

> > Further, as user accessible options these are kinda weird - what they
> > do is defined in terms of some pretty obscure internals, not in terms
> > of what the upshot is.  So, I think it would be better to introduce
> > these as deprecated / debug-only / undocumented options.
> 
> ...I marked both as deprecated in v2, to signal to any hypothetical
> (I'm fairly sure not real) user that they shouldn't assume it's safe to
> rely on them.
> 
> That's the only other category of options we have so far, we don't have
> undocumented ones, and I think it's a feature. Especially as we use
> them in the tests.

Works for me.

> By the way, I would argue that testing is an important type of usage,
> and changes in the test setup itself or in the kernel are non-trivial
> effort, and it's rather uncertain whether we'll ever manage to commit
> to that, so I don't quite feel that marking them as deprecated is
> entirely warranted, but I see the point about possibly dropping that
> migration model at some point in the future.
> 
> Regardless, the fact we *need* them at the moment should prove that
> they are at least convenient, even if just for testing.

Fair enough.

[snip]
> > > diff --git a/passt.c b/passt.c
> > > index 388d10f..f4c108f 100644
> > > --- a/passt.c
> > > +++ b/passt.c
> > > @@ -308,6 +308,9 @@ loop:
> > >  		      c.mode == MODE_PASTA ? "pasta" : "passt",
> > >  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> > >  
> > > +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > > +			continue;
> > > +  
> > 
> > This worries me slightly.  I think it will be fine if passt is killed
> > not long after the migration.  However, if passt lingers, we're
> > ignoring, but not clearing events.  For those events which are
> > level-triggered this could make us chew CPU on a tight epoll_wait()
> > loop.
> 
> Ah, yes, I had that concern as well, but I conveniently forgot about it
> at some point.
> 
> > I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> > the point we would close() them with --migrate-no-linger.  I'd be fine
> > with that as a follow up improvement, though.
> 
> I was pondering about adding this on top of the ignore_data_events
> trick, but, actually, the whole thing as of this patch is somewhat
> bogus because
> 
> - we're ignoring events on TCP sockets (intentional),
> 
> - we're ignoring events on the tap device (who cares, migration is only
>   supported with vhost-user)
> 
> - but *not* ignoring events on the vhost-user kick descriptor
>   (oversight).
> 
> On a second thought, it doesn't look safe to ignore events on the kick
> descriptor, and in any case, with this change, we don't want to prevent
> the guest to send out further packets. It's not expected anyway.
> 
> So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
> we go through the sockets. It's simpler and arguably safer.

Yes, that's what I had in mind as well (I thought I put that in the
mail, but it looks like I didn't).  Just one additional concern that I
don't think need hold up merge: do we also need to epoll_del() our
listening sockets?

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-23  0:27     ` David Gibson
@ 2025-07-23  9:17       ` Stefano Brivio
  2025-07-24  1:48         ` David Gibson
  0 siblings, 1 reply; 6+ messages in thread
From: Stefano Brivio @ 2025-07-23  9:17 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Nir Dothan

On Wed, 23 Jul 2025 10:27:30 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote:
> > On Tue, 22 Jul 2025 10:33:00 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote:  
> > > > We are hitting an issue in the KubeVirt integration where some data is
> > > > still sent to the source instance even after migration is complete. As
> > > > we exit, the kernel closes our sockets and resets connections. The
> > > > resulting RST segments are sent to peers, effectively terminating
> > > > connections that were meanwhile migrated.
> > > > 
> > > > At the moment, this is not done intentionally, but in the future
> > > > KubeVirt might enable OVN-Kubernetes features where source and
> > > > destination nodes are explicitly getting mirrored traffic for a while,
> > > > in order to decrease migration downtime.
> > > > 
> > > > By default, don't quit after migration is completed on the source: the
> > > > previous behaviour can be enabled with the new --migrate-exit
> > > > option.    
> > > 
> > > Might be worth explicitly mentioning that the reason this helps is
> > > that having the sockets still around in repair mode prevents the
> > > source node from responding to any traffic for them.  
> > 
> > Added as separate paragraph, below.  
> 
> Thanks.
> 
> [snip]
> > > > diff --git a/conf.c b/conf.c
> > > > index 6c747aa..5e69014 100644
> > > > --- a/conf.c
> > > > +++ b/conf.c
> > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status)
> > > >  		FPRINTF(f,
> > > >  			"  --repair-path PATH	path for passt-repair(1)\n"
> > > >  			"    default: append '.repair' to UNIX domain path\n");
> > > > +		FPRINTF(f,
> > > > +			"  --migrate-exit	source quits after migration\n"
> > > > +			"    default: source keeps running after migration\n");
> > > > +		FPRINTF(f,
> > > > +			"  --migrate-no-linger	close sockets on migration\n"
> > > > +			"    default: keep sockets open, ignore data events\n");    
> > > 
> > > I'm a bit uncomfortable adding this as regular, supported options.
> > > I'm hoping we can eliminate them in the not too distant future.  In
> > > the medium term we can do that by improving our test code to simulate
> > > different hosts with namespaces.  Longer term we can allow local to
> > > local migration without these options by fixing the kernel: I don't
> > > believe bind() should fail if the address conflict is with a socket in
> > > repair mode.  
> > 
> > I see the point about --migrate-no-linger, but --migrate-exit looks
> > like something that depends on the migration workflow, not necessarily
> > something to cover up issues in our tests or in the kernel. In any
> > case...  
> 
> It does depend on the workflow, but I think this needs to be managed
> by qemu (or whatever upper level component is managing the overall
> migration).  --migrate-exit doesn't make us exit once the migration is
> complete - it makes us exit once *our part of* the migration is
> complete, the migration could still fail after that.  So in the happy
> path, yes, we'll exit, but it should be because qemu or libvirt or
> whatever is satisfied the migration is finished and kills us off
> 
> > > Further, as user accessible options these are kinda weird - what they
> > > do is defined in terms of some pretty obscure internals, not in terms
> > > of what the upshot is.  So, I think it would be better to introduce
> > > these as deprecated / debug-only / undocumented options.  
> > 
> > ...I marked both as deprecated in v2, to signal to any hypothetical
> > (I'm fairly sure not real) user that they shouldn't assume it's safe to
> > rely on them.
> > 
> > That's the only other category of options we have so far, we don't have
> > undocumented ones, and I think it's a feature. Especially as we use
> > them in the tests.  
> 
> Works for me.
> 
> > By the way, I would argue that testing is an important type of usage,
> > and changes in the test setup itself or in the kernel are non-trivial
> > effort, and it's rather uncertain whether we'll ever manage to commit
> > to that, so I don't quite feel that marking them as deprecated is
> > entirely warranted, but I see the point about possibly dropping that
> > migration model at some point in the future.
> > 
> > Regardless, the fact we *need* them at the moment should prove that
> > they are at least convenient, even if just for testing.  
> 
> Fair enough.
> 
> [snip]
> > > > diff --git a/passt.c b/passt.c
> > > > index 388d10f..f4c108f 100644
> > > > --- a/passt.c
> > > > +++ b/passt.c
> > > > @@ -308,6 +308,9 @@ loop:
> > > >  		      c.mode == MODE_PASTA ? "pasta" : "passt",
> > > >  		      EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
> > > >  
> > > > +		if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX)
> > > > +			continue;
> > > > +    
> > > 
> > > This worries me slightly.  I think it will be fine if passt is killed
> > > not long after the migration.  However, if passt lingers, we're
> > > ignoring, but not clearing events.  For those events which are
> > > level-triggered this could make us chew CPU on a tight epoll_wait()
> > > loop.  
> > 
> > Ah, yes, I had that concern as well, but I conveniently forgot about it
> > at some point.
> >   
> > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> > > the point we would close() them with --migrate-no-linger.  I'd be fine
> > > with that as a follow up improvement, though.  
> > 
> > I was pondering about adding this on top of the ignore_data_events
> > trick, but, actually, the whole thing as of this patch is somewhat
> > bogus because
> > 
> > - we're ignoring events on TCP sockets (intentional),
> > 
> > - we're ignoring events on the tap device (who cares, migration is only
> >   supported with vhost-user)
> > 
> > - but *not* ignoring events on the vhost-user kick descriptor
> >   (oversight).
> > 
> > On a second thought, it doesn't look safe to ignore events on the kick
> > descriptor, and in any case, with this change, we don't want to prevent
> > the guest to send out further packets. It's not expected anyway.
> > 
> > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
> > we go through the sockets. It's simpler and arguably safer.  
> 
> Yes, that's what I had in mind as well (I thought I put that in the
> mail, but it looks like I didn't).  Just one additional concern that I
> don't think need hold up merge: do we also need to epoll_del() our
> listening sockets?

Right, I had that thought as well, and this was somewhat covered by the
first version because we'd ignore events on those. But that would still
come with the risk of epoll_wait() loops.

In the perspective of a simple implementation / fix mostly intended for
KubeVirt such as this one, we know that the guest is suspended at that
point, so we'll send a SYN to it, nothing comes back, we'll eventually
time out in 10 seconds and try to reset the connection (in the unlikely
case we're still running *and* getting traffic at that point).

And I guess that's fine because if we're still running and getting
traffic after that timeout something must have gone wrong, so sending a
RST segment doesn't look that bad to me. If the connection was meanwhile
established with the target node, I think our RST segment won't actually
reset the connection because sequence numbers won't match.

But surely it's not elegant and I think we should eventually have an
explicit implementation of the whole thing, perhaps with a new socket
state ("MIGRATED"?) and going through the whole list of listening
sockets and... closing them, I suppose?

-- 
Stefano


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-23  9:17       ` Stefano Brivio
@ 2025-07-24  1:48         ` David Gibson
  0 siblings, 0 replies; 6+ messages in thread
From: David Gibson @ 2025-07-24  1:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Nir Dothan

[-- Attachment #1: Type: text/plain, Size: 3101 bytes --]

On Wed, Jul 23, 2025 at 11:17:08AM +0200, Stefano Brivio wrote:
> On Wed, 23 Jul 2025 10:27:30 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote:
[snip]
> > > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at
> > > > the point we would close() them with --migrate-no-linger.  I'd be fine
> > > > with that as a follow up improvement, though.  
> > > 
> > > I was pondering about adding this on top of the ignore_data_events
> > > trick, but, actually, the whole thing as of this patch is somewhat
> > > bogus because
> > > 
> > > - we're ignoring events on TCP sockets (intentional),
> > > 
> > > - we're ignoring events on the tap device (who cares, migration is only
> > >   supported with vhost-user)
> > > 
> > > - but *not* ignoring events on the vhost-user kick descriptor
> > >   (oversight).
> > > 
> > > On a second thought, it doesn't look safe to ignore events on the kick
> > > descriptor, and in any case, with this change, we don't want to prevent
> > > the guest to send out further packets. It's not expected anyway.
> > > 
> > > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as
> > > we go through the sockets. It's simpler and arguably safer.  
> > 
> > Yes, that's what I had in mind as well (I thought I put that in the
> > mail, but it looks like I didn't).  Just one additional concern that I
> > don't think need hold up merge: do we also need to epoll_del() our
> > listening sockets?
> 
> Right, I had that thought as well, and this was somewhat covered by the
> first version because we'd ignore events on those. But that would still
> come with the risk of epoll_wait() loops.

Exactly.

> In the perspective of a simple implementation / fix mostly intended for
> KubeVirt such as this one, we know that the guest is suspended at that
> point, so we'll send a SYN to it, nothing comes back, we'll eventually
> time out in 10 seconds and try to reset the connection (in the unlikely
> case we're still running *and* getting traffic at that point).

Hrm.  Maybe.  Wouldn't surprise me if there are still some weird edge
cases in her, but, yes, I suspect they're not a huge deal.

> And I guess that's fine because if we're still running and getting
> traffic after that timeout something must have gone wrong, so sending a
> RST segment doesn't look that bad to me. If the connection was meanwhile
> established with the target node, I think our RST segment won't actually
> reset the connection because sequence numbers won't match.
> 
> But surely it's not elegant and I think we should eventually have an
> explicit implementation of the whole thing, perhaps with a new socket
> state ("MIGRATED"?) and going through the whole list of listening
> sockets and... closing them, I suppose?

Right.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-07-24  1:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-21 22:12 [PATCH] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-22  0:33 ` David Gibson
2025-07-22 21:12   ` Stefano Brivio
2025-07-23  0:27     ` David Gibson
2025-07-23  9:17       ` Stefano Brivio
2025-07-24  1:48         ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).