[PATCH v3] treewide: By default, don't quit source after migration, keep sockets open

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
@ 2025-07-24 17:28 Stefano Brivio
  2025-07-25  4:04 ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2025-07-24 17:28 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson, Nir Dothan

We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.

By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.

Reported-by: Nir Dothan <ndothan@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
v2:
  - assorted changes in commit message
  - context variable ignore_linger becomes ignore_no_linger
  - new options are deprecated
  - don't ignore events on some descriptors, drop them from epoll

v3:
  - Nir reported occasional failures (connections being reset)
    with both v1 and v2, because, in KubeVirt's usage, we quit as
    QEMU exits. Disable --one-off after migration as source, and
    document this exception

 conf.c         | 22 ++++++++++++++++++++++
 flow.c         |  2 +-
 passt.1        | 29 +++++++++++++++++++++++++++++
 passt.h        |  4 ++++
 tcp.c          |  9 +++++++--
 tcp_conn.h     |  3 ++-
 test/lib/setup |  4 ++--
 vhost_user.c   |  9 +++++++--
 8 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/conf.c b/conf.c
index 6c747aa..f47f48e 100644
--- a/conf.c
+++ b/conf.c
@@ -864,6 +864,14 @@ static void usage(const char *name, FILE *f, int status)
 		FPRINTF(f,
 			"  --repair-path PATH	path for passt-repair(1)\n"
 			"    default: append '.repair' to UNIX domain path\n");
+		FPRINTF(f,
+			"  --migrate-exit	DEPRECATED:\n"
+			"			source quits after migration\n"
+			"    default: source keeps running after migration\n");
+		FPRINTF(f,
+			"  --migrate-no-linger	DEPRECATED:\n"
+			"			close sockets on migration\n"
+			"    default: keep sockets open, ignore events\n");
 	}
 
 	FPRINTF(f,
@@ -1470,6 +1478,8 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"socket-path",	required_argument,	NULL,		's' },
 		{"fqdn",	required_argument,	NULL,		27 },
 		{"repair-path",	required_argument,	NULL,		28 },
+		{"migrate-exit", no_argument,		NULL,		29 },
+		{"migrate-no-linger", no_argument,	NULL,		30 },
 		{ 0 },
 	};
 	const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
@@ -1686,6 +1696,18 @@ void conf(struct ctx *c, int argc, char **argv)
 					   optarg))
 				die("Invalid passt-repair path: %s", optarg);
 
+			break;
+		case 29:
+			if (c->mode != MODE_VU)
+				die("--migrate-exit is for vhost-user mode only");
+			c->migrate_exit = true;
+
+			break;
+		case 30:
+			if (c->mode != MODE_VU)
+				die("--migrate-no-linger is for vhost-user mode only");
+			c->migrate_no_linger = true;
+
 			break;
 		case 'd':
 			c->debug = 1;
diff --git a/flow.c b/flow.c
index 00885f6..feefda3 100644
--- a/flow.c
+++ b/flow.c
@@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	 * as EIO).
 	 */
 	foreach_established_tcp_flow(flow) {
-		rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
+		rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
 		if (rc) {
 			flow_err(flow, "Can't send extended data: %s",
 				 strerror_(-rc));
diff --git a/passt.1 b/passt.1
index 60066c2..cef98b2 100644
--- a/passt.1
+++ b/passt.1
@@ -439,6 +439,30 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
 chosen for the hypervisor UNIX domain socket. No socket is created if not in
 \-\-vhost-user mode.
 
+.TP
+.BR \-\-migrate-exit (DEPRECATED)
+Exit after a completed migration as source. By default, \fBpasst\fR keeps
+running and the migrated guest can continue using its connection, or a new guest
+can connect.
+
+Note that this configuration option is \fBdeprecated\fR and will be removed in a
+future version. It is not expected to be of any use, and it simply reflects a
+legacy behaviour. If you have any use for this, refer to \fBREPORTING BUGS\fR
+below.
+
+.TP
+.BR \-\-migrate-no-linger (DEPRECATED)
+Close TCP sockets on the source instance once migration completes.
+
+By default, sockets are kept open, and events on data sockets are ignored, so
+that any further message reaching sockets after the source migrated is silently
+ignored, to avoid connection resets in case data is received after migration.
+
+Note that this configuration option is \fBdeprecated\fR and will be removed in a
+future version. It is not expected to be of any use, and it simply reflects a
+legacy behaviour. If you have any use for this, refer to \fBREPORTING BUGS\fR
+below.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
@@ -454,6 +478,11 @@ is closed.
 Quit after handling a single client connection, that is, once the client closes
 the socket, or once we get a socket error.
 
+\fBNote\fR: this option has no effect after \fBpasst\fR completes a migration as
+source, because, in that case, exiting would close sockets for active
+connections, which would in turn cause connection resets if any further data is
+received. See also the description of \fI\-\-migrate-no-linger\fR.
+
 .TP
 .BR \-t ", " \-\-tcp-ports " " \fIspec
 Configure TCP port forwarding to guest. \fIspec\fR can be one of:
diff --git a/passt.h b/passt.h
index 8693794..4cfd6eb 100644
--- a/passt.h
+++ b/passt.h
@@ -241,6 +241,8 @@ struct ip6_ctx {
  * @device_state_fd:	Device state migration channel
  * @device_state_result: Device state migration result
  * @migrate_target:	Are we the target, on the next migration request?
+ * @migrate_no_linger:	Close sockets as we migrate them
+ * @migrate_exit:	Exit (on source) once migration is complete
  */
 struct ctx {
 	enum passt_modes mode;
@@ -318,6 +320,8 @@ struct ctx {
 	int device_state_fd;
 	int device_state_result;
 	bool migrate_target;
+	bool migrate_no_linger;
+	bool migrate_exit;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/tcp.c b/tcp.c
index 2b88466..b10ad8a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
 
 /**
  * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
+ * @c:		Execution context
  * @fd:		Descriptor for state migration
  * @conn:	Pointer to the TCP connection structure
  *
  * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
  */
-int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
+int tcp_flow_migrate_source_ext(struct ctx *c,
+				int fd, const struct tcp_tap_conn *conn)
 {
 	uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
 	struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
@@ -3336,7 +3338,10 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
 	if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
 		goto fail;
 
-	close(s);
+	if (c->migrate_no_linger)
+		close(s);
+	else
+		epoll_del(c, s);
 
 	/* Adjustments unrelated to FIN segments: sequence numbers we dumped are
 	 * based on the end of the queues.
diff --git a/tcp_conn.h b/tcp_conn.h
index 35d813d..d49ae88 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
 int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
-int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
+				const struct tcp_tap_conn *conn);
 
 int tcp_flow_migrate_target(struct ctx *c, int fd);
 int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
diff --git a/test/lib/setup b/test/lib/setup
index 575bc21..5994598 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -350,7 +350,7 @@ setup_migrate() {
 
 	sleep 1
 
-	__opts="--vhost-user"
+	__opts="--vhost-user --migrate-exit --migrate-no-linger"
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
@@ -360,7 +360,7 @@ setup_migrate() {
 
 	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
 
-	__opts="--vhost-user"
+	__opts="--vhost-user --migrate-exit --migrate-no-linger"
 	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
 	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
 	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
diff --git a/vhost_user.c b/vhost_user.c
index e7fb049..c1522d5 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -1207,7 +1207,12 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
 	    vdev->context->device_state_result == 0 &&
 	    !vdev->context->migrate_target) {
-		info("Migration complete, exiting");
-		_exit(EXIT_SUCCESS);
+		if (vdev->context->migrate_exit) {
+			info("Migration complete, exiting");
+			_exit(EXIT_SUCCESS);
+		}
+
+		info("Migration complete");
+		vdev->context->one_off = false;
 	}
 }
-- 
@@ -1207,7 +1207,12 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 	if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
 	    vdev->context->device_state_result == 0 &&
 	    !vdev->context->migrate_target) {
-		info("Migration complete, exiting");
-		_exit(EXIT_SUCCESS);
+		if (vdev->context->migrate_exit) {
+			info("Migration complete, exiting");
+			_exit(EXIT_SUCCESS);
+		}
+
+		info("Migration complete");
+		vdev->context->one_off = false;
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-24 17:28 [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
@ 2025-07-25  4:04 ` David Gibson
  2025-07-25  5:10   ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2025-07-25  4:04 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Nir Dothan

[-- Attachment #1: Type: text/plain, Size: 2791 bytes --]

On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
> We are hitting an issue in the KubeVirt integration where some data is
> still sent to the source instance even after migration is complete. As
> we exit, the kernel closes our sockets and resets connections. The
> resulting RST segments are sent to peers, effectively terminating
> connections that were meanwhile migrated.
> 
> At the moment, this is not done intentionally, but in the future
> KubeVirt might enable OVN-Kubernetes features where source and
> destination nodes are explicitly getting mirrored traffic for a while,
> in order to decrease migration downtime.
> 
> By default, don't quit after migration is completed on the source: the
> previous behaviour can be enabled with the new, but deprecated,
> --migrate-exit option. After migration (as source), the -1 / --one-off
> option has no effect.
> 
> Also, by default, keep migrated TCP sockets open (in repair mode) as
> long as we're running, and ignore events on any epoll descriptor
> representing data channels. The previous behaviour can be enabled with
> the new, equally deprecated, --migrate-no-linger option.
> 
> By keeping sockets open, and not exiting, we prevent the kernel
> running on the source node to send out RST segments if further data
> reaches us.
> 
> Reported-by: Nir Dothan <ndothan@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> v2:
>   - assorted changes in commit message
>   - context variable ignore_linger becomes ignore_no_linger
>   - new options are deprecated
>   - don't ignore events on some descriptors, drop them from epoll
> 
> v3:
>   - Nir reported occasional failures (connections being reset)
>     with both v1 and v2, because, in KubeVirt's usage, we quit as
>     QEMU exits. Disable --one-off after migration as source, and
>     document this exception

This seems like an awful, awful hack.  We're abandoning consistent
semantics on a wild guess as to what the layers above us need.
Specifically, --once-off used to mean that the layer above us didn't
need to manage passt's lifetime; it was tied to qemu's.  Now it still
needs to manually manage passt's lifetime, so what's the point.  So,
if it needs passt to outlive qemu it should actually manage that and
not use --once-off.

Requring passt to outlive qemu already seems pretty dubious to me:
having the source still connected when passt was quitting is one thing
- indeed it's arguably hard to avoid.  Having it still connected when
*qemu* quits is much less defensible.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-25  4:04 ` David Gibson
@ 2025-07-25  5:10   ` Stefano Brivio
  2025-07-25  6:50     ` David Gibson
  0 siblings, 1 reply; 5+ messages in thread
From: Stefano Brivio @ 2025-07-25  5:10 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Nir Dothan

On Fri, 25 Jul 2025 14:04:17 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
> > We are hitting an issue in the KubeVirt integration where some data is
> > still sent to the source instance even after migration is complete. As
> > we exit, the kernel closes our sockets and resets connections. The
> > resulting RST segments are sent to peers, effectively terminating
> > connections that were meanwhile migrated.
> > 
> > At the moment, this is not done intentionally, but in the future
> > KubeVirt might enable OVN-Kubernetes features where source and
> > destination nodes are explicitly getting mirrored traffic for a while,
> > in order to decrease migration downtime.
> > 
> > By default, don't quit after migration is completed on the source: the
> > previous behaviour can be enabled with the new, but deprecated,
> > --migrate-exit option. After migration (as source), the -1 / --one-off
> > option has no effect.
> > 
> > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > long as we're running, and ignore events on any epoll descriptor
> > representing data channels. The previous behaviour can be enabled with
> > the new, equally deprecated, --migrate-no-linger option.
> > 
> > By keeping sockets open, and not exiting, we prevent the kernel
> > running on the source node to send out RST segments if further data
> > reaches us.
> > 
> > Reported-by: Nir Dothan <ndothan@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > v2:
> >   - assorted changes in commit message
> >   - context variable ignore_linger becomes ignore_no_linger
> >   - new options are deprecated
> >   - don't ignore events on some descriptors, drop them from epoll
> > 
> > v3:
> >   - Nir reported occasional failures (connections being reset)
> >     with both v1 and v2, because, in KubeVirt's usage, we quit as
> >     QEMU exits. Disable --one-off after migration as source, and
> >     document this exception  
> 
> This seems like an awful, awful hack.

Well, of course, it is, and long term it should be fixed in
either KubeVirt or libvirt (even though I'm not sure how, see below)
instead.

> We're abandoning consistent
> semantics on a wild guess as to what the layers above us need.

No, not really, we tested this and tested the alternative.

> Specifically, --once-off used to mean that the layer above us didn't

--one-off

> need to manage passt's lifetime; it was tied to qemu's.  Now it still
> needs to manually manage passt's lifetime, so what's the point.  So,
> if it needs passt to outlive qemu it should actually manage that and
> not use --once-off.

The main point is that it does *not* manually manage passt's lifetime
if there's no migration (which is the general case for libvirt and all
other users).

We don't have any other user with an implementation of the migration
workflow anyway (libvirt itself doesn't do that, yet). It's otherwise
unusable for KubeVirt. So I'd say let's fix it for the only user we
have.

> Requring passt to outlive qemu already seems pretty dubious to me:
> having the source still connected when passt was quitting is one thing
> - indeed it's arguably hard to avoid.  Having it still connected when
> *qemu* quits is much less defensible.

The fundamental problem here is that there's an issue in KubeVirt
(and working around it is the whole point of this patch) which implies
that packets are sent to the source pod *for a while* after migration.

We found out that the guest is generally suspended during that while,
but sometimes it might even have already exited. The pod remains,
though, as long as it's needed. That's the only certainty we have.

So, do we want to drop --one-off from the libvirt integration, and have
libvirt manage passt's lifecycle entirely (note that all users outside
KubeVirt don't use migration, so we would make the general case vastly
more complicated for the sake of correctness for a single usage...)?

Well, we can try to do that. Except that libvirt doesn't know either
for how long this traffic will reach the source pod (that's a KubeVirt
concept). So it should implement the same hack: let it outlive QEMU on
migration... as long as we have that issue in KubeVirt.

But I asked KubeVirt people, and it turns out that it's extremely
complicated to fix this in KubeVirt. So, actually, I don't see another
way to fix this in the short term. And without KubeVirt using this we
could also drop the whole feature...

-- 
Stefano

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-25  5:10   ` Stefano Brivio
@ 2025-07-25  6:50     ` David Gibson
  2025-07-25  8:21       ` Stefano Brivio
  0 siblings, 1 reply; 5+ messages in thread
From: David Gibson @ 2025-07-25  6:50 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Nir Dothan

[-- Attachment #1: Type: text/plain, Size: 6443 bytes --]

On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
> On Fri, 25 Jul 2025 14:04:17 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:
> > > We are hitting an issue in the KubeVirt integration where some data is
> > > still sent to the source instance even after migration is complete. As
> > > we exit, the kernel closes our sockets and resets connections. The
> > > resulting RST segments are sent to peers, effectively terminating
> > > connections that were meanwhile migrated.
> > > 
> > > At the moment, this is not done intentionally, but in the future
> > > KubeVirt might enable OVN-Kubernetes features where source and
> > > destination nodes are explicitly getting mirrored traffic for a while,
> > > in order to decrease migration downtime.
> > > 
> > > By default, don't quit after migration is completed on the source: the
> > > previous behaviour can be enabled with the new, but deprecated,
> > > --migrate-exit option. After migration (as source), the -1 / --one-off
> > > option has no effect.
> > > 
> > > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > > long as we're running, and ignore events on any epoll descriptor
> > > representing data channels. The previous behaviour can be enabled with
> > > the new, equally deprecated, --migrate-no-linger option.
> > > 
> > > By keeping sockets open, and not exiting, we prevent the kernel
> > > running on the source node to send out RST segments if further data
> > > reaches us.
> > > 
> > > Reported-by: Nir Dothan <ndothan@redhat.com>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > v2:
> > >   - assorted changes in commit message
> > >   - context variable ignore_linger becomes ignore_no_linger
> > >   - new options are deprecated
> > >   - don't ignore events on some descriptors, drop them from epoll
> > > 
> > > v3:
> > >   - Nir reported occasional failures (connections being reset)
> > >     with both v1 and v2, because, in KubeVirt's usage, we quit as
> > >     QEMU exits. Disable --one-off after migration as source, and
> > >     document this exception  
> > 
> > This seems like an awful, awful hack.
> 
> Well, of course, it is, and long term it should be fixed in
> either KubeVirt or libvirt (even though I'm not sure how, see below)
> instead.

But this hack means that even when it's fixed we'll still have this
wildly counterintuitive behaviour that every future user will have to
work around.  There's no sensible internal reason for out-migration to
affect lifetime, it's a workaround for problems that are quite
specific to this stack of layers above.

> > We're abandoning consistent
> > semantics on a wild guess as to what the layers above us need.
> 
> No, not really, we tested this and tested the alternative.

With just one use case.  Creating semantics to work with exactly how
something is used now, without thought to whether they make sense in
general is the definition of fragile software.

> > Specifically, --once-off used to mean that the layer above us didn't
> 
> --one-off
> 
> > need to manage passt's lifetime; it was tied to qemu's.  Now it still
> > needs to manually manage passt's lifetime, so what's the point.  So,
> > if it needs passt to outlive qemu it should actually manage that and
> > not use --once-off.
> 
> The main point is that it does *not* manually manage passt's lifetime
> if there's no migration (which is the general case for libvirt and all
> other users).

That's exactly my point.  With this hack it's neither one model nor
the other so you have to be aware of both.

> We don't have any other user with an implementation of the migration
> workflow anyway (libvirt itself doesn't do that, yet). It's otherwise
> unusable for KubeVirt. So I'd say let's fix it for the only user we
> have.

Please not at the expense of forcing every future user to deal with
this suckage.

> > Requring passt to outlive qemu already seems pretty dubious to me:
> > having the source still connected when passt was quitting is one thing
> > - indeed it's arguably hard to avoid.  Having it still connected when
> > *qemu* quits is much less defensible.
> 
> The fundamental problem here is that there's an issue in KubeVirt
> (and working around it is the whole point of this patch) which implies
> that packets are sent to the source pod *for a while* after migration.
> 
> We found out that the guest is generally suspended during that while,
> but sometimes it might even have already exited. The pod remains,
> though, as long as it's needed. That's the only certainty we have.

Keeping the pod around is fine.  What needs to change is that the
guest's IP(s) needs to be removed from the source host before qemu
(and therefore passt) is terminated.  The pod must have at least one
other IP, or it would be impossible to perform the migration in the
first place.

This essentially matches the situation for bridged networking: with
the source guest suspended the source host will no longer respond to
the guest IP

> So, do we want to drop --one-off from the libvirt integration, and have
> libvirt manage passt's lifecycle entirely (note that all users outside
> KubeVirt don't use migration, so we would make the general case vastly
> more complicated for the sake of correctness for a single usage...)?

Hmm.. if I understand correctly the network swizzling is handled by
KubeVirt, not libvirt.  I'm hoping that means there's a suitable point
at which it can remove the IP without having to alter libvirt.

> Well, we can try to do that. Except that libvirt doesn't know either
> for how long this traffic will reach the source pod (that's a KubeVirt
> concept). So it should implement the same hack: let it outlive QEMU on
> migration... as long as we have that issue in KubeVirt.
> 
> But I asked KubeVirt people, and it turns out that it's extremely
> complicated to fix this in KubeVirt. So, actually, I don't see another
> way to fix this in the short term. And without KubeVirt using this we
> could also drop the whole feature...
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open
  2025-07-25  6:50     ` David Gibson
@ 2025-07-25  8:21       ` Stefano Brivio
  0 siblings, 0 replies; 5+ messages in thread
From: Stefano Brivio @ 2025-07-25  8:21 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Nir Dothan

On Fri, 25 Jul 2025 16:50:23 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Jul 25, 2025 at 07:10:58AM +0200, Stefano Brivio wrote:
> > On Fri, 25 Jul 2025 14:04:17 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Jul 24, 2025 at 07:28:58PM +0200, Stefano Brivio wrote:  
> > > > We are hitting an issue in the KubeVirt integration where some data is
> > > > still sent to the source instance even after migration is complete. As
> > > > we exit, the kernel closes our sockets and resets connections. The
> > > > resulting RST segments are sent to peers, effectively terminating
> > > > connections that were meanwhile migrated.
> > > > 
> > > > At the moment, this is not done intentionally, but in the future
> > > > KubeVirt might enable OVN-Kubernetes features where source and
> > > > destination nodes are explicitly getting mirrored traffic for a while,
> > > > in order to decrease migration downtime.
> > > > 
> > > > By default, don't quit after migration is completed on the source: the
> > > > previous behaviour can be enabled with the new, but deprecated,
> > > > --migrate-exit option. After migration (as source), the -1 / --one-off
> > > > option has no effect.
> > > > 
> > > > Also, by default, keep migrated TCP sockets open (in repair mode) as
> > > > long as we're running, and ignore events on any epoll descriptor
> > > > representing data channels. The previous behaviour can be enabled with
> > > > the new, equally deprecated, --migrate-no-linger option.
> > > > 
> > > > By keeping sockets open, and not exiting, we prevent the kernel
> > > > running on the source node to send out RST segments if further data
> > > > reaches us.
> > > > 
> > > > Reported-by: Nir Dothan <ndothan@redhat.com>
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > > v2:
> > > >   - assorted changes in commit message
> > > >   - context variable ignore_linger becomes ignore_no_linger
> > > >   - new options are deprecated
> > > >   - don't ignore events on some descriptors, drop them from epoll
> > > > 
> > > > v3:
> > > >   - Nir reported occasional failures (connections being reset)
> > > >     with both v1 and v2, because, in KubeVirt's usage, we quit as
> > > >     QEMU exits. Disable --one-off after migration as source, and
> > > >     document this exception    
> > > 
> > > This seems like an awful, awful hack.  
> > 
> > Well, of course, it is, and long term it should be fixed in
> > either KubeVirt or libvirt (even though I'm not sure how, see below)
> > instead.  
> 
> But this hack means that even when it's fixed we'll still have this
> wildly counterintuitive behaviour that every future user will have to
> work around.

No, why? We can change that as well. We changed semantics of options
in the past and I don't see an issue doing that as long as we
coordinate things to a reasonable extent (like we do with Podman and
rootlesskit and with distributions and LSMs...).

This is just to get things working properly in KubeVirt 1.6 as far as
I'm concerned. Otherwise they can as well drop the whole feature (at
least, that would be my recommendation).

> There's no sensible internal reason for out-migration to
> affect lifetime, it's a workaround for problems that are quite
> specific to this stack of layers above.
> 
> > > We're abandoning consistent
> > > semantics on a wild guess as to what the layers above us need.  
> > 
> > No, not really, we tested this and tested the alternative.  
> 
> With just one use case.

...better than zero?

> Creating semantics to work with exactly how
> something is used now, without thought to whether they make sense in
> general is the definition of fragile software.

...better than useless?

> > > Specifically, --once-off used to mean that the layer above us didn't  
> > 
> > --one-off
> >   
> > > need to manage passt's lifetime; it was tied to qemu's.  Now it still
> > > needs to manually manage passt's lifetime, so what's the point.  So,
> > > if it needs passt to outlive qemu it should actually manage that and
> > > not use --once-off.  
> > 
> > The main point is that it does *not* manually manage passt's lifetime
> > if there's no migration (which is the general case for libvirt and all
> > other users).  
> 
> That's exactly my point.  With this hack it's neither one model nor
> the other so you have to be aware of both.

Current users except for KubeVirt use --one-off with that model, and we
surely want and need to keep that.

Now it turns out that there's an issue with KubeVirt and that (obvious)
model, so here's a workaround for the only documented user of the
migration feature, because it *currently* *needs* the other (obviously
wrong) model.

> > We don't have any other user with an implementation of the migration
> > workflow anyway (libvirt itself doesn't do that, yet). It's otherwise
> > unusable for KubeVirt. So I'd say let's fix it for the only user we
> > have.  
> 
> Please not at the expense of forcing every future user to deal with
> this suckage.

That's not the case. We can (and really should) fix this in passt
later. We need anyway to rework a fair amount of code here because, for
example, as you mentioned, listening sockets are still there.

> > > Requring passt to outlive qemu already seems pretty dubious to me:
> > > having the source still connected when passt was quitting is one thing
> > > - indeed it's arguably hard to avoid.  Having it still connected when
> > > *qemu* quits is much less defensible.  
> > 
> > The fundamental problem here is that there's an issue in KubeVirt
> > (and working around it is the whole point of this patch) which implies
> > that packets are sent to the source pod *for a while* after migration.
> > 
> > We found out that the guest is generally suspended during that while,
> > but sometimes it might even have already exited. The pod remains,
> > though, as long as it's needed. That's the only certainty we have.  
> 
> Keeping the pod around is fine.  What needs to change is that the
> guest's IP(s) needs to be removed from the source host before qemu
> (and therefore passt) is terminated.  The pod must have at least one
> other IP, or it would be impossible to perform the migration in the
> first place.

Maybe, yes. I'm not sure if it's doable.

> This essentially matches the situation for bridged networking: with
> the source guest suspended the source host will no longer respond to
> the guest IP
> 
> > So, do we want to drop --one-off from the libvirt integration, and have
> > libvirt manage passt's lifecycle entirely (note that all users outside
> > KubeVirt don't use migration, so we would make the general case vastly
> > more complicated for the sake of correctness for a single usage...)?  
> 
> Hmm.. if I understand correctly the network swizzling is handled by
> KubeVirt, not libvirt.

That's OVN-Kubernetes in KubeVirt's case.

> I'm hoping that means there's a suitable point
> at which it can remove the IP without having to alter libvirt.

I hope so too, eventually. Or we could make sure that QEMU is alive as
long as needed, this is probably easier to ensure from virt-launcher.

I haven't looked at the details yet, but in passt it's one line and we
can drop it later as needed, in KubeVirt it's probably much more
complicated than that.

> > Well, we can try to do that. Except that libvirt doesn't know either
> > for how long this traffic will reach the source pod (that's a KubeVirt
> > concept). So it should implement the same hack: let it outlive QEMU on
> > migration... as long as we have that issue in KubeVirt.
> > 
> > But I asked KubeVirt people, and it turns out that it's extremely
> > complicated to fix this in KubeVirt. So, actually, I don't see another
> > way to fix this in the short term. And without KubeVirt using this we
> > could also drop the whole feature...

-- 
Stefano


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-25  8:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-24 17:28 [PATCH v3] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-25  4:04 ` David Gibson
2025-07-25  5:10   ` Stefano Brivio
2025-07-25  6:50     ` David Gibson
2025-07-25  8:21       ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).