From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Nir Dothan <ndothan@redhat.com>
Subject: Re: [PATCH v2] treewide: By default, don't quit source after migration, keep sockets open
Date: Wed, 23 Jul 2025 10:22:35 +1000 [thread overview]
Message-ID: <aIArS3IqbB8pYIzr@zatzit> (raw)
In-Reply-To: <20250722211244.3129523-1-sbrivio@redhat.com>
[-- Attachment #1: Type: text/plain, Size: 10001 bytes --]
On Tue, Jul 22, 2025 at 11:12:44PM +0200, Stefano Brivio wrote:
> We are hitting an issue in the KubeVirt integration where some data is
> still sent to the source instance even after migration is complete. As
> we exit, the kernel closes our sockets and resets connections. The
> resulting RST segments are sent to peers, effectively terminating
> connections that were meanwhile migrated.
>
> At the moment, this is not done intentionally, but in the future
> KubeVirt might enable OVN-Kubernetes features where source and
> destination nodes are explicitly getting mirrored traffic for a while,
> in order to decrease migration downtime.
>
> By default, don't quit after migration is completed on the source: the
> previous behaviour can be enabled with the new, but deprecated,
> --migrate-exit option.
>
> Also, by default, keep migrated TCP sockets open (in repair mode) as
> long as we're running, and ignore events on any epoll descriptor
> representing data channels. The previous behaviour can be enabled with
> the new, equally deprecated, --migrate-no-linger option.
>
> By keeping sockets open, and not exiting, we prevent the kernel
> running on the source node to send out RST segments if further data
> reaches us.
>
> Reported-by: Nir Dothan <ndothan@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> v2:
> - assorted changes in commit message
> - context variable ignore_linger becomes ignore_no_linger
> - new options are deprecated
> - don't ignore events on some descriptors, drop them from epoll
>
> conf.c | 22 ++++++++++++++++++++++
> flow.c | 2 +-
> passt.1 | 24 ++++++++++++++++++++++++
> passt.h | 4 ++++
> tcp.c | 9 +++++++--
> tcp_conn.h | 3 ++-
> test/lib/setup | 4 ++--
> vhost_user.c | 8 ++++++--
> 8 files changed, 68 insertions(+), 8 deletions(-)
>
> diff --git a/conf.c b/conf.c
> index 6c747aa..f47f48e 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -864,6 +864,14 @@ static void usage(const char *name, FILE *f, int status)
> FPRINTF(f,
> " --repair-path PATH path for passt-repair(1)\n"
> " default: append '.repair' to UNIX domain path\n");
> + FPRINTF(f,
> + " --migrate-exit DEPRECATED:\n"
> + " source quits after migration\n"
> + " default: source keeps running after migration\n");
> + FPRINTF(f,
> + " --migrate-no-linger DEPRECATED:\n"
> + " close sockets on migration\n"
> + " default: keep sockets open, ignore events\n");
> }
>
> FPRINTF(f,
> @@ -1470,6 +1478,8 @@ void conf(struct ctx *c, int argc, char **argv)
> {"socket-path", required_argument, NULL, 's' },
> {"fqdn", required_argument, NULL, 27 },
> {"repair-path", required_argument, NULL, 28 },
> + {"migrate-exit", no_argument, NULL, 29 },
> + {"migrate-no-linger", no_argument, NULL, 30 },
> { 0 },
> };
> const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:";
> @@ -1686,6 +1696,18 @@ void conf(struct ctx *c, int argc, char **argv)
> optarg))
> die("Invalid passt-repair path: %s", optarg);
>
> + break;
> + case 29:
> + if (c->mode != MODE_VU)
> + die("--migrate-exit is for vhost-user mode only");
> + c->migrate_exit = true;
> +
> + break;
> + case 30:
> + if (c->mode != MODE_VU)
> + die("--migrate-no-linger is for vhost-user mode only");
> + c->migrate_no_linger = true;
> +
> break;
> case 'd':
> c->debug = 1;
> diff --git a/flow.c b/flow.c
> index 00885f6..feefda3 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> * as EIO).
> */
> foreach_established_tcp_flow(flow) {
> - rc = tcp_flow_migrate_source_ext(fd, &flow->tcp);
> + rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp);
> if (rc) {
> flow_err(flow, "Can't send extended data: %s",
> strerror_(-rc));
> diff --git a/passt.1 b/passt.1
> index 60066c2..94363b7 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -439,6 +439,30 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> chosen for the hypervisor UNIX domain socket. No socket is created if not in
> \-\-vhost-user mode.
>
> +.TP
> +.BR \-\-migrate-exit (DEPRECATED)
> +Exit after a completed migration as source. By default, \fBpasst\fR keeps
> +running and the migrated guest can continue using its connection, or a new guest
> +can connect.
> +
> +Note that this configuration option is \fBdeprecated\fR and will be removed in a
> +future version. It is not expected to be of any use, and it simply reflects a
> +legacy behaviour. If you have any use for this, refer to \fBREPORTING BUGS\fR
> +below.
> +
> +.TP
> +.BR \-\-migrate-no-linger (DEPRECATED)
> +Close TCP sockets on the source instance once migration completes.
> +
> +By default, sockets are kept open, and events on data sockets are ignored, so
> +that any further message reaching sockets after the source migrated is silently
> +ignored, to avoid connection resets in case data is received after migration.
> +
> +Note that this configuration option is \fBdeprecated\fR and will be removed in a
> +future version. It is not expected to be of any use, and it simply reflects a
> +legacy behaviour. If you have any use for this, refer to \fBREPORTING BUGS\fR
> +below.
> +
> .TP
> .BR \-F ", " \-\-fd " " \fIFD
> Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.h b/passt.h
> index 8693794..4cfd6eb 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -241,6 +241,8 @@ struct ip6_ctx {
> * @device_state_fd: Device state migration channel
> * @device_state_result: Device state migration result
> * @migrate_target: Are we the target, on the next migration request?
> + * @migrate_no_linger: Close sockets as we migrate them
> + * @migrate_exit: Exit (on source) once migration is complete
> */
> struct ctx {
> enum passt_modes mode;
> @@ -318,6 +320,8 @@ struct ctx {
> int device_state_fd;
> int device_state_result;
> bool migrate_target;
> + bool migrate_no_linger;
> + bool migrate_exit;
> };
>
> void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/tcp.c b/tcp.c
> index 2b88466..b10ad8a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn)
>
> /**
> * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> + * @c: Execution context
> * @fd: Descriptor for state migration
> * @conn: Pointer to the TCP connection structure
> *
> * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure
> */
> -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> +int tcp_flow_migrate_source_ext(struct ctx *c,
> + int fd, const struct tcp_tap_conn *conn)
> {
> uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
> struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)];
> @@ -3336,7 +3338,10 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn)
> if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv)))
> goto fail;
>
> - close(s);
> + if (c->migrate_no_linger)
> + close(s);
> + else
> + epoll_del(c, s);
>
> /* Adjustments unrelated to FIN segments: sequence numbers we dumped are
> * based on the end of the queues.
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 35d813d..d49ae88 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
> int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
>
> int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn);
> -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn);
> +int tcp_flow_migrate_source_ext(struct ctx *c, int fd,
> + const struct tcp_tap_conn *conn);
>
> int tcp_flow_migrate_target(struct ctx *c, int fd);
> int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd);
> diff --git a/test/lib/setup b/test/lib/setup
> index 575bc21..5994598 100755
> --- a/test/lib/setup
> +++ b/test/lib/setup
> @@ -350,7 +350,7 @@ setup_migrate() {
>
> sleep 1
>
> - __opts="--vhost-user"
> + __opts="--vhost-user --migrate-exit --migrate-no-linger"
> [ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
> [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> @@ -360,7 +360,7 @@ setup_migrate() {
>
> context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
>
> - __opts="--vhost-user"
> + __opts="--vhost-user --migrate-exit --migrate-no-linger"
> [ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
> [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
> [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
> diff --git a/vhost_user.c b/vhost_user.c
> index e7fb049..4bfa941 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -1207,7 +1207,11 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
> if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE &&
> vdev->context->device_state_result == 0 &&
> !vdev->context->migrate_target) {
> - info("Migration complete, exiting");
> - _exit(EXIT_SUCCESS);
> + if (vdev->context->migrate_exit) {
> + info("Migration complete, exiting");
> + _exit(EXIT_SUCCESS);
> + }
> +
> + info("Migration complete");
> }
> }
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
prev parent reply other threads:[~2025-07-23 0:22 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-22 21:12 [PATCH v2] treewide: By default, don't quit source after migration, keep sockets open Stefano Brivio
2025-07-23 0:22 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIArS3IqbB8pYIzr@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=ndothan@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).