From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202506 header.b=18ciRchf; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id B38E15A0280 for ; Tue, 22 Jul 2025 02:33:11 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202506; t=1753144231; bh=NaKInDGD9Ls0sJuTztkVRzgEBj7qevM7QYyYXFvrJjo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=18ciRchfaGbnYEJzOAjeOmtkO76vZ8SwnLFnIB683uxBEd29xbNpG84rjUvJ7URBk MM0ETfc66vGz9pa2uK3XKq1WUHvfzQymFtnSw9Kr65iBh70P8VEw8EcbiVd84stG0n XngBe90k1YqqSo8+gKg6CMHysL+JD3sMv9oU8EGJK4/rBsPcLsJvIQCorbBQB5uozI DGxUBr/KRQVIYY7kz3UQkuVQJ8DBXxwS3AhuF15XVhPEuVlMx7Y56cWksoFgi7v4m+ ecGNcHt9PKLjFZEV9vEmuCtEPIymkLpFgijO72VRzFIy98jKctN2bax9J9JiUCMnFW 7fSmvvCSyjK6A== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4bmJ8742tSz4wvb; Tue, 22 Jul 2025 10:30:31 +1000 (AEST) Date: Tue, 22 Jul 2025 10:33:00 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open Message-ID: References: <20250721221258.2874863-1-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="2wPL1/Qbhn7iXdBc" Content-Disposition: inline In-Reply-To: <20250721221258.2874863-1-sbrivio@redhat.com> Message-ID-Hash: MMHKV7AXVUEVOGY5HLOTWFZXNT5JDT2L X-Message-ID-Hash: MMHKV7AXVUEVOGY5HLOTWFZXNT5JDT2L X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Nir Dothan X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --2wPL1/Qbhn7iXdBc Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote: > We are hitting an issue in the KubeVirt integration where some data is > still sent to the source instance even after migration is complete. As > we exit, the kernel closes our sockets and resets connections. The > resulting RST segments are sent to peers, effectively terminating > connections that were meanwhile migrated. >=20 > At the moment, this is not done intentionally, but in the future > KubeVirt might enable OVN-Kubernetes features where source and > destination nodes are explicitly getting mirrored traffic for a while, > in order to decrease migration downtime. >=20 > By default, don't quit after migration is completed on the source: the > previous behaviour can be enabled with the new --migrate-exit > option. Might be worth explicitly mentioning that the reason this helps is that having the sockets still around in repair mode prevents the source node from responding to any traffic for them. >=20 > Also, by default, keep migrated TCP sockets open (in repair mode) as > long as we're running, and ignore events on any epoll descriptor > representing data channels. The previous (and, strictly speaking, > correct) I'm not necessarily convinced the previous behaviour is more correct. As we've discussed the new behaviour is more in keeping with the normal qemu migration model, and with further work could let us allow a resume on the source node even if there's a failure late in migration (as long as it's before the target resumes, of course). As with --migrate-exit the new behaviour is also more robust against the exact timing of network switchover. I think that's generally desirable, even if it's arguably not strictly necessary. > behaviour can be enabled with the new --migrate-no-linger > option. >=20 > Reported-by: Nir Dothan > Signed-off-by: Stefano Brivio Reviewed-by: David Gibson =2E.. in the sense that I think it does accomplish what it sets out to do. There are several things that I think would improve it, see details below. > --- > conf.c | 23 +++++++++++++++++++++++ > epoll_type.h | 12 ++++++++---- > flow.c | 2 +- > passt.1 | 14 ++++++++++++++ > passt.c | 6 +++++- > passt.h | 6 ++++++ > tcp.c | 7 +++++-- > tcp_conn.h | 3 ++- > test/lib/setup | 4 ++-- > vhost_user.c | 10 ++++++++-- > 10 files changed, 74 insertions(+), 13 deletions(-) >=20 > diff --git a/conf.c b/conf.c > index 6c747aa..5e69014 100644 > --- a/conf.c > +++ b/conf.c > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int sta= tus) > FPRINTF(f, > " --repair-path PATH path for passt-repair(1)\n" > " default: append '.repair' to UNIX domain path\n"); > + FPRINTF(f, > + " --migrate-exit source quits after migration\n" > + " default: source keeps running after migration\n"); > + FPRINTF(f, > + " --migrate-no-linger close sockets on migration\n" > + " default: keep sockets open, ignore data events\n"); I'm a bit uncomfortable adding this as regular, supported options. I'm hoping we can eliminate them in the not too distant future. In the medium term we can do that by improving our test code to simulate different hosts with namespaces. Longer term we can allow local to local migration without these options by fixing the kernel: I don't believe bind() should fail if the address conflict is with a socket in repair mode. Further, as user accessible options these are kinda weird - what they do is defined in terms of some pretty obscure internals, not in terms of what the upshot is. So, I think it would be better to introduce these as deprecated / debug-only / undocumented options. > } > =20 > FPRINTF(f, > @@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv) > {"socket-path", required_argument, NULL, 's' }, > {"fqdn", required_argument, NULL, 27 }, > {"repair-path", required_argument, NULL, 28 }, > + {"migrate-exit", no_argument, NULL, 29 }, > + {"migrate-no-linger", no_argument, NULL, 30 }, > { 0 }, > }; > const char *optstring =3D "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u= :T:U:"; > @@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv) > fwd_default =3D FWD_AUTO; > } > =20 > + if (c->mode =3D=3D MODE_VU) No reason for the conditional. It's irrelevant but harmless in other cases. > + c->migrate_linger =3D true; Nicer to avoid the explicit initialisation by inverting the sense of this variable - plus I think it's a little nicer to have 'true' be the abnormal case. > if (tap_l2_max_len(c) - ETH_HLEN < max_mtu) > max_mtu =3D tap_l2_max_len(c) - ETH_HLEN; > c->mtu =3D ROUND_DOWN(max_mtu, sizeof(uint32_t)); > @@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv) > optarg)) > die("Invalid passt-repair path: %s", optarg); > =20 > + break; > + case 29: > + if (c->mode !=3D MODE_VU) > + die("--migrate-exit is for vhost-user mode only"); > + c->migrate_exit =3D true; > + > + break; > + case 30: > + if (c->mode !=3D MODE_VU) > + die("--migrate-no-linger is for vhost-user mode only"); > + c->migrate_linger =3D false; > + > break; > case 'd': > c->debug =3D 1; > diff --git a/epoll_type.h b/epoll_type.h > index 12ac59b..f2991b6 100644 > --- a/epoll_type.h > +++ b/epoll_type.h > @@ -12,6 +12,7 @@ > enum epoll_type { > /* Special value to indicate an invalid type */ > EPOLL_TYPE_NONE =3D 0, > + > /* Connected TCP sockets */ > EPOLL_TYPE_TCP, > /* Connected TCP sockets (spliced) */ > @@ -26,16 +27,19 @@ enum epoll_type { > EPOLL_TYPE_UDP, > /* ICMP/ICMPv6 ping sockets */ > EPOLL_TYPE_PING, > - /* inotify fd watching for end of netns (pasta) */ > - EPOLL_TYPE_NSQUIT_INOTIFY, > - /* timer fd watching for end of netns, fallback for inotify (pasta) */ > - EPOLL_TYPE_NSQUIT_TIMER, > /* tuntap character device */ > EPOLL_TYPE_TAP_PASTA, > /* socket connected to qemu */ > EPOLL_TYPE_TAP_PASST, > /* socket listening for qemu socket connections */ > EPOLL_TYPE_TAP_LISTEN, > + /* End of event types involving data transfers or connections */ > + EPOLL_TYPE_DATA_MAX =3D EPOLL_TYPE_TAP_LISTEN, > + > + /* inotify fd watching for end of netns (pasta) */ > + EPOLL_TYPE_NSQUIT_INOTIFY, > + /* timer fd watching for end of netns, fallback for inotify (pasta) */ > + EPOLL_TYPE_NSQUIT_TIMER, > /* vhost-user command socket */ > EPOLL_TYPE_VHOST_CMD, > /* vhost-user kick event socket */ > diff --git a/flow.c b/flow.c > index 00885f6..feefda3 100644 > --- a/flow.c > +++ b/flow.c > @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct= migrate_stage *stage, > * as EIO). > */ > foreach_established_tcp_flow(flow) { > - rc =3D tcp_flow_migrate_source_ext(fd, &flow->tcp); > + rc =3D tcp_flow_migrate_source_ext(c, fd, &flow->tcp); > if (rc) { > flow_err(flow, "Can't send extended data: %s", > strerror_(-rc)); > diff --git a/passt.1 b/passt.1 > index 60066c2..b85aaa0 100644 > --- a/passt.1 > +++ b/passt.1 > @@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append = \fI.repair\fR to the path > chosen for the hypervisor UNIX domain socket. No socket is created if no= t in > \-\-vhost-user mode. > =20 > +.TP > +.BR \-\-migrate-exit > +Exit after a completed migration as source. By default, \fBpasst\fR keeps > +running and the migrated guest can continue using its connection, or a n= ew guest > +can connect. > + > +.TP > +.BR \-\-migrate-no-linger > +Close TCP sockets on the source instance once migration completes. > + > +By default, sockets are kept open, and events on data sockets are ignore= d, so > +that any further message reaching sockets after the source migrated is s= ilently > +ignored, to avoid connection resets in case data is received after migra= tion. > + > .TP > .BR \-F ", " \-\-fd " " \fIFD > Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket i= s opened > diff --git a/passt.c b/passt.c > index 388d10f..f4c108f 100644 > --- a/passt.c > +++ b/passt.c > @@ -308,6 +308,9 @@ loop: > c.mode =3D=3D MODE_PASTA ? "pasta" : "passt", > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); > =20 > + if (c.ignore_data_events && ref.type <=3D EPOLL_TYPE_DATA_MAX) > + continue; > + This worries me slightly. I think it will be fine if passt is killed not long after the migration. However, if passt lingers, we're ignoring, but not clearing events. For those events which are level-triggered this could make us chew CPU on a tight epoll_wait() loop. I think a preferable approach would be to EPOLL_CTL_DEL each socket at the point we would close() them with --migrate-no-linger. I'd be fine with that as a follow up improvement, though. > switch (ref.type) { > case EPOLL_TYPE_TAP_PASTA: > tap_handler_pasta(&c, eventmask, &now); > @@ -363,7 +366,8 @@ loop: > } > } > =20 > - post_handler(&c, &now); > + if (!c.ignore_data_events) > + post_handler(&c, &now); > =20 > migrate_handler(&c); > =20 > diff --git a/passt.h b/passt.h > index 8693794..636b3c3 100644 > --- a/passt.h > +++ b/passt.h > @@ -241,6 +241,9 @@ struct ip6_ctx { > * @device_state_fd: Device state migration channel > * @device_state_result: Device state migration result > * @migrate_target: Are we the target, on the next migration request? > + * @migrate_linger: Keep sockets open after migration, ignore data events > + * @migrate_exit: Exit (on source) once migration is complete > + * @ignore_data_events: Ignore data events (for migration, source instan= ce) > */ > struct ctx { > enum passt_modes mode; > @@ -318,6 +321,9 @@ struct ctx { > int device_state_fd; > int device_state_result; > bool migrate_target; > + bool migrate_linger; > + bool migrate_exit; > + bool ignore_data_events; > }; > =20 > void proto_update_l2_buf(const unsigned char *eth_d, > diff --git a/tcp.c b/tcp.c > index 2b88466..1f7a6ab 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_ta= p_conn *conn) > =20 > /** > * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send fina= l data > + * @c: Execution context > * @fd: Descriptor for state migration > * @conn: Pointer to the TCP connection structure > * > * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending= failure > */ > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn) > +int tcp_flow_migrate_source_ext(struct ctx *c, > + int fd, const struct tcp_tap_conn *conn) > { > uint32_t peek_offset =3D conn->seq_to_tap - conn->seq_ack_from_tap; > struct tcp_tap_transfer_ext *t =3D &migrate_ext[FLOW_IDX(conn)]; > @@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struc= t tcp_tap_conn *conn) > if ((rc =3D tcp_flow_dump_seq(conn, &t->seq_rcv))) > goto fail; > =20 > - close(s); > + if (!c->migrate_linger) > + close(s); > =20 > /* Adjustments unrelated to FIN segments: sequence numbers we dumped are > * based on the end of the queues. > diff --git a/tcp_conn.h b/tcp_conn.h > index 35d813d..d49ae88 100644 > --- a/tcp_conn.h > +++ b/tcp_conn.h > @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tc= p_tap_conn *conn); > int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn); > =20 > int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn); > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn); > +int tcp_flow_migrate_source_ext(struct ctx *c, int fd, > + const struct tcp_tap_conn *conn); > =20 > int tcp_flow_migrate_target(struct ctx *c, int fd); > int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn= , int fd); > diff --git a/test/lib/setup b/test/lib/setup > index 575bc21..5994598 100755 > --- a/test/lib/setup > +++ b/test/lib/setup > @@ -350,7 +350,7 @@ setup_migrate() { > =20 > sleep 1 > =20 > - __opts=3D"--vhost-user" > + __opts=3D"--vhost-user --migrate-exit --migrate-no-linger" > [ ${PCAP} -eq 1 ] && __opts=3D"${__opts} -p ${LOGDIR}/passt_1.pcap" > [ ${DEBUG} -eq 1 ] && __opts=3D"${__opts} -d" > [ ${TRACE} -eq 1 ] && __opts=3D"${__opts} --trace" > @@ -360,7 +360,7 @@ setup_migrate() { > =20 > context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.soc= ket.repair" > =20 > - __opts=3D"--vhost-user" > + __opts=3D"--vhost-user --migrate-exit --migrate-no-linger" > [ ${PCAP} -eq 1 ] && __opts=3D"${__opts} -p ${LOGDIR}/passt_2.pcap" > [ ${DEBUG} -eq 1 ] && __opts=3D"${__opts} -d" > [ ${TRACE} -eq 1 ] && __opts=3D"${__opts} --trace" > diff --git a/vhost_user.c b/vhost_user.c > index e7fb049..7fd27dc 100644 > --- a/vhost_user.c > +++ b/vhost_user.c > @@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int f= d, uint32_t events) > if (vmsg.hdr.request =3D=3D VHOST_USER_CHECK_DEVICE_STATE && > vdev->context->device_state_result =3D=3D 0 && > !vdev->context->migrate_target) { > - info("Migration complete, exiting"); > - _exit(EXIT_SUCCESS); > + if (vdev->context->migrate_exit) { > + info("Migration complete, exiting"); > + _exit(EXIT_SUCCESS); > + } > + > + info("Migration complete"); > + if (vdev->context->migrate_linger) > + vdev->context->ignore_data_events =3D true; > } > } --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --2wPL1/Qbhn7iXdBc Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmh+3DYACgkQzQJF27ox 2GcaCQ//eUdgKKc76pLzPyttQjwbeAAJ7MjJnDwr5R2Gd6LTJ7PC2BBa1gWlbdPf M2mCmBglO3tq7cawXkIzTh5boxMFH23RfUrQdOHBgIhBJQudl0xMv1HrgG/hiGbR EmkfqgXyqn8/Yd1Rlem7BZ/9Sbanh8GsJoNm8cRn6wNR3KN7UgEt57ts6X51vzsM yGNjboIeKvfFLSYWRJopO2PpBmX+qnHzUUVFAafjDgUhQwG6A5R4YPWv/FWCFlOc KzR2ucPGpzNPRCCxvonk9PJ2VPnpX8CGbeMvlt4PhrWKkkzu5KCbQIq64bwlVLIF 8i6MxO+HQQoiWFCkTa3uX4OBzfFQWgu+yA5Hu7pgHXW0BN/2d/3elCX4UHVS9rS/ mvr5syuUYoZ9PgcgzFPjHS6uSK/6aWQcOhy3r116QoMAZe8XZHvl539WtfDGBg+0 HcL1zCDkWrmtIoBt7kJ+alKSIR6WuNV4RWfEaJHLen7kJ7KmytUHkVMkkqpA2rpq /eDrg1m07blANiZFvnrPAF3z30XCDgya9jM1W7gfoMQBhn1JsRAJE2cnNW5jMkRZ rr/3xQ1sRdvLIjNalT4MvoiDVn0xMnZHAJT1flMxFLk7KsRNf4JXO4OS3nAvzkzh mvkMjd8KZaUKRe43PiC2fbrodR6j19E6ltnWo6LY2e3/4qtR85E= =URqA -----END PGP SIGNATURE----- --2wPL1/Qbhn7iXdBc--