From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=WEMJgxob; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id C8D6B5A0282 for ; Tue, 22 Jul 2025 23:12:56 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753218775; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PQdiaFkSfwrydcZvglxc6ExWZYiA0XlCR5DZQU2PYIQ=; b=WEMJgxoboUJvwMQVZQehMjnhtoMN8jGleZnmveH5or/knR8r2AYiBK/hd38cVeV13B2TwR oHNdunLaQjXIMMQeeE3rpVQw/950p+ILFyMyDQIbo0zwX++aZBXsl1UeAqi7BCJL62UDGw SeHxsV8HFrZ6kXDoXsa4JQiUGn3U4Dg= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-595-QWtwTUlGN2mFGlIE-oP0-Q-1; Tue, 22 Jul 2025 17:12:54 -0400 X-MC-Unique: QWtwTUlGN2mFGlIE-oP0-Q-1 X-Mimecast-MFC-AGG-ID: QWtwTUlGN2mFGlIE-oP0-Q_1753218773 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3a4f7ebfd00so2504102f8f.2 for ; Tue, 22 Jul 2025 14:12:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753218773; x=1753823573; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=PQdiaFkSfwrydcZvglxc6ExWZYiA0XlCR5DZQU2PYIQ=; b=aSbt57EuLSXqpNiav3Bwtaf13A1XU9YytUGmiu+4KnGMczHRU4LwW5iAnz4oA4S3mT HwawWBSWKvy8upCZuQ3+5pDX+I9Wlema39Tytg+fOWWyiq3bcFkqJiMF0ImRiIpo0FIn w3XYWLUM2vJHVh0A4THxr/NodsozoSaGxqQAG5T/KemJud//Kbydok5seCQQhoKcGakb XkY3ETbcRE0m+GUXXL6daJmAUIqb5+LWtpSV2UKvDWJGhVhtsy5aFJZUfoBV1snFJgn1 tx8+1D103yJevqfl2g6/KVPn2XULLbSTYjh5iEo5q9fFjFgNJSBiGm67CY5/uZhyYYkg BCIQ== X-Gm-Message-State: AOJu0YxhqUpVKjtZTIh0FLIeFDbpqtlMqzkRU5aKDP6pmRK/V81tEoYV zxO+mKLz8j/8v3Ecsc3Klc8TFL3TD1W5vUqGONAg7dSHgCC08ThAwNdeK692spPMUuiYQ5g8Sq3 8ubb+N2AWH4FWdw4/aIYsphgiy3XZXdVK7fPtsMb0a2SVbqYWdoXxVg== X-Gm-Gg: ASbGnctccrRAo+mtqODpFfwy8RAeh3nDBSIrVlg3eKC3bB+ctl92PmsEcVu/r+RV35J awtC+w2Y4ZobHc+zXuLNqKK7yf/iloqu5hLL9VAZAz2F8iRWOVTI1ZK3rGqMi3MC51chQiuQH31 L+U0kkvnLsO1vfN86LJtJePuFe5PbK002f1qkvB9GL27NJ1J7TETG19ciTvkpM90qvTgQ39Xchg Fm5DObLm2sgNmlABMUBkAa/ivSp3o/OIeEgNWHsmCQdaqB8JnwAsjazW42YbCDAsUt9YdIW7cKT ewIR4ARwYl0MtBn9VEs5dKlZd03hXrqJPmzO/XtfT3KY7WY8OQQ= X-Received: by 2002:a05:6000:2dc3:b0:3a4:e387:c0bb with SMTP id ffacd0b85a97d-3b768f2e430mr428519f8f.59.1753218772655; Tue, 22 Jul 2025 14:12:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGYV5omZpZk2iHa5WrcRT36m09b3ueQCAgebY9xqNM+kqrtOFuE1tAbHiibVjIM+oHiHAgT6A== X-Received: by 2002:a05:6000:2dc3:b0:3a4:e387:c0bb with SMTP id ffacd0b85a97d-3b768f2e430mr428502f8f.59.1753218772044; Tue, 22 Jul 2025 14:12:52 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-45869182fa5sm1662995e9.2.2025.07.22.14.12.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Jul 2025 14:12:51 -0700 (PDT) Date: Tue, 22 Jul 2025 23:12:49 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open Message-ID: <20250722231249.22e09340@elisabeth> In-Reply-To: References: <20250721221258.2874863-1-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 4IVb2ggzs1-ScAs6d1NNnIawLD0huy0kEtVqIoRC9Kw_1753218773 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: P7JBIFO5DVRMG72WUQ2FAQE3XVISRJ2Z X-Message-ID-Hash: P7JBIFO5DVRMG72WUQ2FAQE3XVISRJ2Z X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Nir Dothan X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, 22 Jul 2025 10:33:00 +1000 David Gibson wrote: > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote: > > We are hitting an issue in the KubeVirt integration where some data is > > still sent to the source instance even after migration is complete. As > > we exit, the kernel closes our sockets and resets connections. The > > resulting RST segments are sent to peers, effectively terminating > > connections that were meanwhile migrated. > > > > At the moment, this is not done intentionally, but in the future > > KubeVirt might enable OVN-Kubernetes features where source and > > destination nodes are explicitly getting mirrored traffic for a while, > > in order to decrease migration downtime. > > > > By default, don't quit after migration is completed on the source: the > > previous behaviour can be enabled with the new --migrate-exit > > option. > > Might be worth explicitly mentioning that the reason this helps is > that having the sockets still around in repair mode prevents the > source node from responding to any traffic for them. Added as separate paragraph, below. > > Also, by default, keep migrated TCP sockets open (in repair mode) as > > long as we're running, and ignore events on any epoll descriptor > > representing data channels. The previous (and, strictly speaking, > > correct) > > I'm not necessarily convinced the previous behaviour is more correct. > As we've discussed the new behaviour is more in keeping with the > normal qemu migration model, and with further work could let us allow > a resume on the source node even if there's a failure late in > migration (as long as it's before the target resumes, of course). > > As with --migrate-exit the new behaviour is also more robust against > the exact timing of network switchover. I think that's generally > desirable, even if it's arguably not strictly necessary. Right, parenthesis dropped. Well, the previous behaviour had the advantage of uncovering the issue in KubeVirt (at least we know that overlap might happen), but that doesn't make it more correct, per se. > > behaviour can be enabled with the new --migrate-no-linger > > option. > > > > Reported-by: Nir Dothan > > Signed-off-by: Stefano Brivio > > Reviewed-by: David Gibson > > ... in the sense that I think it does accomplish what it sets out to > do. There are several things that I think would improve it, see > details below. > > > --- > > conf.c | 23 +++++++++++++++++++++++ > > epoll_type.h | 12 ++++++++---- > > flow.c | 2 +- > > passt.1 | 14 ++++++++++++++ > > passt.c | 6 +++++- > > passt.h | 6 ++++++ > > tcp.c | 7 +++++-- > > tcp_conn.h | 3 ++- > > test/lib/setup | 4 ++-- > > vhost_user.c | 10 ++++++++-- > > 10 files changed, 74 insertions(+), 13 deletions(-) > > > > diff --git a/conf.c b/conf.c > > index 6c747aa..5e69014 100644 > > --- a/conf.c > > +++ b/conf.c > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status) > > FPRINTF(f, > > " --repair-path PATH path for passt-repair(1)\n" > > " default: append '.repair' to UNIX domain path\n"); > > + FPRINTF(f, > > + " --migrate-exit source quits after migration\n" > > + " default: source keeps running after migration\n"); > > + FPRINTF(f, > > + " --migrate-no-linger close sockets on migration\n" > > + " default: keep sockets open, ignore data events\n"); > > I'm a bit uncomfortable adding this as regular, supported options. > I'm hoping we can eliminate them in the not too distant future. In > the medium term we can do that by improving our test code to simulate > different hosts with namespaces. Longer term we can allow local to > local migration without these options by fixing the kernel: I don't > believe bind() should fail if the address conflict is with a socket in > repair mode. I see the point about --migrate-no-linger, but --migrate-exit looks like something that depends on the migration workflow, not necessarily something to cover up issues in our tests or in the kernel. In any case... > Further, as user accessible options these are kinda weird - what they > do is defined in terms of some pretty obscure internals, not in terms > of what the upshot is. So, I think it would be better to introduce > these as deprecated / debug-only / undocumented options. ...I marked both as deprecated in v2, to signal to any hypothetical (I'm fairly sure not real) user that they shouldn't assume it's safe to rely on them. That's the only other category of options we have so far, we don't have undocumented ones, and I think it's a feature. Especially as we use them in the tests. By the way, I would argue that testing is an important type of usage, and changes in the test setup itself or in the kernel are non-trivial effort, and it's rather uncertain whether we'll ever manage to commit to that, so I don't quite feel that marking them as deprecated is entirely warranted, but I see the point about possibly dropping that migration model at some point in the future. Regardless, the fact we *need* them at the moment should prove that they are at least convenient, even if just for testing. > > } > > > > FPRINTF(f, > > @@ -1470,6 +1476,8 @@ void conf(struct ctx *c, int argc, char **argv) > > {"socket-path", required_argument, NULL, 's' }, > > {"fqdn", required_argument, NULL, 27 }, > > {"repair-path", required_argument, NULL, 28 }, > > + {"migrate-exit", no_argument, NULL, 29 }, > > + {"migrate-no-linger", no_argument, NULL, 30 }, > > { 0 }, > > }; > > const char *optstring = "+dqfel:hs:F:I:p:P:m:a:n:M:g:i:o:D:S:H:461t:u:T:U:"; > > @@ -1495,6 +1503,9 @@ void conf(struct ctx *c, int argc, char **argv) > > fwd_default = FWD_AUTO; > > } > > > > + if (c->mode == MODE_VU) > > No reason for the conditional. It's irrelevant but harmless in other cases. Dropped the whole thing in v2. > > + c->migrate_linger = true; > > Nicer to avoid the explicit initialisation by inverting the sense of > this variable - plus I think it's a little nicer to have 'true' be the > abnormal case. Right, I wrote this before coming up with the option name, changed in v2. > > > if (tap_l2_max_len(c) - ETH_HLEN < max_mtu) > > max_mtu = tap_l2_max_len(c) - ETH_HLEN; > > c->mtu = ROUND_DOWN(max_mtu, sizeof(uint32_t)); > > @@ -1686,6 +1697,18 @@ void conf(struct ctx *c, int argc, char **argv) > > optarg)) > > die("Invalid passt-repair path: %s", optarg); > > > > + break; > > + case 29: > > + if (c->mode != MODE_VU) > > + die("--migrate-exit is for vhost-user mode only"); > > + c->migrate_exit = true; > > + > > + break; > > + case 30: > > + if (c->mode != MODE_VU) > > + die("--migrate-no-linger is for vhost-user mode only"); > > + c->migrate_linger = false; > > + > > break; > > case 'd': > > c->debug = 1; > > diff --git a/epoll_type.h b/epoll_type.h > > index 12ac59b..f2991b6 100644 > > --- a/epoll_type.h > > +++ b/epoll_type.h > > @@ -12,6 +12,7 @@ > > enum epoll_type { > > /* Special value to indicate an invalid type */ > > EPOLL_TYPE_NONE = 0, > > + > > /* Connected TCP sockets */ > > EPOLL_TYPE_TCP, > > /* Connected TCP sockets (spliced) */ > > @@ -26,16 +27,19 @@ enum epoll_type { > > EPOLL_TYPE_UDP, > > /* ICMP/ICMPv6 ping sockets */ > > EPOLL_TYPE_PING, > > - /* inotify fd watching for end of netns (pasta) */ > > - EPOLL_TYPE_NSQUIT_INOTIFY, > > - /* timer fd watching for end of netns, fallback for inotify (pasta) */ > > - EPOLL_TYPE_NSQUIT_TIMER, > > /* tuntap character device */ > > EPOLL_TYPE_TAP_PASTA, > > /* socket connected to qemu */ > > EPOLL_TYPE_TAP_PASST, > > /* socket listening for qemu socket connections */ > > EPOLL_TYPE_TAP_LISTEN, > > + /* End of event types involving data transfers or connections */ > > + EPOLL_TYPE_DATA_MAX = EPOLL_TYPE_TAP_LISTEN, > > + > > + /* inotify fd watching for end of netns (pasta) */ > > + EPOLL_TYPE_NSQUIT_INOTIFY, > > + /* timer fd watching for end of netns, fallback for inotify (pasta) */ > > + EPOLL_TYPE_NSQUIT_TIMER, > > /* vhost-user command socket */ > > EPOLL_TYPE_VHOST_CMD, > > /* vhost-user kick event socket */ > > diff --git a/flow.c b/flow.c > > index 00885f6..feefda3 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -1091,7 +1091,7 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, > > * as EIO). > > */ > > foreach_established_tcp_flow(flow) { > > - rc = tcp_flow_migrate_source_ext(fd, &flow->tcp); > > + rc = tcp_flow_migrate_source_ext(c, fd, &flow->tcp); > > if (rc) { > > flow_err(flow, "Can't send extended data: %s", > > strerror_(-rc)); > > diff --git a/passt.1 b/passt.1 > > index 60066c2..b85aaa0 100644 > > --- a/passt.1 > > +++ b/passt.1 > > @@ -439,6 +439,20 @@ Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path > > chosen for the hypervisor UNIX domain socket. No socket is created if not in > > \-\-vhost-user mode. > > > > +.TP > > +.BR \-\-migrate-exit > > +Exit after a completed migration as source. By default, \fBpasst\fR keeps > > +running and the migrated guest can continue using its connection, or a new guest > > +can connect. > > + > > +.TP > > +.BR \-\-migrate-no-linger > > +Close TCP sockets on the source instance once migration completes. > > + > > +By default, sockets are kept open, and events on data sockets are ignored, so > > +that any further message reaching sockets after the source migrated is silently > > +ignored, to avoid connection resets in case data is received after migration. > > + > > .TP > > .BR \-F ", " \-\-fd " " \fIFD > > Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened > > diff --git a/passt.c b/passt.c > > index 388d10f..f4c108f 100644 > > --- a/passt.c > > +++ b/passt.c > > @@ -308,6 +308,9 @@ loop: > > c.mode == MODE_PASTA ? "pasta" : "passt", > > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); > > > > + if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX) > > + continue; > > + > > This worries me slightly. I think it will be fine if passt is killed > not long after the migration. However, if passt lingers, we're > ignoring, but not clearing events. For those events which are > level-triggered this could make us chew CPU on a tight epoll_wait() > loop. Ah, yes, I had that concern as well, but I conveniently forgot about it at some point. > I think a preferable approach would be to EPOLL_CTL_DEL each socket at > the point we would close() them with --migrate-no-linger. I'd be fine > with that as a follow up improvement, though. I was pondering about adding this on top of the ignore_data_events trick, but, actually, the whole thing as of this patch is somewhat bogus because - we're ignoring events on TCP sockets (intentional), - we're ignoring events on the tap device (who cares, migration is only supported with vhost-user) - but *not* ignoring events on the vhost-user kick descriptor (oversight). On a second thought, it doesn't look safe to ignore events on the kick descriptor, and in any case, with this change, we don't want to prevent the guest to send out further packets. It's not expected anyway. So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as we go through the sockets. It's simpler and arguably safer. > > switch (ref.type) { > > case EPOLL_TYPE_TAP_PASTA: > > tap_handler_pasta(&c, eventmask, &now); > > @@ -363,7 +366,8 @@ loop: > > } > > } > > > > - post_handler(&c, &now); > > + if (!c.ignore_data_events) > > + post_handler(&c, &now); > > > > migrate_handler(&c); > > > > diff --git a/passt.h b/passt.h > > index 8693794..636b3c3 100644 > > --- a/passt.h > > +++ b/passt.h > > @@ -241,6 +241,9 @@ struct ip6_ctx { > > * @device_state_fd: Device state migration channel > > * @device_state_result: Device state migration result > > * @migrate_target: Are we the target, on the next migration request? > > + * @migrate_linger: Keep sockets open after migration, ignore data events > > + * @migrate_exit: Exit (on source) once migration is complete > > + * @ignore_data_events: Ignore data events (for migration, source instance) > > */ > > struct ctx { > > enum passt_modes mode; > > @@ -318,6 +321,9 @@ struct ctx { > > int device_state_fd; > > int device_state_result; > > bool migrate_target; > > + bool migrate_linger; > > + bool migrate_exit; > > + bool ignore_data_events; > > }; > > > > void proto_update_l2_buf(const unsigned char *eth_d, > > diff --git a/tcp.c b/tcp.c > > index 2b88466..1f7a6ab 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -3286,12 +3286,14 @@ int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn) > > > > /** > > * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data > > + * @c: Execution context > > * @fd: Descriptor for state migration > > * @conn: Pointer to the TCP connection structure > > * > > * Return: 0 on success, negative (not -EIO) on failure, -EIO on sending failure > > */ > > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn) > > +int tcp_flow_migrate_source_ext(struct ctx *c, > > + int fd, const struct tcp_tap_conn *conn) > > { > > uint32_t peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap; > > struct tcp_tap_transfer_ext *t = &migrate_ext[FLOW_IDX(conn)]; > > @@ -3336,7 +3338,8 @@ int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn) > > if ((rc = tcp_flow_dump_seq(conn, &t->seq_rcv))) > > goto fail; > > > > - close(s); > > + if (!c->migrate_linger) > > + close(s); > > > > /* Adjustments unrelated to FIN segments: sequence numbers we dumped are > > * based on the end of the queues. > > diff --git a/tcp_conn.h b/tcp_conn.h > > index 35d813d..d49ae88 100644 > > --- a/tcp_conn.h > > +++ b/tcp_conn.h > > @@ -236,7 +236,8 @@ int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn); > > int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn); > > > > int tcp_flow_migrate_source(int fd, struct tcp_tap_conn *conn); > > -int tcp_flow_migrate_source_ext(int fd, const struct tcp_tap_conn *conn); > > +int tcp_flow_migrate_source_ext(struct ctx *c, int fd, > > + const struct tcp_tap_conn *conn); > > > > int tcp_flow_migrate_target(struct ctx *c, int fd); > > int tcp_flow_migrate_target_ext(struct ctx *c, struct tcp_tap_conn *conn, int fd); > > diff --git a/test/lib/setup b/test/lib/setup > > index 575bc21..5994598 100755 > > --- a/test/lib/setup > > +++ b/test/lib/setup > > @@ -350,7 +350,7 @@ setup_migrate() { > > > > sleep 1 > > > > - __opts="--vhost-user" > > + __opts="--vhost-user --migrate-exit --migrate-no-linger" > > [ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap" > > [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" > > [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace" > > @@ -360,7 +360,7 @@ setup_migrate() { > > > > context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair" > > > > - __opts="--vhost-user" > > + __opts="--vhost-user --migrate-exit --migrate-no-linger" > > [ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap" > > [ ${DEBUG} -eq 1 ] && __opts="${__opts} -d" > > [ ${TRACE} -eq 1 ] && __opts="${__opts} --trace" > > diff --git a/vhost_user.c b/vhost_user.c > > index e7fb049..7fd27dc 100644 > > --- a/vhost_user.c > > +++ b/vhost_user.c > > @@ -1207,7 +1207,13 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) > > if (vmsg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE && > > vdev->context->device_state_result == 0 && > > !vdev->context->migrate_target) { > > - info("Migration complete, exiting"); > > - _exit(EXIT_SUCCESS); > > + if (vdev->context->migrate_exit) { > > + info("Migration complete, exiting"); > > + _exit(EXIT_SUCCESS); > > + } > > + > > + info("Migration complete"); > > + if (vdev->context->migrate_linger) > > + vdev->context->ignore_data_events = true; > > } > > } >