From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=O52xk4tc; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id E68D35A0280 for ; Wed, 23 Jul 2025 11:17:15 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1753262234; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VbIpHpO5O0SxHMjSwgPmY6+I3oYASEIIdE3dUL+GMoE=; b=O52xk4tcLi246GG6AoCXrxTUDAfUeyD4vqfGl4LlHyI5VKWB/9u6J96z2XOPKM+RUqjZqA NbXNdlPn11Zp+kPPoUIAyld3mTiHite1Sx1JNVC1LHxF11LrFBrxUJ8iZzrG3WPOu8apBX 6HDYnZ5owL7fmXSQOxJUNu8Aq41QJ30= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-298-nx42BCjFMB-Gy7vXxR5OIw-1; Wed, 23 Jul 2025 05:17:13 -0400 X-MC-Unique: nx42BCjFMB-Gy7vXxR5OIw-1 X-Mimecast-MFC-AGG-ID: nx42BCjFMB-Gy7vXxR5OIw_1753262232 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-45867ac308dso4025145e9.2 for ; Wed, 23 Jul 2025 02:17:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753262232; x=1753867032; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=VbIpHpO5O0SxHMjSwgPmY6+I3oYASEIIdE3dUL+GMoE=; b=lkYvZQLhHOyCyQCiHSaBC03DCqY3z4GrIqHQ8wqeW3ofKjuOYNRRshUh2HyIbfolNi hdoVuUJEITcSOGiL+gK7eK9SYfiiI5AXu/jOol+3FUFxUPrO7M1PdZbJItQgb44PoQsD pQRruoL76wGGhHqtx/ObjtWFj3e2aNelV8AdZpAHUns05rD+B1VXAKUTsb8ABExuF0lT oN1SxboTpy46+kf35X2v4SsTtUTzPG8oAld2TSntd4lZoXc2/NJZj1ypEtEn/TBeXv7W pk3CKsfSnSMdAZgmwozvSJfD6kd37AJ+YRG9HaIWvFFASMjhnG8TfabTqm5p4hyFrGVk vFIA== X-Gm-Message-State: AOJu0Yy5ZRE1Wg4CoPrNMuGRDgecgvOS6cJQtTb+qWf7/vyIytc2fbao WUA6wn1O9eW4uvRoDDaW/fvvTnEByubh2OLvgjhANCQEAXYBN+rXkK7lBeEx4Uw2+MQqM+XDPdz GprPPeoCCjCACPZ2mG4dUnc5JEqjqv0GWm2bAf37Ahjdubr1g05aUOA== X-Gm-Gg: ASbGncuEyV6t47QH1U3tgZaVJsrsj4z2mvkojggbPe4H3rmFGdxirVLbOCJWbMgh6Ax KJGlSEOmM4TwogTTYbqlGKWCedbXN9MvD9oumNOsmufirnfo2jwYdij5jhzwBBdMLHcPsR2Krwf Q6GQjbU7z7ezFhP+RtRszrWSyJ3zTtKj4/V/eXbyyTP+xlECNcP8XiZiEOe2Zp+jLcbzgFBThvp w84R4F86I0NjQ4f/wQrXtgvHfZGkVn6KRofLCBP+wQUmpo8VrOwdHn7TR2sxQoHWL4aXuEXMg/J 55OUr3Ixz2v4Gzk+dWufGDW3bn+QyHzlxKNS4mttILg0dgfE43o= X-Received: by 2002:a05:600c:3f0f:b0:456:28f4:a576 with SMTP id 5b1f17b1804b1-45868d304e4mr17154355e9.27.1753262231724; Wed, 23 Jul 2025 02:17:11 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHdTd8HBxH009LbWOd9jaEoJRQ4WYkeHbvL/wQyDv6jGvJryT8XCIAA+/Qzp0jfAU2VFqzgDg== X-Received: by 2002:a05:600c:3f0f:b0:456:28f4:a576 with SMTP id 5b1f17b1804b1-45868d304e4mr17153975e9.27.1753262231079; Wed, 23 Jul 2025 02:17:11 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-458691abc62sm16850905e9.26.2025.07.23.02.17.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Jul 2025 02:17:10 -0700 (PDT) Date: Wed, 23 Jul 2025 11:17:08 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH] treewide: By default, don't quit source after migration, keep sockets open Message-ID: <20250723111708.4d78111e@elisabeth> In-Reply-To: References: <20250721221258.2874863-1-sbrivio@redhat.com> <20250722231249.22e09340@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: v1ePYMp2nEf5TZgmm2Hvxmz_ItLhRyoHv3DZ3oRkf48_1753262232 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: 5IFFNVNOAVZJVP5IKLYJ34ZM7LRXGGEY X-Message-ID-Hash: 5IFFNVNOAVZJVP5IKLYJ34ZM7LRXGGEY X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Nir Dothan X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 23 Jul 2025 10:27:30 +1000 David Gibson wrote: > On Tue, Jul 22, 2025 at 11:12:49PM +0200, Stefano Brivio wrote: > > On Tue, 22 Jul 2025 10:33:00 +1000 > > David Gibson wrote: > > > > > On Tue, Jul 22, 2025 at 12:12:58AM +0200, Stefano Brivio wrote: > > > > We are hitting an issue in the KubeVirt integration where some data is > > > > still sent to the source instance even after migration is complete. As > > > > we exit, the kernel closes our sockets and resets connections. The > > > > resulting RST segments are sent to peers, effectively terminating > > > > connections that were meanwhile migrated. > > > > > > > > At the moment, this is not done intentionally, but in the future > > > > KubeVirt might enable OVN-Kubernetes features where source and > > > > destination nodes are explicitly getting mirrored traffic for a while, > > > > in order to decrease migration downtime. > > > > > > > > By default, don't quit after migration is completed on the source: the > > > > previous behaviour can be enabled with the new --migrate-exit > > > > option. > > > > > > Might be worth explicitly mentioning that the reason this helps is > > > that having the sockets still around in repair mode prevents the > > > source node from responding to any traffic for them. > > > > Added as separate paragraph, below. > > Thanks. > > [snip] > > > > diff --git a/conf.c b/conf.c > > > > index 6c747aa..5e69014 100644 > > > > --- a/conf.c > > > > +++ b/conf.c > > > > @@ -864,6 +864,12 @@ static void usage(const char *name, FILE *f, int status) > > > > FPRINTF(f, > > > > " --repair-path PATH path for passt-repair(1)\n" > > > > " default: append '.repair' to UNIX domain path\n"); > > > > + FPRINTF(f, > > > > + " --migrate-exit source quits after migration\n" > > > > + " default: source keeps running after migration\n"); > > > > + FPRINTF(f, > > > > + " --migrate-no-linger close sockets on migration\n" > > > > + " default: keep sockets open, ignore data events\n"); > > > > > > I'm a bit uncomfortable adding this as regular, supported options. > > > I'm hoping we can eliminate them in the not too distant future. In > > > the medium term we can do that by improving our test code to simulate > > > different hosts with namespaces. Longer term we can allow local to > > > local migration without these options by fixing the kernel: I don't > > > believe bind() should fail if the address conflict is with a socket in > > > repair mode. > > > > I see the point about --migrate-no-linger, but --migrate-exit looks > > like something that depends on the migration workflow, not necessarily > > something to cover up issues in our tests or in the kernel. In any > > case... > > It does depend on the workflow, but I think this needs to be managed > by qemu (or whatever upper level component is managing the overall > migration). --migrate-exit doesn't make us exit once the migration is > complete - it makes us exit once *our part of* the migration is > complete, the migration could still fail after that. So in the happy > path, yes, we'll exit, but it should be because qemu or libvirt or > whatever is satisfied the migration is finished and kills us off > > > > Further, as user accessible options these are kinda weird - what they > > > do is defined in terms of some pretty obscure internals, not in terms > > > of what the upshot is. So, I think it would be better to introduce > > > these as deprecated / debug-only / undocumented options. > > > > ...I marked both as deprecated in v2, to signal to any hypothetical > > (I'm fairly sure not real) user that they shouldn't assume it's safe to > > rely on them. > > > > That's the only other category of options we have so far, we don't have > > undocumented ones, and I think it's a feature. Especially as we use > > them in the tests. > > Works for me. > > > By the way, I would argue that testing is an important type of usage, > > and changes in the test setup itself or in the kernel are non-trivial > > effort, and it's rather uncertain whether we'll ever manage to commit > > to that, so I don't quite feel that marking them as deprecated is > > entirely warranted, but I see the point about possibly dropping that > > migration model at some point in the future. > > > > Regardless, the fact we *need* them at the moment should prove that > > they are at least convenient, even if just for testing. > > Fair enough. > > [snip] > > > > diff --git a/passt.c b/passt.c > > > > index 388d10f..f4c108f 100644 > > > > --- a/passt.c > > > > +++ b/passt.c > > > > @@ -308,6 +308,9 @@ loop: > > > > c.mode == MODE_PASTA ? "pasta" : "passt", > > > > EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); > > > > > > > > + if (c.ignore_data_events && ref.type <= EPOLL_TYPE_DATA_MAX) > > > > + continue; > > > > + > > > > > > This worries me slightly. I think it will be fine if passt is killed > > > not long after the migration. However, if passt lingers, we're > > > ignoring, but not clearing events. For those events which are > > > level-triggered this could make us chew CPU on a tight epoll_wait() > > > loop. > > > > Ah, yes, I had that concern as well, but I conveniently forgot about it > > at some point. > > > > > I think a preferable approach would be to EPOLL_CTL_DEL each socket at > > > the point we would close() them with --migrate-no-linger. I'd be fine > > > with that as a follow up improvement, though. > > > > I was pondering about adding this on top of the ignore_data_events > > trick, but, actually, the whole thing as of this patch is somewhat > > bogus because > > > > - we're ignoring events on TCP sockets (intentional), > > > > - we're ignoring events on the tap device (who cares, migration is only > > supported with vhost-user) > > > > - but *not* ignoring events on the vhost-user kick descriptor > > (oversight). > > > > On a second thought, it doesn't look safe to ignore events on the kick > > descriptor, and in any case, with this change, we don't want to prevent > > the guest to send out further packets. It's not expected anyway. > > > > So I just replaced the whole thing with EPOLL_CTL_DEL (epoll_del()) as > > we go through the sockets. It's simpler and arguably safer. > > Yes, that's what I had in mind as well (I thought I put that in the > mail, but it looks like I didn't). Just one additional concern that I > don't think need hold up merge: do we also need to epoll_del() our > listening sockets? Right, I had that thought as well, and this was somewhat covered by the first version because we'd ignore events on those. But that would still come with the risk of epoll_wait() loops. In the perspective of a simple implementation / fix mostly intended for KubeVirt such as this one, we know that the guest is suspended at that point, so we'll send a SYN to it, nothing comes back, we'll eventually time out in 10 seconds and try to reset the connection (in the unlikely case we're still running *and* getting traffic at that point). And I guess that's fine because if we're still running and getting traffic after that timeout something must have gone wrong, so sending a RST segment doesn't look that bad to me. If the connection was meanwhile established with the target node, I think our RST segment won't actually reset the connection because sequence numbers won't match. But surely it's not elegant and I think we should eventually have an explicit implementation of the whole thing, perhaps with a new socket state ("MIGRATED"?) and going through the whole list of listening sockets and... closing them, I suppose? -- Stefano