From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Egbg40VD;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id A9B585A0272
	for <passt-dev@passt.top>; Wed, 26 Feb 2025 12:24:21 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1740569060;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=bPwyS428T3F850Vxzz8sEsKYU46ouaY9xu4griS9Ulw=;
	b=Egbg40VDa4G+KpnWSyIw5AxydIUa1mznicHVw82xjejr2+wGXUerkJZOrPyPwf5Amw7uck
	lOJCZGHmobFJ086wKaomfzyOOSHfZbNUgSLNHhpghtY3wGMd0d6lJJw52yC0/N2HExj3nc
	XN+d188Rh0exEUMWq0menCNcR0i2zUU=
Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com
 [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-511-9PJywQxUOueuhdkKxJeBDQ-1; Wed, 26 Feb 2025 06:24:18 -0500
X-MC-Unique: 9PJywQxUOueuhdkKxJeBDQ-1
X-Mimecast-MFC-AGG-ID: 9PJywQxUOueuhdkKxJeBDQ_1740569057
Received: by mail-wr1-f72.google.com with SMTP id ffacd0b85a97d-38f394f6d84so5098766f8f.1
        for <passt-dev@passt.top>; Wed, 26 Feb 2025 03:24:17 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740569056; x=1741173856;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=bPwyS428T3F850Vxzz8sEsKYU46ouaY9xu4griS9Ulw=;
        b=s4Z9q4wgLUPNYzrY23G5Uz9hPcKrjo55c9PFK9CJpPmID62CZDDD66qZBokRZpdS5D
         GdHb87/ZF7gicCtNjftmPOFB1zu45hqVphfs6gDVvzs3PgS14GuK/H6wL5fvJnJsQX3a
         A1IJrnut+Edk9bNH5SHTaLjw9i1AzDnUchATiNcrYP+XoaocalxStJCpv7ERpXQIm1E4
         GCQhXCmFZBKT/poFWJBjyJ+pEgKFI2MC87C2ABZ6+EUX+v+HbIo7xdrJAtAnr2smqSHc
         nsE5ulEGciwdN1zFxFk+TcmTeFdhrm5qCb0zQ+DifQxJvoMEhlIC+C3daxYEQNMjib3b
         liPA==
X-Gm-Message-State: AOJu0YzYqUoLFTDOqc5i/X14/iCYpK/fL9BDE6CY98qJpLqX40Ym0MT1
	0k3apH+X49QZVvxgI11aYnfLSGoyzDc8ArceW3r990gcGaQra8nyKnGmeW660bx2O/r0llKl6dN
	hJX8G2zyWfRbuycp6DDxpeYxfVGSNnRTJJG01mWdeZEa1oRwUX6qCnPcipQ==
X-Gm-Gg: ASbGncv5trtDI4sjas3C/l1Lk+mjGm15zvFYAkcj1aIiq8MI2maPJArofz6CVPuwYpe
	b63ltenOqFRDBMv0U+2Vo8TVE6gsbMnWbJGaVRVR3vJor0wtVA0+m9bOJMv7IRbq1Op4mF1UQ5y
	tc2AQX6xEZU6K4ptSbXiKvCf2ARs/FlEGTSx9eyNeyufL5QfmnxVe+E0QVwx46RdVYo74/0r5G0
	1i5XrJPGhCQJpeT4eGOv0GynFh19r/FYj15RgYKim7EkXMar1E+xvmsMAqE3U+PwtVDjt3C668F
	parZMiCD8Q8+XTqWcfWyWrS56m8=
X-Received: by 2002:a5d:4585:0:b0:38f:287a:43d9 with SMTP id ffacd0b85a97d-38f70858bfdmr11571087f8f.52.1740569055996;
        Wed, 26 Feb 2025 03:24:15 -0800 (PST)
X-Google-Smtp-Source: AGHT+IEzl9vtqfAQHk5Vgz0WvKmI8CGw4L3eKwAc6ZjVytv9WOaCMO6NQ2EEC04hGcB5hhNyQqOYHg==
X-Received: by 2002:a5d:4585:0:b0:38f:287a:43d9 with SMTP id ffacd0b85a97d-38f70858bfdmr11571059f8f.52.1740569055365;
        Wed, 26 Feb 2025 03:24:15 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-390df3616c5sm218791f8f.4.2025.02.26.03.24.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 26 Feb 2025 03:24:14 -0800 (PST)
Date: Wed, 26 Feb 2025 12:24:12 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v2 0/2] More graceful handling of migration without
 passt-repair
Message-ID: <20250226122412.33009f77@elisabeth>
In-Reply-To: <Z77V_6xrLXlkmuDx@zatzit>
References: <20250225055132.3677190-1-david@gibson.dropbear.id.au>
	<20250225184316.407247f4@elisabeth>
	<Z75f9IDhnLS7UmDW@zatzit>
	<20250226090948.3d1fff91@elisabeth>
	<Z77V_6xrLXlkmuDx@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: z-jgew5CRm2JxfjOJOs29Dnuq3ZajrgQO8tl44BLGFo_1740569057
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Message-ID-Hash: GDU4KVDQSUNNOVMTEVUCOC5V6AOQLFXI
X-Message-ID-Hash: GDU4KVDQSUNNOVMTEVUCOC5V6AOQLFXI
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20250226122412.33009f77@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/GDU4KVDQSUNNOVMTEVUCOC5V6AOQLFXI/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Wed, 26 Feb 2025 19:51:11 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:
> > On Wed, 26 Feb 2025 11:27:32 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:  
> > > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > From Red Hat internal testing we've had some reports that if
> > > > > attempting to migrate without passt-repair, the failure mode is uglier
> > > > > than we'd like.  The migration fails, which is somewhat expected, but
> > > > > we don't correctly roll things back on the source, so it breaks
> > > > > network there as well.
> > > > > 
> > > > > Handle this more gracefully allowing the migration to proceed in this
> > > > > case, but allow TCP connections to break
> > > > > 
> > > > > I've now tested this reasonably:
> > > > >  * I get a clean migration if there are now active flows
> > > > >  * Migration completes, although connections are broken if
> > > > >    passt-repair isn't connected
> > > > >  * Basic test suite (minus perf)
> > > > > 
> > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > > > behaviour should be better than it was.    
> > > > 
> > > > I did, and it is. The series looks good to me and I would apply it as
> > > > it is, but I'm waiting a bit longer in case you want to try out some
> > > > variations based on my tests as well. Here's what I did.    
> > > 
> > > [snip]
> > > 
> > > Thanks for the detailed instructions.  More complex than I might have
> > > liked, but oh well.
> > >   
> > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > > > 
> > > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > > > this series (I just used 'make install' from the local build), migration
> > > > succeeds instead:
> > > > 
> > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > >   Migration: [100.00 %]
> > > > 
> > > > Now, on the target, I still have to figure out how to tell libvirt
> > > > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > > > as we use in our tests), instead of just starting a new instance like
> > > > it does. Otherwise, I have no chance to start passt-repair there.
> > > > Perhaps it has something to do with persistent mode described here:    
> > > 
> > > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > > with --incoming on the target.  
> > 
> > ("-incoming"), yes, see src/qemu/qemu_migration.c,
> > qemuMigrationDstPrepare().
> >   
> > > IIUC the problem here is more about
> > > timing: we want it to start it early, so that we have a chance to
> > > start passt-repair and let it connect before the migration actually
> > > happens.  
> > 
> > For the timing itself, we could actually wait for passt-repair to be
> > there, with a timeout (say, 100ms).  
> 
> I guess.  That still requires some way for KubeVirt (or whatever) to
> know at least roughly when it needs to launch passt-repair, and I'm
> not sure if that's something we can currently get from libvirt.

KubeVirt sets up the target pod, and that's when it should be done (if
we have an inotify mechanism or similar). I can't point to an exact code
path yet but there's something like that.

> > We could also modify passt-repair to set up an inotify watcher if the
> > socket isn't there yet.  
> 
> Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
> connects to it" model though, and I wonder if we need to revisit the
> security implications of that.

I don't think it actually breaks that model for security purposes,
because the guest doesn't have anyway a chance to cause a connection to
passt-repair. The guest is still suspended (or missing) at that point.

> > > Crud... I didn't think of this before.  I don't know that there's any
> > > sensible way to do this without having libvirt managing passt-repair
> > > as well.  
> > 
> > But we can't really use it as we're assuming that passt-repair will run
> > with capabilities virtqemud doesn't want/need.  
> 
> Oh.  True.
> 
> > > I mean it's not impossible there's some option to do this,
> > > but I doubt there's been any reason before for something outside of
> > > libvirt to control the timing of the target qemu's creation.  I think
> > > we need to ask libvirt people about this.  
> > 
> > I'm looking into it (and perhaps virtiofsd had similar troubles?).  
> 
> I'm guessing libvirt already knows how to start virtiofsd - just as it
> already knows how to start passt, just not passt-repair.
> 
> > > >   https://libvirt.org/migration.html#configuration-file-handling    
> > > 
> > > Yeah.. I don't think this is relevant.
> > >   
> > > > and --listen-address, but I'm not quite sure yet.
> > > > 
> > > > That is, I could only test different failures (early one on source, or
> > > > later one on target) with this, not a complete successful migration.
> > > >     
> > > > > There are more fragile cases that I'm looking to fix, particularly the
> > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > > > > into various complications that I didn't manage to sort out today.
> > > > > I'll continue looking at those tomorrow.  I'm now pretty confident
> > > > > that those additional fixes won't entirely supersede the changes in
> > > > > this series, so it should be fine to apply these on their own.    
> > > > 
> > > > By the way, I think the somewhat less fragile/more obvious case where
> > > > we fail clumsily is when the target doesn't have the same address as
> > > > the source (among other possible addresses). In that case, we fail (and
> > > > terminate) with a rather awkward:    
> > > 
> > > Ah, yes, that is a higher priority fragile case.
> > >   
> > > >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
> > > >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
> > > >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
> > > >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > > > 
> > > > that's because, oops, I only took care of socket() failures in
> > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.    
> > > 
> > > No, you check for errors on both.  
> > 
> > Well, "check", yes, but I'm not even setting an error code. I haven't
> > tried your 3/3 yet but look at "(null)" resulting from:
> > 
> >  		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
> > 
> > ...rc is 0.  
> 
> -1, not 0, otherwise we wouldn't enter that if clause at all.

Ah, oops, right.

> But,
> still, out of bounds for strerror().  I did spot that bug -
> tcp_flow_repair_socket() is directly passing on the return code from
> bind(), whereas it should be returning -errno.
> 
> So, two bugs actually: 1) in the existing code we should return -errno
> not rc if bind() fails, 2) in my 3/3 it should be calling strerror()
> on -rc, not rc.

Right, yes, 1) is what I meant.

> > > The problem is that in
> > > tcp_flow_migrate_target() we cancel the flow allocation and carry on -
> > > but the source will still send information for this flow, putting us
> > > out of sync with the stream.  
> > 
> > That, too, yes.
> >   
> > > > Once that's fixed, flow_migrate_target() should also take care of
> > > > decreasing 'count' accordingly. I just had a glimpse but didn't
> > > > really try to sketch a fix.    
> > > 
> > > Adjusting count won't do the job.  Instead we'd need to keep the flow
> > > around, but marked as "dead" somehow, so that we read but discard the
> > > incoming information for it.  The MIGRATING state I added in one of my
> > > drafts was supposed to help with this sort of thing.  But that's quite
> > > a complex change.  
> > 
> > I think it's great that you could (practically) solve it with three
> > lines...  
> 
> Yeah, I sent that email at the beginning of my day, by the end I'd
> come up with the simpler approach.
> 
> > > Hrm... at least in the near term, I think it might actually be easier
> > > to set IP_FREEBIND when we create sockets for in-migrating flows.
> > > That way we can process them normally, they just won't do much without
> > > the address set.  It has the additional advantage that it should work
> > > if the higher layers only move the IP just after the migration,
> > > instead of in advance.  
> > 
> > Perhaps we want it anyway, but I wonder:  
> 
> Right, I'm no longer considering this as a short term solution, since
> checking for fd < 0 I think works better for the immediate problems.
> 
> > what happens if we turn repair
> > mode off and we bound to a non-local address? I suppose we won't send
> > out anything, but I'm not sure. If we send out the first keep-alive
> > segment with a wrong address, we probably ruined the connection.  
> 
> That's a good point.  More specifically, I think IP_FREEBIND is
> generally used for listen()ing sockets, I'm guessing you'll get an
> error if you try to connect() a socket that's bound to a non-local
> address.  It's possible TCP_REPAIR would defer that until repair mode
> is switched off, which wouldn't make a lot of difference to us.  It's
> also possible there could be bug in repair mode that would let you
> construct a non-locally bound, connected socket that way.  I'm not
> entirely sure what the consequences would be.  I guess that might
> already be possible in a different way: what happens if you have a
> connect()ed socket, then the admin removes the address to which it is
> bound?

I'm fairly sure that the outcome of testing this will surprise us in a
way or another, so probably we should start from testing it.

We could even think of deferring switching repair mode off until the
right address is there, by the way. That would make a difference to us.

> > Once I find a solution for the target libvirt/passt-repair thing (and
> > the remaining SELinux issues), I'll try to have a look at this too. I
> > haven't tried yet a migration with a mismatching address on the target
> > and passt-repair available.  
> 
> Right, I was trying to set up a test case for this today.  I made some
> progress but didn't really get it working.  I was using qemu directly
> with scripts to put the two ends into different net namespaces, rather
> than libvirt on separate L1 VMs.  Working out how to get the two
> namespaces connected in a way I could do the migration, while still
> being separate enough was doing my head in a bit.

I didn't actually get to the point of having a truly working migration.
I just considered a clean attempt at resuming an existing connection
(with keep-alive and subsequent RST from underlying passt instance) as
success.

So, yes, of course, it would be great to have a full simulation in a
compact form of what's going on with KubeVirt and OVN.

Probably bridging the L1 VMs would be a quick solution. It needs root
(at least for the setup), which means they become L2 (at least in my
world): L1 would still be connected by passt, the two L2 instances are
bridged, and source and target guests would be L3.

> In doing that, I also spotted another wrinkle.  I don't think this is
> one we can reasonably fix - but we should be aware, since someone will
> probably try it at some point:

Yeah, I tried it, you might see remnants of that in the setup_migrate()
stuff (I copied it from the "two_guests" test). I originally wanted to
have two namespaces and two instances of pasta (just like
"two_guests"), but soon realised the issue you describe below.

> migration is not going to work if the
> two hosts have their own connectivity provided by (separate instances
> of) passt or pasta (or slirp for that matter).  The migrating VM can
> have its TCP stream reconstructed perfectly, so the right L2 packets
> come out of the host, but the host's own passt/pasta instance won't
> know about the flows and so will just drop/reject the packets.
> 
> To make that work we'd basically have to migrate state for every
> "ancestor" passt/pasta until we hit a common namespace.  That seems
> pretty infeasible to me, since the pieces that know about the
> migration probably don't own those layers of the network.

If there's any potential interest around it, we could abstract things a
bit more than we did until now and have some kind of o... orch...
coordination of data/migration flows, with some kind of external tool
identifying and connecting to several instances of passt and moving
flows between them. It sounds a bit like OVN, but with the notable
difference that we don't need to implement a overlay network.

-- 
Stefano