[PATCH v2 0/2] More graceful handling of migration without passt-repair

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v2 0/2] More graceful handling of migration without passt-repair
@ 2025-02-25  5:51 David Gibson
  2025-02-25  5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: David Gibson @ 2025-02-25  5:51 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

From Red Hat internal testing we've had some reports that if
attempting to migrate without passt-repair, the failure mode is uglier
than we'd like.  The migration fails, which is somewhat expected, but
we don't correctly roll things back on the source, so it breaks
network there as well.

Handle this more gracefully allowing the migration to proceed in this
case, but allow TCP connections to break

I've now tested this reasonably:
 * I get a clean migration if there are now active flows
 * Migration completes, although connections are broken if
   passt-repair isn't connected
 * Basic test suite (minus perf)

I didn't manage to test with libvirt yet, but I'm pretty convinced the
behaviour should be better than it was.

There are more fragile cases that I'm looking to fix, particularly the
die()s in flow_migrate_source_rollback() and elsewhere, however I ran
into various complications that I didn't manage to sort out today.
I'll continue looking at those tomorrow.  I'm now pretty confident
that those additional fixes won't entirely supersede the changes in
this series, so it should be fine to apply these on their own.

David Gibson (2):
  migrate, flow: Trivially succeed if migrating with no flows
  migrate, flow: Don't attempt to migrate TCP flows without passt-repair

 flow.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows
  2025-02-25  5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson
@ 2025-02-25  5:51 ` David Gibson
  2025-02-25  5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
  2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio
  2 siblings, 0 replies; 10+ messages in thread
From: David Gibson @ 2025-02-25  5:51 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate.  In this case after sending or
receiving the number of flows we continue to step through various lists.

In the target case, this could include communication with passt-repair.  If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.

Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/flow.c b/flow.c
index bb5dcc3c..6cf96c26 100644
--- a/flow.c
+++ b/flow.c
@@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 
 	debug("Sending %u flows", ntohl(count));
 
+	if (!count)
+		return 0;
+
 	/* Dump and send information that can be stored in the flow table.
 	 *
 	 * Limited rollback options here: if we fail to transfer any data (that
@@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	count = ntohl(count);
 	debug("Receiving %u flows", count);
 
+	if (!count)
+		return 0;
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
-- 
@@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 
 	debug("Sending %u flows", ntohl(count));
 
+	if (!count)
+		return 0;
+
 	/* Dump and send information that can be stored in the flow table.
 	 *
 	 * Limited rollback options here: if we fail to transfer any data (that
@@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
 	count = ntohl(count);
 	debug("Receiving %u flows", count);
 
+	if (!count)
+		return 0;
+
 	if ((rc = flow_migrate_repair_all(c, true)))
 		return -rc;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair
  2025-02-25  5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson
  2025-02-25  5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson
@ 2025-02-25  5:51 ` David Gibson
  2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio
  2 siblings, 0 replies; 10+ messages in thread
From: David Gibson @ 2025-02-25  5:51 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Migrating TCP flows requires passt-repair in order to use TCP_REPAIR.  If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode.  In some cases we may not roll back these changes properly, meaning
we break network connections on the source.

Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate.  So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index 6cf96c26..749c4984 100644
--- a/flow.c
+++ b/flow.c
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
 	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
@@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow)
-		count++;
+	/* If we don't have a repair helper, we can't migrate TCP flows */
+	if (c->fd_repair >= 0) {
+		foreach_established_tcp_flow(flow)
+			count++;
+	}
 
 	count = htonl(count);
 	if (write_all_buf(fd, &count, sizeof(count))) {
-- 
@@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable)
 	union flow *flow;
 	int rc;
 
+	/* If we don't have a repair helper, there's nothing we can do */
+	if (c->fd_repair < 0)
+		return 0;
+
 	foreach_established_tcp_flow(flow) {
 		if (enable)
 			rc = tcp_flow_repair_on(c, &flow->tcp);
@@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
 	(void)c;
 	(void)stage;
 
-	foreach_established_tcp_flow(flow)
-		count++;
+	/* If we don't have a repair helper, we can't migrate TCP flows */
+	if (c->fd_repair >= 0) {
+		foreach_established_tcp_flow(flow)
+			count++;
+	}
 
 	count = htonl(count);
 	if (write_all_buf(fd, &count, sizeof(count))) {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-25  5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson
  2025-02-25  5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson
  2025-02-25  5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
@ 2025-02-25 17:43 ` Stefano Brivio
  2025-02-26  0:27   ` David Gibson
  2 siblings, 1 reply; 10+ messages in thread
From: Stefano Brivio @ 2025-02-25 17:43 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 25 Feb 2025 16:51:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> From Red Hat internal testing we've had some reports that if
> attempting to migrate without passt-repair, the failure mode is uglier
> than we'd like.  The migration fails, which is somewhat expected, but
> we don't correctly roll things back on the source, so it breaks
> network there as well.
> 
> Handle this more gracefully allowing the migration to proceed in this
> case, but allow TCP connections to break
> 
> I've now tested this reasonably:
>  * I get a clean migration if there are now active flows
>  * Migration completes, although connections are broken if
>    passt-repair isn't connected
>  * Basic test suite (minus perf)
> 
> I didn't manage to test with libvirt yet, but I'm pretty convinced the
> behaviour should be better than it was.

I did, and it is. The series looks good to me and I would apply it as
it is, but I'm waiting a bit longer in case you want to try out some
variations based on my tests as well. Here's what I did.

L0 is Debian testing, L1 are two similar (virt-clone'd) instances of
RHEL 9.5 (with passt-0^20250217.ga1e48a0-1.el9.x86_64 or local build with
this series, qemu-kvm-9.1.0-14.el9.x86_64, libvirt-10.10.0-7.el9.x86_64),
and L2 is Alpine 3.21-ish.

The two L1 instances (hosting the source and target guests), of course,
don't need to be run under libvirt, but they do in my case. They are
connected by passt, so that they share the same address internally, but
I'm forwarding different SSH ports to them.

Relevant libvirt XML snippets for L1 instances:

    <interface type='user'>
      <mac address='52:54:00:8a:9e:c2'/>
      <portForward proto='tcp'>
        <range start='1295' to='22'/>
      </portForward>
      <model type='virtio'/>
      <backend type='passt'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

and:

    <interface type='user'>
      <mac address='52:54:00:b8:99:8c'/>
      <portForward proto='tcp'>
        <range start='11951' to='22'/>
      </portForward>
      <model type='virtio'/>
      <backend type='passt'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>

...I didn't switch those to vhost-user mode yet.

I prepared the L2 guest on L1 with:

  $ wget https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2
  $ virt-customize -a nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2 --root-password password:root
  $ virt-install -d --name alpine --memory 1024 --noreboot --osinfo alpinelinux3.20 --network backend.type=passt,portForward0.proto=tcp,portForward0.range0.start=40922,portForward0.range0.to=2222 --import --disk nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2

And made sure I can connect via SSH to the second (target node) L1 with:

  $ ssh-copy-id -f -p 11951 $GATEWAY

There are some known SELinux issues at this point that I'm still
working on (similar for AppArmor), so I *temporarily* set it to
permissive mode with 'setenforce 0', on L1. Some were not known,
though, and it's taking me longer than expected.

Now I can start passt-repair (or not) on the source L1 (node):

  # passt-repair /run/user/1001/libvirt/qemu/run/passt/8-alpine-net0.socket.repair

and open a TCP connection in the source L2 guest ('virsh console alpine',
then login as root/root):

  # apk add inetutils-telnet
  # telnet passt.top 80

and finally ask libvirt to migrate the guest. Note that I need
"--unsafe" because I didn't care about migrating storage (it's good
enough to have the guest memory for this test).

Without this series, migration fails on the source:

  $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
  Migration: [97.59 %]error: End of file while reading data: : Input/output error

...despite --verbose the error doesn't tell much (perhaps I need
LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
this series (I just used 'make install' from the local build), migration
succeeds instead:

  $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
  Migration: [100.00 %]

Now, on the target, I still have to figure out how to tell libvirt
to start QEMU and prepare for the migration (equivalent of '-incoming'
as we use in our tests), instead of just starting a new instance like
it does. Otherwise, I have no chance to start passt-repair there.
Perhaps it has something to do with persistent mode described here:

  https://libvirt.org/migration.html#configuration-file-handling

and --listen-address, but I'm not quite sure yet.

That is, I could only test different failures (early one on source, or
later one on target) with this, not a complete successful migration.

> There are more fragile cases that I'm looking to fix, particularly the
> die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> into various complications that I didn't manage to sort out today.
> I'll continue looking at those tomorrow.  I'm now pretty confident
> that those additional fixes won't entirely supersede the changes in
> this series, so it should be fine to apply these on their own.

By the way, I think the somewhat less fragile/more obvious case where
we fail clumsily is when the target doesn't have the same address as
the source (among other possible addresses). In that case, we fail (and
terminate) with a rather awkward:

  93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
  93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
  93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
  93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100

that's because, oops, I only took care of socket() failures in
tcp_flow_repair_socket(), but not bind() failures (!). Sorry.

Once that's fixed, flow_migrate_target() should also take care of
decreasing 'count' accordingly. I just had a glimpse but didn't
really try to sketch a fix.

-- 
Stefano

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio
@ 2025-02-26  0:27   ` David Gibson
  2025-02-26  8:09     ` Stefano Brivio
  0 siblings, 1 reply; 10+ messages in thread
From: David Gibson @ 2025-02-26  0:27 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5730 bytes --]

On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:
> On Tue, 25 Feb 2025 16:51:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > From Red Hat internal testing we've had some reports that if
> > attempting to migrate without passt-repair, the failure mode is uglier
> > than we'd like.  The migration fails, which is somewhat expected, but
> > we don't correctly roll things back on the source, so it breaks
> > network there as well.
> > 
> > Handle this more gracefully allowing the migration to proceed in this
> > case, but allow TCP connections to break
> > 
> > I've now tested this reasonably:
> >  * I get a clean migration if there are now active flows
> >  * Migration completes, although connections are broken if
> >    passt-repair isn't connected
> >  * Basic test suite (minus perf)
> > 
> > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > behaviour should be better than it was.
> 
> I did, and it is. The series looks good to me and I would apply it as
> it is, but I'm waiting a bit longer in case you want to try out some
> variations based on my tests as well. Here's what I did.

[snip]

Thanks for the detailed instructions.  More complex than I might have
liked, but oh well.

>   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
>   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> 
> ...despite --verbose the error doesn't tell much (perhaps I need
> LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> this series (I just used 'make install' from the local build), migration
> succeeds instead:
> 
>   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
>   Migration: [100.00 %]
> 
> Now, on the target, I still have to figure out how to tell libvirt
> to start QEMU and prepare for the migration (equivalent of '-incoming'
> as we use in our tests), instead of just starting a new instance like
> it does. Otherwise, I have no chance to start passt-repair there.
> Perhaps it has something to do with persistent mode described here:

Ah.  So I'm pretty sure virsh migrate will automatically start qemu
with --incoming on the target.  IIUC the problem here is more about
timing: we want it to start it early, so that we have a chance to
start passt-repair and let it connect before the migration actually
happens.

Crud... I didn't think of this before.  I don't know that there's any
sensible way to do this without having libvirt managing passt-repair
as well.  I mean it's not impossible there's some option to do this,
but I doubt there's been any reason before for something outside of
libvirt to control the timing of the target qemu's creation.  I think
we need to ask libvirt people about this.

>   https://libvirt.org/migration.html#configuration-file-handling

Yeah.. I don't think this is relevant.

> and --listen-address, but I'm not quite sure yet.
> 
> That is, I could only test different failures (early one on source, or
> later one on target) with this, not a complete successful migration.
> 
> > There are more fragile cases that I'm looking to fix, particularly the
> > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > into various complications that I didn't manage to sort out today.
> > I'll continue looking at those tomorrow.  I'm now pretty confident
> > that those additional fixes won't entirely supersede the changes in
> > this series, so it should be fine to apply these on their own.
> 
> By the way, I think the somewhat less fragile/more obvious case where
> we fail clumsily is when the target doesn't have the same address as
> the source (among other possible addresses). In that case, we fail (and
> terminate) with a rather awkward:

Ah, yes, that is a higher priority fragile case.

>   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
>   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
>   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
>   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> 
> that's because, oops, I only took care of socket() failures in
> tcp_flow_repair_socket(), but not bind() failures (!). Sorry.

No, you check for errors on both.  The problem is that in
tcp_flow_migrate_target() we cancel the flow allocation and carry on -
but the source will still send information for this flow, putting us
out of sync with the stream.

> Once that's fixed, flow_migrate_target() should also take care of
> decreasing 'count' accordingly. I just had a glimpse but didn't
> really try to sketch a fix.

Adjusting count won't do the job.  Instead we'd need to keep the flow
around, but marked as "dead" somehow, so that we read but discard the
incoming information for it.  The MIGRATING state I added in one of my
drafts was supposed to help with this sort of thing.  But that's quite
a complex change.


Hrm... at least in the near term, I think it might actually be easier
to set IP_FREEBIND when we create sockets for in-migrating flows.
That way we can process them normally, they just won't do much without
the address set.  It has the additional advantage that it should work
if the higher layers only move the IP just after the migration,
instead of in advance.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-26  0:27   ` David Gibson
@ 2025-02-26  8:09     ` Stefano Brivio
  2025-02-26  8:51       ` David Gibson
  0 siblings, 1 reply; 10+ messages in thread
From: Stefano Brivio @ 2025-02-26  8:09 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 26 Feb 2025 11:27:32 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:
> > On Tue, 25 Feb 2025 16:51:30 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > From Red Hat internal testing we've had some reports that if
> > > attempting to migrate without passt-repair, the failure mode is uglier
> > > than we'd like.  The migration fails, which is somewhat expected, but
> > > we don't correctly roll things back on the source, so it breaks
> > > network there as well.
> > > 
> > > Handle this more gracefully allowing the migration to proceed in this
> > > case, but allow TCP connections to break
> > > 
> > > I've now tested this reasonably:
> > >  * I get a clean migration if there are now active flows
> > >  * Migration completes, although connections are broken if
> > >    passt-repair isn't connected
> > >  * Basic test suite (minus perf)
> > > 
> > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > behaviour should be better than it was.  
> > 
> > I did, and it is. The series looks good to me and I would apply it as
> > it is, but I'm waiting a bit longer in case you want to try out some
> > variations based on my tests as well. Here's what I did.  
> 
> [snip]
> 
> Thanks for the detailed instructions.  More complex than I might have
> liked, but oh well.
> 
> >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > 
> > ...despite --verbose the error doesn't tell much (perhaps I need
> > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > this series (I just used 'make install' from the local build), migration
> > succeeds instead:
> > 
> >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> >   Migration: [100.00 %]
> > 
> > Now, on the target, I still have to figure out how to tell libvirt
> > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > as we use in our tests), instead of just starting a new instance like
> > it does. Otherwise, I have no chance to start passt-repair there.
> > Perhaps it has something to do with persistent mode described here:  
> 
> Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> with --incoming on the target.

("-incoming"), yes, see src/qemu/qemu_migration.c,
qemuMigrationDstPrepare().

> IIUC the problem here is more about
> timing: we want it to start it early, so that we have a chance to
> start passt-repair and let it connect before the migration actually
> happens.

For the timing itself, we could actually wait for passt-repair to be
there, with a timeout (say, 100ms).

We could also modify passt-repair to set up an inotify watcher if the
socket isn't there yet.

> Crud... I didn't think of this before.  I don't know that there's any
> sensible way to do this without having libvirt managing passt-repair
> as well.

But we can't really use it as we're assuming that passt-repair will run
with capabilities virtqemud doesn't want/need.

> I mean it's not impossible there's some option to do this,
> but I doubt there's been any reason before for something outside of
> libvirt to control the timing of the target qemu's creation.  I think
> we need to ask libvirt people about this.

I'm looking into it (and perhaps virtiofsd had similar troubles?).

> >   https://libvirt.org/migration.html#configuration-file-handling  
> 
> Yeah.. I don't think this is relevant.
> 
> > and --listen-address, but I'm not quite sure yet.
> > 
> > That is, I could only test different failures (early one on source, or
> > later one on target) with this, not a complete successful migration.
> >   
> > > There are more fragile cases that I'm looking to fix, particularly the
> > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > > into various complications that I didn't manage to sort out today.
> > > I'll continue looking at those tomorrow.  I'm now pretty confident
> > > that those additional fixes won't entirely supersede the changes in
> > > this series, so it should be fine to apply these on their own.  
> > 
> > By the way, I think the somewhat less fragile/more obvious case where
> > we fail clumsily is when the target doesn't have the same address as
> > the source (among other possible addresses). In that case, we fail (and
> > terminate) with a rather awkward:  
> 
> Ah, yes, that is a higher priority fragile case.
> 
> >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
> >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
> >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
> >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > 
> > that's because, oops, I only took care of socket() failures in
> > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.  
> 
> No, you check for errors on both.

Well, "check", yes, but I'm not even setting an error code. I haven't
tried your 3/3 yet but look at "(null)" resulting from:

 		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));

...rc is 0.

> The problem is that in
> tcp_flow_migrate_target() we cancel the flow allocation and carry on -
> but the source will still send information for this flow, putting us
> out of sync with the stream.

That, too, yes.

> > Once that's fixed, flow_migrate_target() should also take care of
> > decreasing 'count' accordingly. I just had a glimpse but didn't
> > really try to sketch a fix.  
> 
> Adjusting count won't do the job.  Instead we'd need to keep the flow
> around, but marked as "dead" somehow, so that we read but discard the
> incoming information for it.  The MIGRATING state I added in one of my
> drafts was supposed to help with this sort of thing.  But that's quite
> a complex change.

I think it's great that you could (practically) solve it with three
lines...

> Hrm... at least in the near term, I think it might actually be easier
> to set IP_FREEBIND when we create sockets for in-migrating flows.
> That way we can process them normally, they just won't do much without
> the address set.  It has the additional advantage that it should work
> if the higher layers only move the IP just after the migration,
> instead of in advance.

Perhaps we want it anyway, but I wonder: what happens if we turn repair
mode off and we bound to a non-local address? I suppose we won't send
out anything, but I'm not sure. If we send out the first keep-alive
segment with a wrong address, we probably ruined the connection.

Once I find a solution for the target libvirt/passt-repair thing (and
the remaining SELinux issues), I'll try to have a look at this too. I
haven't tried yet a migration with a mismatching address on the target
and passt-repair available.

-- 
Stefano


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-26  8:09     ` Stefano Brivio
@ 2025-02-26  8:51       ` David Gibson
  2025-02-26 11:24         ` Stefano Brivio
  0 siblings, 1 reply; 10+ messages in thread
From: David Gibson @ 2025-02-26  8:51 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10868 bytes --]

On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:
> On Wed, 26 Feb 2025 11:27:32 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:
> > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > From Red Hat internal testing we've had some reports that if
> > > > attempting to migrate without passt-repair, the failure mode is uglier
> > > > than we'd like.  The migration fails, which is somewhat expected, but
> > > > we don't correctly roll things back on the source, so it breaks
> > > > network there as well.
> > > > 
> > > > Handle this more gracefully allowing the migration to proceed in this
> > > > case, but allow TCP connections to break
> > > > 
> > > > I've now tested this reasonably:
> > > >  * I get a clean migration if there are now active flows
> > > >  * Migration completes, although connections are broken if
> > > >    passt-repair isn't connected
> > > >  * Basic test suite (minus perf)
> > > > 
> > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > > behaviour should be better than it was.  
> > > 
> > > I did, and it is. The series looks good to me and I would apply it as
> > > it is, but I'm waiting a bit longer in case you want to try out some
> > > variations based on my tests as well. Here's what I did.  
> > 
> > [snip]
> > 
> > Thanks for the detailed instructions.  More complex than I might have
> > liked, but oh well.
> > 
> > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > > 
> > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > > this series (I just used 'make install' from the local build), migration
> > > succeeds instead:
> > > 
> > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > >   Migration: [100.00 %]
> > > 
> > > Now, on the target, I still have to figure out how to tell libvirt
> > > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > > as we use in our tests), instead of just starting a new instance like
> > > it does. Otherwise, I have no chance to start passt-repair there.
> > > Perhaps it has something to do with persistent mode described here:  
> > 
> > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > with --incoming on the target.
> 
> ("-incoming"), yes, see src/qemu/qemu_migration.c,
> qemuMigrationDstPrepare().
> 
> > IIUC the problem here is more about
> > timing: we want it to start it early, so that we have a chance to
> > start passt-repair and let it connect before the migration actually
> > happens.
> 
> For the timing itself, we could actually wait for passt-repair to be
> there, with a timeout (say, 100ms).

I guess.  That still requires some way for KubeVirt (or whatever) to
know at least roughly when it needs to launch passt-repair, and I'm
not sure if that's something we can currently get from libvirt.

> We could also modify passt-repair to set up an inotify watcher if the
> socket isn't there yet.

Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
connects to it" model though, and I wonder if we need to revisit the
security implications of that.

> > Crud... I didn't think of this before.  I don't know that there's any
> > sensible way to do this without having libvirt managing passt-repair
> > as well.
> 
> But we can't really use it as we're assuming that passt-repair will run
> with capabilities virtqemud doesn't want/need.

Oh.  True.

> > I mean it's not impossible there's some option to do this,
> > but I doubt there's been any reason before for something outside of
> > libvirt to control the timing of the target qemu's creation.  I think
> > we need to ask libvirt people about this.
> 
> I'm looking into it (and perhaps virtiofsd had similar troubles?).

I'm guessing libvirt already knows how to start virtiofsd - just as it
already knows how to start passt, just not passt-repair.

> > >   https://libvirt.org/migration.html#configuration-file-handling  
> > 
> > Yeah.. I don't think this is relevant.
> > 
> > > and --listen-address, but I'm not quite sure yet.
> > > 
> > > That is, I could only test different failures (early one on source, or
> > > later one on target) with this, not a complete successful migration.
> > >   
> > > > There are more fragile cases that I'm looking to fix, particularly the
> > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > > > into various complications that I didn't manage to sort out today.
> > > > I'll continue looking at those tomorrow.  I'm now pretty confident
> > > > that those additional fixes won't entirely supersede the changes in
> > > > this series, so it should be fine to apply these on their own.  
> > > 
> > > By the way, I think the somewhat less fragile/more obvious case where
> > > we fail clumsily is when the target doesn't have the same address as
> > > the source (among other possible addresses). In that case, we fail (and
> > > terminate) with a rather awkward:  
> > 
> > Ah, yes, that is a higher priority fragile case.
> > 
> > >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
> > >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
> > >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
> > >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > > 
> > > that's because, oops, I only took care of socket() failures in
> > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.  
> > 
> > No, you check for errors on both.
> 
> Well, "check", yes, but I'm not even setting an error code. I haven't
> tried your 3/3 yet but look at "(null)" resulting from:
> 
>  		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
> 
> ...rc is 0.

-1, not 0, otherwise we wouldn't enter that if clause at all.  But,
still, out of bounds for strerror().  I did spot that bug -
tcp_flow_repair_socket() is directly passing on the return code from
bind(), whereas it should be returning -errno.

So, two bugs actually: 1) in the existing code we should return -errno
not rc if bind() fails, 2) in my 3/3 it should be calling strerror()
on -rc, not rc.

> > The problem is that in
> > tcp_flow_migrate_target() we cancel the flow allocation and carry on -
> > but the source will still send information for this flow, putting us
> > out of sync with the stream.
> 
> That, too, yes.
> 
> > > Once that's fixed, flow_migrate_target() should also take care of
> > > decreasing 'count' accordingly. I just had a glimpse but didn't
> > > really try to sketch a fix.  
> > 
> > Adjusting count won't do the job.  Instead we'd need to keep the flow
> > around, but marked as "dead" somehow, so that we read but discard the
> > incoming information for it.  The MIGRATING state I added in one of my
> > drafts was supposed to help with this sort of thing.  But that's quite
> > a complex change.
> 
> I think it's great that you could (practically) solve it with three
> lines...

Yeah, I sent that email at the beginning of my day, by the end I'd
come up with the simpler approach.

> > Hrm... at least in the near term, I think it might actually be easier
> > to set IP_FREEBIND when we create sockets for in-migrating flows.
> > That way we can process them normally, they just won't do much without
> > the address set.  It has the additional advantage that it should work
> > if the higher layers only move the IP just after the migration,
> > instead of in advance.
> 
> Perhaps we want it anyway, but I wonder:

Right, I'm no longer considering this as a short term solution, since
checking for fd < 0 I think works better for the immediate problems.

> what happens if we turn repair
> mode off and we bound to a non-local address? I suppose we won't send
> out anything, but I'm not sure. If we send out the first keep-alive
> segment with a wrong address, we probably ruined the connection.

That's a good point.  More specifically, I think IP_FREEBIND is
generally used for listen()ing sockets, I'm guessing you'll get an
error if you try to connect() a socket that's bound to a non-local
address.  It's possible TCP_REPAIR would defer that until repair mode
is switched off, which wouldn't make a lot of difference to us.  It's
also possible there could be bug in repair mode that would let you
construct a non-locally bound, connected socket that way.  I'm not
entirely sure what the consequences would be.  I guess that might
already be possible in a different way: what happens if you have a
connect()ed socket, then the admin removes the address to which it is
bound?

> Once I find a solution for the target libvirt/passt-repair thing (and
> the remaining SELinux issues), I'll try to have a look at this too. I
> haven't tried yet a migration with a mismatching address on the target
> and passt-repair available.

Right, I was trying to set up a test case for this today.  I made some
progress but didn't really get it working.  I was using qemu directly
with scripts to put the two ends into different net namespaces, rather
than libvirt on separate L1 VMs.  Working out how to get the two
namespaces connected in a way I could do the migration, while still
being separate enough was doing my head in a bit.


In doing that, I also spotted another wrinkle.  I don't think this is
one we can reasonably fix - but we should be aware, since someone will
probably try it at some point: migration is not going to work if the
two hosts have their own connectivity provided by (separate instances
of) passt or pasta (or slirp for that matter).  The migrating VM can
have its TCP stream reconstructed perfectly, so the right L2 packets
come out of the host, but the host's own passt/pasta instance won't
know about the flows and so will just drop/reject the packets.

To make that work we'd basically have to migrate state for every
"ancestor" passt/pasta until we hit a common namespace.  That seems
pretty infeasible to me, since the pieces that know about the
migration probably don't own those layers of the network.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-26  8:51       ` David Gibson
@ 2025-02-26 11:24         ` Stefano Brivio
  2025-02-27  1:43           ` David Gibson
  0 siblings, 1 reply; 10+ messages in thread
From: Stefano Brivio @ 2025-02-26 11:24 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 26 Feb 2025 19:51:11 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:
> > On Wed, 26 Feb 2025 11:27:32 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:  
> > > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > From Red Hat internal testing we've had some reports that if
> > > > > attempting to migrate without passt-repair, the failure mode is uglier
> > > > > than we'd like.  The migration fails, which is somewhat expected, but
> > > > > we don't correctly roll things back on the source, so it breaks
> > > > > network there as well.
> > > > > 
> > > > > Handle this more gracefully allowing the migration to proceed in this
> > > > > case, but allow TCP connections to break
> > > > > 
> > > > > I've now tested this reasonably:
> > > > >  * I get a clean migration if there are now active flows
> > > > >  * Migration completes, although connections are broken if
> > > > >    passt-repair isn't connected
> > > > >  * Basic test suite (minus perf)
> > > > > 
> > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > > > behaviour should be better than it was.    
> > > > 
> > > > I did, and it is. The series looks good to me and I would apply it as
> > > > it is, but I'm waiting a bit longer in case you want to try out some
> > > > variations based on my tests as well. Here's what I did.    
> > > 
> > > [snip]
> > > 
> > > Thanks for the detailed instructions.  More complex than I might have
> > > liked, but oh well.
> > >   
> > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > > > 
> > > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > > > this series (I just used 'make install' from the local build), migration
> > > > succeeds instead:
> > > > 
> > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > >   Migration: [100.00 %]
> > > > 
> > > > Now, on the target, I still have to figure out how to tell libvirt
> > > > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > > > as we use in our tests), instead of just starting a new instance like
> > > > it does. Otherwise, I have no chance to start passt-repair there.
> > > > Perhaps it has something to do with persistent mode described here:    
> > > 
> > > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > > with --incoming on the target.  
> > 
> > ("-incoming"), yes, see src/qemu/qemu_migration.c,
> > qemuMigrationDstPrepare().
> >   
> > > IIUC the problem here is more about
> > > timing: we want it to start it early, so that we have a chance to
> > > start passt-repair and let it connect before the migration actually
> > > happens.  
> > 
> > For the timing itself, we could actually wait for passt-repair to be
> > there, with a timeout (say, 100ms).  
> 
> I guess.  That still requires some way for KubeVirt (or whatever) to
> know at least roughly when it needs to launch passt-repair, and I'm
> not sure if that's something we can currently get from libvirt.

KubeVirt sets up the target pod, and that's when it should be done (if
we have an inotify mechanism or similar). I can't point to an exact code
path yet but there's something like that.

> > We could also modify passt-repair to set up an inotify watcher if the
> > socket isn't there yet.  
> 
> Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
> connects to it" model though, and I wonder if we need to revisit the
> security implications of that.

I don't think it actually breaks that model for security purposes,
because the guest doesn't have anyway a chance to cause a connection to
passt-repair. The guest is still suspended (or missing) at that point.

> > > Crud... I didn't think of this before.  I don't know that there's any
> > > sensible way to do this without having libvirt managing passt-repair
> > > as well.  
> > 
> > But we can't really use it as we're assuming that passt-repair will run
> > with capabilities virtqemud doesn't want/need.  
> 
> Oh.  True.
> 
> > > I mean it's not impossible there's some option to do this,
> > > but I doubt there's been any reason before for something outside of
> > > libvirt to control the timing of the target qemu's creation.  I think
> > > we need to ask libvirt people about this.  
> > 
> > I'm looking into it (and perhaps virtiofsd had similar troubles?).  
> 
> I'm guessing libvirt already knows how to start virtiofsd - just as it
> already knows how to start passt, just not passt-repair.
> 
> > > >   https://libvirt.org/migration.html#configuration-file-handling    
> > > 
> > > Yeah.. I don't think this is relevant.
> > >   
> > > > and --listen-address, but I'm not quite sure yet.
> > > > 
> > > > That is, I could only test different failures (early one on source, or
> > > > later one on target) with this, not a complete successful migration.
> > > >     
> > > > > There are more fragile cases that I'm looking to fix, particularly the
> > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > > > > into various complications that I didn't manage to sort out today.
> > > > > I'll continue looking at those tomorrow.  I'm now pretty confident
> > > > > that those additional fixes won't entirely supersede the changes in
> > > > > this series, so it should be fine to apply these on their own.    
> > > > 
> > > > By the way, I think the somewhat less fragile/more obvious case where
> > > > we fail clumsily is when the target doesn't have the same address as
> > > > the source (among other possible addresses). In that case, we fail (and
> > > > terminate) with a rather awkward:    
> > > 
> > > Ah, yes, that is a higher priority fragile case.
> > >   
> > > >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
> > > >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
> > > >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
> > > >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > > > 
> > > > that's because, oops, I only took care of socket() failures in
> > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.    
> > > 
> > > No, you check for errors on both.  
> > 
> > Well, "check", yes, but I'm not even setting an error code. I haven't
> > tried your 3/3 yet but look at "(null)" resulting from:
> > 
> >  		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
> > 
> > ...rc is 0.  
> 
> -1, not 0, otherwise we wouldn't enter that if clause at all.

Ah, oops, right.

> But,
> still, out of bounds for strerror().  I did spot that bug -
> tcp_flow_repair_socket() is directly passing on the return code from
> bind(), whereas it should be returning -errno.
> 
> So, two bugs actually: 1) in the existing code we should return -errno
> not rc if bind() fails, 2) in my 3/3 it should be calling strerror()
> on -rc, not rc.

Right, yes, 1) is what I meant.

> > > The problem is that in
> > > tcp_flow_migrate_target() we cancel the flow allocation and carry on -
> > > but the source will still send information for this flow, putting us
> > > out of sync with the stream.  
> > 
> > That, too, yes.
> >   
> > > > Once that's fixed, flow_migrate_target() should also take care of
> > > > decreasing 'count' accordingly. I just had a glimpse but didn't
> > > > really try to sketch a fix.    
> > > 
> > > Adjusting count won't do the job.  Instead we'd need to keep the flow
> > > around, but marked as "dead" somehow, so that we read but discard the
> > > incoming information for it.  The MIGRATING state I added in one of my
> > > drafts was supposed to help with this sort of thing.  But that's quite
> > > a complex change.  
> > 
> > I think it's great that you could (practically) solve it with three
> > lines...  
> 
> Yeah, I sent that email at the beginning of my day, by the end I'd
> come up with the simpler approach.
> 
> > > Hrm... at least in the near term, I think it might actually be easier
> > > to set IP_FREEBIND when we create sockets for in-migrating flows.
> > > That way we can process them normally, they just won't do much without
> > > the address set.  It has the additional advantage that it should work
> > > if the higher layers only move the IP just after the migration,
> > > instead of in advance.  
> > 
> > Perhaps we want it anyway, but I wonder:  
> 
> Right, I'm no longer considering this as a short term solution, since
> checking for fd < 0 I think works better for the immediate problems.
> 
> > what happens if we turn repair
> > mode off and we bound to a non-local address? I suppose we won't send
> > out anything, but I'm not sure. If we send out the first keep-alive
> > segment with a wrong address, we probably ruined the connection.  
> 
> That's a good point.  More specifically, I think IP_FREEBIND is
> generally used for listen()ing sockets, I'm guessing you'll get an
> error if you try to connect() a socket that's bound to a non-local
> address.  It's possible TCP_REPAIR would defer that until repair mode
> is switched off, which wouldn't make a lot of difference to us.  It's
> also possible there could be bug in repair mode that would let you
> construct a non-locally bound, connected socket that way.  I'm not
> entirely sure what the consequences would be.  I guess that might
> already be possible in a different way: what happens if you have a
> connect()ed socket, then the admin removes the address to which it is
> bound?

I'm fairly sure that the outcome of testing this will surprise us in a
way or another, so probably we should start from testing it.

We could even think of deferring switching repair mode off until the
right address is there, by the way. That would make a difference to us.

> > Once I find a solution for the target libvirt/passt-repair thing (and
> > the remaining SELinux issues), I'll try to have a look at this too. I
> > haven't tried yet a migration with a mismatching address on the target
> > and passt-repair available.  
> 
> Right, I was trying to set up a test case for this today.  I made some
> progress but didn't really get it working.  I was using qemu directly
> with scripts to put the two ends into different net namespaces, rather
> than libvirt on separate L1 VMs.  Working out how to get the two
> namespaces connected in a way I could do the migration, while still
> being separate enough was doing my head in a bit.

I didn't actually get to the point of having a truly working migration.
I just considered a clean attempt at resuming an existing connection
(with keep-alive and subsequent RST from underlying passt instance) as
success.

So, yes, of course, it would be great to have a full simulation in a
compact form of what's going on with KubeVirt and OVN.

Probably bridging the L1 VMs would be a quick solution. It needs root
(at least for the setup), which means they become L2 (at least in my
world): L1 would still be connected by passt, the two L2 instances are
bridged, and source and target guests would be L3.

> In doing that, I also spotted another wrinkle.  I don't think this is
> one we can reasonably fix - but we should be aware, since someone will
> probably try it at some point:

Yeah, I tried it, you might see remnants of that in the setup_migrate()
stuff (I copied it from the "two_guests" test). I originally wanted to
have two namespaces and two instances of pasta (just like
"two_guests"), but soon realised the issue you describe below.

> migration is not going to work if the
> two hosts have their own connectivity provided by (separate instances
> of) passt or pasta (or slirp for that matter).  The migrating VM can
> have its TCP stream reconstructed perfectly, so the right L2 packets
> come out of the host, but the host's own passt/pasta instance won't
> know about the flows and so will just drop/reject the packets.
> 
> To make that work we'd basically have to migrate state for every
> "ancestor" passt/pasta until we hit a common namespace.  That seems
> pretty infeasible to me, since the pieces that know about the
> migration probably don't own those layers of the network.

If there's any potential interest around it, we could abstract things a
bit more than we did until now and have some kind of o... orch...
coordination of data/migration flows, with some kind of external tool
identifying and connecting to several instances of passt and moving
flows between them. It sounds a bit like OVN, but with the notable
difference that we don't need to implement a overlay network.

-- 
Stefano


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-26 11:24         ` Stefano Brivio
@ 2025-02-27  1:43           ` David Gibson
  2025-02-27  4:32             ` Stefano Brivio
  0 siblings, 1 reply; 10+ messages in thread
From: David Gibson @ 2025-02-27  1:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 15725 bytes --]

On Wed, Feb 26, 2025 at 12:24:12PM +0100, Stefano Brivio wrote:
> On Wed, 26 Feb 2025 19:51:11 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:
> > > On Wed, 26 Feb 2025 11:27:32 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:  
> > > > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > From Red Hat internal testing we've had some reports that if
> > > > > > attempting to migrate without passt-repair, the failure mode is uglier
> > > > > > than we'd like.  The migration fails, which is somewhat expected, but
> > > > > > we don't correctly roll things back on the source, so it breaks
> > > > > > network there as well.
> > > > > > 
> > > > > > Handle this more gracefully allowing the migration to proceed in this
> > > > > > case, but allow TCP connections to break
> > > > > > 
> > > > > > I've now tested this reasonably:
> > > > > >  * I get a clean migration if there are now active flows
> > > > > >  * Migration completes, although connections are broken if
> > > > > >    passt-repair isn't connected
> > > > > >  * Basic test suite (minus perf)
> > > > > > 
> > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > > > > behaviour should be better than it was.    
> > > > > 
> > > > > I did, and it is. The series looks good to me and I would apply it as
> > > > > it is, but I'm waiting a bit longer in case you want to try out some
> > > > > variations based on my tests as well. Here's what I did.    
> > > > 
> > > > [snip]
> > > > 
> > > > Thanks for the detailed instructions.  More complex than I might have
> > > > liked, but oh well.
> > > >   
> > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > > >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > > > > 
> > > > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > > > > this series (I just used 'make install' from the local build), migration
> > > > > succeeds instead:
> > > > > 
> > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > > >   Migration: [100.00 %]
> > > > > 
> > > > > Now, on the target, I still have to figure out how to tell libvirt
> > > > > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > > > > as we use in our tests), instead of just starting a new instance like
> > > > > it does. Otherwise, I have no chance to start passt-repair there.
> > > > > Perhaps it has something to do with persistent mode described here:    
> > > > 
> > > > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > > > with --incoming on the target.  
> > > 
> > > ("-incoming"), yes, see src/qemu/qemu_migration.c,
> > > qemuMigrationDstPrepare().
> > >   
> > > > IIUC the problem here is more about
> > > > timing: we want it to start it early, so that we have a chance to
> > > > start passt-repair and let it connect before the migration actually
> > > > happens.  
> > > 
> > > For the timing itself, we could actually wait for passt-repair to be
> > > there, with a timeout (say, 100ms).  
> > 
> > I guess.  That still requires some way for KubeVirt (or whatever) to
> > know at least roughly when it needs to launch passt-repair, and I'm
> > not sure if that's something we can currently get from libvirt.
> 
> KubeVirt sets up the target pod, and that's when it should be done (if
> we have an inotify mechanism or similar). I can't point to an exact code
> path yet but there's something like that.

Right, but that approach does require inotify and starting
passt-repair before passt, which might be fine, but I have the concern
noted below.  To avoid that we'd need notification after passt & qemu
are started on the target, but before the migration is actually
initiated which I don't think libvirt provides.

> > > We could also modify passt-repair to set up an inotify watcher if the
> > > socket isn't there yet.  
> > 
> > Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
> > connects to it" model though, and I wonder if we need to revisit the
> > security implications of that.
> 
> I don't think it actually breaks that model for security purposes,
> because the guest doesn't have anyway a chance to cause a connection to
> passt-repair. The guest is still suspended (or missing) at that point.

I wasn't thinking of threat models coming from the guest, but an
attack from an unrelated process impersonating passt in order to
access passt-repair's superpowers.

> > > > Crud... I didn't think of this before.  I don't know that there's any
> > > > sensible way to do this without having libvirt managing passt-repair
> > > > as well.  
> > > 
> > > But we can't really use it as we're assuming that passt-repair will run
> > > with capabilities virtqemud doesn't want/need.  
> > 
> > Oh.  True.
> > 
> > > > I mean it's not impossible there's some option to do this,
> > > > but I doubt there's been any reason before for something outside of
> > > > libvirt to control the timing of the target qemu's creation.  I think
> > > > we need to ask libvirt people about this.  
> > > 
> > > I'm looking into it (and perhaps virtiofsd had similar troubles?).  
> > 
> > I'm guessing libvirt already knows how to start virtiofsd - just as it
> > already knows how to start passt, just not passt-repair.
> > 
> > > > >   https://libvirt.org/migration.html#configuration-file-handling    
> > > > 
> > > > Yeah.. I don't think this is relevant.
> > > >   
> > > > > and --listen-address, but I'm not quite sure yet.
> > > > > 
> > > > > That is, I could only test different failures (early one on source, or
> > > > > later one on target) with this, not a complete successful migration.
> > > > >     
> > > > > > There are more fragile cases that I'm looking to fix, particularly the
> > > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> > > > > > into various complications that I didn't manage to sort out today.
> > > > > > I'll continue looking at those tomorrow.  I'm now pretty confident
> > > > > > that those additional fixes won't entirely supersede the changes in
> > > > > > this series, so it should be fine to apply these on their own.    
> > > > > 
> > > > > By the way, I think the somewhat less fragile/more obvious case where
> > > > > we fail clumsily is when the target doesn't have the same address as
> > > > > the source (among other possible addresses). In that case, we fail (and
> > > > > terminate) with a rather awkward:    
> > > > 
> > > > Ah, yes, that is a higher priority fragile case.
> > > >   
> > > > >   93.7217: ERROR:   Failed to bind socket for migrated flow: Cannot assign requested address
> > > > >   93.7218: ERROR:   Flow 0 (TCP connection): Can't set up socket: (null), drop
> > > > >   93.7331: ERROR:   Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
> > > > >   93.7333: ERROR:   Unexpected reply from TCP_REPAIR helper: -100
> > > > > 
> > > > > that's because, oops, I only took care of socket() failures in
> > > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry.    
> > > > 
> > > > No, you check for errors on both.  
> > > 
> > > Well, "check", yes, but I'm not even setting an error code. I haven't
> > > tried your 3/3 yet but look at "(null)" resulting from:
> > > 
> > >  		flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc));
> > > 
> > > ...rc is 0.  
> > 
> > -1, not 0, otherwise we wouldn't enter that if clause at all.
> 
> Ah, oops, right.
> 
> > But,
> > still, out of bounds for strerror().  I did spot that bug -
> > tcp_flow_repair_socket() is directly passing on the return code from
> > bind(), whereas it should be returning -errno.
> > 
> > So, two bugs actually: 1) in the existing code we should return -errno
> > not rc if bind() fails, 2) in my 3/3 it should be calling strerror()
> > on -rc, not rc.
> 
> Right, yes, 1) is what I meant.

I'll fix both of these for the next spin.

> > > > The problem is that in
> > > > tcp_flow_migrate_target() we cancel the flow allocation and carry on -
> > > > but the source will still send information for this flow, putting us
> > > > out of sync with the stream.  
> > > 
> > > That, too, yes.
> > >   
> > > > > Once that's fixed, flow_migrate_target() should also take care of
> > > > > decreasing 'count' accordingly. I just had a glimpse but didn't
> > > > > really try to sketch a fix.    
> > > > 
> > > > Adjusting count won't do the job.  Instead we'd need to keep the flow
> > > > around, but marked as "dead" somehow, so that we read but discard the
> > > > incoming information for it.  The MIGRATING state I added in one of my
> > > > drafts was supposed to help with this sort of thing.  But that's quite
> > > > a complex change.  
> > > 
> > > I think it's great that you could (practically) solve it with three
> > > lines...  
> > 
> > Yeah, I sent that email at the beginning of my day, by the end I'd
> > come up with the simpler approach.
> > 
> > > > Hrm... at least in the near term, I think it might actually be easier
> > > > to set IP_FREEBIND when we create sockets for in-migrating flows.
> > > > That way we can process them normally, they just won't do much without
> > > > the address set.  It has the additional advantage that it should work
> > > > if the higher layers only move the IP just after the migration,
> > > > instead of in advance.  
> > > 
> > > Perhaps we want it anyway, but I wonder:  
> > 
> > Right, I'm no longer considering this as a short term solution, since
> > checking for fd < 0 I think works better for the immediate problems.
> > 
> > > what happens if we turn repair
> > > mode off and we bound to a non-local address? I suppose we won't send
> > > out anything, but I'm not sure. If we send out the first keep-alive
> > > segment with a wrong address, we probably ruined the connection.  
> > 
> > That's a good point.  More specifically, I think IP_FREEBIND is
> > generally used for listen()ing sockets, I'm guessing you'll get an
> > error if you try to connect() a socket that's bound to a non-local
> > address.  It's possible TCP_REPAIR would defer that until repair mode
> > is switched off, which wouldn't make a lot of difference to us.  It's
> > also possible there could be bug in repair mode that would let you
> > construct a non-locally bound, connected socket that way.  I'm not
> > entirely sure what the consequences would be.  I guess that might
> > already be possible in a different way: what happens if you have a
> > connect()ed socket, then the admin removes the address to which it is
> > bound?
> 
> I'm fairly sure that the outcome of testing this will surprise us in a
> way or another, so probably we should start from testing it.

Yeah, I tend to agree.

> We could even think of deferring switching repair mode off until the
> right address is there, by the way. That would make a difference to
> us.

Do you mean by blocking?  Or by returning to normal operation with the
flow flagged somehow to be "woken up" by a netlink monitor?

> > > Once I find a solution for the target libvirt/passt-repair thing (and
> > > the remaining SELinux issues), I'll try to have a look at this too. I
> > > haven't tried yet a migration with a mismatching address on the target
> > > and passt-repair available.  
> > 
> > Right, I was trying to set up a test case for this today.  I made some
> > progress but didn't really get it working.  I was using qemu directly
> > with scripts to put the two ends into different net namespaces, rather
> > than libvirt on separate L1 VMs.  Working out how to get the two
> > namespaces connected in a way I could do the migration, while still
> > being separate enough was doing my head in a bit.
> 
> I didn't actually get to the point of having a truly working migration.
> I just considered a clean attempt at resuming an existing connection
> (with keep-alive and subsequent RST from underlying passt instance) as
> success.
> 
> So, yes, of course, it would be great to have a full simulation in a
> compact form of what's going on with KubeVirt and OVN.
> 
> Probably bridging the L1 VMs would be a quick solution. It needs root

Well.. for some value of quick.  That's basically what I've been
working towards, except using namespaces for the L1s instead of VMs.
Figuring out exactly how to set up the bridge so that we have
addressing that allows the migration between the qemus to take place,
and also has the properties we need for passt to work and test the
things we want is kind of fiddly.

> (at least for the setup), which means they become L2 (at least in my
> world): L1 would still be connected by passt, the two L2 instances are
> bridged, and source and target guests would be L3.

For the tests here we don't even necessarily need the L1 connected to
the outside world: we can put the peer for the migrated connection(s)
inside the L1 

> > In doing that, I also spotted another wrinkle.  I don't think this is
> > one we can reasonably fix - but we should be aware, since someone will
> > probably try it at some point:
> 
> Yeah, I tried it, you might see remnants of that in the setup_migrate()
> stuff (I copied it from the "two_guests" test). I originally wanted to
> have two namespaces and two instances of pasta (just like
> "two_guests"), but soon realised the issue you describe below.

Right.  It took me an embarrassing amount of time to figure out what
was going wrong, alas.

> > migration is not going to work if the
> > two hosts have their own connectivity provided by (separate instances
> > of) passt or pasta (or slirp for that matter).  The migrating VM can
> > have its TCP stream reconstructed perfectly, so the right L2 packets
> > come out of the host, but the host's own passt/pasta instance won't
> > know about the flows and so will just drop/reject the packets.
> > 
> > To make that work we'd basically have to migrate state for every
> > "ancestor" passt/pasta until we hit a common namespace.  That seems
> > pretty infeasible to me, since the pieces that know about the
> > migration probably don't own those layers of the network.
> 
> If there's any potential interest around it, we could abstract things a
> bit more than we did until now and have some kind of o... orch...
> coordination of data/migration flows, with some kind of external tool
> identifying and connecting to several instances of passt and moving
> flows between them. It sounds a bit like OVN, but with the notable
> difference that we don't need to implement a overlay network.

Right.  I mean this is basically the responsibility of whatever's
managing host nodes' network.  For a virtual bridge or router it's a
relatively simple matter of changing where the guest address is
routed.  If the the host nodes' network involves passt or pasta
it's... harder.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
  2025-02-27  1:43           ` David Gibson
@ 2025-02-27  4:32             ` Stefano Brivio
  0 siblings, 0 replies; 10+ messages in thread
From: Stefano Brivio @ 2025-02-27  4:32 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 27 Feb 2025 12:43:41 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Feb 26, 2025 at 12:24:12PM +0100, Stefano Brivio wrote:
> > On Wed, 26 Feb 2025 19:51:11 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote:  
> > > > On Wed, 26 Feb 2025 11:27:32 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote:    
> > > > > > On Tue, 25 Feb 2025 16:51:30 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > From Red Hat internal testing we've had some reports that if
> > > > > > > attempting to migrate without passt-repair, the failure mode is uglier
> > > > > > > than we'd like.  The migration fails, which is somewhat expected, but
> > > > > > > we don't correctly roll things back on the source, so it breaks
> > > > > > > network there as well.
> > > > > > > 
> > > > > > > Handle this more gracefully allowing the migration to proceed in this
> > > > > > > case, but allow TCP connections to break
> > > > > > > 
> > > > > > > I've now tested this reasonably:
> > > > > > >  * I get a clean migration if there are now active flows
> > > > > > >  * Migration completes, although connections are broken if
> > > > > > >    passt-repair isn't connected
> > > > > > >  * Basic test suite (minus perf)
> > > > > > > 
> > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the
> > > > > > > behaviour should be better than it was.      
> > > > > > 
> > > > > > I did, and it is. The series looks good to me and I would apply it as
> > > > > > it is, but I'm waiting a bit longer in case you want to try out some
> > > > > > variations based on my tests as well. Here's what I did.      
> > > > > 
> > > > > [snip]
> > > > > 
> > > > > Thanks for the detailed instructions.  More complex than I might have
> > > > > liked, but oh well.
> > > > >     
> > > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > > > >   Migration: [97.59 %]error: End of file while reading data: : Input/output error
> > > > > > 
> > > > > > ...despite --verbose the error doesn't tell much (perhaps I need
> > > > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
> > > > > > this series (I just used 'make install' from the local build), migration
> > > > > > succeeds instead:
> > > > > > 
> > > > > >   $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
> > > > > >   Migration: [100.00 %]
> > > > > > 
> > > > > > Now, on the target, I still have to figure out how to tell libvirt
> > > > > > to start QEMU and prepare for the migration (equivalent of '-incoming'
> > > > > > as we use in our tests), instead of just starting a new instance like
> > > > > > it does. Otherwise, I have no chance to start passt-repair there.
> > > > > > Perhaps it has something to do with persistent mode described here:      
> > > > > 
> > > > > Ah.  So I'm pretty sure virsh migrate will automatically start qemu
> > > > > with --incoming on the target.    
> > > > 
> > > > ("-incoming"), yes, see src/qemu/qemu_migration.c,
> > > > qemuMigrationDstPrepare().
> > > >     
> > > > > IIUC the problem here is more about
> > > > > timing: we want it to start it early, so that we have a chance to
> > > > > start passt-repair and let it connect before the migration actually
> > > > > happens.    
> > > > 
> > > > For the timing itself, we could actually wait for passt-repair to be
> > > > there, with a timeout (say, 100ms).    
> > > 
> > > I guess.  That still requires some way for KubeVirt (or whatever) to
> > > know at least roughly when it needs to launch passt-repair, and I'm
> > > not sure if that's something we can currently get from libvirt.  
> > 
> > KubeVirt sets up the target pod, and that's when it should be done (if
> > we have an inotify mechanism or similar). I can't point to an exact code
> > path yet but there's something like that.  
> 
> Right, but that approach does require inotify and starting
> passt-repair before passt, which might be fine, but I have the concern
> noted below.  To avoid that we'd need notification after passt & qemu
> are started on the target, but before the migration is actually
> initiated which I don't think libvirt provides.
> 
> > > > We could also modify passt-repair to set up an inotify watcher if the
> > > > socket isn't there yet.    
> > > 
> > > Maybe, yes.  This kind of breaks our "passt starts first, passt-repair
> > > connects to it" model though, and I wonder if we need to revisit the
> > > security implications of that.  
> > 
> > I don't think it actually breaks that model for security purposes,
> > because the guest doesn't have anyway a chance to cause a connection to
> > passt-repair. The guest is still suspended (or missing) at that point.  
> 
> I wasn't thinking of threat models coming from the guest, but an
> attack from an unrelated process impersonating passt in order to
> access passt-repair's superpowers.

Then an inotify watch shouldn't substantially change things. The
attacker could create the socket earlier and obtain the same outcome.

> [...]
>
> > We could even think of deferring switching repair mode off until the
> > right address is there, by the way. That would make a difference to
> > us.  
> 
> Do you mean by blocking?  Or by returning to normal operation with the
> flow flagged somehow to be "woken up" by a netlink monitor?

The latter. I don't think we should block connectivity (with new
addresses) meanwhile.

-- 
Stefano


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-27  4:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-25  5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson
2025-02-25  5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson
2025-02-25  5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio
2025-02-26  0:27   ` David Gibson
2025-02-26  8:09     ` Stefano Brivio
2025-02-26  8:51       ` David Gibson
2025-02-26 11:24         ` Stefano Brivio
2025-02-27  1:43           ` David Gibson
2025-02-27  4:32             ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).