* [PATCH v2 0/2] More graceful handling of migration without passt-repair @ 2025-02-25 5:51 David Gibson 2025-02-25 5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: David Gibson @ 2025-02-25 5:51 UTC (permalink / raw) To: Stefano Brivio, passt-dev; +Cc: David Gibson From Red Hat internal testing we've had some reports that if attempting to migrate without passt-repair, the failure mode is uglier than we'd like. The migration fails, which is somewhat expected, but we don't correctly roll things back on the source, so it breaks network there as well. Handle this more gracefully allowing the migration to proceed in this case, but allow TCP connections to break I've now tested this reasonably: * I get a clean migration if there are now active flows * Migration completes, although connections are broken if passt-repair isn't connected * Basic test suite (minus perf) I didn't manage to test with libvirt yet, but I'm pretty convinced the behaviour should be better than it was. There are more fragile cases that I'm looking to fix, particularly the die()s in flow_migrate_source_rollback() and elsewhere, however I ran into various complications that I didn't manage to sort out today. I'll continue looking at those tomorrow. I'm now pretty confident that those additional fixes won't entirely supersede the changes in this series, so it should be fine to apply these on their own. David Gibson (2): migrate, flow: Trivially succeed if migrating with no flows migrate, flow: Don't attempt to migrate TCP flows without passt-repair flow.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) -- 2.48.1 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows 2025-02-25 5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson @ 2025-02-25 5:51 ` David Gibson 2025-02-25 5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson 2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio 2 siblings, 0 replies; 10+ messages in thread From: David Gibson @ 2025-02-25 5:51 UTC (permalink / raw) To: Stefano Brivio, passt-dev; +Cc: David Gibson We could get a migration request when we have no active flows; or at least none that we need or are able to migrate. In this case after sending or receiving the number of flows we continue to step through various lists. In the target case, this could include communication with passt-repair. If passt-repair wasn't started that could cause further errors, but of course they shouldn't matter if we have nothing to repair. Make it more obvious that there's nothing to do and avoid such errors by short-circuiting flow_migrate_{source,target}() if there are no migratable flows. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- flow.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/flow.c b/flow.c index bb5dcc3c..6cf96c26 100644 --- a/flow.c +++ b/flow.c @@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, debug("Sending %u flows", ntohl(count)); + if (!count) + return 0; + /* Dump and send information that can be stored in the flow table. * * Limited rollback options here: if we fail to transfer any data (that @@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage, count = ntohl(count); debug("Receiving %u flows", count); + if (!count) + return 0; + if ((rc = flow_migrate_repair_all(c, true))) return -rc; -- @@ -999,6 +999,9 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, debug("Sending %u flows", ntohl(count)); + if (!count) + return 0; + /* Dump and send information that can be stored in the flow table. * * Limited rollback options here: if we fail to transfer any data (that @@ -1070,6 +1073,9 @@ int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage, count = ntohl(count); debug("Receiving %u flows", count); + if (!count) + return 0; + if ((rc = flow_migrate_repair_all(c, true))) return -rc; -- 2.48.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair 2025-02-25 5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson 2025-02-25 5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson @ 2025-02-25 5:51 ` David Gibson 2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio 2 siblings, 0 replies; 10+ messages in thread From: David Gibson @ 2025-02-25 5:51 UTC (permalink / raw) To: Stefano Brivio, passt-dev; +Cc: David Gibson Migrating TCP flows requires passt-repair in order to use TCP_REPAIR. If passt-repair is not started, our failure mode is pretty ugly though: we'll attempt the migration, hitting various problems when we can't enter repair mode. In some cases we may not roll back these changes properly, meaning we break network connections on the source. Our general approach is not to completely block migration if there are problems, but simply to break any flows we can't migrate. So, if we have no connection from passt-repair carry on with the migration, but don't attempt to migrate any TCP connections. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- flow.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/flow.c b/flow.c index 6cf96c26..749c4984 100644 --- a/flow.c +++ b/flow.c @@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable) union flow *flow; int rc; + /* If we don't have a repair helper, there's nothing we can do */ + if (c->fd_repair < 0) + return 0; + foreach_established_tcp_flow(flow) { if (enable) rc = tcp_flow_repair_on(c, &flow->tcp); @@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, (void)c; (void)stage; - foreach_established_tcp_flow(flow) - count++; + /* If we don't have a repair helper, we can't migrate TCP flows */ + if (c->fd_repair >= 0) { + foreach_established_tcp_flow(flow) + count++; + } count = htonl(count); if (write_all_buf(fd, &count, sizeof(count))) { -- @@ -923,6 +923,10 @@ static int flow_migrate_repair_all(struct ctx *c, bool enable) union flow *flow; int rc; + /* If we don't have a repair helper, there's nothing we can do */ + if (c->fd_repair < 0) + return 0; + foreach_established_tcp_flow(flow) { if (enable) rc = tcp_flow_repair_on(c, &flow->tcp); @@ -987,8 +991,11 @@ int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage, (void)c; (void)stage; - foreach_established_tcp_flow(flow) - count++; + /* If we don't have a repair helper, we can't migrate TCP flows */ + if (c->fd_repair >= 0) { + foreach_established_tcp_flow(flow) + count++; + } count = htonl(count); if (write_all_buf(fd, &count, sizeof(count))) { -- 2.48.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-25 5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson 2025-02-25 5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson 2025-02-25 5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson @ 2025-02-25 17:43 ` Stefano Brivio 2025-02-26 0:27 ` David Gibson 2 siblings, 1 reply; 10+ messages in thread From: Stefano Brivio @ 2025-02-25 17:43 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Tue, 25 Feb 2025 16:51:30 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > From Red Hat internal testing we've had some reports that if > attempting to migrate without passt-repair, the failure mode is uglier > than we'd like. The migration fails, which is somewhat expected, but > we don't correctly roll things back on the source, so it breaks > network there as well. > > Handle this more gracefully allowing the migration to proceed in this > case, but allow TCP connections to break > > I've now tested this reasonably: > * I get a clean migration if there are now active flows > * Migration completes, although connections are broken if > passt-repair isn't connected > * Basic test suite (minus perf) > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > behaviour should be better than it was. I did, and it is. The series looks good to me and I would apply it as it is, but I'm waiting a bit longer in case you want to try out some variations based on my tests as well. Here's what I did. L0 is Debian testing, L1 are two similar (virt-clone'd) instances of RHEL 9.5 (with passt-0^20250217.ga1e48a0-1.el9.x86_64 or local build with this series, qemu-kvm-9.1.0-14.el9.x86_64, libvirt-10.10.0-7.el9.x86_64), and L2 is Alpine 3.21-ish. The two L1 instances (hosting the source and target guests), of course, don't need to be run under libvirt, but they do in my case. They are connected by passt, so that they share the same address internally, but I'm forwarding different SSH ports to them. Relevant libvirt XML snippets for L1 instances: <interface type='user'> <mac address='52:54:00:8a:9e:c2'/> <portForward proto='tcp'> <range start='1295' to='22'/> </portForward> <model type='virtio'/> <backend type='passt'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> and: <interface type='user'> <mac address='52:54:00:b8:99:8c'/> <portForward proto='tcp'> <range start='11951' to='22'/> </portForward> <model type='virtio'/> <backend type='passt'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> ...I didn't switch those to vhost-user mode yet. I prepared the L2 guest on L1 with: $ wget https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2 $ virt-customize -a nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2 --root-password password:root $ virt-install -d --name alpine --memory 1024 --noreboot --osinfo alpinelinux3.20 --network backend.type=passt,portForward0.proto=tcp,portForward0.range0.start=40922,portForward0.range0.to=2222 --import --disk nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2 And made sure I can connect via SSH to the second (target node) L1 with: $ ssh-copy-id -f -p 11951 $GATEWAY There are some known SELinux issues at this point that I'm still working on (similar for AppArmor), so I *temporarily* set it to permissive mode with 'setenforce 0', on L1. Some were not known, though, and it's taking me longer than expected. Now I can start passt-repair (or not) on the source L1 (node): # passt-repair /run/user/1001/libvirt/qemu/run/passt/8-alpine-net0.socket.repair and open a TCP connection in the source L2 guest ('virsh console alpine', then login as root/root): # apk add inetutils-telnet # telnet passt.top 80 and finally ask libvirt to migrate the guest. Note that I need "--unsafe" because I didn't care about migrating storage (it's good enough to have the guest memory for this test). Without this series, migration fails on the source: $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session Migration: [97.59 %]error: End of file while reading data: : Input/output error ...despite --verbose the error doesn't tell much (perhaps I need LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With this series (I just used 'make install' from the local build), migration succeeds instead: $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session Migration: [100.00 %] Now, on the target, I still have to figure out how to tell libvirt to start QEMU and prepare for the migration (equivalent of '-incoming' as we use in our tests), instead of just starting a new instance like it does. Otherwise, I have no chance to start passt-repair there. Perhaps it has something to do with persistent mode described here: https://libvirt.org/migration.html#configuration-file-handling and --listen-address, but I'm not quite sure yet. That is, I could only test different failures (early one on source, or later one on target) with this, not a complete successful migration. > There are more fragile cases that I'm looking to fix, particularly the > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > into various complications that I didn't manage to sort out today. > I'll continue looking at those tomorrow. I'm now pretty confident > that those additional fixes won't entirely supersede the changes in > this series, so it should be fine to apply these on their own. By the way, I think the somewhat less fragile/more obvious case where we fail clumsily is when the target doesn't have the same address as the source (among other possible addresses). In that case, we fail (and terminate) with a rather awkward: 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 that's because, oops, I only took care of socket() failures in tcp_flow_repair_socket(), but not bind() failures (!). Sorry. Once that's fixed, flow_migrate_target() should also take care of decreasing 'count' accordingly. I just had a glimpse but didn't really try to sketch a fix. -- Stefano ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio @ 2025-02-26 0:27 ` David Gibson 2025-02-26 8:09 ` Stefano Brivio 0 siblings, 1 reply; 10+ messages in thread From: David Gibson @ 2025-02-26 0:27 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 5730 bytes --] On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > On Tue, 25 Feb 2025 16:51:30 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > From Red Hat internal testing we've had some reports that if > > attempting to migrate without passt-repair, the failure mode is uglier > > than we'd like. The migration fails, which is somewhat expected, but > > we don't correctly roll things back on the source, so it breaks > > network there as well. > > > > Handle this more gracefully allowing the migration to proceed in this > > case, but allow TCP connections to break > > > > I've now tested this reasonably: > > * I get a clean migration if there are now active flows > > * Migration completes, although connections are broken if > > passt-repair isn't connected > > * Basic test suite (minus perf) > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > behaviour should be better than it was. > > I did, and it is. The series looks good to me and I would apply it as > it is, but I'm waiting a bit longer in case you want to try out some > variations based on my tests as well. Here's what I did. [snip] Thanks for the detailed instructions. More complex than I might have liked, but oh well. > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > ...despite --verbose the error doesn't tell much (perhaps I need > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > this series (I just used 'make install' from the local build), migration > succeeds instead: > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > Migration: [100.00 %] > > Now, on the target, I still have to figure out how to tell libvirt > to start QEMU and prepare for the migration (equivalent of '-incoming' > as we use in our tests), instead of just starting a new instance like > it does. Otherwise, I have no chance to start passt-repair there. > Perhaps it has something to do with persistent mode described here: Ah. So I'm pretty sure virsh migrate will automatically start qemu with --incoming on the target. IIUC the problem here is more about timing: we want it to start it early, so that we have a chance to start passt-repair and let it connect before the migration actually happens. Crud... I didn't think of this before. I don't know that there's any sensible way to do this without having libvirt managing passt-repair as well. I mean it's not impossible there's some option to do this, but I doubt there's been any reason before for something outside of libvirt to control the timing of the target qemu's creation. I think we need to ask libvirt people about this. > https://libvirt.org/migration.html#configuration-file-handling Yeah.. I don't think this is relevant. > and --listen-address, but I'm not quite sure yet. > > That is, I could only test different failures (early one on source, or > later one on target) with this, not a complete successful migration. > > > There are more fragile cases that I'm looking to fix, particularly the > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > > into various complications that I didn't manage to sort out today. > > I'll continue looking at those tomorrow. I'm now pretty confident > > that those additional fixes won't entirely supersede the changes in > > this series, so it should be fine to apply these on their own. > > By the way, I think the somewhat less fragile/more obvious case where > we fail clumsily is when the target doesn't have the same address as > the source (among other possible addresses). In that case, we fail (and > terminate) with a rather awkward: Ah, yes, that is a higher priority fragile case. > 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address > 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop > 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket > 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 > > that's because, oops, I only took care of socket() failures in > tcp_flow_repair_socket(), but not bind() failures (!). Sorry. No, you check for errors on both. The problem is that in tcp_flow_migrate_target() we cancel the flow allocation and carry on - but the source will still send information for this flow, putting us out of sync with the stream. > Once that's fixed, flow_migrate_target() should also take care of > decreasing 'count' accordingly. I just had a glimpse but didn't > really try to sketch a fix. Adjusting count won't do the job. Instead we'd need to keep the flow around, but marked as "dead" somehow, so that we read but discard the incoming information for it. The MIGRATING state I added in one of my drafts was supposed to help with this sort of thing. But that's quite a complex change. Hrm... at least in the near term, I think it might actually be easier to set IP_FREEBIND when we create sockets for in-migrating flows. That way we can process them normally, they just won't do much without the address set. It has the additional advantage that it should work if the higher layers only move the IP just after the migration, instead of in advance. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-26 0:27 ` David Gibson @ 2025-02-26 8:09 ` Stefano Brivio 2025-02-26 8:51 ` David Gibson 0 siblings, 1 reply; 10+ messages in thread From: Stefano Brivio @ 2025-02-26 8:09 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Wed, 26 Feb 2025 11:27:32 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > > On Tue, 25 Feb 2025 16:51:30 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > From Red Hat internal testing we've had some reports that if > > > attempting to migrate without passt-repair, the failure mode is uglier > > > than we'd like. The migration fails, which is somewhat expected, but > > > we don't correctly roll things back on the source, so it breaks > > > network there as well. > > > > > > Handle this more gracefully allowing the migration to proceed in this > > > case, but allow TCP connections to break > > > > > > I've now tested this reasonably: > > > * I get a clean migration if there are now active flows > > > * Migration completes, although connections are broken if > > > passt-repair isn't connected > > > * Basic test suite (minus perf) > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > > behaviour should be better than it was. > > > > I did, and it is. The series looks good to me and I would apply it as > > it is, but I'm waiting a bit longer in case you want to try out some > > variations based on my tests as well. Here's what I did. > > [snip] > > Thanks for the detailed instructions. More complex than I might have > liked, but oh well. > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > > > ...despite --verbose the error doesn't tell much (perhaps I need > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > > this series (I just used 'make install' from the local build), migration > > succeeds instead: > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > Migration: [100.00 %] > > > > Now, on the target, I still have to figure out how to tell libvirt > > to start QEMU and prepare for the migration (equivalent of '-incoming' > > as we use in our tests), instead of just starting a new instance like > > it does. Otherwise, I have no chance to start passt-repair there. > > Perhaps it has something to do with persistent mode described here: > > Ah. So I'm pretty sure virsh migrate will automatically start qemu > with --incoming on the target. ("-incoming"), yes, see src/qemu/qemu_migration.c, qemuMigrationDstPrepare(). > IIUC the problem here is more about > timing: we want it to start it early, so that we have a chance to > start passt-repair and let it connect before the migration actually > happens. For the timing itself, we could actually wait for passt-repair to be there, with a timeout (say, 100ms). We could also modify passt-repair to set up an inotify watcher if the socket isn't there yet. > Crud... I didn't think of this before. I don't know that there's any > sensible way to do this without having libvirt managing passt-repair > as well. But we can't really use it as we're assuming that passt-repair will run with capabilities virtqemud doesn't want/need. > I mean it's not impossible there's some option to do this, > but I doubt there's been any reason before for something outside of > libvirt to control the timing of the target qemu's creation. I think > we need to ask libvirt people about this. I'm looking into it (and perhaps virtiofsd had similar troubles?). > > https://libvirt.org/migration.html#configuration-file-handling > > Yeah.. I don't think this is relevant. > > > and --listen-address, but I'm not quite sure yet. > > > > That is, I could only test different failures (early one on source, or > > later one on target) with this, not a complete successful migration. > > > > > There are more fragile cases that I'm looking to fix, particularly the > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > > > into various complications that I didn't manage to sort out today. > > > I'll continue looking at those tomorrow. I'm now pretty confident > > > that those additional fixes won't entirely supersede the changes in > > > this series, so it should be fine to apply these on their own. > > > > By the way, I think the somewhat less fragile/more obvious case where > > we fail clumsily is when the target doesn't have the same address as > > the source (among other possible addresses). In that case, we fail (and > > terminate) with a rather awkward: > > Ah, yes, that is a higher priority fragile case. > > > 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address > > 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop > > 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket > > 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 > > > > that's because, oops, I only took care of socket() failures in > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry. > > No, you check for errors on both. Well, "check", yes, but I'm not even setting an error code. I haven't tried your 3/3 yet but look at "(null)" resulting from: flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc)); ...rc is 0. > The problem is that in > tcp_flow_migrate_target() we cancel the flow allocation and carry on - > but the source will still send information for this flow, putting us > out of sync with the stream. That, too, yes. > > Once that's fixed, flow_migrate_target() should also take care of > > decreasing 'count' accordingly. I just had a glimpse but didn't > > really try to sketch a fix. > > Adjusting count won't do the job. Instead we'd need to keep the flow > around, but marked as "dead" somehow, so that we read but discard the > incoming information for it. The MIGRATING state I added in one of my > drafts was supposed to help with this sort of thing. But that's quite > a complex change. I think it's great that you could (practically) solve it with three lines... > Hrm... at least in the near term, I think it might actually be easier > to set IP_FREEBIND when we create sockets for in-migrating flows. > That way we can process them normally, they just won't do much without > the address set. It has the additional advantage that it should work > if the higher layers only move the IP just after the migration, > instead of in advance. Perhaps we want it anyway, but I wonder: what happens if we turn repair mode off and we bound to a non-local address? I suppose we won't send out anything, but I'm not sure. If we send out the first keep-alive segment with a wrong address, we probably ruined the connection. Once I find a solution for the target libvirt/passt-repair thing (and the remaining SELinux issues), I'll try to have a look at this too. I haven't tried yet a migration with a mismatching address on the target and passt-repair available. -- Stefano ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-26 8:09 ` Stefano Brivio @ 2025-02-26 8:51 ` David Gibson 2025-02-26 11:24 ` Stefano Brivio 0 siblings, 1 reply; 10+ messages in thread From: David Gibson @ 2025-02-26 8:51 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 10868 bytes --] On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote: > On Wed, 26 Feb 2025 11:27:32 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > > > On Tue, 25 Feb 2025 16:51:30 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > From Red Hat internal testing we've had some reports that if > > > > attempting to migrate without passt-repair, the failure mode is uglier > > > > than we'd like. The migration fails, which is somewhat expected, but > > > > we don't correctly roll things back on the source, so it breaks > > > > network there as well. > > > > > > > > Handle this more gracefully allowing the migration to proceed in this > > > > case, but allow TCP connections to break > > > > > > > > I've now tested this reasonably: > > > > * I get a clean migration if there are now active flows > > > > * Migration completes, although connections are broken if > > > > passt-repair isn't connected > > > > * Basic test suite (minus perf) > > > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > > > behaviour should be better than it was. > > > > > > I did, and it is. The series looks good to me and I would apply it as > > > it is, but I'm waiting a bit longer in case you want to try out some > > > variations based on my tests as well. Here's what I did. > > > > [snip] > > > > Thanks for the detailed instructions. More complex than I might have > > liked, but oh well. > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > > > > > ...despite --verbose the error doesn't tell much (perhaps I need > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > > > this series (I just used 'make install' from the local build), migration > > > succeeds instead: > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > Migration: [100.00 %] > > > > > > Now, on the target, I still have to figure out how to tell libvirt > > > to start QEMU and prepare for the migration (equivalent of '-incoming' > > > as we use in our tests), instead of just starting a new instance like > > > it does. Otherwise, I have no chance to start passt-repair there. > > > Perhaps it has something to do with persistent mode described here: > > > > Ah. So I'm pretty sure virsh migrate will automatically start qemu > > with --incoming on the target. > > ("-incoming"), yes, see src/qemu/qemu_migration.c, > qemuMigrationDstPrepare(). > > > IIUC the problem here is more about > > timing: we want it to start it early, so that we have a chance to > > start passt-repair and let it connect before the migration actually > > happens. > > For the timing itself, we could actually wait for passt-repair to be > there, with a timeout (say, 100ms). I guess. That still requires some way for KubeVirt (or whatever) to know at least roughly when it needs to launch passt-repair, and I'm not sure if that's something we can currently get from libvirt. > We could also modify passt-repair to set up an inotify watcher if the > socket isn't there yet. Maybe, yes. This kind of breaks our "passt starts first, passt-repair connects to it" model though, and I wonder if we need to revisit the security implications of that. > > Crud... I didn't think of this before. I don't know that there's any > > sensible way to do this without having libvirt managing passt-repair > > as well. > > But we can't really use it as we're assuming that passt-repair will run > with capabilities virtqemud doesn't want/need. Oh. True. > > I mean it's not impossible there's some option to do this, > > but I doubt there's been any reason before for something outside of > > libvirt to control the timing of the target qemu's creation. I think > > we need to ask libvirt people about this. > > I'm looking into it (and perhaps virtiofsd had similar troubles?). I'm guessing libvirt already knows how to start virtiofsd - just as it already knows how to start passt, just not passt-repair. > > > https://libvirt.org/migration.html#configuration-file-handling > > > > Yeah.. I don't think this is relevant. > > > > > and --listen-address, but I'm not quite sure yet. > > > > > > That is, I could only test different failures (early one on source, or > > > later one on target) with this, not a complete successful migration. > > > > > > > There are more fragile cases that I'm looking to fix, particularly the > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > > > > into various complications that I didn't manage to sort out today. > > > > I'll continue looking at those tomorrow. I'm now pretty confident > > > > that those additional fixes won't entirely supersede the changes in > > > > this series, so it should be fine to apply these on their own. > > > > > > By the way, I think the somewhat less fragile/more obvious case where > > > we fail clumsily is when the target doesn't have the same address as > > > the source (among other possible addresses). In that case, we fail (and > > > terminate) with a rather awkward: > > > > Ah, yes, that is a higher priority fragile case. > > > > > 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address > > > 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop > > > 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket > > > 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 > > > > > > that's because, oops, I only took care of socket() failures in > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry. > > > > No, you check for errors on both. > > Well, "check", yes, but I'm not even setting an error code. I haven't > tried your 3/3 yet but look at "(null)" resulting from: > > flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc)); > > ...rc is 0. -1, not 0, otherwise we wouldn't enter that if clause at all. But, still, out of bounds for strerror(). I did spot that bug - tcp_flow_repair_socket() is directly passing on the return code from bind(), whereas it should be returning -errno. So, two bugs actually: 1) in the existing code we should return -errno not rc if bind() fails, 2) in my 3/3 it should be calling strerror() on -rc, not rc. > > The problem is that in > > tcp_flow_migrate_target() we cancel the flow allocation and carry on - > > but the source will still send information for this flow, putting us > > out of sync with the stream. > > That, too, yes. > > > > Once that's fixed, flow_migrate_target() should also take care of > > > decreasing 'count' accordingly. I just had a glimpse but didn't > > > really try to sketch a fix. > > > > Adjusting count won't do the job. Instead we'd need to keep the flow > > around, but marked as "dead" somehow, so that we read but discard the > > incoming information for it. The MIGRATING state I added in one of my > > drafts was supposed to help with this sort of thing. But that's quite > > a complex change. > > I think it's great that you could (practically) solve it with three > lines... Yeah, I sent that email at the beginning of my day, by the end I'd come up with the simpler approach. > > Hrm... at least in the near term, I think it might actually be easier > > to set IP_FREEBIND when we create sockets for in-migrating flows. > > That way we can process them normally, they just won't do much without > > the address set. It has the additional advantage that it should work > > if the higher layers only move the IP just after the migration, > > instead of in advance. > > Perhaps we want it anyway, but I wonder: Right, I'm no longer considering this as a short term solution, since checking for fd < 0 I think works better for the immediate problems. > what happens if we turn repair > mode off and we bound to a non-local address? I suppose we won't send > out anything, but I'm not sure. If we send out the first keep-alive > segment with a wrong address, we probably ruined the connection. That's a good point. More specifically, I think IP_FREEBIND is generally used for listen()ing sockets, I'm guessing you'll get an error if you try to connect() a socket that's bound to a non-local address. It's possible TCP_REPAIR would defer that until repair mode is switched off, which wouldn't make a lot of difference to us. It's also possible there could be bug in repair mode that would let you construct a non-locally bound, connected socket that way. I'm not entirely sure what the consequences would be. I guess that might already be possible in a different way: what happens if you have a connect()ed socket, then the admin removes the address to which it is bound? > Once I find a solution for the target libvirt/passt-repair thing (and > the remaining SELinux issues), I'll try to have a look at this too. I > haven't tried yet a migration with a mismatching address on the target > and passt-repair available. Right, I was trying to set up a test case for this today. I made some progress but didn't really get it working. I was using qemu directly with scripts to put the two ends into different net namespaces, rather than libvirt on separate L1 VMs. Working out how to get the two namespaces connected in a way I could do the migration, while still being separate enough was doing my head in a bit. In doing that, I also spotted another wrinkle. I don't think this is one we can reasonably fix - but we should be aware, since someone will probably try it at some point: migration is not going to work if the two hosts have their own connectivity provided by (separate instances of) passt or pasta (or slirp for that matter). The migrating VM can have its TCP stream reconstructed perfectly, so the right L2 packets come out of the host, but the host's own passt/pasta instance won't know about the flows and so will just drop/reject the packets. To make that work we'd basically have to migrate state for every "ancestor" passt/pasta until we hit a common namespace. That seems pretty infeasible to me, since the pieces that know about the migration probably don't own those layers of the network. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-26 8:51 ` David Gibson @ 2025-02-26 11:24 ` Stefano Brivio 2025-02-27 1:43 ` David Gibson 0 siblings, 1 reply; 10+ messages in thread From: Stefano Brivio @ 2025-02-26 11:24 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Wed, 26 Feb 2025 19:51:11 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote: > > On Wed, 26 Feb 2025 11:27:32 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > > > > On Tue, 25 Feb 2025 16:51:30 +1100 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > From Red Hat internal testing we've had some reports that if > > > > > attempting to migrate without passt-repair, the failure mode is uglier > > > > > than we'd like. The migration fails, which is somewhat expected, but > > > > > we don't correctly roll things back on the source, so it breaks > > > > > network there as well. > > > > > > > > > > Handle this more gracefully allowing the migration to proceed in this > > > > > case, but allow TCP connections to break > > > > > > > > > > I've now tested this reasonably: > > > > > * I get a clean migration if there are now active flows > > > > > * Migration completes, although connections are broken if > > > > > passt-repair isn't connected > > > > > * Basic test suite (minus perf) > > > > > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > > > > behaviour should be better than it was. > > > > > > > > I did, and it is. The series looks good to me and I would apply it as > > > > it is, but I'm waiting a bit longer in case you want to try out some > > > > variations based on my tests as well. Here's what I did. > > > > > > [snip] > > > > > > Thanks for the detailed instructions. More complex than I might have > > > liked, but oh well. > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > > > > > > > ...despite --verbose the error doesn't tell much (perhaps I need > > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > > > > this series (I just used 'make install' from the local build), migration > > > > succeeds instead: > > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > Migration: [100.00 %] > > > > > > > > Now, on the target, I still have to figure out how to tell libvirt > > > > to start QEMU and prepare for the migration (equivalent of '-incoming' > > > > as we use in our tests), instead of just starting a new instance like > > > > it does. Otherwise, I have no chance to start passt-repair there. > > > > Perhaps it has something to do with persistent mode described here: > > > > > > Ah. So I'm pretty sure virsh migrate will automatically start qemu > > > with --incoming on the target. > > > > ("-incoming"), yes, see src/qemu/qemu_migration.c, > > qemuMigrationDstPrepare(). > > > > > IIUC the problem here is more about > > > timing: we want it to start it early, so that we have a chance to > > > start passt-repair and let it connect before the migration actually > > > happens. > > > > For the timing itself, we could actually wait for passt-repair to be > > there, with a timeout (say, 100ms). > > I guess. That still requires some way for KubeVirt (or whatever) to > know at least roughly when it needs to launch passt-repair, and I'm > not sure if that's something we can currently get from libvirt. KubeVirt sets up the target pod, and that's when it should be done (if we have an inotify mechanism or similar). I can't point to an exact code path yet but there's something like that. > > We could also modify passt-repair to set up an inotify watcher if the > > socket isn't there yet. > > Maybe, yes. This kind of breaks our "passt starts first, passt-repair > connects to it" model though, and I wonder if we need to revisit the > security implications of that. I don't think it actually breaks that model for security purposes, because the guest doesn't have anyway a chance to cause a connection to passt-repair. The guest is still suspended (or missing) at that point. > > > Crud... I didn't think of this before. I don't know that there's any > > > sensible way to do this without having libvirt managing passt-repair > > > as well. > > > > But we can't really use it as we're assuming that passt-repair will run > > with capabilities virtqemud doesn't want/need. > > Oh. True. > > > > I mean it's not impossible there's some option to do this, > > > but I doubt there's been any reason before for something outside of > > > libvirt to control the timing of the target qemu's creation. I think > > > we need to ask libvirt people about this. > > > > I'm looking into it (and perhaps virtiofsd had similar troubles?). > > I'm guessing libvirt already knows how to start virtiofsd - just as it > already knows how to start passt, just not passt-repair. > > > > > https://libvirt.org/migration.html#configuration-file-handling > > > > > > Yeah.. I don't think this is relevant. > > > > > > > and --listen-address, but I'm not quite sure yet. > > > > > > > > That is, I could only test different failures (early one on source, or > > > > later one on target) with this, not a complete successful migration. > > > > > > > > > There are more fragile cases that I'm looking to fix, particularly the > > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > > > > > into various complications that I didn't manage to sort out today. > > > > > I'll continue looking at those tomorrow. I'm now pretty confident > > > > > that those additional fixes won't entirely supersede the changes in > > > > > this series, so it should be fine to apply these on their own. > > > > > > > > By the way, I think the somewhat less fragile/more obvious case where > > > > we fail clumsily is when the target doesn't have the same address as > > > > the source (among other possible addresses). In that case, we fail (and > > > > terminate) with a rather awkward: > > > > > > Ah, yes, that is a higher priority fragile case. > > > > > > > 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address > > > > 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop > > > > 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket > > > > 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 > > > > > > > > that's because, oops, I only took care of socket() failures in > > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry. > > > > > > No, you check for errors on both. > > > > Well, "check", yes, but I'm not even setting an error code. I haven't > > tried your 3/3 yet but look at "(null)" resulting from: > > > > flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc)); > > > > ...rc is 0. > > -1, not 0, otherwise we wouldn't enter that if clause at all. Ah, oops, right. > But, > still, out of bounds for strerror(). I did spot that bug - > tcp_flow_repair_socket() is directly passing on the return code from > bind(), whereas it should be returning -errno. > > So, two bugs actually: 1) in the existing code we should return -errno > not rc if bind() fails, 2) in my 3/3 it should be calling strerror() > on -rc, not rc. Right, yes, 1) is what I meant. > > > The problem is that in > > > tcp_flow_migrate_target() we cancel the flow allocation and carry on - > > > but the source will still send information for this flow, putting us > > > out of sync with the stream. > > > > That, too, yes. > > > > > > Once that's fixed, flow_migrate_target() should also take care of > > > > decreasing 'count' accordingly. I just had a glimpse but didn't > > > > really try to sketch a fix. > > > > > > Adjusting count won't do the job. Instead we'd need to keep the flow > > > around, but marked as "dead" somehow, so that we read but discard the > > > incoming information for it. The MIGRATING state I added in one of my > > > drafts was supposed to help with this sort of thing. But that's quite > > > a complex change. > > > > I think it's great that you could (practically) solve it with three > > lines... > > Yeah, I sent that email at the beginning of my day, by the end I'd > come up with the simpler approach. > > > > Hrm... at least in the near term, I think it might actually be easier > > > to set IP_FREEBIND when we create sockets for in-migrating flows. > > > That way we can process them normally, they just won't do much without > > > the address set. It has the additional advantage that it should work > > > if the higher layers only move the IP just after the migration, > > > instead of in advance. > > > > Perhaps we want it anyway, but I wonder: > > Right, I'm no longer considering this as a short term solution, since > checking for fd < 0 I think works better for the immediate problems. > > > what happens if we turn repair > > mode off and we bound to a non-local address? I suppose we won't send > > out anything, but I'm not sure. If we send out the first keep-alive > > segment with a wrong address, we probably ruined the connection. > > That's a good point. More specifically, I think IP_FREEBIND is > generally used for listen()ing sockets, I'm guessing you'll get an > error if you try to connect() a socket that's bound to a non-local > address. It's possible TCP_REPAIR would defer that until repair mode > is switched off, which wouldn't make a lot of difference to us. It's > also possible there could be bug in repair mode that would let you > construct a non-locally bound, connected socket that way. I'm not > entirely sure what the consequences would be. I guess that might > already be possible in a different way: what happens if you have a > connect()ed socket, then the admin removes the address to which it is > bound? I'm fairly sure that the outcome of testing this will surprise us in a way or another, so probably we should start from testing it. We could even think of deferring switching repair mode off until the right address is there, by the way. That would make a difference to us. > > Once I find a solution for the target libvirt/passt-repair thing (and > > the remaining SELinux issues), I'll try to have a look at this too. I > > haven't tried yet a migration with a mismatching address on the target > > and passt-repair available. > > Right, I was trying to set up a test case for this today. I made some > progress but didn't really get it working. I was using qemu directly > with scripts to put the two ends into different net namespaces, rather > than libvirt on separate L1 VMs. Working out how to get the two > namespaces connected in a way I could do the migration, while still > being separate enough was doing my head in a bit. I didn't actually get to the point of having a truly working migration. I just considered a clean attempt at resuming an existing connection (with keep-alive and subsequent RST from underlying passt instance) as success. So, yes, of course, it would be great to have a full simulation in a compact form of what's going on with KubeVirt and OVN. Probably bridging the L1 VMs would be a quick solution. It needs root (at least for the setup), which means they become L2 (at least in my world): L1 would still be connected by passt, the two L2 instances are bridged, and source and target guests would be L3. > In doing that, I also spotted another wrinkle. I don't think this is > one we can reasonably fix - but we should be aware, since someone will > probably try it at some point: Yeah, I tried it, you might see remnants of that in the setup_migrate() stuff (I copied it from the "two_guests" test). I originally wanted to have two namespaces and two instances of pasta (just like "two_guests"), but soon realised the issue you describe below. > migration is not going to work if the > two hosts have their own connectivity provided by (separate instances > of) passt or pasta (or slirp for that matter). The migrating VM can > have its TCP stream reconstructed perfectly, so the right L2 packets > come out of the host, but the host's own passt/pasta instance won't > know about the flows and so will just drop/reject the packets. > > To make that work we'd basically have to migrate state for every > "ancestor" passt/pasta until we hit a common namespace. That seems > pretty infeasible to me, since the pieces that know about the > migration probably don't own those layers of the network. If there's any potential interest around it, we could abstract things a bit more than we did until now and have some kind of o... orch... coordination of data/migration flows, with some kind of external tool identifying and connecting to several instances of passt and moving flows between them. It sounds a bit like OVN, but with the notable difference that we don't need to implement a overlay network. -- Stefano ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-26 11:24 ` Stefano Brivio @ 2025-02-27 1:43 ` David Gibson 2025-02-27 4:32 ` Stefano Brivio 0 siblings, 1 reply; 10+ messages in thread From: David Gibson @ 2025-02-27 1:43 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev [-- Attachment #1: Type: text/plain, Size: 15725 bytes --] On Wed, Feb 26, 2025 at 12:24:12PM +0100, Stefano Brivio wrote: > On Wed, 26 Feb 2025 19:51:11 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > > > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote: > > > On Wed, 26 Feb 2025 11:27:32 +1100 > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > > > > > On Tue, 25 Feb 2025 16:51:30 +1100 > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > From Red Hat internal testing we've had some reports that if > > > > > > attempting to migrate without passt-repair, the failure mode is uglier > > > > > > than we'd like. The migration fails, which is somewhat expected, but > > > > > > we don't correctly roll things back on the source, so it breaks > > > > > > network there as well. > > > > > > > > > > > > Handle this more gracefully allowing the migration to proceed in this > > > > > > case, but allow TCP connections to break > > > > > > > > > > > > I've now tested this reasonably: > > > > > > * I get a clean migration if there are now active flows > > > > > > * Migration completes, although connections are broken if > > > > > > passt-repair isn't connected > > > > > > * Basic test suite (minus perf) > > > > > > > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > > > > > behaviour should be better than it was. > > > > > > > > > > I did, and it is. The series looks good to me and I would apply it as > > > > > it is, but I'm waiting a bit longer in case you want to try out some > > > > > variations based on my tests as well. Here's what I did. > > > > > > > > [snip] > > > > > > > > Thanks for the detailed instructions. More complex than I might have > > > > liked, but oh well. > > > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > > > > > > > > > ...despite --verbose the error doesn't tell much (perhaps I need > > > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > > > > > this series (I just used 'make install' from the local build), migration > > > > > succeeds instead: > > > > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > > Migration: [100.00 %] > > > > > > > > > > Now, on the target, I still have to figure out how to tell libvirt > > > > > to start QEMU and prepare for the migration (equivalent of '-incoming' > > > > > as we use in our tests), instead of just starting a new instance like > > > > > it does. Otherwise, I have no chance to start passt-repair there. > > > > > Perhaps it has something to do with persistent mode described here: > > > > > > > > Ah. So I'm pretty sure virsh migrate will automatically start qemu > > > > with --incoming on the target. > > > > > > ("-incoming"), yes, see src/qemu/qemu_migration.c, > > > qemuMigrationDstPrepare(). > > > > > > > IIUC the problem here is more about > > > > timing: we want it to start it early, so that we have a chance to > > > > start passt-repair and let it connect before the migration actually > > > > happens. > > > > > > For the timing itself, we could actually wait for passt-repair to be > > > there, with a timeout (say, 100ms). > > > > I guess. That still requires some way for KubeVirt (or whatever) to > > know at least roughly when it needs to launch passt-repair, and I'm > > not sure if that's something we can currently get from libvirt. > > KubeVirt sets up the target pod, and that's when it should be done (if > we have an inotify mechanism or similar). I can't point to an exact code > path yet but there's something like that. Right, but that approach does require inotify and starting passt-repair before passt, which might be fine, but I have the concern noted below. To avoid that we'd need notification after passt & qemu are started on the target, but before the migration is actually initiated which I don't think libvirt provides. > > > We could also modify passt-repair to set up an inotify watcher if the > > > socket isn't there yet. > > > > Maybe, yes. This kind of breaks our "passt starts first, passt-repair > > connects to it" model though, and I wonder if we need to revisit the > > security implications of that. > > I don't think it actually breaks that model for security purposes, > because the guest doesn't have anyway a chance to cause a connection to > passt-repair. The guest is still suspended (or missing) at that point. I wasn't thinking of threat models coming from the guest, but an attack from an unrelated process impersonating passt in order to access passt-repair's superpowers. > > > > Crud... I didn't think of this before. I don't know that there's any > > > > sensible way to do this without having libvirt managing passt-repair > > > > as well. > > > > > > But we can't really use it as we're assuming that passt-repair will run > > > with capabilities virtqemud doesn't want/need. > > > > Oh. True. > > > > > > I mean it's not impossible there's some option to do this, > > > > but I doubt there's been any reason before for something outside of > > > > libvirt to control the timing of the target qemu's creation. I think > > > > we need to ask libvirt people about this. > > > > > > I'm looking into it (and perhaps virtiofsd had similar troubles?). > > > > I'm guessing libvirt already knows how to start virtiofsd - just as it > > already knows how to start passt, just not passt-repair. > > > > > > > https://libvirt.org/migration.html#configuration-file-handling > > > > > > > > Yeah.. I don't think this is relevant. > > > > > > > > > and --listen-address, but I'm not quite sure yet. > > > > > > > > > > That is, I could only test different failures (early one on source, or > > > > > later one on target) with this, not a complete successful migration. > > > > > > > > > > > There are more fragile cases that I'm looking to fix, particularly the > > > > > > die()s in flow_migrate_source_rollback() and elsewhere, however I ran > > > > > > into various complications that I didn't manage to sort out today. > > > > > > I'll continue looking at those tomorrow. I'm now pretty confident > > > > > > that those additional fixes won't entirely supersede the changes in > > > > > > this series, so it should be fine to apply these on their own. > > > > > > > > > > By the way, I think the somewhat less fragile/more obvious case where > > > > > we fail clumsily is when the target doesn't have the same address as > > > > > the source (among other possible addresses). In that case, we fail (and > > > > > terminate) with a rather awkward: > > > > > > > > Ah, yes, that is a higher priority fragile case. > > > > > > > > > 93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address > > > > > 93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop > > > > > 93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket > > > > > 93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100 > > > > > > > > > > that's because, oops, I only took care of socket() failures in > > > > > tcp_flow_repair_socket(), but not bind() failures (!). Sorry. > > > > > > > > No, you check for errors on both. > > > > > > Well, "check", yes, but I'm not even setting an error code. I haven't > > > tried your 3/3 yet but look at "(null)" resulting from: > > > > > > flow_err(flow, "Can't set up socket: %s, drop", strerror_(rc)); > > > > > > ...rc is 0. > > > > -1, not 0, otherwise we wouldn't enter that if clause at all. > > Ah, oops, right. > > > But, > > still, out of bounds for strerror(). I did spot that bug - > > tcp_flow_repair_socket() is directly passing on the return code from > > bind(), whereas it should be returning -errno. > > > > So, two bugs actually: 1) in the existing code we should return -errno > > not rc if bind() fails, 2) in my 3/3 it should be calling strerror() > > on -rc, not rc. > > Right, yes, 1) is what I meant. I'll fix both of these for the next spin. > > > > The problem is that in > > > > tcp_flow_migrate_target() we cancel the flow allocation and carry on - > > > > but the source will still send information for this flow, putting us > > > > out of sync with the stream. > > > > > > That, too, yes. > > > > > > > > Once that's fixed, flow_migrate_target() should also take care of > > > > > decreasing 'count' accordingly. I just had a glimpse but didn't > > > > > really try to sketch a fix. > > > > > > > > Adjusting count won't do the job. Instead we'd need to keep the flow > > > > around, but marked as "dead" somehow, so that we read but discard the > > > > incoming information for it. The MIGRATING state I added in one of my > > > > drafts was supposed to help with this sort of thing. But that's quite > > > > a complex change. > > > > > > I think it's great that you could (practically) solve it with three > > > lines... > > > > Yeah, I sent that email at the beginning of my day, by the end I'd > > come up with the simpler approach. > > > > > > Hrm... at least in the near term, I think it might actually be easier > > > > to set IP_FREEBIND when we create sockets for in-migrating flows. > > > > That way we can process them normally, they just won't do much without > > > > the address set. It has the additional advantage that it should work > > > > if the higher layers only move the IP just after the migration, > > > > instead of in advance. > > > > > > Perhaps we want it anyway, but I wonder: > > > > Right, I'm no longer considering this as a short term solution, since > > checking for fd < 0 I think works better for the immediate problems. > > > > > what happens if we turn repair > > > mode off and we bound to a non-local address? I suppose we won't send > > > out anything, but I'm not sure. If we send out the first keep-alive > > > segment with a wrong address, we probably ruined the connection. > > > > That's a good point. More specifically, I think IP_FREEBIND is > > generally used for listen()ing sockets, I'm guessing you'll get an > > error if you try to connect() a socket that's bound to a non-local > > address. It's possible TCP_REPAIR would defer that until repair mode > > is switched off, which wouldn't make a lot of difference to us. It's > > also possible there could be bug in repair mode that would let you > > construct a non-locally bound, connected socket that way. I'm not > > entirely sure what the consequences would be. I guess that might > > already be possible in a different way: what happens if you have a > > connect()ed socket, then the admin removes the address to which it is > > bound? > > I'm fairly sure that the outcome of testing this will surprise us in a > way or another, so probably we should start from testing it. Yeah, I tend to agree. > We could even think of deferring switching repair mode off until the > right address is there, by the way. That would make a difference to > us. Do you mean by blocking? Or by returning to normal operation with the flow flagged somehow to be "woken up" by a netlink monitor? > > > Once I find a solution for the target libvirt/passt-repair thing (and > > > the remaining SELinux issues), I'll try to have a look at this too. I > > > haven't tried yet a migration with a mismatching address on the target > > > and passt-repair available. > > > > Right, I was trying to set up a test case for this today. I made some > > progress but didn't really get it working. I was using qemu directly > > with scripts to put the two ends into different net namespaces, rather > > than libvirt on separate L1 VMs. Working out how to get the two > > namespaces connected in a way I could do the migration, while still > > being separate enough was doing my head in a bit. > > I didn't actually get to the point of having a truly working migration. > I just considered a clean attempt at resuming an existing connection > (with keep-alive and subsequent RST from underlying passt instance) as > success. > > So, yes, of course, it would be great to have a full simulation in a > compact form of what's going on with KubeVirt and OVN. > > Probably bridging the L1 VMs would be a quick solution. It needs root Well.. for some value of quick. That's basically what I've been working towards, except using namespaces for the L1s instead of VMs. Figuring out exactly how to set up the bridge so that we have addressing that allows the migration between the qemus to take place, and also has the properties we need for passt to work and test the things we want is kind of fiddly. > (at least for the setup), which means they become L2 (at least in my > world): L1 would still be connected by passt, the two L2 instances are > bridged, and source and target guests would be L3. For the tests here we don't even necessarily need the L1 connected to the outside world: we can put the peer for the migrated connection(s) inside the L1 > > In doing that, I also spotted another wrinkle. I don't think this is > > one we can reasonably fix - but we should be aware, since someone will > > probably try it at some point: > > Yeah, I tried it, you might see remnants of that in the setup_migrate() > stuff (I copied it from the "two_guests" test). I originally wanted to > have two namespaces and two instances of pasta (just like > "two_guests"), but soon realised the issue you describe below. Right. It took me an embarrassing amount of time to figure out what was going wrong, alas. > > migration is not going to work if the > > two hosts have their own connectivity provided by (separate instances > > of) passt or pasta (or slirp for that matter). The migrating VM can > > have its TCP stream reconstructed perfectly, so the right L2 packets > > come out of the host, but the host's own passt/pasta instance won't > > know about the flows and so will just drop/reject the packets. > > > > To make that work we'd basically have to migrate state for every > > "ancestor" passt/pasta until we hit a common namespace. That seems > > pretty infeasible to me, since the pieces that know about the > > migration probably don't own those layers of the network. > > If there's any potential interest around it, we could abstract things a > bit more than we did until now and have some kind of o... orch... > coordination of data/migration flows, with some kind of external tool > identifying and connecting to several instances of passt and moving > flows between them. It sounds a bit like OVN, but with the notable > difference that we don't need to implement a overlay network. Right. I mean this is basically the responsibility of whatever's managing host nodes' network. For a virtual bridge or router it's a relatively simple matter of changing where the guest address is routed. If the the host nodes' network involves passt or pasta it's... harder. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair 2025-02-27 1:43 ` David Gibson @ 2025-02-27 4:32 ` Stefano Brivio 0 siblings, 0 replies; 10+ messages in thread From: Stefano Brivio @ 2025-02-27 4:32 UTC (permalink / raw) To: David Gibson; +Cc: passt-dev On Thu, 27 Feb 2025 12:43:41 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Wed, Feb 26, 2025 at 12:24:12PM +0100, Stefano Brivio wrote: > > On Wed, 26 Feb 2025 19:51:11 +1100 > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > On Wed, Feb 26, 2025 at 09:09:48AM +0100, Stefano Brivio wrote: > > > > On Wed, 26 Feb 2025 11:27:32 +1100 > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > On Tue, Feb 25, 2025 at 06:43:16PM +0100, Stefano Brivio wrote: > > > > > > On Tue, 25 Feb 2025 16:51:30 +1100 > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote: > > > > > > > > > > > > > From Red Hat internal testing we've had some reports that if > > > > > > > attempting to migrate without passt-repair, the failure mode is uglier > > > > > > > than we'd like. The migration fails, which is somewhat expected, but > > > > > > > we don't correctly roll things back on the source, so it breaks > > > > > > > network there as well. > > > > > > > > > > > > > > Handle this more gracefully allowing the migration to proceed in this > > > > > > > case, but allow TCP connections to break > > > > > > > > > > > > > > I've now tested this reasonably: > > > > > > > * I get a clean migration if there are now active flows > > > > > > > * Migration completes, although connections are broken if > > > > > > > passt-repair isn't connected > > > > > > > * Basic test suite (minus perf) > > > > > > > > > > > > > > I didn't manage to test with libvirt yet, but I'm pretty convinced the > > > > > > > behaviour should be better than it was. > > > > > > > > > > > > I did, and it is. The series looks good to me and I would apply it as > > > > > > it is, but I'm waiting a bit longer in case you want to try out some > > > > > > variations based on my tests as well. Here's what I did. > > > > > > > > > > [snip] > > > > > > > > > > Thanks for the detailed instructions. More complex than I might have > > > > > liked, but oh well. > > > > > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > > > Migration: [97.59 %]error: End of file while reading data: : Input/output error > > > > > > > > > > > > ...despite --verbose the error doesn't tell much (perhaps I need > > > > > > LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With > > > > > > this series (I just used 'make install' from the local build), migration > > > > > > succeeds instead: > > > > > > > > > > > > $ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session > > > > > > Migration: [100.00 %] > > > > > > > > > > > > Now, on the target, I still have to figure out how to tell libvirt > > > > > > to start QEMU and prepare for the migration (equivalent of '-incoming' > > > > > > as we use in our tests), instead of just starting a new instance like > > > > > > it does. Otherwise, I have no chance to start passt-repair there. > > > > > > Perhaps it has something to do with persistent mode described here: > > > > > > > > > > Ah. So I'm pretty sure virsh migrate will automatically start qemu > > > > > with --incoming on the target. > > > > > > > > ("-incoming"), yes, see src/qemu/qemu_migration.c, > > > > qemuMigrationDstPrepare(). > > > > > > > > > IIUC the problem here is more about > > > > > timing: we want it to start it early, so that we have a chance to > > > > > start passt-repair and let it connect before the migration actually > > > > > happens. > > > > > > > > For the timing itself, we could actually wait for passt-repair to be > > > > there, with a timeout (say, 100ms). > > > > > > I guess. That still requires some way for KubeVirt (or whatever) to > > > know at least roughly when it needs to launch passt-repair, and I'm > > > not sure if that's something we can currently get from libvirt. > > > > KubeVirt sets up the target pod, and that's when it should be done (if > > we have an inotify mechanism or similar). I can't point to an exact code > > path yet but there's something like that. > > Right, but that approach does require inotify and starting > passt-repair before passt, which might be fine, but I have the concern > noted below. To avoid that we'd need notification after passt & qemu > are started on the target, but before the migration is actually > initiated which I don't think libvirt provides. > > > > > We could also modify passt-repair to set up an inotify watcher if the > > > > socket isn't there yet. > > > > > > Maybe, yes. This kind of breaks our "passt starts first, passt-repair > > > connects to it" model though, and I wonder if we need to revisit the > > > security implications of that. > > > > I don't think it actually breaks that model for security purposes, > > because the guest doesn't have anyway a chance to cause a connection to > > passt-repair. The guest is still suspended (or missing) at that point. > > I wasn't thinking of threat models coming from the guest, but an > attack from an unrelated process impersonating passt in order to > access passt-repair's superpowers. Then an inotify watch shouldn't substantially change things. The attacker could create the socket earlier and obtain the same outcome. > [...] > > > We could even think of deferring switching repair mode off until the > > right address is there, by the way. That would make a difference to > > us. > > Do you mean by blocking? Or by returning to normal operation with the > flow flagged somehow to be "woken up" by a netlink monitor? The latter. I don't think we should block connectivity (with new addresses) meanwhile. -- Stefano ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-02-27 4:32 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-25 5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson 2025-02-25 5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson 2025-02-25 5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson 2025-02-25 17:43 ` [PATCH v2 0/2] More graceful handling of migration " Stefano Brivio 2025-02-26 0:27 ` David Gibson 2025-02-26 8:09 ` Stefano Brivio 2025-02-26 8:51 ` David Gibson 2025-02-26 11:24 ` Stefano Brivio 2025-02-27 1:43 ` David Gibson 2025-02-27 4:32 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).