From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH v2 0/2] More graceful handling of migration without passt-repair
Date: Tue, 25 Feb 2025 18:43:16 +0100 [thread overview]
Message-ID: <20250225184316.407247f4@elisabeth> (raw)
In-Reply-To: <20250225055132.3677190-1-david@gibson.dropbear.id.au>
On Tue, 25 Feb 2025 16:51:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> From Red Hat internal testing we've had some reports that if
> attempting to migrate without passt-repair, the failure mode is uglier
> than we'd like. The migration fails, which is somewhat expected, but
> we don't correctly roll things back on the source, so it breaks
> network there as well.
>
> Handle this more gracefully allowing the migration to proceed in this
> case, but allow TCP connections to break
>
> I've now tested this reasonably:
> * I get a clean migration if there are now active flows
> * Migration completes, although connections are broken if
> passt-repair isn't connected
> * Basic test suite (minus perf)
>
> I didn't manage to test with libvirt yet, but I'm pretty convinced the
> behaviour should be better than it was.
I did, and it is. The series looks good to me and I would apply it as
it is, but I'm waiting a bit longer in case you want to try out some
variations based on my tests as well. Here's what I did.
L0 is Debian testing, L1 are two similar (virt-clone'd) instances of
RHEL 9.5 (with passt-0^20250217.ga1e48a0-1.el9.x86_64 or local build with
this series, qemu-kvm-9.1.0-14.el9.x86_64, libvirt-10.10.0-7.el9.x86_64),
and L2 is Alpine 3.21-ish.
The two L1 instances (hosting the source and target guests), of course,
don't need to be run under libvirt, but they do in my case. They are
connected by passt, so that they share the same address internally, but
I'm forwarding different SSH ports to them.
Relevant libvirt XML snippets for L1 instances:
<interface type='user'>
<mac address='52:54:00:8a:9e:c2'/>
<portForward proto='tcp'>
<range start='1295' to='22'/>
</portForward>
<model type='virtio'/>
<backend type='passt'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
and:
<interface type='user'>
<mac address='52:54:00:b8:99:8c'/>
<portForward proto='tcp'>
<range start='11951' to='22'/>
</portForward>
<model type='virtio'/>
<backend type='passt'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
...I didn't switch those to vhost-user mode yet.
I prepared the L2 guest on L1 with:
$ wget https://dl-cdn.alpinelinux.org/alpine/v3.21/releases/cloud/nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2
$ virt-customize -a nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2 --root-password password:root
$ virt-install -d --name alpine --memory 1024 --noreboot --osinfo alpinelinux3.20 --network backend.type=passt,portForward0.proto=tcp,portForward0.range0.start=40922,portForward0.range0.to=2222 --import --disk nocloud_alpine-3.21.2-x86_64-bios-tiny-r0.qcow2
And made sure I can connect via SSH to the second (target node) L1 with:
$ ssh-copy-id -f -p 11951 $GATEWAY
There are some known SELinux issues at this point that I'm still
working on (similar for AppArmor), so I *temporarily* set it to
permissive mode with 'setenforce 0', on L1. Some were not known,
though, and it's taking me longer than expected.
Now I can start passt-repair (or not) on the source L1 (node):
# passt-repair /run/user/1001/libvirt/qemu/run/passt/8-alpine-net0.socket.repair
and open a TCP connection in the source L2 guest ('virsh console alpine',
then login as root/root):
# apk add inetutils-telnet
# telnet passt.top 80
and finally ask libvirt to migrate the guest. Note that I need
"--unsafe" because I didn't care about migrating storage (it's good
enough to have the guest memory for this test).
Without this series, migration fails on the source:
$ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
Migration: [97.59 %]error: End of file while reading data: : Input/output error
...despite --verbose the error doesn't tell much (perhaps I need
LIBVIRT_DEBUG=1 instead?), but passt terminates at this point. With
this series (I just used 'make install' from the local build), migration
succeeds instead:
$ virsh migrate --verbose --p2p --live --unsafe alpine --tunneled qemu+ssh://88.198.0.161:10951/session
Migration: [100.00 %]
Now, on the target, I still have to figure out how to tell libvirt
to start QEMU and prepare for the migration (equivalent of '-incoming'
as we use in our tests), instead of just starting a new instance like
it does. Otherwise, I have no chance to start passt-repair there.
Perhaps it has something to do with persistent mode described here:
https://libvirt.org/migration.html#configuration-file-handling
and --listen-address, but I'm not quite sure yet.
That is, I could only test different failures (early one on source, or
later one on target) with this, not a complete successful migration.
> There are more fragile cases that I'm looking to fix, particularly the
> die()s in flow_migrate_source_rollback() and elsewhere, however I ran
> into various complications that I didn't manage to sort out today.
> I'll continue looking at those tomorrow. I'm now pretty confident
> that those additional fixes won't entirely supersede the changes in
> this series, so it should be fine to apply these on their own.
By the way, I think the somewhat less fragile/more obvious case where
we fail clumsily is when the target doesn't have the same address as
the source (among other possible addresses). In that case, we fail (and
terminate) with a rather awkward:
93.7217: ERROR: Failed to bind socket for migrated flow: Cannot assign requested address
93.7218: ERROR: Flow 0 (TCP connection): Can't set up socket: (null), drop
93.7331: ERROR: Selecting TCP_SEND_QUEUE, socket 1: Socket operation on non-socket
93.7333: ERROR: Unexpected reply from TCP_REPAIR helper: -100
that's because, oops, I only took care of socket() failures in
tcp_flow_repair_socket(), but not bind() failures (!). Sorry.
Once that's fixed, flow_migrate_target() should also take care of
decreasing 'count' accordingly. I just had a glimpse but didn't
really try to sketch a fix.
--
Stefano
next prev parent reply other threads:[~2025-02-25 17:43 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-25 5:51 [PATCH v2 0/2] More graceful handling of migration without passt-repair David Gibson
2025-02-25 5:51 ` [PATCH v2 1/2] migrate, flow: Trivially succeed if migrating with no flows David Gibson
2025-02-25 5:51 ` [PATCH v2 2/2] migrate, flow: Don't attempt to migrate TCP flows without passt-repair David Gibson
2025-02-25 17:43 ` Stefano Brivio [this message]
2025-02-26 0:27 ` [PATCH v2 0/2] More graceful handling of migration " David Gibson
2025-02-26 8:09 ` Stefano Brivio
2025-02-26 8:51 ` David Gibson
2025-02-26 11:24 ` Stefano Brivio
2025-02-27 1:43 ` David Gibson
2025-02-27 4:32 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250225184316.407247f4@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).