From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by passt.top (Postfix, from userid 1000) id B382F5A062B; Fri, 31 Jan 2025 20:39:53 +0100 (CET) From: Stefano Brivio To: passt-dev@passt.top Subject: [PATCH v3 00/20] Draft, incomplete series introducing state migration Date: Fri, 31 Jan 2025 20:39:33 +0100 Message-ID: <20250131193953.3034031-1-sbrivio@redhat.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID-Hash: XEKRXHTA5IEB436JSITU4JSKURSDCEJ7 X-Message-ID-Hash: XEKRXHTA5IEB436JSITU4JSKURSDCEJ7 X-MailFrom: sbrivio@passt.top X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Laurent Vivier , David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: ...and finally connections survive migration from source to target, at least the ones originating from the (source) guest. I didn't try the other way around, small tweaks might be needed. Tested as follows, roughly as instructed by Laurent: Source: $ ./passt --vhost-user $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \ -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \ -append "console=ttyS0" \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,netdev=netdev0 \ -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \ -numa node,memdev=memfd0 -m 2G # ./passt-repair /tmp/passt_1.socket.repair Target (same host): $ ./passt --vhost-user $ qemu-system-x86_64 -machine accel=kvm -cpu host -kernel ... \ -initrd mbuto.img -nographic -serial mon:stdio -nodefaults \ -append "console=ttyS0" \ -chardev socket,id=chr0,path=/tmp/passt_2.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,netdev=netdev0 \ -object memory-backend-memfd,id=memfd0,share=on,size=$((2 * 1024 * 1024 * 1024)) \ -numa node,memdev=memfd0 -m 2G \ -incoming tcp:0:4444 # ./passt-repair /tmp/passt_2.socket.repair Test server: $ nc -l 9091 Once the guest boots: # ip link set dev eth0 up # dhclient eth0 # socat STDIN TCP:$DEFAULT_GW:9091 abcd ^a-c migrate tcp:0:4444 Then continue typing in the target guest: efgh The purpose of this is mostly to show the complete flow, but it needs a number of reworks. What's missing (letting aside pending packet queues for a moment, those are not strictly needed): 1. tests based on the two_guests layout/setup. Even with reverse-search in the shell, this is getting quite hard on wrists. I guess we can start QEMU with -monitor unix:mon.sock,server,nowait and send the 'migrate' command via socat STDIN UNIX-CONNECT:mon.sock 2. dump and transfer of *socket-side* MSS and window scale (I used hardcoded values): this needs more storage, so it needs to be transferred outside the flow table 3. dump, transfer and restore of TCP_REPAIR_WINDOW parameters (not strictly needed, but easy to add once we have appropriate storage) 4. perhaps some small bits of implementation for socket-originated connections (I tested only guest-originated ones so far) 5. UDP and ICMP flows (ping already happens to "survive" nicely, by the way) 6. man page for passt-repair, and man page changes for everything 7. packaging and Linux Security Module changes for passt-repair 8. error handling here and there, and repair rollback/migration abort 9. setting original receive/send buffer sizes and socket options (TCP_NODELAY) What clearly needs changes: a. we can't dump more stuff to the flow table, because we would exceed 128 bytes. We need to copy everything from tcp_tap_conn except for: - state in flow_common - in_epoll - sock - timer and on top of this we need: - values for TCPOPT_WINDOW and TCPOPT_MAXSEG - struct tcp_repair_window somewhat unexpectedly, this is actually bigger than a flow table entry. In any case, we need to implement a stream/per-entry migration right away. b. at this point, I guess we can throw the header away, and just keep a magic (0xB1BB1D1B0BB1D1B0 has a missing 0 at the end but, well, https://en.wikipedia.org/wiki/Bibbidi-Bobbidi-Boo is the Magic Song: can we keep it?) and a version number. The rest, let's go with big/network endianness I'd say, and 64-bit time_t c. the declarative data thing is very convenient but we need to fetch stuff from struct ctx, as shown by the hash_secret example. What's very convenient of this approach is the iovec / writev() / readv() idea. I'm not sure if we can maintain that convenience, though Patches that could be applied regardless of this series to make it more manageable: 1/20 tcp: Always pass NULL event with EPOLL_CTL_DEL 2/20 util: Rename and make global vu_remove_watch() 6/20 util: Add read_remainder() and read_all_buf() 8/20 Introduce passt-repair 16/20 vhost_user: Turn vhost-user message reports to trace() 17/20 vhost_user: Make source quit after reporting migration state 18/20 tcp: Get our socket port using getsockname() when connecting from guest 19/20 tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros Patches that we can throw away with the changes outlined above: 3/20 icmp, udp: Pad time_t timestamp to 64-bit to ease state migration 4/20 flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits 15/20 flow, flow_table: Export declaration of hash table David Gibson (6): tcp: Always pass NULL event with EPOLL_CTL_DEL util: Rename and make global vu_remove_watch() migrate: vu_migrate_{source,target}() aren't actually vu speciic migrate: Move repair_sock_init() to vu_init() migrate: Make more handling common rather than vhost-user specific migrate: Don't handle the migration channel through epoll Stefano Brivio (14): icmp, udp: Pad time_t timestamp to 64-bit to ease state migration flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits flow_table: Use size in extern declaration for flowtab util: Add read_remainder() and read_all_buf() Introduce facilities for guest migration on top of vhost-user infrastructure Introduce passt-repair Add interfaces and configuration bits for passt-repair flow, tcp: Basic pre-migration source handler to dump sequence numbers flow, flow_table: Export declaration of hash table vhost_user: Turn vhost-user message reports to trace() vhost_user: Make source quit after reporting migration state tcp: Get our socket port using getsockname() when connecting from guest tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros Implement target side of migration .gitignore | 1 + Makefile | 24 +-- conf.c | 44 +++++- epoll_type.h | 6 +- flow.c | 97 +++++++++++- flow.h | 20 ++- flow_table.h | 22 ++- icmp.c | 2 +- icmp_flow.h | 6 +- migrate.c | 408 +++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 84 ++++++++++ passt-repair.c | 117 ++++++++++++++ passt.1 | 11 ++ passt.c | 17 ++- passt.h | 17 +++ repair.c | 193 +++++++++++++++++++++++ repair.h | 16 ++ tap.c | 64 +------- tcp.c | 198 +++++++++++++++++++++++- tcp_conn.h | 7 + tcp_internal.h | 10 +- tcp_splice.c | 4 +- udp_flow.c | 2 +- udp_flow.h | 6 +- util.c | 155 +++++++++++++++++++ util.h | 4 + vhost_user.c | 94 +++--------- virtio.h | 4 - vu_common.c | 62 +++----- vu_common.h | 2 +- 30 files changed, 1469 insertions(+), 228 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h create mode 100644 passt-repair.c create mode 100644 repair.c create mode 100644 repair.h -- 2.43.0