public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: passt-dev@passt.top
Cc: David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v4 2/8] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Tue,  4 Feb 2025 01:47:39 +0100	[thread overview]
Message-ID: <20250204004745.97854-3-sbrivio@redhat.com> (raw)
In-Reply-To: <20250204004745.97854-1-sbrivio@redhat.com>

Add two sets (source or target) of three functions each for passt in
vhost-user mode, triggered by activity on the file descriptor passed
via VHOST_USER_PROTOCOL_F_DEVICE_STATE:

- migrate_source_pre() and migrate_target_pre() are called to prepare
  for migration, before data is transferred

- migrate_source() sends, and migrate_target() receives migration data

- migrate_source_post() and migrate_target_post() are responsible for
  any post-migration task

Callbacks are added to these functions with arrays of function
pointers in migrate.c. Migration handlers are versioned.

Versioned descriptions of data sections will be added to the
data_versions array, which points to versioned iovec arrays. Version
1 is currently empty and will be filled in in subsequent patches.

The source announces the data version to be used and informs the peer
about endianness, and the size of void *, time_t, flow entries and
flow hash table entries.

The target checks if the version of the source is still supported. If
it's not, it aborts the migration.

From David: vu_migrate() moves to migrate.c now.

On top of this, from David:
--
migrate: Handle sending header section from data sections

Currently the global state header is included in the list of data sections
to send for migration.  This makes for an asymmetry between the source and
target sides: for the source, the header is sent after the 'pre' handlers
along with all the rest of the data.  For the target side, the header must
be read first (before the 'pre' handlers), so that we know the version
which determines what all the rest of the data will be.

Change this so that the header is handled explicitly on both the source and
target side.  This will make some future changes simpler as well.
--
migrate: Make migration handlers simpler and more flexible

Currently the structure of the migration callbacks is quite rigid.  There
are separate tables for pre and post handlers, for source and target.
Those only prepare or fix up with no reading or writing of the stream.

The actual reading and writing is done with an iovec of buffers to
transfer.  That can't handle any variably sized structures, nor sending
state which is only obtained during migration rather than being tracked
usually.

Replace both the handlers and the sections with an ordered list of
migration "stages".  Each stage has a callback for both source and target
side doing whatever is necessary - these can be NULL, for example for
preparation steps that have no equivalent on the other side.  Things that
are just buffers to be transferred are handled with a macro generating a
suitable stage entry.
--

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile    |  12 +--
 migrate.c   | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h   |  94 ++++++++++++++++++++
 passt.c     |   2 +-
 vu_common.c |  58 +++++-------
 vu_common.h |   2 +-
 6 files changed, 372 insertions(+), 46 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h

diff --git a/Makefile b/Makefile
index 6ab8d24..f9e8a3c 100644
--- a/Makefile
+++ b/Makefile
@@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
+	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
@@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
-	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
-	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
-	virtio.h vu_common.h
+	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
+	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
+	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
+	vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/migrate.c b/migrate.c
new file mode 100644
index 0000000..a4b8a1d
--- /dev/null
+++ b/migrate.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * migrate.c - Migration sections, layout, and routines
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "migrate.h"
+
+/* Current version of migration data */
+#define MIGRATE_VERSION		1
+
+/* Magic as we see it and as seen with reverse endianness */
+#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
+#define MIGRATE_MAGIC_SWAPPED	0xB0D1B1B01B1DBBB1
+
+/* Migration header to send from source */
+static union migrate_header header = {
+	/* Immutable part of header structure: keep these two sections at the
+	 * beginning, because they are enough to identify a version regardless
+	 * of metadata.
+	 */
+	.magic		= MIGRATE_MAGIC,
+	.version	= htonl_constant(MIGRATE_VERSION),
+	/* End of immutable part of header structure */
+
+	.time_t_size	= htonl_constant(sizeof(time_t)),
+	.flow_size	= htonl_constant(sizeof(union flow)),
+	.flow_sidx_size	= htonl_constant(sizeof(struct flow_sidx)),
+	.voidp_size	= htonl_constant(sizeof(void *)),
+};
+
+/**
+ * migrate_send_block() - Migration stage handler to send verbatim data
+ * @c:		Execution context
+ * @m:		Migration metadata
+ * @stage:	Migration stage
+ * @fd:		Migration fd
+ *
+ * Sends the buffer in @stage->iov over the migration channel.
+ */
+static int migrate_send_block(struct ctx *c, struct migrate_meta *m,
+			      const struct migrate_stage *stage, int fd)
+{
+	(void)c;
+	(void)m;
+
+	if (write_remainder(fd, &stage->iov, 1, 0) < 0)
+		return errno;
+
+	return 0;
+}
+
+/**
+ * migrate_recv_block() - Migration stage handler to receive verbatim data
+ * @c:		Execution context
+ * @m:		Migration metadata
+ * @stage:	Migration stage
+ * @fd:		Migration fd
+ *
+ * Reads the buffer in @stage->iov from the migration channel.
+ *
+ * #syscalls:vu readv
+ */
+static int migrate_recv_block(struct ctx *c, struct migrate_meta *m,
+			      const struct migrate_stage *stage, int fd)
+{
+	(void)c;
+	(void)m;
+
+	if (read_remainder(fd, &stage->iov, 1, 0) < 0)
+		return errno;
+
+	return 0;
+}
+
+#define DATA_STAGE(v) \
+	{					\
+		.name = #v,			\
+		.source = migrate_send_block,	\
+		.target = migrate_recv_block,	\
+		.iov = { &(v), sizeof(v) },	\
+	}
+
+/* Stages for version 1 */
+static const struct migrate_stage stages_v1[] = {
+	{
+		.name = "flow pre",
+		.target = NULL,
+	},
+	DATA_STAGE(flow_first_free),
+	DATA_STAGE(flowtab),
+	DATA_STAGE(flow_hashtab),
+	{
+		.name = "flow post",
+		.source = NULL,
+	},
+};
+
+/* Set of data versions */
+static const struct migrate_version versions[] = {
+	{
+		1,	stages_v1,	ARRAY_SIZE(stages_v1),
+	},
+};
+
+/**
+ * migrate_source() - Migration as source, send state to hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int migrate_source(struct ctx *c, int fd)
+{
+	struct migrate_meta m;
+	unsigned i;
+	int rc;
+
+	rc = write_all_buf(fd, &header, sizeof(header));
+	if (rc) {
+		err("Failed to send migration header: %s, abort", strerror_(rc));
+		return rc;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(stages_v1); i++) {
+		const struct migrate_stage *stage = &stages_v1[i];
+
+		if (!stage->source)
+			continue;
+
+		debug("Source side migration: %s", stage->name);
+
+		if ((rc = stage->source(c, &m, stage, fd))) {
+			err("Source migration failed stage %s: %s, abort",
+			    stage->name, strerror_(rc));
+			return rc;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_target_read_header() - Set metadata in target from source header
+ * @fd:		Descriptor for state transfer
+ * @m:		Migration metadata, filled on return
+ *
+ * Return: 0 on success, error code on failure
+ */
+static int migrate_target_read_header(int fd, struct migrate_meta *m)
+{
+	union migrate_header h;
+	unsigned i;
+
+	if (read_all_buf(fd, &h, sizeof(h)))
+		return errno;
+
+	debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
+	      h.magic, ntohl(h.voidp_size), ntohl(h.version));
+
+	m->version = ntohl(h.version);
+	m->v = NULL;
+	for (i = 0; i < ARRAY_SIZE(versions); i++) {
+		if (versions[i].version == m->version)
+			m->v = &versions[i];
+	}
+	if (!m->v)
+		return ENOTSUP;
+
+	if (h.magic == MIGRATE_MAGIC)
+		m->bswap = false;
+	else if (h.magic == MIGRATE_MAGIC_SWAPPED)
+		m->bswap = true;
+	else
+		return ENOTSUP;
+
+	if (ntohl(h.voidp_size) == 4)
+		m->source_64b = false;
+	else if (ntohl(h.voidp_size) == 8)
+		m->source_64b = true;
+	else
+		return ENOTSUP;
+
+	if (ntohl(h.time_t_size) == 4)
+		m->time_64b = false;
+	else if (ntohl(h.time_t_size) == 8)
+		m->time_64b = true;
+	else
+		return ENOTSUP;
+
+	m->flow_size = ntohl(h.flow_size);
+	m->flow_sidx_size = ntohl(h.flow_sidx_size);
+
+	return 0;
+}
+
+/**
+ * migrate_target() - Migration as target, receive state from hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int migrate_target(struct ctx *c, int fd)
+{
+	struct migrate_meta m;
+	unsigned i;
+	int rc;
+
+	rc = migrate_target_read_header(fd, &m);
+	if (rc) {
+		err("Migration header check failed: %s, abort", strerror_(rc));
+		return rc;
+	}
+
+	for (i = 0; i < m.v->nstages; i++) {
+		const struct migrate_stage *stage = &m.v->stages[i];
+
+		if (!stage->target)
+			continue;
+
+		debug("Target side migration: %s", stage->name);
+		
+		if ((rc = stage->target(c, &m, stage, fd))) {
+			err("Target migration failed stage %s: %s, abort",
+			    stage->name, strerror_(rc));
+			return rc;
+		}
+	}
+	
+	return 0;
+}
diff --git a/migrate.h b/migrate.h
new file mode 100644
index 0000000..793d5e5
--- /dev/null
+++ b/migrate.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef MIGRATE_H
+#define MIGRATE_H
+
+struct migrate_version;
+
+/**
+ * struct migrate_meta - Migration metadata
+ * @version:		Chosen migration data version, host order
+ * @v:			Migration version information
+ * @bswap:		Source has opposite endianness
+ * @peer_64b:		Source uses 64-bit void *
+ * @time_64b:		Source uses 64-bit time_t
+ * @flow_size:		Size of union flow in source
+ * @flow_sidx_size:	Size of struct flow_sidx in source
+ */
+struct migrate_meta {
+	uint32_t version;
+	const struct migrate_version *v;
+	bool bswap;
+	bool source_64b;
+	bool time_64b;
+	size_t flow_size;
+	size_t flow_sidx_size;
+};
+
+/**
+ * union migrate_header - Migration header from source
+ * @magic:		0xB1BB1D1B0BB1D1B0, host order
+ * @version:		Source sends highest known, target aborts if unsupported
+ * @voidp_size:		sizeof(void *), network order
+ * @time_t_size:	sizeof(time_t), network order
+ * @flow_size:		sizeof(union flow), network order
+ * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
+ * @unused:		Go figure
+ */
+union migrate_header {
+	struct {
+		uint64_t magic;
+		uint32_t version;
+		uint32_t voidp_size;
+		uint32_t time_t_size;
+		uint32_t flow_size;
+		uint32_t flow_sidx_size;
+	};
+	uint8_t unused[4096];
+} __attribute__((packed));
+
+struct migrate_stage;
+
+/**
+ * migrate_cb_t - Callback function to implement one migration stage
+ */
+typedef int (*migrate_cb_t)(struct ctx *c, struct migrate_meta *m,
+			    const struct migrate_stage *stage, int fd);
+
+/**
+ * struct migrate_stage - Callbacks and parameters for one stage of migration
+ * @name:	Stage name (for debugging)
+ * @source:	Callback to implement this stage on the source
+ * @target:	Callback to implement this stage on the target
+ * @v:		Source version this applies to, host order
+ * @sections:	Array of data sections, NULL-terminated
+ */
+struct migrate_stage {
+	const char *name;
+	migrate_cb_t source;
+	migrate_cb_t target;
+	/* FIXME: rollback callbacks? */
+	union {
+		struct iovec iov;
+	};
+};
+
+/**
+ * struct migrate_version - Stages for a particular protocol version
+ * @version:	Version number this implements
+ * @stages:	Ordered array of stages
+ * @nstages:	Length of @stages
+ */
+struct migrate_version {
+	uint32_t version;
+	const struct migrate_stage *stages;
+	unsigned nstages;
+};
+
+int migrate_source(struct ctx *c, int fd);
+int migrate_target(struct ctx *c, int fd);
+
+#endif /* MIGRATE_H */
diff --git a/passt.c b/passt.c
index b1c8ab6..184d4e5 100644
--- a/passt.c
+++ b/passt.c
@@ -358,7 +358,7 @@ loop:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
 		case EPOLL_TYPE_VHOST_MIGRATION:
-			vu_migrate(c.vdev, eventmask);
+			vu_migrate(&c, eventmask);
 			break;
 		default:
 			/* Can't happen */
diff --git a/vu_common.c b/vu_common.c
index ab04d31..3d41824 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -5,6 +5,7 @@
  * common_vu.c - vhost-user common UDP and TCP functions
  */
 
+#include <errno.h>
 #include <unistd.h>
 #include <sys/uio.h>
 #include <sys/eventfd.h>
@@ -17,6 +18,7 @@
 #include "vhost_user.h"
 #include "pcap.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define VU_MAX_TX_BUFFER_NB	2
 
@@ -305,48 +307,28 @@ err:
 }
 
 /**
- * vu_migrate() - Send/receive passt insternal state to/from QEMU
- * @vdev:	vhost-user device
+ * vu_migrate() - Send/receive passt internal state to/from QEMU
+ * @c:		Execution context
  * @events:	epoll events
  */
-void vu_migrate(struct vu_dev *vdev, uint32_t events)
+void vu_migrate(struct ctx *c, uint32_t events)
 {
-	int ret;
+	struct vu_dev *vdev = c->vdev;
+	int rc = EIO;
 
-	/* TODO: collect/set passt internal state
-	 * and use vdev->device_state_fd to send/receive it
-	 */
 	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
-	if (events & EPOLLOUT) {
-		debug("Saving backend state");
-
-		/* send some stuff */
-		ret = write(vdev->device_state_fd, "PASST", 6);
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		vdev->device_state_result = ret == -1 ? -1 : 0;
-		/* Closing the file descriptor signals the end of transfer */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	} else if (events & EPOLLIN) {
-		char buf[6];
-
-		debug("Loading backend state");
-		/* read some stuff */
-		ret = read(vdev->device_state_fd, buf, sizeof(buf));
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		if (ret != sizeof(buf)) {
-			vdev->device_state_result = -1;
-		} else {
-			ret = strncmp(buf, "PASST", sizeof(buf));
-			vdev->device_state_result = ret == 0 ? 0 : -1;
-		}
-	} else if (events & EPOLLHUP) {
-		debug("Closing migration channel");
 
-		/* The end of file signals the end of the transfer. */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	}
+	if (events & EPOLLOUT)
+		rc = migrate_source(c, vdev->device_state_fd);
+	else if (events & EPOLLIN)
+		rc = migrate_target(c, vdev->device_state_fd);
+
+	/* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
+
+	vdev->device_state_result = rc;
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
+	debug("Closing migration channel");
+	close(vdev->device_state_fd);
+	vdev->device_state_fd = -1;
 }
diff --git a/vu_common.h b/vu_common.h
index d56c021..69c4006 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
 #endif /* VU_COMMON_H */
-- 
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
 #endif /* VU_COMMON_H */
-- 
2.43.0


  parent reply	other threads:[~2025-02-04  0:47 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-04  0:47 [PATCH v4 0/8] Draft, incomplete series introducing state migration Stefano Brivio
2025-02-04  0:47 ` [PATCH v4 1/8] flow_table: Use size in extern declaration for flowtab, export hash table Stefano Brivio
2025-02-04  0:47 ` Stefano Brivio [this message]
2025-02-04  0:47 ` [PATCH v4 3/8] migrate: Make more handling common rather than vhost-user specific Stefano Brivio
2025-02-04  0:47 ` [PATCH v4 4/8] migrate: Don't handle the migration channel through epoll Stefano Brivio
2025-02-04  0:47 ` [PATCH v4 5/8] Add interfaces and configuration bits for passt-repair Stefano Brivio
2025-02-04  0:47 ` [PATCH v4 6/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers Stefano Brivio
2025-02-04  3:43   ` David Gibson
2025-02-04  6:44     ` Stefano Brivio
2025-02-05  0:58       ` David Gibson
2025-02-04  0:47 ` [PATCH v4 7/8] vhost_user: Make source quit after reporting migration state Stefano Brivio
2025-02-04  0:47 ` [PATCH v4 8/8] Implement target side of migration Stefano Brivio
2025-02-04  6:01 ` [PATCH v4 0/8] Draft, incomplete series introducing state migration David Gibson
2025-02-04  6:48   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250204004745.97854-3-sbrivio@redhat.com \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).