[PATCH v13 0/6] State migration, kind of draft again

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v13 0/6] State migration, kind of draft again
@ 2025-02-09 22:19 Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:19 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

Some concerns came to my mind during the weekend and now I tried
a bit quickly to fix them. A couple of functions became horrible
as a result.

More tests: IPv6 too, iperf3 inbound and outbound.

From another read of:
  https://github.com/checkpoint-restore/criu/blob/criu-dev/soccr/soccr.c
I noticed that the way I was dumping and restoring queues
was almost entirely bogus. It should be fixed now. Handling of
FIN might be needed, though: we might have an off-by-one when we
restore queues, I guess.

Riddle of this version: enabling the flow_migrate_source_early()
callback (shrinking the window early on) breaks things if I run
the source without strace: the source (I think? sends a reset
when we close the sockets after we switch repair mode on the
second time (we flip that twice).

So it's commented out for the moment. By itself, it works, and
effectively limits what the peer sends during migration.

There must be some other race or issue in passt-repair or in the
matching interface, but I couldn't figure it out.

David Gibson (1):
  migrate: Migrate guest observed addresses

Stefano Brivio (5):
  migrate: Skeleton of live migration logic
  Add interfaces and configuration bits for passt-repair
  vhost_user: Make source quit after reporting migration state
  migrate: Migrate TCP flows
  test: Add migration tests

 Makefile                   |  14 +-
 conf.c                     |  44 ++-
 epoll_type.h               |   6 +-
 flow.c                     | 248 ++++++++++++
 flow.h                     |   8 +
 migrate.c                  | 309 +++++++++++++++
 migrate.h                  |  54 +++
 passt.1                    |  11 +
 passt.c                    |  21 +-
 passt.h                    |  15 +
 repair.c                   | 211 ++++++++++
 repair.h                   |  16 +
 tap.c                      |  65 +--
 tcp.c                      | 789 +++++++++++++++++++++++++++++++++++++
 tcp_conn.h                 |  95 +++++
 test/lib/layout            |  55 ++-
 test/lib/setup             | 134 +++++++
 test/lib/test              |  48 +++
 test/migrate/basic         |  59 +++
 test/migrate/bidirectional |  64 +++
 test/migrate/iperf3_in4    |  50 +++
 test/migrate/iperf3_in6    |  58 +++
 test/migrate/iperf3_out4   |  50 +++
 test/migrate/iperf3_out6   |  58 +++
 test/run                   |  10 +
 util.c                     |  62 +++
 util.h                     |  30 ++
 vhost_user.c               |  68 +---
 virtio.h                   |   4 -
 vu_common.c                |  49 +--
 vu_common.h                |   2 +-
 31 files changed, 2524 insertions(+), 183 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h
 create mode 100644 repair.c
 create mode 100644 repair.h
 create mode 100644 test/migrate/basic
 create mode 100644 test/migrate/bidirectional
 create mode 100644 test/migrate/iperf3_in4
 create mode 100644 test/migrate/iperf3_in6
 create mode 100644 test/migrate/iperf3_out4
 create mode 100644 test/migrate/iperf3_out6

-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 1/6] migrate: Skeleton of live migration logic
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  2025-02-10  2:26   ` David Gibson
  2025-02-09 22:20 ` [PATCH v13 2/6] migrate: Migrate guest observed addresses Stefano Brivio
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

Introduce facilities for guest migration on top of vhost-user
infrastructure.  Add migration facilities based on top of the current
vhost-user infrastructure, moving vu_migrate() and related functions
to migrate.c.

Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.

The migration header consists of a magic number, a version number for the
encoding, and a "compat_version" which represents the oldest version which
is compatible with the current one.  We don't use it yet, but that allows
for the future possibility of backwards compatible protocol extensions.

Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile     |  12 +--
 epoll_type.h |   2 -
 migrate.c    | 214 +++++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h    |  54 +++++++++++++
 passt.c      |   8 +-
 passt.h      |   8 ++
 util.h       |  29 +++++++
 vhost_user.c |  60 +++------------
 virtio.h     |   4 -
 vu_common.c  |  49 +-----------
 vu_common.h  |   2 +-
 11 files changed, 327 insertions(+), 115 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h

diff --git a/Makefile b/Makefile
index d3d4b78..be89b07 100644
--- a/Makefile
+++ b/Makefile
@@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
+	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
@@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
-	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
-	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
-	virtio.h vu_common.h
+	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
+	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
+	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
+	vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/epoll_type.h b/epoll_type.h
index fd9eac3..f3ef415 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -40,8 +40,6 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
 	EPOLL_TYPE_VHOST_KICK,
-	/* vhost-user migration socket */
-	EPOLL_TYPE_VHOST_MIGRATION,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/migrate.c b/migrate.c
new file mode 100644
index 0000000..aeac872
--- /dev/null
+++ b/migrate.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * migrate.c - Migration sections, layout, and routines
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "migrate.h"
+
+/* Magic identifier for migration data */
+#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
+
+/* Stages for version 1 */
+static const struct migrate_stage stages_v1[] = {
+	{ 0 },
+};
+
+/* Supported encoding versions, from latest (most preferred) to oldest */
+static const struct migrate_version versions[] = {
+	{ 1,	stages_v1, },
+	{ 0 },
+};
+
+/* Current encoding version */
+#define CURRENT_VERSION		(&versions[0])
+
+/**
+ * migrate_source() - Migration as source, send state to hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int migrate_source(struct ctx *c, int fd)
+{
+	const struct migrate_version *v = CURRENT_VERSION;
+	const struct migrate_header header = {
+		.magic		= htonll_constant(MIGRATE_MAGIC),
+		.version	= htonl(v->id),
+		.compat_version	= htonl(v->id),
+	};
+	const struct migrate_stage *s;
+	int ret;
+
+	if (write_all_buf(fd, &header, sizeof(header))) {
+		ret = errno;
+		err("Can't send migration header: %s, abort", strerror_(ret));
+		return ret;
+	}
+
+	for (s = v->s; s->name; s++) {
+		if (!s->source)
+			continue;
+
+		debug("Source side migration stage: %s", s->name);
+
+		if ((ret = s->source(c, s, fd))) {
+			err("Source migration stage: %s: %s, abort", s->name,
+			    strerror_(ret));
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_target_read_header() - Read header in target
+ * @fd:		Descriptor for state transfer
+ *
+ * Return: version structure on success, NULL on failure with errno set
+ */
+static const struct migrate_version *migrate_target_read_header(int fd)
+{
+	const struct migrate_version *v;
+	struct migrate_header h;
+	uint32_t id, compat_id;
+
+	if (read_all_buf(fd, &h, sizeof(h)))
+		return NULL;
+
+	id = ntohl(h.version);
+	compat_id = ntohl(h.compat_version);
+
+	debug("Source magic: 0x%016" PRIx64 ", version: %u, compat: %u",
+	      ntohll(h.magic), id, compat_id);
+
+	if (ntohll(h.magic) != MIGRATE_MAGIC || !id || !compat_id) {
+		err("Invalid incoming device state");
+		errno = EINVAL;
+		return NULL;
+	}
+
+	for (v = versions; v->id; v++)
+		if (v->id <= id && v->id >= compat_id)
+			return v;
+
+	errno = ENOTSUP;
+	err("Unsupported device state version: %u", id);
+	return NULL;
+}
+
+/**
+ * migrate_target() - Migration as target, receive state from hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int migrate_target(struct ctx *c, int fd)
+{
+	const struct migrate_version *v;
+	const struct migrate_stage *s;
+	int ret;
+
+	if (!(v = migrate_target_read_header(fd)))
+		return errno;
+
+	for (s = v->s; s->name; s++) {
+		if (!s->target)
+			continue;
+
+		debug("Target side migration stage: %s", s->name);
+
+		if ((ret = s->target(c, s, fd))) {
+			err("Target migration stage: %s: %s, abort", s->name,
+			    strerror_(ret));
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_init() - Set up things necessary for migration
+ * @c:		Execution context
+ */
+void migrate_init(struct ctx *c)
+{
+	c->device_state_result = -1;
+}
+
+/**
+ * migrate_close() - Close migration channel
+ * @c:		Execution context
+ */
+void migrate_close(struct ctx *c)
+{
+	if (c->device_state_fd != -1) {
+		debug("Closing migration channel, fd: %d", c->device_state_fd);
+		close(c->device_state_fd);
+		c->device_state_fd = -1;
+		c->device_state_result = -1;
+	}
+}
+
+/**
+ * migrate_request() - Request a migration of device state
+ * @c:		Execution context
+ * @fd:		fd to transfer state
+ * @target:	Are we the target of the migration?
+ */
+void migrate_request(struct ctx *c, int fd, bool target)
+{
+	debug("Migration requested, fd: %d (was %d)", fd, c->device_state_fd);
+
+	if (c->device_state_fd != -1)
+		migrate_close(c);
+
+	c->device_state_fd = fd;
+	c->migrate_target = target;
+}
+
+/**
+ * migrate_handler() - Send/receive passt internal state to/from hypervisor
+ * @c:		Execution context
+ */
+void migrate_handler(struct ctx *c)
+{
+	int rc;
+
+	if (c->device_state_fd < 0)
+		return;
+
+	debug("Handling migration request from fd: %d, target: %d",
+	      c->device_state_fd, c->migrate_target);
+
+	if (c->migrate_target)
+		rc = migrate_target(c, c->device_state_fd);
+	else
+		rc = migrate_source(c, c->device_state_fd);
+
+	migrate_close(c);
+
+	c->device_state_result = rc;
+}
diff --git a/migrate.h b/migrate.h
new file mode 100644
index 0000000..d299779
--- /dev/null
+++ b/migrate.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef MIGRATE_H
+#define MIGRATE_H
+
+/**
+ * struct migrate_header - Migration header from source
+ * @magic:		0xB1BB1D1B0BB1D1B0, network order
+ * @version:		Highest known, target aborts if too old, network order
+ * @compat_version:	Lowest version compatible with @version, target aborts
+ *			if too new, network order
+ */
+struct migrate_header {
+	uint64_t magic;
+	uint32_t version;
+	uint32_t compat_version;
+} __attribute__((packed));
+
+/**
+ * struct migrate_stage - Callbacks and parameters for one stage of migration
+ * @name:	Stage name (for debugging)
+ * @source:	Callback to implement this stage on the source
+ * @target:	Callback to implement this stage on the target
+ * @iov:	Optional data section to transfer
+ */
+struct migrate_stage {
+	const char *name;
+	int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
+	int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
+
+	/* Add here separate rollback callbacks if needed */
+
+	struct iovec iov;
+};
+
+/**
+ * struct migrate_version - Stages for a particular protocol version
+ * @id:		Version number, host order
+ * @s:		Ordered array of stages, NULL-terminated
+ */
+struct migrate_version {
+	uint32_t id;
+	const struct migrate_stage *s;
+};
+
+void migrate_init(struct ctx *c);
+void migrate_close(struct ctx *c);
+void migrate_request(struct ctx *c, int fd, bool target);
+void migrate_handler(struct ctx *c);
+
+#endif /* MIGRATE_H */
diff --git a/passt.c b/passt.c
index 53fdd38..935a69f 100644
--- a/passt.c
+++ b/passt.c
@@ -51,6 +51,7 @@
 #include "tcp_splice.h"
 #include "ndp.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define EPOLL_EVENTS		8
 
@@ -75,7 +76,6 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
-	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -202,6 +202,7 @@ int main(int argc, char **argv)
 	isolate_initial(argc, argv);
 
 	c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
+	c.device_state_fd = -1;
 
 	sigemptyset(&sa.sa_mask);
 	sa.sa_flags = 0;
@@ -357,9 +358,6 @@ loop:
 		case EPOLL_TYPE_VHOST_KICK:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
-		case EPOLL_TYPE_VHOST_MIGRATION:
-			vu_migrate(c.vdev, eventmask);
-			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
@@ -368,5 +366,7 @@ loop:
 
 	post_handler(&c, &now);
 
+	migrate_handler(&c);
+
 	goto loop;
 }
diff --git a/passt.h b/passt.h
index 0dd4efa..e73a5ac 100644
--- a/passt.h
+++ b/passt.h
@@ -235,6 +235,9 @@ struct ip6_ctx {
  * @low_wmem:		Low probed net.core.wmem_max
  * @low_rmem:		Low probed net.core.rmem_max
  * @vdev:		vhost-user device
+ * @device_state_fd:	Device state migration channel
+ * @device_state_result: Device state migration result
+ * @migrate_target:	Are we the target, on the next migration request?
  */
 struct ctx {
 	enum passt_modes mode;
@@ -300,6 +303,11 @@ struct ctx {
 	int low_rmem;
 
 	struct vu_dev *vdev;
+
+	/* Migration */
+	int device_state_fd;
+	int device_state_result;
+	bool migrate_target;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/util.h b/util.h
index 23b165c..255eb26 100644
--- a/util.h
+++ b/util.h
@@ -122,14 +122,43 @@
 	 (((x) & 0x0000ff00) <<  8) | (((x) & 0x000000ff) << 24))
 #endif
 
+#ifndef __bswap_constant_32
+#define __bswap_constant_32(x)						\
+	((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >>  8) |	\
+	 (((x) & 0x0000ff00) <<  8) | (((x) & 0x000000ff) << 24))
+#endif
+
+#ifndef __bswap_constant_64
+#define __bswap_constant_64(x) \
+	((((x) & 0xff00000000000000ULL) >> 56) |			\
+	 (((x) & 0x00ff000000000000ULL) >> 40) |			\
+	 (((x) & 0x0000ff0000000000ULL) >> 24) |			\
+	 (((x) & 0x000000ff00000000ULL) >> 8)  |			\
+	 (((x) & 0x00000000ff000000ULL) << 8)  |			\
+	 (((x) & 0x0000000000ff0000ULL) << 24) |			\
+	 (((x) & 0x000000000000ff00ULL) << 40) |			\
+	 (((x) & 0x00000000000000ffULL) << 56))
+#endif
+
 #if __BYTE_ORDER == __BIG_ENDIAN
 #define	htons_constant(x)	(x)
 #define	htonl_constant(x)	(x)
+#define htonll_constant(x)	(x)
+#define	ntohs_constant(x)	(x)
+#define	ntohl_constant(x)	(x)
+#define ntohll_constant(x)	(x)
 #else
 #define	htons_constant(x)	(__bswap_constant_16(x))
 #define	htonl_constant(x)	(__bswap_constant_32(x))
+#define	htonll_constant(x)	(__bswap_constant_64(x))
+#define	ntohs_constant(x)	(__bswap_constant_16(x))
+#define	ntohl_constant(x)	(__bswap_constant_32(x))
+#define	ntohll_constant(x)	(__bswap_constant_64(x))
 #endif
 
+#define ntohll(x)		(be64toh((x)))
+#define htonll(x)		(htobe64((x)))
+
 /**
  * ntohl_unaligned() - Read 32-bit BE value from a possibly unaligned address
  * @p:		Pointer to the BE value in memory
diff --git a/vhost_user.c b/vhost_user.c
index 159f0b3..256c8ab 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -44,6 +44,7 @@
 #include "tap.h"
 #include "vhost_user.h"
 #include "pcap.h"
+#include "migrate.h"
 
 /* vhost-user version we are compatible with */
 #define VHOST_USER_VERSION 1
@@ -997,36 +998,6 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
 	return false;
 }
 
-/**
- * vu_set_migration_watch() - Add the migration file descriptor to epoll
- * @vdev:	vhost-user device
- * @fd:		File descriptor to add
- * @direction:	Direction of the migration (save or load backend state)
- */
-static void vu_set_migration_watch(const struct vu_dev *vdev, int fd,
-				   uint32_t direction)
-{
-	union epoll_ref ref = {
-		.type = EPOLL_TYPE_VHOST_MIGRATION,
-		.fd = fd,
-	 };
-	struct epoll_event ev = { 0 };
-
-	ev.data.u64 = ref.u64;
-	switch (direction) {
-	case VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE:
-		ev.events = EPOLLOUT;
-		break;
-	case VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD:
-		ev.events = EPOLLIN;
-		break;
-	default:
-		ASSERT(0);
-	}
-
-	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
-}
-
 /**
  * vu_set_device_state_fd_exec() - Set the device state migration channel
  * @vdev:	vhost-user device
@@ -1051,16 +1022,8 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 	    direction != VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD)
 		die("Invalide device_state_fd direction: %d", direction);
 
-	if (vdev->device_state_fd != -1) {
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-	}
-
-	vdev->device_state_fd = msg->fds[0];
-	vdev->device_state_result = -1;
-	vu_set_migration_watch(vdev, vdev->device_state_fd, direction);
-
-	debug("Got device_state_fd: %d", vdev->device_state_fd);
+	migrate_request(vdev->context, msg->fds[0],
+			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
 
 	/* We don't provide a new fd for the data transfer */
 	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1075,12 +1038,11 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
  *
  * Return: True as the reply contains the migration result
  */
+/* cppcheck-suppress constParameterCallback */
 static bool vu_check_device_state_exec(struct vu_dev *vdev,
 				       struct vhost_user_msg *msg)
 {
-	(void)vdev;
-
-	vmsg_set_reply_u64(msg, vdev->device_state_result);
+	vmsg_set_reply_u64(msg, vdev->context->device_state_result);
 
 	return true;
 }
@@ -1106,8 +1068,8 @@ void vu_init(struct ctx *c)
 	}
 	c->vdev->log_table = NULL;
 	c->vdev->log_call_fd = -1;
-	c->vdev->device_state_fd = -1;
-	c->vdev->device_state_result = -1;
+
+	migrate_init(c);
 }
 
 
@@ -1157,12 +1119,8 @@ void vu_cleanup(struct vu_dev *vdev)
 
 	vu_close_log(vdev);
 
-	if (vdev->device_state_fd != -1) {
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-		vdev->device_state_result = -1;
-	}
+	/* If we lose the VU dev, we also lose our migration channel */
+	migrate_close(vdev->context);
 }
 
 /**
diff --git a/virtio.h b/virtio.h
index 7bef2d2..0a59441 100644
--- a/virtio.h
+++ b/virtio.h
@@ -106,8 +106,6 @@ struct vu_dev_region {
  * @log_call_fd:		Eventfd to report logging update
  * @log_size:			Size of the logging memory region
  * @log_table:			Base of the logging memory region
- * @device_state_fd:		Device state migration channel
- * @device_state_result:	Device state migration result
  */
 struct vu_dev {
 	struct ctx *context;
@@ -119,8 +117,6 @@ struct vu_dev {
 	int log_call_fd;
 	uint64_t log_size;
 	uint8_t *log_table;
-	int device_state_fd;
-	int device_state_result;
 };
 
 /**
diff --git a/vu_common.c b/vu_common.c
index ab04d31..48826b1 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -5,6 +5,7 @@
  * common_vu.c - vhost-user common UDP and TCP functions
  */
 
+#include <errno.h>
 #include <unistd.h>
 #include <sys/uio.h>
 #include <sys/eventfd.h>
@@ -17,6 +18,7 @@
 #include "vhost_user.h"
 #include "pcap.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define VU_MAX_TX_BUFFER_NB	2
 
@@ -303,50 +305,3 @@ err:
 
 	return -1;
 }
-
-/**
- * vu_migrate() - Send/receive passt insternal state to/from QEMU
- * @vdev:	vhost-user device
- * @events:	epoll events
- */
-void vu_migrate(struct vu_dev *vdev, uint32_t events)
-{
-	int ret;
-
-	/* TODO: collect/set passt internal state
-	 * and use vdev->device_state_fd to send/receive it
-	 */
-	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
-	if (events & EPOLLOUT) {
-		debug("Saving backend state");
-
-		/* send some stuff */
-		ret = write(vdev->device_state_fd, "PASST", 6);
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		vdev->device_state_result = ret == -1 ? -1 : 0;
-		/* Closing the file descriptor signals the end of transfer */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	} else if (events & EPOLLIN) {
-		char buf[6];
-
-		debug("Loading backend state");
-		/* read some stuff */
-		ret = read(vdev->device_state_fd, buf, sizeof(buf));
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		if (ret != sizeof(buf)) {
-			vdev->device_state_result = -1;
-		} else {
-			ret = strncmp(buf, "PASST", sizeof(buf));
-			vdev->device_state_result = ret == 0 ? 0 : -1;
-		}
-	} else if (events & EPOLLHUP) {
-		debug("Closing migration channel");
-
-		/* The end of file signals the end of the transfer. */
-		epoll_del(vdev->context, vdev->device_state_fd);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	}
-}
diff --git a/vu_common.h b/vu_common.h
index d56c021..f538f23 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+
 #endif /* VU_COMMON_H */
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 2/6] migrate: Migrate guest observed addresses
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair Stefano Brivio
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

Most of the information in struct ctx doesn't need to be migrated.
Either it's strictly back end information which is allowed to differ
between the two ends, or it must already be configured identically on
the two ends.

There are a few exceptions though.  In particular passt learns several
addresses of the guest by observing what it sends out.  If we lose
this information across migration we might get away with it, but if
there are active flows we might misdirect some packets before
re-learning the guest address.

Avoid this by migrating the guest's observed addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Coding style stuff, comments, etc.]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 migrate.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/migrate.c b/migrate.c
index aeac872..72a6d40 100644
--- a/migrate.c
+++ b/migrate.c
@@ -27,8 +27,81 @@
 /* Magic identifier for migration data */
 #define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
 
+/**
+ * struct migrate_seen_addrs_v1 - Migratable guest addresses for v1 state stream
+ * @addr6:	Observed guest IPv6 address
+ * @addr6_ll:	Observed guest IPv6 link-local address
+ * @addr4:	Observed guest IPv4 address
+ * @mac:	Observed guest MAC address
+ */
+struct migrate_seen_addrs_v1 {
+	struct in6_addr addr6;
+	struct in6_addr addr6_ll;
+	struct in_addr addr4;
+	unsigned char mac[ETH_ALEN];
+} __attribute__((packed));
+
+/**
+ * seen_addrs_source_v1() - Copy and send guest observed addresses from source
+ * @c:		Execution context
+ * @stage:	Migration stage, unused
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+/* cppcheck-suppress [constParameterCallback, unmatchedSuppression] */
+static int seen_addrs_source_v1(struct ctx *c,
+				const struct migrate_stage *stage, int fd)
+{
+	struct migrate_seen_addrs_v1 addrs = {
+		.addr6 = c->ip6.addr_seen,
+		.addr6_ll = c->ip6.addr_ll_seen,
+		.addr4 = c->ip4.addr_seen,
+	};
+
+	(void)stage;
+
+	memcpy(addrs.mac, c->guest_mac, sizeof(addrs.mac));
+
+	if (write_all_buf(fd, &addrs, sizeof(addrs)))
+		return errno;
+
+	return 0;
+}
+
+/**
+ * seen_addrs_target_v1() - Receive and use guest observed addresses on target
+ * @c:		Execution context
+ * @stage:	Migration stage, unused
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int seen_addrs_target_v1(struct ctx *c,
+				const struct migrate_stage *stage, int fd)
+{
+	struct migrate_seen_addrs_v1 addrs;
+
+	(void)stage;
+
+	if (read_all_buf(fd, &addrs, sizeof(addrs)))
+		return errno;
+
+	c->ip6.addr_seen = addrs.addr6;
+	c->ip6.addr_ll_seen = addrs.addr6_ll;
+	c->ip4.addr_seen = addrs.addr4;
+	memcpy(c->guest_mac, addrs.mac, sizeof(c->guest_mac));
+
+	return 0;
+}
+
 /* Stages for version 1 */
 static const struct migrate_stage stages_v1[] = {
+	{
+		.name = "observed addresses",
+		.source = seen_addrs_source_v1,
+		.target = seen_addrs_target_v1,
+	},
 	{ 0 },
 };
 
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 2/6] migrate: Migrate guest observed addresses Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  2025-02-10  2:59   ` David Gibson
  2025-02-09 22:20 ` [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state Stefano Brivio
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile     |  12 +--
 conf.c       |  44 ++++++++++-
 epoll_type.h |   4 +
 migrate.c    |   5 +-
 passt.1      |  11 +++
 passt.c      |   9 +++
 passt.h      |   7 ++
 repair.c     | 212 +++++++++++++++++++++++++++++++++++++++++++++++++++
 repair.h     |  16 ++++
 tap.c        |  65 +---------------
 util.c       |  62 +++++++++++++++
 util.h       |   1 +
 12 files changed, 375 insertions(+), 73 deletions(-)
 create mode 100644 repair.c
 create mode 100644 repair.h

diff --git a/Makefile b/Makefile
index be89b07..d4e1096 100644
--- a/Makefile
+++ b/Makefile
@@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
-	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
-	vhost_user.c virtio.c vu_common.c
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
+	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
+	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
@@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
-	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
-	vhost_user.h virtio.h vu_common.h
+	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
+	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
+	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/conf.c b/conf.c
index 142dc94..7a5ff8b 100644
--- a/conf.c
+++ b/conf.c
@@ -820,6 +820,9 @@ static void usage(const char *name, FILE *f, int status)
 			"    UNIX domain socket is provided by -s option\n"
 			"  --print-capabilities	print back-end capabilities in JSON format,\n"
 			"    only meaningful for vhost-user mode\n");
+		FPRINTF(f,
+			"  --repair-path PATH	path for passt-repair(1)\n"
+			"    default: append '.repair' to UNIX domain path\n");
 	}
 
 	FPRINTF(f,
@@ -1243,8 +1246,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
  */
 static void conf_open_files(struct ctx *c)
 {
-	if (c->mode != MODE_PASTA && c->fd_tap == -1)
-		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
+	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
+		c->fd_tap_listen = sock_unix(c->sock_path);
+
+		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
+			if (!*c->repair_path &&
+			    snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s.repair",
+					   c->sock_path)) {
+				warn("passt-repair path %s not usable",
+				     c->repair_path);
+				c->fd_repair_listen = -1;
+			} else {
+				c->fd_repair_listen = sock_unix(c->repair_path);
+			}
+		} else {
+			c->fd_repair_listen = -1;
+		}
+		c->fd_repair = -1;
+	}
 
 	if (*c->pidfile) {
 		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
@@ -1357,9 +1377,12 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
 		{"dns-host",	required_argument,	NULL,		24 },
 		{"vhost-user",	no_argument,		NULL,		25 },
+
 		/* vhost-user backend program convention */
 		{"print-capabilities", no_argument,	NULL,		26 },
 		{"socket-path",	required_argument,	NULL,		's' },
+
+		{"repair-path",	required_argument,	NULL,		27 },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1751,6 +1774,9 @@ void conf(struct ctx *c, int argc, char **argv)
 		case 'D':
 			/* Handle these later, once addresses are configured */
 			break;
+		case 27:
+			/* Handle this once we checked --vhost-user */
+			break;
 		case 'h':
 			usage(argv[0], stdout, EXIT_SUCCESS);
 			break;
@@ -1827,8 +1853,8 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
 		c->no_dhcp = 1;
 
-	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
-	 * settings)
+	/* Inbound port options, DNS, and --repair-path can be parsed now, after
+	 * IPv4/IPv6 settings and --vhost-user.
 	 */
 	fwd_probe_ephemeral();
 	udp_portmap_clear();
@@ -1874,6 +1900,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			}
 
 			die("Cannot use DNS address %s", optarg);
+		} else if (name == 27) {
+			if (c->mode != MODE_VU && strcmp(optarg, "none"))
+				die("--repair-path is for vhost-user mode only");
+
+			if (snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s",
+					   optarg))
+				die("Invalid passt-repair path: %s", optarg);
+
+			break;
 		}
 	} while (name != -1);
 
diff --git a/epoll_type.h b/epoll_type.h
index f3ef415..7f2a121 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -40,6 +40,10 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_CMD,
 	/* vhost-user kick event socket */
 	EPOLL_TYPE_VHOST_KICK,
+	/* TCP_REPAIR helper listening socket */
+	EPOLL_TYPE_REPAIR_LISTEN,
+	/* TCP_REPAIR helper socket */
+	EPOLL_TYPE_REPAIR,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/migrate.c b/migrate.c
index 72a6d40..1c59016 100644
--- a/migrate.c
+++ b/migrate.c
@@ -23,6 +23,7 @@
 #include "flow_table.h"
 
 #include "migrate.h"
+#include "repair.h"
 
 /* Magic identifier for migration data */
 #define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
@@ -232,7 +233,7 @@ void migrate_init(struct ctx *c)
 }
 
 /**
- * migrate_close() - Close migration channel
+ * migrate_close() - Close migration channel and connection to passt-repair
  * @c:		Execution context
  */
 void migrate_close(struct ctx *c)
@@ -243,6 +244,8 @@ void migrate_close(struct ctx *c)
 		c->device_state_fd = -1;
 		c->device_state_result = -1;
 	}
+
+	repair_close(c);
 }
 
 /**
diff --git a/passt.1 b/passt.1
index 29cc3ed..c81d539 100644
--- a/passt.1
+++ b/passt.1
@@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
 .BR \-\-print-capabilities
 Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
 
+.TP
+.BR \-\-repair-path " " \fIpath
+Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
+to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
+migration. \fB--repair-path none\fR disables this interface (if you need to
+specify a socket path called "none" you can prefix the path by \fI./\fR).
+
+Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
+chosen for the hypervisor UNIX domain socket. No socket is created if not in
+\-\-vhost-user mode.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
diff --git a/passt.c b/passt.c
index 935a69f..6f9fb4d 100644
--- a/passt.c
+++ b/passt.c
@@ -52,6 +52,7 @@
 #include "ndp.h"
 #include "vu_common.h"
 #include "migrate.h"
+#include "repair.h"
 
 #define EPOLL_EVENTS		8
 
@@ -76,6 +77,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
+	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
+	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -358,6 +361,12 @@ loop:
 		case EPOLL_TYPE_VHOST_KICK:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
+		case EPOLL_TYPE_REPAIR_LISTEN:
+			repair_listen_handler(&c, eventmask);
+			break;
+		case EPOLL_TYPE_REPAIR:
+			repair_handler(&c, eventmask);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index e73a5ac..c392be0 100644
--- a/passt.h
+++ b/passt.h
@@ -20,6 +20,7 @@ union epoll_ref;
 #include "siphash.h"
 #include "ip.h"
 #include "inany.h"
+#include "migrate.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
@@ -193,6 +194,7 @@ struct ip6_ctx {
  * @foreground:		Run in foreground, don't log to stderr by default
  * @nofile:		Maximum number of open files (ulimit -n)
  * @sock_path:		Path for UNIX domain socket
+ * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
  * @pcap:		Path for packet capture file
  * @pidfile:		Path to PID file, empty string if not configured
  * @pidfile_fd:		File descriptor for PID file, -1 if none
@@ -203,6 +205,8 @@ struct ip6_ctx {
  * @epollfd:		File descriptor for epoll instance
  * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
  * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
+ * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
+ * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
  * @our_tap_mac:	Pasta/passt's MAC on the tap link
  * @guest_mac:		MAC address of guest or namespace, seen or configured
  * @hash_secret:	128-bit secret for siphash functions
@@ -247,6 +251,7 @@ struct ctx {
 	int foreground;
 	int nofile;
 	char sock_path[UNIX_PATH_MAX];
+	char repair_path[UNIX_PATH_MAX];
 	char pcap[PATH_MAX];
 
 	char pidfile[PATH_MAX];
@@ -263,6 +268,8 @@ struct ctx {
 	int epollfd;
 	int fd_tap_listen;
 	int fd_tap;
+	int fd_repair_listen;
+	int fd_repair;
 	unsigned char our_tap_mac[ETH_ALEN];
 	unsigned char guest_mac[ETH_ALEN];
 	uint64_t hash_secret[2];
diff --git a/repair.c b/repair.c
new file mode 100644
index 0000000..784b994
--- /dev/null
+++ b/repair.c
@@ -0,0 +1,212 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "repair.h"
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+/* Pending file descriptors for next repair_flush() call, or command change */
+static int repair_fds[SCM_MAX_FD];
+
+/* Pending command: flush pending file descriptors if it changes */
+static int repair_cmd;
+
+/* Number of pending file descriptors set in @repair_fds */
+static int repair_nfds;
+
+/**
+ * repair_sock_init() - Start listening for connections on helper socket
+ * @c:		Execution context
+ */
+void repair_sock_init(const struct ctx *c)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
+	struct epoll_event ev = { 0 };
+
+	if (c->fd_repair_listen == -1)
+		return;
+
+	if (listen(c->fd_repair_listen, 0)) {
+		err_perror("listen() on repair helper socket, won't migrate");
+		return;
+	}
+
+	ref.fd = c->fd_repair_listen;
+	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
+		err_perror("repair helper socket epoll_ctl(), won't migrate");
+}
+
+/**
+ * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_listen_handler(struct ctx *c, uint32_t events)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
+	struct epoll_event ev = { 0 };
+	struct ucred ucred;
+	socklen_t len;
+
+	if (events != EPOLLIN) {
+		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
+		      events);
+		return;
+	}
+
+	len = sizeof(ucred);
+
+	/* Another client is already connected: accept and close right away. */
+	if (c->fd_repair != -1) {
+		int discard = accept4(c->fd_repair_listen, NULL, NULL,
+				      SOCK_NONBLOCK);
+
+		if (discard == -1)
+			return;
+
+		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
+
+		close(discard);
+		return;
+	}
+
+	if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
+		debug_perror("accept4() on TCP_REPAIR helper listening socket");
+		return;
+	}
+
+	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
+
+	ref.fd = c->fd_repair;
+	ev.events = EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
+		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
+		close(c->fd_repair);
+		c->fd_repair = -1;
+	}
+}
+
+/**
+ * repair_close() - Close connection to TCP_REPAIR helper
+ * @c:		Execution context
+ */
+void repair_close(struct ctx *c)
+{
+	debug("Closing TCP_REPAIR helper socket");
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
+	close(c->fd_repair);
+	c->fd_repair = -1;
+}
+
+/**
+ * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_handler(struct ctx *c, uint32_t events)
+{
+	(void)events;
+
+	repair_close(c);
+}
+
+/**
+ * repair_flush() - Flush current set of sockets to helper, with current command
+ * @c:		Execution context
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int repair_flush(struct ctx *c)
+{
+	struct iovec iov = { &((int8_t){ repair_cmd }), sizeof(int8_t) };
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+
+	if (!repair_nfds)
+		return 0;
+
+	msg = (struct msghdr){ NULL, 0, &iov, 1,
+			       buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
+	memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
+
+	repair_nfds = 0;
+
+	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
+		int ret = -errno;
+		err_perror("Failed to send sockets to TCP_REPAIR helper");
+		repair_close(c);
+		return ret;
+	}
+
+	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {
+		int ret = -errno;
+		err_perror("Failed to receive reply from TCP_REPAIR helper");
+		repair_close(c);
+		return ret;
+	}
+
+	return 0;
+}
+
+/**
+ * repair_set() - Add socket to TCP_REPAIR set with given command
+ * @c:		Execution context
+ * @s:		Socket to add
+ * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+/* cppcheck-suppress unusedFunction */
+int repair_set(struct ctx *c, int s, int cmd)
+{
+	int rc;
+
+	if (repair_nfds && repair_cmd != cmd) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	repair_cmd = cmd;
+	repair_fds[repair_nfds++] = s;
+
+	if (repair_nfds >= SCM_MAX_FD) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/repair.h b/repair.h
new file mode 100644
index 0000000..de279d6
--- /dev/null
+++ b/repair.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef REPAIR_H
+#define REPAIR_H
+
+void repair_sock_init(const struct ctx *c);
+void repair_listen_handler(struct ctx *c, uint32_t events);
+void repair_handler(struct ctx *c, uint32_t events);
+void repair_close(struct ctx *c);
+int repair_flush(struct ctx *c);
+int repair_set(struct ctx *c, int s, int cmd);
+
+#endif /* REPAIR_H */
diff --git a/tap.c b/tap.c
index 8c92d23..d0673e5 100644
--- a/tap.c
+++ b/tap.c
@@ -56,6 +56,7 @@
 #include "netlink.h"
 #include "pasta.h"
 #include "packet.h"
+#include "repair.h"
 #include "tap.h"
 #include "log.h"
 #include "vhost_user.h"
@@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		tap_pasta_input(c, now);
 }
 
-/**
- * tap_sock_unix_open() - Create and bind AF_UNIX socket
- * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
- *
- * Return: socket descriptor on success, won't return on failure
- */
-int tap_sock_unix_open(char *sock_path)
-{
-	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
-	struct sockaddr_un addr = {
-		.sun_family = AF_UNIX,
-	};
-	int i;
-
-	if (fd < 0)
-		die_perror("Failed to open UNIX domain socket");
-
-	for (i = 1; i < UNIX_SOCK_MAX; i++) {
-		char *path = addr.sun_path;
-		int ex, ret;
-
-		if (*sock_path)
-			memcpy(path, sock_path, UNIX_PATH_MAX);
-		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
-					UNIX_SOCK_PATH, i))
-			die_perror("Can't build UNIX domain socket path");
-
-		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
-			    0);
-		if (ex < 0)
-			die_perror("Failed to check for UNIX domain conflicts");
-
-		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
-		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
-			     errno != EACCES)) {
-			if (*sock_path)
-				die("Socket path %s already in use", path);
-
-			close(ex);
-			continue;
-		}
-		close(ex);
-
-		unlink(path);
-		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
-		if (*sock_path && ret)
-			die_perror("Failed to bind UNIX domain socket");
-
-		if (!ret)
-			break;
-	}
-
-	if (i == UNIX_SOCK_MAX)
-		die_perror("Failed to bind UNIX domain socket");
-
-	info("UNIX domain socket bound at %s", addr.sun_path);
-	if (!*sock_path)
-		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
-
-	return fd;
-}
-
 /**
  * tap_backend_show_hints() - Give help information to start QEMU
  * @c:		Execution context
@@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
 		tap_sock_tun_init(c);
 		break;
 	case MODE_VU:
+		repair_sock_init(c);
+		/* fall through */
 	case MODE_PASST:
 		tap_sock_unix_init(c);
 
diff --git a/util.c b/util.c
index 4d51e04..c3c5480 100644
--- a/util.c
+++ b/util.c
@@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	return fd;
 }
 
+/**
+ * sock_unix() - Create and bind AF_UNIX socket
+ * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
+ *
+ * Return: socket descriptor on success, won't return on failure
+ */
+int sock_unix(char *sock_path)
+{
+	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+	struct sockaddr_un addr = {
+		.sun_family = AF_UNIX,
+	};
+	int i;
+
+	if (fd < 0)
+		die_perror("Failed to open UNIX domain socket");
+
+	for (i = 1; i < UNIX_SOCK_MAX; i++) {
+		char *path = addr.sun_path;
+		int ex, ret;
+
+		if (*sock_path)
+			memcpy(path, sock_path, UNIX_PATH_MAX);
+		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
+					UNIX_SOCK_PATH, i))
+			die_perror("Can't build UNIX domain socket path");
+
+		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
+			    0);
+		if (ex < 0)
+			die_perror("Failed to check for UNIX domain conflicts");
+
+		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
+		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
+			     errno != EACCES)) {
+			if (*sock_path)
+				die("Socket path %s already in use", path);
+
+			close(ex);
+			continue;
+		}
+		close(ex);
+
+		unlink(path);
+		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
+		if (*sock_path && ret)
+			die_perror("Failed to bind UNIX domain socket");
+
+		if (!ret)
+			break;
+	}
+
+	if (i == UNIX_SOCK_MAX)
+		die_perror("Failed to bind UNIX domain socket");
+
+	info("UNIX domain socket bound at %s", addr.sun_path);
+	if (!*sock_path)
+		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
+
+	return fd;
+}
+
 /**
  * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
  * @c:		Execution context
diff --git a/util.h b/util.h
index 255eb26..3dacb4d 100644
--- a/util.h
+++ b/util.h
@@ -214,6 +214,7 @@ struct ctx;
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
 	       const char *ifname, bool v6only, uint32_t data);
+int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
 int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-02-09 22:20 ` [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  2025-02-10  3:43   ` David Gibson
  2025-02-09 22:20 ` [PATCH v13 5/6] migrate: Migrate TCP flows Stefano Brivio
  2025-02-09 22:20 ` [PATCH v13 6/6] test: Add migration tests Stefano Brivio
  5 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

This will close all the sockets we currently have open in repair mode,
and completes our migration tasks as source. If the hypervisor wants
to have us back at this point, somebody needs to restart us.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 vhost_user.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/vhost_user.c b/vhost_user.c
index 256c8ab..9115fb5 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -998,6 +998,9 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
 	return false;
 }
 
+/* If set, quit when we get a VHOST_USER_CHECK_DEVICE_STATE, after replying */
+static bool quit_on_device_state = false;
+
 /**
  * vu_set_device_state_fd_exec() - Set the device state migration channel
  * @vdev:	vhost-user device
@@ -1025,6 +1028,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
 	migrate_request(vdev->context, msg->fds[0],
 			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
 
+	if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE)
+		quit_on_device_state = true;
+
 	/* We don't provide a new fd for the data transfer */
 	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
 
@@ -1203,4 +1209,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
 
 	if (reply_requested)
 		vu_send_reply(fd, &msg);
+
+	if (quit_on_device_state &&
+	    msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) {
+		info("Migration complete, exiting");
+		_exit(EXIT_SUCCESS);
+	}
 }
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 5/6] migrate: Migrate TCP flows
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
                   ` (3 preceding siblings ...)
  2025-02-09 22:20 ` [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  2025-02-10  6:05   ` David Gibson
  2025-02-09 22:20 ` [PATCH v13 6/6] test: Add migration tests Stefano Brivio
  5 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, and flow insertion
on the target, with all the appropriate window options, window
scaling, MSS, etc.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. However, we don't want to
request repair mode for sockets one by one. So we need to do this in
several steps.

[dwg: Assorted cleanups]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c     | 248 +++++++++++++++++
 flow.h     |   8 +
 migrate.c  |  19 ++
 passt.c    |   6 +-
 repair.c   |   1 -
 tcp.c      | 789 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_conn.h |  95 +++++++
 7 files changed, 1162 insertions(+), 4 deletions(-)

diff --git a/flow.c b/flow.c
index a6fe6d1..51f8c62 100644
--- a/flow.c
+++ b/flow.c
@@ -19,6 +19,7 @@
 #include "inany.h"
 #include "flow.h"
 #include "flow_table.h"
+#include "repair.h"
 
 const char *flow_state_str[] = {
 	[FLOW_STATE_FREE]	= "FREE",
@@ -52,6 +53,26 @@ const uint8_t flow_proto[] = {
 static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 	      "flow_proto[] doesn't match enum flow_type");
 
+#define foreach_flow(i, flow, bound)					\
+	for ((i) = 0, (flow) = &flowtab[(i)];				\
+	     (i) < (bound);						\
+	     (i)++, (flow) = &flowtab[(i)])				\
+		if ((flow)->f.state == FLOW_STATE_FREE)			\
+			(i) += (flow)->free.n - 1;			\
+		else
+
+#define foreach_active_flow(i, flow, bound)				\
+	foreach_flow((i), (flow), (bound))				\
+		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
+			continue;					\
+		else
+
+#define foreach_tcp_flow(i, flow, bound)				\
+	foreach_active_flow((i), (flow), (bound))			\
+		if ((flow)->f.type != FLOW_TCP)				\
+			continue;					\
+		else
+
 /* Global Flow Table */
 
 /**
@@ -874,6 +895,233 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 	*last_next = FLOW_MAX;
 }
 
+/**
+ * flow_migrate_source_rollback() - Disable repair mode, return failure
+ * @c:		Execution context
+ * @max_flow:	Maximum index of affected flows
+ * @ret:	Negative error code
+ *
+ * Return: @ret
+ */
+static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
+					int ret)
+{
+	union flow *flow;
+	unsigned i;
+
+	debug("...roll back migration");
+
+	foreach_tcp_flow(i, flow, max_flow)
+		tcp_flow_repair_off(c, &flow->tcp);
+
+	repair_flush(c);
+
+	return ret;
+}
+
+/**
+ * flow_migrate_repair_all() - Turn repair mode on or off for all flows
+ * @c:		Execution context
+ * @enable:	Switch repair mode on if set, off otherwise
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int flow_migrate_repair_all(struct ctx *c, bool enable)
+{
+	union flow *flow;
+	unsigned i;
+	int rc;
+
+	foreach_tcp_flow(i, flow, FLOW_MAX) {
+		if (enable)
+			rc = tcp_flow_repair_on(c, &flow->tcp);
+		else
+			rc = tcp_flow_repair_off(c, &flow->tcp);
+
+		if (rc) {
+			debug("Can't %s repair mode: %s",
+			      enable ? "enable" : "disable", strerror_(-rc));
+			return flow_migrate_source_rollback(c, i, rc);
+		}
+	}
+
+	if ((rc = repair_flush(c))) {
+		debug("Can't %s repair mode: %s",
+		      enable ? "enable" : "disable", strerror_(-rc));
+		return flow_migrate_source_rollback(c, i, rc);
+	}
+
+	return 0;
+}
+
+/**
+ * flow_migrate_source_early() - Early tasks: shrink (RFC 7323 2.2) TCP windows
+ * @c:		Execution context
+ * @stage:	Migration stage information, unused
+ * @fd:		Migration file descriptor, unused
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
+			      int fd)
+{
+	union flow *flow;
+	unsigned i;
+	int rc;
+
+	(void)stage;
+	(void)fd;
+
+	/* We need repair mode to dump and set (some) window parameters */
+	if ((rc = flow_migrate_repair_all(c, true)))
+		return -rc;
+
+	foreach_tcp_flow(i, flow, FLOW_MAX) {
+		if ((rc = tcp_flow_migrate_shrink_window(i, &flow->tcp))) {
+			err("Shrinking window, flow %u: %s", i, strerror_(-rc));
+			return flow_migrate_source_rollback(c, i, -rc);
+		}
+	}
+
+	/* Now send window updates. We'll flip repair mode back on in a bit */
+	if ((rc = flow_migrate_repair_all(c, false)))
+		return -rc;
+
+	return 0;
+}
+
+/**
+ * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
+ * @c:		Execution context
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor (unused)
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
+			    int fd)
+{
+	int rc;
+
+	(void)stage;
+	(void)fd;
+
+	if ((rc = flow_migrate_repair_all(c, true)))
+		return -rc;
+
+	return 0;
+}
+
+/**
+ * flow_migrate_source() - Dump all the remaining information and send data
+ * @c:		Execution context (unused)
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
+			int fd)
+{
+	uint32_t count = 0;
+	union flow *flow;
+	unsigned i;
+	int rc;
+
+	(void)c;
+	(void)stage;
+
+	foreach_tcp_flow(i, flow, FLOW_MAX)
+		count++;
+
+	count = htonl(count);
+	if ((rc = write_all_buf(fd, &count, sizeof(count)))) {
+		rc = errno;
+		err_perror("Can't send flow count (%u)", ntohl(count));
+		return flow_migrate_source_rollback(c, FLOW_MAX, rc);
+	}
+
+	debug("Sending %u flows", ntohl(count));
+
+	/* Dump and send information that can be stored in the flow table */
+	foreach_tcp_flow(i, flow, FLOW_MAX) {
+		if ((rc = tcp_flow_migrate_source(fd, &flow->tcp))) {
+			err("Can't send data, flow %u: %s", i, strerror_(-rc));
+			return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
+		}
+	}
+
+	/* And then "extended" data (including window data we saved previously):
+	 * the target needs to set repair mode on sockets before it can set
+	 * this stuff, but it needs sockets (and flows) for that.
+	 *
+	 * This also closes sockets so that the target can start connecting
+	 * theirs: you can't sendmsg() to queues (using the socket) if the
+	 * socket is not connected (EPIPE), not even in repair mode. And the
+	 * target needs to restore queues now because we're sending the data.
+	 *
+	 * So, no rollback here, just try as hard as we can.
+	 */
+	foreach_tcp_flow(i, flow, FLOW_MAX) {
+		if ((rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp)))
+			err("Extended data for flow %u: %s", i, strerror_(-rc));
+	}
+
+	return 0;
+}
+
+/**
+ * flow_migrate_target() - Receive flows and insert in flow table
+ * @c:		Execution context
+ * @stage:	Migration stage information (unused)
+ * @fd:		Migration file descriptor
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
+			int fd)
+{
+	uint32_t count;
+	unsigned i;
+	int rc;
+
+	(void)stage;
+
+	if (read_all_buf(fd, &count, sizeof(count)))
+		return errno;
+
+	count = ntohl(count);
+	debug("Receiving %u flows", count);
+
+	if ((rc = flow_migrate_repair_all(c, true)))
+		return -rc;
+
+	repair_flush(c);
+
+	/* TODO: flow header with type, instead? */
+	for (i = 0; i < count; i++) {
+		rc = tcp_flow_migrate_target(c, fd);
+		if (rc) {
+			debug("Bad target data for flow %u: %s, abort",
+			      i, strerror_(-rc));
+			return -rc;
+		}
+	}
+
+	repair_flush(c);
+
+	for (i = 0; i < count; i++) {
+		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
+		if (rc) {
+			debug("Bad target extended data for flow %u: %s, abort",
+			      i, strerror_(-rc));
+			return -rc;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * flow_init() - Initialise flow related data structures
  */
diff --git a/flow.h b/flow.h
index 24ba3ef..675726e 100644
--- a/flow.h
+++ b/flow.h
@@ -249,6 +249,14 @@ union flow;
 
 void flow_init(void);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
+int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
+			      int fd);
+int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
+			    int fd);
+int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
+			int fd);
+int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
+			int fd);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/migrate.c b/migrate.c
index 1c59016..c5c6663 100644
--- a/migrate.c
+++ b/migrate.c
@@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
 
 /* Stages for version 1 */
 static const struct migrate_stage stages_v1[] = {
+	/* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
+	 * *sometimes* closes the connection for real.
+	 */
+/*	{
+		.name = "shrink TCP windows",
+		.source = flow_migrate_source_early,
+		.target = NULL,
+	},
+*/
 	{
 		.name = "observed addresses",
 		.source = seen_addrs_source_v1,
 		.target = seen_addrs_target_v1,
 	},
+	{
+		.name = "prepare flows",
+		.source = flow_migrate_source_pre,
+		.target = NULL,
+	},
+	{
+		.name = "transfer flows",
+		.source = flow_migrate_source,
+		.target = flow_migrate_target,
+	},
 	{ 0 },
 };
 
diff --git a/passt.c b/passt.c
index 6f9fb4d..68d1a28 100644
--- a/passt.c
+++ b/passt.c
@@ -223,9 +223,6 @@ int main(int argc, char **argv)
 		if (sigaction(SIGCHLD, &sa, NULL))
 			die_perror("Couldn't install signal handlers");
 
-		if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
-			die_perror("Couldn't set disposition for SIGPIPE");
-
 		c.mode = MODE_PASTA;
 	} else if (strstr(name, "passt")) {
 		c.mode = MODE_PASST;
@@ -233,6 +230,9 @@ int main(int argc, char **argv)
 		_exit(EXIT_FAILURE);
 	}
 
+	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
+		die_perror("Couldn't set disposition for SIGPIPE");
+
 	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
 
 	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
diff --git a/repair.c b/repair.c
index 784b994..da85edb 100644
--- a/repair.c
+++ b/repair.c
@@ -190,7 +190,6 @@ int repair_flush(struct ctx *c)
  *
  * Return: 0 on success, negative error code on failure
  */
-/* cppcheck-suppress unusedFunction */
 int repair_set(struct ctx *c, int s, int cmd)
 {
 	int rc;
diff --git a/tcp.c b/tcp.c
index af6bd95..78db64f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -280,6 +280,7 @@
 #include <stddef.h>
 #include <string.h>
 #include <sys/epoll.h>
+#include <sys/ioctl.h>
 #include <sys/socket.h>
 #include <sys/timerfd.h>
 #include <sys/types.h>
@@ -287,6 +288,8 @@
 #include <time.h>
 #include <arpa/inet.h>
 
+#include <linux/sockios.h>
+
 #include "checksum.h"
 #include "util.h"
 #include "iov.h"
@@ -299,6 +302,7 @@
 #include "log.h"
 #include "inany.h"
 #include "flow.h"
+#include "repair.h"
 #include "linux_dep.h"
 
 #include "flow_table.h"
@@ -326,6 +330,19 @@
 	 ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
 #define CONN_HAS(conn, set)	(((conn)->events & (set)) == (set))
 
+/* Buffers to migrate pending data from send and receive queues. No, they don't
+ * use memory if we don't use them. And we're going away after this, so splurge.
+ */
+#define TCP_MIGRATE_SND_QUEUE_MAX	(64 << 20)
+#define TCP_MIGRATE_RCV_QUEUE_MAX	(64 << 20)
+uint8_t tcp_migrate_snd_queue		[TCP_MIGRATE_SND_QUEUE_MAX];
+uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
+
+#define TCP_MIGRATE_RESTORE_CHUNK_MIN	1024 /* Try smaller when above this */
+
+/* "Extended" data (not stored in the flow table) for TCP flow migration */
+static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
+
 static const char *tcp_event_str[] __attribute((__unused__)) = {
 	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
 
@@ -2645,3 +2662,775 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
 	if (c->mode == MODE_PASTA)
 		tcp_splice_refill(c);
 }
+
+/**
+ * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
+{
+	int rc = 0;
+
+	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
+		err("Failed to set TCP_REPAIR");
+
+	return rc;
+}
+
+/**
+ * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
+{
+	int rc = 0;
+
+	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
+		err("Failed to clear TCP_REPAIR");
+
+	return rc;
+}
+
+/**
+ * tcp_flow_repair_queues() - Read or write socket queues, or send unsent data
+ * @s:			Socket
+ * @sndbuf:		Send queue buffer read or written/sent depending on @set
+ * @sndlen:		Length of send queue buffer to set, network order
+ * @notsentlen:		Length of not sent data, non-zero to actually _send_
+ * @rcvbuf:		Receive queue buffer, read or written depending on @set
+ * @rcvlen:		Length of receive queue buffer to set, network order
+ * @set:		Set or send (unsent data only) if true, dump if false
+ *
+ * Return: 0 on success, negative error code on failure
+ *
+ * #syscalls:vu ioctl
+ */
+static int tcp_flow_repair_queues(int s,
+				  uint8_t *sndbuf,
+				  uint32_t *sndlen, uint32_t *notsentlen,
+				  uint8_t *rcvbuf, uint32_t *rcvlen,
+				  bool set)
+{
+	ssize_t rc;
+	int v;
+
+	if (set && rcvbuf) { /* FIXME: can't check notsentlen, rework this */
+		v = TCP_SEND_QUEUE;
+		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
+			rc = -errno;
+			err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
+			return rc;
+		}
+	}
+
+	if (set) {
+		size_t chunk;
+		uint8_t *p;
+
+		*sndlen = ntohl(*sndlen);
+		debug("Writing socket %i send queue: %u bytes", s, *sndlen);
+		p = sndbuf;
+		chunk = *sndlen;
+		while (*sndlen > 0) {
+			rc = send(s, p, MIN(*sndlen, chunk), 0);
+
+			if (rc < 0) {
+				if ((errno == ENOBUFS || errno == ENOMEM) &&
+				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
+					chunk /= 2;
+					continue;
+				}
+
+				rc = -errno;
+				err_perror("Can't write socket %i send queue",
+					   s);
+				return rc;
+			}
+
+			*sndlen -= rc;
+			p += rc;
+		}
+
+		*notsentlen = ntohl(*notsentlen);
+		debug("Sending socket %i unsent queue: %u bytes", s,
+		      *notsentlen);
+		p = sndbuf;
+		chunk = *notsentlen;
+		while (*notsentlen > 0) {
+			rc = send(s, p, MIN(*notsentlen, chunk), 0);
+
+			if (rc < 0) {
+				if ((errno == ENOBUFS || errno == ENOMEM) &&
+				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
+					chunk /= 2;
+					continue;
+				}
+
+				rc = -errno;
+				err_perror("Can't send socket %i unsent queue",
+					   s);
+				return rc;
+			}
+
+			*notsentlen -= rc;
+			p += rc;
+		}
+	} else {
+		rc = ioctl(s, SIOCOUTQ, sndlen);
+		if (rc < 0) {
+			rc = -errno;
+			err_perror("Getting send queue size for socket %i", s);
+			return rc;
+		}
+
+		rc = ioctl(s, SIOCOUTQNSD, notsentlen);
+		if (rc < 0) {
+			rc = -errno;
+			err_perror("Getting not sent count for socket %i", s);
+			return rc;
+		}
+
+		/* TODO: Skip "FIN" byte in queue if present */
+
+		rc = recv(s, sndbuf, MIN(*sndlen, TCP_MIGRATE_SND_QUEUE_MAX),
+			  MSG_PEEK);
+		if (rc < 0 && errno != EAGAIN) { /* EAGAIN means empty */
+			rc = -errno;
+			err_perror("Can't read send queue for socket %i", s);
+			return rc;
+		}
+
+		rc = MAX(0, rc);
+		debug("Read socket %i send queue: %zi (%u not sent)", s, rc,
+		      *notsentlen);
+
+		*sndlen = htonl(rc);
+		*notsentlen = htonl(*notsentlen);
+	}
+
+	if (!rcvbuf)
+		return 0;
+
+	v = TCP_RECV_QUEUE;
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
+		rc = -errno;
+		err_perror("Selecting TCP_RECV_QUEUE for socket %i", s);
+		return rc;
+	}
+
+	if (set) {
+		uint8_t *p;
+		size_t chunk;
+
+		*rcvlen = ntohl(*rcvlen);
+		debug("Writing socket %i receive queue: %u bytes", s, *rcvlen);
+		p = rcvbuf;
+		chunk = *rcvlen;
+		while (*rcvlen > 0) {
+			rc = send(s, p, MIN(*rcvlen, chunk), 0);
+			if (rc < 0) {
+				if ((errno == ENOBUFS || errno == ENOMEM) &&
+				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
+					chunk /= 2;
+					continue;
+				}
+
+				rc = -errno;
+				err_perror("Can't send socket %i receive queue",
+					   s);
+				return rc;
+			}
+
+			*rcvlen -= rc;
+			p += rc;
+		}
+	} else {
+		if (ioctl(s, SIOCINQ, rcvlen) < 0) {
+			rc = -errno;
+			err_perror("Get receive queue size for socket %i", s);
+			return rc;
+		}
+
+		/* TODO: Skip "FIN" byte in queue if present */
+
+		rc = recv(s, rcvbuf, MIN(*rcvlen, TCP_MIGRATE_RCV_QUEUE_MAX),
+					 MSG_PEEK);
+		if (rc < 0 && errno != EAGAIN) { /* EAGAIN means empty */
+			rc = -errno;
+			err_perror("Can't read receive queue for socket %i", s);
+			return rc;
+		}
+
+		rc = MAX(0, rc);
+		*rcvlen = htonl(rc);
+		debug("Read socket %i receive queue: %zi bytes", s, rc);
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_seq() - Dump or set sequences
+ * @s:		Socket
+ * @snd_seq:	Send sequence, set on return if @set == false, network order
+ * @rcv_seq:	Receive sequence, set on return if @set == false, network order
+ * @set:	Set if true, dump if false
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_seq(int s, uint32_t *snd_seq, uint32_t *rcv_seq,
+			       bool set)
+{
+	socklen_t vlen = sizeof(uint32_t);
+	ssize_t rc;
+	int v;
+
+	v = TCP_SEND_QUEUE;
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
+		rc = -errno;
+		err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
+		return rc;
+	}
+
+	if (set) {
+		*snd_seq = ntohl(*snd_seq);
+		if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd_seq, vlen)) {
+			rc = -errno;
+			err_perror("Setting send sequence for socket %i", s);
+			return rc;
+		}
+		debug("Set send sequence for socket %i to %u", s, *snd_seq);
+	} else {
+		if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd_seq, &vlen)) {
+			rc = -errno;
+			err_perror("Dumping send sequence for socket %i", s);
+			return rc;
+		}
+		debug("Dumped send sequence for socket %i: %u", s, *snd_seq);
+		*snd_seq = htonl(*snd_seq);
+	}
+
+	v = TCP_RECV_QUEUE;
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
+		rc = -errno;
+		err_perror("Selecting TCP_RECV_QUEUE for socket %i", s);
+		return rc;
+	}
+
+	if (set) {
+		*rcv_seq = ntohl(*rcv_seq);
+		if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv_seq, vlen)) {
+			rc = -errno;
+			err_perror("Setting receive sequence %u for socket %i",
+				   *rcv_seq, s);
+			return rc;
+		}
+		debug("Set receive sequence for socket %i to %u", s, *rcv_seq);
+	} else {
+		if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv_seq, &vlen)) {
+			rc = -errno;
+			err_perror("Dumping receive sequence for socket %i", s);
+			return rc;
+		}
+		debug("Dumped receive sequence for socket %i: %u", s, *rcv_seq);
+		*rcv_seq = htonl(*rcv_seq);
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_opt() - Dump or set repair "options" (MSS and window scale)
+ * @s:			Socket
+ * @snd_wscale:		Window scaling factor, send, network order
+ * @rcv_wscale:		Window scaling factor, receive, network order
+ * @mss:		Maximum Segment Size, socket side, network order
+ * @set:		Set if true, dump if false
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_opt(int s, uint8_t *snd_wscale, uint8_t *rcv_wscale,
+			uint32_t *mss, bool set)
+{
+	struct tcp_info_linux tinfo;
+	struct tcp_repair_opt opts[2];
+	socklen_t sl;
+	int rc;
+
+	opts[0].opt_code = TCPOPT_WINDOW;
+	opts[1].opt_code = TCPOPT_MAXSEG;
+
+	if (set) {
+		*mss = ntohl(*mss);
+		debug("Setting repair options for socket %i:", s);
+		opts[0].opt_val = *snd_wscale + (*rcv_wscale << 16);
+		opts[1].opt_val = *mss;
+		debug("  window scale send %u, receive %u, MSS: %u",
+		      *snd_wscale, *rcv_wscale, *mss);
+
+		sl = sizeof(opts);
+		if (setsockopt(s, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl)) {
+			rc = -errno;
+			err_perror("Setting repair options for socket %i", s);
+			return rc;
+		}
+
+		sl = sizeof(*mss);
+		if (setsockopt(s, SOL_TCP, TCP_MAXSEG, mss, sl))
+			debug_perror("Setting MSS for socket %i", s);
+	} else {
+		sl = sizeof(tinfo);
+		if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
+			rc = -errno;
+			err_perror("Querying TCP_INFO for socket %i", s);
+			return rc;
+		}
+
+		*snd_wscale = tinfo.tcpi_snd_wscale;
+		*rcv_wscale = tinfo.tcpi_rcv_wscale;
+
+		/* TCP_INFO MSS is just the current value: ask explicitly */
+		sl = sizeof(*mss);
+		if (getsockopt(s, SOL_TCP, TCP_MAXSEG, mss, &sl)) {
+			rc = -errno;
+			err_perror("Getting MSS for socket %i", s);
+			return rc;
+		}
+		*mss = htonl(*mss);
+
+		debug("Got repair options for socket %i:", s);
+		debug("  window scale send %u, receive %u, MSS: %u",
+		      *snd_wscale, *rcv_wscale, ntohl(*mss));
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_wnd() - Dump or set window parameters
+ * @snd_wl1:		Next sequence used in window probe (next sequence - 1)
+ * @snd_wnd:		Socket-side sending window, network order
+ * @max_window:		Window clamp, network order
+ * @rcv_wnd:		Socket-side receive window, network order
+ * @rcv_wup:		rcv_nxt on last window update sent, network order
+ * @wnd_out:		If NULL, set parameters, if given, get and return all
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_wnd(int s, uint32_t *snd_wl1, uint32_t *snd_wnd,
+			       uint32_t *max_window, uint32_t *rcv_wnd,
+			       uint32_t *rcv_wup,
+			       struct tcp_repair_window *wnd_out)
+{
+	struct tcp_repair_window wnd_copy, *wnd = wnd_out ? wnd_out : &wnd_copy;
+	socklen_t sl = sizeof(*wnd);
+	int rc;
+
+	if (!wnd_out) {
+		wnd->snd_wl1	= ntohl(*snd_wl1);
+		wnd->snd_wnd	= ntohl(*snd_wnd);
+		wnd->max_window	= ntohl(*max_window);
+		wnd->rcv_wnd	= ntohl(*rcv_wnd);
+		wnd->rcv_wup	= ntohl(*rcv_wup);
+
+		if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, wnd, sl)) {
+			rc = -errno;
+			err_perror("Setting window repair data, socket %i", s);
+			return rc;
+		}
+	} else {
+		if (getsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, wnd, &sl)) {
+			rc = -errno;
+			err_perror("Getting window repair data, socket %i", s);
+			return rc;
+		}
+
+		*snd_wl1	= htonl(wnd->snd_wl1);
+		*snd_wnd	= htonl(wnd->snd_wnd);
+		*max_window	= htonl(wnd->max_window);
+		*rcv_wnd	= htonl(wnd->rcv_wnd);
+		*rcv_wup	= htonl(wnd->rcv_wup);
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_shrink_window() - Dump window data, decrease socket window
+ * @fidx:	Flow index
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn)
+{
+	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
+	struct tcp_repair_window wnd;
+	socklen_t sl = sizeof(wnd);
+	int s = conn->sock;
+
+	if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, &((int){ 0 }), sizeof(int)))
+		debug("TCP: failed to set SO_RCVBUF to minimum value");
+
+	/* Dump window data as it is for the target, before touching stuff */
+	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
+			    &t->sock_max_window, &t->sock_rcv_wnd,
+			    &t->sock_rcv_wup, &wnd);
+
+	wnd.rcv_wnd = 0;
+
+	if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, sl))
+		debug_perror("Setting window repair data, socket %i", s);
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_source() - Send data (flow table part) for a single flow
+ * @c:		Execution context
+ * @fd:		Descriptor for state migration
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_migrate_source(int fd, const struct tcp_tap_conn *conn)
+{
+	struct tcp_tap_transfer t = {
+		.retrans		= conn->retrans,
+		.ws_from_tap		= conn->ws_from_tap,
+		.ws_to_tap		= conn->ws_to_tap,
+		.events			= conn->events,
+
+		.tap_mss		= htonl(MSS_GET(conn)),
+
+		.sndbuf			= htonl(conn->sndbuf),
+
+		.flags			= conn->flags,
+		.seq_dup_ack_approx	= conn->seq_dup_ack_approx,
+
+		.wnd_from_tap		= htons(conn->wnd_from_tap),
+		.wnd_to_tap		= htons(conn->wnd_to_tap),
+
+		.seq_to_tap		= htonl(conn->seq_to_tap),
+		.seq_ack_from_tap	= htonl(conn->seq_ack_from_tap),
+		.seq_from_tap		= htonl(conn->seq_from_tap),
+		.seq_ack_to_tap		= htonl(conn->seq_ack_to_tap),
+		.seq_init_from_tap	= htonl(conn->seq_init_from_tap),
+	};
+	int rc;
+
+	memcpy(&t.pif, conn->f.pif, sizeof(t.pif));
+	memcpy(&t.side, conn->f.side, sizeof(t.side));
+
+	if (write_all_buf(fd, &t, sizeof(t))) {
+		rc = -errno;
+		err_perror("Failed to write migration data for socket %i",
+			   conn->sock);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
+ * @fd:		Descriptor for state migration
+ * @fidx:	Flow index
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_migrate_source_ext(int fd, int fidx,
+				const struct tcp_tap_conn *conn)
+{
+	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
+	struct tcp_repair_window wnd;
+	int s = conn->sock;
+	int rc;
+
+	/* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */
+	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
+			    &t->sock_max_window, &t->sock_rcv_wnd,
+			    &t->sock_rcv_wup, &wnd);
+
+	rc = tcp_flow_repair_seq(s, &t->sock_seq_snd, &t->sock_seq_rcv, false);
+	if (rc) {
+		err("Failed to get sequences on source for socket %i: %s",
+		    s, strerror_(-rc));
+		return rc;
+	}
+
+	tcp_flow_repair_opt(s, &t->snd_wscale, &t->rcv_wscale, &t->sock_mss,
+			    false);
+
+	/* Dump receive queue as late as possible */
+	rc = tcp_flow_repair_queues(s, tcp_migrate_snd_queue,
+				    &t->sndlen, &t->notsentlen,
+				    tcp_migrate_rcv_queue, &t->rcvlen, false);
+	if (rc) {
+		err("Failed to dump queues on source for socket %i: %s",
+		    s, strerror_(-rc));
+		return rc;
+	}
+
+	if (ntohl(t->sndlen) > TCP_MIGRATE_SND_QUEUE_MAX ||
+	    ntohl(t->notsentlen) > ntohl(t->sndlen) ||
+	    ntohl(t->rcvlen) > TCP_MIGRATE_RCV_QUEUE_MAX) {
+		err("Bad data queues length, socket %i, send: %u, receive: %u",
+		    s, ntohl(t->sndlen), ntohl(t->rcvlen));
+		return -EINVAL;
+	}
+
+	/* FIXME: it's either this or flow_migrate_source_early(), why? */
+	close(s);
+
+	if (write_all_buf(fd, t, sizeof(*t))) {
+		rc = -errno;
+		err_perror("Failed to write extended data for socket %i", s);
+		return rc;
+	}
+
+	if (write_all_buf(fd, tcp_migrate_snd_queue, ntohl(t->sndlen))) {
+		rc = -errno;
+		err_perror("Failed to write send queue data for socket %i", s);
+		return rc;
+	}
+
+	if (write_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t->rcvlen))) {
+		rc = -errno;
+		err_perror("Failed to write receive queue data for socket %i",
+			   s);
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_socket() - Open and bind socket, request repair mode
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6;
+	const struct flowside *sockside = HOSTFLOW(conn);
+	union sockaddr_inany a;
+	socklen_t sl;
+	int s, rc;
+
+	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
+
+	if ((conn->sock = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
+				 IPPROTO_TCP)) < 0) {
+		rc = -errno;
+		err_perror("Failed to create socket for migrated flow");
+		return rc;
+	}
+	s = conn->sock;
+
+	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int));
+
+	tcp_sock_set_bufsize(c, s);
+	tcp_sock_set_nodelay(s);
+
+	if (bind(s, &a.sa, sizeof(a)) < 0) {
+		rc = -errno;
+		err_perror("Failed to bind socket %i for migrated flow", s);
+		close(s);
+		conn->sock = -1;
+		return rc;
+	}
+
+	rc = tcp_flow_repair_on(c, conn);
+	if (rc) {
+		close(s);
+		conn->sock = -1;
+		return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_repair_connect() - Connect socket in repair mode, then turn it off
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+static int tcp_flow_repair_connect(const struct ctx *c,
+				   struct tcp_tap_conn *conn)
+{
+	const struct flowside *tgt = HOSTFLOW(conn);
+	int rc;
+
+	rc = flowside_connect(c, conn->sock, PIF_HOST, tgt);
+	if (rc) {
+		rc = -errno;
+		err_perror("Failed to connect migrated socket %i", conn->sock);
+		return rc;
+	}
+
+	conn->in_epoll = 0;
+	conn->timer = -1;
+	if ((rc = tcp_epoll_ctl(c, conn))) {
+		debug("Failed to subscribe to epoll for migrated socket %i: %s",
+		      conn->sock, strerror_(-rc));
+	}
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_target() - Receive data (flow table part) for flow, insert
+ * @c:		Execution context
+ * @fd:		Descriptor for state migration
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_migrate_target(struct ctx *c, int fd)
+{
+	struct tcp_tap_transfer t;
+	struct tcp_tap_conn *conn;
+	union flow *flow;
+	int rc;
+
+	if (!(flow = flow_alloc())) {
+		err("Flow table full on migration target");
+		return -ENOMEM;
+	}
+
+	if (read_all_buf(fd, &t, sizeof(t))) {
+		err_perror("Failed to receive migration data");
+		return -errno;
+	}
+
+	flow->f.state = FLOW_STATE_TGT;
+	memcpy(&flow->f.pif, &t.pif, sizeof(flow->f.pif));
+	memcpy(&flow->f.side, &t.side, sizeof(flow->f.side));
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
+
+	conn->retrans			= t.retrans;
+	conn->ws_from_tap		= t.ws_from_tap;
+	conn->ws_to_tap			= t.ws_to_tap;
+	conn->events			= t.events;
+
+	conn->sndbuf			= htonl(t.sndbuf);
+
+	conn->flags			= t.flags;
+	conn->seq_dup_ack_approx	= t.seq_dup_ack_approx;
+
+	MSS_SET(conn,			  ntohl(t.tap_mss));
+
+	conn->wnd_from_tap		= ntohs(t.wnd_from_tap);
+	conn->wnd_to_tap		= ntohs(t.wnd_to_tap);
+
+	conn->seq_to_tap		= ntohl(t.seq_to_tap);
+	conn->seq_ack_from_tap		= ntohl(t.seq_ack_from_tap);
+	conn->seq_from_tap		= ntohl(t.seq_from_tap);
+	conn->seq_ack_to_tap		= ntohl(t.seq_ack_to_tap);
+	conn->seq_init_from_tap		= ntohl(t.seq_init_from_tap);
+
+	if ((rc = tcp_flow_repair_socket(c, conn)))
+		return rc;
+
+	flow_hash_insert(c, TAP_SIDX(conn));
+	FLOW_ACTIVATE(conn);
+
+	return 0;
+}
+
+/**
+ * tcp_flow_migrate_target_ext() - Receive extended data for flow, set, connect
+ * @c:		Execution context
+ * @flow:	Existing flow for this connection data
+ * @fd:		Descriptor for state migration
+ *
+ * Return: 0 on success, negative code on failure, but 0 on connection reset
+ */
+int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd)
+{
+	struct tcp_tap_conn *conn = &flow->tcp;
+	struct tcp_tap_transfer_ext t;
+	uint32_t peek_offset, len;
+	int s = conn->sock, rc;
+
+	if (read_all_buf(fd, &t, sizeof(t))) {
+		rc = -errno;
+		err_perror("Failed to read extended data for socket %i", s);
+		return rc;
+	}
+
+	if (ntohl(t.sndlen) > TCP_MIGRATE_SND_QUEUE_MAX ||
+	    ntohl(t.notsentlen) > ntohl(t.sndlen) ||
+	    ntohl(t.rcvlen) > TCP_MIGRATE_RCV_QUEUE_MAX) {
+		err("Bad data queues length, socket %i, send: %u, receive: %u",
+		    s, ntohl(t.sndlen), ntohl(t.rcvlen));
+		return -EINVAL;
+	}
+
+	if (read_all_buf(fd, tcp_migrate_snd_queue, ntohl(t.sndlen))) {
+		rc = -errno;
+		err_perror("Failed to read send queue data for socket %i", s);
+		return rc;
+	}
+
+	if (read_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t.rcvlen))) {
+		rc = -errno;
+		err_perror("Failed to read receive queue data for socket %i",
+			   s);
+		return rc;
+	}
+
+	if ((rc = tcp_flow_repair_seq(s, &t.sock_seq_snd, &t.sock_seq_rcv,
+				      true)))
+		return rc;
+
+	if ((rc = tcp_flow_repair_connect(c, conn)))
+		return rc;
+
+	if ((rc = tcp_flow_repair_opt(s, &t.snd_wscale, &t.rcv_wscale,
+				      &t.sock_mss, true)))
+		return rc;
+
+	if ((rc = tcp_flow_repair_queues(s,
+					 tcp_migrate_snd_queue, &t.sndlen,
+					 &((uint32_t){ 0 }), /* Sent only */
+					 tcp_migrate_rcv_queue, &t.rcvlen,
+					 true)))
+		debug_perror("Error while repairing queues for socket %i", s);
+
+	if ((rc = tcp_flow_repair_wnd(s, &t.sock_snd_wl1, &t.sock_snd_wnd,
+				      &t.sock_max_window, &t.sock_rcv_wnd,
+				      &t.sock_rcv_wup, NULL)))
+		debug_perror("Error while repairing window for socket %i", s);
+
+	tcp_flow_repair_off(c, conn);
+	repair_flush(c);	/* FIXME: batch this? */
+
+	/* Now the unsent part of send queue */
+	len = htonl(ntohl(t.sndlen) - ntohl(t.notsentlen));
+	if ((rc = tcp_flow_repair_queues(s,
+					 tcp_migrate_snd_queue + ntohl(len),
+					 &((uint32_t){ 0 }), &len,
+					 NULL, &((uint32_t){ 0 }),
+					 true)))
+		debug_perror("Sending unsent data for socket %i", s);
+
+	peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
+	if (tcp_set_peek_offset(conn->sock, peek_offset))
+		tcp_rst(c, conn);
+
+	tcp_send_flag(c, conn, ACK);
+
+	return 0;
+}
diff --git a/tcp_conn.h b/tcp_conn.h
index d342680..b64e857 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -96,6 +96,89 @@ struct tcp_tap_conn {
 	uint32_t	seq_init_from_tap;
 };
 
+/**
+ * struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
+ * @pif:		Interfaces for each side of the flow
+ * @side:		Addresses and ports for each side of the flow
+ * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
+ * @ws_from_tap:	Window scaling factor advertised from tap/guest
+ * @ws_to_tap:		Window scaling factor advertised to tap/guest
+ * @events:		Connection events, implying connection states
+ * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
+ * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
+ * @flags:		Connection flags representing internal attributes
+ * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
+ * @wnd_from_tap:	Last window size from tap, unscaled (as received)
+ * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
+ * @seq_to_tap:		Next sequence for packets to tap
+ * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
+ * @seq_ack_to_tap:	Last ACK number sent to tap
+ * @seq_init_from_tap:	Initial sequence number from tap
+*/
+struct tcp_tap_transfer {
+	uint8_t		pif[SIDES];
+	struct flowside	side[SIDES];
+
+	uint8_t		retrans;
+	uint8_t		ws_from_tap;
+	uint8_t		ws_to_tap;
+	uint8_t		events;
+
+	uint32_t	tap_mss;
+
+	uint32_t	sndbuf;
+
+	uint8_t		flags;
+	uint8_t		seq_dup_ack_approx;
+
+	uint16_t	wnd_from_tap;
+	uint16_t	wnd_to_tap;
+
+	uint32_t	seq_to_tap;
+	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_from_tap;
+	uint32_t	seq_ack_to_tap;
+	uint32_t	seq_init_from_tap;
+} __attribute__((packed, aligned(__alignof__(uint32_t))));
+
+/**
+ * struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
+ * @sock_seq_snd:	Socket-side send sequence
+ * @sock_seq_rcv:	Socket-side receive sequence
+ * @sndlen:		Length of pending send queue (unacknowledged / not sent)
+ * @notsentlen:		Part of pending send queue that wasn't sent out yet
+ * @rcvlen:		Length of pending receive queue
+ * @sock_mss:		Socket-side MSS
+ * @sock_snd_wl1:	Next sequence used in window probe (next sequence - 1)
+ * @sock_snd_wnd:	Socket-side sending window
+ * @sock_max_window:	Window clamp
+ * @sock_rcv_wnd:	Socket-side receive window
+ * @sock_rcv_wup:	rcv_nxt on last window update sent
+ * @snd_wscale:		Window scaling factor, send
+ * @snd_wscale:		Window scaling factor, receive
+ */
+struct tcp_tap_transfer_ext {
+	uint32_t	sock_seq_snd;
+	uint32_t	sock_seq_rcv;
+
+	uint32_t	sndlen;
+	uint32_t	notsentlen;
+	uint32_t	rcvlen;
+
+	uint32_t	sock_mss;
+
+	/* We can't just use struct tcp_repair_window: we need network order */
+	uint32_t	sock_snd_wl1;
+	uint32_t	sock_snd_wnd;
+	uint32_t	sock_max_window;
+	uint32_t	sock_rcv_wnd;
+	uint32_t	sock_rcv_wup;
+
+	uint8_t		snd_wscale;
+	uint8_t		rcv_wscale;
+} __attribute__((packed, aligned(__alignof__(uint32_t))));
+
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
  * @f:			Generic flow information
@@ -140,6 +223,18 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
 extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 
 bool tcp_flow_defer(const struct tcp_tap_conn *conn);
+
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
+int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
+
+int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source(int fd, const struct tcp_tap_conn *conn);
+int tcp_flow_migrate_source_ext(int fd, int fidx,
+				const struct tcp_tap_conn *conn);
+
+int tcp_flow_migrate_target(struct ctx *c, int fd);
+int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd);
+
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
 void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v13 6/6] test: Add migration tests
  2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
                   ` (4 preceding siblings ...)
  2025-02-09 22:20 ` [PATCH v13 5/6] migrate: Migrate TCP flows Stefano Brivio
@ 2025-02-09 22:20 ` Stefano Brivio
  5 siblings, 0 replies; 14+ messages in thread
From: Stefano Brivio @ 2025-02-09 22:20 UTC (permalink / raw)
  To: passt-dev; +Cc: David Gibson

PCAP=1 ./run migrate/basic is oddly satisfying.

PCAP=1 ./run migrate/bidirectional and migrate/iperf3_out4 are even
better.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/lib/layout            |  55 ++++++++++++++-
 test/lib/setup             | 134 +++++++++++++++++++++++++++++++++++++
 test/lib/test              |  48 +++++++++++++
 test/migrate/basic         |  59 ++++++++++++++++
 test/migrate/bidirectional |  64 ++++++++++++++++++
 test/migrate/iperf3_in4    |  50 ++++++++++++++
 test/migrate/iperf3_in6    |  58 ++++++++++++++++
 test/migrate/iperf3_out4   |  50 ++++++++++++++
 test/migrate/iperf3_out6   |  58 ++++++++++++++++
 test/run                   |  10 +++
 10 files changed, 585 insertions(+), 1 deletion(-)
 create mode 100644 test/migrate/basic
 create mode 100644 test/migrate/bidirectional
 create mode 100644 test/migrate/iperf3_in4
 create mode 100644 test/migrate/iperf3_in6
 create mode 100644 test/migrate/iperf3_out4
 create mode 100644 test/migrate/iperf3_out6

diff --git a/test/lib/layout b/test/lib/layout
index 4d03572..fddcdc4 100644
--- a/test/lib/layout
+++ b/test/lib/layout
@@ -134,6 +134,54 @@ layout_two_guests() {
 
 	get_info_cols
 
+	pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
+	pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #1" qemu_2 guest_2
+
+	tmux send-keys -l -t ${PANE_INFO} 'while cat '"$STATEBASE/log_pipe"'; do :; done'
+	tmux send-keys -t ${PANE_INFO} -N 100 C-m
+	tmux select-pane -t ${PANE_INFO} -T "test log"
+
+	pane_watch_contexts ${PANE_HOST} host host
+	pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
+	pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #1" pasta_1 passt_2
+
+	info_layout "two guests, two passt instances, in namespaces"
+
+	sleep 1
+}
+
+# layout_migrate() - Two guest panes, two passt panes, two passt-repair panes,
+#		     plus host and log
+layout_migrate() {
+	sleep 1
+
+	tmux kill-pane -a -t 0
+	cmd_write 0 clear
+
+	tmux split-window -v -t passt_test
+	tmux split-window -h -l '33%'
+	tmux split-window -h -t passt_test:1.1
+
+	tmux split-window -h -l '35%' -t passt_test:1.0
+	tmux split-window -v -t passt_test:1.0
+
+	tmux split-window -v -t passt_test:1.4
+	tmux split-window -v -t passt_test:1.6
+
+	tmux split-window -v -t passt_test:1.3
+
+	PANE_GUEST_1=0
+	PANE_GUEST_2=1
+	PANE_INFO=2
+	PANE_MON=3
+	PANE_HOST=4
+	PANE_PASST_REPAIR_1=5
+	PANE_PASST_1=6
+	PANE_PASST_REPAIR_2=7
+	PANE_PASST_2=8
+
+	get_info_cols
+
 	pane_watch_contexts ${PANE_GUEST_1} "guest #1 in namespace #1" qemu_1 guest_1
 	pane_watch_contexts ${PANE_GUEST_2} "guest #2 in namespace #2" qemu_2 guest_2
 
@@ -141,11 +189,16 @@ layout_two_guests() {
 	tmux send-keys -t ${PANE_INFO} -N 100 C-m
 	tmux select-pane -t ${PANE_INFO} -T "test log"
 
+	pane_watch_contexts ${PANE_MON} "QEMU monitor" mon mon
+
 	pane_watch_contexts ${PANE_HOST} host host
+	pane_watch_contexts ${PANE_PASST_REPAIR_1} "passt-repair #1 in namespace #1" repair_1 passt_repair_1
 	pane_watch_contexts ${PANE_PASST_1} "passt #1 in namespace #1" pasta_1 passt_1
+
+	pane_watch_contexts ${PANE_PASST_REPAIR_2} "passt-repair #2 in namespace #2" repair_2 passt_repair_2
 	pane_watch_contexts ${PANE_PASST_2} "passt #2 in namespace #2" pasta_2 passt_2
 
-	info_layout "two guests, two passt instances, in namespaces"
+	info_layout "two guests, two passt + passt-repair instances, in namespaces"
 
 	sleep 1
 }
diff --git a/test/lib/setup b/test/lib/setup
index 580825f..589540e 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -305,6 +305,117 @@ setup_two_guests() {
 	context_setup_guest guest_2 ${GUEST_2_CID}
 }
 
+# setup_migrate() - Set up two namespace, run qemu, passt/passt-repair in both
+setup_migrate() {
+	context_setup_host host
+	context_setup_host mon
+	context_setup_host pasta_1
+	context_setup_host pasta_2
+
+	layout_migrate
+
+	# Ports:
+	#
+	#         guest #1  |  guest #2 |   ns #1   |    host
+	#         --------- |-----------|-----------|------------
+	#  10001  as server |           | to guest  |  to ns #1
+	#  10002            |           | as server |  to ns #1
+	#  10003            |           |  to init  |  as server
+	#  10004            | as server | to guest  |  to ns #1
+
+	__opts=
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/pasta_1.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	__map_host4=192.0.2.1
+	__map_host6=2001:db8:9a55::1
+	__map_ns4=192.0.2.2
+	__map_ns6=2001:db8:9a55::2
+
+	# Option 1: send stuff via spliced path in pasta
+	# context_run_bg pasta_1 "./pasta ${__opts} --trace -l /tmp/pasta1.log -P ${STATESETUP}/pasta_1.pid -t 10001,10002 -T 10003 -u 10001,10002 -U 10003 --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
+	# Option 2: send stuff via tap (--map-guest-addr) instead (useful to see capture of full migration)
+	context_run_bg pasta_1 "./pasta ${__opts} --trace -l /tmp/pasta1.log -P ${STATESETUP}/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr ${__map_host4} --map-guest-addr ${__map_host6} --config-net ${NSTOOL} hold ${STATESETUP}/ns1.hold"
+	context_setup_nstool passt_1 ${STATESETUP}/ns1.hold
+	context_setup_nstool passt_repair_1 ${STATESETUP}/ns1.hold
+
+	context_setup_nstool passt_2 ${STATESETUP}/ns1.hold
+	context_setup_nstool passt_repair_2 ${STATESETUP}/ns1.hold
+
+	context_setup_nstool qemu_1 ${STATESETUP}/ns1.hold
+	context_setup_nstool qemu_2 ${STATESETUP}/ns1.hold
+
+	__ifname="$(context_run qemu_1 "ip -j link show | jq -rM '.[] | select(.link_type == \"ether\").ifname'")"
+
+	sleep 1
+
+	__opts="--vhost-user"
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_1.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	context_run_bg passt_1 "./passt -s ${STATESETUP}/passt_1.socket -P ${STATESETUP}/passt_1.pid -f ${__opts} -t 10001 -u 10001"
+	wait_for [ -f "${STATESETUP}/passt_1.pid" ]
+
+	context_run_bg passt_repair_1 "./passt-repair ${STATESETUP}/passt_1.socket.repair"
+
+	__opts="--vhost-user"
+	[ ${PCAP} -eq 1 ] && __opts="${__opts} -p ${LOGDIR}/passt_2.pcap"
+	[ ${DEBUG} -eq 1 ] && __opts="${__opts} -d"
+	[ ${TRACE} -eq 1 ] && __opts="${__opts} --trace"
+
+	context_run_bg passt_2 "./passt -s ${STATESETUP}/passt_2.socket -P ${STATESETUP}/passt_2.pid -f ${__opts} -t 10004 -u 10004"
+	wait_for [ -f "${STATESETUP}/passt_2.pid" ]
+
+	context_run_bg passt_repair_2 "./passt-repair ${STATESETUP}/passt_2.socket.repair"
+
+	__vmem="512M"	# Keep migration fast
+	__qemu_netdev1="					       \
+		-chardev socket,id=c,path=${STATESETUP}/passt_1.socket \
+		-netdev vhost-user,id=v,chardev=c		       \
+		-device virtio-net,netdev=v			       \
+		-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+		-numa node,memdev=m"
+	__qemu_netdev2="					       \
+		-chardev socket,id=c,path=${STATESETUP}/passt_2.socket \
+		-netdev vhost-user,id=v,chardev=c		       \
+		-device virtio-net,netdev=v			       \
+		-object memory-backend-memfd,id=m,share=on,size=${__vmem} \
+		-numa node,memdev=m"
+
+	GUEST_1_CID=94557
+	context_run_bg qemu_1 'qemu-system-'"${QEMU_ARCH}"		     \
+		' -M accel=kvm:tcg'                                          \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
+		' -kernel '"${KERNEL}"					     \
+		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
+		' -nodefaults'						     \
+		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
+		" ${__qemu_netdev1}"					     \
+		" -pidfile ${STATESETUP}/qemu_1.pid"			     \
+		" -device vhost-vsock-pci,guest-cid=$GUEST_1_CID"	     \
+		" -monitor unix:${STATESETUP}/qemu_1_mon.sock,server,nowait"
+
+	GUEST_2_CID=94558
+	context_run_bg qemu_2 'qemu-system-'"${QEMU_ARCH}"		     \
+		' -M accel=kvm:tcg'                                          \
+		' -m '${__vmem}' -cpu host -smp '${VCPUS}		     \
+		' -kernel '"${KERNEL}"					     \
+		' -initrd '${INITRAMFS}' -nographic -serial stdio'	     \
+		' -nodefaults'						     \
+		' -append "console=ttyS0 mitigations=off apparmor=0" '	     \
+		" ${__qemu_netdev2}"					     \
+		" -pidfile ${STATESETUP}/qemu_2.pid"			     \
+		" -device vhost-vsock-pci,guest-cid=$GUEST_2_CID"	     \
+		" -monitor unix:${STATESETUP}/qemu_2_mon.sock,server,nowait" \
+		" -incoming tcp:0:20005"
+
+	context_setup_guest guest_1 ${GUEST_1_CID}
+	# Only available after migration:
+	( context_setup_guest guest_2 ${GUEST_2_CID} & )
+}
+
 # teardown_context_watch() - Remove contexts and stop panes watching them
 # $1:	Pane number watching
 # $@:	Context names
@@ -384,6 +495,29 @@ teardown_two_guests() {
 	teardown_context_watch ${PANE_PASST_2} pasta_2 passt_2
 }
 
+# teardown_migrate() - Exit namespaces, kill qemu processes, passt and pasta
+teardown_migrate() {
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_1.pid")
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/qemu_2.pid")
+	context_wait qemu_1
+	context_wait qemu_2
+
+	${NSTOOL} exec ${STATESETUP}/ns1.hold -- kill $(cat "${STATESETUP}/passt_2.pid")
+	context_wait passt_1
+	context_wait passt_2
+	${NSTOOL} stop "${STATESETUP}/ns1.hold"
+	context_wait pasta_1
+
+	rm -f "${STATESETUP}/passt_[12].pid" "${STATESETUP}/pasta_[12].pid"
+
+	teardown_context_watch ${PANE_HOST} host
+
+	teardown_context_watch ${PANE_GUEST_1} qemu_1 guest_1
+	teardown_context_watch ${PANE_GUEST_2} qemu_2 guest_2
+	teardown_context_watch ${PANE_PASST_1} pasta_1 passt_1
+	teardown_context_watch ${PANE_PASST_2} pasta_1 passt_2
+}
+
 # teardown_demo_passt() - Exit namespace, kill qemu, passt and pasta
 teardown_demo_passt() {
 	tmux send-keys -t ${PANE_GUEST} "C-c"
diff --git a/test/lib/test b/test/lib/test
index e6726be..758250a 100755
--- a/test/lib/test
+++ b/test/lib/test
@@ -68,6 +68,45 @@ test_iperf3() {
 	TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
 }
 
+# test_iperf3m() - Ugly helper for iperf3 directive, guest migration variant
+# $1:	Variable name: to put the measure bandwidth into
+# $2:	Initial source/client context
+# $3:	Second source/client context the guest is moving to
+# $4:	Destination name or address for client
+# $5:	Port number, ${i} is translated to process index
+# $6:	Run time, in seconds
+# $7:	Client options
+test_iperf3m() {
+	__var="${1}"; shift
+	__cctx="${1}"; shift
+	__cctx2="${1}"; shift
+	__dest="${1}"; shift
+	__port="${1}"; shift
+	__time="${1}"; shift
+
+	pane_or_context_run "${__cctx}" 'rm -f c.json'
+
+        # A 1s wait for connection on what's basically a local link
+        # indicates something is pretty wrong
+        __timeout=1000
+	pane_or_context_run_bg "${__cctx}" 				\
+		 'iperf3 -J -c '${__dest}' -p '${__port}		\
+		 '	 --connect-timeout '${__timeout}		\
+		 '	 -t'${__time}' -i0 '"${@}"' > c.json'		\
+
+	__jval=".end.sum_received.bits_per_second"
+
+	sleep $((${__time} + 3))
+
+	pane_or_context_output "${__cctx2}"				\
+		 'cat c.json'
+
+	__bw=$(pane_or_context_output "${__cctx2}"			\
+		 'cat c.json | jq -rMs "map('${__jval}') | add"')
+
+	TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )"
+}
+
 test_one_line() {
 	__line="${1}"
 
@@ -177,6 +216,12 @@ test_one_line() {
 	"guest2w")
 		pane_or_context_wait guest_2 || TEST_ONE_nok=1
 		;;
+	"mon")
+		pane_or_context_run mon "${__arg}" || TEST_ONE_nok=1
+		;;
+	"monb")
+		pane_or_context_run_bg mon "${__arg}"
+		;;
 	"ns")
 		pane_or_context_run ns "${__arg}" || TEST_ONE_nok=1
 		;;
@@ -292,6 +337,9 @@ test_one_line() {
 	"iperf3")
 		test_iperf3 ${__arg}
 		;;
+	"iperf3m")
+		test_iperf3m ${__arg}
+		;;
 	"set")
 		TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__arg%% *}__" "${__arg#* }")"
 		;;
diff --git a/test/migrate/basic b/test/migrate/basic
new file mode 100644
index 0000000..d830136
--- /dev/null
+++ b/test/migrate/basic
@@ -0,0 +1,59 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/basic - Check basic migration functionality
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv4: guest1/guest2 > host
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+hostb	socat -u TCP4-LISTEN:10006 OPEN:msg,create,trunc
+sleep	1
+# Option 1: via spliced path in pasta, namespace to host
+# guest1b	{ printf "Hello from guest 1"; sleep 10; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__GW1__:10003
+# Option 2: via --map-guest-addr (tap) in pasta, namespace to host
+guest1b	{ printf "Hello from guest 1"; sleep 3; printf " and from guest 2\n"; } | socat -u STDIN TCP4:__MAP_HOST4__:10006
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+hostw
+hout	MSG cat msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
diff --git a/test/migrate/bidirectional b/test/migrate/bidirectional
new file mode 100644
index 0000000..923f7d0
--- /dev/null
+++ b/test/migrate/bidirectional
@@ -0,0 +1,64 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/bidirectional - Check migration with messages in both directions
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4: guest1/guest2 > host, host > guest1/guest2
+g1out	GW1 ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway'
+
+hostb	socat -u TCP4-LISTEN:10006 OPEN:msg,create,trunc
+guest1b	socat -u TCP4-LISTEN:10001 OPEN:msg,create,trunc
+sleep	1
+
+guest1b	socat -u UNIX-RECV:proxy.sock,null-eof TCP4:__MAP_HOST4__:10006
+hostb	socat -u UNIX-RECV:__STATESETUP__/proxy.sock,null-eof TCP4:__ADDR1__:10001
+sleep	1
+guest1	printf "Hello from guest 1" | socat -u STDIN UNIX:proxy.sock
+host	printf "Dear guest 1," | socat -u STDIN UNIX:__STATESETUP__/proxy.sock
+sleep	1
+
+mon	echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+sleep	1
+guest2	printf " and from guest 2" | socat -u STDIN UNIX:proxy.sock,shut-null
+host	printf " you are now guest 2" | socat -u STDIN UNIX:__STATESETUP__/proxy.sock,shut-null
+
+hostw
+# FIXME: guest2w doesn't work here because shell jobs are (also) from guest #1,
+# use sleep 1 for the moment
+sleep	1
+
+hout	MSG cat msg
+check	[ "__MSG__" = "Hello from guest 1 and from guest 2" ]
+
+g2out	MSG cat msg
+check	[ "__MSG__" = "Dear guest 1, you are now guest 2" ]
diff --git a/test/migrate/iperf3_in4 b/test/migrate/iperf3_in4
new file mode 100644
index 0000000..fe735fb
--- /dev/null
+++ b/test/migrate/iperf3_in4
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+guest1	/sbin/sysctl -w net.core.rmem_max=33554432
+guest1	/sbin/sysctl -w net.core.wmem_max=33554432
+
+set	THREADS 1
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__ -l 1M -w 4M
+bw	__BW__ 2 3
+
+iperf3k	host
diff --git a/test/migrate/iperf3_in6 b/test/migrate/iperf3_in6
new file mode 100644
index 0000000..1c932d3
--- /dev/null
+++ b/test/migrate/iperf3_in6
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 1
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N -R
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__ -l 1M
+bw	__BW__ 2 3
+
+iperf3k	host
diff --git a/test/migrate/iperf3_out4 b/test/migrate/iperf3_out4
new file mode 100644
index 0000000..5b2093b
--- /dev/null
+++ b/test/migrate/iperf3_out4
@@ -0,0 +1,50 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out4 - Migration behaviour under outbound IPv4 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+guest1	/sbin/sysctl -w net.core.rmem_max=33554432
+guest1	/sbin/sysctl -w net.core.wmem_max=33554432
+
+set	THREADS 2
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	TCP/IPv4 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST4__ 10006 __TIME__ __OPTS__ -l 1M
+bw	__BW__ 2 3
+
+iperf3k	host
diff --git a/test/migrate/iperf3_out6 b/test/migrate/iperf3_out6
new file mode 100644
index 0000000..af1c4a0
--- /dev/null
+++ b/test/migrate/iperf3_out6
@@ -0,0 +1,58 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/migrate/iperf3_out6 - Migration behaviour under outbound IPv6 flood
+#
+# Copyright (c) 2025 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+g1tools	ip jq dhclient socat cat
+htools	ip jq
+
+set	MAP_HOST4 192.0.2.1
+set	MAP_HOST6 2001:db8:9a55::1
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+set	THREADS 1
+set	TIME 3
+set	OMIT 0.1
+set	OPTS -Z -P __THREADS__ -O__OMIT__ -N
+
+test	Interface name
+g1out	IFNAME1 ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+hout	HOST_IFNAME ip -j -4 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+hout	HOST_IFNAME6 ip -j -6 route show|jq -rM '[.[] | select(.dst == "default").dev] | .[0]'
+check	[ -n "__IFNAME1__" ]
+
+test	DHCP: address
+guest1	ip link set dev __IFNAME1__ up
+guest1	/sbin/dhclient -4 __IFNAME1__
+g1out	ADDR1 ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME1__").addr_info[0].local'
+hout	HOST_ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__HOST_IFNAME__").addr_info[0].local'
+check	[ "__ADDR1__" = "__HOST_ADDR__" ]
+
+test	DHCPv6: address
+# Link is up now, wait for DAD to complete
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+guest1	/sbin/dhclient -6 __IFNAME1__
+# Wait for DAD to complete on the DHCP address
+guest1	while ip -j -6 addr show tentative | jq -e '.[].addr_info'; do sleep 0.1; done
+g1out	ADDR1_6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__IFNAME1__").addr_info[] | select(.prefixlen == 128).local] | .[0]'
+hout	HOST_ADDR6 ip -j -6 addr show|jq -rM '[.[] | select(.ifname == "__HOST_IFNAME6__").addr_info[] | select(.scope == "global" and .deprecated != true).local] | .[0]'
+check	[ "__ADDR1_6__" = "__HOST_ADDR6__" ]
+
+test	TCP/IPv6 guest to host throughput during migration
+
+monb	sleep 1; echo "migrate tcp:0:20005" | socat -u STDIN UNIX:__STATESETUP__/qemu_1_mon.sock
+
+iperf3s	host 10006
+iperf3m	BW guest_1 guest_2 __MAP_HOST6__ 10006 __TIME__ __OPTS__ -l 1M
+bw	__BW__ 2 3
+
+iperf3k	host
diff --git a/test/run b/test/run
index f188d8e..639f2ab 100755
--- a/test/run
+++ b/test/run
@@ -130,6 +130,16 @@ run() {
 	test two_guests_vu/basic
 	teardown two_guests
 
+	setup migrate
+	test migrate/basic
+	teardown migrate
+	setup migrate
+	test migrate/bidirectional
+	teardown migrate
+	setup migrate
+	test migrate/iperf3_out4
+	teardown migrate
+
 	VALGRIND=0
 	VHOST_USER=0
 	setup passt_in_ns
-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 1/6] migrate: Skeleton of live migration logic
  2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
@ 2025-02-10  2:26   ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-02-10  2:26 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1949 bytes --]

On Sun, Feb 09, 2025 at 11:20:00PM +0100, Stefano Brivio wrote:
> Introduce facilities for guest migration on top of vhost-user
> infrastructure.  Add migration facilities based on top of the current
> vhost-user infrastructure, moving vu_migrate() and related functions
> to migrate.c.
> 
> Versioned migration stages define function pointers to be called on
> source or target, or data sections that need to be transferred.
> 
> The migration header consists of a magic number, a version number for the
> encoding, and a "compat_version" which represents the oldest version which
> is compatible with the current one.  We don't use it yet, but that allows
> for the future possibility of backwards compatible protocol extensions.
> 
> Co-authored-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

with the exception of one nit, which I hope to send a fixup patch for today:

[snip]

> +/**
> + * struct migrate_stage - Callbacks and parameters for one stage of migration
> + * @name:	Stage name (for debugging)
> + * @source:	Callback to implement this stage on the source
> + * @target:	Callback to implement this stage on the target
> + * @iov:	Optional data section to transfer
> + */
> +struct migrate_stage {
> +	const char *name;
> +	int (*source)(struct ctx *c, const struct migrate_stage *stage, int fd);
> +	int (*target)(struct ctx *c, const struct migrate_stage *stage, int fd);
> +
> +	/* Add here separate rollback callbacks if needed */
> +
> +	struct iovec iov;

We're no longer using this @iov field, and it can be removed.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair
  2025-02-09 22:20 ` [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair Stefano Brivio
@ 2025-02-10  2:59   ` David Gibson
  2025-02-10 15:54     ` Stefano Brivio
  0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2025-02-10  2:59 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 24083 bytes --]

On Sun, Feb 09, 2025 at 11:20:02PM +0100, Stefano Brivio wrote:
> In vhost-user mode, by default, create a second UNIX domain socket
> accepting connections from passt-repair, with the usual listener
> socket.
> 
> When we need to set or clear TCP_REPAIR on sockets, we'll send them
> via SCM_RIGHTS to passt-repair, who sets the socket option values we
> ask for.
> 
> To that end, introduce batched functions to request TCP_REPAIR
> settings on sockets, so that we don't have to send a single message
> for each socket, on migration. When needed, repair_flush() will
> send the message and check for the reply.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  Makefile     |  12 +--
>  conf.c       |  44 ++++++++++-
>  epoll_type.h |   4 +
>  migrate.c    |   5 +-
>  passt.1      |  11 +++
>  passt.c      |   9 +++
>  passt.h      |   7 ++
>  repair.c     | 212 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  repair.h     |  16 ++++
>  tap.c        |  65 +---------------
>  util.c       |  62 +++++++++++++++
>  util.h       |   1 +
>  12 files changed, 375 insertions(+), 73 deletions(-)
>  create mode 100644 repair.c
>  create mode 100644 repair.h
> 
> diff --git a/Makefile b/Makefile
> index be89b07..d4e1096 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> -	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> -	vhost_user.c virtio.c vu_common.c
> +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> +	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> +	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  PASST_REPAIR_SRCS = passt-repair.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> @@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> -	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> -	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> -	vhost_user.h virtio.h vu_common.h
> +	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> +	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> +	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/conf.c b/conf.c
> index 142dc94..7a5ff8b 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -820,6 +820,9 @@ static void usage(const char *name, FILE *f, int status)
>  			"    UNIX domain socket is provided by -s option\n"
>  			"  --print-capabilities	print back-end capabilities in JSON format,\n"
>  			"    only meaningful for vhost-user mode\n");
> +		FPRINTF(f,
> +			"  --repair-path PATH	path for passt-repair(1)\n"
> +			"    default: append '.repair' to UNIX domain path\n");
>  	}
>  
>  	FPRINTF(f,
> @@ -1243,8 +1246,25 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
>   */
>  static void conf_open_files(struct ctx *c)
>  {
> -	if (c->mode != MODE_PASTA && c->fd_tap == -1)
> -		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
> +	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
> +		c->fd_tap_listen = sock_unix(c->sock_path);
> +
> +		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
> +			if (!*c->repair_path &&
> +			    snprintf_check(c->repair_path,
> +					   sizeof(c->repair_path), "%s.repair",
> +					   c->sock_path)) {
> +				warn("passt-repair path %s not usable",
> +				     c->repair_path);
> +				c->fd_repair_listen = -1;
> +			} else {
> +				c->fd_repair_listen = sock_unix(c->repair_path);
> +			}
> +		} else {
> +			c->fd_repair_listen = -1;
> +		}
> +		c->fd_repair = -1;
> +	}
>  
>  	if (*c->pidfile) {
>  		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
> @@ -1357,9 +1377,12 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
>  		{"dns-host",	required_argument,	NULL,		24 },
>  		{"vhost-user",	no_argument,		NULL,		25 },
> +
>  		/* vhost-user backend program convention */
>  		{"print-capabilities", no_argument,	NULL,		26 },
>  		{"socket-path",	required_argument,	NULL,		's' },
> +
> +		{"repair-path",	required_argument,	NULL,		27 },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1751,6 +1774,9 @@ void conf(struct ctx *c, int argc, char **argv)
>  		case 'D':
>  			/* Handle these later, once addresses are configured */
>  			break;
> +		case 27:
> +			/* Handle this once we checked --vhost-user */
> +			break;
>  		case 'h':
>  			usage(argv[0], stdout, EXIT_SUCCESS);
>  			break;
> @@ -1827,8 +1853,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
>  		c->no_dhcp = 1;
>  
> -	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
> -	 * settings)
> +	/* Inbound port options, DNS, and --repair-path can be parsed now, after
> +	 * IPv4/IPv6 settings and --vhost-user.
>  	 */
>  	fwd_probe_ephemeral();
>  	udp_portmap_clear();
> @@ -1874,6 +1900,16 @@ void conf(struct ctx *c, int argc, char **argv)
>  			}
>  
>  			die("Cannot use DNS address %s", optarg);
> +		} else if (name == 27) {
> +			if (c->mode != MODE_VU && strcmp(optarg, "none"))
> +				die("--repair-path is for vhost-user mode only");
> +
> +			if (snprintf_check(c->repair_path,
> +					   sizeof(c->repair_path), "%s",
> +					   optarg))
> +				die("Invalid passt-repair path: %s", optarg);
> +
> +			break;
>  		}
>  	} while (name != -1);
>  
> diff --git a/epoll_type.h b/epoll_type.h
> index f3ef415..7f2a121 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -40,6 +40,10 @@ enum epoll_type {
>  	EPOLL_TYPE_VHOST_CMD,
>  	/* vhost-user kick event socket */
>  	EPOLL_TYPE_VHOST_KICK,
> +	/* TCP_REPAIR helper listening socket */
> +	EPOLL_TYPE_REPAIR_LISTEN,
> +	/* TCP_REPAIR helper socket */
> +	EPOLL_TYPE_REPAIR,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/migrate.c b/migrate.c
> index 72a6d40..1c59016 100644
> --- a/migrate.c
> +++ b/migrate.c
> @@ -23,6 +23,7 @@
>  #include "flow_table.h"
>  
>  #include "migrate.h"
> +#include "repair.h"
>  
>  /* Magic identifier for migration data */
>  #define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
> @@ -232,7 +233,7 @@ void migrate_init(struct ctx *c)
>  }
>  
>  /**
> - * migrate_close() - Close migration channel
> + * migrate_close() - Close migration channel and connection to passt-repair
>   * @c:		Execution context
>   */
>  void migrate_close(struct ctx *c)
> @@ -243,6 +244,8 @@ void migrate_close(struct ctx *c)
>  		c->device_state_fd = -1;
>  		c->device_state_result = -1;
>  	}
> +
> +	repair_close(c);

I don't think we want this.  At the moment, rollback / failed
migrations aren't really handled properly anyway.  But this pretty
much explicitly prevents a second attempt at a failed migration.
I'll send a fixup.

>  }
>  
>  /**
> diff --git a/passt.1 b/passt.1
> index 29cc3ed..c81d539 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
>  .BR \-\-print-capabilities
>  Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
>  
> +.TP
> +.BR \-\-repair-path " " \fIpath
> +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
> +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
> +migration. \fB--repair-path none\fR disables this interface (if you need to
> +specify a socket path called "none" you can prefix the path by \fI./\fR).
> +
> +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> +chosen for the hypervisor UNIX domain socket. No socket is created if not in
> +\-\-vhost-user mode.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 935a69f..6f9fb4d 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -52,6 +52,7 @@
>  #include "ndp.h"
>  #include "vu_common.h"
>  #include "migrate.h"
> +#include "repair.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -76,6 +77,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
>  	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
> +	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
> +	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -358,6 +361,12 @@ loop:
>  		case EPOLL_TYPE_VHOST_KICK:
>  			vu_kick_cb(c.vdev, ref, &now);
>  			break;
> +		case EPOLL_TYPE_REPAIR_LISTEN:
> +			repair_listen_handler(&c, eventmask);
> +			break;
> +		case EPOLL_TYPE_REPAIR:
> +			repair_handler(&c, eventmask);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index e73a5ac..c392be0 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -20,6 +20,7 @@ union epoll_ref;
>  #include "siphash.h"
>  #include "ip.h"
>  #include "inany.h"
> +#include "migrate.h"
>  #include "flow.h"
>  #include "icmp.h"
>  #include "fwd.h"
> @@ -193,6 +194,7 @@ struct ip6_ctx {
>   * @foreground:		Run in foreground, don't log to stderr by default
>   * @nofile:		Maximum number of open files (ulimit -n)
>   * @sock_path:		Path for UNIX domain socket
> + * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
>   * @pcap:		Path for packet capture file
>   * @pidfile:		Path to PID file, empty string if not configured
>   * @pidfile_fd:		File descriptor for PID file, -1 if none
> @@ -203,6 +205,8 @@ struct ip6_ctx {
>   * @epollfd:		File descriptor for epoll instance
>   * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
>   * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
> + * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
> + * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
>   * @our_tap_mac:	Pasta/passt's MAC on the tap link
>   * @guest_mac:		MAC address of guest or namespace, seen or configured
>   * @hash_secret:	128-bit secret for siphash functions
> @@ -247,6 +251,7 @@ struct ctx {
>  	int foreground;
>  	int nofile;
>  	char sock_path[UNIX_PATH_MAX];
> +	char repair_path[UNIX_PATH_MAX];
>  	char pcap[PATH_MAX];
>  
>  	char pidfile[PATH_MAX];
> @@ -263,6 +268,8 @@ struct ctx {
>  	int epollfd;
>  	int fd_tap_listen;
>  	int fd_tap;
> +	int fd_repair_listen;
> +	int fd_repair;
>  	unsigned char our_tap_mac[ETH_ALEN];
>  	unsigned char guest_mac[ETH_ALEN];
>  	uint64_t hash_secret[2];
> diff --git a/repair.c b/repair.c
> new file mode 100644
> index 0000000..784b994
> --- /dev/null
> +++ b/repair.c
> @@ -0,0 +1,212 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + *  for qemu/UNIX domain socket mode
> + *
> + * PASTA - Pack A Subtle Tap Abstraction
> + *  for network namespace/tap device mode
> + *
> + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
> + *
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <sys/uio.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "inany.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +
> +#include "repair.h"
> +
> +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> +
> +/* Pending file descriptors for next repair_flush() call, or command change */
> +static int repair_fds[SCM_MAX_FD];
> +
> +/* Pending command: flush pending file descriptors if it changes */
> +static int repair_cmd;

This should be typed as int8_t (see below for more details).

> +
> +/* Number of pending file descriptors set in @repair_fds */
> +static int repair_nfds;
> +
> +/**
> + * repair_sock_init() - Start listening for connections on helper socket
> + * @c:		Execution context
> + */
> +void repair_sock_init(const struct ctx *c)
> +{
> +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> +	struct epoll_event ev = { 0 };
> +
> +	if (c->fd_repair_listen == -1)
> +		return;
> +
> +	if (listen(c->fd_repair_listen, 0)) {
> +		err_perror("listen() on repair helper socket, won't migrate");
> +		return;
> +	}
> +
> +	ref.fd = c->fd_repair_listen;
> +	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> +	ev.data.u64 = ref.u64;
> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
> +		err_perror("repair helper socket epoll_ctl(), won't migrate");
> +}
> +
> +/**
> + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
> + * @c:		Execution context
> + * @events:	epoll events
> + */
> +void repair_listen_handler(struct ctx *c, uint32_t events)
> +{
> +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> +	struct epoll_event ev = { 0 };
> +	struct ucred ucred;
> +	socklen_t len;
> +
> +	if (events != EPOLLIN) {
> +		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
> +		      events);
> +		return;
> +	}
> +
> +	len = sizeof(ucred);
> +
> +	/* Another client is already connected: accept and close right away. */
> +	if (c->fd_repair != -1) {
> +		int discard = accept4(c->fd_repair_listen, NULL, NULL,
> +				      SOCK_NONBLOCK);
> +
> +		if (discard == -1)
> +			return;
> +
> +		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> +			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
> +
> +		close(discard);
> +		return;
> +	}
> +
> +	if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
> +		debug_perror("accept4() on TCP_REPAIR helper listening socket");
> +		return;
> +	}
> +
> +	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> +		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
> +
> +	ref.fd = c->fd_repair;
> +	ev.events = EPOLLHUP | EPOLLET;
> +	ev.data.u64 = ref.u64;
> +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
> +		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
> +		close(c->fd_repair);
> +		c->fd_repair = -1;
> +	}
> +}
> +
> +/**
> + * repair_close() - Close connection to TCP_REPAIR helper
> + * @c:		Execution context
> + */
> +void repair_close(struct ctx *c)
> +{
> +	debug("Closing TCP_REPAIR helper socket");
> +
> +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
> +	close(c->fd_repair);
> +	c->fd_repair = -1;
> +}
> +
> +/**
> + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
> + * @c:		Execution context
> + * @events:	epoll events
> + */
> +void repair_handler(struct ctx *c, uint32_t events)
> +{
> +	(void)events;
> +
> +	repair_close(c);
> +}
> +
> +/**
> + * repair_flush() - Flush current set of sockets to helper, with current command
> + * @c:		Execution context
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int repair_flush(struct ctx *c)
> +{
> +	struct iovec iov = { &((int8_t){ repair_cmd }), sizeof(int8_t) };

This will only be correct for little-endian machines.  Better to
correctly type the repair_cmd variable.

> +	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> +	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> +	struct cmsghdr *cmsg;
> +	struct msghdr msg;
> +
> +	if (!repair_nfds)
> +		return 0;
> +
> +	msg = (struct msghdr){ NULL, 0, &iov, 1,
> +			       buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 };
> +	cmsg = CMSG_FIRSTHDR(&msg);
> +
> +	cmsg->cmsg_level = SOL_SOCKET;
> +	cmsg->cmsg_type = SCM_RIGHTS;
> +	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
> +	memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
> +
> +	repair_nfds = 0;
> +
> +	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
> +		int ret = -errno;
> +		err_perror("Failed to send sockets to TCP_REPAIR helper");
> +		repair_close(c);
> +		return ret;
> +	}
> +
> +	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {

I guess it works, but passing an address to an implicitly constructed
variable to recv() makes me nervous.  Besides we could error check a
bit better here, I'll try to send another fixup.

> +		int ret = -errno;
> +		err_perror("Failed to receive reply from TCP_REPAIR helper");
> +		repair_close(c);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * repair_set() - Add socket to TCP_REPAIR set with given command
> + * @c:		Execution context
> + * @s:		Socket to add
> + * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +/* cppcheck-suppress unusedFunction */
> +int repair_set(struct ctx *c, int s, int cmd)
> +{
> +	int rc;
> +
> +	if (repair_nfds && repair_cmd != cmd) {
> +		if ((rc = repair_flush(c)))
> +			return rc;
> +	}
> +
> +	repair_cmd = cmd;
> +	repair_fds[repair_nfds++] = s;
> +
> +	if (repair_nfds >= SCM_MAX_FD) {
> +		if ((rc = repair_flush(c)))
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> diff --git a/repair.h b/repair.h
> new file mode 100644
> index 0000000..de279d6
> --- /dev/null
> +++ b/repair.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef REPAIR_H
> +#define REPAIR_H
> +
> +void repair_sock_init(const struct ctx *c);
> +void repair_listen_handler(struct ctx *c, uint32_t events);
> +void repair_handler(struct ctx *c, uint32_t events);
> +void repair_close(struct ctx *c);
> +int repair_flush(struct ctx *c);
> +int repair_set(struct ctx *c, int s, int cmd);
> +
> +#endif /* REPAIR_H */
> diff --git a/tap.c b/tap.c
> index 8c92d23..d0673e5 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -56,6 +56,7 @@
>  #include "netlink.h"
>  #include "pasta.h"
>  #include "packet.h"
> +#include "repair.h"
>  #include "tap.h"
>  #include "log.h"
>  #include "vhost_user.h"
> @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  		tap_pasta_input(c, now);
>  }
>  
> -/**
> - * tap_sock_unix_open() - Create and bind AF_UNIX socket
> - * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> - *
> - * Return: socket descriptor on success, won't return on failure
> - */
> -int tap_sock_unix_open(char *sock_path)
> -{
> -	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> -	struct sockaddr_un addr = {
> -		.sun_family = AF_UNIX,
> -	};
> -	int i;
> -
> -	if (fd < 0)
> -		die_perror("Failed to open UNIX domain socket");
> -
> -	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> -		char *path = addr.sun_path;
> -		int ex, ret;
> -
> -		if (*sock_path)
> -			memcpy(path, sock_path, UNIX_PATH_MAX);
> -		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> -					UNIX_SOCK_PATH, i))
> -			die_perror("Can't build UNIX domain socket path");
> -
> -		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> -			    0);
> -		if (ex < 0)
> -			die_perror("Failed to check for UNIX domain conflicts");
> -
> -		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> -		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> -			     errno != EACCES)) {
> -			if (*sock_path)
> -				die("Socket path %s already in use", path);
> -
> -			close(ex);
> -			continue;
> -		}
> -		close(ex);
> -
> -		unlink(path);
> -		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> -		if (*sock_path && ret)
> -			die_perror("Failed to bind UNIX domain socket");
> -
> -		if (!ret)
> -			break;
> -	}
> -
> -	if (i == UNIX_SOCK_MAX)
> -		die_perror("Failed to bind UNIX domain socket");
> -
> -	info("UNIX domain socket bound at %s", addr.sun_path);
> -	if (!*sock_path)
> -		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> -
> -	return fd;
> -}
> -
>  /**
>   * tap_backend_show_hints() - Give help information to start QEMU
>   * @c:		Execution context
> @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
>  		tap_sock_tun_init(c);
>  		break;
>  	case MODE_VU:
> +		repair_sock_init(c);
> +		/* fall through */
>  	case MODE_PASST:
>  		tap_sock_unix_init(c);
>  
> diff --git a/util.c b/util.c
> index 4d51e04..c3c5480 100644
> --- a/util.c
> +++ b/util.c
> @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	return fd;
>  }
>  
> +/**
> + * sock_unix() - Create and bind AF_UNIX socket
> + * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> + *
> + * Return: socket descriptor on success, won't return on failure
> + */
> +int sock_unix(char *sock_path)
> +{
> +	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +	struct sockaddr_un addr = {
> +		.sun_family = AF_UNIX,
> +	};
> +	int i;
> +
> +	if (fd < 0)
> +		die_perror("Failed to open UNIX domain socket");
> +
> +	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> +		char *path = addr.sun_path;
> +		int ex, ret;
> +
> +		if (*sock_path)
> +			memcpy(path, sock_path, UNIX_PATH_MAX);
> +		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> +					UNIX_SOCK_PATH, i))
> +			die_perror("Can't build UNIX domain socket path");
> +
> +		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> +			    0);
> +		if (ex < 0)
> +			die_perror("Failed to check for UNIX domain conflicts");
> +
> +		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> +		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> +			     errno != EACCES)) {
> +			if (*sock_path)
> +				die("Socket path %s already in use", path);
> +
> +			close(ex);
> +			continue;
> +		}
> +		close(ex);
> +
> +		unlink(path);
> +		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> +		if (*sock_path && ret)
> +			die_perror("Failed to bind UNIX domain socket");
> +
> +		if (!ret)
> +			break;
> +	}
> +
> +	if (i == UNIX_SOCK_MAX)
> +		die_perror("Failed to bind UNIX domain socket");
> +
> +	info("UNIX domain socket bound at %s", addr.sun_path);
> +	if (!*sock_path)
> +		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> +
> +	return fd;
> +}
> +
>  /**
>   * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
>   * @c:		Execution context
> diff --git a/util.h b/util.h
> index 255eb26..3dacb4d 100644
> --- a/util.h
> +++ b/util.h
> @@ -214,6 +214,7 @@ struct ctx;
>  int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	       const void *sa, socklen_t sl,
>  	       const char *ifname, bool v6only, uint32_t data);
> +int sock_unix(char *sock_path);
>  void sock_probe_mem(struct ctx *c);
>  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
>  int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state
  2025-02-09 22:20 ` [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state Stefano Brivio
@ 2025-02-10  3:43   ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-02-10  3:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2112 bytes --]

On Sun, Feb 09, 2025 at 11:20:03PM +0100, Stefano Brivio wrote:
> This will close all the sockets we currently have open in repair mode,
> and completes our migration tasks as source. If the hypervisor wants
> to have us back at this point, somebody needs to restart us.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  vhost_user.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/vhost_user.c b/vhost_user.c
> index 256c8ab..9115fb5 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -998,6 +998,9 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev,
>  	return false;
>  }
>  
> +/* If set, quit when we get a VHOST_USER_CHECK_DEVICE_STATE, after replying */
> +static bool quit_on_device_state = false;

We don't actually need this global, because we have c->migrate_target.

> +
>  /**
>   * vu_set_device_state_fd_exec() - Set the device state migration channel
>   * @vdev:	vhost-user device
> @@ -1025,6 +1028,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev,
>  	migrate_request(vdev->context, msg->fds[0],
>  			direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
>  
> +	if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE)
> +		quit_on_device_state = true;
> +
>  	/* We don't provide a new fd for the data transfer */
>  	vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
>  
> @@ -1203,4 +1209,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
>  
>  	if (reply_requested)
>  		vu_send_reply(fd, &msg);
> +
> +	if (quit_on_device_state &&
> +	    msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) {

We probably also only want to quit when the migration is successful.
We can't determine that in all cases, but we can at least check
device_state_result.  Patch coming.

> +		info("Migration complete, exiting");
> +		_exit(EXIT_SUCCESS);
> +	}
>  }

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 5/6] migrate: Migrate TCP flows
  2025-02-09 22:20 ` [PATCH v13 5/6] migrate: Migrate TCP flows Stefano Brivio
@ 2025-02-10  6:05   ` David Gibson
  2025-02-10  9:51     ` Stefano Brivio
  0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2025-02-10  6:05 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 42218 bytes --]

On Sun, Feb 09, 2025 at 11:20:04PM +0100, Stefano Brivio wrote:
> This implements flow preparation on the source, transfer of data with
> a format roughly inspired by struct tcp_tap_conn, and flow insertion
> on the target, with all the appropriate window options, window
> scaling, MSS, etc.
> 
> The target side is rather convoluted because we first need to create
> sockets and switch them to repair mode, before we can apply options
> that are *not* stored in the flow table. However, we don't want to
> request repair mode for sockets one by one. So we need to do this in
> several steps.

Various comments below, I'll try to make patches for as many of them
as I can.

> 
> [dwg: Assorted cleanups]
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  flow.c     | 248 +++++++++++++++++
>  flow.h     |   8 +
>  migrate.c  |  19 ++
>  passt.c    |   6 +-
>  repair.c   |   1 -
>  tcp.c      | 789 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_conn.h |  95 +++++++
>  7 files changed, 1162 insertions(+), 4 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index a6fe6d1..51f8c62 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -19,6 +19,7 @@
>  #include "inany.h"
>  #include "flow.h"
>  #include "flow_table.h"
> +#include "repair.h"
>  
>  const char *flow_state_str[] = {
>  	[FLOW_STATE_FREE]	= "FREE",
> @@ -52,6 +53,26 @@ const uint8_t flow_proto[] = {
>  static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>  	      "flow_proto[] doesn't match enum flow_type");
>  
> +#define foreach_flow(i, flow, bound)					\
> +	for ((i) = 0, (flow) = &flowtab[(i)];				\
> +	     (i) < (bound);						\
> +	     (i)++, (flow) = &flowtab[(i)])				\
> +		if ((flow)->f.state == FLOW_STATE_FREE)			\
> +			(i) += (flow)->free.n - 1;			\
> +		else
> +
> +#define foreach_active_flow(i, flow, bound)				\
> +	foreach_flow((i), (flow), (bound))				\
> +		if ((flow)->f.state != FLOW_STATE_ACTIVE)		\
> +			continue;					\
> +		else
> +
> +#define foreach_tcp_flow(i, flow, bound)				\
> +	foreach_active_flow((i), (flow), (bound))			\
> +		if ((flow)->f.type != FLOW_TCP)				\
> +			continue;					\
> +		else
> +

Oh, nice.

>  /* Global Flow Table */
>  
>  /**
> @@ -874,6 +895,233 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  	*last_next = FLOW_MAX;
>  }
>  
> +/**
> + * flow_migrate_source_rollback() - Disable repair mode, return failure
> + * @c:		Execution context
> + * @max_flow:	Maximum index of affected flows
> + * @ret:	Negative error code
> + *
> + * Return: @ret
> + */
> +static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
> +					int ret)
> +{
> +	union flow *flow;
> +	unsigned i;
> +
> +	debug("...roll back migration");
> +
> +	foreach_tcp_flow(i, flow, max_flow)
> +		tcp_flow_repair_off(c, &flow->tcp);
> +
> +	repair_flush(c);

I think this should die() on failures.  If we get here, it could well
mean we've already had a failure enabling repair mode, so an error
disabling is more plausible than usual.  I think die()ing is
preferable to carrying on, since resuming normal operation with some
of our sockets in repair mode is almost certain to result in really
weird, hard to debug behaviour.

> +
> +	return ret;
> +}
> +
> +/**
> + * flow_migrate_repair_all() - Turn repair mode on or off for all flows
> + * @c:		Execution context
> + * @enable:	Switch repair mode on if set, off otherwise
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +static int flow_migrate_repair_all(struct ctx *c, bool enable)
> +{
> +	union flow *flow;
> +	unsigned i;
> +	int rc;
> +
> +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> +		if (enable)
> +			rc = tcp_flow_repair_on(c, &flow->tcp);
> +		else
> +			rc = tcp_flow_repair_off(c, &flow->tcp);
> +
> +		if (rc) {
> +			debug("Can't %s repair mode: %s",
> +			      enable ? "enable" : "disable", strerror_(-rc));
> +			return flow_migrate_source_rollback(c, i, rc);
> +		}
> +	}
> +
> +	if ((rc = repair_flush(c))) {
> +		debug("Can't %s repair mode: %s",
> +		      enable ? "enable" : "disable", strerror_(-rc));
> +		return flow_migrate_source_rollback(c, i, rc);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * flow_migrate_source_early() - Early tasks: shrink (RFC 7323 2.2) TCP windows
> + * @c:		Execution context
> + * @stage:	Migration stage information, unused
> + * @fd:		Migration file descriptor, unused
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> +			      int fd)
> +{
> +	union flow *flow;
> +	unsigned i;
> +	int rc;
> +
> +	(void)stage;
> +	(void)fd;
> +
> +	/* We need repair mode to dump and set (some) window parameters */
> +	if ((rc = flow_migrate_repair_all(c, true)))
> +		return -rc;
> +
> +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> +		if ((rc = tcp_flow_migrate_shrink_window(i, &flow->tcp))) {
> +			err("Shrinking window, flow %u: %s", i, strerror_(-rc));
> +			return flow_migrate_source_rollback(c, i, -rc);
> +		}
> +	}
> +
> +	/* Now send window updates. We'll flip repair mode back on in a bit */
> +	if ((rc = flow_migrate_repair_all(c, false)))
> +		return -rc;
> +
> +	return 0;
> +}
> +
> +/**
> + * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
> + * @c:		Execution context
> + * @stage:	Migration stage information (unused)
> + * @fd:		Migration file descriptor (unused)
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> +			    int fd)
> +{
> +	int rc;
> +
> +	(void)stage;
> +	(void)fd;
> +
> +	if ((rc = flow_migrate_repair_all(c, true)))
> +		return -rc;
> +
> +	return 0;
> +}
> +
> +/**
> + * flow_migrate_source() - Dump all the remaining information and send data
> + * @c:		Execution context (unused)
> + * @stage:	Migration stage information (unused)
> + * @fd:		Migration file descriptor
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> +			int fd)
> +{
> +	uint32_t count = 0;
> +	union flow *flow;
> +	unsigned i;
> +	int rc;
> +
> +	(void)c;
> +	(void)stage;
> +
> +	foreach_tcp_flow(i, flow, FLOW_MAX)
> +		count++;
> +
> +	count = htonl(count);
> +	if ((rc = write_all_buf(fd, &count, sizeof(count)))) {
> +		rc = errno;
> +		err_perror("Can't send flow count (%u)", ntohl(count));
> +		return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> +	}
> +
> +	debug("Sending %u flows", ntohl(count));
> +
> +	/* Dump and send information that can be stored in the flow table */
> +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> +		if ((rc = tcp_flow_migrate_source(fd, &flow->tcp))) {
> +			err("Can't send data, flow %u: %s", i, strerror_(-rc));
> +			return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
> +		}
> +	}
> +
> +	/* And then "extended" data (including window data we saved previously):
> +	 * the target needs to set repair mode on sockets before it can set
> +	 * this stuff, but it needs sockets (and flows) for that.
> +	 *
> +	 * This also closes sockets so that the target can start connecting
> +	 * theirs: you can't sendmsg() to queues (using the socket) if the
> +	 * socket is not connected (EPIPE), not even in repair mode. And the
> +	 * target needs to restore queues now because we're sending the data.
> +	 *
> +	 * So, no rollback here, just try as hard as we can.
> +	 */
> +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> +		if ((rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp)))
> +			err("Extended data for flow %u: %s", i, strerror_(-rc));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * flow_migrate_target() - Receive flows and insert in flow table
> + * @c:		Execution context
> + * @stage:	Migration stage information (unused)
> + * @fd:		Migration file descriptor
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> +			int fd)
> +{
> +	uint32_t count;
> +	unsigned i;
> +	int rc;
> +
> +	(void)stage;
> +
> +	if (read_all_buf(fd, &count, sizeof(count)))
> +		return errno;
> +
> +	count = ntohl(count);
> +	debug("Receiving %u flows", count);
> +
> +	if ((rc = flow_migrate_repair_all(c, true)))
> +		return -rc;
> +
> +	repair_flush(c);

Unnecessary, flow_migrate_repair_all() already handles this.

> +
> +	/* TODO: flow header with type, instead? */
> +	for (i = 0; i < count; i++) {
> +		rc = tcp_flow_migrate_target(c, fd);
> +		if (rc) {
> +			debug("Bad target data for flow %u: %s, abort",
> +			      i, strerror_(-rc));
> +			return -rc;
> +		}
> +	}
> +
> +	repair_flush(c);
> +
> +	for (i = 0; i < count; i++) {
> +		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
> +		if (rc) {
> +			debug("Bad target extended data for flow %u: %s, abort",
> +			      i, strerror_(-rc));
> +			return -rc;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * flow_init() - Initialise flow related data structures
>   */
> diff --git a/flow.h b/flow.h
> index 24ba3ef..675726e 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -249,6 +249,14 @@ union flow;
>  
>  void flow_init(void);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> +			      int fd);
> +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> +			    int fd);
> +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> +			int fd);
> +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> +			int fd);
>  
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	__attribute__((format(printf, 3, 4)));
> diff --git a/migrate.c b/migrate.c
> index 1c59016..c5c6663 100644
> --- a/migrate.c
> +++ b/migrate.c
> @@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
>  
>  /* Stages for version 1 */
>  static const struct migrate_stage stages_v1[] = {
> +	/* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
> +	 * *sometimes* closes the connection for real.
> +	 */
> +/*	{
> +		.name = "shrink TCP windows",
> +		.source = flow_migrate_source_early,
> +		.target = NULL,
> +	},
> +*/

Given we're not sure if this will help, and it adds some
complications, probably makes sense to split this into a separate
patch.

>  	{
>  		.name = "observed addresses",
>  		.source = seen_addrs_source_v1,
>  		.target = seen_addrs_target_v1,
>  	},
> +	{
> +		.name = "prepare flows",
> +		.source = flow_migrate_source_pre,
> +		.target = NULL,
> +	},
> +	{
> +		.name = "transfer flows",
> +		.source = flow_migrate_source,
> +		.target = flow_migrate_target,
> +	},
>  	{ 0 },
>  };
>  
> diff --git a/passt.c b/passt.c
> index 6f9fb4d..68d1a28 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -223,9 +223,6 @@ int main(int argc, char **argv)
>  		if (sigaction(SIGCHLD, &sa, NULL))
>  			die_perror("Couldn't install signal handlers");
>  
> -		if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> -			die_perror("Couldn't set disposition for SIGPIPE");
> -
>  		c.mode = MODE_PASTA;
>  	} else if (strstr(name, "passt")) {
>  		c.mode = MODE_PASST;
> @@ -233,6 +230,9 @@ int main(int argc, char **argv)
>  		_exit(EXIT_FAILURE);
>  	}
>  
> +	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> +		die_perror("Couldn't set disposition for SIGPIPE");
> +
>  	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
>  
>  	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
> diff --git a/repair.c b/repair.c
> index 784b994..da85edb 100644
> --- a/repair.c
> +++ b/repair.c
> @@ -190,7 +190,6 @@ int repair_flush(struct ctx *c)
>   *
>   * Return: 0 on success, negative error code on failure
>   */
> -/* cppcheck-suppress unusedFunction */
>  int repair_set(struct ctx *c, int s, int cmd)
>  {
>  	int rc;
> diff --git a/tcp.c b/tcp.c
> index af6bd95..78db64f 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -280,6 +280,7 @@
>  #include <stddef.h>
>  #include <string.h>
>  #include <sys/epoll.h>
> +#include <sys/ioctl.h>
>  #include <sys/socket.h>
>  #include <sys/timerfd.h>
>  #include <sys/types.h>
> @@ -287,6 +288,8 @@
>  #include <time.h>
>  #include <arpa/inet.h>
>  
> +#include <linux/sockios.h>
> +
>  #include "checksum.h"
>  #include "util.h"
>  #include "iov.h"
> @@ -299,6 +302,7 @@
>  #include "log.h"
>  #include "inany.h"
>  #include "flow.h"
> +#include "repair.h"
>  #include "linux_dep.h"
>  
>  #include "flow_table.h"
> @@ -326,6 +330,19 @@
>  	 ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
>  #define CONN_HAS(conn, set)	(((conn)->events & (set)) == (set))
>  
> +/* Buffers to migrate pending data from send and receive queues. No, they don't
> + * use memory if we don't use them. And we're going away after this, so splurge.
> + */
> +#define TCP_MIGRATE_SND_QUEUE_MAX	(64 << 20)
> +#define TCP_MIGRATE_RCV_QUEUE_MAX	(64 << 20)
> +uint8_t tcp_migrate_snd_queue		[TCP_MIGRATE_SND_QUEUE_MAX];
> +uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
> +
> +#define TCP_MIGRATE_RESTORE_CHUNK_MIN	1024 /* Try smaller when above this */
> +
> +/* "Extended" data (not stored in the flow table) for TCP flow migration */
> +static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> +
>  static const char *tcp_event_str[] __attribute((__unused__)) = {
>  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
>  
> @@ -2645,3 +2662,775 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
>  	if (c->mode == MODE_PASTA)
>  		tcp_splice_refill(c);
>  }
> +
> +/**
> + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> +{
> +	int rc = 0;
> +
> +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> +		err("Failed to set TCP_REPAIR");
> +
> +	return rc;
> +}
> +
> +/**
> + * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
> +{
> +	int rc = 0;
> +
> +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
> +		err("Failed to clear TCP_REPAIR");
> +
> +	return rc;
> +}
> +
> +/**
> + * tcp_flow_repair_queues() - Read or write socket queues, or send unsent data
> + * @s:			Socket
> + * @sndbuf:		Send queue buffer read or written/sent depending on @set
> + * @sndlen:		Length of send queue buffer to set, network order
> + * @notsentlen:		Length of not sent data, non-zero to actually _send_
> + * @rcvbuf:		Receive queue buffer, read or written depending on @set
> + * @rcvlen:		Length of receive queue buffer to set, network order
> + * @set:		Set or send (unsent data only) if true, dump if false
> + *
> + * Return: 0 on success, negative error code on failure
> + *
> + * #syscalls:vu ioctl
> + */
> +static int tcp_flow_repair_queues(int s,
> +				  uint8_t *sndbuf,
> +				  uint32_t *sndlen, uint32_t *notsentlen,
> +				  uint8_t *rcvbuf, uint32_t *rcvlen,
> +				  bool set)
> +{
> +	ssize_t rc;
> +	int v;
> +
> +	if (set && rcvbuf) { /* FIXME: can't check notsentlen, rework this */

Not really clear why this is its own block, rather than part of the if
(set) below.

> +		v = TCP_SEND_QUEUE;
> +		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> +			rc = -errno;
> +			err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
> +			return rc;
> +		}
> +	}
> +
> +	if (set) {
> +		size_t chunk;
> +		uint8_t *p;
> +
> +		*sndlen = ntohl(*sndlen);
> +		debug("Writing socket %i send queue: %u bytes", s, *sndlen);
> +		p = sndbuf;
> +		chunk = *sndlen;
> +		while (*sndlen > 0) {
> +			rc = send(s, p, MIN(*sndlen, chunk), 0);
> +
> +			if (rc < 0) {
> +				if ((errno == ENOBUFS || errno == ENOMEM) &&
> +				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
> +					chunk /= 2;
> +					continue;
> +				}
> +
> +				rc = -errno;
> +				err_perror("Can't write socket %i send queue",
> +					   s);
> +				return rc;
> +			}
> +
> +			*sndlen -= rc;
> +			p += rc;
> +		}
> +
> +		*notsentlen = ntohl(*notsentlen);
> +		debug("Sending socket %i unsent queue: %u bytes", s,
> +		      *notsentlen);
> +		p = sndbuf;
> +		chunk = *notsentlen;
> +		while (*notsentlen > 0) {
> +			rc = send(s, p, MIN(*notsentlen, chunk), 0);
> +
> +			if (rc < 0) {
> +				if ((errno == ENOBUFS || errno == ENOMEM) &&
> +				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
> +					chunk /= 2;
> +					continue;
> +				}
> +
> +				rc = -errno;
> +				err_perror("Can't send socket %i unsent queue",
> +					   s);
> +				return rc;
> +			}
> +
> +			*notsentlen -= rc;
> +			p += rc;
> +		}
> +	} else {
> +		rc = ioctl(s, SIOCOUTQ, sndlen);
> +		if (rc < 0) {
> +			rc = -errno;
> +			err_perror("Getting send queue size for socket %i", s);
> +			return rc;
> +		}
> +
> +		rc = ioctl(s, SIOCOUTQNSD, notsentlen);
> +		if (rc < 0) {
> +			rc = -errno;
> +			err_perror("Getting not sent count for socket %i", s);
> +			return rc;
> +		}
> +
> +		/* TODO: Skip "FIN" byte in queue if present */
> +
> +		rc = recv(s, sndbuf, MIN(*sndlen, TCP_MIGRATE_SND_QUEUE_MAX),
> +			  MSG_PEEK);
> +		if (rc < 0 && errno != EAGAIN) { /* EAGAIN means empty */
> +			rc = -errno;
> +			err_perror("Can't read send queue for socket %i", s);
> +			return rc;
> +		}
> +
> +		rc = MAX(0, rc);
> +		debug("Read socket %i send queue: %zi (%u not sent)", s, rc,
> +		      *notsentlen);
> +
> +		*sndlen = htonl(rc);
> +		*notsentlen = htonl(*notsentlen);
> +	}
> +
> +	if (!rcvbuf)
> +		return 0;
> +
> +	v = TCP_RECV_QUEUE;
> +	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> +		rc = -errno;
> +		err_perror("Selecting TCP_RECV_QUEUE for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (set) {
> +		uint8_t *p;
> +		size_t chunk;
> +
> +		*rcvlen = ntohl(*rcvlen);
> +		debug("Writing socket %i receive queue: %u bytes", s, *rcvlen);
> +		p = rcvbuf;
> +		chunk = *rcvlen;
> +		while (*rcvlen > 0) {
> +			rc = send(s, p, MIN(*rcvlen, chunk), 0);
> +			if (rc < 0) {
> +				if ((errno == ENOBUFS || errno == ENOMEM) &&
> +				    chunk >= TCP_MIGRATE_RESTORE_CHUNK_MIN) {
> +					chunk /= 2;
> +					continue;
> +				}
> +
> +				rc = -errno;
> +				err_perror("Can't send socket %i receive queue",
> +					   s);
> +				return rc;
> +			}
> +
> +			*rcvlen -= rc;
> +			p += rc;
> +		}
> +	} else {
> +		if (ioctl(s, SIOCINQ, rcvlen) < 0) {
> +			rc = -errno;
> +			err_perror("Get receive queue size for socket %i", s);
> +			return rc;
> +		}
> +
> +		/* TODO: Skip "FIN" byte in queue if present */
> +
> +		rc = recv(s, rcvbuf, MIN(*rcvlen, TCP_MIGRATE_RCV_QUEUE_MAX),
> +					 MSG_PEEK);
> +		if (rc < 0 && errno != EAGAIN) { /* EAGAIN means empty */
> +			rc = -errno;
> +			err_perror("Can't read receive queue for socket %i", s);
> +			return rc;
> +		}
> +
> +		rc = MAX(0, rc);
> +		*rcvlen = htonl(rc);
> +		debug("Read socket %i receive queue: %zi bytes", s, rc);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_repair_seq() - Dump or set sequences
> + * @s:		Socket
> + * @snd_seq:	Send sequence, set on return if @set == false, network order
> + * @rcv_seq:	Receive sequence, set on return if @set == false, network order
> + * @set:	Set if true, dump if false
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +static int tcp_flow_repair_seq(int s, uint32_t *snd_seq, uint32_t *rcv_seq,
> +			       bool set)
> +{
> +	socklen_t vlen = sizeof(uint32_t);
> +	ssize_t rc;
> +	int v;
> +
> +	v = TCP_SEND_QUEUE;
> +	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> +		rc = -errno;
> +		err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
> +		return rc;
> +	}
> +
> +	if (set) {
> +		*snd_seq = ntohl(*snd_seq);
> +		if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd_seq, vlen)) {
> +			rc = -errno;
> +			err_perror("Setting send sequence for socket %i", s);
> +			return rc;
> +		}
> +		debug("Set send sequence for socket %i to %u", s, *snd_seq);
> +	} else {
> +		if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, snd_seq, &vlen)) {
> +			rc = -errno;
> +			err_perror("Dumping send sequence for socket %i", s);
> +			return rc;
> +		}
> +		debug("Dumped send sequence for socket %i: %u", s, *snd_seq);
> +		*snd_seq = htonl(*snd_seq);
> +	}
> +
> +	v = TCP_RECV_QUEUE;
> +	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> +		rc = -errno;
> +		err_perror("Selecting TCP_RECV_QUEUE for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (set) {
> +		*rcv_seq = ntohl(*rcv_seq);
> +		if (setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv_seq, vlen)) {
> +			rc = -errno;
> +			err_perror("Setting receive sequence %u for socket %i",
> +				   *rcv_seq, s);
> +			return rc;
> +		}
> +		debug("Set receive sequence for socket %i to %u", s, *rcv_seq);
> +	} else {
> +		if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, rcv_seq, &vlen)) {
> +			rc = -errno;
> +			err_perror("Dumping receive sequence for socket %i", s);
> +			return rc;
> +		}
> +		debug("Dumped receive sequence for socket %i: %u", s, *rcv_seq);
> +		*rcv_seq = htonl(*rcv_seq);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_repair_opt() - Dump or set repair "options" (MSS and window scale)
> + * @s:			Socket
> + * @snd_wscale:		Window scaling factor, send, network order
> + * @rcv_wscale:		Window scaling factor, receive, network order
> + * @mss:		Maximum Segment Size, socket side, network order
> + * @set:		Set if true, dump if false
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_repair_opt(int s, uint8_t *snd_wscale, uint8_t *rcv_wscale,
> +			uint32_t *mss, bool set)
> +{
> +	struct tcp_info_linux tinfo;
> +	struct tcp_repair_opt opts[2];
> +	socklen_t sl;
> +	int rc;
> +
> +	opts[0].opt_code = TCPOPT_WINDOW;
> +	opts[1].opt_code = TCPOPT_MAXSEG;
> +
> +	if (set) {
> +		*mss = ntohl(*mss);
> +		debug("Setting repair options for socket %i:", s);
> +		opts[0].opt_val = *snd_wscale + (*rcv_wscale << 16);
> +		opts[1].opt_val = *mss;
> +		debug("  window scale send %u, receive %u, MSS: %u",
> +		      *snd_wscale, *rcv_wscale, *mss);
> +
> +		sl = sizeof(opts);
> +		if (setsockopt(s, SOL_TCP, TCP_REPAIR_OPTIONS, opts, sl)) {
> +			rc = -errno;
> +			err_perror("Setting repair options for socket %i", s);
> +			return rc;
> +		}
> +
> +		sl = sizeof(*mss);
> +		if (setsockopt(s, SOL_TCP, TCP_MAXSEG, mss, sl))
> +			debug_perror("Setting MSS for socket %i", s);
> +	} else {
> +		sl = sizeof(tinfo);
> +		if (getsockopt(s, SOL_TCP, TCP_INFO, &tinfo, &sl)) {
> +			rc = -errno;
> +			err_perror("Querying TCP_INFO for socket %i", s);
> +			return rc;
> +		}
> +
> +		*snd_wscale = tinfo.tcpi_snd_wscale;
> +		*rcv_wscale = tinfo.tcpi_rcv_wscale;
> +
> +		/* TCP_INFO MSS is just the current value: ask explicitly */
> +		sl = sizeof(*mss);
> +		if (getsockopt(s, SOL_TCP, TCP_MAXSEG, mss, &sl)) {
> +			rc = -errno;
> +			err_perror("Getting MSS for socket %i", s);
> +			return rc;
> +		}
> +		*mss = htonl(*mss);
> +
> +		debug("Got repair options for socket %i:", s);
> +		debug("  window scale send %u, receive %u, MSS: %u",
> +		      *snd_wscale, *rcv_wscale, ntohl(*mss));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_repair_wnd() - Dump or set window parameters
> + * @snd_wl1:		Next sequence used in window probe (next sequence - 1)
> + * @snd_wnd:		Socket-side sending window, network order
> + * @max_window:		Window clamp, network order
> + * @rcv_wnd:		Socket-side receive window, network order
> + * @rcv_wup:		rcv_nxt on last window update sent, network order
> + * @wnd_out:		If NULL, set parameters, if given, get and return all
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +static int tcp_flow_repair_wnd(int s, uint32_t *snd_wl1, uint32_t *snd_wnd,
> +			       uint32_t *max_window, uint32_t *rcv_wnd,
> +			       uint32_t *rcv_wup,
> +			       struct tcp_repair_window *wnd_out)
> +{
> +	struct tcp_repair_window wnd_copy, *wnd = wnd_out ? wnd_out : &wnd_copy;
> +	socklen_t sl = sizeof(*wnd);
> +	int rc;
> +
> +	if (!wnd_out) {
> +		wnd->snd_wl1	= ntohl(*snd_wl1);
> +		wnd->snd_wnd	= ntohl(*snd_wnd);
> +		wnd->max_window	= ntohl(*max_window);
> +		wnd->rcv_wnd	= ntohl(*rcv_wnd);
> +		wnd->rcv_wup	= ntohl(*rcv_wup);
> +
> +		if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, wnd, sl)) {
> +			rc = -errno;
> +			err_perror("Setting window repair data, socket %i", s);
> +			return rc;
> +		}
> +	} else {
> +		if (getsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, wnd, &sl)) {
> +			rc = -errno;
> +			err_perror("Getting window repair data, socket %i", s);
> +			return rc;
> +		}
> +
> +		*snd_wl1	= htonl(wnd->snd_wl1);
> +		*snd_wnd	= htonl(wnd->snd_wnd);
> +		*max_window	= htonl(wnd->max_window);
> +		*rcv_wnd	= htonl(wnd->rcv_wnd);
> +		*rcv_wup	= htonl(wnd->rcv_wup);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_migrate_shrink_window() - Dump window data, decrease socket window
> + * @fidx:	Flow index
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn)
> +{
> +	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
> +	struct tcp_repair_window wnd;
> +	socklen_t sl = sizeof(wnd);
> +	int s = conn->sock;
> +
> +	if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, &((int){ 0 }), sizeof(int)))
> +		debug("TCP: failed to set SO_RCVBUF to minimum value");
> +
> +	/* Dump window data as it is for the target, before touching stuff */
> +	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
> +			    &t->sock_max_window, &t->sock_rcv_wnd,
> +			    &t->sock_rcv_wup, &wnd);
> +
> +	wnd.rcv_wnd = 0;
> +
> +	if (setsockopt(s, IPPROTO_TCP, TCP_REPAIR_WINDOW, &wnd, sl))
> +		debug_perror("Setting window repair data, socket %i", s);
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_migrate_source() - Send data (flow table part) for a single flow
> + * @c:		Execution context
> + * @fd:		Descriptor for state migration
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_migrate_source(int fd, const struct tcp_tap_conn *conn)
> +{
> +	struct tcp_tap_transfer t = {
> +		.retrans		= conn->retrans,
> +		.ws_from_tap		= conn->ws_from_tap,
> +		.ws_to_tap		= conn->ws_to_tap,
> +		.events			= conn->events,
> +
> +		.tap_mss		= htonl(MSS_GET(conn)),
> +
> +		.sndbuf			= htonl(conn->sndbuf),
> +
> +		.flags			= conn->flags,
> +		.seq_dup_ack_approx	= conn->seq_dup_ack_approx,
> +
> +		.wnd_from_tap		= htons(conn->wnd_from_tap),
> +		.wnd_to_tap		= htons(conn->wnd_to_tap),
> +
> +		.seq_to_tap		= htonl(conn->seq_to_tap),
> +		.seq_ack_from_tap	= htonl(conn->seq_ack_from_tap),
> +		.seq_from_tap		= htonl(conn->seq_from_tap),
> +		.seq_ack_to_tap		= htonl(conn->seq_ack_to_tap),
> +		.seq_init_from_tap	= htonl(conn->seq_init_from_tap),
> +	};
> +	int rc;
> +
> +	memcpy(&t.pif, conn->f.pif, sizeof(t.pif));
> +	memcpy(&t.side, conn->f.side, sizeof(t.side));
> +
> +	if (write_all_buf(fd, &t, sizeof(t))) {
> +		rc = -errno;
> +		err_perror("Failed to write migration data for socket %i",
> +			   conn->sock);
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> + * @fd:		Descriptor for state migration
> + * @fidx:	Flow index
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_migrate_source_ext(int fd, int fidx,
> +				const struct tcp_tap_conn *conn)
> +{
> +	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
> +	struct tcp_repair_window wnd;
> +	int s = conn->sock;
> +	int rc;
> +
> +	/* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */
> +	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
> +			    &t->sock_max_window, &t->sock_rcv_wnd,
> +			    &t->sock_rcv_wup, &wnd);
> +
> +	rc = tcp_flow_repair_seq(s, &t->sock_seq_snd, &t->sock_seq_rcv, false);
> +	if (rc) {
> +		err("Failed to get sequences on source for socket %i: %s",
> +		    s, strerror_(-rc));
> +		return rc;
> +	}
> +
> +	tcp_flow_repair_opt(s, &t->snd_wscale, &t->rcv_wscale, &t->sock_mss,
> +			    false);
> +
> +	/* Dump receive queue as late as possible */

Hmm.. why?

> +	rc = tcp_flow_repair_queues(s, tcp_migrate_snd_queue,
> +				    &t->sndlen, &t->notsentlen,
> +				    tcp_migrate_rcv_queue, &t->rcvlen, false);
> +	if (rc) {
> +		err("Failed to dump queues on source for socket %i: %s",
> +		    s, strerror_(-rc));
> +		return rc;
> +	}
> +
> +	if (ntohl(t->sndlen) > TCP_MIGRATE_SND_QUEUE_MAX ||
> +	    ntohl(t->notsentlen) > ntohl(t->sndlen) ||
> +	    ntohl(t->rcvlen) > TCP_MIGRATE_RCV_QUEUE_MAX) {
> +		err("Bad data queues length, socket %i, send: %u, receive: %u",
> +		    s, ntohl(t->sndlen), ntohl(t->rcvlen));
> +		return -EINVAL;
> +	}
> +
> +	/* FIXME: it's either this or flow_migrate_source_early(), why? */
> +	close(s);
> +
> +	if (write_all_buf(fd, t, sizeof(*t))) {
> +		rc = -errno;
> +		err_perror("Failed to write extended data for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (write_all_buf(fd, tcp_migrate_snd_queue, ntohl(t->sndlen))) {
> +		rc = -errno;
> +		err_perror("Failed to write send queue data for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (write_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t->rcvlen))) {
> +		rc = -errno;
> +		err_perror("Failed to write receive queue data for socket %i",
> +			   s);
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_repair_socket() - Open and bind socket, request repair mode
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_repair_socket(struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	sa_family_t af = CONN_V4(conn) ? AF_INET : AF_INET6;
> +	const struct flowside *sockside = HOSTFLOW(conn);
> +	union sockaddr_inany a;
> +	socklen_t sl;
> +	int s, rc;
> +
> +	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
> +
> +	if ((conn->sock = socket(af, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> +				 IPPROTO_TCP)) < 0) {
> +		rc = -errno;
> +		err_perror("Failed to create socket for migrated flow");
> +		return rc;
> +	}
> +	s = conn->sock;
> +
> +	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &(int){ 1 }, sizeof(int));
> +
> +	tcp_sock_set_bufsize(c, s);
> +	tcp_sock_set_nodelay(s);
> +
> +	if (bind(s, &a.sa, sizeof(a)) < 0) {
> +		rc = -errno;
> +		err_perror("Failed to bind socket %i for migrated flow", s);
> +		close(s);
> +		conn->sock = -1;
> +		return rc;
> +	}
> +
> +	rc = tcp_flow_repair_on(c, conn);
> +	if (rc) {
> +		close(s);
> +		conn->sock = -1;
> +		return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_repair_connect() - Connect socket in repair mode, then turn it off
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +static int tcp_flow_repair_connect(const struct ctx *c,
> +				   struct tcp_tap_conn *conn)
> +{
> +	const struct flowside *tgt = HOSTFLOW(conn);
> +	int rc;
> +
> +	rc = flowside_connect(c, conn->sock, PIF_HOST, tgt);
> +	if (rc) {
> +		rc = -errno;
> +		err_perror("Failed to connect migrated socket %i", conn->sock);
> +		return rc;
> +	}
> +
> +	conn->in_epoll = 0;
> +	conn->timer = -1;
> +	if ((rc = tcp_epoll_ctl(c, conn))) {
> +		debug("Failed to subscribe to epoll for migrated socket %i: %s",
> +		      conn->sock, strerror_(-rc));
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_migrate_target() - Receive data (flow table part) for flow, insert
> + * @c:		Execution context
> + * @fd:		Descriptor for state migration
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_migrate_target(struct ctx *c, int fd)
> +{
> +	struct tcp_tap_transfer t;
> +	struct tcp_tap_conn *conn;
> +	union flow *flow;
> +	int rc;
> +
> +	if (!(flow = flow_alloc())) {
> +		err("Flow table full on migration target");
> +		return -ENOMEM;
> +	}
> +
> +	if (read_all_buf(fd, &t, sizeof(t))) {
> +		err_perror("Failed to receive migration data");
> +		return -errno;
> +	}
> +
> +	flow->f.state = FLOW_STATE_TGT;
> +	memcpy(&flow->f.pif, &t.pif, sizeof(flow->f.pif));
> +	memcpy(&flow->f.side, &t.side, sizeof(flow->f.side));
> +	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> +
> +	conn->retrans			= t.retrans;
> +	conn->ws_from_tap		= t.ws_from_tap;
> +	conn->ws_to_tap			= t.ws_to_tap;
> +	conn->events			= t.events;
> +
> +	conn->sndbuf			= htonl(t.sndbuf);
> +
> +	conn->flags			= t.flags;
> +	conn->seq_dup_ack_approx	= t.seq_dup_ack_approx;
> +
> +	MSS_SET(conn,			  ntohl(t.tap_mss));
> +
> +	conn->wnd_from_tap		= ntohs(t.wnd_from_tap);
> +	conn->wnd_to_tap		= ntohs(t.wnd_to_tap);
> +
> +	conn->seq_to_tap		= ntohl(t.seq_to_tap);
> +	conn->seq_ack_from_tap		= ntohl(t.seq_ack_from_tap);
> +	conn->seq_from_tap		= ntohl(t.seq_from_tap);
> +	conn->seq_ack_to_tap		= ntohl(t.seq_ack_to_tap);
> +	conn->seq_init_from_tap		= ntohl(t.seq_init_from_tap);
> +
> +	if ((rc = tcp_flow_repair_socket(c, conn)))
> +		return rc;
> +
> +	flow_hash_insert(c, TAP_SIDX(conn));
> +	FLOW_ACTIVATE(conn);
> +
> +	return 0;
> +}
> +
> +/**
> + * tcp_flow_migrate_target_ext() - Receive extended data for flow, set, connect
> + * @c:		Execution context
> + * @flow:	Existing flow for this connection data
> + * @fd:		Descriptor for state migration
> + *
> + * Return: 0 on success, negative code on failure, but 0 on connection reset
> + */
> +int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd)
> +{
> +	struct tcp_tap_conn *conn = &flow->tcp;
> +	struct tcp_tap_transfer_ext t;
> +	uint32_t peek_offset, len;
> +	int s = conn->sock, rc;
> +
> +	if (read_all_buf(fd, &t, sizeof(t))) {
> +		rc = -errno;
> +		err_perror("Failed to read extended data for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (ntohl(t.sndlen) > TCP_MIGRATE_SND_QUEUE_MAX ||
> +	    ntohl(t.notsentlen) > ntohl(t.sndlen) ||
> +	    ntohl(t.rcvlen) > TCP_MIGRATE_RCV_QUEUE_MAX) {
> +		err("Bad data queues length, socket %i, send: %u, receive: %u",
> +		    s, ntohl(t.sndlen), ntohl(t.rcvlen));
> +		return -EINVAL;
> +	}
> +
> +	if (read_all_buf(fd, tcp_migrate_snd_queue, ntohl(t.sndlen))) {
> +		rc = -errno;
> +		err_perror("Failed to read send queue data for socket %i", s);
> +		return rc;
> +	}
> +
> +	if (read_all_buf(fd, tcp_migrate_rcv_queue, ntohl(t.rcvlen))) {
> +		rc = -errno;
> +		err_perror("Failed to read receive queue data for socket %i",
> +			   s);
> +		return rc;
> +	}
> +
> +	if ((rc = tcp_flow_repair_seq(s, &t.sock_seq_snd, &t.sock_seq_rcv,
> +				      true)))
> +		return rc;
> +
> +	if ((rc = tcp_flow_repair_connect(c, conn)))
> +		return rc;
> +
> +	if ((rc = tcp_flow_repair_opt(s, &t.snd_wscale, &t.rcv_wscale,
> +				      &t.sock_mss, true)))
> +		return rc;
> +
> +	if ((rc = tcp_flow_repair_queues(s,
> +					 tcp_migrate_snd_queue, &t.sndlen,
> +					 &((uint32_t){ 0 }), /* Sent only */
> +					 tcp_migrate_rcv_queue, &t.rcvlen,
> +					 true)))
> +		debug_perror("Error while repairing queues for socket %i", s);
> +
> +	if ((rc = tcp_flow_repair_wnd(s, &t.sock_snd_wl1, &t.sock_snd_wnd,
> +				      &t.sock_max_window, &t.sock_rcv_wnd,
> +				      &t.sock_rcv_wup, NULL)))
> +		debug_perror("Error while repairing window for socket %i", s);
> +
> +	tcp_flow_repair_off(c, conn);
> +	repair_flush(c);	/* FIXME: batch this? */
> +
> +	/* Now the unsent part of send queue */
> +	len = htonl(ntohl(t.sndlen) - ntohl(t.notsentlen));
> +	if ((rc = tcp_flow_repair_queues(s,
> +					 tcp_migrate_snd_queue + ntohl(len),
> +					 &((uint32_t){ 0 }), &len,
> +					 NULL, &((uint32_t){ 0 }),
> +					 true)))
> +		debug_perror("Sending unsent data for socket %i", s);
> +
> +	peek_offset = conn->seq_to_tap - conn->seq_ack_from_tap;
> +	if (tcp_set_peek_offset(conn->sock, peek_offset))
> +		tcp_rst(c, conn);
> +
> +	tcp_send_flag(c, conn, ACK);
> +
> +	return 0;
> +}
> diff --git a/tcp_conn.h b/tcp_conn.h
> index d342680..b64e857 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -96,6 +96,89 @@ struct tcp_tap_conn {
>  	uint32_t	seq_init_from_tap;
>  };
>  
> +/**
> + * struct tcp_tap_transfer - Migrated TCP data, flow table part, network order
> + * @pif:		Interfaces for each side of the flow
> + * @side:		Addresses and ports for each side of the flow
> + * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
> + * @ws_from_tap:	Window scaling factor advertised from tap/guest
> + * @ws_to_tap:		Window scaling factor advertised to tap/guest
> + * @events:		Connection events, implying connection states
> + * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
> + * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
> + * @flags:		Connection flags representing internal attributes
> + * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
> + * @wnd_from_tap:	Last window size from tap, unscaled (as received)
> + * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
> + * @seq_to_tap:		Next sequence for packets to tap
> + * @seq_ack_from_tap:	Last ACK number received from tap
> + * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
> + * @seq_ack_to_tap:	Last ACK number sent to tap
> + * @seq_init_from_tap:	Initial sequence number from tap
> +*/
> +struct tcp_tap_transfer {
> +	uint8_t		pif[SIDES];
> +	struct flowside	side[SIDES];
> +
> +	uint8_t		retrans;
> +	uint8_t		ws_from_tap;
> +	uint8_t		ws_to_tap;
> +	uint8_t		events;
> +
> +	uint32_t	tap_mss;
> +
> +	uint32_t	sndbuf;
> +
> +	uint8_t		flags;
> +	uint8_t		seq_dup_ack_approx;
> +
> +	uint16_t	wnd_from_tap;
> +	uint16_t	wnd_to_tap;
> +
> +	uint32_t	seq_to_tap;
> +	uint32_t	seq_ack_from_tap;
> +	uint32_t	seq_from_tap;
> +	uint32_t	seq_ack_to_tap;
> +	uint32_t	seq_init_from_tap;
> +} __attribute__((packed, aligned(__alignof__(uint32_t))));
> +
> +/**
> + * struct tcp_tap_transfer_ext - Migrated TCP data, outside flow, network order
> + * @sock_seq_snd:	Socket-side send sequence
> + * @sock_seq_rcv:	Socket-side receive sequence
> + * @sndlen:		Length of pending send queue (unacknowledged / not sent)
> + * @notsentlen:		Part of pending send queue that wasn't sent out yet
> + * @rcvlen:		Length of pending receive queue
> + * @sock_mss:		Socket-side MSS
> + * @sock_snd_wl1:	Next sequence used in window probe (next sequence - 1)
> + * @sock_snd_wnd:	Socket-side sending window
> + * @sock_max_window:	Window clamp
> + * @sock_rcv_wnd:	Socket-side receive window
> + * @sock_rcv_wup:	rcv_nxt on last window update sent
> + * @snd_wscale:		Window scaling factor, send
> + * @snd_wscale:		Window scaling factor, receive
> + */
> +struct tcp_tap_transfer_ext {
> +	uint32_t	sock_seq_snd;
> +	uint32_t	sock_seq_rcv;
> +
> +	uint32_t	sndlen;
> +	uint32_t	notsentlen;
> +	uint32_t	rcvlen;
> +
> +	uint32_t	sock_mss;
> +
> +	/* We can't just use struct tcp_repair_window: we need network order */
> +	uint32_t	sock_snd_wl1;
> +	uint32_t	sock_snd_wnd;
> +	uint32_t	sock_max_window;
> +	uint32_t	sock_rcv_wnd;
> +	uint32_t	sock_rcv_wup;
> +
> +	uint8_t		snd_wscale;
> +	uint8_t		rcv_wscale;
> +} __attribute__((packed, aligned(__alignof__(uint32_t))));
> +
>  /**
>   * struct tcp_splice_conn - Descriptor for a spliced TCP connection
>   * @f:			Generic flow information
> @@ -140,6 +223,18 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
>  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
>  
>  bool tcp_flow_defer(const struct tcp_tap_conn *conn);
> +
> +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
> +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn);
> +
> +int tcp_flow_migrate_shrink_window(int fidx, const struct tcp_tap_conn *conn);
> +int tcp_flow_migrate_source(int fd, const struct tcp_tap_conn *conn);
> +int tcp_flow_migrate_source_ext(int fd, int fidx,
> +				const struct tcp_tap_conn *conn);
> +
> +int tcp_flow_migrate_target(struct ctx *c, int fd);
> +int tcp_flow_migrate_target_ext(struct ctx *c, union flow *flow, int fd);
> +
>  bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
>  void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
>  int tcp_conn_pool_sock(int pool[]);

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 5/6] migrate: Migrate TCP flows
  2025-02-10  6:05   ` David Gibson
@ 2025-02-10  9:51     ` Stefano Brivio
  2025-02-10 15:54       ` Stefano Brivio
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-02-10  9:51 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Mon, 10 Feb 2025 17:05:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sun, Feb 09, 2025 at 11:20:04PM +0100, Stefano Brivio wrote:
> >
> > +/**
> > + * flow_migrate_source_rollback() - Disable repair mode, return failure
> > + * @c:		Execution context
> > + * @max_flow:	Maximum index of affected flows
> > + * @ret:	Negative error code
> > + *
> > + * Return: @ret
> > + */
> > +static int flow_migrate_source_rollback(struct ctx *c, unsigned max_flow,
> > +					int ret)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +
> > +	debug("...roll back migration");
> > +
> > +	foreach_tcp_flow(i, flow, max_flow)
> > +		tcp_flow_repair_off(c, &flow->tcp);
> > +
> > +	repair_flush(c);  
> 
> I think this should die() on failures.  If we get here, it could well
> mean we've already had a failure enabling repair mode, so an error
> disabling is more plausible than usual.  I think die()ing is
> preferable to carrying on, since resuming normal operation with some
> of our sockets in repair mode is almost certain to result in really
> weird, hard to debug behaviour.

It makes sense in general, except for the case where we were unable to
set repair mode for any of the sockets, say, passt-repair isn't there
or we can't use it for any reason.

We should probably tell this function how to handle failures, from
callers.

> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * flow_migrate_repair_all() - Turn repair mode on or off for all flows
> > + * @c:		Execution context
> > + * @enable:	Switch repair mode on if set, off otherwise
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +static int flow_migrate_repair_all(struct ctx *c, bool enable)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if (enable)
> > +			rc = tcp_flow_repair_on(c, &flow->tcp);
> > +		else
> > +			rc = tcp_flow_repair_off(c, &flow->tcp);
> > +
> > +		if (rc) {
> > +			debug("Can't %s repair mode: %s",
> > +			      enable ? "enable" : "disable", strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, i, rc);
> > +		}
> > +	}
> > +
> > +	if ((rc = repair_flush(c))) {
> > +		debug("Can't %s repair mode: %s",
> > +		      enable ? "enable" : "disable", strerror_(-rc));
> > +		return flow_migrate_source_rollback(c, i, rc);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_early() - Early tasks: shrink (RFC 7323 2.2) TCP windows
> > + * @c:		Execution context
> > + * @stage:	Migration stage information, unused
> > + * @fd:		Migration file descriptor, unused
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > +			      int fd)
> > +{
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)stage;
> > +	(void)fd;
> > +
> > +	/* We need repair mode to dump and set (some) window parameters */
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_shrink_window(i, &flow->tcp))) {
> > +			err("Shrinking window, flow %u: %s", i, strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, i, -rc);
> > +		}
> > +	}
> > +
> > +	/* Now send window updates. We'll flip repair mode back on in a bit */
> > +	if ((rc = flow_migrate_repair_all(c, false)))
> > +		return -rc;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source_pre() - Prepare flows for migration: enable repair mode
> > + * @c:		Execution context
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor (unused)
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > +			    int fd)
> > +{
> > +	int rc;
> > +
> > +	(void)stage;
> > +	(void)fd;
> > +
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_source() - Dump all the remaining information and send data
> > + * @c:		Execution context (unused)
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd)
> > +{
> > +	uint32_t count = 0;
> > +	union flow *flow;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)c;
> > +	(void)stage;
> > +
> > +	foreach_tcp_flow(i, flow, FLOW_MAX)
> > +		count++;
> > +
> > +	count = htonl(count);
> > +	if ((rc = write_all_buf(fd, &count, sizeof(count)))) {
> > +		rc = errno;
> > +		err_perror("Can't send flow count (%u)", ntohl(count));
> > +		return flow_migrate_source_rollback(c, FLOW_MAX, rc);
> > +	}
> > +
> > +	debug("Sending %u flows", ntohl(count));
> > +
> > +	/* Dump and send information that can be stored in the flow table */
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_source(fd, &flow->tcp))) {
> > +			err("Can't send data, flow %u: %s", i, strerror_(-rc));
> > +			return flow_migrate_source_rollback(c, FLOW_MAX, -rc);
> > +		}
> > +	}
> > +
> > +	/* And then "extended" data (including window data we saved previously):
> > +	 * the target needs to set repair mode on sockets before it can set
> > +	 * this stuff, but it needs sockets (and flows) for that.
> > +	 *
> > +	 * This also closes sockets so that the target can start connecting
> > +	 * theirs: you can't sendmsg() to queues (using the socket) if the
> > +	 * socket is not connected (EPIPE), not even in repair mode. And the
> > +	 * target needs to restore queues now because we're sending the data.
> > +	 *
> > +	 * So, no rollback here, just try as hard as we can.
> > +	 */
> > +	foreach_tcp_flow(i, flow, FLOW_MAX) {
> > +		if ((rc = tcp_flow_migrate_source_ext(fd, i, &flow->tcp)))
> > +			err("Extended data for flow %u: %s", i, strerror_(-rc));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * flow_migrate_target() - Receive flows and insert in flow table
> > + * @c:		Execution context
> > + * @stage:	Migration stage information (unused)
> > + * @fd:		Migration file descriptor
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd)
> > +{
> > +	uint32_t count;
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)stage;
> > +
> > +	if (read_all_buf(fd, &count, sizeof(count)))
> > +		return errno;
> > +
> > +	count = ntohl(count);
> > +	debug("Receiving %u flows", count);
> > +
> > +	if ((rc = flow_migrate_repair_all(c, true)))
> > +		return -rc;
> > +
> > +	repair_flush(c);  
> 
> Unnecessary, flow_migrate_repair_all() already handles this.
> 
> > +
> > +	/* TODO: flow header with type, instead? */
> > +	for (i = 0; i < count; i++) {
> > +		rc = tcp_flow_migrate_target(c, fd);
> > +		if (rc) {
> > +			debug("Bad target data for flow %u: %s, abort",
> > +			      i, strerror_(-rc));
> > +			return -rc;
> > +		}
> > +	}
> > +
> > +	repair_flush(c);
> > +
> > +	for (i = 0; i < count; i++) {
> > +		rc = tcp_flow_migrate_target_ext(c, flowtab + i, fd);
> > +		if (rc) {
> > +			debug("Bad target extended data for flow %u: %s, abort",
> > +			      i, strerror_(-rc));
> > +			return -rc;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * flow_init() - Initialise flow related data structures
> >   */
> > diff --git a/flow.h b/flow.h
> > index 24ba3ef..675726e 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -249,6 +249,14 @@ union flow;
> >  
> >  void flow_init(void);
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> > +int flow_migrate_source_early(struct ctx *c, const struct migrate_stage *stage,
> > +			      int fd);
> > +int flow_migrate_source_pre(struct ctx *c, const struct migrate_stage *stage,
> > +			    int fd);
> > +int flow_migrate_source(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd);
> > +int flow_migrate_target(struct ctx *c, const struct migrate_stage *stage,
> > +			int fd);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	__attribute__((format(printf, 3, 4)));
> > diff --git a/migrate.c b/migrate.c
> > index 1c59016..c5c6663 100644
> > --- a/migrate.c
> > +++ b/migrate.c
> > @@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
> >  
> >  /* Stages for version 1 */
> >  static const struct migrate_stage stages_v1[] = {
> > +	/* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
> > +	 * *sometimes* closes the connection for real.
> > +	 */
> > +/*	{
> > +		.name = "shrink TCP windows",
> > +		.source = flow_migrate_source_early,
> > +		.target = NULL,
> > +	},
> > +*/  
> 
> Given we're not sure if this will help, and it adds some
> complications, probably makes sense to split this into a separate
> patch.

I'd rather not because, due to this, the code for the case *without it*
is a bit different anyway, and we avoid some code churn. I would also
like to merge it unused to make our lives a bit easier the day we retry
to work on it.

> >  	{
> >  		.name = "observed addresses",
> >  		.source = seen_addrs_source_v1,
> >  		.target = seen_addrs_target_v1,
> >  	},
> > +	{
> > +		.name = "prepare flows",
> > +		.source = flow_migrate_source_pre,
> > +		.target = NULL,
> > +	},
> > +	{
> > +		.name = "transfer flows",
> > +		.source = flow_migrate_source,
> > +		.target = flow_migrate_target,
> > +	},
> >  	{ 0 },
> >  };
> >  
> > diff --git a/passt.c b/passt.c
> > index 6f9fb4d..68d1a28 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -223,9 +223,6 @@ int main(int argc, char **argv)
> >  		if (sigaction(SIGCHLD, &sa, NULL))
> >  			die_perror("Couldn't install signal handlers");
> >  
> > -		if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > -			die_perror("Couldn't set disposition for SIGPIPE");
> > -
> >  		c.mode = MODE_PASTA;
> >  	} else if (strstr(name, "passt")) {
> >  		c.mode = MODE_PASST;
> > @@ -233,6 +230,9 @@ int main(int argc, char **argv)
> >  		_exit(EXIT_FAILURE);
> >  	}
> >  
> > +	if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
> > +		die_perror("Couldn't set disposition for SIGPIPE");
> > +
> >  	madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
> >  
> >  	c.epollfd = epoll_create1(EPOLL_CLOEXEC);
> > diff --git a/repair.c b/repair.c
> > index 784b994..da85edb 100644
> > --- a/repair.c
> > +++ b/repair.c
> > @@ -190,7 +190,6 @@ int repair_flush(struct ctx *c)
> >   *
> >   * Return: 0 on success, negative error code on failure
> >   */
> > -/* cppcheck-suppress unusedFunction */
> >  int repair_set(struct ctx *c, int s, int cmd)
> >  {
> >  	int rc;
> > diff --git a/tcp.c b/tcp.c
> > index af6bd95..78db64f 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -280,6 +280,7 @@
> >  #include <stddef.h>
> >  #include <string.h>
> >  #include <sys/epoll.h>
> > +#include <sys/ioctl.h>
> >  #include <sys/socket.h>
> >  #include <sys/timerfd.h>
> >  #include <sys/types.h>
> > @@ -287,6 +288,8 @@
> >  #include <time.h>
> >  #include <arpa/inet.h>
> >  
> > +#include <linux/sockios.h>
> > +
> >  #include "checksum.h"
> >  #include "util.h"
> >  #include "iov.h"
> > @@ -299,6 +302,7 @@
> >  #include "log.h"
> >  #include "inany.h"
> >  #include "flow.h"
> > +#include "repair.h"
> >  #include "linux_dep.h"
> >  
> >  #include "flow_table.h"
> > @@ -326,6 +330,19 @@
> >  	 ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
> >  #define CONN_HAS(conn, set)	(((conn)->events & (set)) == (set))
> >  
> > +/* Buffers to migrate pending data from send and receive queues. No, they don't
> > + * use memory if we don't use them. And we're going away after this, so splurge.
> > + */
> > +#define TCP_MIGRATE_SND_QUEUE_MAX	(64 << 20)
> > +#define TCP_MIGRATE_RCV_QUEUE_MAX	(64 << 20)
> > +uint8_t tcp_migrate_snd_queue		[TCP_MIGRATE_SND_QUEUE_MAX];
> > +uint8_t tcp_migrate_rcv_queue		[TCP_MIGRATE_RCV_QUEUE_MAX];
> > +
> > +#define TCP_MIGRATE_RESTORE_CHUNK_MIN	1024 /* Try smaller when above this */
> > +
> > +/* "Extended" data (not stored in the flow table) for TCP flow migration */
> > +static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
> > +
> >  static const char *tcp_event_str[] __attribute((__unused__)) = {
> >  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
> >  
> > @@ -2645,3 +2662,775 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
> >  	if (c->mode == MODE_PASTA)
> >  		tcp_splice_refill(c);
> >  }
> > +
> > +/**
> > + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> > + * @c:		Execution context
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > +	int rc = 0;
> > +
> > +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> > +		err("Failed to set TCP_REPAIR");
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_off() - Clear repair mode for a single TCP flow
> > + * @c:		Execution context
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_off(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > +	int rc = 0;
> > +
> > +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_OFF)))
> > +		err("Failed to clear TCP_REPAIR");
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * tcp_flow_repair_queues() - Read or write socket queues, or send unsent data
> > + * @s:			Socket
> > + * @sndbuf:		Send queue buffer read or written/sent depending on @set
> > + * @sndlen:		Length of send queue buffer to set, network order
> > + * @notsentlen:		Length of not sent data, non-zero to actually _send_
> > + * @rcvbuf:		Receive queue buffer, read or written depending on @set
> > + * @rcvlen:		Length of receive queue buffer to set, network order
> > + * @set:		Set or send (unsent data only) if true, dump if false
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + *
> > + * #syscalls:vu ioctl
> > + */
> > +static int tcp_flow_repair_queues(int s,
> > +				  uint8_t *sndbuf,
> > +				  uint32_t *sndlen, uint32_t *notsentlen,
> > +				  uint8_t *rcvbuf, uint32_t *rcvlen,
> > +				  bool set)
> > +{
> > +	ssize_t rc;
> > +	int v;
> > +
> > +	if (set && rcvbuf) { /* FIXME: can't check notsentlen, rework this */  
> 
> Not really clear why this is its own block, rather than part of the if
> (set) below.

Because in the general case we need to select the send queue, for both
set (set/write send queue) and !set (get/dump send queue), but not in
the case we write to the "real" socket (send send queue, don't write
it).

That happens to be the case where I pass rcvbuf as NULL, and that's how
I grossly identified that subcase. This function needs a complete
rewrite, it got out of hand.

> > +		v = TCP_SEND_QUEUE;
> > +		if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v))) {
> > +			rc = -errno;
> > +			err_perror("Selecting TCP_SEND_QUEUE on socket %i", s);
> > +			return rc;
> > +		}
> > +	}
> > +
> > +	if (set) {
> >
> > [...]
> >
> > +/**
> > + * tcp_flow_migrate_source_ext() - Dump queues, close sockets, send final data
> > + * @fd:		Descriptor for state migration
> > + * @fidx:	Flow index
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_migrate_source_ext(int fd, int fidx,
> > +				const struct tcp_tap_conn *conn)
> > +{
> > +	struct tcp_tap_transfer_ext *t = &migrate_ext[fidx];
> > +	struct tcp_repair_window wnd;
> > +	int s = conn->sock;
> > +	int rc;
> > +
> > +	/* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */
> > +	tcp_flow_repair_wnd(s, &t->sock_snd_wl1, &t->sock_snd_wnd,
> > +			    &t->sock_max_window, &t->sock_rcv_wnd,
> > +			    &t->sock_rcv_wup, &wnd);
> > +
> > +	rc = tcp_flow_repair_seq(s, &t->sock_seq_snd, &t->sock_seq_rcv, false);
> > +	if (rc) {
> > +		err("Failed to get sequences on source for socket %i: %s",
> > +		    s, strerror_(-rc));
> > +		return rc;
> > +	}
> > +
> > +	tcp_flow_repair_opt(s, &t->snd_wscale, &t->rcv_wscale, &t->sock_mss,
> > +			    false);
> > +
> > +	/* Dump receive queue as late as possible */  
> 
> Hmm.. why?

I'm not entirely sure if it makes sense, but my thought was: if we
spend time doing other things *after* dumping the receive queue, then
the part we risk missing of it (data that comes to the receive queue
after we dump it) is bigger.

If we dump (and transfer) it last, we should decrease the risk of
missing data.

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair
  2025-02-10  2:59   ` David Gibson
@ 2025-02-10 15:54     ` Stefano Brivio
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Brivio @ 2025-02-10 15:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Mon, 10 Feb 2025 13:59:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sun, Feb 09, 2025 at 11:20:02PM +0100, Stefano Brivio wrote:
>
> > @@ -243,6 +244,8 @@ void migrate_close(struct ctx *c)
> >  		c->device_state_fd = -1;
> >  		c->device_state_result = -1;
> >  	}
> > +
> > +	repair_close(c);  
> 
> I don't think we want this.  At the moment, rollback / failed
> migrations aren't really handled properly anyway.  But this pretty
> much explicitly prevents a second attempt at a failed migration.
> I'll send a fixup.

Well, after a failed migration, the user needs to retrigger a migration
anyway, and I don't think we want to encourage users to keep
passt-repair around the whole time, as it needs CAP_NET_ADMIN and
connects to passt.

It should really be started as late as possible, so that it has minimal
exposure to the guest.

And if starting passt-repair is a part of starting the migration
process, then we should definitely terminate it. Besides, we have no
other safe way to clean it up.

By the way, I thought the fixup (8/14 of v14) is what broke running
tests in a sequence because passt-repair doesn't terminate anymore, and
the teardown function wouldn't take care of it... but no, it was from
excess shell quoting (I'll include this fix in the next patch for tests):

-	rm -f "${STATESETUP}/passt_[12].pid" "${STATESETUP}/pasta_[12].pid"
+	rm -f "${STATESETUP}/passt_1.pid" "${STATESETUP}/passt_2.pid"
+	rm -f "${STATESETUP}/pasta_1.pid" "${STATESETUP}/pasta_2.pid"

-- 
Stefano




> >  }
> >  
> >  /**
> > diff --git a/passt.1 b/passt.1
> > index 29cc3ed..c81d539 100644
> > --- a/passt.1
> > +++ b/passt.1
> > @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> >  .BR \-\-print-capabilities
> >  Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> >  
> > +.TP
> > +.BR \-\-repair-path " " \fIpath
> > +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
> > +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
> > +migration. \fB--repair-path none\fR disables this interface (if you need to
> > +specify a socket path called "none" you can prefix the path by \fI./\fR).
> > +
> > +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> > +chosen for the hypervisor UNIX domain socket. No socket is created if not in
> > +\-\-vhost-user mode.
> > +
> >  .TP
> >  .BR \-F ", " \-\-fd " " \fIFD
> >  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> > diff --git a/passt.c b/passt.c
> > index 935a69f..6f9fb4d 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -52,6 +52,7 @@
> >  #include "ndp.h"
> >  #include "vu_common.h"
> >  #include "migrate.h"
> > +#include "repair.h"
> >  
> >  #define EPOLL_EVENTS		8
> >  
> > @@ -76,6 +77,8 @@ char *epoll_type_str[] = {
> >  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
> >  	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> >  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
> > +	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
> > +	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> >  };
> >  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
> >  	      "epoll_type_str[] doesn't match enum epoll_type");
> > @@ -358,6 +361,12 @@ loop:
> >  		case EPOLL_TYPE_VHOST_KICK:
> >  			vu_kick_cb(c.vdev, ref, &now);
> >  			break;
> > +		case EPOLL_TYPE_REPAIR_LISTEN:
> > +			repair_listen_handler(&c, eventmask);
> > +			break;
> > +		case EPOLL_TYPE_REPAIR:
> > +			repair_handler(&c, eventmask);
> > +			break;
> >  		default:
> >  			/* Can't happen */
> >  			ASSERT(0);
> > diff --git a/passt.h b/passt.h
> > index e73a5ac..c392be0 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -20,6 +20,7 @@ union epoll_ref;
> >  #include "siphash.h"
> >  #include "ip.h"
> >  #include "inany.h"
> > +#include "migrate.h"
> >  #include "flow.h"
> >  #include "icmp.h"
> >  #include "fwd.h"
> > @@ -193,6 +194,7 @@ struct ip6_ctx {
> >   * @foreground:		Run in foreground, don't log to stderr by default
> >   * @nofile:		Maximum number of open files (ulimit -n)
> >   * @sock_path:		Path for UNIX domain socket
> > + * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
> >   * @pcap:		Path for packet capture file
> >   * @pidfile:		Path to PID file, empty string if not configured
> >   * @pidfile_fd:		File descriptor for PID file, -1 if none
> > @@ -203,6 +205,8 @@ struct ip6_ctx {
> >   * @epollfd:		File descriptor for epoll instance
> >   * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
> >   * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
> > + * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
> > + * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
> >   * @our_tap_mac:	Pasta/passt's MAC on the tap link
> >   * @guest_mac:		MAC address of guest or namespace, seen or configured
> >   * @hash_secret:	128-bit secret for siphash functions
> > @@ -247,6 +251,7 @@ struct ctx {
> >  	int foreground;
> >  	int nofile;
> >  	char sock_path[UNIX_PATH_MAX];
> > +	char repair_path[UNIX_PATH_MAX];
> >  	char pcap[PATH_MAX];
> >  
> >  	char pidfile[PATH_MAX];
> > @@ -263,6 +268,8 @@ struct ctx {
> >  	int epollfd;
> >  	int fd_tap_listen;
> >  	int fd_tap;
> > +	int fd_repair_listen;
> > +	int fd_repair;
> >  	unsigned char our_tap_mac[ETH_ALEN];
> >  	unsigned char guest_mac[ETH_ALEN];
> >  	uint64_t hash_secret[2];
> > diff --git a/repair.c b/repair.c
> > new file mode 100644
> > index 0000000..784b994
> > --- /dev/null
> > +++ b/repair.c
> > @@ -0,0 +1,212 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* PASST - Plug A Simple Socket Transport
> > + *  for qemu/UNIX domain socket mode
> > + *
> > + * PASTA - Pack A Subtle Tap Abstraction
> > + *  for network namespace/tap device mode
> > + *
> > + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
> > + *
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#include <errno.h>
> > +#include <sys/uio.h>
> > +
> > +#include "util.h"
> > +#include "ip.h"
> > +#include "passt.h"
> > +#include "inany.h"
> > +#include "flow.h"
> > +#include "flow_table.h"
> > +
> > +#include "repair.h"
> > +
> > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > +
> > +/* Pending file descriptors for next repair_flush() call, or command change */
> > +static int repair_fds[SCM_MAX_FD];
> > +
> > +/* Pending command: flush pending file descriptors if it changes */
> > +static int repair_cmd;  
> 
> This should be typed as int8_t (see below for more details).
> 
> > +
> > +/* Number of pending file descriptors set in @repair_fds */
> > +static int repair_nfds;
> > +
> > +/**
> > + * repair_sock_init() - Start listening for connections on helper socket
> > + * @c:		Execution context
> > + */
> > +void repair_sock_init(const struct ctx *c)
> > +{
> > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> > +	struct epoll_event ev = { 0 };
> > +
> > +	if (c->fd_repair_listen == -1)
> > +		return;
> > +
> > +	if (listen(c->fd_repair_listen, 0)) {
> > +		err_perror("listen() on repair helper socket, won't migrate");
> > +		return;
> > +	}
> > +
> > +	ref.fd = c->fd_repair_listen;
> > +	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> > +	ev.data.u64 = ref.u64;
> > +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev))
> > +		err_perror("repair helper socket epoll_ctl(), won't migrate");
> > +}
> > +
> > +/**
> > + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
> > + * @c:		Execution context
> > + * @events:	epoll events
> > + */
> > +void repair_listen_handler(struct ctx *c, uint32_t events)
> > +{
> > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> > +	struct epoll_event ev = { 0 };
> > +	struct ucred ucred;
> > +	socklen_t len;
> > +
> > +	if (events != EPOLLIN) {
> > +		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
> > +		      events);
> > +		return;
> > +	}
> > +
> > +	len = sizeof(ucred);
> > +
> > +	/* Another client is already connected: accept and close right away. */
> > +	if (c->fd_repair != -1) {
> > +		int discard = accept4(c->fd_repair_listen, NULL, NULL,
> > +				      SOCK_NONBLOCK);
> > +
> > +		if (discard == -1)
> > +			return;
> > +
> > +		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > +			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
> > +
> > +		close(discard);
> > +		return;
> > +	}
> > +
> > +	if ((c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0)) < 0) {
> > +		debug_perror("accept4() on TCP_REPAIR helper listening socket");
> > +		return;
> > +	}
> > +
> > +	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > +		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
> > +
> > +	ref.fd = c->fd_repair;
> > +	ev.events = EPOLLHUP | EPOLLET;
> > +	ev.data.u64 = ref.u64;
> > +	if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) {
> > +		debug_perror("epoll_ctl() on TCP_REPAIR helper socket");
> > +		close(c->fd_repair);
> > +		c->fd_repair = -1;
> > +	}
> > +}
> > +
> > +/**
> > + * repair_close() - Close connection to TCP_REPAIR helper
> > + * @c:		Execution context
> > + */
> > +void repair_close(struct ctx *c)
> > +{
> > +	debug("Closing TCP_REPAIR helper socket");
> > +
> > +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
> > +	close(c->fd_repair);
> > +	c->fd_repair = -1;
> > +}
> > +
> > +/**
> > + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
> > + * @c:		Execution context
> > + * @events:	epoll events
> > + */
> > +void repair_handler(struct ctx *c, uint32_t events)
> > +{
> > +	(void)events;
> > +
> > +	repair_close(c);
> > +}
> > +
> > +/**
> > + * repair_flush() - Flush current set of sockets to helper, with current command
> > + * @c:		Execution context
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int repair_flush(struct ctx *c)
> > +{
> > +	struct iovec iov = { &((int8_t){ repair_cmd }), sizeof(int8_t) };  
> 
> This will only be correct for little-endian machines.  Better to
> correctly type the repair_cmd variable.
> 
> > +	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > +	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > +	struct cmsghdr *cmsg;
> > +	struct msghdr msg;
> > +
> > +	if (!repair_nfds)
> > +		return 0;
> > +
> > +	msg = (struct msghdr){ NULL, 0, &iov, 1,
> > +			       buf, CMSG_SPACE(sizeof(int) * repair_nfds), 0 };
> > +	cmsg = CMSG_FIRSTHDR(&msg);
> > +
> > +	cmsg->cmsg_level = SOL_SOCKET;
> > +	cmsg->cmsg_type = SCM_RIGHTS;
> > +	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * repair_nfds);
> > +	memcpy(CMSG_DATA(cmsg), repair_fds, sizeof(int) * repair_nfds);
> > +
> > +	repair_nfds = 0;
> > +
> > +	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
> > +		int ret = -errno;
> > +		err_perror("Failed to send sockets to TCP_REPAIR helper");
> > +		repair_close(c);
> > +		return ret;
> > +	}
> > +
> > +	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {  
> 
> I guess it works, but passing an address to an implicitly constructed
> variable to recv() makes me nervous.  Besides we could error check a
> bit better here, I'll try to send another fixup.
> 
> > +		int ret = -errno;
> > +		err_perror("Failed to receive reply from TCP_REPAIR helper");
> > +		repair_close(c);
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * repair_set() - Add socket to TCP_REPAIR set with given command
> > + * @c:		Execution context
> > + * @s:		Socket to add
> > + * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +/* cppcheck-suppress unusedFunction */
> > +int repair_set(struct ctx *c, int s, int cmd)
> > +{
> > +	int rc;
> > +
> > +	if (repair_nfds && repair_cmd != cmd) {
> > +		if ((rc = repair_flush(c)))
> > +			return rc;
> > +	}
> > +
> > +	repair_cmd = cmd;
> > +	repair_fds[repair_nfds++] = s;
> > +
> > +	if (repair_nfds >= SCM_MAX_FD) {
> > +		if ((rc = repair_flush(c)))
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/repair.h b/repair.h
> > new file mode 100644
> > index 0000000..de279d6
> > --- /dev/null
> > +++ b/repair.h
> > @@ -0,0 +1,16 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#ifndef REPAIR_H
> > +#define REPAIR_H
> > +
> > +void repair_sock_init(const struct ctx *c);
> > +void repair_listen_handler(struct ctx *c, uint32_t events);
> > +void repair_handler(struct ctx *c, uint32_t events);
> > +void repair_close(struct ctx *c);
> > +int repair_flush(struct ctx *c);
> > +int repair_set(struct ctx *c, int s, int cmd);
> > +
> > +#endif /* REPAIR_H */
> > diff --git a/tap.c b/tap.c
> > index 8c92d23..d0673e5 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -56,6 +56,7 @@
> >  #include "netlink.h"
> >  #include "pasta.h"
> >  #include "packet.h"
> > +#include "repair.h"
> >  #include "tap.h"
> >  #include "log.h"
> >  #include "vhost_user.h"
> > @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
> >  		tap_pasta_input(c, now);
> >  }
> >  
> > -/**
> > - * tap_sock_unix_open() - Create and bind AF_UNIX socket
> > - * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > - *
> > - * Return: socket descriptor on success, won't return on failure
> > - */
> > -int tap_sock_unix_open(char *sock_path)
> > -{
> > -	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> > -	struct sockaddr_un addr = {
> > -		.sun_family = AF_UNIX,
> > -	};
> > -	int i;
> > -
> > -	if (fd < 0)
> > -		die_perror("Failed to open UNIX domain socket");
> > -
> > -	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> > -		char *path = addr.sun_path;
> > -		int ex, ret;
> > -
> > -		if (*sock_path)
> > -			memcpy(path, sock_path, UNIX_PATH_MAX);
> > -		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> > -					UNIX_SOCK_PATH, i))
> > -			die_perror("Can't build UNIX domain socket path");
> > -
> > -		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> > -			    0);
> > -		if (ex < 0)
> > -			die_perror("Failed to check for UNIX domain conflicts");
> > -
> > -		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> > -		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> > -			     errno != EACCES)) {
> > -			if (*sock_path)
> > -				die("Socket path %s already in use", path);
> > -
> > -			close(ex);
> > -			continue;
> > -		}
> > -		close(ex);
> > -
> > -		unlink(path);
> > -		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> > -		if (*sock_path && ret)
> > -			die_perror("Failed to bind UNIX domain socket");
> > -
> > -		if (!ret)
> > -			break;
> > -	}
> > -
> > -	if (i == UNIX_SOCK_MAX)
> > -		die_perror("Failed to bind UNIX domain socket");
> > -
> > -	info("UNIX domain socket bound at %s", addr.sun_path);
> > -	if (!*sock_path)
> > -		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> > -
> > -	return fd;
> > -}
> > -
> >  /**
> >   * tap_backend_show_hints() - Give help information to start QEMU
> >   * @c:		Execution context
> > @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
> >  		tap_sock_tun_init(c);
> >  		break;
> >  	case MODE_VU:
> > +		repair_sock_init(c);
> > +		/* fall through */
> >  	case MODE_PASST:
> >  		tap_sock_unix_init(c);
> >  
> > diff --git a/util.c b/util.c
> > index 4d51e04..c3c5480 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> >  	return fd;
> >  }
> >  
> > +/**
> > + * sock_unix() - Create and bind AF_UNIX socket
> > + * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > + *
> > + * Return: socket descriptor on success, won't return on failure
> > + */
> > +int sock_unix(char *sock_path)
> > +{
> > +	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> > +	struct sockaddr_un addr = {
> > +		.sun_family = AF_UNIX,
> > +	};
> > +	int i;
> > +
> > +	if (fd < 0)
> > +		die_perror("Failed to open UNIX domain socket");
> > +
> > +	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> > +		char *path = addr.sun_path;
> > +		int ex, ret;
> > +
> > +		if (*sock_path)
> > +			memcpy(path, sock_path, UNIX_PATH_MAX);
> > +		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> > +					UNIX_SOCK_PATH, i))
> > +			die_perror("Can't build UNIX domain socket path");
> > +
> > +		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> > +			    0);
> > +		if (ex < 0)
> > +			die_perror("Failed to check for UNIX domain conflicts");
> > +
> > +		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> > +		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> > +			     errno != EACCES)) {
> > +			if (*sock_path)
> > +				die("Socket path %s already in use", path);
> > +
> > +			close(ex);
> > +			continue;
> > +		}
> > +		close(ex);
> > +
> > +		unlink(path);
> > +		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> > +		if (*sock_path && ret)
> > +			die_perror("Failed to bind UNIX domain socket");
> > +
> > +		if (!ret)
> > +			break;
> > +	}
> > +
> > +	if (i == UNIX_SOCK_MAX)
> > +		die_perror("Failed to bind UNIX domain socket");
> > +
> > +	info("UNIX domain socket bound at %s", addr.sun_path);
> > +	if (!*sock_path)
> > +		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> > +
> > +	return fd;
> > +}
> > +
> >  /**
> >   * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
> >   * @c:		Execution context
> > diff --git a/util.h b/util.h
> > index 255eb26..3dacb4d 100644
> > --- a/util.h
> > +++ b/util.h
> > @@ -214,6 +214,7 @@ struct ctx;
> >  int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> >  	       const void *sa, socklen_t sl,
> >  	       const char *ifname, bool v6only, uint32_t data);
> > +int sock_unix(char *sock_path);
> >  void sock_probe_mem(struct ctx *c);
> >  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
> >  int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);  
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v13 5/6] migrate: Migrate TCP flows
  2025-02-10  9:51     ` Stefano Brivio
@ 2025-02-10 15:54       ` Stefano Brivio
  0 siblings, 0 replies; 14+ messages in thread
From: Stefano Brivio @ 2025-02-10 15:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Mon, 10 Feb 2025 10:51:31 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Mon, 10 Feb 2025 17:05:08 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sun, Feb 09, 2025 at 11:20:04PM +0100, Stefano Brivio wrote:  
> > >
> > > +++ b/migrate.c
> > > @@ -98,11 +98,30 @@ static int seen_addrs_target_v1(struct ctx *c,
> > >  
> > >  /* Stages for version 1 */
> > >  static const struct migrate_stage stages_v1[] = {
> > > +	/* FIXME: With this step, close() in tcp_flow_migrate_source_ext()
> > > +	 * *sometimes* closes the connection for real.
> > > +	 */
> > > +/*	{
> > > +		.name = "shrink TCP windows",
> > > +		.source = flow_migrate_source_early,
> > > +		.target = NULL,
> > > +	},
> > > +*/    
> > 
> > Given we're not sure if this will help, and it adds some
> > complications, probably makes sense to split this into a separate
> > patch.  
> 
> I'd rather not because, due to this, the code for the case *without it*
> is a bit different anyway, and we avoid some code churn. I would also
> like to merge it unused to make our lives a bit easier the day we retry
> to work on it.

Oh, I missed the fact you already split this out in v14... fine then.
There are some stray FIXMEs in 11/14, tcp_flow_migrate_source_ext():

+	/* FIXME: Reenable dump in tcp_flow_migrate_shrink_window() */

+	/* FIXME: it's either this or flow_migrate_source_early(), why? */

but I'll just drop them in the next version.

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-02-10 15:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-09 22:19 [PATCH v13 0/6] State migration, kind of draft again Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 1/6] migrate: Skeleton of live migration logic Stefano Brivio
2025-02-10  2:26   ` David Gibson
2025-02-09 22:20 ` [PATCH v13 2/6] migrate: Migrate guest observed addresses Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 3/6] Add interfaces and configuration bits for passt-repair Stefano Brivio
2025-02-10  2:59   ` David Gibson
2025-02-10 15:54     ` Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 4/6] vhost_user: Make source quit after reporting migration state Stefano Brivio
2025-02-10  3:43   ` David Gibson
2025-02-09 22:20 ` [PATCH v13 5/6] migrate: Migrate TCP flows Stefano Brivio
2025-02-10  6:05   ` David Gibson
2025-02-10  9:51     ` Stefano Brivio
2025-02-10 15:54       ` Stefano Brivio
2025-02-09 22:20 ` [PATCH v13 6/6] test: Add migration tests Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).