[PATCH v2 0/8] Draft, incomplete series introducing state migration

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v2 0/8] Draft, incomplete series introducing state migration
@ 2025-01-28 23:39 Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

A bit more complete than the previous iteration: this adds fixes,
interfaces for the TCP_REPAIR helper, and the first (working) basic
source pre-migration handler, dumping (socket-side) sequence numbers
into the flow table.

v2:

- drop 3/7, (7-bit hole in tcp_splice_conn), not really relevant

- handle EOF (ENODATA) in read_remainder() and read_all_buf()

- fix definition of reverse-endianness magic, and version match
  in migrate_target_read_header()

- pass context to the handlers, not an arbitrary data pointer,
  as that's kind of useless

- change passt-repair protocol to single-byte command, exit on
  EOF instead of special command

- add interface and infrastructure for passt-repair

- add basic pre-migration source handler dumping sequence numbers
  once sockets are switched to repair mode by helper

Stefano Brivio (8):
  icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
  flow, flow_table: Pad flow table entries to 128 bytes, hash entries to
    32 bits
  flow_table: Use size in extern declaration for flowtab
  util: Add read_remainder() and read_all_buf()
  Introduce facilities for guest migration on top of vhost-user
    infrastructure
  Introduce passt-repair
  Add interfaces and configuration bits for passt-repair
  flow, tcp: Basic pre-migration source handler to dump sequence numbers

 Makefile       |  24 +++--
 conf.c         |  46 ++++++++-
 epoll_type.h   |   4 +
 flow.c         |  43 ++++++++
 flow.h         |  19 ++--
 flow_table.h   |  15 ++-
 icmp_flow.h    |   6 +-
 migrate.c      | 265 +++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h      |  88 ++++++++++++++++
 passt-repair.c | 117 ++++++++++++++++++++++
 passt.1        |  11 ++
 passt.c        |  11 +-
 passt.h        |   7 ++
 repair.c       | 192 +++++++++++++++++++++++++++++++++++
 repair.h       |  16 +++
 tap.c          |  65 +-----------
 tcp.c          |  56 +++++++++++
 tcp_conn.h     |   5 +
 udp_flow.h     |   6 +-
 util.c         | 142 ++++++++++++++++++++++++++
 util.h         |   3 +
 vu_common.c    | 124 +++++++++++++++--------
 vu_common.h    |   2 +-
 23 files changed, 1136 insertions(+), 131 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h
 create mode 100644 passt-repair.c
 create mode 100644 repair.c
 create mode 100644 repair.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  1:34   ` David Gibson
  2025-01-28 23:39 ` [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

That's the only field in flows with different storage sizes depending
on the architecture: it's usually 4-byte wide on 32-bit architectures,
except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
machines.

By keeping flow entries the same size across architectures, we avoid
having to expand or shrink table entries upon migration.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 icmp_flow.h | 6 +++++-
 udp_flow.h  | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/icmp_flow.h b/icmp_flow.h
index fb93801..da7e255 100644
--- a/icmp_flow.h
+++ b/icmp_flow.h
@@ -13,6 +13,7 @@
  * @seq:	Last sequence number sent to tap, host order, -1: not sent yet
  * @sock:	"ping" socket
  * @ts:		Last associated activity from tap, seconds
+ * @ts_storage:	Pad @ts to 64-bit storage to keep state migration sane
  */
 struct icmp_ping_flow {
 	/* Must be first element */
@@ -20,7 +21,10 @@ struct icmp_ping_flow {
 
 	int seq;
 	int sock;
-	time_t ts;
+	union {
+		time_t ts;
+		uint64_t ts_storage;
+	};
 };
 
 bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
diff --git a/udp_flow.h b/udp_flow.h
index 9a1b059..9cb79a0 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -12,6 +12,7 @@
  * @f:		Generic flow information
  * @closed:	Flow is already closed
  * @ts:		Activity timestamp
+ * @ts_storage:	Pad @ts to 64-bit storage to keep state migration sane
  * @s:		Socket fd (or -1) for each side of the flow
  */
 struct udp_flow {
@@ -19,7 +20,10 @@ struct udp_flow {
 	struct flow_common f;
 
 	bool closed :1;
-	time_t ts;
+	union {
+		time_t ts;
+		uint64_t ts_storage;
+	};
 	int s[SIDES];
 };
 
-- 
@@ -12,6 +12,7 @@
  * @f:		Generic flow information
  * @closed:	Flow is already closed
  * @ts:		Activity timestamp
+ * @ts_storage:	Pad @ts to 64-bit storage to keep state migration sane
  * @s:		Socket fd (or -1) for each side of the flow
  */
 struct udp_flow {
@@ -19,7 +20,10 @@ struct udp_flow {
 	struct flow_common f;
 
 	bool closed :1;
-	time_t ts;
+	union {
+		time_t ts;
+		uint64_t ts_storage;
+	};
 	int s[SIDES];
 };
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  1:35   ` David Gibson
  2025-01-28 23:39 ` [PATCH v2 3/8] flow_table: Use size in extern declaration for flowtab Stefano Brivio
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

...to keep migration sane. Right now, the biggest struct in union flow
is struct tcp_splice_conn with 120 bytes on x86_64, which should also
have the biggest storage and alignment requirements of any
architecture we might run on.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.h       | 18 ++++++++++++------
 flow_table.h | 13 ++++++++++---
 2 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/flow.h b/flow.h
index 24ba3ef..8eb5964 100644
--- a/flow.h
+++ b/flow.h
@@ -202,15 +202,21 @@ struct flow_common {
 
 /**
  * struct flow_sidx - ID for one side of a specific flow
- * @sidei:	Index of side referenced (0 or 1)
- * @flowi:	Index of flow referenced
+ * @sidei:		Index of side referenced (0 or 1)
+ * @flowi:		Index of flow referenced
+ * @flow_sidx_storage:	Pad to 32 bits
  */
 typedef struct flow_sidx {
-	unsigned	sidei :1;
-	unsigned	flowi :FLOW_INDEX_BITS;
+	union {
+		struct {
+			unsigned	sidei :1;
+			unsigned	flowi :FLOW_INDEX_BITS;
+		};
+		uint32_t flow_sidx_storage;
+	};
 } flow_sidx_t;
-static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
-	      "flow_sidx_t must fit within 32 bits");
+static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t),
+	      "flow_sidx_t must be 32-bit wide");
 
 #define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX })
 
diff --git a/flow_table.h b/flow_table.h
index f15db53..007f4dd 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -26,9 +26,13 @@ struct flow_free_cluster {
 
 /**
  * union flow - Descriptor for a logical packet flow (e.g. connection)
- * @f:		Fields common between all variants
- * @tcp:	Fields for non-spliced TCP connections
- * @tcp_splice:	Fields for spliced TCP connections
+ * @f:			Fields common between all variants
+ * @free:		Entry in a cluster of free entries
+ * @tcp:		Fields for non-spliced TCP connections
+ * @tcp_splice:		Fields for spliced TCP connections
+ * @ping:		Tracking for ping flows
+ * @udp:		Tracking for UDP flows
+ * @flow_storage:	Pad flow entries to 128 bytes to ease state migration
 */
 union flow {
 	struct flow_common f;
@@ -37,8 +41,11 @@ union flow {
 	struct tcp_splice_conn tcp_splice;
 	struct icmp_ping_flow ping;
 	struct udp_flow udp;
+	char flow_storage[128];
 };
 
+static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
+
 /* Global Flow Table */
 extern unsigned flow_first_free;
 extern union flow flowtab[];
-- 
@@ -26,9 +26,13 @@ struct flow_free_cluster {
 
 /**
  * union flow - Descriptor for a logical packet flow (e.g. connection)
- * @f:		Fields common between all variants
- * @tcp:	Fields for non-spliced TCP connections
- * @tcp_splice:	Fields for spliced TCP connections
+ * @f:			Fields common between all variants
+ * @free:		Entry in a cluster of free entries
+ * @tcp:		Fields for non-spliced TCP connections
+ * @tcp_splice:		Fields for spliced TCP connections
+ * @ping:		Tracking for ping flows
+ * @udp:		Tracking for UDP flows
+ * @flow_storage:	Pad flow entries to 128 bytes to ease state migration
 */
 union flow {
 	struct flow_common f;
@@ -37,8 +41,11 @@ union flow {
 	struct tcp_splice_conn tcp_splice;
 	struct icmp_ping_flow ping;
 	struct udp_flow udp;
+	char flow_storage[128];
 };
 
+static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
+
 /* Global Flow Table */
 extern unsigned flow_first_free;
 extern union flow flowtab[];
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 3/8] flow_table: Use size in extern declaration for flowtab
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 4/8] util: Add read_remainder() and read_all_buf() Stefano Brivio
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

...so that we can use sizeof() on it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow_table.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/flow_table.h b/flow_table.h
index 007f4dd..a85cab5 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -48,7 +48,7 @@ static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
 
 /* Global Flow Table */
 extern unsigned flow_first_free;
-extern union flow flowtab[];
+extern union flow flowtab[FLOW_MAX];
 
 /**
  * flow_foreach_sidei() - 'for' type macro to step through each side of flow
-- 
@@ -48,7 +48,7 @@ static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
 
 /* Global Flow Table */
 extern unsigned flow_first_free;
-extern union flow flowtab[];
+extern union flow flowtab[FLOW_MAX];
 
 /**
  * flow_foreach_sidei() - 'for' type macro to step through each side of flow
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 4/8] util: Add read_remainder() and read_all_buf()
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
                   ` (2 preceding siblings ...)
  2025-01-28 23:39 ` [PATCH v2 3/8] flow_table: Use size in extern declaration for flowtab Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  1:37   ` David Gibson
  2025-01-28 23:39 ` [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.

I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 util.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 util.h |  2 ++
 2 files changed, 82 insertions(+)

diff --git a/util.c b/util.c
index 11973c4..36857d4 100644
--- a/util.c
+++ b/util.c
@@ -606,6 +606,86 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
 	return 0;
 }
 
+/**
+ * read_all_buf() - Fill a whole buffer from a file descriptor
+ * @fd:		File descriptor
+ * @buf:	Pointer to base of buffer
+ * @len:	Length of buffer
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * #syscalls read
+ */
+int read_all_buf(int fd, void *buf, size_t len)
+{
+	size_t left = len;
+	char *p = buf;
+
+	while (left) {
+		ssize_t rc;
+
+		ASSERT(left <= len);
+
+		do
+			rc = read(fd, p, left);
+		while ((rc < 0) && errno == EINTR);
+
+		if (rc < 0)
+			return -1;
+
+		if (rc == 0) {
+			errno = ENODATA;
+			return -1;
+		}
+
+		p += rc;
+		left -= rc;
+	}
+	return 0;
+}
+
+/**
+ * read_remainder() - Read the tail of an IO vector from a file descriptor
+ * @fd:		File descriptor
+ * @iov:	IO vector
+ * @cnt:	Number of entries in @iov
+ * @skip:	Number of bytes of the vector to skip reading
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * Note: mode-specific seccomp profiles need to enable readv() to use this.
+ */
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip)
+{
+	size_t i = 0, offset;
+
+	while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
+		ssize_t rc;
+
+		if (offset) {
+			ASSERT(offset < iov[i].iov_len);
+			/* Read the remainder of the partially read buffer */
+			if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
+					 iov[i].iov_len - offset) < 0)
+				return -1;
+			i++;
+		}
+
+		/* Fill as many of the remaining buffers as we can */
+		rc = readv(fd, &iov[i], cnt - i);
+		if (rc < 0)
+			return -1;
+
+		if (rc == 0) {
+			errno = ENODATA;
+			return -1;
+		}
+
+		skip = rc;
+	}
+	return 0;
+}
+
 /** sockaddr_ntop() - Convert a socket address to text format
  * @sa:		Socket address
  * @dst:	output buffer, minimum SOCKADDR_STRLEN bytes
diff --git a/util.h b/util.h
index d02333d..73a7a33 100644
--- a/util.h
+++ b/util.h
@@ -203,6 +203,8 @@ int fls(unsigned long x);
 int write_file(const char *path, const char *buf);
 int write_all_buf(int fd, const void *buf, size_t len);
 int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
+int read_all_buf(int fd, void *buf, size_t len);
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
 void close_open_files(int argc, char **argv);
 bool snprintf_check(char *str, size_t size, const char *format, ...);
 
-- 
@@ -203,6 +203,8 @@ int fls(unsigned long x);
 int write_file(const char *path, const char *buf);
 int write_all_buf(int fd, const void *buf, size_t len);
 int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
+int read_all_buf(int fd, void *buf, size_t len);
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
 void close_open_files(int argc, char **argv);
 bool snprintf_check(char *str, size_t size, const char *format, ...);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
                   ` (3 preceding siblings ...)
  2025-01-28 23:39 ` [PATCH v2 4/8] util: Add read_remainder() and read_all_buf() Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  5:41   ` David Gibson
  2025-01-28 23:39 ` [PATCH v2 6/8] Introduce passt-repair Stefano Brivio
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

Add two sets (source or target) of three functions each for passt in
vhost-user mode, triggered by activity on the file descriptor passed
via VHOST_USER_PROTOCOL_F_DEVICE_STATE:

- migrate_source_pre() and migrate_target_pre() are called to prepare
  for migration, before data is transferred

- migrate_source() sends, and migrate_target() receives migration data

- migrate_source_post() and migrate_target_post() are responsible for
  any post-migration task

Callbacks are added to these functions with arrays of function
pointers in migrate.c. Migration handlers are versioned.

Versioned descriptions of data sections will be added to the
data_versions array, which points to versioned iovec arrays. Version
1 is currently empty and will be filled in in subsequent patches.

The source announces the data version to be used and informs the peer
about endianness, and the size of void *, time_t, flow entries and
flow hash table entries.

The target checks if the version of the source is still supported. If
it's not, it aborts the migration.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile    |  12 +--
 migrate.c   | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 migrate.h   |  88 ++++++++++++++++++
 passt.c     |   2 +-
 vu_common.c | 124 ++++++++++++++++--------
 vu_common.h |   2 +-
 6 files changed, 443 insertions(+), 49 deletions(-)
 create mode 100644 migrate.c
 create mode 100644 migrate.h

diff --git a/Makefile b/Makefile
index 464eef1..1383875 100644
--- a/Makefile
+++ b/Makefile
@@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
+	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
@@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
-	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
-	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
-	virtio.h vu_common.h
+	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
+	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
+	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
+	vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/migrate.c b/migrate.c
new file mode 100644
index 0000000..b8b79e0
--- /dev/null
+++ b/migrate.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * migrate.c - Migration sections, layout, and routines
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "migrate.h"
+
+/* Current version of migration data */
+#define MIGRATE_VERSION		1
+
+/* Magic as we see it and as seen with reverse endianness */
+#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
+#define MIGRATE_MAGIC_SWAPPED	0xB0D1B1B01B1DBBB1
+
+/* Migration header to send from source */
+static union migrate_header header = {
+	.magic		= MIGRATE_MAGIC,
+	.version	= htonl_constant(MIGRATE_VERSION),
+	.time_t_size	= htonl_constant(sizeof(time_t)),
+	.flow_size	= htonl_constant(sizeof(union flow)),
+	.flow_sidx_size	= htonl_constant(sizeof(struct flow_sidx)),
+	.voidp_size	= htonl_constant(sizeof(void *)),
+};
+
+/* Data sections for version 1 */
+static struct iovec sections_v1[] = {
+	{ &header,	sizeof(header) },
+};
+
+/* Set of data versions */
+static struct migrate_data data_versions[] = {
+	{
+		1,	sections_v1,
+	},
+	{ 0 },
+};
+
+/* Handlers to call in source before sending data */
+struct migrate_handler handlers_source_pre[] = {
+	{ 0 },
+};
+
+/* Handlers to call in source after sending data */
+struct migrate_handler handlers_source_post[] = {
+	{ 0 },
+};
+
+/* Handlers to call in target before receiving data with version 1 */
+struct migrate_handler handlers_target_pre_v1[] = {
+	{ 0 },
+};
+
+/* Handlers to call in target after receiving data with version 1 */
+struct migrate_handler handlers_target_post_v1[] = {
+	{ 0 },
+};
+
+/* Versioned sets of migration handlers */
+struct migrate_target_handlers target_handlers[] = {
+	{
+		1,
+		handlers_target_pre_v1,
+		handlers_target_post_v1,
+	},
+	{ 0 },
+};
+
+/**
+ * migrate_source_pre() - Pre-migration tasks as source
+ * @c:		Execution context
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_source_pre(struct ctx *c, struct migrate_meta *m)
+{
+	struct migrate_handler *h;
+
+	for (h = handlers_source_pre; h->fn; h++) {
+		int rc;
+
+		if ((rc = h->fn(c, m)))
+			return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_source() - Perform migration as source: send state to hypervisor
+ * @fd:		Descriptor for state transfer
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_source(int fd, const struct migrate_meta *m)
+{
+	static struct migrate_data *d;
+	int count, rc;
+
+	(void)m;
+
+	for (d = data_versions; d->v != MIGRATE_VERSION; d++);
+
+	for (count = 0; d->sections[count].iov_len; count++);
+
+	debug("Writing %u migration sections", count - 1 /* minus header */);
+	rc = write_remainder(fd, d->sections, count, 0);
+	if (rc < 0)
+		return errno;
+
+	return 0;
+}
+
+/**
+ * migrate_source_post() - Post-migration tasks as source
+ * @c:		Execution context
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+void migrate_source_post(struct ctx *c, struct migrate_meta *m)
+{
+	struct migrate_handler *h;
+
+	for (h = handlers_source_post; h->fn; h++)
+		h->fn(c, m);
+}
+
+/**
+ * migrate_target_read_header() - Set metadata in target from source header
+ * @fd:		Descriptor for state transfer
+ * @m:		Migration metadata, filled on return
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_target_read_header(int fd, struct migrate_meta *m)
+{
+	static struct migrate_data *d;
+	union migrate_header h;
+
+	if (read_all_buf(fd, &h, sizeof(h)))
+		return errno;
+
+	debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
+	      h.magic, ntohl(h.voidp_size), ntohl(h.version));
+
+	for (d = data_versions; d->v != ntohl(h.version) && d->v; d++);
+	if (!d->v)
+		return ENOTSUP;
+	m->v = d->v;
+
+	if (h.magic == MIGRATE_MAGIC)
+		m->bswap = false;
+	else if (h.magic == MIGRATE_MAGIC_SWAPPED)
+		m->bswap = true;
+	else
+		return ENOTSUP;
+
+	if (ntohl(h.voidp_size) == 4)
+		m->source_64b = false;
+	else if (ntohl(h.voidp_size) == 8)
+		m->source_64b = true;
+	else
+		return ENOTSUP;
+
+	if (ntohl(h.time_t_size) == 4)
+		m->time_64b = false;
+	else if (ntohl(h.time_t_size) == 8)
+		m->time_64b = true;
+	else
+		return ENOTSUP;
+
+	m->flow_size = ntohl(h.flow_size);
+	m->flow_sidx_size = ntohl(h.flow_sidx_size);
+
+	return 0;
+}
+
+/**
+ * migrate_target_pre() - Pre-migration tasks as target
+ * @c:		Execution context
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_target_pre(struct ctx *c, struct migrate_meta *m)
+{
+	struct migrate_target_handlers *th;
+	struct migrate_handler *h;
+
+	for (th = target_handlers; th->v != m->v && th->v; th++);
+
+	for (h = th->pre; h->fn; h++) {
+		int rc;
+
+		if ((rc = h->fn(c, m)))
+			return rc;
+	}
+
+	return 0;
+}
+
+/**
+ * migrate_target() - Perform migration as target: receive state from hypervisor
+ * @fd:		Descriptor for state transfer
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ *
+ * #syscalls:vu readv
+ */
+int migrate_target(int fd, const struct migrate_meta *m)
+{
+	static struct migrate_data *d;
+	unsigned cnt;
+	int rc;
+
+	for (d = data_versions; d->v != m->v && d->v; d++);
+
+	for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
+
+	debug("Reading %u migration sections", cnt);
+	rc = read_remainder(fd, d->sections + 1, cnt, 0);
+	if (rc < 0)
+		return errno;
+
+	return 0;
+}
+
+/**
+ * migrate_target_post() - Post-migration tasks as target
+ * @c:		Execution context
+ * @m:		Migration metadata
+ */
+void migrate_target_post(struct ctx *c, struct migrate_meta *m)
+{
+	struct migrate_target_handlers *th;
+	struct migrate_handler *h;
+
+	for (th = target_handlers; th->v != m->v && th->v; th++);
+
+	for (h = th->post; h->fn; h++)
+		h->fn(c, m);
+}
diff --git a/migrate.h b/migrate.h
new file mode 100644
index 0000000..f9635ac
--- /dev/null
+++ b/migrate.h
@@ -0,0 +1,88 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+ 
+#ifndef MIGRATE_H
+#define MIGRATE_H
+
+/**
+ * struct migrate_meta - Migration metadata
+ * @v:			Chosen migration data version, host order
+ * @bswap:		Source has opposite endianness
+ * @peer_64b:		Source uses 64-bit void *
+ * @time_64b:		Source uses 64-bit time_t
+ * @flow_size:		Size of union flow in source
+ * @flow_sidx_size:	Size of struct flow_sidx in source
+ */
+struct migrate_meta {
+	uint32_t v;
+	bool bswap;
+	bool source_64b;
+	bool time_64b;
+	size_t flow_size;
+	size_t flow_sidx_size;
+};
+
+/**
+ * union migrate_header - Migration header from source
+ * @magic:		0xB1BB1D1B0BB1D1B0, host order
+ * @version:		Source sends highest known, target aborts if unsupported
+ * @voidp_size:		sizeof(void *), network order
+ * @time_t_size:	sizeof(time_t), network order
+ * @flow_size:		sizeof(union flow), network order
+ * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
+ * @unused:		Go figure
+ */
+union migrate_header {
+	struct {
+		uint64_t magic;
+		uint32_t version;
+		uint32_t voidp_size;
+		uint32_t time_t_size;
+		uint32_t flow_size;
+		uint32_t flow_sidx_size;
+	};
+	uint8_t unused[65536];
+};
+
+/**
+ * struct migrate_data - Data sections for given source version
+ * @v:			Source version this applies to, host order
+ * @sections:		Array of data sections, NULL-terminated
+ */
+struct migrate_data {
+	uint32_t v;
+	struct iovec *sections;
+};
+
+/**
+ * struct migrate_handler - Function to handle a specific data section
+ * @fn:			Function pointer taking pointer to context and metadata
+ */
+struct migrate_handler {
+	int (*fn)(struct ctx *c, struct migrate_meta *m);
+};
+
+/**
+ * struct migrate_target_handlers - Versioned sets of migration target handlers
+ * @v:			Source version this applies to, host order
+ * @pre:		Set of functions to execute in target before data copy
+ * @post:		Set of functions to execute in target after data copy
+ */
+struct migrate_target_handlers {
+	uint32_t v;
+	struct migrate_handler *pre;
+	struct migrate_handler *post;
+};
+
+int migrate_source_pre(struct ctx *c, struct migrate_meta *m);
+int migrate_source(int fd, const struct migrate_meta *m);
+void migrate_source_post(struct ctx *c, struct migrate_meta *m);
+
+int migrate_target_read_header(int fd, struct migrate_meta *m);
+int migrate_target_pre(struct ctx *c, struct migrate_meta *m);
+int migrate_target(int fd, const struct migrate_meta *m);
+void migrate_target_post(struct ctx *c, struct migrate_meta *m);
+
+#endif /* MIGRATE_H */
diff --git a/passt.c b/passt.c
index b1c8ab6..184d4e5 100644
--- a/passt.c
+++ b/passt.c
@@ -358,7 +358,7 @@ loop:
 			vu_kick_cb(c.vdev, ref, &now);
 			break;
 		case EPOLL_TYPE_VHOST_MIGRATION:
-			vu_migrate(c.vdev, eventmask);
+			vu_migrate(&c, eventmask);
 			break;
 		default:
 			/* Can't happen */
diff --git a/vu_common.c b/vu_common.c
index f43d8ac..6c346c8 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -5,6 +5,7 @@
  * common_vu.c - vhost-user common UDP and TCP functions
  */
 
+#include <errno.h>
 #include <unistd.h>
 #include <sys/uio.h>
 #include <sys/eventfd.h>
@@ -17,6 +18,7 @@
 #include "vhost_user.h"
 #include "pcap.h"
 #include "vu_common.h"
+#include "migrate.h"
 
 #define VU_MAX_TX_BUFFER_NB	2
 
@@ -305,50 +307,90 @@ err:
 }
 
 /**
- * vu_migrate() - Send/receive passt insternal state to/from QEMU
- * @vdev:	vhost-user device
+ * vu_migrate_source() - Migration as source, send state to hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int vu_migrate_source(struct ctx *c, int fd)
+{
+	struct migrate_meta m;
+	int rc;
+
+	if ((rc = migrate_source_pre(c, &m))) {
+		err("Source pre-migration failed: %s, abort", strerror_(rc));
+		return rc;
+	}
+
+	debug("Saving backend state");
+
+	rc = migrate_source(fd, &m);
+	if (rc)
+		err("Source migration failed: %s", strerror_(rc));
+	else
+		migrate_source_post(c, &m);
+
+	return rc;
+}
+
+/**
+ * vu_migrate_target() - Migration as target, receive state from hypervisor
+ * @c:		Execution context
+ * @fd:		File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int vu_migrate_target(struct ctx *c, int fd)
+{
+	struct migrate_meta m;
+	int rc;
+
+	rc = migrate_target_read_header(fd, &m);
+	if (rc) {
+		err("Migration header check failed: %s, abort", strerror_(rc));
+		return rc;
+	}
+
+	if ((rc = migrate_target_pre(c, &m))) {
+		err("Target pre-migration failed: %s, abort", strerror_(rc));
+		return rc;
+	}
+
+	debug("Loading backend state");
+
+	rc = migrate_target(fd, &m);
+	if (rc)
+		err("Target migration failed: %s", strerror_(rc));
+	else
+		migrate_target_post(c, &m);
+
+	return rc;
+}
+
+/**
+ * vu_migrate() - Send/receive passt internal state to/from QEMU
+ * @c:		Execution context
  * @events:	epoll events
  */
-void vu_migrate(struct vu_dev *vdev, uint32_t events)
+void vu_migrate(struct ctx *c, uint32_t events)
 {
-	int ret;
+	struct vu_dev *vdev = c->vdev;
+	int rc = EIO;
 
-	/* TODO: collect/set passt internal state
-	 * and use vdev->device_state_fd to send/receive it
-	 */
 	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
-	if (events & EPOLLOUT) {
-		debug("Saving backend state");
-
-		/* send some stuff */
-		ret = write(vdev->device_state_fd, "PASST", 6);
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		vdev->device_state_result = ret == -1 ? -1 : 0;
-		/* Closing the file descriptor signals the end of transfer */
-		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
-			  vdev->device_state_fd, NULL);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	} else if (events & EPOLLIN) {
-		char buf[6];
-
-		debug("Loading backend state");
-		/* read some stuff */
-		ret = read(vdev->device_state_fd, buf, sizeof(buf));
-		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
-		if (ret != sizeof(buf)) {
-			vdev->device_state_result = -1;
-		} else {
-			ret = strncmp(buf, "PASST", sizeof(buf));
-			vdev->device_state_result = ret == 0 ? 0 : -1;
-		}
-	} else if (events & EPOLLHUP) {
-		debug("Closing migration channel");
-
-		/* The end of file signals the end of the transfer. */
-		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
-			  vdev->device_state_fd, NULL);
-		close(vdev->device_state_fd);
-		vdev->device_state_fd = -1;
-	}
+
+	if (events & EPOLLOUT)
+		rc = vu_migrate_source(c, vdev->device_state_fd);
+	else if (events & EPOLLIN)
+		rc = vu_migrate_target(c, vdev->device_state_fd);
+
+	/* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
+
+	vdev->device_state_result = rc;
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
+	debug("Closing migration channel");
+	close(vdev->device_state_fd);
+	vdev->device_state_fd = -1;
 }
diff --git a/vu_common.h b/vu_common.h
index d56c021..69c4006 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
 #endif /* VU_COMMON_H */
-- 
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now);
 int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
 #endif /* VU_COMMON_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 6/8] Introduce passt-repair
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
                   ` (4 preceding siblings ...)
  2025-01-28 23:39 ` [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair Stefano Brivio
  2025-01-28 23:39 ` [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers Stefano Brivio
  7 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile       |  10 +++--
 passt-repair.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 124 insertions(+), 3 deletions(-)
 create mode 100644 passt-repair.c

diff --git a/Makefile b/Makefile
index 1383875..1b71cb0 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
 	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
-SRCS = $(PASST_SRCS) $(QRAP_SRCS)
+PASST_REPAIR_SRCS = passt-repair.c
+SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
 
 MANPAGES = passt.1 pasta.1 qrap.1
 
@@ -72,9 +73,9 @@ mandir		?= $(datarootdir)/man
 man1dir		?= $(mandir)/man1
 
 ifeq ($(TARGET_ARCH),x86_64)
-BIN := passt passt.avx2 pasta pasta.avx2 qrap
+BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
 else
-BIN := passt pasta qrap
+BIN := passt pasta qrap passt-repair
 endif
 
 all: $(BIN) $(MANPAGES) docs
@@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
 qrap: $(QRAP_SRCS) passt.h
 	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
 
+passt-repair: $(PASST_REPAIR_SRCS)
+	$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
+
 valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction	\
 			    rt_sigreturn getpid gettid kill clock_gettime mmap \
 			    mmap2 munmap open unlink gettimeofday futex statx \
diff --git a/passt-repair.c b/passt-repair.c
new file mode 100644
index 0000000..988a52c
--- /dev/null
+++ b/passt-repair.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit on EOF.
+ */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+int main(int argc, char **argv)
+{
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct sockaddr_un a = { AF_UNIX, "" };
+	int fds[SCM_MAX_FD], s, ret, i, n;
+	int8_t cmd = INT8_MAX;
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+	struct iovec iov;
+
+	iov = (struct iovec){ &cmd, sizeof(cmd) };
+	msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	if (argc != 2) {
+		fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+		return -1;
+	}
+
+	ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+		fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+		return -1;
+	}
+
+	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+		perror("Failed to create AF_UNIX socket");
+		return -1;
+	}
+
+	if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+		fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+			strerror(errno));
+		return -1;
+	}
+
+loop:
+	ret = recvmsg(s, &msg, 0);
+	if (ret < 0) {
+		perror("Failed to receive message");
+		return -1;
+	}
+
+	if (!ret)	/* Done */
+		return 0;
+
+	if (!cmsg ||
+	    cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+	    cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+	    cmsg->cmsg_type != SCM_RIGHTS)
+		return -1;
+
+	n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+	memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+	if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
+	    cmd != TCP_REPAIR_OFF_NO_WP) {
+		fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
+		return -1;
+	}
+
+	for (i = 0; i < n; i++) {
+		int o = cmd;
+
+		if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) {
+			fprintf(stderr,
+				"Setting TCP_REPAIR to %i on socket %i: %s", o,
+				fds[i], strerror(errno));
+			return -1;
+		}
+
+		/* Confirm setting by echoing the command back */
+		if (send(s, &cmd, sizeof(cmd), 0) < 0) {
+			fprintf(stderr, "Reply to command %i: %s\n",
+				o, strerror(errno));
+			return -1;
+		}
+	}
+
+	goto loop;
+
+	return 0;
+}
-- 
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with byte commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit on EOF.
+ */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+int main(int argc, char **argv)
+{
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct sockaddr_un a = { AF_UNIX, "" };
+	int fds[SCM_MAX_FD], s, ret, i, n;
+	int8_t cmd = INT8_MAX;
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+	struct iovec iov;
+
+	iov = (struct iovec){ &cmd, sizeof(cmd) };
+	msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	if (argc != 2) {
+		fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+		return -1;
+	}
+
+	ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+	if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+		fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+		return -1;
+	}
+
+	if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+		perror("Failed to create AF_UNIX socket");
+		return -1;
+	}
+
+	if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+		fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+			strerror(errno));
+		return -1;
+	}
+
+loop:
+	ret = recvmsg(s, &msg, 0);
+	if (ret < 0) {
+		perror("Failed to receive message");
+		return -1;
+	}
+
+	if (!ret)	/* Done */
+		return 0;
+
+	if (!cmsg ||
+	    cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+	    cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+	    cmsg->cmsg_type != SCM_RIGHTS)
+		return -1;
+
+	n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+	memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+	if (cmd != TCP_REPAIR_ON && cmd != TCP_REPAIR_OFF &&
+	    cmd != TCP_REPAIR_OFF_NO_WP) {
+		fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
+		return -1;
+	}
+
+	for (i = 0; i < n; i++) {
+		int o = cmd;
+
+		if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) {
+			fprintf(stderr,
+				"Setting TCP_REPAIR to %i on socket %i: %s", o,
+				fds[i], strerror(errno));
+			return -1;
+		}
+
+		/* Confirm setting by echoing the command back */
+		if (send(s, &cmd, sizeof(cmd), 0) < 0) {
+			fprintf(stderr, "Reply to command %i: %s\n",
+				o, strerror(errno));
+			return -1;
+		}
+	}
+
+	goto loop;
+
+	return 0;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
                   ` (5 preceding siblings ...)
  2025-01-28 23:39 ` [PATCH v2 6/8] Introduce passt-repair Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  6:09   ` David Gibson
  2025-01-28 23:39 ` [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers Stefano Brivio
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.

When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.

To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 Makefile     |  12 ++--
 conf.c       |  46 ++++++++++--
 epoll_type.h |   4 ++
 passt.1      |  11 +++
 passt.c      |   9 +++
 passt.h      |   7 ++
 repair.c     | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++
 repair.h     |  16 +++++
 tap.c        |  65 +----------------
 util.c       |  62 +++++++++++++++++
 util.h       |   1 +
 11 files changed, 353 insertions(+), 72 deletions(-)
 create mode 100644 repair.c
 create mode 100644 repair.h

diff --git a/Makefile b/Makefile
index 1b71cb0..f67a20b 100644
--- a/Makefile
+++ b/Makefile
@@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
-	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
-	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
-	vhost_user.c virtio.c vu_common.c
+	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
+	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
+	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 PASST_REPAIR_SRCS = passt-repair.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
@@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
-	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
-	vhost_user.h virtio.h vu_common.h
+	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
+	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
+	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/conf.c b/conf.c
index df2b016..85dec44 100644
--- a/conf.c
+++ b/conf.c
@@ -816,6 +816,9 @@ static void usage(const char *name, FILE *f, int status)
 			"    UNIX domain socket is provided by -s option\n"
 			"  --print-capabilities	print back-end capabilities in JSON format,\n"
 			"    only meaningful for vhost-user mode\n");
+		FPRINTF(f,
+			"  --repair-path PATH	path for passt-repair(1)\n"
+			"    default: append '.repair' to UNIX domain path\n");
 	}
 
 	FPRINTF(f,
@@ -1240,8 +1243,30 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
  */
 static void conf_open_files(struct ctx *c)
 {
-	if (c->mode != MODE_PASTA && c->fd_tap == -1)
-		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
+	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
+		c->fd_tap_listen = sock_unix(c->sock_path);
+
+		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
+			if (!strncmp(c->repair_path, "./", 2)) {
+				memmove(c->repair_path, c->repair_path + 2,
+					sizeof(c->repair_path) - 2);
+			}
+
+			if (!*c->repair_path &&
+			    snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s.repair",
+					   c->sock_path)) {
+				warn("passt-repair path %s not usable",
+				     c->repair_path);
+				c->fd_repair_listen = -1;
+			} else {
+				c->fd_repair_listen = sock_unix(c->repair_path);
+			}
+		} else {
+			c->fd_repair_listen = -1;
+		}
+		c->fd_repair = -1;
+	}
 
 	if (*c->pidfile) {
 		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
@@ -1354,9 +1379,12 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
 		{"dns-host",	required_argument,	NULL,		24 },
 		{"vhost-user",	no_argument,		NULL,		25 },
+
 		/* vhost-user backend program convention */
 		{"print-capabilities", no_argument,	NULL,		26 },
 		{"socket-path",	required_argument,	NULL,		's' },
+
+		{"repair-path",	required_argument,	NULL,		27 },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1824,8 +1852,8 @@ void conf(struct ctx *c, int argc, char **argv)
 	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
 		c->no_dhcp = 1;
 
-	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
-	 * settings)
+	/* Inbound port options, DNS, and --repair-path can be parsed now, after
+	 * IPv4/IPv6 settings and --vhost-user.
 	 */
 	fwd_probe_ephemeral();
 	udp_portmap_clear();
@@ -1871,6 +1899,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			}
 
 			die("Cannot use DNS address %s", optarg);
+		} else if (name == 27) {
+			if (c->mode != MODE_VU && strcmp(optarg, "none"))
+				die("--repair-path is for vhost-user mode only");
+
+			if (snprintf_check(c->repair_path,
+					   sizeof(c->repair_path), "%s",
+					   optarg))
+				die("Invalid passt-repair path: %s", optarg);
+
+			break;
 		}
 	} while (name != -1);
 
diff --git a/epoll_type.h b/epoll_type.h
index fd9eac3..706238a 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -42,6 +42,10 @@ enum epoll_type {
 	EPOLL_TYPE_VHOST_KICK,
 	/* vhost-user migration socket */
 	EPOLL_TYPE_VHOST_MIGRATION,
+	/* TCP_REPAIR helper listening socket */
+	EPOLL_TYPE_REPAIR_LISTEN,
+	/* TCP_REPAIR helper socket */
+	EPOLL_TYPE_REPAIR,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/passt.1 b/passt.1
index d9cd33e..63a3a01 100644
--- a/passt.1
+++ b/passt.1
@@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
 .BR \-\-print-capabilities
 Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
 
+.TP
+.BR \-\-repair-path " " \fIpath
+Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect
+to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
+migration. \fB--repair-path none\fR disables this interface (if you need to
+specify a socket path called "none" you can prefix the path by \fI./\fR).
+
+Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
+chosen for the hypervisor UNIX domain socket. No socket is created if not in
+\-\-vhost-user mode.
+
 .TP
 .BR \-F ", " \-\-fd " " \fIFD
 Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
diff --git a/passt.c b/passt.c
index 184d4e5..1fa2ddd 100644
--- a/passt.c
+++ b/passt.c
@@ -51,6 +51,7 @@
 #include "tcp_splice.h"
 #include "ndp.h"
 #include "vu_common.h"
+#include "repair.h"
 
 #define EPOLL_EVENTS		8
 
@@ -76,6 +77,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
 	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
+	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
+	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -360,6 +363,12 @@ loop:
 		case EPOLL_TYPE_VHOST_MIGRATION:
 			vu_migrate(&c, eventmask);
 			break;
+		case EPOLL_TYPE_REPAIR_LISTEN:
+			repair_listen_handler(&c, eventmask);
+			break;
+		case EPOLL_TYPE_REPAIR:
+			repair_handler(&c, eventmask);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 0dd4efa..85b0a10 100644
--- a/passt.h
+++ b/passt.h
@@ -20,6 +20,7 @@ union epoll_ref;
 #include "siphash.h"
 #include "ip.h"
 #include "inany.h"
+#include "migrate.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
@@ -193,6 +194,7 @@ struct ip6_ctx {
  * @foreground:		Run in foreground, don't log to stderr by default
  * @nofile:		Maximum number of open files (ulimit -n)
  * @sock_path:		Path for UNIX domain socket
+ * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
  * @pcap:		Path for packet capture file
  * @pidfile:		Path to PID file, empty string if not configured
  * @pidfile_fd:		File descriptor for PID file, -1 if none
@@ -203,6 +205,8 @@ struct ip6_ctx {
  * @epollfd:		File descriptor for epoll instance
  * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
  * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
+ * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
+ * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
  * @our_tap_mac:	Pasta/passt's MAC on the tap link
  * @guest_mac:		MAC address of guest or namespace, seen or configured
  * @hash_secret:	128-bit secret for siphash functions
@@ -244,6 +248,7 @@ struct ctx {
 	int foreground;
 	int nofile;
 	char sock_path[UNIX_PATH_MAX];
+	char repair_path[UNIX_PATH_MAX];
 	char pcap[PATH_MAX];
 
 	char pidfile[PATH_MAX];
@@ -260,6 +265,8 @@ struct ctx {
 	int epollfd;
 	int fd_tap_listen;
 	int fd_tap;
+	int fd_repair_listen;
+	int fd_repair;
 	unsigned char our_tap_mac[ETH_ALEN];
 	unsigned char guest_mac[ETH_ALEN];
 	uint64_t hash_secret[2];
diff --git a/repair.c b/repair.c
new file mode 100644
index 0000000..24966f5
--- /dev/null
+++ b/repair.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ *  for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ *  for network namespace/tap device mode
+ *
+ * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "repair.h"
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+static int fds[SCM_MAX_FD];
+static int current_cmd;
+static int nfds;
+
+/**
+ * repair_sock_init() - Start listening for connections on helper socket
+ * @c:		Execution context
+ */
+void repair_sock_init(const struct ctx *c)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
+	struct epoll_event ev = { 0 };
+
+	listen(c->fd_repair_listen, 0);
+
+	ref.fd = c->fd_repair_listen;
+	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev);
+}
+
+/**
+ * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_listen_handler(struct ctx *c, uint32_t events)
+{
+	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
+	struct epoll_event ev = { 0 };
+	struct ucred ucred;
+	socklen_t len;
+
+	if (events != EPOLLIN) {
+		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
+		      events);
+		return;
+	}
+
+	len = sizeof(ucred);
+
+	/* Another client is already connected: accept and close right away. */
+	if (c->fd_repair != -1) {
+		int discard = accept4(c->fd_repair_listen, NULL, NULL,
+				      SOCK_NONBLOCK);
+
+		if (discard == -1)
+			return;
+
+		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
+
+		close(discard);
+		return;
+	}
+
+	c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0);
+
+	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
+		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
+
+	ref.fd = c->fd_repair;
+	ev.events = EPOLLHUP | EPOLLET;
+	ev.data.u64 = ref.u64;
+	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev);
+}
+
+/**
+ * repair_close() - Close connection to TCP_REPAIR helper
+ * @c:		Execution context
+ */
+void repair_close(struct ctx *c)
+{
+	debug("Closing TCP_REPAIR helper socket");
+
+	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
+	close(c->fd_repair);
+	c->fd_repair = -1;
+}
+
+/**
+ * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
+ * @c:		Execution context
+ * @events:	epoll events
+ */
+void repair_handler(struct ctx *c, uint32_t events)
+{
+	(void)events;
+
+	repair_close(c);
+}
+
+/**
+ * repair_flush() - Flush current set of sockets to helper, with current command
+ * @c:		Execution context
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int repair_flush(struct ctx *c)
+{
+	struct iovec iov = { &((int8_t){ current_cmd }), sizeof(int8_t) };
+	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+	struct cmsghdr *cmsg;
+	struct msghdr msg;
+	int ret = 0;
+
+	if (!nfds)
+		return 0;
+
+	msg = (struct msghdr){ NULL, 0, &iov, 1,
+			       buf, CMSG_SPACE(sizeof(int) * nfds), 0 };
+	cmsg = CMSG_FIRSTHDR(&msg);
+
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * nfds);
+	memcpy(CMSG_DATA(cmsg), fds, sizeof(int) * nfds);
+
+	nfds = 0;
+
+	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
+		ret = -errno;
+		err_perror("Failed to send sockets to TCP_REPAIR helper");
+		repair_close(c);
+	}
+
+	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {
+		ret = -errno;
+		err_perror("Failed to receive reply from TCP_REPAIR helper");
+		repair_close(c);
+	}
+
+	return ret;
+}
+
+/**
+ * repair_flush() - Add socket to TCP_REPAIR set with given command
+ * @c:		Execution context
+ * @s:		Socket to add
+ * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+/* cppcheck-suppress unusedFunction */
+int repair_set(struct ctx *c, int s, int cmd)
+{
+	int rc;
+
+	if (nfds && current_cmd != cmd) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	current_cmd = cmd;
+	fds[nfds++] = s;
+
+	if (nfds >= SCM_MAX_FD) {
+		if ((rc = repair_flush(c)))
+			return rc;
+	}
+
+	return 0;
+}
diff --git a/repair.h b/repair.h
new file mode 100644
index 0000000..693c515
--- /dev/null
+++ b/repair.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+ 
+#ifndef REPAIR_H
+#define REPAIR_H
+
+void repair_sock_init(const struct ctx *c);
+void repair_listen_handler(struct ctx *c, uint32_t events);
+void repair_handler(struct ctx *c, uint32_t events);
+void repair_close(struct ctx *c);
+int repair_flush(struct ctx *c);
+int repair_set(struct ctx *c, int s, int cmd);
+
+#endif /* REPAIR_H */
diff --git a/tap.c b/tap.c
index cd32a90..0e60eb4 100644
--- a/tap.c
+++ b/tap.c
@@ -56,6 +56,7 @@
 #include "netlink.h"
 #include "pasta.h"
 #include "packet.h"
+#include "repair.h"
 #include "tap.h"
 #include "log.h"
 #include "vhost_user.h"
@@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 		tap_pasta_input(c, now);
 }
 
-/**
- * tap_sock_unix_open() - Create and bind AF_UNIX socket
- * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
- *
- * Return: socket descriptor on success, won't return on failure
- */
-int tap_sock_unix_open(char *sock_path)
-{
-	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
-	struct sockaddr_un addr = {
-		.sun_family = AF_UNIX,
-	};
-	int i;
-
-	if (fd < 0)
-		die_perror("Failed to open UNIX domain socket");
-
-	for (i = 1; i < UNIX_SOCK_MAX; i++) {
-		char *path = addr.sun_path;
-		int ex, ret;
-
-		if (*sock_path)
-			memcpy(path, sock_path, UNIX_PATH_MAX);
-		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
-					UNIX_SOCK_PATH, i))
-			die_perror("Can't build UNIX domain socket path");
-
-		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
-			    0);
-		if (ex < 0)
-			die_perror("Failed to check for UNIX domain conflicts");
-
-		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
-		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
-			     errno != EACCES)) {
-			if (*sock_path)
-				die("Socket path %s already in use", path);
-
-			close(ex);
-			continue;
-		}
-		close(ex);
-
-		unlink(path);
-		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
-		if (*sock_path && ret)
-			die_perror("Failed to bind UNIX domain socket");
-
-		if (!ret)
-			break;
-	}
-
-	if (i == UNIX_SOCK_MAX)
-		die_perror("Failed to bind UNIX domain socket");
-
-	info("UNIX domain socket bound at %s", addr.sun_path);
-	if (!*sock_path)
-		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
-
-	return fd;
-}
-
 /**
  * tap_backend_show_hints() - Give help information to start QEMU
  * @c:		Execution context
@@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
 		tap_sock_tun_init(c);
 		break;
 	case MODE_VU:
+		repair_sock_init(c);
+		/* fall through */
 	case MODE_PASST:
 		tap_sock_unix_init(c);
 
diff --git a/util.c b/util.c
index 36857d4..e98da74 100644
--- a/util.c
+++ b/util.c
@@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	return fd;
 }
 
+/**
+ * sock_unix() - Create and bind AF_UNIX socket
+ * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
+ *
+ * Return: socket descriptor on success, won't return on failure
+ */
+int sock_unix(char *sock_path)
+{
+	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
+	struct sockaddr_un addr = {
+		.sun_family = AF_UNIX,
+	};
+	int i;
+
+	if (fd < 0)
+		die_perror("Failed to open UNIX domain socket");
+
+	for (i = 1; i < UNIX_SOCK_MAX; i++) {
+		char *path = addr.sun_path;
+		int ex, ret;
+
+		if (*sock_path)
+			memcpy(path, sock_path, UNIX_PATH_MAX);
+		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
+					UNIX_SOCK_PATH, i))
+			die_perror("Can't build UNIX domain socket path");
+
+		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
+			    0);
+		if (ex < 0)
+			die_perror("Failed to check for UNIX domain conflicts");
+
+		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
+		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
+			     errno != EACCES)) {
+			if (*sock_path)
+				die("Socket path %s already in use", path);
+
+			close(ex);
+			continue;
+		}
+		close(ex);
+
+		unlink(path);
+		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
+		if (*sock_path && ret)
+			die_perror("Failed to bind UNIX domain socket");
+
+		if (!ret)
+			break;
+	}
+
+	if (i == UNIX_SOCK_MAX)
+		die_perror("Failed to bind UNIX domain socket");
+
+	info("UNIX domain socket bound at %s", addr.sun_path);
+	if (!*sock_path)
+		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
+
+	return fd;
+}
+
 /**
  * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
  * @c:		Execution context
diff --git a/util.h b/util.h
index 73a7a33..d2fd2fe 100644
--- a/util.h
+++ b/util.h
@@ -185,6 +185,7 @@ struct ctx;
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
 	       const char *ifname, bool v6only, uint32_t data);
+int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
 int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);
-- 
@@ -185,6 +185,7 @@ struct ctx;
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	       const void *sa, socklen_t sl,
 	       const char *ifname, bool v6only, uint32_t data);
+int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
 int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers
  2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
                   ` (6 preceding siblings ...)
  2025-01-28 23:39 ` [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair Stefano Brivio
@ 2025-01-28 23:39 ` Stefano Brivio
  2025-01-29  6:15   ` David Gibson
  7 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-28 23:39 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier, David Gibson

Very much draft quality, but it works. Ask passt-repair to switch
TCP sockets to repair mode and dump their current sequence numbers to
the flow table, which will be transferred and used by the target in
the next step.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 flow.c     | 43 +++++++++++++++++++++++++++++++++++++++++
 flow.h     |  1 +
 migrate.c  |  1 +
 tcp.c      | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_conn.h |  5 +++++
 5 files changed, 106 insertions(+)

diff --git a/flow.c b/flow.c
index ee1221b..e7148b2 100644
--- a/flow.c
+++ b/flow.c
@@ -19,6 +19,7 @@
 #include "inany.h"
 #include "flow.h"
 #include "flow_table.h"
+#include "repair.h"
 
 const char *flow_state_str[] = {
 	[FLOW_STATE_FREE]	= "FREE",
@@ -874,6 +875,48 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 	*last_next = FLOW_MAX;
 }
 
+/**
+ * flow_migrate_source_pre() - Prepare all source flows for migration
+ * @c:		Execution context
+ * @m:		Migration metadata
+ *
+ * Return: 0 on success
+ */
+int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m)
+{
+	unsigned i;
+	int rc;
+
+	(void)m;
+
+	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
+		union flow *flow = &flowtab[i];
+
+		if (flow->f.state == FLOW_STATE_FREE)
+			i += flow->free.n - 1;
+		else if (flow->f.state == FLOW_STATE_ACTIVE &&
+			 flow->f.type == FLOW_TCP)
+			rc = tcp_flow_repair_on(c, &flow->tcp);
+
+		if (rc)
+			return rc;		/* TODO: rollback */
+	}
+
+	repair_flush(c);			/* TODO: move to TCP logic */
+
+	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
+		union flow *flow = &flowtab[i];
+
+		if (flow->f.state == FLOW_STATE_FREE)
+			i += flow->free.n - 1;
+		else if (flow->f.state == FLOW_STATE_ACTIVE &&
+			 flow->f.type == FLOW_TCP)
+			tcp_flow_dump_seq(c, &flow->tcp);
+	}
+
+	return 0;
+}
+
 /**
  * flow_init() - Initialise flow related data structures
  */
diff --git a/flow.h b/flow.h
index 8eb5964..ff390a6 100644
--- a/flow.h
+++ b/flow.h
@@ -255,6 +255,7 @@ union flow;
 
 void flow_init(void);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
+int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/migrate.c b/migrate.c
index b8b79e0..6707c02 100644
--- a/migrate.c
+++ b/migrate.c
@@ -56,6 +56,7 @@ static struct migrate_data data_versions[] = {
 
 /* Handlers to call in source before sending data */
 struct migrate_handler handlers_source_pre[] = {
+	{ flow_migrate_source_pre },
 	{ 0 },
 };
 
diff --git a/tcp.c b/tcp.c
index c89f323..3a3038b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -299,6 +299,7 @@
 #include "log.h"
 #include "inany.h"
 #include "flow.h"
+#include "repair.h"
 #include "linux_dep.h"
 
 #include "flow_table.h"
@@ -868,6 +869,61 @@ void tcp_defer_handler(struct ctx *c)
 	tcp_payload_flush(c);
 }
 
+/**
+ * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
+{
+	int rc = 0;
+
+	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
+		err("Failed to set TCP_REPAIR for socket %i", conn->sock);
+
+	return rc;
+}
+
+/**
+ * tcp_flow_dump_seq() - Dump sequences for send and receive queues
+ * @c:		Execution context
+ * @conn:	Pointer to the TCP connection structure
+ *
+ * Return: 0 on success, negative error code on failure
+ */
+int tcp_flow_dump_seq(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	int v, s = conn->sock;
+	socklen_t vlen;
+
+	(void)c;
+
+	vlen = sizeof(v);
+
+	v = TCP_SEND_QUEUE;
+	/* TODO: proper error management and prints */
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, vlen))
+		return -errno;
+
+	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &conn->sock_seq_snd, &vlen))
+		return -errno;
+
+	debug("Send queue sequence %u for socket %i", conn->sock_seq_snd, s);
+
+	v = TCP_RECV_QUEUE;
+	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, vlen))
+		return -errno;
+
+	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &conn->sock_seq_rcv, &vlen))
+		return -errno;
+
+	debug("Receive queue sequence %u for socket %i", conn->sock_seq_rcv, s);
+
+	return 0;
+}
+
 /**
  * tcp_fill_header() - Fill the TCP header fields for a given TCP segment.
  *
diff --git a/tcp_conn.h b/tcp_conn.h
index d342680..0c3e197 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -94,6 +94,9 @@ struct tcp_tap_conn {
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
+
+	uint32_t	sock_seq_snd;
+	uint32_t	sock_seq_rcv;
 };
 
 /**
@@ -140,6 +143,8 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
 extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 
 bool tcp_flow_defer(const struct tcp_tap_conn *conn);
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
+int tcp_flow_dump_seq(struct ctx *c, struct tcp_tap_conn *conn);
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
 void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);
-- 
@@ -94,6 +94,9 @@ struct tcp_tap_conn {
 	uint32_t	seq_from_tap;
 	uint32_t	seq_ack_to_tap;
 	uint32_t	seq_init_from_tap;
+
+	uint32_t	sock_seq_snd;
+	uint32_t	sock_seq_rcv;
 };
 
 /**
@@ -140,6 +143,8 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
 extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 
 bool tcp_flow_defer(const struct tcp_tap_conn *conn);
+int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
+int tcp_flow_dump_seq(struct ctx *c, struct tcp_tap_conn *conn);
 bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
 void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
 int tcp_conn_pool_sock(int pool[]);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
  2025-01-28 23:39 ` [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
@ 2025-01-29  1:34   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2025-01-29  1:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

On Wed, Jan 29, 2025 at 12:39:33AM +0100, Stefano Brivio wrote:
> That's the only field in flows with different storage sizes depending
> on the architecture: it's usually 4-byte wide on 32-bit architectures,
> except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
> machines.
> 
> By keeping flow entries the same size across architectures, we avoid
> having to expand or shrink table entries upon migration.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Ugly, but necessary for this v1 approach.

> ---
>  icmp_flow.h | 6 +++++-
>  udp_flow.h  | 6 +++++-
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/icmp_flow.h b/icmp_flow.h
> index fb93801..da7e255 100644
> --- a/icmp_flow.h
> +++ b/icmp_flow.h
> @@ -13,6 +13,7 @@
>   * @seq:	Last sequence number sent to tap, host order, -1: not sent yet
>   * @sock:	"ping" socket
>   * @ts:		Last associated activity from tap, seconds
> + * @ts_storage:	Pad @ts to 64-bit storage to keep state migration sane
>   */
>  struct icmp_ping_flow {
>  	/* Must be first element */
> @@ -20,7 +21,10 @@ struct icmp_ping_flow {
>  
>  	int seq;
>  	int sock;
> -	time_t ts;
> +	union {
> +		time_t ts;
> +		uint64_t ts_storage;
> +	};
>  };
>  
>  bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
> diff --git a/udp_flow.h b/udp_flow.h
> index 9a1b059..9cb79a0 100644
> --- a/udp_flow.h
> +++ b/udp_flow.h
> @@ -12,6 +12,7 @@
>   * @f:		Generic flow information
>   * @closed:	Flow is already closed
>   * @ts:		Activity timestamp
> + * @ts_storage:	Pad @ts to 64-bit storage to keep state migration sane
>   * @s:		Socket fd (or -1) for each side of the flow
>   */
>  struct udp_flow {
> @@ -19,7 +20,10 @@ struct udp_flow {
>  	struct flow_common f;
>  
>  	bool closed :1;
> -	time_t ts;
> +	union {
> +		time_t ts;
> +		uint64_t ts_storage;
> +	};
>  	int s[SIDES];
>  };
>  

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits
  2025-01-28 23:39 ` [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
@ 2025-01-29  1:35   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2025-01-29  1:35 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 2936 bytes --]

On Wed, Jan 29, 2025 at 12:39:34AM +0100, Stefano Brivio wrote:
> ...to keep migration sane. Right now, the biggest struct in union flow
> is struct tcp_splice_conn with 120 bytes on x86_64, which should also
> have the biggest storage and alignment requirements of any
> architecture we might run on.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Again, ugly, but necessary for the v1 approach.

> ---
>  flow.h       | 18 ++++++++++++------
>  flow_table.h | 13 ++++++++++---
>  2 files changed, 22 insertions(+), 9 deletions(-)
> 
> diff --git a/flow.h b/flow.h
> index 24ba3ef..8eb5964 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -202,15 +202,21 @@ struct flow_common {
>  
>  /**
>   * struct flow_sidx - ID for one side of a specific flow
> - * @sidei:	Index of side referenced (0 or 1)
> - * @flowi:	Index of flow referenced
> + * @sidei:		Index of side referenced (0 or 1)
> + * @flowi:		Index of flow referenced
> + * @flow_sidx_storage:	Pad to 32 bits
>   */
>  typedef struct flow_sidx {
> -	unsigned	sidei :1;
> -	unsigned	flowi :FLOW_INDEX_BITS;
> +	union {
> +		struct {
> +			unsigned	sidei :1;
> +			unsigned	flowi :FLOW_INDEX_BITS;
> +		};
> +		uint32_t flow_sidx_storage;
> +	};
>  } flow_sidx_t;
> -static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
> -	      "flow_sidx_t must fit within 32 bits");
> +static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t),
> +	      "flow_sidx_t must be 32-bit wide");
>  
>  #define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX })
>  
> diff --git a/flow_table.h b/flow_table.h
> index f15db53..007f4dd 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -26,9 +26,13 @@ struct flow_free_cluster {
>  
>  /**
>   * union flow - Descriptor for a logical packet flow (e.g. connection)
> - * @f:		Fields common between all variants
> - * @tcp:	Fields for non-spliced TCP connections
> - * @tcp_splice:	Fields for spliced TCP connections
> + * @f:			Fields common between all variants
> + * @free:		Entry in a cluster of free entries
> + * @tcp:		Fields for non-spliced TCP connections
> + * @tcp_splice:		Fields for spliced TCP connections
> + * @ping:		Tracking for ping flows
> + * @udp:		Tracking for UDP flows
> + * @flow_storage:	Pad flow entries to 128 bytes to ease state migration
>  */
>  union flow {
>  	struct flow_common f;
> @@ -37,8 +41,11 @@ union flow {
>  	struct tcp_splice_conn tcp_splice;
>  	struct icmp_ping_flow ping;
>  	struct udp_flow udp;
> +	char flow_storage[128];
>  };
>  
> +static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
> +
>  /* Global Flow Table */
>  extern unsigned flow_first_free;
>  extern union flow flowtab[];

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/8] util: Add read_remainder() and read_all_buf()
  2025-01-28 23:39 ` [PATCH v2 4/8] util: Add read_remainder() and read_all_buf() Stefano Brivio
@ 2025-01-29  1:37   ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2025-01-29  1:37 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 4121 bytes --]

On Wed, Jan 29, 2025 at 12:39:36AM +0100, Stefano Brivio wrote:
> These are symmetric to write_remainder() and write_all_buf() and
> almost a copy and paste of them, with the most notable differences
> being reversed reads/writes and a couple of better-safe-than-sorry
> asserts to keep Coverity happy.
> 
> I'll use them in the next patch. At least for the moment, they're
> going to be used for vhost-user mode only, so I'm not unconditionally
> enabling readv() in the seccomp profile: the caller has to ensure it's
> there.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I don't think I've ever encountered ENODATA before, so it seems a good
choice.  Other options would be

a) 0 for success, -ve for read error, 1 for EOF
b) -ve for read error, 0 for EOF, # of bytes for success (redundant,
   but harmless)

but this is fine.

> ---
>  util.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  util.h |  2 ++
>  2 files changed, 82 insertions(+)
> 
> diff --git a/util.c b/util.c
> index 11973c4..36857d4 100644
> --- a/util.c
> +++ b/util.c
> @@ -606,6 +606,86 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
>  	return 0;
>  }
>  
> +/**
> + * read_all_buf() - Fill a whole buffer from a file descriptor
> + * @fd:		File descriptor
> + * @buf:	Pointer to base of buffer
> + * @len:	Length of buffer
> + *
> + * Return: 0 on success, -1 on error (with errno set)
> + *
> + * #syscalls read
> + */
> +int read_all_buf(int fd, void *buf, size_t len)
> +{
> +	size_t left = len;
> +	char *p = buf;
> +
> +	while (left) {
> +		ssize_t rc;
> +
> +		ASSERT(left <= len);
> +
> +		do
> +			rc = read(fd, p, left);
> +		while ((rc < 0) && errno == EINTR);
> +
> +		if (rc < 0)
> +			return -1;
> +
> +		if (rc == 0) {
> +			errno = ENODATA;
> +			return -1;
> +		}
> +
> +		p += rc;
> +		left -= rc;
> +	}
> +	return 0;
> +}
> +
> +/**
> + * read_remainder() - Read the tail of an IO vector from a file descriptor
> + * @fd:		File descriptor
> + * @iov:	IO vector
> + * @cnt:	Number of entries in @iov
> + * @skip:	Number of bytes of the vector to skip reading
> + *
> + * Return: 0 on success, -1 on error (with errno set)
> + *
> + * Note: mode-specific seccomp profiles need to enable readv() to use this.
> + */
> +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip)
> +{
> +	size_t i = 0, offset;
> +
> +	while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
> +		ssize_t rc;
> +
> +		if (offset) {
> +			ASSERT(offset < iov[i].iov_len);
> +			/* Read the remainder of the partially read buffer */
> +			if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
> +					 iov[i].iov_len - offset) < 0)
> +				return -1;
> +			i++;
> +		}
> +
> +		/* Fill as many of the remaining buffers as we can */
> +		rc = readv(fd, &iov[i], cnt - i);
> +		if (rc < 0)
> +			return -1;
> +
> +		if (rc == 0) {
> +			errno = ENODATA;
> +			return -1;
> +		}
> +
> +		skip = rc;
> +	}
> +	return 0;
> +}
> +
>  /** sockaddr_ntop() - Convert a socket address to text format
>   * @sa:		Socket address
>   * @dst:	output buffer, minimum SOCKADDR_STRLEN bytes
> diff --git a/util.h b/util.h
> index d02333d..73a7a33 100644
> --- a/util.h
> +++ b/util.h
> @@ -203,6 +203,8 @@ int fls(unsigned long x);
>  int write_file(const char *path, const char *buf);
>  int write_all_buf(int fd, const void *buf, size_t len);
>  int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> +int read_all_buf(int fd, void *buf, size_t len);
> +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
>  void close_open_files(int argc, char **argv);
>  bool snprintf_check(char *str, size_t size, const char *format, ...);
>  

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure
  2025-01-28 23:39 ` [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
@ 2025-01-29  5:41   ` David Gibson
  2025-01-29  8:46     ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2025-01-29  5:41 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 20507 bytes --]

On Wed, Jan 29, 2025 at 12:39:37AM +0100, Stefano Brivio wrote:
> Add two sets (source or target) of three functions each for passt in
> vhost-user mode, triggered by activity on the file descriptor passed
> via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> 
> - migrate_source_pre() and migrate_target_pre() are called to prepare
>   for migration, before data is transferred
> 
> - migrate_source() sends, and migrate_target() receives migration data
> 
> - migrate_source_post() and migrate_target_post() are responsible for
>   any post-migration task
> 
> Callbacks are added to these functions with arrays of function
> pointers in migrate.c. Migration handlers are versioned.
> 
> Versioned descriptions of data sections will be added to the
> data_versions array, which points to versioned iovec arrays. Version
> 1 is currently empty and will be filled in in subsequent patches.
> 
> The source announces the data version to be used and informs the peer
> about endianness, and the size of void *, time_t, flow entries and
> flow hash table entries.
> 
> The target checks if the version of the source is still supported. If
> it's not, it aborts the migration.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>

Tomorrow, I'm planning to try implementing a bunch of the suggestions
I have below.

> ---
>  Makefile    |  12 +--
>  migrate.c   | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  migrate.h   |  88 ++++++++++++++++++
>  passt.c     |   2 +-
>  vu_common.c | 124 ++++++++++++++++--------
>  vu_common.h |   2 +-
>  6 files changed, 443 insertions(+), 49 deletions(-)
>  create mode 100644 migrate.c
>  create mode 100644 migrate.h
> 
> diff --git a/Makefile b/Makefile
> index 464eef1..1383875 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> -	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> +	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
>  	vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> -	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> -	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> -	virtio.h vu_common.h
> +	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> +	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> +	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> +	vhost_user.h virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/migrate.c b/migrate.c
> new file mode 100644
> index 0000000..b8b79e0
> --- /dev/null
> +++ b/migrate.c
> @@ -0,0 +1,264 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + *  for qemu/UNIX domain socket mode
> + *
> + * PASTA - Pack A Subtle Tap Abstraction
> + *  for network namespace/tap device mode
> + *
> + * migrate.c - Migration sections, layout, and routines
> + *
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <sys/uio.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "inany.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +
> +#include "migrate.h"
> +
> +/* Current version of migration data */
> +#define MIGRATE_VERSION		1
> +
> +/* Magic as we see it and as seen with reverse endianness */
> +#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
> +#define MIGRATE_MAGIC_SWAPPED	0xB0D1B1B01B1DBBB1
> +
> +/* Migration header to send from source */
> +static union migrate_header header = {
> +	.magic		= MIGRATE_MAGIC,
> +	.version	= htonl_constant(MIGRATE_VERSION),
> +	.time_t_size	= htonl_constant(sizeof(time_t)),
> +	.flow_size	= htonl_constant(sizeof(union flow)),
> +	.flow_sidx_size	= htonl_constant(sizeof(struct flow_sidx)),
> +	.voidp_size	= htonl_constant(sizeof(void *)),
> +};
> +
> +/* Data sections for version 1 */
> +static struct iovec sections_v1[] = {
> +	{ &header,	sizeof(header) },

This format assumes that everything we send is in buffers with
statically known size and address.  I'm pretty sure we'll outgrow that
quickly.  For one thing, I realised we want to transfer addr*_seen, or
we won't know how to direct an incoming connection after migration, if
that occurs before the guest sends anything outgoing.

> +};
> +
> +/* Set of data versions */
> +static struct migrate_data data_versions[] = {
> +	{
> +		1,	sections_v1,
> +	},

I realise it's a bit less "declarative", but maybe just a "receive"
function for each version which open codes sending each section would
be more flexible, and not that much more verbose.

> +	{ 0 },
> +};
> +
> +/* Handlers to call in source before sending data */
> +struct migrate_handler handlers_source_pre[] = {
> +	{ 0 },
> +};
> +
> +/* Handlers to call in source after sending data */
> +struct migrate_handler handlers_source_post[] = {
> +	{ 0 },
> +};

I also wonder if these tables of handlers are overkill, rather than
just having a function that calls all the things in the right order.

> +
> +/* Handlers to call in target before receiving data with version 1 */
> +struct migrate_handler handlers_target_pre_v1[] = {
> +	{ 0 },
> +};
> +
> +/* Handlers to call in target after receiving data with version 1 */
> +struct migrate_handler handlers_target_post_v1[] = {
> +	{ 0 },
> +};
> +
> +/* Versioned sets of migration handlers */
> +struct migrate_target_handlers target_handlers[] = {
> +	{
> +		1,
> +		handlers_target_pre_v1,
> +		handlers_target_post_v1,
> +	},
> +	{ 0 },
> +};
> +
> +/**
> + * migrate_source_pre() - Pre-migration tasks as source
> + * @c:		Execution context
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_source_pre(struct ctx *c, struct migrate_meta *m)
> +{
> +	struct migrate_handler *h;
> +
> +	for (h = handlers_source_pre; h->fn; h++) {
> +		int rc;
> +
> +		if ((rc = h->fn(c, m)))
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * migrate_source() - Perform migration as source: send state to hypervisor
> + * @fd:		Descriptor for state transfer
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_source(int fd, const struct migrate_meta *m)
> +{
> +	static struct migrate_data *d;
> +	int count, rc;
> +
> +	(void)m;
> +
> +	for (d = data_versions; d->v != MIGRATE_VERSION; d++);
> +
> +	for (count = 0; d->sections[count].iov_len; count++);
> +
> +	debug("Writing %u migration sections", count - 1 /* minus header */);
> +	rc = write_remainder(fd, d->sections, count, 0);
> +	if (rc < 0)
> +		return errno;
> +
> +	return 0;
> +}
> +
> +/**
> + * migrate_source_post() - Post-migration tasks as source
> + * @c:		Execution context
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +void migrate_source_post(struct ctx *c, struct migrate_meta *m)
> +{
> +	struct migrate_handler *h;
> +
> +	for (h = handlers_source_post; h->fn; h++)
> +		h->fn(c, m);
> +}
> +
> +/**
> + * migrate_target_read_header() - Set metadata in target from source header
> + * @fd:		Descriptor for state transfer
> + * @m:		Migration metadata, filled on return
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_target_read_header(int fd, struct migrate_meta *m)
> +{
> +	static struct migrate_data *d;
> +	union migrate_header h;
> +
> +	if (read_all_buf(fd, &h, sizeof(h)))
> +		return errno;
> +
> +	debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> +	      h.magic, ntohl(h.voidp_size), ntohl(h.version));
> +
> +	for (d = data_versions; d->v != ntohl(h.version) && d->v; d++);
> +	if (!d->v)
> +		return ENOTSUP;
> +	m->v = d->v;
> +
> +	if (h.magic == MIGRATE_MAGIC)
> +		m->bswap = false;
> +	else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> +		m->bswap = true;
> +	else
> +		return ENOTSUP;
> +
> +	if (ntohl(h.voidp_size) == 4)
> +		m->source_64b = false;
> +	else if (ntohl(h.voidp_size) == 8)
> +		m->source_64b = true;
> +	else
> +		return ENOTSUP;
> +
> +	if (ntohl(h.time_t_size) == 4)
> +		m->time_64b = false;
> +	else if (ntohl(h.time_t_size) == 8)
> +		m->time_64b = true;
> +	else
> +		return ENOTSUP;
> +
> +	m->flow_size = ntohl(h.flow_size);
> +	m->flow_sidx_size = ntohl(h.flow_sidx_size);
> +
> +	return 0;
> +}
> +
> +/**
> + * migrate_target_pre() - Pre-migration tasks as target
> + * @c:		Execution context
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_target_pre(struct ctx *c, struct migrate_meta *m)
> +{
> +	struct migrate_target_handlers *th;
> +	struct migrate_handler *h;
> +
> +	for (th = target_handlers; th->v != m->v && th->v; th++);
> +
> +	for (h = th->pre; h->fn; h++) {
> +		int rc;
> +
> +		if ((rc = h->fn(c, m)))
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * migrate_target() - Perform migration as target: receive state from hypervisor
> + * @fd:		Descriptor for state transfer
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + *
> + * #syscalls:vu readv
> + */
> +int migrate_target(int fd, const struct migrate_meta *m)
> +{
> +	static struct migrate_data *d;
> +	unsigned cnt;
> +	int rc;
> +
> +	for (d = data_versions; d->v != m->v && d->v; d++);
> +
> +	for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> +
> +	debug("Reading %u migration sections", cnt);
> +	rc = read_remainder(fd, d->sections + 1, cnt, 0);
> +	if (rc < 0)
> +		return errno;
> +
> +	return 0;
> +}
> +
> +/**
> + * migrate_target_post() - Post-migration tasks as target
> + * @c:		Execution context
> + * @m:		Migration metadata
> + */
> +void migrate_target_post(struct ctx *c, struct migrate_meta *m)
> +{
> +	struct migrate_target_handlers *th;
> +	struct migrate_handler *h;
> +
> +	for (th = target_handlers; th->v != m->v && th->v; th++);
> +
> +	for (h = th->post; h->fn; h++)
> +		h->fn(c, m);
> +}
> diff --git a/migrate.h b/migrate.h
> new file mode 100644
> index 0000000..f9635ac
> --- /dev/null
> +++ b/migrate.h
> @@ -0,0 +1,88 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> + 
> +#ifndef MIGRATE_H
> +#define MIGRATE_H
> +
> +/**
> + * struct migrate_meta - Migration metadata
> + * @v:			Chosen migration data version, host order
> + * @bswap:		Source has opposite endianness
> + * @peer_64b:		Source uses 64-bit void *
> + * @time_64b:		Source uses 64-bit time_t
> + * @flow_size:		Size of union flow in source
> + * @flow_sidx_size:	Size of struct flow_sidx in source
> + */
> +struct migrate_meta {
> +	uint32_t v;
> +	bool bswap;
> +	bool source_64b;
> +	bool time_64b;
> +	size_t flow_size;
> +	size_t flow_sidx_size;
> +};
> +
> +/**
> + * union migrate_header - Migration header from source
> + * @magic:		0xB1BB1D1B0BB1D1B0, host order
> + * @version:		Source sends highest known, target aborts if unsupported
> + * @voidp_size:		sizeof(void *), network order
> + * @time_t_size:	sizeof(time_t), network order
> + * @flow_size:		sizeof(union flow), network order
> + * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
> + * @unused:		Go figure
> + */
> +union migrate_header {
> +	struct {
> +		uint64_t magic;
> +		uint32_t version;
> +		uint32_t voidp_size;
> +		uint32_t time_t_size;
> +		uint32_t flow_size;
> +		uint32_t flow_sidx_size;
> +	};
> +	uint8_t unused[65536];
> +};
> +
> +/**
> + * struct migrate_data - Data sections for given source version
> + * @v:			Source version this applies to, host order
> + * @sections:		Array of data sections, NULL-terminated
> + */
> +struct migrate_data {
> +	uint32_t v;
> +	struct iovec *sections;
> +};
> +
> +/**
> + * struct migrate_handler - Function to handle a specific data section
> + * @fn:			Function pointer taking pointer to context and metadata
> + */
> +struct migrate_handler {
> +	int (*fn)(struct ctx *c, struct migrate_meta *m);
> +};
> +
> +/**
> + * struct migrate_target_handlers - Versioned sets of migration target handlers
> + * @v:			Source version this applies to, host order
> + * @pre:		Set of functions to execute in target before data copy
> + * @post:		Set of functions to execute in target after data copy
> + */
> +struct migrate_target_handlers {
> +	uint32_t v;
> +	struct migrate_handler *pre;
> +	struct migrate_handler *post;
> +};
> +
> +int migrate_source_pre(struct ctx *c, struct migrate_meta *m);
> +int migrate_source(int fd, const struct migrate_meta *m);
> +void migrate_source_post(struct ctx *c, struct migrate_meta *m);
> +
> +int migrate_target_read_header(int fd, struct migrate_meta *m);
> +int migrate_target_pre(struct ctx *c, struct migrate_meta *m);
> +int migrate_target(int fd, const struct migrate_meta *m);
> +void migrate_target_post(struct ctx *c, struct migrate_meta *m);
> +
> +#endif /* MIGRATE_H */
> diff --git a/passt.c b/passt.c
> index b1c8ab6..184d4e5 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -358,7 +358,7 @@ loop:
>  			vu_kick_cb(c.vdev, ref, &now);
>  			break;
>  		case EPOLL_TYPE_VHOST_MIGRATION:
> -			vu_migrate(c.vdev, eventmask);
> +			vu_migrate(&c, eventmask);

Not really a comment on this patch, but do we actually need/want to
put the migration fd into the epoll loop?  IIUC, everything we're
doing on the migration fd is synchronous and blocking, so we could run
the whole migration from the VHOST_USER_SET_DEVICE_STATE callback.
Or, maybe less confusing, flag the migration to run before we next
call epoll_wait(): having the migration run strictly in-between epoll
cycles seems like it might make some things easier to analyse.

>  			break;
>  		default:
>  			/* Can't happen */
> diff --git a/vu_common.c b/vu_common.c
> index f43d8ac..6c346c8 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -5,6 +5,7 @@
>   * common_vu.c - vhost-user common UDP and TCP functions
>   */
>  
> +#include <errno.h>
>  #include <unistd.h>
>  #include <sys/uio.h>
>  #include <sys/eventfd.h>
> @@ -17,6 +18,7 @@
>  #include "vhost_user.h"
>  #include "pcap.h"
>  #include "vu_common.h"
> +#include "migrate.h"
>  
>  #define VU_MAX_TX_BUFFER_NB	2
>  
> @@ -305,50 +307,90 @@ err:
>  }
>  
>  /**
> - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> - * @vdev:	vhost-user device
> + * vu_migrate_source() - Migration as source, send state to hypervisor
> + * @c:		Execution context
> + * @fd:		File descriptor for state transfer
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +static int vu_migrate_source(struct ctx *c, int fd)
> +{
> +	struct migrate_meta m;
> +	int rc;
> +
> +	if ((rc = migrate_source_pre(c, &m))) {
> +		err("Source pre-migration failed: %s, abort", strerror_(rc));
> +		return rc;
> +	}
> +
> +	debug("Saving backend state");
> +
> +	rc = migrate_source(fd, &m);
> +	if (rc)
> +		err("Source migration failed: %s", strerror_(rc));
> +	else
> +		migrate_source_post(c, &m);
> +
> +	return rc;
> +}
> +
> +/**
> + * vu_migrate_target() - Migration as target, receive state from hypervisor
> + * @c:		Execution context
> + * @fd:		File descriptor for state transfer
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +static int vu_migrate_target(struct ctx *c, int fd)
> +{
> +	struct migrate_meta m;
> +	int rc;
> +
> +	rc = migrate_target_read_header(fd, &m);
> +	if (rc) {
> +		err("Migration header check failed: %s, abort", strerror_(rc));
> +		return rc;
> +	}
> +
> +	if ((rc = migrate_target_pre(c, &m))) {
> +		err("Target pre-migration failed: %s, abort", strerror_(rc));
> +		return rc;
> +	}
> +
> +	debug("Loading backend state");
> +
> +	rc = migrate_target(fd, &m);
> +	if (rc)
> +		err("Target migration failed: %s", strerror_(rc));
> +	else
> +		migrate_target_post(c, &m);
> +
> +	return rc;
> +}
> +
> +/**
> + * vu_migrate() - Send/receive passt internal state to/from QEMU
> + * @c:		Execution context
>   * @events:	epoll events
>   */
> -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> +void vu_migrate(struct ctx *c, uint32_t events)
>  {
> -	int ret;
> +	struct vu_dev *vdev = c->vdev;
> +	int rc = EIO;
>  
> -	/* TODO: collect/set passt internal state
> -	 * and use vdev->device_state_fd to send/receive it
> -	 */
>  	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> -	if (events & EPOLLOUT) {
> -		debug("Saving backend state");
> -
> -		/* send some stuff */
> -		ret = write(vdev->device_state_fd, "PASST", 6);
> -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> -		vdev->device_state_result = ret == -1 ? -1 : 0;
> -		/* Closing the file descriptor signals the end of transfer */
> -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> -			  vdev->device_state_fd, NULL);
> -		close(vdev->device_state_fd);
> -		vdev->device_state_fd = -1;
> -	} else if (events & EPOLLIN) {
> -		char buf[6];
> -
> -		debug("Loading backend state");
> -		/* read some stuff */
> -		ret = read(vdev->device_state_fd, buf, sizeof(buf));
> -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> -		if (ret != sizeof(buf)) {
> -			vdev->device_state_result = -1;
> -		} else {
> -			ret = strncmp(buf, "PASST", sizeof(buf));
> -			vdev->device_state_result = ret == 0 ? 0 : -1;
> -		}
> -	} else if (events & EPOLLHUP) {
> -		debug("Closing migration channel");
> -
> -		/* The end of file signals the end of the transfer. */
> -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> -			  vdev->device_state_fd, NULL);
> -		close(vdev->device_state_fd);
> -		vdev->device_state_fd = -1;
> -	}
> +
> +	if (events & EPOLLOUT)
> +		rc = vu_migrate_source(c, vdev->device_state_fd);
> +	else if (events & EPOLLIN)
> +		rc = vu_migrate_target(c, vdev->device_state_fd);
> +
> +	/* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> +
> +	vdev->device_state_result = rc;

Again, not really a comment on this patch, but I think it would be
safer to set device_state_result to something non-zero (probably
EINPROGRESS) as soon as we get the VHOST_USER_SET_DEVICE_STATE_FD.
That way if somehow VHOST_USER_CHECK_DEVICE_STATE were called before
we actually do anything about the migration we'd correctly report that
it hasn't happened.

> +
> +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
> +	debug("Closing migration channel");
> +	close(vdev->device_state_fd);
> +	vdev->device_state_fd = -1;
>  }
> diff --git a/vu_common.h b/vu_common.h
> index d56c021..69c4006 100644
> --- a/vu_common.h
> +++ b/vu_common.h
> @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
>  void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
>  		const struct timespec *now);
>  int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> -void vu_migrate(struct vu_dev *vdev, uint32_t events);
> +void vu_migrate(struct ctx *c, uint32_t events);
>  #endif /* VU_COMMON_H */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair
  2025-01-28 23:39 ` [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair Stefano Brivio
@ 2025-01-29  6:09   ` David Gibson
  2025-01-29  8:46     ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2025-01-29  6:09 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 22790 bytes --]

On Wed, Jan 29, 2025 at 12:39:39AM +0100, Stefano Brivio wrote:
> In vhost-user mode, by default, create a second UNIX domain socket
> accepting connections from passt-repair, with the usual listener
> socket.
> 
> When we need to set or clear TCP_REPAIR on sockets, we'll send them
> via SCM_RIGHTS to passt-repair, who sets the socket option values we
> ask for.
> 
> To that end, introduce batched functions to request TCP_REPAIR
> settings on sockets, so that we don't have to send a single message
> for each socket, on migration. When needed, repair_flush() will
> send the message and check for the reply.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  Makefile     |  12 ++--
>  conf.c       |  46 ++++++++++--
>  epoll_type.h |   4 ++
>  passt.1      |  11 +++
>  passt.c      |   9 +++
>  passt.h      |   7 ++
>  repair.c     | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  repair.h     |  16 +++++
>  tap.c        |  65 +----------------
>  util.c       |  62 +++++++++++++++++
>  util.h       |   1 +
>  11 files changed, 353 insertions(+), 72 deletions(-)
>  create mode 100644 repair.c
>  create mode 100644 repair.h
> 
> diff --git a/Makefile b/Makefile
> index 1b71cb0..f67a20b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> -	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> -	vhost_user.c virtio.c vu_common.c
> +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> +	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> +	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  PASST_REPAIR_SRCS = passt-repair.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> @@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> -	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> -	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> -	vhost_user.h virtio.h vu_common.h
> +	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> +	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> +	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/conf.c b/conf.c
> index df2b016..85dec44 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -816,6 +816,9 @@ static void usage(const char *name, FILE *f, int status)
>  			"    UNIX domain socket is provided by -s option\n"
>  			"  --print-capabilities	print back-end capabilities in JSON format,\n"
>  			"    only meaningful for vhost-user mode\n");
> +		FPRINTF(f,
> +			"  --repair-path PATH	path for passt-repair(1)\n"

Nit: as a privileged helper, should it be passt-repair(8)?

> +			"    default: append '.repair' to UNIX domain path\n");
>  	}
>  
>  	FPRINTF(f,
> @@ -1240,8 +1243,30 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
>   */
>  static void conf_open_files(struct ctx *c)
>  {
> -	if (c->mode != MODE_PASTA && c->fd_tap == -1)
> -		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
> +	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
> +		c->fd_tap_listen = sock_unix(c->sock_path);
> +
> +		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
> +			if (!strncmp(c->repair_path, "./", 2)) {
> +				memmove(c->repair_path, c->repair_path + 2,
> +					sizeof(c->repair_path) - 2);
> +			}

Do you need this? Shouldn't "./whatever" be usable as-is?

> +
> +			if (!*c->repair_path &&
> +			    snprintf_check(c->repair_path,
> +					   sizeof(c->repair_path), "%s.repair",
> +					   c->sock_path)) {
> +				warn("passt-repair path %s not usable",
> +				     c->repair_path);

I'd prefer a die() here - I think omitting a possibly expected
feature, with just a warning that could easily be lost in the logs is
not a good idea.

> +				c->fd_repair_listen = -1;
> +			} else {
> +				c->fd_repair_listen = sock_unix(c->repair_path);
> +			}
> +		} else {
> +			c->fd_repair_listen = -1;
> +		}
> +		c->fd_repair = -1;
> +	}
>  
>  	if (*c->pidfile) {
>  		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
> @@ -1354,9 +1379,12 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
>  		{"dns-host",	required_argument,	NULL,		24 },
>  		{"vhost-user",	no_argument,		NULL,		25 },
> +
>  		/* vhost-user backend program convention */
>  		{"print-capabilities", no_argument,	NULL,		26 },
>  		{"socket-path",	required_argument,	NULL,		's' },
> +
> +		{"repair-path",	required_argument,	NULL,		27 },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1824,8 +1852,8 @@ void conf(struct ctx *c, int argc, char **argv)
>  	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
>  		c->no_dhcp = 1;
>  
> -	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
> -	 * settings)
> +	/* Inbound port options, DNS, and --repair-path can be parsed now, after
> +	 * IPv4/IPv6 settings and --vhost-user.
>  	 */
>  	fwd_probe_ephemeral();
>  	udp_portmap_clear();
> @@ -1871,6 +1899,16 @@ void conf(struct ctx *c, int argc, char **argv)
>  			}
>  
>  			die("Cannot use DNS address %s", optarg);
> +		} else if (name == 27) {
> +			if (c->mode != MODE_VU && strcmp(optarg, "none"))
> +				die("--repair-path is for vhost-user mode only");
> +
> +			if (snprintf_check(c->repair_path,
> +					   sizeof(c->repair_path), "%s",
> +					   optarg))
> +				die("Invalid passt-repair path: %s", optarg);
> +
> +			break;
>  		}
>  	} while (name != -1);
>  
> diff --git a/epoll_type.h b/epoll_type.h
> index fd9eac3..706238a 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -42,6 +42,10 @@ enum epoll_type {
>  	EPOLL_TYPE_VHOST_KICK,
>  	/* vhost-user migration socket */
>  	EPOLL_TYPE_VHOST_MIGRATION,
> +	/* TCP_REPAIR helper listening socket */
> +	EPOLL_TYPE_REPAIR_LISTEN,
> +	/* TCP_REPAIR helper socket */
> +	EPOLL_TYPE_REPAIR,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/passt.1 b/passt.1
> index d9cd33e..63a3a01 100644
> --- a/passt.1
> +++ b/passt.1
> @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
>  .BR \-\-print-capabilities
>  Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
>  
> +.TP
> +.BR \-\-repair-path " " \fIpath
> +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect

passt-repair(8)?

> +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
> +migration. \fB--repair-path none\fR disables this interface (if you need to
> +specify a socket path called "none" you can prefix the path by \fI./\fR).
> +
> +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> +chosen for the hypervisor UNIX domain socket. No socket is created if not in
> +\-\-vhost-user mode.
> +
>  .TP
>  .BR \-F ", " \-\-fd " " \fIFD
>  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> diff --git a/passt.c b/passt.c
> index 184d4e5..1fa2ddd 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -51,6 +51,7 @@
>  #include "tcp_splice.h"
>  #include "ndp.h"
>  #include "vu_common.h"
> +#include "repair.h"
>  
>  #define EPOLL_EVENTS		8
>  
> @@ -76,6 +77,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
>  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
> +	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
> +	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -360,6 +363,12 @@ loop:
>  		case EPOLL_TYPE_VHOST_MIGRATION:
>  			vu_migrate(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_REPAIR_LISTEN:
> +			repair_listen_handler(&c, eventmask);
> +			break;
> +		case EPOLL_TYPE_REPAIR:
> +			repair_handler(&c, eventmask);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 0dd4efa..85b0a10 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -20,6 +20,7 @@ union epoll_ref;
>  #include "siphash.h"
>  #include "ip.h"
>  #include "inany.h"
> +#include "migrate.h"
>  #include "flow.h"
>  #include "icmp.h"
>  #include "fwd.h"
> @@ -193,6 +194,7 @@ struct ip6_ctx {
>   * @foreground:		Run in foreground, don't log to stderr by default
>   * @nofile:		Maximum number of open files (ulimit -n)
>   * @sock_path:		Path for UNIX domain socket
> + * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
>   * @pcap:		Path for packet capture file
>   * @pidfile:		Path to PID file, empty string if not configured
>   * @pidfile_fd:		File descriptor for PID file, -1 if none
> @@ -203,6 +205,8 @@ struct ip6_ctx {
>   * @epollfd:		File descriptor for epoll instance
>   * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
>   * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
> + * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
> + * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
>   * @our_tap_mac:	Pasta/passt's MAC on the tap link
>   * @guest_mac:		MAC address of guest or namespace, seen or configured
>   * @hash_secret:	128-bit secret for siphash functions
> @@ -244,6 +248,7 @@ struct ctx {
>  	int foreground;
>  	int nofile;
>  	char sock_path[UNIX_PATH_MAX];
> +	char repair_path[UNIX_PATH_MAX];
>  	char pcap[PATH_MAX];
>  
>  	char pidfile[PATH_MAX];
> @@ -260,6 +265,8 @@ struct ctx {
>  	int epollfd;
>  	int fd_tap_listen;
>  	int fd_tap;
> +	int fd_repair_listen;
> +	int fd_repair;
>  	unsigned char our_tap_mac[ETH_ALEN];
>  	unsigned char guest_mac[ETH_ALEN];
>  	uint64_t hash_secret[2];
> diff --git a/repair.c b/repair.c
> new file mode 100644
> index 0000000..24966f5
> --- /dev/null
> +++ b/repair.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + *  for qemu/UNIX domain socket mode
> + *
> + * PASTA - Pack A Subtle Tap Abstraction
> + *  for network namespace/tap device mode
> + *
> + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
> + *
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <sys/uio.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "inany.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +
> +#include "repair.h"
> +
> +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> +
> +static int fds[SCM_MAX_FD];

Even 'static', I'd prefer a longer name for a global variable.

> +static int current_cmd;

Is there any particular rationale behind these being globals, whereas
fd_repair is in struct ctx?  AFAICT they're basically equally global
in practice.

> +static int nfds;
> +
> +/**
> + * repair_sock_init() - Start listening for connections on helper socket
> + * @c:		Execution context
> + */
> +void repair_sock_init(const struct ctx *c)
> +{
> +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> +	struct epoll_event ev = { 0 };
> +
> +	listen(c->fd_repair_listen, 0);
> +
> +	ref.fd = c->fd_repair_listen;
> +	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> +	ev.data.u64 = ref.u64;
> +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev);
> +}
> +
> +/**
> + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
> + * @c:		Execution context
> + * @events:	epoll events
> + */
> +void repair_listen_handler(struct ctx *c, uint32_t events)
> +{
> +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> +	struct epoll_event ev = { 0 };
> +	struct ucred ucred;
> +	socklen_t len;
> +
> +	if (events != EPOLLIN) {
> +		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
> +		      events);
> +		return;
> +	}
> +
> +	len = sizeof(ucred);
> +
> +	/* Another client is already connected: accept and close right away. */

For the repair socket, last-connection-wins would make more sense to
me than first-connection-wins.  While hacking/debugging seems it might
be useful to fix something in the passt-repair, re-run it and have it
displace the stale version for existing passt instances.

> +	if (c->fd_repair != -1) {
> +		int discard = accept4(c->fd_repair_listen, NULL, NULL,
> +				      SOCK_NONBLOCK);
> +
> +		if (discard == -1)
> +			return;
> +
> +		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> +			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
> +
> +		close(discard);
> +		return;
> +	}
> +
> +	c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0);
> +
> +	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> +		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
> +
> +	ref.fd = c->fd_repair;
> +	ev.events = EPOLLHUP | EPOLLET;
> +	ev.data.u64 = ref.u64;
> +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev);
> +}
> +
> +/**
> + * repair_close() - Close connection to TCP_REPAIR helper
> + * @c:		Execution context
> + */
> +void repair_close(struct ctx *c)
> +{
> +	debug("Closing TCP_REPAIR helper socket");
> +
> +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
> +	close(c->fd_repair);
> +	c->fd_repair = -1;
> +}
> +
> +/**
> + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
> + * @c:		Execution context
> + * @events:	epoll events
> + */
> +void repair_handler(struct ctx *c, uint32_t events)
> +{
> +	(void)events;
> +
> +	repair_close(c);
> +}
> +
> +/**
> + * repair_flush() - Flush current set of sockets to helper, with current command
> + * @c:		Execution context
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int repair_flush(struct ctx *c)
> +{
> +	struct iovec iov = { &((int8_t){ current_cmd }), sizeof(int8_t) };
> +	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> +	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> +	struct cmsghdr *cmsg;
> +	struct msghdr msg;
> +	int ret = 0;
> +
> +	if (!nfds)
> +		return 0;
> +
> +	msg = (struct msghdr){ NULL, 0, &iov, 1,
> +			       buf, CMSG_SPACE(sizeof(int) * nfds), 0 };
> +	cmsg = CMSG_FIRSTHDR(&msg);
> +
> +	cmsg->cmsg_level = SOL_SOCKET;
> +	cmsg->cmsg_type = SCM_RIGHTS;
> +	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * nfds);
> +	memcpy(CMSG_DATA(cmsg), fds, sizeof(int) * nfds);
> +
> +	nfds = 0;
> +
> +	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
> +		ret = -errno;

This error code won't be reported to the caller: you'll continue on to
the recv() below, which will return EBADF, clobbering ret.

> +		err_perror("Failed to send sockets to TCP_REPAIR helper");
> +		repair_close(c);
> +	}
> +
> +	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {
> +		ret = -errno;
> +		err_perror("Failed to receive reply from TCP_REPAIR helper");
> +		repair_close(c);
> +	}
> +
> +	return ret;
> +}
> +
> +/**
> + * repair_flush() - Add socket to TCP_REPAIR set with given command
> + * @c:		Execution context
> + * @s:		Socket to add
> + * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +/* cppcheck-suppress unusedFunction */
> +int repair_set(struct ctx *c, int s, int cmd)
> +{
> +	int rc;
> +
> +	if (nfds && current_cmd != cmd) {
> +		if ((rc = repair_flush(c)))
> +			return rc;
> +	}
> +
> +	current_cmd = cmd;
> +	fds[nfds++] = s;
> +
> +	if (nfds >= SCM_MAX_FD) {
> +		if ((rc = repair_flush(c)))
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> diff --git a/repair.h b/repair.h
> new file mode 100644
> index 0000000..693c515
> --- /dev/null
> +++ b/repair.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> + 
> +#ifndef REPAIR_H
> +#define REPAIR_H
> +
> +void repair_sock_init(const struct ctx *c);
> +void repair_listen_handler(struct ctx *c, uint32_t events);
> +void repair_handler(struct ctx *c, uint32_t events);
> +void repair_close(struct ctx *c);
> +int repair_flush(struct ctx *c);
> +int repair_set(struct ctx *c, int s, int cmd);
> +
> +#endif /* REPAIR_H */
> diff --git a/tap.c b/tap.c
> index cd32a90..0e60eb4 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -56,6 +56,7 @@
>  #include "netlink.h"
>  #include "pasta.h"
>  #include "packet.h"
> +#include "repair.h"
>  #include "tap.h"
>  #include "log.h"
>  #include "vhost_user.h"
> @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  		tap_pasta_input(c, now);
>  }
>  
> -/**
> - * tap_sock_unix_open() - Create and bind AF_UNIX socket
> - * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> - *
> - * Return: socket descriptor on success, won't return on failure
> - */
> -int tap_sock_unix_open(char *sock_path)
> -{
> -	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> -	struct sockaddr_un addr = {
> -		.sun_family = AF_UNIX,
> -	};
> -	int i;
> -
> -	if (fd < 0)
> -		die_perror("Failed to open UNIX domain socket");
> -
> -	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> -		char *path = addr.sun_path;
> -		int ex, ret;
> -
> -		if (*sock_path)
> -			memcpy(path, sock_path, UNIX_PATH_MAX);
> -		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> -					UNIX_SOCK_PATH, i))
> -			die_perror("Can't build UNIX domain socket path");
> -
> -		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> -			    0);
> -		if (ex < 0)
> -			die_perror("Failed to check for UNIX domain conflicts");
> -
> -		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> -		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> -			     errno != EACCES)) {
> -			if (*sock_path)
> -				die("Socket path %s already in use", path);
> -
> -			close(ex);
> -			continue;
> -		}
> -		close(ex);
> -
> -		unlink(path);
> -		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> -		if (*sock_path && ret)
> -			die_perror("Failed to bind UNIX domain socket");
> -
> -		if (!ret)
> -			break;
> -	}
> -
> -	if (i == UNIX_SOCK_MAX)
> -		die_perror("Failed to bind UNIX domain socket");
> -
> -	info("UNIX domain socket bound at %s", addr.sun_path);
> -	if (!*sock_path)
> -		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> -
> -	return fd;
> -}
> -
>  /**
>   * tap_backend_show_hints() - Give help information to start QEMU
>   * @c:		Execution context
> @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
>  		tap_sock_tun_init(c);
>  		break;
>  	case MODE_VU:
> +		repair_sock_init(c);
> +		/* fall through */
>  	case MODE_PASST:
>  		tap_sock_unix_init(c);
>  
> diff --git a/util.c b/util.c
> index 36857d4..e98da74 100644
> --- a/util.c
> +++ b/util.c
> @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	return fd;
>  }
>  
> +/**
> + * sock_unix() - Create and bind AF_UNIX socket
> + * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> + *
> + * Return: socket descriptor on success, won't return on failure
> + */

I like making tap_sock_unix_open() more general.  Could be split into
a separate patch.

> +int sock_unix(char *sock_path)
> +{
> +	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> +	struct sockaddr_un addr = {
> +		.sun_family = AF_UNIX,
> +	};
> +	int i;
> +
> +	if (fd < 0)
> +		die_perror("Failed to open UNIX domain socket");
> +
> +	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> +		char *path = addr.sun_path;
> +		int ex, ret;
> +
> +		if (*sock_path)
> +			memcpy(path, sock_path, UNIX_PATH_MAX);
> +		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> +					UNIX_SOCK_PATH, i))
> +			die_perror("Can't build UNIX domain socket path");
> +
> +		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> +			    0);
> +		if (ex < 0)
> +			die_perror("Failed to check for UNIX domain conflicts");
> +
> +		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> +		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> +			     errno != EACCES)) {
> +			if (*sock_path)
> +				die("Socket path %s already in use", path);
> +
> +			close(ex);
> +			continue;
> +		}
> +		close(ex);
> +
> +		unlink(path);
> +		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> +		if (*sock_path && ret)
> +			die_perror("Failed to bind UNIX domain socket");
> +
> +		if (!ret)
> +			break;
> +	}
> +
> +	if (i == UNIX_SOCK_MAX)
> +		die_perror("Failed to bind UNIX domain socket");
> +
> +	info("UNIX domain socket bound at %s", addr.sun_path);
> +	if (!*sock_path)
> +		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> +
> +	return fd;
> +}
> +
>  /**
>   * sock_probe_mem() - Check if setting high SO_SNDBUF and SO_RCVBUF is allowed
>   * @c:		Execution context
> diff --git a/util.h b/util.h
> index 73a7a33..d2fd2fe 100644
> --- a/util.h
> +++ b/util.h
> @@ -185,6 +185,7 @@ struct ctx;
>  int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	       const void *sa, socklen_t sl,
>  	       const char *ifname, bool v6only, uint32_t data);
> +int sock_unix(char *sock_path);
>  void sock_probe_mem(struct ctx *c);
>  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
>  int64_t timespec_diff_us(const struct timespec *a, const struct timespec *b);

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers
  2025-01-28 23:39 ` [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers Stefano Brivio
@ 2025-01-29  6:15   ` David Gibson
  2025-01-29  8:46     ` Stefano Brivio
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2025-01-29  6:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 6486 bytes --]

On Wed, Jan 29, 2025 at 12:39:40AM +0100, Stefano Brivio wrote:
> Very much draft quality, but it works. Ask passt-repair to switch
> TCP sockets to repair mode and dump their current sequence numbers to
> the flow table, which will be transferred and used by the target in
> the next step.
> 
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  flow.c     | 43 +++++++++++++++++++++++++++++++++++++++++
>  flow.h     |  1 +
>  migrate.c  |  1 +
>  tcp.c      | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_conn.h |  5 +++++
>  5 files changed, 106 insertions(+)
> 
> diff --git a/flow.c b/flow.c
> index ee1221b..e7148b2 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -19,6 +19,7 @@
>  #include "inany.h"
>  #include "flow.h"
>  #include "flow_table.h"
> +#include "repair.h"
>  
>  const char *flow_state_str[] = {
>  	[FLOW_STATE_FREE]	= "FREE",
> @@ -874,6 +875,48 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  	*last_next = FLOW_MAX;
>  }
>  
> +/**
> + * flow_migrate_source_pre() - Prepare all source flows for migration
> + * @c:		Execution context
> + * @m:		Migration metadata
> + *
> + * Return: 0 on success
> + */
> +int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m)
> +{
> +	unsigned i;
> +	int rc;
> +
> +	(void)m;
> +
> +	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
> +		union flow *flow = &flowtab[i];
> +
> +		if (flow->f.state == FLOW_STATE_FREE)
> +			i += flow->free.n - 1;
> +		else if (flow->f.state == FLOW_STATE_ACTIVE &&

We should probably just abort any flows that are in pre-ACTIVE state
at migration time.  Wait... IIRC flows have to be in ACTIVE state (or
already cancelled) once we get to the next epoll cycle.  So we can
possibly just assert that state is either ACTIVE or FREE.

> +			 flow->f.type == FLOW_TCP)
> +			rc = tcp_flow_repair_on(c, &flow->tcp);
> +
> +		if (rc)
> +			return rc;		/* TODO: rollback */
> +	}
> +
> +	repair_flush(c);			/* TODO: move to TCP logic */
> +
> +	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
> +		union flow *flow = &flowtab[i];
> +
> +		if (flow->f.state == FLOW_STATE_FREE)
> +			i += flow->free.n - 1;
> +		else if (flow->f.state == FLOW_STATE_ACTIVE &&
> +			 flow->f.type == FLOW_TCP)
> +			tcp_flow_dump_seq(c, &flow->tcp);
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * flow_init() - Initialise flow related data structures
>   */
> diff --git a/flow.h b/flow.h
> index 8eb5964..ff390a6 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -255,6 +255,7 @@ union flow;
>  
>  void flow_init(void);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> +int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m);
>  
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	__attribute__((format(printf, 3, 4)));
> diff --git a/migrate.c b/migrate.c
> index b8b79e0..6707c02 100644
> --- a/migrate.c
> +++ b/migrate.c
> @@ -56,6 +56,7 @@ static struct migrate_data data_versions[] = {
>  
>  /* Handlers to call in source before sending data */
>  struct migrate_handler handlers_source_pre[] = {
> +	{ flow_migrate_source_pre },
>  	{ 0 },
>  };
>  
> diff --git a/tcp.c b/tcp.c
> index c89f323..3a3038b 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -299,6 +299,7 @@
>  #include "log.h"
>  #include "inany.h"
>  #include "flow.h"
> +#include "repair.h"
>  #include "linux_dep.h"
>  
>  #include "flow_table.h"
> @@ -868,6 +869,61 @@ void tcp_defer_handler(struct ctx *c)
>  	tcp_payload_flush(c);
>  }
>  
> +/**
> + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> +{
> +	int rc = 0;
> +
> +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> +		err("Failed to set TCP_REPAIR for socket %i", conn->sock);

Well.. except that the error could just as easily have been on a
previous socket that wasn't flushed yet.

> +
> +	return rc;
> +}
> +
> +/**
> + * tcp_flow_dump_seq() - Dump sequences for send and receive queues
> + * @c:		Execution context
> + * @conn:	Pointer to the TCP connection structure
> + *
> + * Return: 0 on success, negative error code on failure
> + */
> +int tcp_flow_dump_seq(struct ctx *c, struct tcp_tap_conn *conn)
> +{
> +	int v, s = conn->sock;
> +	socklen_t vlen;
> +
> +	(void)c;
> +
> +	vlen = sizeof(v);
> +
> +	v = TCP_SEND_QUEUE;
> +	/* TODO: proper error management and prints */
> +	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, vlen))
> +		return -errno;
> +
> +	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &conn->sock_seq_snd, &vlen))
> +		return -errno;
> +
> +	debug("Send queue sequence %u for socket %i", conn->sock_seq_snd, s);
> +
> +	v = TCP_RECV_QUEUE;
> +	if (setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &v, vlen))
> +		return -errno;
> +
> +	if (getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &conn->sock_seq_rcv, &vlen))
> +		return -errno;
> +
> +	debug("Receive queue sequence %u for socket %i", conn->sock_seq_rcv, s);
> +
> +	return 0;
> +}
> +
>  /**
>   * tcp_fill_header() - Fill the TCP header fields for a given TCP segment.
>   *
> diff --git a/tcp_conn.h b/tcp_conn.h
> index d342680..0c3e197 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -94,6 +94,9 @@ struct tcp_tap_conn {
>  	uint32_t	seq_from_tap;
>  	uint32_t	seq_ack_to_tap;
>  	uint32_t	seq_init_from_tap;
> +
> +	uint32_t	sock_seq_snd;
> +	uint32_t	sock_seq_rcv;
>  };
>  
>  /**
> @@ -140,6 +143,8 @@ extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
>  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
>  
>  bool tcp_flow_defer(const struct tcp_tap_conn *conn);
> +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn);
> +int tcp_flow_dump_seq(struct ctx *c, struct tcp_tap_conn *conn);
>  bool tcp_splice_flow_defer(struct tcp_splice_conn *conn);
>  void tcp_splice_timer(const struct ctx *c, struct tcp_splice_conn *conn);
>  int tcp_conn_pool_sock(int pool[]);

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure
  2025-01-29  5:41   ` David Gibson
@ 2025-01-29  8:46     ` Stefano Brivio
  2025-01-30  1:28       ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-29  8:46 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Laurent Vivier

On Wed, 29 Jan 2025 16:41:01 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 29, 2025 at 12:39:37AM +0100, Stefano Brivio wrote:
> > Add two sets (source or target) of three functions each for passt in
> > vhost-user mode, triggered by activity on the file descriptor passed
> > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > 
> > - migrate_source_pre() and migrate_target_pre() are called to prepare
> >   for migration, before data is transferred
> > 
> > - migrate_source() sends, and migrate_target() receives migration data
> > 
> > - migrate_source_post() and migrate_target_post() are responsible for
> >   any post-migration task
> > 
> > Callbacks are added to these functions with arrays of function
> > pointers in migrate.c. Migration handlers are versioned.
> > 
> > Versioned descriptions of data sections will be added to the
> > data_versions array, which points to versioned iovec arrays. Version
> > 1 is currently empty and will be filled in in subsequent patches.
> > 
> > The source announces the data version to be used and informs the peer
> > about endianness, and the size of void *, time_t, flow entries and
> > flow hash table entries.
> > 
> > The target checks if the version of the source is still supported. If
> > it's not, it aborts the migration.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> 
> Tomorrow, I'm planning to try implementing a bunch of the suggestions
> I have below.
> 
> > ---
> >  Makefile    |  12 +--
> >  migrate.c   | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  migrate.h   |  88 ++++++++++++++++++
> >  passt.c     |   2 +-
> >  vu_common.c | 124 ++++++++++++++++--------
> >  vu_common.h |   2 +-
> >  6 files changed, 443 insertions(+), 49 deletions(-)
> >  create mode 100644 migrate.c
> >  create mode 100644 migrate.h
> > 
> > diff --git a/Makefile b/Makefile
> > index 464eef1..1383875 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> >  
> >  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> >  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > -	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > -	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > +	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> >  	vhost_user.c virtio.c vu_common.c
> >  QRAP_SRCS = qrap.c
> >  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> >  
> >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> >  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > -	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > -	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > -	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > -	virtio.h vu_common.h
> > +	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > +	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > +	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > +	vhost_user.h virtio.h vu_common.h
> >  HEADERS = $(PASST_HEADERS) seccomp.h
> >  
> >  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > diff --git a/migrate.c b/migrate.c
> > new file mode 100644
> > index 0000000..b8b79e0
> > --- /dev/null
> > +++ b/migrate.c
> > @@ -0,0 +1,264 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* PASST - Plug A Simple Socket Transport
> > + *  for qemu/UNIX domain socket mode
> > + *
> > + * PASTA - Pack A Subtle Tap Abstraction
> > + *  for network namespace/tap device mode
> > + *
> > + * migrate.c - Migration sections, layout, and routines
> > + *
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#include <errno.h>
> > +#include <sys/uio.h>
> > +
> > +#include "util.h"
> > +#include "ip.h"
> > +#include "passt.h"
> > +#include "inany.h"
> > +#include "flow.h"
> > +#include "flow_table.h"
> > +
> > +#include "migrate.h"
> > +
> > +/* Current version of migration data */
> > +#define MIGRATE_VERSION		1
> > +
> > +/* Magic as we see it and as seen with reverse endianness */
> > +#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
> > +#define MIGRATE_MAGIC_SWAPPED	0xB0D1B1B01B1DBBB1
> > +
> > +/* Migration header to send from source */
> > +static union migrate_header header = {
> > +	.magic		= MIGRATE_MAGIC,
> > +	.version	= htonl_constant(MIGRATE_VERSION),
> > +	.time_t_size	= htonl_constant(sizeof(time_t)),
> > +	.flow_size	= htonl_constant(sizeof(union flow)),
> > +	.flow_sidx_size	= htonl_constant(sizeof(struct flow_sidx)),
> > +	.voidp_size	= htonl_constant(sizeof(void *)),
> > +};
> > +
> > +/* Data sections for version 1 */
> > +static struct iovec sections_v1[] = {
> > +	{ &header,	sizeof(header) },  
> 
> This format assumes that everything we send is in buffers with
> statically known size and address.  I'm pretty sure we'll outgrow that
> quickly.  For one thing, I realised we want to transfer addr*_seen, or
> we won't know how to direct an incoming connection after migration, if
> that occurs before the guest sends anything outgoing.

Yes, I'm transferring those too (local changes not completely
working yet) as separate sections. The layout is just fine for those.

What we don't cover with this layout is arrays, and I guess that should
be added, because if we want to transfer a minimal representation of
single flows instead of the whole flow table, we need something like
that.

I think it could be something on top of iovecs, that is, a simple
structure with iovec pointers and a count, or a pointer to a count.

> > +};
> > +
> > +/* Set of data versions */
> > +static struct migrate_data data_versions[] = {
> > +	{
> > +		1,	sections_v1,
> > +	},  
> 
> I realise it's a bit less "declarative", but maybe just a "receive"
> function for each version which open codes sending each section would
> be more flexible, and not that much more verbose.

It's 18 lines of migrate_source() and 17 lines of migrate_target()
right now, so it's not really *that* short, and if it can be
declarative (I'm finding this quite convenient as I'm *using* it), I
think it should be. What else would you like to add?

Besides, it can always be changed later as a use case appears.

> > +	{ 0 },
> > +};
> > +
> > +/* Handlers to call in source before sending data */
> > +struct migrate_handler handlers_source_pre[] = {
> > +	{ 0 },
> > +};
> > +
> > +/* Handlers to call in source after sending data */
> > +struct migrate_handler handlers_source_post[] = {
> > +	{ 0 },
> > +};  
> 
> I also wonder if these tables of handlers are overkill, rather than
> just having a function that calls all the things in the right order.

That risks turning into a long series of:

	ret = handler(c);
	if (ret)
		return ret;

...

but anyway, let me finish writing the code using this first, then we'll
know, instead of speculating.

> > +
> > +/* Handlers to call in target before receiving data with version 1 */
> > +struct migrate_handler handlers_target_pre_v1[] = {
> > +	{ 0 },
> > +};
> > +
> > +/* Handlers to call in target after receiving data with version 1 */
> > +struct migrate_handler handlers_target_post_v1[] = {
> > +	{ 0 },
> > +};
> > +
> > +/* Versioned sets of migration handlers */
> > +struct migrate_target_handlers target_handlers[] = {
> > +	{
> > +		1,
> > +		handlers_target_pre_v1,
> > +		handlers_target_post_v1,
> > +	},
> > +	{ 0 },
> > +};
> > +
> > +/**
> > + * migrate_source_pre() - Pre-migration tasks as source
> > + * @c:		Execution context
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m)
> > +{
> > +	struct migrate_handler *h;
> > +
> > +	for (h = handlers_source_pre; h->fn; h++) {
> > +		int rc;
> > +
> > +		if ((rc = h->fn(c, m)))
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * migrate_source() - Perform migration as source: send state to hypervisor
> > + * @fd:		Descriptor for state transfer
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_source(int fd, const struct migrate_meta *m)
> > +{
> > +	static struct migrate_data *d;
> > +	int count, rc;
> > +
> > +	(void)m;
> > +
> > +	for (d = data_versions; d->v != MIGRATE_VERSION; d++);
> > +
> > +	for (count = 0; d->sections[count].iov_len; count++);
> > +
> > +	debug("Writing %u migration sections", count - 1 /* minus header */);
> > +	rc = write_remainder(fd, d->sections, count, 0);
> > +	if (rc < 0)
> > +		return errno;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * migrate_source_post() - Post-migration tasks as source
> > + * @c:		Execution context
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +void migrate_source_post(struct ctx *c, struct migrate_meta *m)
> > +{
> > +	struct migrate_handler *h;
> > +
> > +	for (h = handlers_source_post; h->fn; h++)
> > +		h->fn(c, m);
> > +}
> > +
> > +/**
> > + * migrate_target_read_header() - Set metadata in target from source header
> > + * @fd:		Descriptor for state transfer
> > + * @m:		Migration metadata, filled on return
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_target_read_header(int fd, struct migrate_meta *m)
> > +{
> > +	static struct migrate_data *d;
> > +	union migrate_header h;
> > +
> > +	if (read_all_buf(fd, &h, sizeof(h)))
> > +		return errno;
> > +
> > +	debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> > +	      h.magic, ntohl(h.voidp_size), ntohl(h.version));
> > +
> > +	for (d = data_versions; d->v != ntohl(h.version) && d->v; d++);
> > +	if (!d->v)
> > +		return ENOTSUP;
> > +	m->v = d->v;
> > +
> > +	if (h.magic == MIGRATE_MAGIC)
> > +		m->bswap = false;
> > +	else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> > +		m->bswap = true;
> > +	else
> > +		return ENOTSUP;
> > +
> > +	if (ntohl(h.voidp_size) == 4)
> > +		m->source_64b = false;
> > +	else if (ntohl(h.voidp_size) == 8)
> > +		m->source_64b = true;
> > +	else
> > +		return ENOTSUP;
> > +
> > +	if (ntohl(h.time_t_size) == 4)
> > +		m->time_64b = false;
> > +	else if (ntohl(h.time_t_size) == 8)
> > +		m->time_64b = true;
> > +	else
> > +		return ENOTSUP;
> > +
> > +	m->flow_size = ntohl(h.flow_size);
> > +	m->flow_sidx_size = ntohl(h.flow_sidx_size);
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * migrate_target_pre() - Pre-migration tasks as target
> > + * @c:		Execution context
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m)
> > +{
> > +	struct migrate_target_handlers *th;
> > +	struct migrate_handler *h;
> > +
> > +	for (th = target_handlers; th->v != m->v && th->v; th++);
> > +
> > +	for (h = th->pre; h->fn; h++) {
> > +		int rc;
> > +
> > +		if ((rc = h->fn(c, m)))
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * migrate_target() - Perform migration as target: receive state from hypervisor
> > + * @fd:		Descriptor for state transfer
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + *
> > + * #syscalls:vu readv
> > + */
> > +int migrate_target(int fd, const struct migrate_meta *m)
> > +{
> > +	static struct migrate_data *d;
> > +	unsigned cnt;
> > +	int rc;
> > +
> > +	for (d = data_versions; d->v != m->v && d->v; d++);
> > +
> > +	for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> > +
> > +	debug("Reading %u migration sections", cnt);
> > +	rc = read_remainder(fd, d->sections + 1, cnt, 0);
> > +	if (rc < 0)
> > +		return errno;
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * migrate_target_post() - Post-migration tasks as target
> > + * @c:		Execution context
> > + * @m:		Migration metadata
> > + */
> > +void migrate_target_post(struct ctx *c, struct migrate_meta *m)
> > +{
> > +	struct migrate_target_handlers *th;
> > +	struct migrate_handler *h;
> > +
> > +	for (th = target_handlers; th->v != m->v && th->v; th++);
> > +
> > +	for (h = th->post; h->fn; h++)
> > +		h->fn(c, m);
> > +}
> > diff --git a/migrate.h b/migrate.h
> > new file mode 100644
> > index 0000000..f9635ac
> > --- /dev/null
> > +++ b/migrate.h
> > @@ -0,0 +1,88 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > + 
> > +#ifndef MIGRATE_H
> > +#define MIGRATE_H
> > +
> > +/**
> > + * struct migrate_meta - Migration metadata
> > + * @v:			Chosen migration data version, host order
> > + * @bswap:		Source has opposite endianness
> > + * @peer_64b:		Source uses 64-bit void *
> > + * @time_64b:		Source uses 64-bit time_t
> > + * @flow_size:		Size of union flow in source
> > + * @flow_sidx_size:	Size of struct flow_sidx in source
> > + */
> > +struct migrate_meta {
> > +	uint32_t v;
> > +	bool bswap;
> > +	bool source_64b;
> > +	bool time_64b;
> > +	size_t flow_size;
> > +	size_t flow_sidx_size;
> > +};
> > +
> > +/**
> > + * union migrate_header - Migration header from source
> > + * @magic:		0xB1BB1D1B0BB1D1B0, host order
> > + * @version:		Source sends highest known, target aborts if unsupported
> > + * @voidp_size:		sizeof(void *), network order
> > + * @time_t_size:	sizeof(time_t), network order
> > + * @flow_size:		sizeof(union flow), network order
> > + * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
> > + * @unused:		Go figure
> > + */
> > +union migrate_header {
> > +	struct {
> > +		uint64_t magic;
> > +		uint32_t version;
> > +		uint32_t voidp_size;
> > +		uint32_t time_t_size;
> > +		uint32_t flow_size;
> > +		uint32_t flow_sidx_size;
> > +	};
> > +	uint8_t unused[65536];
> > +};
> > +
> > +/**
> > + * struct migrate_data - Data sections for given source version
> > + * @v:			Source version this applies to, host order
> > + * @sections:		Array of data sections, NULL-terminated
> > + */
> > +struct migrate_data {
> > +	uint32_t v;
> > +	struct iovec *sections;
> > +};
> > +
> > +/**
> > + * struct migrate_handler - Function to handle a specific data section
> > + * @fn:			Function pointer taking pointer to context and metadata
> > + */
> > +struct migrate_handler {
> > +	int (*fn)(struct ctx *c, struct migrate_meta *m);
> > +};
> > +
> > +/**
> > + * struct migrate_target_handlers - Versioned sets of migration target handlers
> > + * @v:			Source version this applies to, host order
> > + * @pre:		Set of functions to execute in target before data copy
> > + * @post:		Set of functions to execute in target after data copy
> > + */
> > +struct migrate_target_handlers {
> > +	uint32_t v;
> > +	struct migrate_handler *pre;
> > +	struct migrate_handler *post;
> > +};
> > +
> > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m);
> > +int migrate_source(int fd, const struct migrate_meta *m);
> > +void migrate_source_post(struct ctx *c, struct migrate_meta *m);
> > +
> > +int migrate_target_read_header(int fd, struct migrate_meta *m);
> > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m);
> > +int migrate_target(int fd, const struct migrate_meta *m);
> > +void migrate_target_post(struct ctx *c, struct migrate_meta *m);
> > +
> > +#endif /* MIGRATE_H */
> > diff --git a/passt.c b/passt.c
> > index b1c8ab6..184d4e5 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -358,7 +358,7 @@ loop:
> >  			vu_kick_cb(c.vdev, ref, &now);
> >  			break;
> >  		case EPOLL_TYPE_VHOST_MIGRATION:
> > -			vu_migrate(c.vdev, eventmask);
> > +			vu_migrate(&c, eventmask);  
> 
> Not really a comment on this patch, but do we actually need/want to
> put the migration fd into the epoll loop?  IIUC, everything we're
> doing on the migration fd is synchronous and blocking, so we could run
> the whole migration from the VHOST_USER_SET_DEVICE_STATE callback.

The epoll event bits are practical to have here, though.

> Or, maybe less confusing, flag the migration to run before we next
> call epoll_wait(): having the migration run strictly in-between epoll
> cycles seems like it might make some things easier to analyse.

Well, we don't really *need* to have the descriptor in the epoll list,
but it's more consistent. Should we really introduce a different
mechanism just because we don't strictly need that?

What's the actual problem with the current implementation?

> >  			break;
> >  		default:
> >  			/* Can't happen */
> > diff --git a/vu_common.c b/vu_common.c
> > index f43d8ac..6c346c8 100644
> > --- a/vu_common.c
> > +++ b/vu_common.c
> > @@ -5,6 +5,7 @@
> >   * common_vu.c - vhost-user common UDP and TCP functions
> >   */
> >  
> > +#include <errno.h>
> >  #include <unistd.h>
> >  #include <sys/uio.h>
> >  #include <sys/eventfd.h>
> > @@ -17,6 +18,7 @@
> >  #include "vhost_user.h"
> >  #include "pcap.h"
> >  #include "vu_common.h"
> > +#include "migrate.h"
> >  
> >  #define VU_MAX_TX_BUFFER_NB	2
> >  
> > @@ -305,50 +307,90 @@ err:
> >  }
> >  
> >  /**
> > - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> > - * @vdev:	vhost-user device
> > + * vu_migrate_source() - Migration as source, send state to hypervisor
> > + * @c:		Execution context
> > + * @fd:		File descriptor for state transfer
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +static int vu_migrate_source(struct ctx *c, int fd)
> > +{
> > +	struct migrate_meta m;
> > +	int rc;
> > +
> > +	if ((rc = migrate_source_pre(c, &m))) {
> > +		err("Source pre-migration failed: %s, abort", strerror_(rc));
> > +		return rc;
> > +	}
> > +
> > +	debug("Saving backend state");
> > +
> > +	rc = migrate_source(fd, &m);
> > +	if (rc)
> > +		err("Source migration failed: %s", strerror_(rc));
> > +	else
> > +		migrate_source_post(c, &m);
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * vu_migrate_target() - Migration as target, receive state from hypervisor
> > + * @c:		Execution context
> > + * @fd:		File descriptor for state transfer
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +static int vu_migrate_target(struct ctx *c, int fd)
> > +{
> > +	struct migrate_meta m;
> > +	int rc;
> > +
> > +	rc = migrate_target_read_header(fd, &m);
> > +	if (rc) {
> > +		err("Migration header check failed: %s, abort", strerror_(rc));
> > +		return rc;
> > +	}
> > +
> > +	if ((rc = migrate_target_pre(c, &m))) {
> > +		err("Target pre-migration failed: %s, abort", strerror_(rc));
> > +		return rc;
> > +	}
> > +
> > +	debug("Loading backend state");
> > +
> > +	rc = migrate_target(fd, &m);
> > +	if (rc)
> > +		err("Target migration failed: %s", strerror_(rc));
> > +	else
> > +		migrate_target_post(c, &m);
> > +
> > +	return rc;
> > +}
> > +
> > +/**
> > + * vu_migrate() - Send/receive passt internal state to/from QEMU
> > + * @c:		Execution context
> >   * @events:	epoll events
> >   */
> > -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> > +void vu_migrate(struct ctx *c, uint32_t events)
> >  {
> > -	int ret;
> > +	struct vu_dev *vdev = c->vdev;
> > +	int rc = EIO;
> >  
> > -	/* TODO: collect/set passt internal state
> > -	 * and use vdev->device_state_fd to send/receive it
> > -	 */
> >  	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> > -	if (events & EPOLLOUT) {
> > -		debug("Saving backend state");
> > -
> > -		/* send some stuff */
> > -		ret = write(vdev->device_state_fd, "PASST", 6);
> > -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > -		vdev->device_state_result = ret == -1 ? -1 : 0;
> > -		/* Closing the file descriptor signals the end of transfer */
> > -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > -			  vdev->device_state_fd, NULL);
> > -		close(vdev->device_state_fd);
> > -		vdev->device_state_fd = -1;
> > -	} else if (events & EPOLLIN) {
> > -		char buf[6];
> > -
> > -		debug("Loading backend state");
> > -		/* read some stuff */
> > -		ret = read(vdev->device_state_fd, buf, sizeof(buf));
> > -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > -		if (ret != sizeof(buf)) {
> > -			vdev->device_state_result = -1;
> > -		} else {
> > -			ret = strncmp(buf, "PASST", sizeof(buf));
> > -			vdev->device_state_result = ret == 0 ? 0 : -1;
> > -		}
> > -	} else if (events & EPOLLHUP) {
> > -		debug("Closing migration channel");
> > -
> > -		/* The end of file signals the end of the transfer. */
> > -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > -			  vdev->device_state_fd, NULL);
> > -		close(vdev->device_state_fd);
> > -		vdev->device_state_fd = -1;
> > -	}
> > +
> > +	if (events & EPOLLOUT)
> > +		rc = vu_migrate_source(c, vdev->device_state_fd);
> > +	else if (events & EPOLLIN)
> > +		rc = vu_migrate_target(c, vdev->device_state_fd);
> > +
> > +	/* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> > +
> > +	vdev->device_state_result = rc;  
> 
> Again, not really a comment on this patch, but I think it would be
> safer to set device_state_result to something non-zero (probably
> EINPROGRESS) as soon as we get the VHOST_USER_SET_DEVICE_STATE_FD.
> That way if somehow VHOST_USER_CHECK_DEVICE_STATE were called before
> we actually do anything about the migration we'd correctly report that
> it hasn't happened.

I see your point, but I'm not sure if it's correct to report
EINPROGRESS while the migration is not actually in progress.

That is, if we get VHOST_USER_SET_DEVICE_STATE_FD, and then the
hypervisor decides to just send VHOST_USER_CHECK_DEVICE_STATE right
away, the migration isn't in progress. This is especially relevant as
the vhost-user documentation says that:

	"any non-zero value is an error"

Further, it also states, about VHOST_USER_CHECK_DEVICE_STATE, that:

	The back-end responds once it knows whether the transfer and
	processing was successful or not.

so, strictly speaking, we should probably *not* respond to that before
we're done here. On the other hand, we risk triggering issues/hangs in
the hypervisor just for the sake of being correct, so I would rather
report 0 as it would be (not actually happening) done now.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair
  2025-01-29  6:09   ` David Gibson
@ 2025-01-29  8:46     ` Stefano Brivio
  2025-01-30  1:33       ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Stefano Brivio @ 2025-01-29  8:46 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Laurent Vivier

On Wed, 29 Jan 2025 17:09:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 29, 2025 at 12:39:39AM +0100, Stefano Brivio wrote:
> > In vhost-user mode, by default, create a second UNIX domain socket
> > accepting connections from passt-repair, with the usual listener
> > socket.
> > 
> > When we need to set or clear TCP_REPAIR on sockets, we'll send them
> > via SCM_RIGHTS to passt-repair, who sets the socket option values we
> > ask for.
> > 
> > To that end, introduce batched functions to request TCP_REPAIR
> > settings on sockets, so that we don't have to send a single message
> > for each socket, on migration. When needed, repair_flush() will
> > send the message and check for the reply.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  Makefile     |  12 ++--
> >  conf.c       |  46 ++++++++++--
> >  epoll_type.h |   4 ++
> >  passt.1      |  11 +++
> >  passt.c      |   9 +++
> >  passt.h      |   7 ++
> >  repair.c     | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  repair.h     |  16 +++++
> >  tap.c        |  65 +----------------
> >  util.c       |  62 +++++++++++++++++
> >  util.h       |   1 +
> >  11 files changed, 353 insertions(+), 72 deletions(-)
> >  create mode 100644 repair.c
> >  create mode 100644 repair.h
> > 
> > diff --git a/Makefile b/Makefile
> > index 1b71cb0..f67a20b 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> >  
> >  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> >  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > -	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > -	vhost_user.c virtio.c vu_common.c
> > +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> > +	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> > +	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> >  QRAP_SRCS = qrap.c
> >  PASST_REPAIR_SRCS = passt-repair.c
> >  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > @@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1
> >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> >  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> >  	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > -	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > -	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > -	vhost_user.h virtio.h vu_common.h
> > +	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> > +	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> > +	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> >  HEADERS = $(PASST_HEADERS) seccomp.h
> >  
> >  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > diff --git a/conf.c b/conf.c
> > index df2b016..85dec44 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -816,6 +816,9 @@ static void usage(const char *name, FILE *f, int status)
> >  			"    UNIX domain socket is provided by -s option\n"
> >  			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> >  			"    only meaningful for vhost-user mode\n");
> > +		FPRINTF(f,
> > +			"  --repair-path PATH	path for passt-repair(1)\n"  
> 
> Nit: as a privileged helper, should it be passt-repair(8)?

So, I spent a couple of minutes on this as I wrote this, and I spent a
bit longer now: the most authoritative definition I can find of section
8 is from man-pages(7):

       8 System management commands
              Commands like mount(8), many of which only root can
              execute.

It's not really a system management command, and the idea is to run it
with CAP_NET_ADMIN, but not necessarily as root. So I would rather keep
it in section 1, unless there's some other conflicting definition I'm
not aware of.

There's also the topic of where it should be installed (/sbin,
/usr/sbin/, /bin, /usr/bin). I'd pick /usr/bin, because /sbin doesn't
really mean much nowadays, and it's anyway fitting with the FHS 3.0:

  Utilities used for system administration (and other root-only
  commands) are stored in /sbin, /usr/ sbin, and /usr/local/sbin.
  
  /sbin contains binaries essential for booting, restoring, recovering,
  and/or repairing the system in addition to the binaries in /bin.
  18 Programs executed after /usr is known to be mounted (when there
  are no problems) are generally placed into /usr/sbin.

...and I don't think this helper would qualify for /sbin or /usr/sbin.

> 
> > +			"    default: append '.repair' to UNIX domain path\n");
> >  	}
> >  
> >  	FPRINTF(f,
> > @@ -1240,8 +1243,30 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
> >   */
> >  static void conf_open_files(struct ctx *c)
> >  {
> > -	if (c->mode != MODE_PASTA && c->fd_tap == -1)
> > -		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
> > +	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
> > +		c->fd_tap_listen = sock_unix(c->sock_path);
> > +
> > +		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
> > +			if (!strncmp(c->repair_path, "./", 2)) {
> > +				memmove(c->repair_path, c->repair_path + 2,
> > +					sizeof(c->repair_path) - 2);
> > +			}  
> 
> Do you need this? Shouldn't "./whatever" be usable as-is?

Ah, yes, I didn't know. I explicitly added this for the '--repair-path
./none' case, but it's not actually needed. I'll drop this.

> > +
> > +			if (!*c->repair_path &&
> > +			    snprintf_check(c->repair_path,
> > +					   sizeof(c->repair_path), "%s.repair",
> > +					   c->sock_path)) {
> > +				warn("passt-repair path %s not usable",
> > +				     c->repair_path);  
> 
> I'd prefer a die() here - I think omitting a possibly expected
> feature, with just a warning that could easily be lost in the logs is
> not a good idea.

I was thinking about that, but should we really risk *not* starting
because with ".repair" the path is now too long, for a feature that,
realistically, most users won't actually use?

If we're started by any kind of framework (which is where we run the
risk of the warning being ignored), then there should be explicit
checks about the path and the usability of passt-repair (or equivalent).

> > +				c->fd_repair_listen = -1;
> > +			} else {
> > +				c->fd_repair_listen = sock_unix(c->repair_path);
> > +			}
> > +		} else {
> > +			c->fd_repair_listen = -1;
> > +		}
> > +		c->fd_repair = -1;
> > +	}
> >  
> >  	if (*c->pidfile) {
> >  		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
> > @@ -1354,9 +1379,12 @@ void conf(struct ctx *c, int argc, char **argv)
> >  		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
> >  		{"dns-host",	required_argument,	NULL,		24 },
> >  		{"vhost-user",	no_argument,		NULL,		25 },
> > +
> >  		/* vhost-user backend program convention */
> >  		{"print-capabilities", no_argument,	NULL,		26 },
> >  		{"socket-path",	required_argument,	NULL,		's' },
> > +
> > +		{"repair-path",	required_argument,	NULL,		27 },
> >  		{ 0 },
> >  	};
> >  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> > @@ -1824,8 +1852,8 @@ void conf(struct ctx *c, int argc, char **argv)
> >  	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
> >  		c->no_dhcp = 1;
> >  
> > -	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
> > -	 * settings)
> > +	/* Inbound port options, DNS, and --repair-path can be parsed now, after
> > +	 * IPv4/IPv6 settings and --vhost-user.
> >  	 */
> >  	fwd_probe_ephemeral();
> >  	udp_portmap_clear();
> > @@ -1871,6 +1899,16 @@ void conf(struct ctx *c, int argc, char **argv)
> >  			}
> >  
> >  			die("Cannot use DNS address %s", optarg);
> > +		} else if (name == 27) {
> > +			if (c->mode != MODE_VU && strcmp(optarg, "none"))
> > +				die("--repair-path is for vhost-user mode only");
> > +
> > +			if (snprintf_check(c->repair_path,
> > +					   sizeof(c->repair_path), "%s",
> > +					   optarg))
> > +				die("Invalid passt-repair path: %s", optarg);
> > +
> > +			break;
> >  		}
> >  	} while (name != -1);
> >  
> > diff --git a/epoll_type.h b/epoll_type.h
> > index fd9eac3..706238a 100644
> > --- a/epoll_type.h
> > +++ b/epoll_type.h
> > @@ -42,6 +42,10 @@ enum epoll_type {
> >  	EPOLL_TYPE_VHOST_KICK,
> >  	/* vhost-user migration socket */
> >  	EPOLL_TYPE_VHOST_MIGRATION,
> > +	/* TCP_REPAIR helper listening socket */
> > +	EPOLL_TYPE_REPAIR_LISTEN,
> > +	/* TCP_REPAIR helper socket */
> > +	EPOLL_TYPE_REPAIR,
> >  
> >  	EPOLL_NUM_TYPES,
> >  };
> > diff --git a/passt.1 b/passt.1
> > index d9cd33e..63a3a01 100644
> > --- a/passt.1
> > +++ b/passt.1
> > @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> >  .BR \-\-print-capabilities
> >  Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> >  
> > +.TP
> > +.BR \-\-repair-path " " \fIpath
> > +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect  
> 
> passt-repair(8)?

See above.

> > +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
> > +migration. \fB--repair-path none\fR disables this interface (if you need to
> > +specify a socket path called "none" you can prefix the path by \fI./\fR).
> > +
> > +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> > +chosen for the hypervisor UNIX domain socket. No socket is created if not in
> > +\-\-vhost-user mode.
> > +
> >  .TP
> >  .BR \-F ", " \-\-fd " " \fIFD
> >  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> > diff --git a/passt.c b/passt.c
> > index 184d4e5..1fa2ddd 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -51,6 +51,7 @@
> >  #include "tcp_splice.h"
> >  #include "ndp.h"
> >  #include "vu_common.h"
> > +#include "repair.h"
> >  
> >  #define EPOLL_EVENTS		8
> >  
> > @@ -76,6 +77,8 @@ char *epoll_type_str[] = {
> >  	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> >  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
> >  	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
> > +	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
> > +	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> >  };
> >  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
> >  	      "epoll_type_str[] doesn't match enum epoll_type");
> > @@ -360,6 +363,12 @@ loop:
> >  		case EPOLL_TYPE_VHOST_MIGRATION:
> >  			vu_migrate(&c, eventmask);
> >  			break;
> > +		case EPOLL_TYPE_REPAIR_LISTEN:
> > +			repair_listen_handler(&c, eventmask);
> > +			break;
> > +		case EPOLL_TYPE_REPAIR:
> > +			repair_handler(&c, eventmask);
> > +			break;
> >  		default:
> >  			/* Can't happen */
> >  			ASSERT(0);
> > diff --git a/passt.h b/passt.h
> > index 0dd4efa..85b0a10 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -20,6 +20,7 @@ union epoll_ref;
> >  #include "siphash.h"
> >  #include "ip.h"
> >  #include "inany.h"
> > +#include "migrate.h"
> >  #include "flow.h"
> >  #include "icmp.h"
> >  #include "fwd.h"
> > @@ -193,6 +194,7 @@ struct ip6_ctx {
> >   * @foreground:		Run in foreground, don't log to stderr by default
> >   * @nofile:		Maximum number of open files (ulimit -n)
> >   * @sock_path:		Path for UNIX domain socket
> > + * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
> >   * @pcap:		Path for packet capture file
> >   * @pidfile:		Path to PID file, empty string if not configured
> >   * @pidfile_fd:		File descriptor for PID file, -1 if none
> > @@ -203,6 +205,8 @@ struct ip6_ctx {
> >   * @epollfd:		File descriptor for epoll instance
> >   * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
> >   * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
> > + * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
> > + * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
> >   * @our_tap_mac:	Pasta/passt's MAC on the tap link
> >   * @guest_mac:		MAC address of guest or namespace, seen or configured
> >   * @hash_secret:	128-bit secret for siphash functions
> > @@ -244,6 +248,7 @@ struct ctx {
> >  	int foreground;
> >  	int nofile;
> >  	char sock_path[UNIX_PATH_MAX];
> > +	char repair_path[UNIX_PATH_MAX];
> >  	char pcap[PATH_MAX];
> >  
> >  	char pidfile[PATH_MAX];
> > @@ -260,6 +265,8 @@ struct ctx {
> >  	int epollfd;
> >  	int fd_tap_listen;
> >  	int fd_tap;
> > +	int fd_repair_listen;
> > +	int fd_repair;
> >  	unsigned char our_tap_mac[ETH_ALEN];
> >  	unsigned char guest_mac[ETH_ALEN];
> >  	uint64_t hash_secret[2];
> > diff --git a/repair.c b/repair.c
> > new file mode 100644
> > index 0000000..24966f5
> > --- /dev/null
> > +++ b/repair.c
> > @@ -0,0 +1,192 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* PASST - Plug A Simple Socket Transport
> > + *  for qemu/UNIX domain socket mode
> > + *
> > + * PASTA - Pack A Subtle Tap Abstraction
> > + *  for network namespace/tap device mode
> > + *
> > + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
> > + *
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#include <errno.h>
> > +#include <sys/uio.h>
> > +
> > +#include "util.h"
> > +#include "ip.h"
> > +#include "passt.h"
> > +#include "inany.h"
> > +#include "flow.h"
> > +#include "flow_table.h"
> > +
> > +#include "repair.h"
> > +
> > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > +
> > +static int fds[SCM_MAX_FD];  
> 
> Even 'static', I'd prefer a longer name for a global variable.

A longer name for this also means a longer name for nfds below, which
is not necessarily practical, but I'll check and try to change these to
repair_fds[] / repair_nfds if doable.

> > +static int current_cmd;  
> 
> Is there any particular rationale behind these being globals, whereas
> fd_repair is in struct ctx?  AFAICT they're basically equally global
> in practice.

Yes: c->fd_repair_listen needs to be in struct ctx, and I want to keep
this consistent with c->fd_tap_listen / c->fd_tap.

Besides, the variables declared here are really hacks representing
state for this compilation unit only.

> > +static int nfds;
> > +
> > +/**
> > + * repair_sock_init() - Start listening for connections on helper socket
> > + * @c:		Execution context
> > + */
> > +void repair_sock_init(const struct ctx *c)
> > +{
> > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> > +	struct epoll_event ev = { 0 };
> > +
> > +	listen(c->fd_repair_listen, 0);
> > +
> > +	ref.fd = c->fd_repair_listen;
> > +	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> > +	ev.data.u64 = ref.u64;
> > +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev);
> > +}
> > +
> > +/**
> > + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
> > + * @c:		Execution context
> > + * @events:	epoll events
> > + */
> > +void repair_listen_handler(struct ctx *c, uint32_t events)
> > +{
> > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> > +	struct epoll_event ev = { 0 };
> > +	struct ucred ucred;
> > +	socklen_t len;
> > +
> > +	if (events != EPOLLIN) {
> > +		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
> > +		      events);
> > +		return;
> > +	}
> > +
> > +	len = sizeof(ucred);
> > +
> > +	/* Another client is already connected: accept and close right away. */  
> 
> For the repair socket, last-connection-wins would make more sense to
> me than first-connection-wins.  While hacking/debugging seems it might
> be useful to fix something in the passt-repair, re-run it and have it
> displace the stale version for existing passt instances.

In the whole debugging I've done so far I'm actually using passt-repair
(otherwise I don't even start it), so it already terminates once it's
done.

I think it's more secure to keep it like this, because it's more robust
against races.

Let's say we have an issue in KubeVirt's virt-handler so that the
migration starts a bit before the helper is started, and somebody
manages to use this time window to connect another helper, this attempt
will actually go unnoticed.

If passt-repair fails to connect, it will be very obvious.

> > +	if (c->fd_repair != -1) {
> > +		int discard = accept4(c->fd_repair_listen, NULL, NULL,
> > +				      SOCK_NONBLOCK);
> > +
> > +		if (discard == -1)
> > +			return;
> > +
> > +		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > +			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
> > +
> > +		close(discard);
> > +		return;
> > +	}
> > +
> > +	c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0);
> > +
> > +	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > +		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
> > +
> > +	ref.fd = c->fd_repair;
> > +	ev.events = EPOLLHUP | EPOLLET;
> > +	ev.data.u64 = ref.u64;
> > +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev);
> > +}
> > +
> > +/**
> > + * repair_close() - Close connection to TCP_REPAIR helper
> > + * @c:		Execution context
> > + */
> > +void repair_close(struct ctx *c)
> > +{
> > +	debug("Closing TCP_REPAIR helper socket");
> > +
> > +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
> > +	close(c->fd_repair);
> > +	c->fd_repair = -1;
> > +}
> > +
> > +/**
> > + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
> > + * @c:		Execution context
> > + * @events:	epoll events
> > + */
> > +void repair_handler(struct ctx *c, uint32_t events)
> > +{
> > +	(void)events;
> > +
> > +	repair_close(c);
> > +}
> > +
> > +/**
> > + * repair_flush() - Flush current set of sockets to helper, with current command
> > + * @c:		Execution context
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int repair_flush(struct ctx *c)
> > +{
> > +	struct iovec iov = { &((int8_t){ current_cmd }), sizeof(int8_t) };
> > +	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > +	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > +	struct cmsghdr *cmsg;
> > +	struct msghdr msg;
> > +	int ret = 0;
> > +
> > +	if (!nfds)
> > +		return 0;
> > +
> > +	msg = (struct msghdr){ NULL, 0, &iov, 1,
> > +			       buf, CMSG_SPACE(sizeof(int) * nfds), 0 };
> > +	cmsg = CMSG_FIRSTHDR(&msg);
> > +
> > +	cmsg->cmsg_level = SOL_SOCKET;
> > +	cmsg->cmsg_type = SCM_RIGHTS;
> > +	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * nfds);
> > +	memcpy(CMSG_DATA(cmsg), fds, sizeof(int) * nfds);
> > +
> > +	nfds = 0;
> > +
> > +	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
> > +		ret = -errno;  
> 
> This error code won't be reported to the caller: you'll continue on to
> the recv() below, which will return EBADF, clobbering ret.

Oops, right, I'll return early in this case (or equivalent).

> > +		err_perror("Failed to send sockets to TCP_REPAIR helper");
> > +		repair_close(c);
> > +	}
> > +
> > +	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {
> > +		ret = -errno;
> > +		err_perror("Failed to receive reply from TCP_REPAIR helper");
> > +		repair_close(c);
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/**
> > + * repair_flush() - Add socket to TCP_REPAIR set with given command
> > + * @c:		Execution context
> > + * @s:		Socket to add
> > + * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +/* cppcheck-suppress unusedFunction */
> > +int repair_set(struct ctx *c, int s, int cmd)
> > +{
> > +	int rc;
> > +
> > +	if (nfds && current_cmd != cmd) {
> > +		if ((rc = repair_flush(c)))
> > +			return rc;
> > +	}
> > +
> > +	current_cmd = cmd;
> > +	fds[nfds++] = s;
> > +
> > +	if (nfds >= SCM_MAX_FD) {
> > +		if ((rc = repair_flush(c)))
> > +			return rc;
> > +	}
> > +
> > +	return 0;
> > +}
> > diff --git a/repair.h b/repair.h
> > new file mode 100644
> > index 0000000..693c515
> > --- /dev/null
> > +++ b/repair.h
> > @@ -0,0 +1,16 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > + 
> > +#ifndef REPAIR_H
> > +#define REPAIR_H
> > +
> > +void repair_sock_init(const struct ctx *c);
> > +void repair_listen_handler(struct ctx *c, uint32_t events);
> > +void repair_handler(struct ctx *c, uint32_t events);
> > +void repair_close(struct ctx *c);
> > +int repair_flush(struct ctx *c);
> > +int repair_set(struct ctx *c, int s, int cmd);
> > +
> > +#endif /* REPAIR_H */
> > diff --git a/tap.c b/tap.c
> > index cd32a90..0e60eb4 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -56,6 +56,7 @@
> >  #include "netlink.h"
> >  #include "pasta.h"
> >  #include "packet.h"
> > +#include "repair.h"
> >  #include "tap.h"
> >  #include "log.h"
> >  #include "vhost_user.h"
> > @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
> >  		tap_pasta_input(c, now);
> >  }
> >  
> > -/**
> > - * tap_sock_unix_open() - Create and bind AF_UNIX socket
> > - * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > - *
> > - * Return: socket descriptor on success, won't return on failure
> > - */
> > -int tap_sock_unix_open(char *sock_path)
> > -{
> > -	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> > -	struct sockaddr_un addr = {
> > -		.sun_family = AF_UNIX,
> > -	};
> > -	int i;
> > -
> > -	if (fd < 0)
> > -		die_perror("Failed to open UNIX domain socket");
> > -
> > -	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> > -		char *path = addr.sun_path;
> > -		int ex, ret;
> > -
> > -		if (*sock_path)
> > -			memcpy(path, sock_path, UNIX_PATH_MAX);
> > -		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> > -					UNIX_SOCK_PATH, i))
> > -			die_perror("Can't build UNIX domain socket path");
> > -
> > -		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> > -			    0);
> > -		if (ex < 0)
> > -			die_perror("Failed to check for UNIX domain conflicts");
> > -
> > -		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> > -		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> > -			     errno != EACCES)) {
> > -			if (*sock_path)
> > -				die("Socket path %s already in use", path);
> > -
> > -			close(ex);
> > -			continue;
> > -		}
> > -		close(ex);
> > -
> > -		unlink(path);
> > -		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> > -		if (*sock_path && ret)
> > -			die_perror("Failed to bind UNIX domain socket");
> > -
> > -		if (!ret)
> > -			break;
> > -	}
> > -
> > -	if (i == UNIX_SOCK_MAX)
> > -		die_perror("Failed to bind UNIX domain socket");
> > -
> > -	info("UNIX domain socket bound at %s", addr.sun_path);
> > -	if (!*sock_path)
> > -		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> > -
> > -	return fd;
> > -}
> > -
> >  /**
> >   * tap_backend_show_hints() - Give help information to start QEMU
> >   * @c:		Execution context
> > @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
> >  		tap_sock_tun_init(c);
> >  		break;
> >  	case MODE_VU:
> > +		repair_sock_init(c);
> > +		/* fall through */
> >  	case MODE_PASST:
> >  		tap_sock_unix_init(c);
> >  
> > diff --git a/util.c b/util.c
> > index 36857d4..e98da74 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> >  	return fd;
> >  }
> >  
> > +/**
> > + * sock_unix() - Create and bind AF_UNIX socket
> > + * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > + *
> > + * Return: socket descriptor on success, won't return on failure
> > + */  
> 
> I like making tap_sock_unix_open() more general.  Could be split into
> a separate patch.

I need it here though, and keeping it here saves lines in the commit
message...

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers
  2025-01-29  6:15   ` David Gibson
@ 2025-01-29  8:46     ` Stefano Brivio
  0 siblings, 0 replies; 20+ messages in thread
From: Stefano Brivio @ 2025-01-29  8:46 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev, Laurent Vivier

On Wed, 29 Jan 2025 17:15:41 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 29, 2025 at 12:39:40AM +0100, Stefano Brivio wrote:
> > Very much draft quality, but it works. Ask passt-repair to switch
> > TCP sockets to repair mode and dump their current sequence numbers to
> > the flow table, which will be transferred and used by the target in
> > the next step.
> > 
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  flow.c     | 43 +++++++++++++++++++++++++++++++++++++++++
> >  flow.h     |  1 +
> >  migrate.c  |  1 +
> >  tcp.c      | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tcp_conn.h |  5 +++++
> >  5 files changed, 106 insertions(+)
> > 
> > diff --git a/flow.c b/flow.c
> > index ee1221b..e7148b2 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -19,6 +19,7 @@
> >  #include "inany.h"
> >  #include "flow.h"
> >  #include "flow_table.h"
> > +#include "repair.h"
> >  
> >  const char *flow_state_str[] = {
> >  	[FLOW_STATE_FREE]	= "FREE",
> > @@ -874,6 +875,48 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> >  	*last_next = FLOW_MAX;
> >  }
> >  
> > +/**
> > + * flow_migrate_source_pre() - Prepare all source flows for migration
> > + * @c:		Execution context
> > + * @m:		Migration metadata
> > + *
> > + * Return: 0 on success
> > + */
> > +int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m)
> > +{
> > +	unsigned i;
> > +	int rc;
> > +
> > +	(void)m;
> > +
> > +	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
> > +		union flow *flow = &flowtab[i];
> > +
> > +		if (flow->f.state == FLOW_STATE_FREE)
> > +			i += flow->free.n - 1;
> > +		else if (flow->f.state == FLOW_STATE_ACTIVE &&  
> 
> We should probably just abort any flows that are in pre-ACTIVE state
> at migration time.  Wait... IIRC flows have to be in ACTIVE state (or
> already cancelled) once we get to the next epoll cycle.  So we can
> possibly just assert that state is either ACTIVE or FREE.

"IIRC" doesn't quite fit with "just assert". Let's not make things
crash here, it's really not the right point to get issues reported. I
can add a print.

> > +			 flow->f.type == FLOW_TCP)
> > +			rc = tcp_flow_repair_on(c, &flow->tcp);
> > +
> > +		if (rc)
> > +			return rc;		/* TODO: rollback */
> > +	}
> > +
> > +	repair_flush(c);			/* TODO: move to TCP logic */
> > +
> > +	for (i = 0; i < FLOW_MAX; i++) {	/* TODO: iterator with skip */
> > +		union flow *flow = &flowtab[i];
> > +
> > +		if (flow->f.state == FLOW_STATE_FREE)
> > +			i += flow->free.n - 1;
> > +		else if (flow->f.state == FLOW_STATE_ACTIVE &&
> > +			 flow->f.type == FLOW_TCP)
> > +			tcp_flow_dump_seq(c, &flow->tcp);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * flow_init() - Initialise flow related data structures
> >   */
> > diff --git a/flow.h b/flow.h
> > index 8eb5964..ff390a6 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -255,6 +255,7 @@ union flow;
> >  
> >  void flow_init(void);
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> > +int flow_migrate_source_pre(struct ctx *c, struct migrate_meta *m);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	__attribute__((format(printf, 3, 4)));
> > diff --git a/migrate.c b/migrate.c
> > index b8b79e0..6707c02 100644
> > --- a/migrate.c
> > +++ b/migrate.c
> > @@ -56,6 +56,7 @@ static struct migrate_data data_versions[] = {
> >  
> >  /* Handlers to call in source before sending data */
> >  struct migrate_handler handlers_source_pre[] = {
> > +	{ flow_migrate_source_pre },
> >  	{ 0 },
> >  };
> >  
> > diff --git a/tcp.c b/tcp.c
> > index c89f323..3a3038b 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -299,6 +299,7 @@
> >  #include "log.h"
> >  #include "inany.h"
> >  #include "flow.h"
> > +#include "repair.h"
> >  #include "linux_dep.h"
> >  
> >  #include "flow_table.h"
> > @@ -868,6 +869,61 @@ void tcp_defer_handler(struct ctx *c)
> >  	tcp_payload_flush(c);
> >  }
> >  
> > +/**
> > + * tcp_flow_repair_on() - Enable repair mode for a single TCP flow
> > + * @c:		Execution context
> > + * @conn:	Pointer to the TCP connection structure
> > + *
> > + * Return: 0 on success, negative error code on failure
> > + */
> > +int tcp_flow_repair_on(struct ctx *c, const struct tcp_tap_conn *conn)
> > +{
> > +	int rc = 0;
> > +
> > +	if ((rc = repair_set(c, conn->sock, TCP_REPAIR_ON)))
> > +		err("Failed to set TCP_REPAIR for socket %i", conn->sock);  
> 
> Well.. except that the error could just as easily have been on a
> previous socket that wasn't flushed yet.

Right, I should probably just drop the socket number from here.

-- 
Stefano


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure
  2025-01-29  8:46     ` Stefano Brivio
@ 2025-01-30  1:28       ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2025-01-30  1:28 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 27656 bytes --]

On Wed, Jan 29, 2025 at 09:46:06AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 16:41:01 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 29, 2025 at 12:39:37AM +0100, Stefano Brivio wrote:
> > > Add two sets (source or target) of three functions each for passt in
> > > vhost-user mode, triggered by activity on the file descriptor passed
> > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > 
> > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > >   for migration, before data is transferred
> > > 
> > > - migrate_source() sends, and migrate_target() receives migration data
> > > 
> > > - migrate_source_post() and migrate_target_post() are responsible for
> > >   any post-migration task
> > > 
> > > Callbacks are added to these functions with arrays of function
> > > pointers in migrate.c. Migration handlers are versioned.
> > > 
> > > Versioned descriptions of data sections will be added to the
> > > data_versions array, which points to versioned iovec arrays. Version
> > > 1 is currently empty and will be filled in in subsequent patches.
> > > 
> > > The source announces the data version to be used and informs the peer
> > > about endianness, and the size of void *, time_t, flow entries and
> > > flow hash table entries.
> > > 
> > > The target checks if the version of the source is still supported. If
> > > it's not, it aborts the migration.
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>  
> > 
> > Tomorrow, I'm planning to try implementing a bunch of the suggestions
> > I have below.
> > 
> > > ---
> > >  Makefile    |  12 +--
> > >  migrate.c   | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  migrate.h   |  88 ++++++++++++++++++
> > >  passt.c     |   2 +-
> > >  vu_common.c | 124 ++++++++++++++++--------
> > >  vu_common.h |   2 +-
> > >  6 files changed, 443 insertions(+), 49 deletions(-)
> > >  create mode 100644 migrate.c
> > >  create mode 100644 migrate.h
> > > 
> > > diff --git a/Makefile b/Makefile
> > > index 464eef1..1383875 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > >  
> > >  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > >  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > -	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > -	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > +	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > >  	vhost_user.c virtio.c vu_common.c
> > >  QRAP_SRCS = qrap.c
> > >  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > >  
> > >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > >  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > -	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > -	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > -	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > -	virtio.h vu_common.h
> > > +	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > +	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > +	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > +	vhost_user.h virtio.h vu_common.h
> > >  HEADERS = $(PASST_HEADERS) seccomp.h
> > >  
> > >  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > diff --git a/migrate.c b/migrate.c
> > > new file mode 100644
> > > index 0000000..b8b79e0
> > > --- /dev/null
> > > +++ b/migrate.c
> > > @@ -0,0 +1,264 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +/* PASST - Plug A Simple Socket Transport
> > > + *  for qemu/UNIX domain socket mode
> > > + *
> > > + * PASTA - Pack A Subtle Tap Abstraction
> > > + *  for network namespace/tap device mode
> > > + *
> > > + * migrate.c - Migration sections, layout, and routines
> > > + *
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > +
> > > +#include <errno.h>
> > > +#include <sys/uio.h>
> > > +
> > > +#include "util.h"
> > > +#include "ip.h"
> > > +#include "passt.h"
> > > +#include "inany.h"
> > > +#include "flow.h"
> > > +#include "flow_table.h"
> > > +
> > > +#include "migrate.h"
> > > +
> > > +/* Current version of migration data */
> > > +#define MIGRATE_VERSION		1
> > > +
> > > +/* Magic as we see it and as seen with reverse endianness */
> > > +#define MIGRATE_MAGIC		0xB1BB1D1B0BB1D1B0
> > > +#define MIGRATE_MAGIC_SWAPPED	0xB0D1B1B01B1DBBB1
> > > +
> > > +/* Migration header to send from source */
> > > +static union migrate_header header = {
> > > +	.magic		= MIGRATE_MAGIC,
> > > +	.version	= htonl_constant(MIGRATE_VERSION),
> > > +	.time_t_size	= htonl_constant(sizeof(time_t)),
> > > +	.flow_size	= htonl_constant(sizeof(union flow)),
> > > +	.flow_sidx_size	= htonl_constant(sizeof(struct flow_sidx)),
> > > +	.voidp_size	= htonl_constant(sizeof(void *)),
> > > +};
> > > +
> > > +/* Data sections for version 1 */
> > > +static struct iovec sections_v1[] = {
> > > +	{ &header,	sizeof(header) },  
> > 
> > This format assumes that everything we send is in buffers with
> > statically known size and address.  I'm pretty sure we'll outgrow that
> > quickly.  For one thing, I realised we want to transfer addr*_seen, or
> > we won't know how to direct an incoming connection after migration, if
> > that occurs before the guest sends anything outgoing.
> 
> Yes, I'm transferring those too (local changes not completely
> working yet) as separate sections. The layout is just fine for those.

Oh, I guess the context does actually live in a global, we just
pretend it doesn't most of the time.

> What we don't cover with this layout is arrays, and I guess that should
> be added, because if we want to transfer a minimal representation of
> single flows instead of the whole flow table, we need something like
> that.
> 
> I think it could be something on top of iovecs, that is, a simple
> structure with iovec pointers and a count, or a pointer to a count.
> 
> > > +};
> > > +
> > > +/* Set of data versions */
> > > +static struct migrate_data data_versions[] = {
> > > +	{
> > > +		1,	sections_v1,
> > > +	},  
> > 
> > I realise it's a bit less "declarative", but maybe just a "receive"
> > function for each version which open codes sending each section would
> > be more flexible, and not that much more verbose.
> 
> It's 18 lines of migrate_source() and 17 lines of migrate_target()
> right now, so it's not really *that* short, and if it can be
> declarative (I'm finding this quite convenient as I'm *using* it), I
> think it should be. What else would you like to add?
> 
> Besides, it can always be changed later as a use case appears.
> 
> > > +	{ 0 },
> > > +};
> > > +
> > > +/* Handlers to call in source before sending data */
> > > +struct migrate_handler handlers_source_pre[] = {
> > > +	{ 0 },
> > > +};
> > > +
> > > +/* Handlers to call in source after sending data */
> > > +struct migrate_handler handlers_source_post[] = {
> > > +	{ 0 },
> > > +};  
> > 
> > I also wonder if these tables of handlers are overkill, rather than
> > just having a function that calls all the things in the right order.
> 
> That risks turning into a long series of:
> 
> 	ret = handler(c);
> 	if (ret)
> 		return ret;
> 
> ...
> 
> but anyway, let me finish writing the code using this first, then we'll
> know, instead of speculating.

Ah, yeah, bulky C error handling, that's a good point.

> > > +
> > > +/* Handlers to call in target before receiving data with version 1 */
> > > +struct migrate_handler handlers_target_pre_v1[] = {
> > > +	{ 0 },
> > > +};
> > > +
> > > +/* Handlers to call in target after receiving data with version 1 */
> > > +struct migrate_handler handlers_target_post_v1[] = {
> > > +	{ 0 },
> > > +};
> > > +
> > > +/* Versioned sets of migration handlers */
> > > +struct migrate_target_handlers target_handlers[] = {
> > > +	{
> > > +		1,
> > > +		handlers_target_pre_v1,
> > > +		handlers_target_post_v1,
> > > +	},
> > > +	{ 0 },
> > > +};
> > > +
> > > +/**
> > > + * migrate_source_pre() - Pre-migration tasks as source
> > > + * @c:		Execution context
> > > + * @m:		Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m)
> > > +{
> > > +	struct migrate_handler *h;
> > > +
> > > +	for (h = handlers_source_pre; h->fn; h++) {
> > > +		int rc;
> > > +
> > > +		if ((rc = h->fn(c, m)))
> > > +			return rc;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_source() - Perform migration as source: send state to hypervisor
> > > + * @fd:		Descriptor for state transfer
> > > + * @m:		Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_source(int fd, const struct migrate_meta *m)
> > > +{
> > > +	static struct migrate_data *d;
> > > +	int count, rc;
> > > +
> > > +	(void)m;
> > > +
> > > +	for (d = data_versions; d->v != MIGRATE_VERSION; d++);
> > > +
> > > +	for (count = 0; d->sections[count].iov_len; count++);
> > > +
> > > +	debug("Writing %u migration sections", count - 1 /* minus header */);
> > > +	rc = write_remainder(fd, d->sections, count, 0);
> > > +	if (rc < 0)
> > > +		return errno;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_source_post() - Post-migration tasks as source
> > > + * @c:		Execution context
> > > + * @m:		Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +void migrate_source_post(struct ctx *c, struct migrate_meta *m)
> > > +{
> > > +	struct migrate_handler *h;
> > > +
> > > +	for (h = handlers_source_post; h->fn; h++)
> > > +		h->fn(c, m);
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_read_header() - Set metadata in target from source header
> > > + * @fd:		Descriptor for state transfer
> > > + * @m:		Migration metadata, filled on return
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_target_read_header(int fd, struct migrate_meta *m)
> > > +{
> > > +	static struct migrate_data *d;
> > > +	union migrate_header h;
> > > +
> > > +	if (read_all_buf(fd, &h, sizeof(h)))
> > > +		return errno;
> > > +
> > > +	debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> > > +	      h.magic, ntohl(h.voidp_size), ntohl(h.version));
> > > +
> > > +	for (d = data_versions; d->v != ntohl(h.version) && d->v; d++);
> > > +	if (!d->v)
> > > +		return ENOTSUP;
> > > +	m->v = d->v;
> > > +
> > > +	if (h.magic == MIGRATE_MAGIC)
> > > +		m->bswap = false;
> > > +	else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> > > +		m->bswap = true;
> > > +	else
> > > +		return ENOTSUP;
> > > +
> > > +	if (ntohl(h.voidp_size) == 4)
> > > +		m->source_64b = false;
> > > +	else if (ntohl(h.voidp_size) == 8)
> > > +		m->source_64b = true;
> > > +	else
> > > +		return ENOTSUP;
> > > +
> > > +	if (ntohl(h.time_t_size) == 4)
> > > +		m->time_64b = false;
> > > +	else if (ntohl(h.time_t_size) == 8)
> > > +		m->time_64b = true;
> > > +	else
> > > +		return ENOTSUP;
> > > +
> > > +	m->flow_size = ntohl(h.flow_size);
> > > +	m->flow_sidx_size = ntohl(h.flow_sidx_size);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_pre() - Pre-migration tasks as target
> > > + * @c:		Execution context
> > > + * @m:		Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m)
> > > +{
> > > +	struct migrate_target_handlers *th;
> > > +	struct migrate_handler *h;
> > > +
> > > +	for (th = target_handlers; th->v != m->v && th->v; th++);
> > > +
> > > +	for (h = th->pre; h->fn; h++) {
> > > +		int rc;
> > > +
> > > +		if ((rc = h->fn(c, m)))
> > > +			return rc;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target() - Perform migration as target: receive state from hypervisor
> > > + * @fd:		Descriptor for state transfer
> > > + * @m:		Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + *
> > > + * #syscalls:vu readv
> > > + */
> > > +int migrate_target(int fd, const struct migrate_meta *m)
> > > +{
> > > +	static struct migrate_data *d;
> > > +	unsigned cnt;
> > > +	int rc;
> > > +
> > > +	for (d = data_versions; d->v != m->v && d->v; d++);
> > > +
> > > +	for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> > > +
> > > +	debug("Reading %u migration sections", cnt);
> > > +	rc = read_remainder(fd, d->sections + 1, cnt, 0);
> > > +	if (rc < 0)
> > > +		return errno;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_post() - Post-migration tasks as target
> > > + * @c:		Execution context
> > > + * @m:		Migration metadata
> > > + */
> > > +void migrate_target_post(struct ctx *c, struct migrate_meta *m)
> > > +{
> > > +	struct migrate_target_handlers *th;
> > > +	struct migrate_handler *h;
> > > +
> > > +	for (th = target_handlers; th->v != m->v && th->v; th++);
> > > +
> > > +	for (h = th->post; h->fn; h++)
> > > +		h->fn(c, m);
> > > +}
> > > diff --git a/migrate.h b/migrate.h
> > > new file mode 100644
> > > index 0000000..f9635ac
> > > --- /dev/null
> > > +++ b/migrate.h
> > > @@ -0,0 +1,88 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > + 
> > > +#ifndef MIGRATE_H
> > > +#define MIGRATE_H
> > > +
> > > +/**
> > > + * struct migrate_meta - Migration metadata
> > > + * @v:			Chosen migration data version, host order
> > > + * @bswap:		Source has opposite endianness
> > > + * @peer_64b:		Source uses 64-bit void *
> > > + * @time_64b:		Source uses 64-bit time_t
> > > + * @flow_size:		Size of union flow in source
> > > + * @flow_sidx_size:	Size of struct flow_sidx in source
> > > + */
> > > +struct migrate_meta {
> > > +	uint32_t v;
> > > +	bool bswap;
> > > +	bool source_64b;
> > > +	bool time_64b;
> > > +	size_t flow_size;
> > > +	size_t flow_sidx_size;
> > > +};
> > > +
> > > +/**
> > > + * union migrate_header - Migration header from source
> > > + * @magic:		0xB1BB1D1B0BB1D1B0, host order
> > > + * @version:		Source sends highest known, target aborts if unsupported
> > > + * @voidp_size:		sizeof(void *), network order
> > > + * @time_t_size:	sizeof(time_t), network order
> > > + * @flow_size:		sizeof(union flow), network order
> > > + * @flow_sidx_size:	sizeof(struct flow_sidx_t), network order
> > > + * @unused:		Go figure
> > > + */
> > > +union migrate_header {
> > > +	struct {
> > > +		uint64_t magic;
> > > +		uint32_t version;
> > > +		uint32_t voidp_size;
> > > +		uint32_t time_t_size;
> > > +		uint32_t flow_size;
> > > +		uint32_t flow_sidx_size;
> > > +	};
> > > +	uint8_t unused[65536];
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_data - Data sections for given source version
> > > + * @v:			Source version this applies to, host order
> > > + * @sections:		Array of data sections, NULL-terminated
> > > + */
> > > +struct migrate_data {
> > > +	uint32_t v;
> > > +	struct iovec *sections;
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_handler - Function to handle a specific data section
> > > + * @fn:			Function pointer taking pointer to context and metadata
> > > + */
> > > +struct migrate_handler {
> > > +	int (*fn)(struct ctx *c, struct migrate_meta *m);
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_target_handlers - Versioned sets of migration target handlers
> > > + * @v:			Source version this applies to, host order
> > > + * @pre:		Set of functions to execute in target before data copy
> > > + * @post:		Set of functions to execute in target after data copy
> > > + */
> > > +struct migrate_target_handlers {
> > > +	uint32_t v;
> > > +	struct migrate_handler *pre;
> > > +	struct migrate_handler *post;
> > > +};
> > > +
> > > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m);
> > > +int migrate_source(int fd, const struct migrate_meta *m);
> > > +void migrate_source_post(struct ctx *c, struct migrate_meta *m);
> > > +
> > > +int migrate_target_read_header(int fd, struct migrate_meta *m);
> > > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m);
> > > +int migrate_target(int fd, const struct migrate_meta *m);
> > > +void migrate_target_post(struct ctx *c, struct migrate_meta *m);
> > > +
> > > +#endif /* MIGRATE_H */
> > > diff --git a/passt.c b/passt.c
> > > index b1c8ab6..184d4e5 100644
> > > --- a/passt.c
> > > +++ b/passt.c
> > > @@ -358,7 +358,7 @@ loop:
> > >  			vu_kick_cb(c.vdev, ref, &now);
> > >  			break;
> > >  		case EPOLL_TYPE_VHOST_MIGRATION:
> > > -			vu_migrate(c.vdev, eventmask);
> > > +			vu_migrate(&c, eventmask);  
> > 
> > Not really a comment on this patch, but do we actually need/want to
> > put the migration fd into the epoll loop?  IIUC, everything we're
> > doing on the migration fd is synchronous and blocking, so we could run
> > the whole migration from the VHOST_USER_SET_DEVICE_STATE callback.
> 
> The epoll event bits are practical to have here, though.

No, not really; the direction is explicitly given in the
VU_USER_SET_DEVICE_STATE_FD request.  Indeed, I'd say that's the
correct source for this information: in practice we're getting a
single-direction socket for the fd, but I don't know that the protocol
guarantees that.  AFAICT, in theory a front-end saving state to disk
could just hand over a handle to a regular file, meaning the epoll
bits tell us nothing.

> > Or, maybe less confusing, flag the migration to run before we next
> > call epoll_wait(): having the migration run strictly in-between epoll
> > cycles seems like it might make some things easier to analyse.
> 
> Well, we don't really *need* to have the descriptor in the epoll list,
> but it's more consistent. Should we really introduce a different
> mechanism just because we don't strictly need that?
> 
> What's the actual problem with the current implementation?

It suggests that events on the device state fd can be interspersed
with other events: that's kind of the whole point of epoll.  But
that's not true, we want to freeze the world as much as possible for
the migration.

More specifically the device state epoll event could occur in between,
say, an EPOLLIN for an external TCP socket, and the deferred handling
for that TCP flow.  I don't know if that could cause trouble, but it
seems like a case we're better off eliminating: we're always going to
have less variable state between epoll cycles than within an epoll
cycle.

[Not important right now, but that made me think of a potential
 complication if we ever try to implement migration over -net stream:
 at the moment reads & writes to the qemu socket are essentially
 independent of each other.  But if qemu were to make a migrate
 request over that, we'd need some sort of reply.  We can't really
 ensure that that reply is the first thing qemu sees after sending the
 request, because there might have been unrelated frames in flight
 already.  That means qemu would have to deal with those incoming
 frames *while the guest is already stopped* which sounds rather
 awkward.]

> > >  			break;
> > >  		default:
> > >  			/* Can't happen */
> > > diff --git a/vu_common.c b/vu_common.c
> > > index f43d8ac..6c346c8 100644
> > > --- a/vu_common.c
> > > +++ b/vu_common.c
> > > @@ -5,6 +5,7 @@
> > >   * common_vu.c - vhost-user common UDP and TCP functions
> > >   */
> > >  
> > > +#include <errno.h>
> > >  #include <unistd.h>
> > >  #include <sys/uio.h>
> > >  #include <sys/eventfd.h>
> > > @@ -17,6 +18,7 @@
> > >  #include "vhost_user.h"
> > >  #include "pcap.h"
> > >  #include "vu_common.h"
> > > +#include "migrate.h"
> > >  
> > >  #define VU_MAX_TX_BUFFER_NB	2
> > >  
> > > @@ -305,50 +307,90 @@ err:
> > >  }
> > >  
> > >  /**
> > > - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> > > - * @vdev:	vhost-user device
> > > + * vu_migrate_source() - Migration as source, send state to hypervisor
> > > + * @c:		Execution context
> > > + * @fd:		File descriptor for state transfer
> > > + *
> > > + * Return: 0 on success, positive error code on failure
> > > + */
> > > +static int vu_migrate_source(struct ctx *c, int fd)
> > > +{
> > > +	struct migrate_meta m;
> > > +	int rc;
> > > +
> > > +	if ((rc = migrate_source_pre(c, &m))) {
> > > +		err("Source pre-migration failed: %s, abort", strerror_(rc));
> > > +		return rc;
> > > +	}
> > > +
> > > +	debug("Saving backend state");
> > > +
> > > +	rc = migrate_source(fd, &m);
> > > +	if (rc)
> > > +		err("Source migration failed: %s", strerror_(rc));
> > > +	else
> > > +		migrate_source_post(c, &m);
> > > +
> > > +	return rc;
> > > +}
> > > +
> > > +/**
> > > + * vu_migrate_target() - Migration as target, receive state from hypervisor
> > > + * @c:		Execution context
> > > + * @fd:		File descriptor for state transfer
> > > + *
> > > + * Return: 0 on success, positive error code on failure
> > > + */
> > > +static int vu_migrate_target(struct ctx *c, int fd)
> > > +{
> > > +	struct migrate_meta m;
> > > +	int rc;
> > > +
> > > +	rc = migrate_target_read_header(fd, &m);
> > > +	if (rc) {
> > > +		err("Migration header check failed: %s, abort", strerror_(rc));
> > > +		return rc;
> > > +	}
> > > +
> > > +	if ((rc = migrate_target_pre(c, &m))) {
> > > +		err("Target pre-migration failed: %s, abort", strerror_(rc));
> > > +		return rc;
> > > +	}
> > > +
> > > +	debug("Loading backend state");
> > > +
> > > +	rc = migrate_target(fd, &m);
> > > +	if (rc)
> > > +		err("Target migration failed: %s", strerror_(rc));
> > > +	else
> > > +		migrate_target_post(c, &m);
> > > +
> > > +	return rc;
> > > +}
> > > +
> > > +/**
> > > + * vu_migrate() - Send/receive passt internal state to/from QEMU
> > > + * @c:		Execution context
> > >   * @events:	epoll events
> > >   */
> > > -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> > > +void vu_migrate(struct ctx *c, uint32_t events)
> > >  {
> > > -	int ret;
> > > +	struct vu_dev *vdev = c->vdev;
> > > +	int rc = EIO;
> > >  
> > > -	/* TODO: collect/set passt internal state
> > > -	 * and use vdev->device_state_fd to send/receive it
> > > -	 */
> > >  	debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> > > -	if (events & EPOLLOUT) {
> > > -		debug("Saving backend state");
> > > -
> > > -		/* send some stuff */
> > > -		ret = write(vdev->device_state_fd, "PASST", 6);
> > > -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > > -		vdev->device_state_result = ret == -1 ? -1 : 0;
> > > -		/* Closing the file descriptor signals the end of transfer */
> > > -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > > -			  vdev->device_state_fd, NULL);
> > > -		close(vdev->device_state_fd);
> > > -		vdev->device_state_fd = -1;
> > > -	} else if (events & EPOLLIN) {
> > > -		char buf[6];
> > > -
> > > -		debug("Loading backend state");
> > > -		/* read some stuff */
> > > -		ret = read(vdev->device_state_fd, buf, sizeof(buf));
> > > -		/* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > > -		if (ret != sizeof(buf)) {
> > > -			vdev->device_state_result = -1;
> > > -		} else {
> > > -			ret = strncmp(buf, "PASST", sizeof(buf));
> > > -			vdev->device_state_result = ret == 0 ? 0 : -1;
> > > -		}
> > > -	} else if (events & EPOLLHUP) {
> > > -		debug("Closing migration channel");
> > > -
> > > -		/* The end of file signals the end of the transfer. */
> > > -		epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > > -			  vdev->device_state_fd, NULL);
> > > -		close(vdev->device_state_fd);
> > > -		vdev->device_state_fd = -1;
> > > -	}
> > > +
> > > +	if (events & EPOLLOUT)
> > > +		rc = vu_migrate_source(c, vdev->device_state_fd);
> > > +	else if (events & EPOLLIN)
> > > +		rc = vu_migrate_target(c, vdev->device_state_fd);
> > > +
> > > +	/* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> > > +
> > > +	vdev->device_state_result = rc;  
> > 
> > Again, not really a comment on this patch, but I think it would be
> > safer to set device_state_result to something non-zero (probably
> > EINPROGRESS) as soon as we get the VHOST_USER_SET_DEVICE_STATE_FD.
> > That way if somehow VHOST_USER_CHECK_DEVICE_STATE were called before
> > we actually do anything about the migration we'd correctly report that
> > it hasn't happened.
> 
> I see your point, but I'm not sure if it's correct to report
> EINPROGRESS while the migration is not actually in progress.

I mean.. it's in progress in the sense that the request has been
issued but we haven't done it yet.

> That is, if we get VHOST_USER_SET_DEVICE_STATE_FD, and then the
> hypervisor decides to just send VHOST_USER_CHECK_DEVICE_STATE right
> away, the migration isn't in progress. This is especially relevant as
> the vhost-user documentation says that:
> 
> 	"any non-zero value is an error"
> 
> Further, it also states, about VHOST_USER_CHECK_DEVICE_STATE, that:
> 
> 	The back-end responds once it knows whether the transfer and
> 	processing was successful or not.
>
> so, strictly speaking, we should probably *not* respond to that before
> we're done here. On the other hand, we risk triggering issues/hangs in
> the hypervisor just for the sake of being correct, so I would rather
> report 0 as it would be (not actually happening) done now.

*thinks about the mechanics of delaying a response to a vhost-user
request*
I guess we can just ignore it, return to the epoll loop and assume the
event will trigger on the next loop as well.  Looks like we're not
using EPOLLET on the VU socket, so that should work.

Does enything define how VHOST_USER_CHECK_DEVICE_STATE should respond
if no migration is requested?  I guess it should be success, so that
if for some reason it were called twice after a successful migration
it would return success both times?

I still think this works well with the idea of putting migration
between epoll cycles rather than within.  We add a pending_migration
flag which is either NONE, SOURCE, TARGET.
VHOST_USER_SET_DEVICE_STATE_FD sets it based on the direction flag.
After post_handler(), we check it and if non-NONE, we synchronously
complete the migration, then clear.  In VHOST_USER_CHECK_DEVICE_STATE
if it's non-NONE we do nothing and return to the epoll loop.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair
  2025-01-29  8:46     ` Stefano Brivio
@ 2025-01-30  1:33       ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2025-01-30  1:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 26584 bytes --]

On Wed, Jan 29, 2025 at 09:46:10AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 17:09:07 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 29, 2025 at 12:39:39AM +0100, Stefano Brivio wrote:
> > > In vhost-user mode, by default, create a second UNIX domain socket
> > > accepting connections from passt-repair, with the usual listener
> > > socket.
> > > 
> > > When we need to set or clear TCP_REPAIR on sockets, we'll send them
> > > via SCM_RIGHTS to passt-repair, who sets the socket option values we
> > > ask for.
> > > 
> > > To that end, introduce batched functions to request TCP_REPAIR
> > > settings on sockets, so that we don't have to send a single message
> > > for each socket, on migration. When needed, repair_flush() will
> > > send the message and check for the reply.
> > > 
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  Makefile     |  12 ++--
> > >  conf.c       |  46 ++++++++++--
> > >  epoll_type.h |   4 ++
> > >  passt.1      |  11 +++
> > >  passt.c      |   9 +++
> > >  passt.h      |   7 ++
> > >  repair.c     | 192 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  repair.h     |  16 +++++
> > >  tap.c        |  65 +----------------
> > >  util.c       |  62 +++++++++++++++++
> > >  util.h       |   1 +
> > >  11 files changed, 353 insertions(+), 72 deletions(-)
> > >  create mode 100644 repair.c
> > >  create mode 100644 repair.h
> > > 
> > > diff --git a/Makefile b/Makefile
> > > index 1b71cb0..f67a20b 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -38,9 +38,9 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > >  
> > >  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > >  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > -	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > -	tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > -	vhost_user.c virtio.c vu_common.c
> > > +	ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \
> > > +	repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \
> > > +	udp_vu.c util.c vhost_user.c virtio.c vu_common.c
> > >  QRAP_SRCS = qrap.c
> > >  PASST_REPAIR_SRCS = passt-repair.c
> > >  SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > > @@ -50,9 +50,9 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > >  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > >  	lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > -	pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > -	tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > -	vhost_user.h virtio.h vu_common.h
> > > +	pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \
> > > +	tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \
> > > +	udp_vu.h util.h vhost_user.h virtio.h vu_common.h
> > >  HEADERS = $(PASST_HEADERS) seccomp.h
> > >  
> > >  C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > diff --git a/conf.c b/conf.c
> > > index df2b016..85dec44 100644
> > > --- a/conf.c
> > > +++ b/conf.c
> > > @@ -816,6 +816,9 @@ static void usage(const char *name, FILE *f, int status)
> > >  			"    UNIX domain socket is provided by -s option\n"
> > >  			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> > >  			"    only meaningful for vhost-user mode\n");
> > > +		FPRINTF(f,
> > > +			"  --repair-path PATH	path for passt-repair(1)\n"  
> > 
> > Nit: as a privileged helper, should it be passt-repair(8)?
> 
> So, I spent a couple of minutes on this as I wrote this, and I spent a
> bit longer now: the most authoritative definition I can find of section
> 8 is from man-pages(7):
> 
>        8 System management commands
>               Commands like mount(8), many of which only root can
>               execute.
> 
> It's not really a system management command, and the idea is to run it
> with CAP_NET_ADMIN, but not necessarily as root. So I would rather keep
> it in section 1, unless there's some other conflicting definition I'm
> not aware of.

Yeah oh, on that basis (1) makes more sense than (8).

> There's also the topic of where it should be installed (/sbin,
> /usr/sbin/, /bin, /usr/bin). I'd pick /usr/bin, because /sbin doesn't
> really mean much nowadays, and it's anyway fitting with the FHS 3.0:
> 
>   Utilities used for system administration (and other root-only
>   commands) are stored in /sbin, /usr/ sbin, and /usr/local/sbin.
>   
>   /sbin contains binaries essential for booting, restoring, recovering,
>   and/or repairing the system in addition to the binaries in /bin.
>   18 Programs executed after /usr is known to be mounted (when there
>   are no problems) are generally placed into /usr/sbin.
> 
> ...and I don't think this helper would qualify for /sbin or /usr/sbin.
> 
> > 
> > > +			"    default: append '.repair' to UNIX domain path\n");
> > >  	}
> > >  
> > >  	FPRINTF(f,
> > > @@ -1240,8 +1243,30 @@ static void conf_nat(const char *arg, struct in_addr *addr4,
> > >   */
> > >  static void conf_open_files(struct ctx *c)
> > >  {
> > > -	if (c->mode != MODE_PASTA && c->fd_tap == -1)
> > > -		c->fd_tap_listen = tap_sock_unix_open(c->sock_path);
> > > +	if (c->mode != MODE_PASTA && c->fd_tap == -1) {
> > > +		c->fd_tap_listen = sock_unix(c->sock_path);
> > > +
> > > +		if (c->mode == MODE_VU && strcmp(c->repair_path, "none")) {
> > > +			if (!strncmp(c->repair_path, "./", 2)) {
> > > +				memmove(c->repair_path, c->repair_path + 2,
> > > +					sizeof(c->repair_path) - 2);
> > > +			}  
> > 
> > Do you need this? Shouldn't "./whatever" be usable as-is?
> 
> Ah, yes, I didn't know. I explicitly added this for the '--repair-path
> ./none' case, but it's not actually needed. I'll drop this.
> 
> > > +
> > > +			if (!*c->repair_path &&
> > > +			    snprintf_check(c->repair_path,
> > > +					   sizeof(c->repair_path), "%s.repair",
> > > +					   c->sock_path)) {
> > > +				warn("passt-repair path %s not usable",
> > > +				     c->repair_path);  
> > 
> > I'd prefer a die() here - I think omitting a possibly expected
> > feature, with just a warning that could easily be lost in the logs is
> > not a good idea.
> 
> I was thinking about that, but should we really risk *not* starting
> because with ".repair" the path is now too long, for a feature that,
> realistically, most users won't actually use?
> 
> If we're started by any kind of framework (which is where we run the
> risk of the warning being ignored), then there should be explicit
> checks about the path and the usability of passt-repair (or equivalent).

Hm, yeah, I guess.  Honestly I'm not sure which option will cause us
less trouble.

> > > +				c->fd_repair_listen = -1;
> > > +			} else {
> > > +				c->fd_repair_listen = sock_unix(c->repair_path);
> > > +			}
> > > +		} else {
> > > +			c->fd_repair_listen = -1;
> > > +		}
> > > +		c->fd_repair = -1;
> > > +	}
> > >  
> > >  	if (*c->pidfile) {
> > >  		c->pidfile_fd = output_file_open(c->pidfile, O_WRONLY);
> > > @@ -1354,9 +1379,12 @@ void conf(struct ctx *c, int argc, char **argv)
> > >  		{"host-lo-to-ns-lo", no_argument, 	NULL,		23 },
> > >  		{"dns-host",	required_argument,	NULL,		24 },
> > >  		{"vhost-user",	no_argument,		NULL,		25 },
> > > +
> > >  		/* vhost-user backend program convention */
> > >  		{"print-capabilities", no_argument,	NULL,		26 },
> > >  		{"socket-path",	required_argument,	NULL,		's' },
> > > +
> > > +		{"repair-path",	required_argument,	NULL,		27 },
> > >  		{ 0 },
> > >  	};
> > >  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> > > @@ -1824,8 +1852,8 @@ void conf(struct ctx *c, int argc, char **argv)
> > >  	if (c->ifi4 && IN4_IS_ADDR_UNSPECIFIED(&c->ip4.guest_gw))
> > >  		c->no_dhcp = 1;
> > >  
> > > -	/* Inbound port options & DNS can be parsed now (after IPv4/IPv6
> > > -	 * settings)
> > > +	/* Inbound port options, DNS, and --repair-path can be parsed now, after
> > > +	 * IPv4/IPv6 settings and --vhost-user.
> > >  	 */
> > >  	fwd_probe_ephemeral();
> > >  	udp_portmap_clear();
> > > @@ -1871,6 +1899,16 @@ void conf(struct ctx *c, int argc, char **argv)
> > >  			}
> > >  
> > >  			die("Cannot use DNS address %s", optarg);
> > > +		} else if (name == 27) {
> > > +			if (c->mode != MODE_VU && strcmp(optarg, "none"))
> > > +				die("--repair-path is for vhost-user mode only");
> > > +
> > > +			if (snprintf_check(c->repair_path,
> > > +					   sizeof(c->repair_path), "%s",
> > > +					   optarg))
> > > +				die("Invalid passt-repair path: %s", optarg);
> > > +
> > > +			break;
> > >  		}
> > >  	} while (name != -1);
> > >  
> > > diff --git a/epoll_type.h b/epoll_type.h
> > > index fd9eac3..706238a 100644
> > > --- a/epoll_type.h
> > > +++ b/epoll_type.h
> > > @@ -42,6 +42,10 @@ enum epoll_type {
> > >  	EPOLL_TYPE_VHOST_KICK,
> > >  	/* vhost-user migration socket */
> > >  	EPOLL_TYPE_VHOST_MIGRATION,
> > > +	/* TCP_REPAIR helper listening socket */
> > > +	EPOLL_TYPE_REPAIR_LISTEN,
> > > +	/* TCP_REPAIR helper socket */
> > > +	EPOLL_TYPE_REPAIR,
> > >  
> > >  	EPOLL_NUM_TYPES,
> > >  };
> > > diff --git a/passt.1 b/passt.1
> > > index d9cd33e..63a3a01 100644
> > > --- a/passt.1
> > > +++ b/passt.1
> > > @@ -418,6 +418,17 @@ Enable vhost-user. The vhost-user command socket is provided by \fB--socket\fR.
> > >  .BR \-\-print-capabilities
> > >  Print back-end capabilities in JSON format, only meaningful for vhost-user mode.
> > >  
> > > +.TP
> > > +.BR \-\-repair-path " " \fIpath
> > > +Path for UNIX domain socket used by the \fBpasst-repair\fR(1) helper to connect  
> > 
> > passt-repair(8)?
> 
> See above.
> 
> > > +to \fBpasst\fR in order to set or clear the TCP_REPAIR option on sockets, during
> > > +migration. \fB--repair-path none\fR disables this interface (if you need to
> > > +specify a socket path called "none" you can prefix the path by \fI./\fR).
> > > +
> > > +Default, for \-\-vhost-user mode only, is to append \fI.repair\fR to the path
> > > +chosen for the hypervisor UNIX domain socket. No socket is created if not in
> > > +\-\-vhost-user mode.
> > > +
> > >  .TP
> > >  .BR \-F ", " \-\-fd " " \fIFD
> > >  Pass a pre-opened, connected socket to \fBpasst\fR. Usually the socket is opened
> > > diff --git a/passt.c b/passt.c
> > > index 184d4e5..1fa2ddd 100644
> > > --- a/passt.c
> > > +++ b/passt.c
> > > @@ -51,6 +51,7 @@
> > >  #include "tcp_splice.h"
> > >  #include "ndp.h"
> > >  #include "vu_common.h"
> > > +#include "repair.h"
> > >  
> > >  #define EPOLL_EVENTS		8
> > >  
> > > @@ -76,6 +77,8 @@ char *epoll_type_str[] = {
> > >  	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> > >  	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
> > >  	[EPOLL_TYPE_VHOST_MIGRATION]	= "vhost-user migration socket",
> > > +	[EPOLL_TYPE_REPAIR_LISTEN]	= "TCP_REPAIR helper listening socket",
> > > +	[EPOLL_TYPE_REPAIR]		= "TCP_REPAIR helper socket",
> > >  };
> > >  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
> > >  	      "epoll_type_str[] doesn't match enum epoll_type");
> > > @@ -360,6 +363,12 @@ loop:
> > >  		case EPOLL_TYPE_VHOST_MIGRATION:
> > >  			vu_migrate(&c, eventmask);
> > >  			break;
> > > +		case EPOLL_TYPE_REPAIR_LISTEN:
> > > +			repair_listen_handler(&c, eventmask);
> > > +			break;
> > > +		case EPOLL_TYPE_REPAIR:
> > > +			repair_handler(&c, eventmask);
> > > +			break;
> > >  		default:
> > >  			/* Can't happen */
> > >  			ASSERT(0);
> > > diff --git a/passt.h b/passt.h
> > > index 0dd4efa..85b0a10 100644
> > > --- a/passt.h
> > > +++ b/passt.h
> > > @@ -20,6 +20,7 @@ union epoll_ref;
> > >  #include "siphash.h"
> > >  #include "ip.h"
> > >  #include "inany.h"
> > > +#include "migrate.h"
> > >  #include "flow.h"
> > >  #include "icmp.h"
> > >  #include "fwd.h"
> > > @@ -193,6 +194,7 @@ struct ip6_ctx {
> > >   * @foreground:		Run in foreground, don't log to stderr by default
> > >   * @nofile:		Maximum number of open files (ulimit -n)
> > >   * @sock_path:		Path for UNIX domain socket
> > > + * @repair_path:	TCP_REPAIR helper path, can be "none", empty for default
> > >   * @pcap:		Path for packet capture file
> > >   * @pidfile:		Path to PID file, empty string if not configured
> > >   * @pidfile_fd:		File descriptor for PID file, -1 if none
> > > @@ -203,6 +205,8 @@ struct ip6_ctx {
> > >   * @epollfd:		File descriptor for epoll instance
> > >   * @fd_tap_listen:	File descriptor for listening AF_UNIX socket, if any
> > >   * @fd_tap:		AF_UNIX socket, tuntap device, or pre-opened socket
> > > + * @fd_repair_listen:	File descriptor for listening TCP_REPAIR socket, if any
> > > + * @fd_repair:		Connected AF_UNIX socket for TCP_REPAIR helper
> > >   * @our_tap_mac:	Pasta/passt's MAC on the tap link
> > >   * @guest_mac:		MAC address of guest or namespace, seen or configured
> > >   * @hash_secret:	128-bit secret for siphash functions
> > > @@ -244,6 +248,7 @@ struct ctx {
> > >  	int foreground;
> > >  	int nofile;
> > >  	char sock_path[UNIX_PATH_MAX];
> > > +	char repair_path[UNIX_PATH_MAX];
> > >  	char pcap[PATH_MAX];
> > >  
> > >  	char pidfile[PATH_MAX];
> > > @@ -260,6 +265,8 @@ struct ctx {
> > >  	int epollfd;
> > >  	int fd_tap_listen;
> > >  	int fd_tap;
> > > +	int fd_repair_listen;
> > > +	int fd_repair;
> > >  	unsigned char our_tap_mac[ETH_ALEN];
> > >  	unsigned char guest_mac[ETH_ALEN];
> > >  	uint64_t hash_secret[2];
> > > diff --git a/repair.c b/repair.c
> > > new file mode 100644
> > > index 0000000..24966f5
> > > --- /dev/null
> > > +++ b/repair.c
> > > @@ -0,0 +1,192 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +/* PASST - Plug A Simple Socket Transport
> > > + *  for qemu/UNIX domain socket mode
> > > + *
> > > + * PASTA - Pack A Subtle Tap Abstraction
> > > + *  for network namespace/tap device mode
> > > + *
> > > + * repair.c - Interface (server) for passt-repair, set/clear TCP_REPAIR
> > > + *
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > +
> > > +#include <errno.h>
> > > +#include <sys/uio.h>
> > > +
> > > +#include "util.h"
> > > +#include "ip.h"
> > > +#include "passt.h"
> > > +#include "inany.h"
> > > +#include "flow.h"
> > > +#include "flow_table.h"
> > > +
> > > +#include "repair.h"
> > > +
> > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > +
> > > +static int fds[SCM_MAX_FD];  
> > 
> > Even 'static', I'd prefer a longer name for a global variable.
> 
> A longer name for this also means a longer name for nfds below, which
> is not necessarily practical, but I'll check and try to change these to
> repair_fds[] / repair_nfds if doable.
> 
> > > +static int current_cmd;  
> > 
> > Is there any particular rationale behind these being globals, whereas
> > fd_repair is in struct ctx?  AFAICT they're basically equally global
> > in practice.
> 
> Yes: c->fd_repair_listen needs to be in struct ctx, and I want to keep
> this consistent with c->fd_tap_listen / c->fd_tap.
> 
> Besides, the variables declared here are really hacks representing
> state for this compilation unit only.

Oh, right.  Essentially these (static) globals are less global than
ctx, which is truly world global.  Ok, that makes sense.

> > > +static int nfds;
> > > +
> > > +/**
> > > + * repair_sock_init() - Start listening for connections on helper socket
> > > + * @c:		Execution context
> > > + */
> > > +void repair_sock_init(const struct ctx *c)
> > > +{
> > > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN };
> > > +	struct epoll_event ev = { 0 };
> > > +
> > > +	listen(c->fd_repair_listen, 0);
> > > +
> > > +	ref.fd = c->fd_repair_listen;
> > > +	ev.events = EPOLLIN | EPOLLHUP | EPOLLET;
> > > +	ev.data.u64 = ref.u64;
> > > +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev);
> > > +}
> > > +
> > > +/**
> > > + * repair_listen_handler() - Handle events on TCP_REPAIR helper listening socket
> > > + * @c:		Execution context
> > > + * @events:	epoll events
> > > + */
> > > +void repair_listen_handler(struct ctx *c, uint32_t events)
> > > +{
> > > +	union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR };
> > > +	struct epoll_event ev = { 0 };
> > > +	struct ucred ucred;
> > > +	socklen_t len;
> > > +
> > > +	if (events != EPOLLIN) {
> > > +		debug("Spurious event 0x%04x on TCP_REPAIR helper socket",
> > > +		      events);
> > > +		return;
> > > +	}
> > > +
> > > +	len = sizeof(ucred);
> > > +
> > > +	/* Another client is already connected: accept and close right away. */  
> > 
> > For the repair socket, last-connection-wins would make more sense to
> > me than first-connection-wins.  While hacking/debugging seems it might
> > be useful to fix something in the passt-repair, re-run it and have it
> > displace the stale version for existing passt instances.
> 
> In the whole debugging I've done so far I'm actually using passt-repair
> (otherwise I don't even start it), so it already terminates once it's
> done.
> 
> I think it's more secure to keep it like this, because it's more robust
> against races.
> 
> Let's say we have an issue in KubeVirt's virt-handler so that the
> migration starts a bit before the helper is started, and somebody
> manages to use this time window to connect another helper, this attempt
> will actually go unnoticed.
> 
> If passt-repair fails to connect, it will be very obvious.

Ok, you convinced me.

> > > +	if (c->fd_repair != -1) {
> > > +		int discard = accept4(c->fd_repair_listen, NULL, NULL,
> > > +				      SOCK_NONBLOCK);
> > > +
> > > +		if (discard == -1)
> > > +			return;
> > > +
> > > +		if (!getsockopt(discard, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > > +			info("Discarding TCP_REPAIR helper, PID %i", ucred.pid);
> > > +
> > > +		close(discard);
> > > +		return;
> > > +	}
> > > +
> > > +	c->fd_repair = accept4(c->fd_repair_listen, NULL, NULL, 0);
> > > +
> > > +	if (!getsockopt(c->fd_repair, SOL_SOCKET, SO_PEERCRED, &ucred, &len))
> > > +		info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
> > > +
> > > +	ref.fd = c->fd_repair;
> > > +	ev.events = EPOLLHUP | EPOLLET;
> > > +	ev.data.u64 = ref.u64;
> > > +	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev);
> > > +}
> > > +
> > > +/**
> > > + * repair_close() - Close connection to TCP_REPAIR helper
> > > + * @c:		Execution context
> > > + */
> > > +void repair_close(struct ctx *c)
> > > +{
> > > +	debug("Closing TCP_REPAIR helper socket");
> > > +
> > > +	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_repair, NULL);
> > > +	close(c->fd_repair);
> > > +	c->fd_repair = -1;
> > > +}
> > > +
> > > +/**
> > > + * repair_handler() - Handle EPOLLHUP and EPOLLERR on TCP_REPAIR helper socket
> > > + * @c:		Execution context
> > > + * @events:	epoll events
> > > + */
> > > +void repair_handler(struct ctx *c, uint32_t events)
> > > +{
> > > +	(void)events;
> > > +
> > > +	repair_close(c);
> > > +}
> > > +
> > > +/**
> > > + * repair_flush() - Flush current set of sockets to helper, with current command
> > > + * @c:		Execution context
> > > + *
> > > + * Return: 0 on success, negative error code on failure
> > > + */
> > > +int repair_flush(struct ctx *c)
> > > +{
> > > +	struct iovec iov = { &((int8_t){ current_cmd }), sizeof(int8_t) };
> > > +	char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > +	     __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > +	struct cmsghdr *cmsg;
> > > +	struct msghdr msg;
> > > +	int ret = 0;
> > > +
> > > +	if (!nfds)
> > > +		return 0;
> > > +
> > > +	msg = (struct msghdr){ NULL, 0, &iov, 1,
> > > +			       buf, CMSG_SPACE(sizeof(int) * nfds), 0 };
> > > +	cmsg = CMSG_FIRSTHDR(&msg);
> > > +
> > > +	cmsg->cmsg_level = SOL_SOCKET;
> > > +	cmsg->cmsg_type = SCM_RIGHTS;
> > > +	cmsg->cmsg_len = CMSG_LEN(sizeof(int) * nfds);
> > > +	memcpy(CMSG_DATA(cmsg), fds, sizeof(int) * nfds);
> > > +
> > > +	nfds = 0;
> > > +
> > > +	if (sendmsg(c->fd_repair, &msg, 0) < 0) {
> > > +		ret = -errno;  
> > 
> > This error code won't be reported to the caller: you'll continue on to
> > the recv() below, which will return EBADF, clobbering ret.
> 
> Oops, right, I'll return early in this case (or equivalent).
> 
> > > +		err_perror("Failed to send sockets to TCP_REPAIR helper");
> > > +		repair_close(c);
> > > +	}
> > > +
> > > +	if (recv(c->fd_repair, &((int8_t){ 0 }), 1, 0) < 0) {
> > > +		ret = -errno;
> > > +		err_perror("Failed to receive reply from TCP_REPAIR helper");
> > > +		repair_close(c);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * repair_flush() - Add socket to TCP_REPAIR set with given command
> > > + * @c:		Execution context
> > > + * @s:		Socket to add
> > > + * @cmd:	TCP_REPAIR_ON, TCP_REPAIR_OFF, or TCP_REPAIR_OFF_NO_WP
> > > + *
> > > + * Return: 0 on success, negative error code on failure
> > > + */
> > > +/* cppcheck-suppress unusedFunction */
> > > +int repair_set(struct ctx *c, int s, int cmd)
> > > +{
> > > +	int rc;
> > > +
> > > +	if (nfds && current_cmd != cmd) {
> > > +		if ((rc = repair_flush(c)))
> > > +			return rc;
> > > +	}
> > > +
> > > +	current_cmd = cmd;
> > > +	fds[nfds++] = s;
> > > +
> > > +	if (nfds >= SCM_MAX_FD) {
> > > +		if ((rc = repair_flush(c)))
> > > +			return rc;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/repair.h b/repair.h
> > > new file mode 100644
> > > index 0000000..693c515
> > > --- /dev/null
> > > +++ b/repair.h
> > > @@ -0,0 +1,16 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > + 
> > > +#ifndef REPAIR_H
> > > +#define REPAIR_H
> > > +
> > > +void repair_sock_init(const struct ctx *c);
> > > +void repair_listen_handler(struct ctx *c, uint32_t events);
> > > +void repair_handler(struct ctx *c, uint32_t events);
> > > +void repair_close(struct ctx *c);
> > > +int repair_flush(struct ctx *c);
> > > +int repair_set(struct ctx *c, int s, int cmd);
> > > +
> > > +#endif /* REPAIR_H */
> > > diff --git a/tap.c b/tap.c
> > > index cd32a90..0e60eb4 100644
> > > --- a/tap.c
> > > +++ b/tap.c
> > > @@ -56,6 +56,7 @@
> > >  #include "netlink.h"
> > >  #include "pasta.h"
> > >  #include "packet.h"
> > > +#include "repair.h"
> > >  #include "tap.h"
> > >  #include "log.h"
> > >  #include "vhost_user.h"
> > > @@ -1151,68 +1152,6 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
> > >  		tap_pasta_input(c, now);
> > >  }
> > >  
> > > -/**
> > > - * tap_sock_unix_open() - Create and bind AF_UNIX socket
> > > - * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > > - *
> > > - * Return: socket descriptor on success, won't return on failure
> > > - */
> > > -int tap_sock_unix_open(char *sock_path)
> > > -{
> > > -	int fd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
> > > -	struct sockaddr_un addr = {
> > > -		.sun_family = AF_UNIX,
> > > -	};
> > > -	int i;
> > > -
> > > -	if (fd < 0)
> > > -		die_perror("Failed to open UNIX domain socket");
> > > -
> > > -	for (i = 1; i < UNIX_SOCK_MAX; i++) {
> > > -		char *path = addr.sun_path;
> > > -		int ex, ret;
> > > -
> > > -		if (*sock_path)
> > > -			memcpy(path, sock_path, UNIX_PATH_MAX);
> > > -		else if (snprintf_check(path, UNIX_PATH_MAX - 1,
> > > -					UNIX_SOCK_PATH, i))
> > > -			die_perror("Can't build UNIX domain socket path");
> > > -
> > > -		ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK | SOCK_CLOEXEC,
> > > -			    0);
> > > -		if (ex < 0)
> > > -			die_perror("Failed to check for UNIX domain conflicts");
> > > -
> > > -		ret = connect(ex, (const struct sockaddr *)&addr, sizeof(addr));
> > > -		if (!ret || (errno != ENOENT && errno != ECONNREFUSED &&
> > > -			     errno != EACCES)) {
> > > -			if (*sock_path)
> > > -				die("Socket path %s already in use", path);
> > > -
> > > -			close(ex);
> > > -			continue;
> > > -		}
> > > -		close(ex);
> > > -
> > > -		unlink(path);
> > > -		ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr));
> > > -		if (*sock_path && ret)
> > > -			die_perror("Failed to bind UNIX domain socket");
> > > -
> > > -		if (!ret)
> > > -			break;
> > > -	}
> > > -
> > > -	if (i == UNIX_SOCK_MAX)
> > > -		die_perror("Failed to bind UNIX domain socket");
> > > -
> > > -	info("UNIX domain socket bound at %s", addr.sun_path);
> > > -	if (!*sock_path)
> > > -		memcpy(sock_path, addr.sun_path, UNIX_PATH_MAX);
> > > -
> > > -	return fd;
> > > -}
> > > -
> > >  /**
> > >   * tap_backend_show_hints() - Give help information to start QEMU
> > >   * @c:		Execution context
> > > @@ -1423,6 +1362,8 @@ void tap_backend_init(struct ctx *c)
> > >  		tap_sock_tun_init(c);
> > >  		break;
> > >  	case MODE_VU:
> > > +		repair_sock_init(c);
> > > +		/* fall through */
> > >  	case MODE_PASST:
> > >  		tap_sock_unix_init(c);
> > >  
> > > diff --git a/util.c b/util.c
> > > index 36857d4..e98da74 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -178,6 +178,68 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > >  	return fd;
> > >  }
> > >  
> > > +/**
> > > + * sock_unix() - Create and bind AF_UNIX socket
> > > + * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > > + *
> > > + * Return: socket descriptor on success, won't return on failure
> > > + */  
> > 
> > I like making tap_sock_unix_open() more general.  Could be split into
> > a separate patch.
> 
> I need it here though, and keeping it here saves lines in the commit
> message...

Oh very well.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-01-30  1:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-28 23:39 [PATCH v2 0/8] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-28 23:39 ` [PATCH v2 1/8] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-29  1:34   ` David Gibson
2025-01-28 23:39 ` [PATCH v2 2/8] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-29  1:35   ` David Gibson
2025-01-28 23:39 ` [PATCH v2 3/8] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-28 23:39 ` [PATCH v2 4/8] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-29  1:37   ` David Gibson
2025-01-28 23:39 ` [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-29  5:41   ` David Gibson
2025-01-29  8:46     ` Stefano Brivio
2025-01-30  1:28       ` David Gibson
2025-01-28 23:39 ` [PATCH v2 6/8] Introduce passt-repair Stefano Brivio
2025-01-28 23:39 ` [PATCH v2 7/8] Add interfaces and configuration bits for passt-repair Stefano Brivio
2025-01-29  6:09   ` David Gibson
2025-01-29  8:46     ` Stefano Brivio
2025-01-30  1:33       ` David Gibson
2025-01-28 23:39 ` [PATCH v2 8/8] flow, tcp: Basic pre-migration source handler to dump sequence numbers Stefano Brivio
2025-01-29  6:15   ` David Gibson
2025-01-29  8:46     ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).