* [PATCH 0/7] Draft, incomplete series introducing state migration
@ 2025-01-27 23:15 Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
` (6 more replies)
0 siblings, 7 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
This is obviously incomplete. I have code on top of this,
not really working yet, with a loop on transferred flows,
and an implementation matching passt-repair, requesting to
enable/disable the TCP_REPAIR option as needed, as well as
setting/receiving sequences.
I'm sending this for early review/rework/rewrite/whatever.
What's here should all be tested and working.
Adding:
{ &flow_first_free, sizeof(flow_first_free) },
{ flowtab, sizeof(flowtab) },
to data version 1 in 6/7 will properly transfer those sections.
Declaring functions and assigning pointers such as:
{ flow_migrate_source_pre, NULL },
{ flow_migrate_source_post, NULL },
{ flow_migrate_target_post_v1, NULL },
also executes them. The passt-repair helper in 7/7 is (lightly)
tested against a stand-alone source/target implementation which
I'll share in a bit.
Stefano Brivio (7):
icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
flow, flow_table: Pad flow table entries to 128 bytes, hash entries to
32 bits
tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
flow_table: Use size in extern declaration for flowtab
util: Add read_remainder() and read_all_buf()
Introduce facilities for guest migration on top of vhost-user
infrastructure
Introduce passt-repair
Makefile | 22 +++--
flow.h | 18 ++--
flow_table.h | 15 ++-
icmp_flow.h | 6 +-
migrate.c | 259 +++++++++++++++++++++++++++++++++++++++++++++++++
migrate.h | 90 +++++++++++++++++
passt-repair.c | 111 +++++++++++++++++++++
passt.c | 2 +-
tcp_conn.h | 2 +-
udp_flow.h | 6 +-
util.c | 70 +++++++++++++
util.h | 2 +
vu_common.c | 122 +++++++++++++++--------
vu_common.h | 2 +-
14 files changed, 662 insertions(+), 65 deletions(-)
create mode 100644 migrate.c
create mode 100644 migrate.h
create mode 100644 passt-repair.c
--
2.43.0
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-28 0:49 ` David Gibson
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
` (5 subsequent siblings)
6 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
That's the only field in flows with different storage sizes depending
on the architecture: it's usually 4-byte wide on 32-bit architectures,
except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
machines.
By keeping flow entries the same size across architectures, we avoid
having to expand or shrink table entries upon migration.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
icmp_flow.h | 6 +++++-
udp_flow.h | 6 +++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/icmp_flow.h b/icmp_flow.h
index fb93801..da7e255 100644
--- a/icmp_flow.h
+++ b/icmp_flow.h
@@ -13,6 +13,7 @@
* @seq: Last sequence number sent to tap, host order, -1: not sent yet
* @sock: "ping" socket
* @ts: Last associated activity from tap, seconds
+ * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane
*/
struct icmp_ping_flow {
/* Must be first element */
@@ -20,7 +21,10 @@ struct icmp_ping_flow {
int seq;
int sock;
- time_t ts;
+ union {
+ time_t ts;
+ uint64_t ts_storage;
+ };
};
bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
diff --git a/udp_flow.h b/udp_flow.h
index 9a1b059..9cb79a0 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -12,6 +12,7 @@
* @f: Generic flow information
* @closed: Flow is already closed
* @ts: Activity timestamp
+ * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane
* @s: Socket fd (or -1) for each side of the flow
*/
struct udp_flow {
@@ -19,7 +20,10 @@ struct udp_flow {
struct flow_common f;
bool closed :1;
- time_t ts;
+ union {
+ time_t ts;
+ uint64_t ts_storage;
+ };
int s[SIDES];
};
--
@@ -12,6 +12,7 @@
* @f: Generic flow information
* @closed: Flow is already closed
* @ts: Activity timestamp
+ * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane
* @s: Socket fd (or -1) for each side of the flow
*/
struct udp_flow {
@@ -19,7 +20,10 @@ struct udp_flow {
struct flow_common f;
bool closed :1;
- time_t ts;
+ union {
+ time_t ts;
+ uint64_t ts_storage;
+ };
int s[SIDES];
};
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-28 0:50 ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
` (4 subsequent siblings)
6 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
...to keep migration sane. Right now, the biggest struct in union flow
is struct tcp_splice_conn with 120 bytes on x86_64, which should also
have the biggest storage and alignment requirements of any
architecture we might run on.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
flow.h | 18 ++++++++++++------
flow_table.h | 13 ++++++++++---
2 files changed, 22 insertions(+), 9 deletions(-)
diff --git a/flow.h b/flow.h
index 24ba3ef..8eb5964 100644
--- a/flow.h
+++ b/flow.h
@@ -202,15 +202,21 @@ struct flow_common {
/**
* struct flow_sidx - ID for one side of a specific flow
- * @sidei: Index of side referenced (0 or 1)
- * @flowi: Index of flow referenced
+ * @sidei: Index of side referenced (0 or 1)
+ * @flowi: Index of flow referenced
+ * @flow_sidx_storage: Pad to 32 bits
*/
typedef struct flow_sidx {
- unsigned sidei :1;
- unsigned flowi :FLOW_INDEX_BITS;
+ union {
+ struct {
+ unsigned sidei :1;
+ unsigned flowi :FLOW_INDEX_BITS;
+ };
+ uint32_t flow_sidx_storage;
+ };
} flow_sidx_t;
-static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
- "flow_sidx_t must fit within 32 bits");
+static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t),
+ "flow_sidx_t must be 32-bit wide");
#define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX })
diff --git a/flow_table.h b/flow_table.h
index f15db53..007f4dd 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -26,9 +26,13 @@ struct flow_free_cluster {
/**
* union flow - Descriptor for a logical packet flow (e.g. connection)
- * @f: Fields common between all variants
- * @tcp: Fields for non-spliced TCP connections
- * @tcp_splice: Fields for spliced TCP connections
+ * @f: Fields common between all variants
+ * @free: Entry in a cluster of free entries
+ * @tcp: Fields for non-spliced TCP connections
+ * @tcp_splice: Fields for spliced TCP connections
+ * @ping: Tracking for ping flows
+ * @udp: Tracking for UDP flows
+ * @flow_storage: Pad flow entries to 128 bytes to ease state migration
*/
union flow {
struct flow_common f;
@@ -37,8 +41,11 @@ union flow {
struct tcp_splice_conn tcp_splice;
struct icmp_ping_flow ping;
struct udp_flow udp;
+ char flow_storage[128];
};
+static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
+
/* Global Flow Table */
extern unsigned flow_first_free;
extern union flow flowtab[];
--
@@ -26,9 +26,13 @@ struct flow_free_cluster {
/**
* union flow - Descriptor for a logical packet flow (e.g. connection)
- * @f: Fields common between all variants
- * @tcp: Fields for non-spliced TCP connections
- * @tcp_splice: Fields for spliced TCP connections
+ * @f: Fields common between all variants
+ * @free: Entry in a cluster of free entries
+ * @tcp: Fields for non-spliced TCP connections
+ * @tcp_splice: Fields for spliced TCP connections
+ * @ping: Tracking for ping flows
+ * @udp: Tracking for UDP flows
+ * @flow_storage: Pad flow entries to 128 bytes to ease state migration
*/
union flow {
struct flow_common f;
@@ -37,8 +41,11 @@ union flow {
struct tcp_splice_conn tcp_splice;
struct icmp_ping_flow ping;
struct udp_flow udp;
+ char flow_storage[128];
};
+static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
+
/* Global Flow Table */
extern unsigned flow_first_free;
extern union flow flowtab[];
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-28 0:53 ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
` (3 subsequent siblings)
6 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
Moving in_epoll out of the common flow data created a 7-bit hole in
struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
unused) bit.
Fixes: b60fa33eeafb ("tcp: Move in_epoll flag out of common connection structure")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
tcp_conn.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp_conn.h b/tcp_conn.h
index d342680..3d06e2c 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -125,7 +125,7 @@ struct tcp_splice_conn {
#define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4))
#define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6))
- uint8_t flags;
+ uint8_t flags :7;
#define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0))
#define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2))
#define CLOSING BIT(4)
--
@@ -125,7 +125,7 @@ struct tcp_splice_conn {
#define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4))
#define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6))
- uint8_t flags;
+ uint8_t flags :7;
#define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0))
#define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2))
#define CLOSING BIT(4)
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 4/7] flow_table: Use size in extern declaration for flowtab
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
` (2 preceding siblings ...)
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
` (2 subsequent siblings)
6 siblings, 0 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
...so that we can use sizeof() on it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
flow_table.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/flow_table.h b/flow_table.h
index 007f4dd..a85cab5 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -48,7 +48,7 @@ static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
/* Global Flow Table */
extern unsigned flow_first_free;
-extern union flow flowtab[];
+extern union flow flowtab[FLOW_MAX];
/**
* flow_foreach_sidei() - 'for' type macro to step through each side of flow
--
@@ -48,7 +48,7 @@ static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
/* Global Flow Table */
extern unsigned flow_first_free;
-extern union flow flowtab[];
+extern union flow flowtab[FLOW_MAX];
/**
* flow_foreach_sidei() - 'for' type macro to step through each side of flow
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
` (3 preceding siblings ...)
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-28 0:59 ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
6 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.
I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
util.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
util.h | 2 ++
2 files changed, 72 insertions(+)
diff --git a/util.c b/util.c
index 11973c4..085937b 100644
--- a/util.c
+++ b/util.c
@@ -606,6 +606,76 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
return 0;
}
+/**
+ * read_all_buf() - Fill a whole buffer from a file descriptor
+ * @fd: File descriptor
+ * @buf: Pointer to base of buffer
+ * @len: Length of buffer
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * #syscalls read
+ */
+int read_all_buf(int fd, void *buf, size_t len)
+{
+ size_t left = len;
+ char *p = buf;
+
+ while (left) {
+ ssize_t rc;
+
+ ASSERT(left <= len);
+
+ do
+ rc = read(fd, p, left);
+ while ((rc < 0) && errno == EINTR);
+
+ if (rc < 0)
+ return -1;
+
+ p += rc;
+ left -= rc;
+ }
+ return 0;
+}
+
+/**
+ * read_remainder() - Read the tail of an IO vector from a file descriptor
+ * @fd: File descriptor
+ * @iov: IO vector
+ * @cnt: Number of entries in @iov
+ * @skip: Number of bytes of the vector to skip reading
+ *
+ * Return: 0 on success, -1 on error (with errno set)
+ *
+ * Note: mode-specific seccomp profiles need to enable readv() to use this.
+ */
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip)
+{
+ size_t i = 0, offset;
+
+ while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
+ ssize_t rc;
+
+ if (offset) {
+ ASSERT(offset < iov[i].iov_len);
+ /* Read the remainder of the partially read buffer */
+ if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
+ iov[i].iov_len - offset) < 0)
+ return -1;
+ i++;
+ }
+
+ /* Fill as many of the remaining buffers as we can */
+ rc = readv(fd, &iov[i], cnt - i);
+ if (rc < 0)
+ return -1;
+
+ skip = rc;
+ }
+ return 0;
+}
+
/** sockaddr_ntop() - Convert a socket address to text format
* @sa: Socket address
* @dst: output buffer, minimum SOCKADDR_STRLEN bytes
diff --git a/util.h b/util.h
index d02333d..73a7a33 100644
--- a/util.h
+++ b/util.h
@@ -203,6 +203,8 @@ int fls(unsigned long x);
int write_file(const char *path, const char *buf);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
+int read_all_buf(int fd, void *buf, size_t len);
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
void close_open_files(int argc, char **argv);
bool snprintf_check(char *str, size_t size, const char *format, ...);
--
@@ -203,6 +203,8 @@ int fls(unsigned long x);
int write_file(const char *path, const char *buf);
int write_all_buf(int fd, const void *buf, size_t len);
int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
+int read_all_buf(int fd, void *buf, size_t len);
+int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
void close_open_files(int argc, char **argv);
bool snprintf_check(char *str, size_t size, const char *format, ...);
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
` (4 preceding siblings ...)
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-28 1:40 ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
6 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
Add two sets (source or target) of three functions each for passt in
vhost-user mode, triggered by activity on the file descriptor passed
via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
- migrate_source_pre() and migrate_target_pre() are called to prepare
for migration, before data is transferred
- migrate_source() sends, and migrate_target() receives migration data
- migrate_source_post() and migrate_target_post() are responsible for
any post-migration task
Callbacks are added to these functions with arrays of function
pointers in migrate.c. Migration handlers are versioned.
Versioned descriptions of data sections will be added to the
data_versions array, which points to versioned iovec arrays. Version
1 is currently empty and will be filled in in subsequent patches.
The source announces the data version to be used and informs the peer
about endianness, and the size of void *, time_t, flow entries and
flow hash table entries.
The target checks if the version of the source is still supported. If
it's not, it aborts the migration.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
Makefile | 12 +--
migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
migrate.h | 90 ++++++++++++++++++
passt.c | 2 +-
vu_common.c | 122 ++++++++++++++++---------
vu_common.h | 2 +-
6 files changed, 438 insertions(+), 49 deletions(-)
create mode 100644 migrate.c
create mode 100644 migrate.h
diff --git a/Makefile b/Makefile
index 464eef1..1383875 100644
--- a/Makefile
+++ b/Makefile
@@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
- ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
- tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+ ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
+ tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
@@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
- lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
- siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
- tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
- virtio.h vu_common.h
+ lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
+ pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
+ tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
+ vhost_user.h virtio.h vu_common.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
diff --git a/migrate.c b/migrate.c
new file mode 100644
index 0000000..bee9653
--- /dev/null
+++ b/migrate.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ * for qemu/UNIX domain socket mode
+ *
+ * PASTA - Pack A Subtle Tap Abstraction
+ * for network namespace/tap device mode
+ *
+ * migrate.c - Migration sections, layout, and routines
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "inany.h"
+#include "flow.h"
+#include "flow_table.h"
+
+#include "migrate.h"
+
+/* Current version of migration data */
+#define MIGRATE_VERSION 1
+
+/* Magic as we see it and as seen with reverse endianness */
+#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
+#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
+
+/* Migration header to send from source */
+static union migrate_header header = {
+ .magic = MIGRATE_MAGIC,
+ .version = htonl_constant(MIGRATE_VERSION),
+ .time_t_size = htonl_constant(sizeof(time_t)),
+ .flow_size = htonl_constant(sizeof(union flow)),
+ .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)),
+ .voidp_size = htonl_constant(sizeof(void *)),
+};
+
+/* Data sections for version 1 */
+static struct iovec sections_v1[] = {
+ { &header, sizeof(header) },
+};
+
+/* Set of data versions */
+static struct migrate_data data_versions[] = {
+ {
+ 1, sections_v1,
+ },
+ { 0 },
+};
+
+/* Handlers to call in source before sending data */
+struct migrate_handler handlers_source_pre[] = {
+ { 0 },
+};
+
+/* Handlers to call in source after sending data */
+struct migrate_handler handlers_source_post[] = {
+ { 0 },
+};
+
+/* Handlers to call in target before receiving data with version 1 */
+struct migrate_handler handlers_target_pre_v1[] = {
+ { 0 },
+};
+
+/* Handlers to call in target after receiving data with version 1 */
+struct migrate_handler handlers_target_post_v1[] = {
+ { 0 },
+};
+
+/* Versioned sets of migration handlers */
+struct migrate_target_handlers target_handlers[] = {
+ {
+ 1,
+ handlers_target_pre_v1,
+ handlers_target_post_v1,
+ },
+ { 0 },
+};
+
+/**
+ * migrate_source_pre() - Pre-migration tasks as source
+ * @m: Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_source_pre(struct migrate_meta *m)
+{
+ struct migrate_handler *h;
+
+ for (h = handlers_source_pre; h->fn; h++) {
+ int rc;
+
+ if ((rc = h->fn(m, h->data)))
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * migrate_source() - Perform migration as source: send state to hypervisor
+ * @fd: Descriptor for state transfer
+ * @m: Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_source(int fd, const struct migrate_meta *m)
+{
+ static struct migrate_data *d;
+ unsigned count;
+ int rc;
+
+ for (d = data_versions; d->v != MIGRATE_VERSION; d++);
+
+ for (count = 0; d->sections[count].iov_len; count++);
+
+ debug("Writing %u migration sections", count - 1 /* minus header */);
+ rc = write_remainder(fd, d->sections, count, 0);
+ if (rc < 0)
+ return errno;
+
+ return 0;
+}
+
+/**
+ * migrate_source_post() - Post-migration tasks as source
+ * @m: Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+void migrate_source_post(struct migrate_meta *m)
+{
+ struct migrate_handler *h;
+
+ for (h = handlers_source_post; h->fn; h++)
+ h->fn(m, h->data);
+}
+
+/**
+ * migrate_target_read_header() - Set metadata in target from source header
+ * @fd: Descriptor for state transfer
+ * @m: Migration metadata, filled on return
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_target_read_header(int fd, struct migrate_meta *m)
+{
+ static struct migrate_data *d;
+ union migrate_header h;
+
+ if (read_all_buf(fd, &h, sizeof(h)))
+ return errno;
+
+ debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
+ h.magic, ntohl(h.voidp_size), ntohl(h.version));
+
+ for (d = data_versions; d->v != ntohl(h.version); d++);
+ if (!d->v)
+ return ENOTSUP;
+ m->v = d->v;
+
+ if (h.magic == MIGRATE_MAGIC)
+ m->bswap = false;
+ else if (h.magic == MIGRATE_MAGIC_SWAPPED)
+ m->bswap = true;
+ else
+ return ENOTSUP;
+
+ if (ntohl(h.voidp_size) == 4)
+ m->source_64b = false;
+ else if (ntohl(h.voidp_size) == 8)
+ m->source_64b = true;
+ else
+ return ENOTSUP;
+
+ if (ntohl(h.time_t_size) == 4)
+ m->time_64b = false;
+ else if (ntohl(h.time_t_size) == 8)
+ m->time_64b = true;
+ else
+ return ENOTSUP;
+
+ m->flow_size = ntohl(h.flow_size);
+ m->flow_sidx_size = ntohl(h.flow_sidx_size);
+
+ return 0;
+}
+
+/**
+ * migrate_target_pre() - Pre-migration tasks as target
+ * @m: Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ */
+int migrate_target_pre(struct migrate_meta *m)
+{
+ struct migrate_target_handlers *th;
+ struct migrate_handler *h;
+
+ for (th = target_handlers; th->v != m->v && th->v; th++);
+
+ for (h = th->pre; h->fn; h++) {
+ int rc;
+
+ if ((rc = h->fn(m, h->data)))
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * migrate_target() - Perform migration as target: receive state from hypervisor
+ * @fd: Descriptor for state transfer
+ * @m: Migration metadata
+ *
+ * Return: 0 on success, error code on failure
+ *
+ * #syscalls:vu readv
+ */
+int migrate_target(int fd, const struct migrate_meta *m)
+{
+ static struct migrate_data *d;
+ unsigned cnt;
+ int rc;
+
+ for (d = data_versions; d->v != m->v && d->v; d++);
+
+ for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
+
+ debug("Reading %u migration sections", cnt);
+ rc = read_remainder(fd, d->sections + 1, cnt, 0);
+ if (rc < 0)
+ return errno;
+
+ return 0;
+}
+
+/**
+ * migrate_target_post() - Post-migration tasks as target
+ * @m: Migration metadata
+ */
+void migrate_target_post(struct migrate_meta *m)
+{
+ struct migrate_target_handlers *th;
+ struct migrate_handler *h;
+
+ for (th = target_handlers; th->v != m->v && th->v; th++);
+
+ for (h = th->post; h->fn; h++)
+ h->fn(m, h->data);
+}
diff --git a/migrate.h b/migrate.h
new file mode 100644
index 0000000..5582f75
--- /dev/null
+++ b/migrate.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef MIGRATE_H
+#define MIGRATE_H
+
+/**
+ * struct migrate_meta - Migration metadata
+ * @v: Chosen migration data version, host order
+ * @bswap: Source has opposite endianness
+ * @peer_64b: Source uses 64-bit void *
+ * @time_64b: Source uses 64-bit time_t
+ * @flow_size: Size of union flow in source
+ * @flow_sidx_size: Size of struct flow_sidx in source
+ */
+struct migrate_meta {
+ uint32_t v;
+ bool bswap;
+ bool source_64b;
+ bool time_64b;
+ size_t flow_size;
+ size_t flow_sidx_size;
+};
+
+/**
+ * union migrate_header - Migration header from source
+ * @magic: 0xB1BB1D1B0BB1D1B0, host order
+ * @version: Source sends highest known, target aborts if unsupported
+ * @voidp_size: sizeof(void *), network order
+ * @time_t_size: sizeof(time_t), network order
+ * @flow_size: sizeof(union flow), network order
+ * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
+ * @unused: Go figure
+ */
+union migrate_header {
+ struct {
+ uint64_t magic;
+ uint32_t version;
+ uint32_t voidp_size;
+ uint32_t time_t_size;
+ uint32_t flow_size;
+ uint32_t flow_sidx_size;
+ };
+ uint8_t unused[65536];
+};
+
+/**
+ * struct migrate_data - Data sections for given source version
+ * @v: Source version this applies to, host order
+ * @sections: Array of data sections, NULL-terminated
+ */
+struct migrate_data {
+ uint32_t v;
+ struct iovec *sections;
+};
+
+/**
+ * struct migrate_handler - Function to handle a specific data section
+ * @fn: Function pointer taking pointer to data section
+ * @data: Associated data section
+ */
+struct migrate_handler {
+ int (*fn)(struct migrate_meta *m, void *data);
+ void *data;
+};
+
+/**
+ * struct migrate_target_handlers - Versioned sets of migration target handlers
+ * @v: Source version this applies to, host order
+ * @pre: Set of functions to execute in target before data copy
+ * @post: Set of functions to execute in target after data copy
+ */
+struct migrate_target_handlers {
+ uint32_t v;
+ struct migrate_handler *pre;
+ struct migrate_handler *post;
+};
+
+int migrate_source_pre(struct migrate_meta *m);
+int migrate_source(int fd, const struct migrate_meta *m);
+void migrate_source_post(struct migrate_meta *m);
+
+int migrate_target_read_header(int fd, struct migrate_meta *m);
+int migrate_target_pre(struct migrate_meta *m);
+int migrate_target(int fd, const struct migrate_meta *m);
+void migrate_target_post(struct migrate_meta *m);
+
+#endif /* MIGRATE_H */
diff --git a/passt.c b/passt.c
index b1c8ab6..184d4e5 100644
--- a/passt.c
+++ b/passt.c
@@ -358,7 +358,7 @@ loop:
vu_kick_cb(c.vdev, ref, &now);
break;
case EPOLL_TYPE_VHOST_MIGRATION:
- vu_migrate(c.vdev, eventmask);
+ vu_migrate(&c, eventmask);
break;
default:
/* Can't happen */
diff --git a/vu_common.c b/vu_common.c
index f43d8ac..0c67bd0 100644
--- a/vu_common.c
+++ b/vu_common.c
@@ -5,6 +5,7 @@
* common_vu.c - vhost-user common UDP and TCP functions
*/
+#include <errno.h>
#include <unistd.h>
#include <sys/uio.h>
#include <sys/eventfd.h>
@@ -17,6 +18,7 @@
#include "vhost_user.h"
#include "pcap.h"
#include "vu_common.h"
+#include "migrate.h"
#define VU_MAX_TX_BUFFER_NB 2
@@ -305,50 +307,88 @@ err:
}
/**
- * vu_migrate() - Send/receive passt insternal state to/from QEMU
- * @vdev: vhost-user device
+ * vu_migrate_source() - Migration as source, send state to hypervisor
+ * @fd: File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int vu_migrate_source(int fd)
+{
+ struct migrate_meta m;
+ int rc;
+
+ if ((rc = migrate_source_pre(&m))) {
+ err("Source pre-migration failed: %s, abort", strerror_(rc));
+ return rc;
+ }
+
+ debug("Saving backend state");
+
+ rc = migrate_source(fd, &m);
+ if (rc)
+ err("Source migration failed: %s", strerror_(rc));
+ else
+ migrate_source_post(&m);
+
+ return rc;
+}
+
+/**
+ * vu_migrate_target() - Migration as target, receive state from hypervisor
+ * @fd: File descriptor for state transfer
+ *
+ * Return: 0 on success, positive error code on failure
+ */
+static int vu_migrate_target(int fd)
+{
+ struct migrate_meta m;
+ int rc;
+
+ rc = migrate_target_read_header(fd, &m);
+ if (rc) {
+ err("Migration header check failed: %s, abort", strerror_(rc));
+ return rc;
+ }
+
+ if ((rc = migrate_target_pre(&m))) {
+ err("Target pre-migration failed: %s, abort", strerror_(rc));
+ return rc;
+ }
+
+ debug("Loading backend state");
+
+ rc = migrate_target(fd, &m);
+ if (rc)
+ err("Target migration failed: %s", strerror_(rc));
+ else
+ migrate_target_post(&m);
+
+ return rc;
+}
+
+/**
+ * vu_migrate() - Send/receive passt internal state to/from QEMU
+ * @c: Execution context
* @events: epoll events
*/
-void vu_migrate(struct vu_dev *vdev, uint32_t events)
+void vu_migrate(struct ctx *c, uint32_t events)
{
- int ret;
+ struct vu_dev *vdev = c->vdev;
+ int rc = EIO;
- /* TODO: collect/set passt internal state
- * and use vdev->device_state_fd to send/receive it
- */
debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
- if (events & EPOLLOUT) {
- debug("Saving backend state");
-
- /* send some stuff */
- ret = write(vdev->device_state_fd, "PASST", 6);
- /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
- vdev->device_state_result = ret == -1 ? -1 : 0;
- /* Closing the file descriptor signals the end of transfer */
- epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
- vdev->device_state_fd, NULL);
- close(vdev->device_state_fd);
- vdev->device_state_fd = -1;
- } else if (events & EPOLLIN) {
- char buf[6];
-
- debug("Loading backend state");
- /* read some stuff */
- ret = read(vdev->device_state_fd, buf, sizeof(buf));
- /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
- if (ret != sizeof(buf)) {
- vdev->device_state_result = -1;
- } else {
- ret = strncmp(buf, "PASST", sizeof(buf));
- vdev->device_state_result = ret == 0 ? 0 : -1;
- }
- } else if (events & EPOLLHUP) {
- debug("Closing migration channel");
-
- /* The end of file signals the end of the transfer. */
- epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
- vdev->device_state_fd, NULL);
- close(vdev->device_state_fd);
- vdev->device_state_fd = -1;
- }
+
+ if (events & EPOLLOUT)
+ rc = vu_migrate_source(vdev->device_state_fd);
+ else if (events & EPOLLIN)
+ rc = vu_migrate_target(vdev->device_state_fd);
+
+ /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
+
+ vdev->device_state_result = rc;
+
+ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
+ debug("Closing migration channel");
+ close(vdev->device_state_fd);
+ vdev->device_state_fd = -1;
}
diff --git a/vu_common.h b/vu_common.h
index d56c021..69c4006 100644
--- a/vu_common.h
+++ b/vu_common.h
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
const struct timespec *now);
int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
#endif /* VU_COMMON_H */
--
@@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
const struct timespec *now);
int vu_send_single(const struct ctx *c, const void *buf, size_t size);
-void vu_migrate(struct vu_dev *vdev, uint32_t events);
+void vu_migrate(struct ctx *c, uint32_t events);
#endif /* VU_COMMON_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 7/7] Introduce passt-repair
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
` (5 preceding siblings ...)
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
@ 2025-01-27 23:15 ` Stefano Brivio
2025-01-27 23:31 ` Stefano Brivio
2025-01-28 1:51 ` David Gibson
6 siblings, 2 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:15 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
passt. Not used yet.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
Makefile | 10 +++--
passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+), 3 deletions(-)
create mode 100644 passt-repair.c
diff --git a/Makefile b/Makefile
index 1383875..1b71cb0 100644
--- a/Makefile
+++ b/Makefile
@@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
vhost_user.c virtio.c vu_common.c
QRAP_SRCS = qrap.c
-SRCS = $(PASST_SRCS) $(QRAP_SRCS)
+PASST_REPAIR_SRCS = passt-repair.c
+SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
@@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
man1dir ?= $(mandir)/man1
ifeq ($(TARGET_ARCH),x86_64)
-BIN := passt passt.avx2 pasta pasta.avx2 qrap
+BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
else
-BIN := passt pasta qrap
+BIN := passt pasta qrap passt-repair
endif
all: $(BIN) $(MANPAGES) docs
@@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
qrap: $(QRAP_SRCS) passt.h
$(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
+passt-repair: $(PASST_REPAIR_SRCS)
+ $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
+
valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
rt_sigreturn getpid gettid kill clock_gettime mmap \
mmap2 munmap open unlink gettimeofday futex statx \
diff --git a/passt-repair.c b/passt-repair.c
new file mode 100644
index 0000000..e9b9609
--- /dev/null
+++ b/passt-repair.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ * for qemu/UNIX domain socket mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit if the command is INT_MAX.
+ */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+int main(int argc, char **argv)
+{
+ char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+ __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+ struct sockaddr_un a = { AF_UNIX, "" };
+ int cmd, fds[SCM_MAX_FD], s, ret, i;
+ struct cmsghdr *cmsg;
+ struct msghdr msg;
+ struct iovec iov;
+
+ iov = (struct iovec){ &cmd, sizeof(cmd) };
+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+ cmsg = CMSG_FIRSTHDR(&msg);
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+ return -1;
+ }
+
+ ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+ if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+ fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+ return -1;
+ }
+
+ if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+ perror("Failed to create AF_UNIX socket");
+ return -1;
+ }
+
+ if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+ fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+ strerror(errno));
+ return -1;
+ }
+
+ while (1) {
+ int n;
+
+ if (recvmsg(s, &msg, 0) < 0) {
+ perror("Failed to receive message");
+ return -1;
+ }
+
+ if (!cmsg ||
+ cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+ cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+ cmsg->cmsg_type != SCM_RIGHTS)
+ return -1;
+
+ n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+ memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+ switch (cmd) {
+ case INT_MAX:
+ return 0;
+ case TCP_REPAIR_ON:
+ case TCP_REPAIR_OFF:
+ case TCP_REPAIR_OFF_NO_WP:
+ for (i = 0; i < n; i++) {
+ if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
+ &cmd, sizeof(int))) {
+ perror("Setting TCP_REPAIR");
+ return -1;
+ }
+ }
+
+ /* Confirm setting by echoing the command back */
+ if (send(s, &cmd, sizeof(int), 0) < 0) {
+ fprintf(stderr, "Reply to command %i: %s\n",
+ cmd, strerror(errno));
+ return -1;
+ }
+
+ break;
+ default:
+ fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
+ return -1;
+ }
+ }
+}
--
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+/* PASST - Plug A Simple Socket Transport
+ * for qemu/UNIX domain socket mode
+ *
+ * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
+ *
+ * Copyright (c) 2025 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ *
+ * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
+ * with commands mapping to TCP_REPAIR values, and switch repair mode on or
+ * off. Reply by echoing the command. Exit if the command is INT_MAX.
+ */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <netdb.h>
+
+#include <netinet/tcp.h>
+
+#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
+
+int main(int argc, char **argv)
+{
+ char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
+ __attribute__ ((aligned(__alignof__(struct cmsghdr))));
+ struct sockaddr_un a = { AF_UNIX, "" };
+ int cmd, fds[SCM_MAX_FD], s, ret, i;
+ struct cmsghdr *cmsg;
+ struct msghdr msg;
+ struct iovec iov;
+
+ iov = (struct iovec){ &cmd, sizeof(cmd) };
+ msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
+ cmsg = CMSG_FIRSTHDR(&msg);
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s PATH\n", argv[0]);
+ return -1;
+ }
+
+ ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
+ if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
+ fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
+ return -1;
+ }
+
+ if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
+ perror("Failed to create AF_UNIX socket");
+ return -1;
+ }
+
+ if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
+ fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
+ strerror(errno));
+ return -1;
+ }
+
+ while (1) {
+ int n;
+
+ if (recvmsg(s, &msg, 0) < 0) {
+ perror("Failed to receive message");
+ return -1;
+ }
+
+ if (!cmsg ||
+ cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
+ cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
+ cmsg->cmsg_type != SCM_RIGHTS)
+ return -1;
+
+ n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
+ memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
+
+ switch (cmd) {
+ case INT_MAX:
+ return 0;
+ case TCP_REPAIR_ON:
+ case TCP_REPAIR_OFF:
+ case TCP_REPAIR_OFF_NO_WP:
+ for (i = 0; i < n; i++) {
+ if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
+ &cmd, sizeof(int))) {
+ perror("Setting TCP_REPAIR");
+ return -1;
+ }
+ }
+
+ /* Confirm setting by echoing the command back */
+ if (send(s, &cmd, sizeof(int), 0) < 0) {
+ fprintf(stderr, "Reply to command %i: %s\n",
+ cmd, strerror(errno));
+ return -1;
+ }
+
+ break;
+ default:
+ fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
+ return -1;
+ }
+ }
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
@ 2025-01-27 23:31 ` Stefano Brivio
2025-01-28 1:51 ` David Gibson
1 sibling, 0 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-27 23:31 UTC (permalink / raw)
To: passt-dev; +Cc: Laurent Vivier, David Gibson
[-- Attachment #1: Type: text/plain, Size: 540 bytes --]
On Tue, 28 Jan 2025 00:15:32 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> passt. Not used yet.
...but tested against source.c / target.c, attached.
---
$ nc -l 9996
---
$ ./source 127.0.0.1 9996 9898 /tmp/repair.sock
sending sequence: 3244673313
receiving sequence: 2250449386
---
# ./passt-repair /tmp/repair.sock
---
$ strace ./target 127.0.0.1 9996 9898 /tmp/repair.sock 3244673313 2250449386
---
# ./passt-repair /tmp/repair.sock
---
--
Stefano
[-- Attachment #2: source.c --]
[-- Type: text/x-c++src, Size: 2368 bytes --]
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int cmd, seq, s, s_helper;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
socklen_t seqlen = sizeof(int);
struct addrinfo *r;
(void)argc;
if (argc != 5) {
fprintf(stderr, "%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH\n",
argv[0]);
return -1;
}
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
/* Connect socket to server and send some data */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
connect(s, r->ai_addr, r->ai_addrlen);
send(s, "before migration\n", sizeof("before migration\n"), 0);
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Terminate helper */
cmd = INT_MAX;
sendmsg(s_helper, &msg, 0);
/* Get sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "sending sequence: %u\n", seq);
/* Get receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
getsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, &seqlen);
fprintf(stdout, "receiving sequence: %u\n", seq);
}
[-- Attachment #3: target.c --]
[-- Type: text/x-c++src, Size: 2550 bytes --]
#include <arpa/inet.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <unistd.h>
#include <netdb.h>
#include <netinet/tcp.h>
int main(int argc, char **argv)
{
struct sockaddr_in a = { AF_INET, htons(atoi(argv[3])), { 0 }, { 0 } };
struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
NULL, NULL, NULL };
struct sockaddr_un a_helper = { AF_UNIX, { 0 } };
int cmd, s, s_helper, seq;
struct iovec iov = { &cmd, sizeof(cmd) };
char buf[CMSG_SPACE(sizeof(int))];
struct msghdr msg = { NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
struct addrinfo *r;
(void)argc;
strcpy(a_helper.sun_path, argv[4]);
getaddrinfo(argv[1], argv[2], &hints, &r);
if (argc != 7) {
fprintf(stderr,
"%s DST_ADDR DST_PORT SRC_PORT HELPER_PATH SSEQ RSEQ\n",
argv[0]);
return -1;
}
/* Prepare socket, bind to source port */
s = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
bind(s, (struct sockaddr *)&a, sizeof(a));
/* Wait for helper */
s_helper = socket(AF_UNIX, SOCK_STREAM, 0);
unlink(a_helper.sun_path);
bind(s_helper, (struct sockaddr *)&a_helper, sizeof(a_helper));
listen(s_helper, 1);
s_helper = accept(s_helper, NULL, NULL);
/* Set up message for helper, with socket */
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
memcpy(CMSG_DATA(cmsg), &s, sizeof(s));
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_ON;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Set sending sequence */
seq = TCP_SEND_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[5]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Set receiving sequence */
seq = TCP_RECV_QUEUE;
setsockopt(s, SOL_TCP, TCP_REPAIR_QUEUE, &seq, sizeof(seq));
seq = atoi(argv[6]);
setsockopt(s, SOL_TCP, TCP_QUEUE_SEQ, &seq, sizeof(seq));
/* Connect setting kernel state only, without actual SYN / handshake */
connect(s, r->ai_addr, r->ai_addrlen);
/* Send command to helper: turn repair mode on, wait for reply */
cmd = TCP_REPAIR_OFF;
sendmsg(s_helper, &msg, 0);
recv(s_helper, &((int){ 0 }), 1, 0);
/* Terminate helper */
cmd = INT_MAX;
sendmsg(s_helper, &msg, 0);
/* Send some more data */
send(s, "after migration\n", sizeof("after migration\n"), 0);
}
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
@ 2025-01-28 0:49 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-28 0:49 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 2322 bytes --]
On Tue, Jan 28, 2025 at 12:15:26AM +0100, Stefano Brivio wrote:
> That's the only field in flows with different storage sizes depending
> on the architecture: it's usually 4-byte wide on 32-bit architectures,
> except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
> machines.
As discussed on the call, I think there are broader problems with
transferring timestamps than just the structure size. So I'm hoping
we can work out how to not transfer them at all and avoid this change.
> By keeping flow entries the same size across architectures, we avoid
> having to expand or shrink table entries upon migration.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> icmp_flow.h | 6 +++++-
> udp_flow.h | 6 +++++-
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/icmp_flow.h b/icmp_flow.h
> index fb93801..da7e255 100644
> --- a/icmp_flow.h
> +++ b/icmp_flow.h
> @@ -13,6 +13,7 @@
> * @seq: Last sequence number sent to tap, host order, -1: not sent yet
> * @sock: "ping" socket
> * @ts: Last associated activity from tap, seconds
> + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane
> */
> struct icmp_ping_flow {
> /* Must be first element */
> @@ -20,7 +21,10 @@ struct icmp_ping_flow {
>
> int seq;
> int sock;
> - time_t ts;
> + union {
> + time_t ts;
> + uint64_t ts_storage;
> + };
> };
>
> bool icmp_ping_timer(const struct ctx *c, const struct icmp_ping_flow *pingf,
> diff --git a/udp_flow.h b/udp_flow.h
> index 9a1b059..9cb79a0 100644
> --- a/udp_flow.h
> +++ b/udp_flow.h
> @@ -12,6 +12,7 @@
> * @f: Generic flow information
> * @closed: Flow is already closed
> * @ts: Activity timestamp
> + * @ts_storage: Pad @ts to 64-bit storage to keep state migration sane
> * @s: Socket fd (or -1) for each side of the flow
> */
> struct udp_flow {
> @@ -19,7 +20,10 @@ struct udp_flow {
> struct flow_common f;
>
> bool closed :1;
> - time_t ts;
> + union {
> + time_t ts;
> + uint64_t ts_storage;
> + };
> int s[SIDES];
> };
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
@ 2025-01-28 0:50 ` David Gibson
0 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2025-01-28 0:50 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 3107 bytes --]
On Tue, Jan 28, 2025 at 12:15:27AM +0100, Stefano Brivio wrote:
> ...to keep migration sane. Right now, the biggest struct in union flow
> is struct tcp_splice_conn with 120 bytes on x86_64, which should also
> have the biggest storage and alignment requirements of any
> architecture we might run on.
Necessary for the current "copy the entire table as a blob" approach.
As I've noted, I think that will be fragile, but we can revisit this
change when/if we figure out a different way to handle the table as a
whole.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> flow.h | 18 ++++++++++++------
> flow_table.h | 13 ++++++++++---
> 2 files changed, 22 insertions(+), 9 deletions(-)
>
> diff --git a/flow.h b/flow.h
> index 24ba3ef..8eb5964 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -202,15 +202,21 @@ struct flow_common {
>
> /**
> * struct flow_sidx - ID for one side of a specific flow
> - * @sidei: Index of side referenced (0 or 1)
> - * @flowi: Index of flow referenced
> + * @sidei: Index of side referenced (0 or 1)
> + * @flowi: Index of flow referenced
> + * @flow_sidx_storage: Pad to 32 bits
> */
> typedef struct flow_sidx {
> - unsigned sidei :1;
> - unsigned flowi :FLOW_INDEX_BITS;
> + union {
> + struct {
> + unsigned sidei :1;
> + unsigned flowi :FLOW_INDEX_BITS;
> + };
> + uint32_t flow_sidx_storage;
> + };
> } flow_sidx_t;
> -static_assert(sizeof(flow_sidx_t) <= sizeof(uint32_t),
> - "flow_sidx_t must fit within 32 bits");
> +static_assert(sizeof(flow_sidx_t) == sizeof(uint32_t),
> + "flow_sidx_t must be 32-bit wide");
>
> #define FLOW_SIDX_NONE ((flow_sidx_t){ .flowi = FLOW_MAX })
>
> diff --git a/flow_table.h b/flow_table.h
> index f15db53..007f4dd 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -26,9 +26,13 @@ struct flow_free_cluster {
>
> /**
> * union flow - Descriptor for a logical packet flow (e.g. connection)
> - * @f: Fields common between all variants
> - * @tcp: Fields for non-spliced TCP connections
> - * @tcp_splice: Fields for spliced TCP connections
> + * @f: Fields common between all variants
> + * @free: Entry in a cluster of free entries
> + * @tcp: Fields for non-spliced TCP connections
> + * @tcp_splice: Fields for spliced TCP connections
> + * @ping: Tracking for ping flows
> + * @udp: Tracking for UDP flows
> + * @flow_storage: Pad flow entries to 128 bytes to ease state migration
> */
> union flow {
> struct flow_common f;
> @@ -37,8 +41,11 @@ union flow {
> struct tcp_splice_conn tcp_splice;
> struct icmp_ping_flow ping;
> struct udp_flow udp;
> + char flow_storage[128];
> };
>
> +static_assert(sizeof(union flow) == 128, "union flow should be 128-byte wide");
> +
> /* Global Flow Table */
> extern unsigned flow_first_free;
> extern union flow flowtab[];
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
@ 2025-01-28 0:53 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-28 0:53 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]
On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> Moving in_epoll out of the common flow data created a 7-bit hole in
> struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> unused) bit.
Is this actually necessary for the migration stuff? Or just a cleanup
you spotted along the way?
>
> Fixes: b60fa33eeafb ("tcp: Move in_epoll flag out of common connection structure")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> tcp_conn.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcp_conn.h b/tcp_conn.h
> index d342680..3d06e2c 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -125,7 +125,7 @@ struct tcp_splice_conn {
> #define FIN_RCVD(sidei_) ((sidei_) ? BIT(5) : BIT(4))
> #define FIN_SENT(sidei_) ((sidei_) ? BIT(7) : BIT(6))
>
> - uint8_t flags;
> + uint8_t flags :7;
> #define RCVLOWAT_SET(sidei_) ((sidei_) ? BIT(1) : BIT(0))
> #define RCVLOWAT_ACT(sidei_) ((sidei_) ? BIT(3) : BIT(2))
> #define CLOSING BIT(4)
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
@ 2025-01-28 0:59 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-28 0:59 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 3854 bytes --]
On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:
> These are symmetric to write_remainder() and write_all_buf() and
> almost a copy and paste of them, with the most notable differences
> being reversed reads/writes and a couple of better-safe-than-sorry
> asserts to keep Coverity happy.
So, there's one thing that needs to be not quite symmetric for the
read() version: we need to handle EOF. At present, I believe these
will enter an infinite loop on EOF, which is not a graceful failure
mode.
> I'll use them in the next patch. At least for the moment, they're
> going to be used for vhost-user mode only, so I'm not unconditionally
> enabling readv() in the seccomp profile: the caller has to ensure it's
> there.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> util.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> util.h | 2 ++
> 2 files changed, 72 insertions(+)
>
> diff --git a/util.c b/util.c
> index 11973c4..085937b 100644
> --- a/util.c
> +++ b/util.c
> @@ -606,6 +606,76 @@ int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip)
> return 0;
> }
>
> +/**
> + * read_all_buf() - Fill a whole buffer from a file descriptor
> + * @fd: File descriptor
> + * @buf: Pointer to base of buffer
> + * @len: Length of buffer
> + *
> + * Return: 0 on success, -1 on error (with errno set)
> + *
> + * #syscalls read
> + */
> +int read_all_buf(int fd, void *buf, size_t len)
> +{
> + size_t left = len;
> + char *p = buf;
> +
> + while (left) {
> + ssize_t rc;
> +
> + ASSERT(left <= len);
> +
> + do
> + rc = read(fd, p, left);
> + while ((rc < 0) && errno == EINTR);
> +
> + if (rc < 0)
> + return -1;
> +
> + p += rc;
> + left -= rc;
> + }
> + return 0;
> +}
> +
> +/**
> + * read_remainder() - Read the tail of an IO vector from a file descriptor
> + * @fd: File descriptor
> + * @iov: IO vector
> + * @cnt: Number of entries in @iov
> + * @skip: Number of bytes of the vector to skip reading
> + *
> + * Return: 0 on success, -1 on error (with errno set)
> + *
> + * Note: mode-specific seccomp profiles need to enable readv() to use this.
> + */
> +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip)
> +{
> + size_t i = 0, offset;
> +
> + while ((i += iov_skip_bytes(iov + i, cnt - i, skip, &offset)) < cnt) {
> + ssize_t rc;
> +
> + if (offset) {
> + ASSERT(offset < iov[i].iov_len);
> + /* Read the remainder of the partially read buffer */
> + if (read_all_buf(fd, (char *)iov[i].iov_base + offset,
> + iov[i].iov_len - offset) < 0)
> + return -1;
> + i++;
> + }
> +
> + /* Fill as many of the remaining buffers as we can */
> + rc = readv(fd, &iov[i], cnt - i);
> + if (rc < 0)
> + return -1;
> +
> + skip = rc;
> + }
> + return 0;
> +}
> +
> /** sockaddr_ntop() - Convert a socket address to text format
> * @sa: Socket address
> * @dst: output buffer, minimum SOCKADDR_STRLEN bytes
> diff --git a/util.h b/util.h
> index d02333d..73a7a33 100644
> --- a/util.h
> +++ b/util.h
> @@ -203,6 +203,8 @@ int fls(unsigned long x);
> int write_file(const char *path, const char *buf);
> int write_all_buf(int fd, const void *buf, size_t len);
> int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip);
> +int read_all_buf(int fd, void *buf, size_t len);
> +int read_remainder(int fd, struct iovec *iov, size_t cnt, size_t skip);
> void close_open_files(int argc, char **argv);
> bool snprintf_check(char *str, size_t size, const char *format, ...);
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
@ 2025-01-28 1:40 ` David Gibson
2025-01-28 6:50 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-28 1:40 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 19745 bytes --]
On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> Add two sets (source or target) of three functions each for passt in
> vhost-user mode, triggered by activity on the file descriptor passed
> via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
>
> - migrate_source_pre() and migrate_target_pre() are called to prepare
> for migration, before data is transferred
>
> - migrate_source() sends, and migrate_target() receives migration data
>
> - migrate_source_post() and migrate_target_post() are responsible for
> any post-migration task
>
> Callbacks are added to these functions with arrays of function
> pointers in migrate.c. Migration handlers are versioned.
>
> Versioned descriptions of data sections will be added to the
> data_versions array, which points to versioned iovec arrays. Version
> 1 is currently empty and will be filled in in subsequent patches.
>
> The source announces the data version to be used and informs the peer
> about endianness, and the size of void *, time_t, flow entries and
> flow hash table entries.
>
> The target checks if the version of the source is still supported. If
> it's not, it aborts the migration.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> Makefile | 12 +--
> migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> migrate.h | 90 ++++++++++++++++++
> passt.c | 2 +-
> vu_common.c | 122 ++++++++++++++++---------
> vu_common.h | 2 +-
> 6 files changed, 438 insertions(+), 49 deletions(-)
> create mode 100644 migrate.c
> create mode 100644 migrate.h
>
> diff --git a/Makefile b/Makefile
> index 464eef1..1383875 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>
> PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> vhost_user.c virtio.c vu_common.c
> QRAP_SRCS = qrap.c
> SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
>
> PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> - virtio.h vu_common.h
> + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> + vhost_user.h virtio.h vu_common.h
> HEADERS = $(PASST_HEADERS) seccomp.h
>
> C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> diff --git a/migrate.c b/migrate.c
> new file mode 100644
> index 0000000..bee9653
> --- /dev/null
> +++ b/migrate.c
> @@ -0,0 +1,259 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + * for qemu/UNIX domain socket mode
> + *
> + * PASTA - Pack A Subtle Tap Abstraction
> + * for network namespace/tap device mode
> + *
> + * migrate.c - Migration sections, layout, and routines
> + *
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#include <errno.h>
> +#include <sys/uio.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "inany.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +
> +#include "migrate.h"
> +
> +/* Current version of migration data */
> +#define MIGRATE_VERSION 1
> +
> +/* Magic as we see it and as seen with reverse endianness */
> +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
As noted, I'm hoping we can get rid of "either endian" migration. But
if this stays, we should define it using __bswap_constant_32() to
avoid embarrassing mistakes.
> +
> +/* Migration header to send from source */
> +static union migrate_header header = {
> + .magic = MIGRATE_MAGIC,
> + .version = htonl_constant(MIGRATE_VERSION),
> + .time_t_size = htonl_constant(sizeof(time_t)),
> + .flow_size = htonl_constant(sizeof(union flow)),
> + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)),
> + .voidp_size = htonl_constant(sizeof(void *)),
> +};
> +
> +/* Data sections for version 1 */
> +static struct iovec sections_v1[] = {
> + { &header, sizeof(header) },
> +};
> +
> +/* Set of data versions */
> +static struct migrate_data data_versions[] = {
> + {
> + 1, sections_v1,
> + },
> + { 0 },
> +};
> +
> +/* Handlers to call in source before sending data */
> +struct migrate_handler handlers_source_pre[] = {
> + { 0 },
> +};
> +
> +/* Handlers to call in source after sending data */
> +struct migrate_handler handlers_source_post[] = {
> + { 0 },
> +};
> +
> +/* Handlers to call in target before receiving data with version 1 */
> +struct migrate_handler handlers_target_pre_v1[] = {
> + { 0 },
> +};
> +
> +/* Handlers to call in target after receiving data with version 1 */
> +struct migrate_handler handlers_target_post_v1[] = {
> + { 0 },
> +};
> +
> +/* Versioned sets of migration handlers */
> +struct migrate_target_handlers target_handlers[] = {
> + {
> + 1,
> + handlers_target_pre_v1,
> + handlers_target_post_v1,
> + },
> + { 0 },
> +};
> +
> +/**
> + * migrate_source_pre() - Pre-migration tasks as source
> + * @m: Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_source_pre(struct migrate_meta *m)
> +{
> + struct migrate_handler *h;
> +
> + for (h = handlers_source_pre; h->fn; h++) {
> + int rc;
> +
> + if ((rc = h->fn(m, h->data)))
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * migrate_source() - Perform migration as source: send state to hypervisor
> + * @fd: Descriptor for state transfer
> + * @m: Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_source(int fd, const struct migrate_meta *m)
> +{
> + static struct migrate_data *d;
> + unsigned count;
> + int rc;
> +
> + for (d = data_versions; d->v != MIGRATE_VERSION; d++);
Should ASSERT() if we don't find the version within the array.
> + for (count = 0; d->sections[count].iov_len; count++);
> +
> + debug("Writing %u migration sections", count - 1 /* minus header */);
> + rc = write_remainder(fd, d->sections, count, 0);
> + if (rc < 0)
> + return errno;
> +
> + return 0;
> +}
> +
> +/**
> + * migrate_source_post() - Post-migration tasks as source
> + * @m: Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +void migrate_source_post(struct migrate_meta *m)
> +{
> + struct migrate_handler *h;
> +
> + for (h = handlers_source_post; h->fn; h++)
> + h->fn(m, h->data);
Is there actually anything we might need to do on the source after a
successful migration, other than exit?
> +}
> +
> +/**
> + * migrate_target_read_header() - Set metadata in target from source header
> + * @fd: Descriptor for state transfer
> + * @m: Migration metadata, filled on return
> + *
> + * Return: 0 on success, error code on failure
We nearly always use negative error codes. Why not here?
> + */
> +int migrate_target_read_header(int fd, struct migrate_meta *m)
> +{
> + static struct migrate_data *d;
> + union migrate_header h;
> +
> + if (read_all_buf(fd, &h, sizeof(h)))
> + return errno;
> +
> + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> + h.magic, ntohl(h.voidp_size), ntohl(h.version));
> +
> + for (d = data_versions; d->v != ntohl(h.version); d++);
> + if (!d->v)
> + return ENOTSUP;
This is too late. The loop doesn't check it, so you've already
overrun the data_versions table if the version wasn't in there.
Easier to use an ARRAY_SIZE() limit in the loop, I think.
> + m->v = d->v;
> +
> + if (h.magic == MIGRATE_MAGIC)
> + m->bswap = false;
> + else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> + m->bswap = true;
> + else
> + return ENOTSUP;
> +
> + if (ntohl(h.voidp_size) == 4)
> + m->source_64b = false;
> + else if (ntohl(h.voidp_size) == 8)
> + m->source_64b = true;
> + else
> + return ENOTSUP;
> +
> + if (ntohl(h.time_t_size) == 4)
> + m->time_64b = false;
> + else if (ntohl(h.time_t_size) == 8)
> + m->time_64b = true;
> + else
> + return ENOTSUP;
> +
> + m->flow_size = ntohl(h.flow_size);
> + m->flow_sidx_size = ntohl(h.flow_sidx_size);
> +
> + return 0;
> +}
> +
> +/**
> + * migrate_target_pre() - Pre-migration tasks as target
> + * @m: Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int migrate_target_pre(struct migrate_meta *m)
> +{
> + struct migrate_target_handlers *th;
> + struct migrate_handler *h;
> +
> + for (th = target_handlers; th->v != m->v && th->v; th++);
> +
> + for (h = th->pre; h->fn; h++) {
> + int rc;
> +
> + if ((rc = h->fn(m, h->data)))
> + return rc;
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * migrate_target() - Perform migration as target: receive state from hypervisor
> + * @fd: Descriptor for state transfer
> + * @m: Migration metadata
> + *
> + * Return: 0 on success, error code on failure
> + *
> + * #syscalls:vu readv
> + */
> +int migrate_target(int fd, const struct migrate_meta *m)
> +{
> + static struct migrate_data *d;
> + unsigned cnt;
> + int rc;
> +
> + for (d = data_versions; d->v != m->v && d->v; d++);
> +
> + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> +
> + debug("Reading %u migration sections", cnt);
> + rc = read_remainder(fd, d->sections + 1, cnt, 0);
> + if (rc < 0)
> + return errno;
> +
> + return 0;
> +}
> +
> +/**
> + * migrate_target_post() - Post-migration tasks as target
> + * @m: Migration metadata
> + */
> +void migrate_target_post(struct migrate_meta *m)
> +{
> + struct migrate_target_handlers *th;
> + struct migrate_handler *h;
> +
> + for (th = target_handlers; th->v != m->v && th->v; th++);
> +
> + for (h = th->post; h->fn; h++)
> + h->fn(m, h->data);
> +}
> diff --git a/migrate.h b/migrate.h
> new file mode 100644
> index 0000000..5582f75
> --- /dev/null
> +++ b/migrate.h
> @@ -0,0 +1,90 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef MIGRATE_H
> +#define MIGRATE_H
> +
> +/**
> + * struct migrate_meta - Migration metadata
> + * @v: Chosen migration data version, host order
> + * @bswap: Source has opposite endianness
> + * @peer_64b: Source uses 64-bit void *
> + * @time_64b: Source uses 64-bit time_t
> + * @flow_size: Size of union flow in source
> + * @flow_sidx_size: Size of struct flow_sidx in source
> + */
> +struct migrate_meta {
> + uint32_t v;
> + bool bswap;
> + bool source_64b;
> + bool time_64b;
> + size_t flow_size;
> + size_t flow_sidx_size;
> +};
> +
> +/**
> + * union migrate_header - Migration header from source
> + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> + * @version: Source sends highest known, target aborts if unsupported
> + * @voidp_size: sizeof(void *), network order
> + * @time_t_size: sizeof(time_t), network order
> + * @flow_size: sizeof(union flow), network order
> + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> + * @unused: Go figure
> + */
> +union migrate_header {
> + struct {
> + uint64_t magic;
> + uint32_t version;
> + uint32_t voidp_size;
> + uint32_t time_t_size;
> + uint32_t flow_size;
> + uint32_t flow_sidx_size;
> + };
> + uint8_t unused[65536];
So, having looked at this, I no longer think padding the header to 64kiB
is a good idea. The structure means we're basically stuck always
having that chunky header. Instead, I think the header should be
absolutely minimal: basically magic and version only. v1 (and maybe
others) can add a "metadata" or whatever section for additional
information like this they need.
> +};
> +
> +/**
> + * struct migrate_data - Data sections for given source version
> + * @v: Source version this applies to, host order
> + * @sections: Array of data sections, NULL-terminated
> + */
> +struct migrate_data {
> + uint32_t v;
> + struct iovec *sections;
> +};
> +
> +/**
> + * struct migrate_handler - Function to handle a specific data section
> + * @fn: Function pointer taking pointer to data section
> + * @data: Associated data section
> + */
> +struct migrate_handler {
> + int (*fn)(struct migrate_meta *m, void *data);
> + void *data;
> +};
> +
> +/**
> + * struct migrate_target_handlers - Versioned sets of migration target handlers
> + * @v: Source version this applies to, host order
> + * @pre: Set of functions to execute in target before data copy
> + * @post: Set of functions to execute in target after data copy
> + */
> +struct migrate_target_handlers {
> + uint32_t v;
> + struct migrate_handler *pre;
> + struct migrate_handler *post;
> +};
> +
> +int migrate_source_pre(struct migrate_meta *m);
> +int migrate_source(int fd, const struct migrate_meta *m);
> +void migrate_source_post(struct migrate_meta *m);
> +
> +int migrate_target_read_header(int fd, struct migrate_meta *m);
> +int migrate_target_pre(struct migrate_meta *m);
> +int migrate_target(int fd, const struct migrate_meta *m);
> +void migrate_target_post(struct migrate_meta *m);
> +
> +#endif /* MIGRATE_H */
> diff --git a/passt.c b/passt.c
> index b1c8ab6..184d4e5 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -358,7 +358,7 @@ loop:
> vu_kick_cb(c.vdev, ref, &now);
> break;
> case EPOLL_TYPE_VHOST_MIGRATION:
> - vu_migrate(c.vdev, eventmask);
> + vu_migrate(&c, eventmask);
> break;
> default:
> /* Can't happen */
> diff --git a/vu_common.c b/vu_common.c
> index f43d8ac..0c67bd0 100644
> --- a/vu_common.c
> +++ b/vu_common.c
> @@ -5,6 +5,7 @@
> * common_vu.c - vhost-user common UDP and TCP functions
> */
>
> +#include <errno.h>
> #include <unistd.h>
> #include <sys/uio.h>
> #include <sys/eventfd.h>
> @@ -17,6 +18,7 @@
> #include "vhost_user.h"
> #include "pcap.h"
> #include "vu_common.h"
> +#include "migrate.h"
>
> #define VU_MAX_TX_BUFFER_NB 2
>
> @@ -305,50 +307,88 @@ err:
> }
>
> /**
> - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> - * @vdev: vhost-user device
> + * vu_migrate_source() - Migration as source, send state to hypervisor
> + * @fd: File descriptor for state transfer
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +static int vu_migrate_source(int fd)
> +{
> + struct migrate_meta m;
> + int rc;
> +
> + if ((rc = migrate_source_pre(&m))) {
> + err("Source pre-migration failed: %s, abort", strerror_(rc));
> + return rc;
> + }
> +
> + debug("Saving backend state");
> +
> + rc = migrate_source(fd, &m);
> + if (rc)
> + err("Source migration failed: %s", strerror_(rc));
> + else
> + migrate_source_post(&m);
> +
> + return rc;
After a successful source migration shouldn't we exit, or at least
quiesce ourselves so we don't accidentally mess with anything the
target is now doing?
> +}
> +
> +/**
> + * vu_migrate_target() - Migration as target, receive state from hypervisor
> + * @fd: File descriptor for state transfer
> + *
> + * Return: 0 on success, positive error code on failure
> + */
> +static int vu_migrate_target(int fd)
> +{
> + struct migrate_meta m;
> + int rc;
> +
> + rc = migrate_target_read_header(fd, &m);
> + if (rc) {
> + err("Migration header check failed: %s, abort", strerror_(rc));
> + return rc;
> + }
> +
> + if ((rc = migrate_target_pre(&m))) {
> + err("Target pre-migration failed: %s, abort", strerror_(rc));
> + return rc;
> + }
> +
> + debug("Loading backend state");
> +
> + rc = migrate_target(fd, &m);
> + if (rc)
> + err("Target migration failed: %s", strerror_(rc));
> + else
> + migrate_target_post(&m);
> +
> + return rc;
> +}
> +
> +/**
> + * vu_migrate() - Send/receive passt internal state to/from QEMU
> + * @c: Execution context
> * @events: epoll events
> */
> -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> +void vu_migrate(struct ctx *c, uint32_t events)
> {
> - int ret;
> + struct vu_dev *vdev = c->vdev;
> + int rc = EIO;
>
> - /* TODO: collect/set passt internal state
> - * and use vdev->device_state_fd to send/receive it
> - */
> debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> - if (events & EPOLLOUT) {
> - debug("Saving backend state");
> -
> - /* send some stuff */
> - ret = write(vdev->device_state_fd, "PASST", 6);
> - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> - vdev->device_state_result = ret == -1 ? -1 : 0;
> - /* Closing the file descriptor signals the end of transfer */
> - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> - vdev->device_state_fd, NULL);
> - close(vdev->device_state_fd);
> - vdev->device_state_fd = -1;
> - } else if (events & EPOLLIN) {
> - char buf[6];
> -
> - debug("Loading backend state");
> - /* read some stuff */
> - ret = read(vdev->device_state_fd, buf, sizeof(buf));
> - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> - if (ret != sizeof(buf)) {
> - vdev->device_state_result = -1;
> - } else {
> - ret = strncmp(buf, "PASST", sizeof(buf));
> - vdev->device_state_result = ret == 0 ? 0 : -1;
> - }
> - } else if (events & EPOLLHUP) {
> - debug("Closing migration channel");
> -
> - /* The end of file signals the end of the transfer. */
> - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> - vdev->device_state_fd, NULL);
> - close(vdev->device_state_fd);
> - vdev->device_state_fd = -1;
> - }
> +
> + if (events & EPOLLOUT)
> + rc = vu_migrate_source(vdev->device_state_fd);
> + else if (events & EPOLLIN)
> + rc = vu_migrate_target(vdev->device_state_fd);
> +
> + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> +
> + vdev->device_state_result = rc;
> +
> + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
> + debug("Closing migration channel");
> + close(vdev->device_state_fd);
> + vdev->device_state_fd = -1;
> }
> diff --git a/vu_common.h b/vu_common.h
> index d56c021..69c4006 100644
> --- a/vu_common.h
> +++ b/vu_common.h
> @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> const struct timespec *now);
> int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> -void vu_migrate(struct vu_dev *vdev, uint32_t events);
> +void vu_migrate(struct ctx *c, uint32_t events);
> #endif /* VU_COMMON_H */
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31 ` Stefano Brivio
@ 2025-01-28 1:51 ` David Gibson
2025-01-28 6:51 ` Stefano Brivio
1 sibling, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-28 1:51 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 6312 bytes --]
On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> passt. Not used yet.
>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> Makefile | 10 +++--
> passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 118 insertions(+), 3 deletions(-)
> create mode 100644 passt-repair.c
>
> diff --git a/Makefile b/Makefile
> index 1383875..1b71cb0 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> vhost_user.c virtio.c vu_common.c
> QRAP_SRCS = qrap.c
> -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> +PASST_REPAIR_SRCS = passt-repair.c
> +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
>
> MANPAGES = passt.1 pasta.1 qrap.1
>
> @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> man1dir ?= $(mandir)/man1
>
> ifeq ($(TARGET_ARCH),x86_64)
> -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> else
> -BIN := passt pasta qrap
> +BIN := passt pasta qrap passt-repair
> endif
>
> all: $(BIN) $(MANPAGES) docs
> @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> qrap: $(QRAP_SRCS) passt.h
> $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
>
> +passt-repair: $(PASST_REPAIR_SRCS)
> + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> +
> valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> rt_sigreturn getpid gettid kill clock_gettime mmap \
> mmap2 munmap open unlink gettimeofday futex statx \
> diff --git a/passt-repair.c b/passt-repair.c
> new file mode 100644
> index 0000000..e9b9609
> --- /dev/null
> +++ b/passt-repair.c
> @@ -0,0 +1,111 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +/* PASST - Plug A Simple Socket Transport
> + * for qemu/UNIX domain socket mode
> + *
> + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> + *
> + * Copyright (c) 2025 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + *
> + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> + */
> +
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <sys/un.h>
> +#include <errno.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <limits.h>
> +#include <unistd.h>
> +#include <netdb.h>
> +
> +#include <netinet/tcp.h>
> +
> +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> +
> +int main(int argc, char **argv)
> +{
> + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> + struct sockaddr_un a = { AF_UNIX, "" };
> + int cmd, fds[SCM_MAX_FD], s, ret, i;
> + struct cmsghdr *cmsg;
> + struct msghdr msg;
> + struct iovec iov;
> +
> + iov = (struct iovec){ &cmd, sizeof(cmd) };
I mean, local to local, it's *probably* fine, but still a network
protocol not defined in terms of explicit width fields makes me
nervous. I'd prefer to see the cmd being a packed structure with
fixed width elements.
I also think we should do some sort of basic magic / version exchange.
I don't see any reason we'd need to extend the protocol, but I'd
rather have the option if we have to. Plus checking a magic number
should make things less damaging and more debuggable if you were to
point the repair helper at an entirely unrelated unix socket instead
of passt's repair socket.
> + msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
> + cmsg = CMSG_FIRSTHDR(&msg);
> +
> + if (argc != 2) {
> + fprintf(stderr, "Usage: %s PATH\n", argv[0]);
> + return -1;
> + }
> +
> + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
> + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
> + fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
> + return -1;
> + }
> +
> + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
Hmm.. would a datagram socket better suit our needs here?
> + perror("Failed to create AF_UNIX socket");
> + return -1;
> + }
> +
> + if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
> + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
> + strerror(errno));
> + return -1;
> + }
> +
> + while (1) {
> + int n;
> +
> + if (recvmsg(s, &msg, 0) < 0) {
> + perror("Failed to receive message");
> + return -1;
> + }
> +
> + if (!cmsg ||
> + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
> + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
> + cmsg->cmsg_type != SCM_RIGHTS)
> + return -1;
> +
> + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
> + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
> +
> + switch (cmd) {
> + case INT_MAX:
> + return 0;
> + case TCP_REPAIR_ON:
> + case TCP_REPAIR_OFF:
> + case TCP_REPAIR_OFF_NO_WP:
> + for (i = 0; i < n; i++) {
> + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
> + &cmd, sizeof(int))) {
> + perror("Setting TCP_REPAIR");
> + return -1;
We probably want this to report errors back to passt, rather than just
dying in this case. That way if for some weird reason one socket
can't be placed in repair mode, we can still migrate all the other
connections.
> + }
> + }
> +
> + /* Confirm setting by echoing the command back */
> + if (send(s, &cmd, sizeof(int), 0) < 0) {
> + fprintf(stderr, "Reply to command %i: %s\n",
> + cmd, strerror(errno));
> + return -1;
> + }
> +
> + break;
> + default:
> + fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
> + return -1;
> + }
> + }
> +}
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease state migration
2025-01-28 0:49 ` David Gibson
@ 2025-01-28 6:48 ` Stefano Brivio
0 siblings, 0 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-28 6:48 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Tue, 28 Jan 2025 11:49:16 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 12:15:26AM +0100, Stefano Brivio wrote:
> > That's the only field in flows with different storage sizes depending
> > on the architecture: it's usually 4-byte wide on 32-bit architectures,
> > except for arc and x32 where it's 8 bytes, and 8-byte wide on 64-bit
> > machines.
>
> As discussed on the call, I think there are broader problems with
> transferring timestamps than just the structure size. So I'm hoping
> we can work out how to not transfer them at all and avoid this change.
This change is not related to the fact that we ignore or use them. It's
about making the flow entries the same size, which we need, at least
with this implementation.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-28 0:53 ` David Gibson
@ 2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:02 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-28 6:48 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Tue, 28 Jan 2025 11:53:09 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > Moving in_epoll out of the common flow data created a 7-bit hole in
> > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > unused) bit.
>
> Is this actually necessary for the migration stuff? Or just a cleanup
> you spotted along the way?
I thought it was helpful to keep the same size on 32-bit, but it looks
like it's not actually needed.
Let me drop it from this series as it's just noise and I'm trying to
keep this slim. If we are all happy with it I can apply it. If not I'll
forget about it.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-28 0:59 ` David Gibson
@ 2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:03 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-28 6:48 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Tue, 28 Jan 2025 11:59:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:
> > These are symmetric to write_remainder() and write_all_buf() and
> > almost a copy and paste of them, with the most notable differences
> > being reversed reads/writes and a couple of better-safe-than-sorry
> > asserts to keep Coverity happy.
>
> So, there's one thing that needs to be not quite symmetric for the
> read() version: we need to handle EOF. At present, I believe these
> will enter an infinite loop on EOF, which is not a graceful failure
> mode.
It doesn't happen in our current usage where we close the socket once
we're done, but sure, if we use it for something else, boom. Let me add
a rc == 0 case (which gets EIO or EINVAL, I'm not sure yet).
Or feel free to re-post this if you have clearer ideas how to fix this
up (but only if tested).
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-28 1:40 ` David Gibson
@ 2025-01-28 6:50 ` Stefano Brivio
2025-01-29 1:16 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-28 6:50 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Tue, 28 Jan 2025 12:40:12 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > Add two sets (source or target) of three functions each for passt in
> > vhost-user mode, triggered by activity on the file descriptor passed
> > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> >
> > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > for migration, before data is transferred
> >
> > - migrate_source() sends, and migrate_target() receives migration data
> >
> > - migrate_source_post() and migrate_target_post() are responsible for
> > any post-migration task
> >
> > Callbacks are added to these functions with arrays of function
> > pointers in migrate.c. Migration handlers are versioned.
> >
> > Versioned descriptions of data sections will be added to the
> > data_versions array, which points to versioned iovec arrays. Version
> > 1 is currently empty and will be filled in in subsequent patches.
> >
> > The source announces the data version to be used and informs the peer
> > about endianness, and the size of void *, time_t, flow entries and
> > flow hash table entries.
> >
> > The target checks if the version of the source is still supported. If
> > it's not, it aborts the migration.
> >
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > Makefile | 12 +--
> > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > migrate.h | 90 ++++++++++++++++++
> > passt.c | 2 +-
> > vu_common.c | 122 ++++++++++++++++---------
> > vu_common.h | 2 +-
> > 6 files changed, 438 insertions(+), 49 deletions(-)
> > create mode 100644 migrate.c
> > create mode 100644 migrate.h
> >
> > diff --git a/Makefile b/Makefile
> > index 464eef1..1383875 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> >
> > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > vhost_user.c virtio.c vu_common.c
> > QRAP_SRCS = qrap.c
> > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> >
> > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > - virtio.h vu_common.h
> > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > + vhost_user.h virtio.h vu_common.h
> > HEADERS = $(PASST_HEADERS) seccomp.h
> >
> > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > diff --git a/migrate.c b/migrate.c
> > new file mode 100644
> > index 0000000..bee9653
> > --- /dev/null
> > +++ b/migrate.c
> > @@ -0,0 +1,259 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* PASST - Plug A Simple Socket Transport
> > + * for qemu/UNIX domain socket mode
> > + *
> > + * PASTA - Pack A Subtle Tap Abstraction
> > + * for network namespace/tap device mode
> > + *
> > + * migrate.c - Migration sections, layout, and routines
> > + *
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#include <errno.h>
> > +#include <sys/uio.h>
> > +
> > +#include "util.h"
> > +#include "ip.h"
> > +#include "passt.h"
> > +#include "inany.h"
> > +#include "flow.h"
> > +#include "flow_table.h"
> > +
> > +#include "migrate.h"
> > +
> > +/* Current version of migration data */
> > +#define MIGRATE_VERSION 1
> > +
> > +/* Magic as we see it and as seen with reverse endianness */
> > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
>
> As noted, I'm hoping we can get rid of "either endian" migration. But
> if this stays, we should define it using __bswap_constant_32() to
> avoid embarrassing mistakes.
Those always give me issues on musl, so I'd rather test things on
big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap).
Feel free to post a different proposal if tested.
> > +
> > +/* Migration header to send from source */
> > +static union migrate_header header = {
> > + .magic = MIGRATE_MAGIC,
> > + .version = htonl_constant(MIGRATE_VERSION),
> > + .time_t_size = htonl_constant(sizeof(time_t)),
> > + .flow_size = htonl_constant(sizeof(union flow)),
> > + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)),
> > + .voidp_size = htonl_constant(sizeof(void *)),
> > +};
> > +
> > +/* Data sections for version 1 */
> > +static struct iovec sections_v1[] = {
> > + { &header, sizeof(header) },
> > +};
> > +
> > +/* Set of data versions */
> > +static struct migrate_data data_versions[] = {
> > + {
> > + 1, sections_v1,
> > + },
> > + { 0 },
> > +};
> > +
> > +/* Handlers to call in source before sending data */
> > +struct migrate_handler handlers_source_pre[] = {
> > + { 0 },
> > +};
> > +
> > +/* Handlers to call in source after sending data */
> > +struct migrate_handler handlers_source_post[] = {
> > + { 0 },
> > +};
> > +
> > +/* Handlers to call in target before receiving data with version 1 */
> > +struct migrate_handler handlers_target_pre_v1[] = {
> > + { 0 },
> > +};
> > +
> > +/* Handlers to call in target after receiving data with version 1 */
> > +struct migrate_handler handlers_target_post_v1[] = {
> > + { 0 },
> > +};
> > +
> > +/* Versioned sets of migration handlers */
> > +struct migrate_target_handlers target_handlers[] = {
> > + {
> > + 1,
> > + handlers_target_pre_v1,
> > + handlers_target_post_v1,
> > + },
> > + { 0 },
> > +};
> > +
> > +/**
> > + * migrate_source_pre() - Pre-migration tasks as source
> > + * @m: Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_source_pre(struct migrate_meta *m)
> > +{
> > + struct migrate_handler *h;
> > +
> > + for (h = handlers_source_pre; h->fn; h++) {
> > + int rc;
> > +
> > + if ((rc = h->fn(m, h->data)))
> > + return rc;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * migrate_source() - Perform migration as source: send state to hypervisor
> > + * @fd: Descriptor for state transfer
> > + * @m: Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_source(int fd, const struct migrate_meta *m)
> > +{
> > + static struct migrate_data *d;
> > + unsigned count;
> > + int rc;
> > +
> > + for (d = data_versions; d->v != MIGRATE_VERSION; d++);
>
> Should ASSERT() if we don't find the version within the array.
This looks a bit unnecessary, MIGRATE_VERSION is defined just above...
it's just a readability killer to me.
> > + for (count = 0; d->sections[count].iov_len; count++);
> > +
> > + debug("Writing %u migration sections", count - 1 /* minus header */);
> > + rc = write_remainder(fd, d->sections, count, 0);
> > + if (rc < 0)
> > + return errno;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * migrate_source_post() - Post-migration tasks as source
> > + * @m: Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +void migrate_source_post(struct migrate_meta *m)
> > +{
> > + struct migrate_handler *h;
> > +
> > + for (h = handlers_source_post; h->fn; h++)
> > + h->fn(m, h->data);
>
> Is there actually anything we might need to do on the source after a
> successful migration, other than exit?
We might want to log a couple of things, which would warrant these
handlers.
But let's say we need to do something *similar* to "updating the
network" such as the RARP announcement that QEMU is requesting (this is
intended for OVN-Kubernetes, so go figure), or that we need a
workaround for a kernel issue with implicit close() with TCP_REPAIR
on... I would leave this in for completeness.
> > +}
> > +
> > +/**
> > + * migrate_target_read_header() - Set metadata in target from source header
> > + * @fd: Descriptor for state transfer
> > + * @m: Migration metadata, filled on return
> > + *
> > + * Return: 0 on success, error code on failure
>
> We nearly always use negative error codes. Why not here?
Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned:
https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-message-types
and I want to keep this consistent/untranslated.
> > + */
> > +int migrate_target_read_header(int fd, struct migrate_meta *m)
> > +{
> > + static struct migrate_data *d;
> > + union migrate_header h;
> > +
> > + if (read_all_buf(fd, &h, sizeof(h)))
> > + return errno;
> > +
> > + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> > + h.magic, ntohl(h.voidp_size), ntohl(h.version));
> > +
> > + for (d = data_versions; d->v != ntohl(h.version); d++);
> > + if (!d->v)
> > + return ENOTSUP;
>
> This is too late. The loop doesn't check it, so you've already
> overrun the data_versions table if the version wasn't in there.
Ah, yes, I forgot the '&& d->v' part (see migrate_target()).
> Easier to use an ARRAY_SIZE() limit in the loop, I think.
I'd rather keep that as a one-liner, and NULL-terminate the arrays.
> > + m->v = d->v;
> > +
> > + if (h.magic == MIGRATE_MAGIC)
> > + m->bswap = false;
> > + else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> > + m->bswap = true;
> > + else
> > + return ENOTSUP;
> > +
> > + if (ntohl(h.voidp_size) == 4)
> > + m->source_64b = false;
> > + else if (ntohl(h.voidp_size) == 8)
> > + m->source_64b = true;
> > + else
> > + return ENOTSUP;
> > +
> > + if (ntohl(h.time_t_size) == 4)
> > + m->time_64b = false;
> > + else if (ntohl(h.time_t_size) == 8)
> > + m->time_64b = true;
> > + else
> > + return ENOTSUP;
> > +
> > + m->flow_size = ntohl(h.flow_size);
> > + m->flow_sidx_size = ntohl(h.flow_sidx_size);
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * migrate_target_pre() - Pre-migration tasks as target
> > + * @m: Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + */
> > +int migrate_target_pre(struct migrate_meta *m)
> > +{
> > + struct migrate_target_handlers *th;
> > + struct migrate_handler *h;
> > +
> > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > +
> > + for (h = th->pre; h->fn; h++) {
> > + int rc;
> > +
> > + if ((rc = h->fn(m, h->data)))
> > + return rc;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * migrate_target() - Perform migration as target: receive state from hypervisor
> > + * @fd: Descriptor for state transfer
> > + * @m: Migration metadata
> > + *
> > + * Return: 0 on success, error code on failure
> > + *
> > + * #syscalls:vu readv
> > + */
> > +int migrate_target(int fd, const struct migrate_meta *m)
> > +{
> > + static struct migrate_data *d;
> > + unsigned cnt;
> > + int rc;
> > +
> > + for (d = data_versions; d->v != m->v && d->v; d++);
> > +
> > + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> > +
> > + debug("Reading %u migration sections", cnt);
> > + rc = read_remainder(fd, d->sections + 1, cnt, 0);
> > + if (rc < 0)
> > + return errno;
> > +
> > + return 0;
> > +}
> > +
> > +/**
> > + * migrate_target_post() - Post-migration tasks as target
> > + * @m: Migration metadata
> > + */
> > +void migrate_target_post(struct migrate_meta *m)
> > +{
> > + struct migrate_target_handlers *th;
> > + struct migrate_handler *h;
> > +
> > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > +
> > + for (h = th->post; h->fn; h++)
> > + h->fn(m, h->data);
> > +}
> > diff --git a/migrate.h b/migrate.h
> > new file mode 100644
> > index 0000000..5582f75
> > --- /dev/null
> > +++ b/migrate.h
> > @@ -0,0 +1,90 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + */
> > +
> > +#ifndef MIGRATE_H
> > +#define MIGRATE_H
> > +
> > +/**
> > + * struct migrate_meta - Migration metadata
> > + * @v: Chosen migration data version, host order
> > + * @bswap: Source has opposite endianness
> > + * @peer_64b: Source uses 64-bit void *
> > + * @time_64b: Source uses 64-bit time_t
> > + * @flow_size: Size of union flow in source
> > + * @flow_sidx_size: Size of struct flow_sidx in source
> > + */
> > +struct migrate_meta {
> > + uint32_t v;
> > + bool bswap;
> > + bool source_64b;
> > + bool time_64b;
> > + size_t flow_size;
> > + size_t flow_sidx_size;
> > +};
> > +
> > +/**
> > + * union migrate_header - Migration header from source
> > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > + * @version: Source sends highest known, target aborts if unsupported
> > + * @voidp_size: sizeof(void *), network order
> > + * @time_t_size: sizeof(time_t), network order
> > + * @flow_size: sizeof(union flow), network order
> > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > + * @unused: Go figure
> > + */
> > +union migrate_header {
> > + struct {
> > + uint64_t magic;
> > + uint32_t version;
> > + uint32_t voidp_size;
> > + uint32_t time_t_size;
> > + uint32_t flow_size;
> > + uint32_t flow_sidx_size;
> > + };
> > + uint8_t unused[65536];
>
> So, having looked at this, I no longer think padding the header to 64kiB
> is a good idea. The structure means we're basically stuck always
> having that chunky header. Instead, I think the header should be
> absolutely minimal: basically magic and version only. v1 (and maybe
> others) can add a "metadata" or whatever section for additional
> information like this they need.
The header is processed by the target in a separate, preliminary step,
though.
That's why I added metadata right in the header: if the target needs to
abort the migration because, say, the size of a flow entry is too big
to handle for a particular version, then we should know that before
migrate_target_pre().
As long as we check the version first, we can always shrink the header
later on. But having 64 KiB reserved looks more robust because it's a
safe place to add this kind of metadata.
Note that 64 KiB is typically transferred in a single read/write
from/to the vhost-user back-end.
> > +};
> > +
> > +/**
> > + * struct migrate_data - Data sections for given source version
> > + * @v: Source version this applies to, host order
> > + * @sections: Array of data sections, NULL-terminated
> > + */
> > +struct migrate_data {
> > + uint32_t v;
> > + struct iovec *sections;
> > +};
> > +
> > +/**
> > + * struct migrate_handler - Function to handle a specific data section
> > + * @fn: Function pointer taking pointer to data section
> > + * @data: Associated data section
> > + */
> > +struct migrate_handler {
> > + int (*fn)(struct migrate_meta *m, void *data);
> > + void *data;
> > +};
> > +
> > +/**
> > + * struct migrate_target_handlers - Versioned sets of migration target handlers
> > + * @v: Source version this applies to, host order
> > + * @pre: Set of functions to execute in target before data copy
> > + * @post: Set of functions to execute in target after data copy
> > + */
> > +struct migrate_target_handlers {
> > + uint32_t v;
> > + struct migrate_handler *pre;
> > + struct migrate_handler *post;
> > +};
> > +
> > +int migrate_source_pre(struct migrate_meta *m);
> > +int migrate_source(int fd, const struct migrate_meta *m);
> > +void migrate_source_post(struct migrate_meta *m);
> > +
> > +int migrate_target_read_header(int fd, struct migrate_meta *m);
> > +int migrate_target_pre(struct migrate_meta *m);
> > +int migrate_target(int fd, const struct migrate_meta *m);
> > +void migrate_target_post(struct migrate_meta *m);
> > +
> > +#endif /* MIGRATE_H */
> > diff --git a/passt.c b/passt.c
> > index b1c8ab6..184d4e5 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -358,7 +358,7 @@ loop:
> > vu_kick_cb(c.vdev, ref, &now);
> > break;
> > case EPOLL_TYPE_VHOST_MIGRATION:
> > - vu_migrate(c.vdev, eventmask);
> > + vu_migrate(&c, eventmask);
> > break;
> > default:
> > /* Can't happen */
> > diff --git a/vu_common.c b/vu_common.c
> > index f43d8ac..0c67bd0 100644
> > --- a/vu_common.c
> > +++ b/vu_common.c
> > @@ -5,6 +5,7 @@
> > * common_vu.c - vhost-user common UDP and TCP functions
> > */
> >
> > +#include <errno.h>
> > #include <unistd.h>
> > #include <sys/uio.h>
> > #include <sys/eventfd.h>
> > @@ -17,6 +18,7 @@
> > #include "vhost_user.h"
> > #include "pcap.h"
> > #include "vu_common.h"
> > +#include "migrate.h"
> >
> > #define VU_MAX_TX_BUFFER_NB 2
> >
> > @@ -305,50 +307,88 @@ err:
> > }
> >
> > /**
> > - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> > - * @vdev: vhost-user device
> > + * vu_migrate_source() - Migration as source, send state to hypervisor
> > + * @fd: File descriptor for state transfer
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +static int vu_migrate_source(int fd)
> > +{
> > + struct migrate_meta m;
> > + int rc;
> > +
> > + if ((rc = migrate_source_pre(&m))) {
> > + err("Source pre-migration failed: %s, abort", strerror_(rc));
> > + return rc;
> > + }
> > +
> > + debug("Saving backend state");
> > +
> > + rc = migrate_source(fd, &m);
> > + if (rc)
> > + err("Source migration failed: %s", strerror_(rc));
> > + else
> > + migrate_source_post(&m);
> > +
> > + return rc;
>
> After a successful source migration shouldn't we exit, or at least
> quiesce ourselves so we don't accidentally mess with anything the
> target is now doing?
Maybe, yes. Pending TCP connections should be safe because with
TCP_REPAIR they're already quiesced, but we don't close listening
sockets (yet).
Perhaps a reasonable approach for the moment would be to declare a
single migrate_source_post handler logging a info() message and exiting.
> > +}
> > +
> > +/**
> > + * vu_migrate_target() - Migration as target, receive state from hypervisor
> > + * @fd: File descriptor for state transfer
> > + *
> > + * Return: 0 on success, positive error code on failure
> > + */
> > +static int vu_migrate_target(int fd)
> > +{
> > + struct migrate_meta m;
> > + int rc;
> > +
> > + rc = migrate_target_read_header(fd, &m);
> > + if (rc) {
> > + err("Migration header check failed: %s, abort", strerror_(rc));
> > + return rc;
> > + }
> > +
> > + if ((rc = migrate_target_pre(&m))) {
> > + err("Target pre-migration failed: %s, abort", strerror_(rc));
> > + return rc;
> > + }
> > +
> > + debug("Loading backend state");
> > +
> > + rc = migrate_target(fd, &m);
> > + if (rc)
> > + err("Target migration failed: %s", strerror_(rc));
> > + else
> > + migrate_target_post(&m);
> > +
> > + return rc;
> > +}
> > +
> > +/**
> > + * vu_migrate() - Send/receive passt internal state to/from QEMU
> > + * @c: Execution context
> > * @events: epoll events
> > */
> > -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> > +void vu_migrate(struct ctx *c, uint32_t events)
> > {
> > - int ret;
> > + struct vu_dev *vdev = c->vdev;
> > + int rc = EIO;
> >
> > - /* TODO: collect/set passt internal state
> > - * and use vdev->device_state_fd to send/receive it
> > - */
> > debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> > - if (events & EPOLLOUT) {
> > - debug("Saving backend state");
> > -
> > - /* send some stuff */
> > - ret = write(vdev->device_state_fd, "PASST", 6);
> > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > - vdev->device_state_result = ret == -1 ? -1 : 0;
> > - /* Closing the file descriptor signals the end of transfer */
> > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > - vdev->device_state_fd, NULL);
> > - close(vdev->device_state_fd);
> > - vdev->device_state_fd = -1;
> > - } else if (events & EPOLLIN) {
> > - char buf[6];
> > -
> > - debug("Loading backend state");
> > - /* read some stuff */
> > - ret = read(vdev->device_state_fd, buf, sizeof(buf));
> > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > - if (ret != sizeof(buf)) {
> > - vdev->device_state_result = -1;
> > - } else {
> > - ret = strncmp(buf, "PASST", sizeof(buf));
> > - vdev->device_state_result = ret == 0 ? 0 : -1;
> > - }
> > - } else if (events & EPOLLHUP) {
> > - debug("Closing migration channel");
> > -
> > - /* The end of file signals the end of the transfer. */
> > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > - vdev->device_state_fd, NULL);
> > - close(vdev->device_state_fd);
> > - vdev->device_state_fd = -1;
> > - }
> > +
> > + if (events & EPOLLOUT)
> > + rc = vu_migrate_source(vdev->device_state_fd);
> > + else if (events & EPOLLIN)
> > + rc = vu_migrate_target(vdev->device_state_fd);
> > +
> > + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> > +
> > + vdev->device_state_result = rc;
> > +
> > + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
> > + debug("Closing migration channel");
> > + close(vdev->device_state_fd);
> > + vdev->device_state_fd = -1;
> > }
> > diff --git a/vu_common.h b/vu_common.h
> > index d56c021..69c4006 100644
> > --- a/vu_common.h
> > +++ b/vu_common.h
> > @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> > void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> > const struct timespec *now);
> > int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> > -void vu_migrate(struct vu_dev *vdev, uint32_t events);
> > +void vu_migrate(struct ctx *c, uint32_t events);
> > #endif /* VU_COMMON_H */
>
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-28 1:51 ` David Gibson
@ 2025-01-28 6:51 ` Stefano Brivio
2025-01-29 1:29 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-28 6:51 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Tue, 28 Jan 2025 12:51:59 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > passt. Not used yet.
> >
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> > Makefile | 10 +++--
> > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 118 insertions(+), 3 deletions(-)
> > create mode 100644 passt-repair.c
> >
> > diff --git a/Makefile b/Makefile
> > index 1383875..1b71cb0 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > vhost_user.c virtio.c vu_common.c
> > QRAP_SRCS = qrap.c
> > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > +PASST_REPAIR_SRCS = passt-repair.c
> > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> >
> > MANPAGES = passt.1 pasta.1 qrap.1
> >
> > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > man1dir ?= $(mandir)/man1
> >
> > ifeq ($(TARGET_ARCH),x86_64)
> > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > else
> > -BIN := passt pasta qrap
> > +BIN := passt pasta qrap passt-repair
> > endif
> >
> > all: $(BIN) $(MANPAGES) docs
> > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > qrap: $(QRAP_SRCS) passt.h
> > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> >
> > +passt-repair: $(PASST_REPAIR_SRCS)
> > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > +
> > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > mmap2 munmap open unlink gettimeofday futex statx \
> > diff --git a/passt-repair.c b/passt-repair.c
> > new file mode 100644
> > index 0000000..e9b9609
> > --- /dev/null
> > +++ b/passt-repair.c
> > @@ -0,0 +1,111 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +/* PASST - Plug A Simple Socket Transport
> > + * for qemu/UNIX domain socket mode
> > + *
> > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > + *
> > + * Copyright (c) 2025 Red Hat GmbH
> > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > + *
> > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > + */
> > +
> > +#include <sys/types.h>
> > +#include <sys/socket.h>
> > +#include <sys/un.h>
> > +#include <errno.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <limits.h>
> > +#include <unistd.h>
> > +#include <netdb.h>
> > +
> > +#include <netinet/tcp.h>
> > +
> > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > +
> > +int main(int argc, char **argv)
> > +{
> > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > + struct sockaddr_un a = { AF_UNIX, "" };
> > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > + struct cmsghdr *cmsg;
> > + struct msghdr msg;
> > + struct iovec iov;
> > +
> > + iov = (struct iovec){ &cmd, sizeof(cmd) };
>
> I mean, local to local, it's *probably* fine, but still a network
> protocol not defined in terms of explicit width fields makes me
> nervous. I'd prefer to see the cmd being a packed structure with
> fixed width elements.
It actually is, because:
struct {
int cmd;
};
is a packet structure with fixed width elements. Any architecture we
build for (at least the ones I'm aware of) has a 32-bit int. We can
make it uint32_t if it makes you feel better.
> I also think we should do some sort of basic magic / version exchange.
> I don't see any reason we'd need to extend the protocol, but I'd
> rather have the option if we have to.
passt-repair will be packaged and distributed together with passt,
though. Versions must match. And latency here might matter more than in
the rest of the migration process.
> Plus checking a magic number
> should make things less damaging and more debuggable if you were to
> point the repair helper at an entirely unrelated unix socket instead
> of passt's repair socket.
Maybe, yes, even though I don't really see good chances for that
mistake to happen. Feel free to post a proposal, of course.
> > + msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
> > + cmsg = CMSG_FIRSTHDR(&msg);
> > +
> > + if (argc != 2) {
> > + fprintf(stderr, "Usage: %s PATH\n", argv[0]);
> > + return -1;
> > + }
> > +
> > + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
> > + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
> > + fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
> > + return -1;
> > + }
> > +
> > + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
>
> Hmm.. would a datagram socket better suit our needs here?
We need a connection though, so that passt knows when the helper is
ready to get messages. It could be done with a synchronisation datagram
but it looks more complicated to handle.
By the way, with a connection, we could probably just close() the
socket here instead of having a "quit" command.
If you're referring to the fact we don't keep message boundaries, so we
would in theory need to add short read handling to the recvmsg() below:
I'd rather switch cmd to a single byte instead. You can't transfer less
than that.
> > + perror("Failed to create AF_UNIX socket");
> > + return -1;
> > + }
> > +
> > + if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
> > + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
> > + strerror(errno));
> > + return -1;
> > + }
> > +
> > + while (1) {
> > + int n;
> > +
> > + if (recvmsg(s, &msg, 0) < 0) {
> > + perror("Failed to receive message");
> > + return -1;
> > + }
> > +
> > + if (!cmsg ||
> > + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
> > + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
> > + cmsg->cmsg_type != SCM_RIGHTS)
> > + return -1;
> > +
> > + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
> > + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
> > +
> > + switch (cmd) {
> > + case INT_MAX:
> > + return 0;
> > + case TCP_REPAIR_ON:
> > + case TCP_REPAIR_OFF:
> > + case TCP_REPAIR_OFF_NO_WP:
> > + for (i = 0; i < n; i++) {
> > + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
> > + &cmd, sizeof(int))) {
> > + perror("Setting TCP_REPAIR");
> > + return -1;
>
> We probably want this to report errors back to passt, rather than just
> dying in this case. That way if for some weird reason one socket
> can't be placed in repair mode, we can still migrate all the other
> connections.
We implicitly report the error in the sense that we close the
connection and passt will abort the migration. If you look at the
handling of TCP_REPAIR in do_tcp_setsockopt(), you'll see that it
either always fails (EPERM), or always succeeds.
I mean, it's straightforward to implement, and we can just reply with a
different command. But it's probably more meaningful and fitting to
abort altogether.
Besides, if we have to report exactly on which socket we failed, we
won't be able to switch to a single-byte command protocol.
> > + }
> > + }
> > +
> > + /* Confirm setting by echoing the command back */
> > + if (send(s, &cmd, sizeof(int), 0) < 0) {
> > + fprintf(stderr, "Reply to command %i: %s\n",
> > + cmd, strerror(errno));
> > + return -1;
> > + }
> > +
> > + break;
> > + default:
> > + fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
> > + return -1;
> > + }
> > + }
> > +}
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-28 6:48 ` Stefano Brivio
@ 2025-01-29 1:02 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-29 1:02 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 1268 bytes --]
On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:
> On Tue, 28 Jan 2025 11:53:09 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > > Moving in_epoll out of the common flow data created a 7-bit hole in
> > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > > unused) bit.
> >
> > Is this actually necessary for the migration stuff? Or just a cleanup
> > you spotted along the way?
>
> I thought it was helpful to keep the same size on 32-bit, but it looks
> like it's not actually needed.
>
> Let me drop it from this series as it's just noise and I'm trying to
> keep this slim. If we are all happy with it I can apply it. If not I'll
> forget about it.
Eh, I don't care that much either way.
Note, btw, that bit-field packing is another way source and
destination could potentially have mismatching data structures. IIUC
bit field packing is described by the ABI and doesn't necessarily
match the byte endianness.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-28 6:48 ` Stefano Brivio
@ 2025-01-29 1:03 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-29 1:03 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]
On Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:
> On Tue, 28 Jan 2025 11:59:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:
> > > These are symmetric to write_remainder() and write_all_buf() and
> > > almost a copy and paste of them, with the most notable differences
> > > being reversed reads/writes and a couple of better-safe-than-sorry
> > > asserts to keep Coverity happy.
> >
> > So, there's one thing that needs to be not quite symmetric for the
> > read() version: we need to handle EOF. At present, I believe these
> > will enter an infinite loop on EOF, which is not a graceful failure
> > mode.
>
> It doesn't happen in our current usage where we close the socket once
> we're done,
I don't see how what we do with the socket is relevant. Couldn't we
hit this case if qemu unexpectedly closed the socket or died?
> but sure, if we use it for something else, boom. Let me add
> a rc == 0 case (which gets EIO or EINVAL, I'm not sure yet).
>
> Or feel free to re-post this if you have clearer ideas how to fix this
> up (but only if tested).
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-28 6:50 ` Stefano Brivio
@ 2025-01-29 1:16 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-29 1:16 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 25625 bytes --]
On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:
> On Tue, 28 Jan 2025 12:40:12 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > > Add two sets (source or target) of three functions each for passt in
> > > vhost-user mode, triggered by activity on the file descriptor passed
> > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > >
> > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > for migration, before data is transferred
> > >
> > > - migrate_source() sends, and migrate_target() receives migration data
> > >
> > > - migrate_source_post() and migrate_target_post() are responsible for
> > > any post-migration task
> > >
> > > Callbacks are added to these functions with arrays of function
> > > pointers in migrate.c. Migration handlers are versioned.
> > >
> > > Versioned descriptions of data sections will be added to the
> > > data_versions array, which points to versioned iovec arrays. Version
> > > 1 is currently empty and will be filled in in subsequent patches.
> > >
> > > The source announces the data version to be used and informs the peer
> > > about endianness, and the size of void *, time_t, flow entries and
> > > flow hash table entries.
> > >
> > > The target checks if the version of the source is still supported. If
> > > it's not, it aborts the migration.
> > >
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > Makefile | 12 +--
> > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > migrate.h | 90 ++++++++++++++++++
> > > passt.c | 2 +-
> > > vu_common.c | 122 ++++++++++++++++---------
> > > vu_common.h | 2 +-
> > > 6 files changed, 438 insertions(+), 49 deletions(-)
> > > create mode 100644 migrate.c
> > > create mode 100644 migrate.h
> > >
> > > diff --git a/Makefile b/Makefile
> > > index 464eef1..1383875 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > >
> > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > vhost_user.c virtio.c vu_common.c
> > > QRAP_SRCS = qrap.c
> > > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > >
> > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > - virtio.h vu_common.h
> > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > + vhost_user.h virtio.h vu_common.h
> > > HEADERS = $(PASST_HEADERS) seccomp.h
> > >
> > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > diff --git a/migrate.c b/migrate.c
> > > new file mode 100644
> > > index 0000000..bee9653
> > > --- /dev/null
> > > +++ b/migrate.c
> > > @@ -0,0 +1,259 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +/* PASST - Plug A Simple Socket Transport
> > > + * for qemu/UNIX domain socket mode
> > > + *
> > > + * PASTA - Pack A Subtle Tap Abstraction
> > > + * for network namespace/tap device mode
> > > + *
> > > + * migrate.c - Migration sections, layout, and routines
> > > + *
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > +
> > > +#include <errno.h>
> > > +#include <sys/uio.h>
> > > +
> > > +#include "util.h"
> > > +#include "ip.h"
> > > +#include "passt.h"
> > > +#include "inany.h"
> > > +#include "flow.h"
> > > +#include "flow_table.h"
> > > +
> > > +#include "migrate.h"
> > > +
> > > +/* Current version of migration data */
> > > +#define MIGRATE_VERSION 1
> > > +
> > > +/* Magic as we see it and as seen with reverse endianness */
> > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
> >
> > As noted, I'm hoping we can get rid of "either endian" migration. But
> > if this stays, we should define it using __bswap_constant_32() to
> > avoid embarrassing mistakes.
>
> Those always give me issues on musl,
What sort of issues? We're already using them, and have fallback
versions defined in util.h
> so I'd rather test things on
> big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap).
>
> Feel free to post a different proposal if tested.
>
> > > +
> > > +/* Migration header to send from source */
> > > +static union migrate_header header = {
> > > + .magic = MIGRATE_MAGIC,
> > > + .version = htonl_constant(MIGRATE_VERSION),
> > > + .time_t_size = htonl_constant(sizeof(time_t)),
> > > + .flow_size = htonl_constant(sizeof(union flow)),
> > > + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)),
> > > + .voidp_size = htonl_constant(sizeof(void *)),
> > > +};
> > > +
> > > +/* Data sections for version 1 */
> > > +static struct iovec sections_v1[] = {
> > > + { &header, sizeof(header) },
> > > +};
> > > +
> > > +/* Set of data versions */
> > > +static struct migrate_data data_versions[] = {
> > > + {
> > > + 1, sections_v1,
> > > + },
> > > + { 0 },
> > > +};
> > > +
> > > +/* Handlers to call in source before sending data */
> > > +struct migrate_handler handlers_source_pre[] = {
> > > + { 0 },
> > > +};
> > > +
> > > +/* Handlers to call in source after sending data */
> > > +struct migrate_handler handlers_source_post[] = {
> > > + { 0 },
> > > +};
> > > +
> > > +/* Handlers to call in target before receiving data with version 1 */
> > > +struct migrate_handler handlers_target_pre_v1[] = {
> > > + { 0 },
> > > +};
> > > +
> > > +/* Handlers to call in target after receiving data with version 1 */
> > > +struct migrate_handler handlers_target_post_v1[] = {
> > > + { 0 },
> > > +};
> > > +
> > > +/* Versioned sets of migration handlers */
> > > +struct migrate_target_handlers target_handlers[] = {
> > > + {
> > > + 1,
> > > + handlers_target_pre_v1,
> > > + handlers_target_post_v1,
> > > + },
> > > + { 0 },
> > > +};
> > > +
> > > +/**
> > > + * migrate_source_pre() - Pre-migration tasks as source
> > > + * @m: Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_source_pre(struct migrate_meta *m)
> > > +{
> > > + struct migrate_handler *h;
> > > +
> > > + for (h = handlers_source_pre; h->fn; h++) {
> > > + int rc;
> > > +
> > > + if ((rc = h->fn(m, h->data)))
> > > + return rc;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_source() - Perform migration as source: send state to hypervisor
> > > + * @fd: Descriptor for state transfer
> > > + * @m: Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_source(int fd, const struct migrate_meta *m)
> > > +{
> > > + static struct migrate_data *d;
> > > + unsigned count;
> > > + int rc;
> > > +
> > > + for (d = data_versions; d->v != MIGRATE_VERSION; d++);
> >
> > Should ASSERT() if we don't find the version within the array.
>
> This looks a bit unnecessary, MIGRATE_VERSION is defined just above...
> it's just a readability killer to me.
>
> > > + for (count = 0; d->sections[count].iov_len; count++);
> > > +
> > > + debug("Writing %u migration sections", count - 1 /* minus header */);
> > > + rc = write_remainder(fd, d->sections, count, 0);
> > > + if (rc < 0)
> > > + return errno;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_source_post() - Post-migration tasks as source
> > > + * @m: Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +void migrate_source_post(struct migrate_meta *m)
> > > +{
> > > + struct migrate_handler *h;
> > > +
> > > + for (h = handlers_source_post; h->fn; h++)
> > > + h->fn(m, h->data);
> >
> > Is there actually anything we might need to do on the source after a
> > successful migration, other than exit?
>
> We might want to log a couple of things, which would warrant these
> handlers.
>
> But let's say we need to do something *similar* to "updating the
> network" such as the RARP announcement that QEMU is requesting (this is
IIUC, that's on the target end, not the source end...
> intended for OVN-Kubernetes, so go figure), or that we need a
> workaround for a kernel issue with implicit close() with TCP_REPAIR
> on... I would leave this in for completeness.
...but sure, point taken.
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_read_header() - Set metadata in target from source header
> > > + * @fd: Descriptor for state transfer
> > > + * @m: Migration metadata, filled on return
> > > + *
> > > + * Return: 0 on success, error code on failure
> >
> > We nearly always use negative error codes. Why not here?
>
> Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned:
>
> https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-message-types
>
> and I want to keep this consistent/untranslated.
Ok.
> > > + */
> > > +int migrate_target_read_header(int fd, struct migrate_meta *m)
> > > +{
> > > + static struct migrate_data *d;
> > > + union migrate_header h;
> > > +
> > > + if (read_all_buf(fd, &h, sizeof(h)))
> > > + return errno;
> > > +
> > > + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> > > + h.magic, ntohl(h.voidp_size), ntohl(h.version));
> > > +
> > > + for (d = data_versions; d->v != ntohl(h.version); d++);
> > > + if (!d->v)
> > > + return ENOTSUP;
> >
> > This is too late. The loop doesn't check it, so you've already
> > overrun the data_versions table if the version wasn't in there.
>
> Ah, yes, I forgot the '&& d->v' part (see migrate_target()).
>
> > Easier to use an ARRAY_SIZE() limit in the loop, I think.
>
> I'd rather keep that as a one-liner, and NULL-terminate the arrays.
>
> > > + m->v = d->v;
> > > +
> > > + if (h.magic == MIGRATE_MAGIC)
> > > + m->bswap = false;
> > > + else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> > > + m->bswap = true;
> > > + else
> > > + return ENOTSUP;
> > > +
> > > + if (ntohl(h.voidp_size) == 4)
> > > + m->source_64b = false;
> > > + else if (ntohl(h.voidp_size) == 8)
> > > + m->source_64b = true;
> > > + else
> > > + return ENOTSUP;
> > > +
> > > + if (ntohl(h.time_t_size) == 4)
> > > + m->time_64b = false;
> > > + else if (ntohl(h.time_t_size) == 8)
> > > + m->time_64b = true;
> > > + else
> > > + return ENOTSUP;
> > > +
> > > + m->flow_size = ntohl(h.flow_size);
> > > + m->flow_sidx_size = ntohl(h.flow_sidx_size);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_pre() - Pre-migration tasks as target
> > > + * @m: Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + */
> > > +int migrate_target_pre(struct migrate_meta *m)
> > > +{
> > > + struct migrate_target_handlers *th;
> > > + struct migrate_handler *h;
> > > +
> > > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > > +
> > > + for (h = th->pre; h->fn; h++) {
> > > + int rc;
> > > +
> > > + if ((rc = h->fn(m, h->data)))
> > > + return rc;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target() - Perform migration as target: receive state from hypervisor
> > > + * @fd: Descriptor for state transfer
> > > + * @m: Migration metadata
> > > + *
> > > + * Return: 0 on success, error code on failure
> > > + *
> > > + * #syscalls:vu readv
> > > + */
> > > +int migrate_target(int fd, const struct migrate_meta *m)
> > > +{
> > > + static struct migrate_data *d;
> > > + unsigned cnt;
> > > + int rc;
> > > +
> > > + for (d = data_versions; d->v != m->v && d->v; d++);
> > > +
> > > + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> > > +
> > > + debug("Reading %u migration sections", cnt);
> > > + rc = read_remainder(fd, d->sections + 1, cnt, 0);
> > > + if (rc < 0)
> > > + return errno;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * migrate_target_post() - Post-migration tasks as target
> > > + * @m: Migration metadata
> > > + */
> > > +void migrate_target_post(struct migrate_meta *m)
> > > +{
> > > + struct migrate_target_handlers *th;
> > > + struct migrate_handler *h;
> > > +
> > > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > > +
> > > + for (h = th->post; h->fn; h++)
> > > + h->fn(m, h->data);
> > > +}
> > > diff --git a/migrate.h b/migrate.h
> > > new file mode 100644
> > > index 0000000..5582f75
> > > --- /dev/null
> > > +++ b/migrate.h
> > > @@ -0,0 +1,90 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + */
> > > +
> > > +#ifndef MIGRATE_H
> > > +#define MIGRATE_H
> > > +
> > > +/**
> > > + * struct migrate_meta - Migration metadata
> > > + * @v: Chosen migration data version, host order
> > > + * @bswap: Source has opposite endianness
> > > + * @peer_64b: Source uses 64-bit void *
> > > + * @time_64b: Source uses 64-bit time_t
> > > + * @flow_size: Size of union flow in source
> > > + * @flow_sidx_size: Size of struct flow_sidx in source
> > > + */
> > > +struct migrate_meta {
> > > + uint32_t v;
> > > + bool bswap;
> > > + bool source_64b;
> > > + bool time_64b;
> > > + size_t flow_size;
> > > + size_t flow_sidx_size;
> > > +};
> > > +
> > > +/**
> > > + * union migrate_header - Migration header from source
> > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > > + * @version: Source sends highest known, target aborts if unsupported
> > > + * @voidp_size: sizeof(void *), network order
> > > + * @time_t_size: sizeof(time_t), network order
> > > + * @flow_size: sizeof(union flow), network order
> > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > > + * @unused: Go figure
> > > + */
> > > +union migrate_header {
> > > + struct {
> > > + uint64_t magic;
> > > + uint32_t version;
> > > + uint32_t voidp_size;
> > > + uint32_t time_t_size;
> > > + uint32_t flow_size;
> > > + uint32_t flow_sidx_size;
> > > + };
> > > + uint8_t unused[65536];
> >
> > So, having looked at this, I no longer think padding the header to 64kiB
> > is a good idea. The structure means we're basically stuck always
> > having that chunky header. Instead, I think the header should be
> > absolutely minimal: basically magic and version only. v1 (and maybe
> > others) can add a "metadata" or whatever section for additional
> > information like this they need.
>
> The header is processed by the target in a separate, preliminary step,
> though.
>
> That's why I added metadata right in the header: if the target needs to
> abort the migration because, say, the size of a flow entry is too big
> to handle for a particular version, then we should know that before
> migrate_target_pre().
Ah, yes, I missed that, we'd need a more complex design to do
additional transfers and checks before making the target_pre
callbacks.
> As long as we check the version first, we can always shrink the header
> later on.
*thinks*.. I guess so, though it's kind of awkward; a future version
would have to read the "header of the header", check the version, then
if it's the old one, read the remainder of the 64kiB block.
I still think we should clearly separate the part that we're
committing to being in every future version (which I think should just
be magic and version), from the stuff that's just v1.
> But having 64 KiB reserved looks more robust because it's a
> safe place to add this kind of metadata.
>
> Note that 64 KiB is typically transferred in a single read/write
> from/to the vhost-user back-end.
Ok, but it also has to go over the qemu migration channel, which will
often be a physical link, not a super-fast local/virtual one, and may
be bandwidth capped as well. I'm not actually certain if 64kiB is
likely to be a problem there, but it *is* large compared to the state
blobs of most qemu devices (usually only a few hundred bytes).
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_data - Data sections for given source version
> > > + * @v: Source version this applies to, host order
> > > + * @sections: Array of data sections, NULL-terminated
> > > + */
> > > +struct migrate_data {
> > > + uint32_t v;
> > > + struct iovec *sections;
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_handler - Function to handle a specific data section
> > > + * @fn: Function pointer taking pointer to data section
> > > + * @data: Associated data section
> > > + */
> > > +struct migrate_handler {
> > > + int (*fn)(struct migrate_meta *m, void *data);
> > > + void *data;
> > > +};
> > > +
> > > +/**
> > > + * struct migrate_target_handlers - Versioned sets of migration target handlers
> > > + * @v: Source version this applies to, host order
> > > + * @pre: Set of functions to execute in target before data copy
> > > + * @post: Set of functions to execute in target after data copy
> > > + */
> > > +struct migrate_target_handlers {
> > > + uint32_t v;
> > > + struct migrate_handler *pre;
> > > + struct migrate_handler *post;
> > > +};
> > > +
> > > +int migrate_source_pre(struct migrate_meta *m);
> > > +int migrate_source(int fd, const struct migrate_meta *m);
> > > +void migrate_source_post(struct migrate_meta *m);
> > > +
> > > +int migrate_target_read_header(int fd, struct migrate_meta *m);
> > > +int migrate_target_pre(struct migrate_meta *m);
> > > +int migrate_target(int fd, const struct migrate_meta *m);
> > > +void migrate_target_post(struct migrate_meta *m);
> > > +
> > > +#endif /* MIGRATE_H */
> > > diff --git a/passt.c b/passt.c
> > > index b1c8ab6..184d4e5 100644
> > > --- a/passt.c
> > > +++ b/passt.c
> > > @@ -358,7 +358,7 @@ loop:
> > > vu_kick_cb(c.vdev, ref, &now);
> > > break;
> > > case EPOLL_TYPE_VHOST_MIGRATION:
> > > - vu_migrate(c.vdev, eventmask);
> > > + vu_migrate(&c, eventmask);
> > > break;
> > > default:
> > > /* Can't happen */
> > > diff --git a/vu_common.c b/vu_common.c
> > > index f43d8ac..0c67bd0 100644
> > > --- a/vu_common.c
> > > +++ b/vu_common.c
> > > @@ -5,6 +5,7 @@
> > > * common_vu.c - vhost-user common UDP and TCP functions
> > > */
> > >
> > > +#include <errno.h>
> > > #include <unistd.h>
> > > #include <sys/uio.h>
> > > #include <sys/eventfd.h>
> > > @@ -17,6 +18,7 @@
> > > #include "vhost_user.h"
> > > #include "pcap.h"
> > > #include "vu_common.h"
> > > +#include "migrate.h"
> > >
> > > #define VU_MAX_TX_BUFFER_NB 2
> > >
> > > @@ -305,50 +307,88 @@ err:
> > > }
> > >
> > > /**
> > > - * vu_migrate() - Send/receive passt insternal state to/from QEMU
> > > - * @vdev: vhost-user device
> > > + * vu_migrate_source() - Migration as source, send state to hypervisor
> > > + * @fd: File descriptor for state transfer
> > > + *
> > > + * Return: 0 on success, positive error code on failure
> > > + */
> > > +static int vu_migrate_source(int fd)
> > > +{
> > > + struct migrate_meta m;
> > > + int rc;
> > > +
> > > + if ((rc = migrate_source_pre(&m))) {
> > > + err("Source pre-migration failed: %s, abort", strerror_(rc));
> > > + return rc;
> > > + }
> > > +
> > > + debug("Saving backend state");
> > > +
> > > + rc = migrate_source(fd, &m);
> > > + if (rc)
> > > + err("Source migration failed: %s", strerror_(rc));
> > > + else
> > > + migrate_source_post(&m);
> > > +
> > > + return rc;
> >
> > After a successful source migration shouldn't we exit, or at least
> > quiesce ourselves so we don't accidentally mess with anything the
> > target is now doing?
>
> Maybe, yes. Pending TCP connections should be safe because with
> TCP_REPAIR they're already quiesced, but we don't close listening
> sockets (yet).
>
> Perhaps a reasonable approach for the moment would be to declare a
> single migrate_source_post handler logging a info() message and
> exiting.
Seems sensible for now.
> > > +}
> > > +
> > > +/**
> > > + * vu_migrate_target() - Migration as target, receive state from hypervisor
> > > + * @fd: File descriptor for state transfer
> > > + *
> > > + * Return: 0 on success, positive error code on failure
> > > + */
> > > +static int vu_migrate_target(int fd)
> > > +{
> > > + struct migrate_meta m;
> > > + int rc;
> > > +
> > > + rc = migrate_target_read_header(fd, &m);
> > > + if (rc) {
> > > + err("Migration header check failed: %s, abort", strerror_(rc));
> > > + return rc;
> > > + }
> > > +
> > > + if ((rc = migrate_target_pre(&m))) {
> > > + err("Target pre-migration failed: %s, abort", strerror_(rc));
> > > + return rc;
> > > + }
> > > +
> > > + debug("Loading backend state");
> > > +
> > > + rc = migrate_target(fd, &m);
> > > + if (rc)
> > > + err("Target migration failed: %s", strerror_(rc));
> > > + else
> > > + migrate_target_post(&m);
> > > +
> > > + return rc;
> > > +}
> > > +
> > > +/**
> > > + * vu_migrate() - Send/receive passt internal state to/from QEMU
> > > + * @c: Execution context
> > > * @events: epoll events
> > > */
> > > -void vu_migrate(struct vu_dev *vdev, uint32_t events)
> > > +void vu_migrate(struct ctx *c, uint32_t events)
> > > {
> > > - int ret;
> > > + struct vu_dev *vdev = c->vdev;
> > > + int rc = EIO;
> > >
> > > - /* TODO: collect/set passt internal state
> > > - * and use vdev->device_state_fd to send/receive it
> > > - */
> > > debug("vu_migrate fd %d events %x", vdev->device_state_fd, events);
> > > - if (events & EPOLLOUT) {
> > > - debug("Saving backend state");
> > > -
> > > - /* send some stuff */
> > > - ret = write(vdev->device_state_fd, "PASST", 6);
> > > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > > - vdev->device_state_result = ret == -1 ? -1 : 0;
> > > - /* Closing the file descriptor signals the end of transfer */
> > > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > > - vdev->device_state_fd, NULL);
> > > - close(vdev->device_state_fd);
> > > - vdev->device_state_fd = -1;
> > > - } else if (events & EPOLLIN) {
> > > - char buf[6];
> > > -
> > > - debug("Loading backend state");
> > > - /* read some stuff */
> > > - ret = read(vdev->device_state_fd, buf, sizeof(buf));
> > > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */
> > > - if (ret != sizeof(buf)) {
> > > - vdev->device_state_result = -1;
> > > - } else {
> > > - ret = strncmp(buf, "PASST", sizeof(buf));
> > > - vdev->device_state_result = ret == 0 ? 0 : -1;
> > > - }
> > > - } else if (events & EPOLLHUP) {
> > > - debug("Closing migration channel");
> > > -
> > > - /* The end of file signals the end of the transfer. */
> > > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL,
> > > - vdev->device_state_fd, NULL);
> > > - close(vdev->device_state_fd);
> > > - vdev->device_state_fd = -1;
> > > - }
> > > +
> > > + if (events & EPOLLOUT)
> > > + rc = vu_migrate_source(vdev->device_state_fd);
> > > + else if (events & EPOLLIN)
> > > + rc = vu_migrate_target(vdev->device_state_fd);
> > > +
> > > + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */
> > > +
> > > + vdev->device_state_result = rc;
> > > +
> > > + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL);
> > > + debug("Closing migration channel");
> > > + close(vdev->device_state_fd);
> > > + vdev->device_state_fd = -1;
> > > }
> > > diff --git a/vu_common.h b/vu_common.h
> > > index d56c021..69c4006 100644
> > > --- a/vu_common.h
> > > +++ b/vu_common.h
> > > @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq,
> > > void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> > > const struct timespec *now);
> > > int vu_send_single(const struct ctx *c, const void *buf, size_t size);
> > > -void vu_migrate(struct vu_dev *vdev, uint32_t events);
> > > +void vu_migrate(struct ctx *c, uint32_t events);
> > > #endif /* VU_COMMON_H */
> >
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-28 6:51 ` Stefano Brivio
@ 2025-01-29 1:29 ` David Gibson
2025-01-29 7:04 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-29 1:29 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 10244 bytes --]
On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
> On Tue, 28 Jan 2025 12:51:59 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > > passt. Not used yet.
> > >
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > > Makefile | 10 +++--
> > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 118 insertions(+), 3 deletions(-)
> > > create mode 100644 passt-repair.c
> > >
> > > diff --git a/Makefile b/Makefile
> > > index 1383875..1b71cb0 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > vhost_user.c virtio.c vu_common.c
> > > QRAP_SRCS = qrap.c
> > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > +PASST_REPAIR_SRCS = passt-repair.c
> > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > >
> > > MANPAGES = passt.1 pasta.1 qrap.1
> > >
> > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > > man1dir ?= $(mandir)/man1
> > >
> > > ifeq ($(TARGET_ARCH),x86_64)
> > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > > else
> > > -BIN := passt pasta qrap
> > > +BIN := passt pasta qrap passt-repair
> > > endif
> > >
> > > all: $(BIN) $(MANPAGES) docs
> > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > > qrap: $(QRAP_SRCS) passt.h
> > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> > >
> > > +passt-repair: $(PASST_REPAIR_SRCS)
> > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > > +
> > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > > mmap2 munmap open unlink gettimeofday futex statx \
> > > diff --git a/passt-repair.c b/passt-repair.c
> > > new file mode 100644
> > > index 0000000..e9b9609
> > > --- /dev/null
> > > +++ b/passt-repair.c
> > > @@ -0,0 +1,111 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +/* PASST - Plug A Simple Socket Transport
> > > + * for qemu/UNIX domain socket mode
> > > + *
> > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > > + *
> > > + * Copyright (c) 2025 Red Hat GmbH
> > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > + *
> > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > > + */
> > > +
> > > +#include <sys/types.h>
> > > +#include <sys/socket.h>
> > > +#include <sys/un.h>
> > > +#include <errno.h>
> > > +#include <stdio.h>
> > > +#include <stdlib.h>
> > > +#include <string.h>
> > > +#include <limits.h>
> > > +#include <unistd.h>
> > > +#include <netdb.h>
> > > +
> > > +#include <netinet/tcp.h>
> > > +
> > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > +
> > > +int main(int argc, char **argv)
> > > +{
> > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > + struct sockaddr_un a = { AF_UNIX, "" };
> > > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > > + struct cmsghdr *cmsg;
> > > + struct msghdr msg;
> > > + struct iovec iov;
> > > +
> > > + iov = (struct iovec){ &cmd, sizeof(cmd) };
> >
> > I mean, local to local, it's *probably* fine, but still a network
> > protocol not defined in terms of explicit width fields makes me
> > nervous. I'd prefer to see the cmd being a packed structure with
> > fixed width elements.
>
> It actually is, because:
>
> struct {
> int cmd;
> };
>
> is a packet structure with fixed width elements. Any architecture we
> build for (at least the ones I'm aware of) has a 32-bit int. We can
> make it uint32_t if it makes you feel better.
Sorry, I should have said "*explicitly* fixed width fields". So, yes,
uint32_t would make me feel better :)
> > I also think we should do some sort of basic magic / version exchange.
> > I don't see any reason we'd need to extend the protocol, but I'd
> > rather have the option if we have to.
>
> passt-repair will be packaged and distributed together with passt,
> though. Versions must match.
But nothing enforces that. AIUI KubeVirt will be running passt-repair
in a different context. Which means it may well be deployed by a
different path than the passt binary, which means however we
distribute it's quite plausible that a downstream screwup could
mismatch the versions. We should endeavour to have a reasonably
graceful failure mode for that.
> And latency here might matter more than in
> the rest of the migration process.
> > Plus checking a magic number
> > should make things less damaging and more debuggable if you were to
> > point the repair helper at an entirely unrelated unix socket instead
> > of passt's repair socket.
>
> Maybe, yes, even though I don't really see good chances for that
> mistake to happen. Feel free to post a proposal, of course.
I disagree on the good chances for a mistake thing. In GSS I saw
plenty of occasions where things that shouldn't be mismatched were due
to some packaging or user screwup. And that's before even considering
the way that KubeVirt deploys its various pieces seems to provide a
number of opportunities to mess this up.
So, I'll see what I can come up with. I'm fine with requiring
matching versions if it's actually checked. Maybe a magic derived
from our git hash, or even our build-id.
> > > + msg = (struct msghdr){ NULL, 0, &iov, 1, buf, sizeof(buf), 0 };
> > > + cmsg = CMSG_FIRSTHDR(&msg);
> > > +
> > > + if (argc != 2) {
> > > + fprintf(stderr, "Usage: %s PATH\n", argv[0]);
> > > + return -1;
> > > + }
> > > +
> > > + ret = snprintf(a.sun_path, sizeof(a.sun_path), "%s", argv[1]);
> > > + if (ret <= 0 || ret >= (int)sizeof(a.sun_path)) {
> > > + fprintf(stderr, "Invalid socket path: %s\n", argv[1]);
> > > + return -1;
> > > + }
> > > +
> > > + if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
> >
> > Hmm.. would a datagram socket better suit our needs here?
>
> We need a connection though, so that passt knows when the helper is
> ready to get messages. It could be done with a synchronisation datagram
> but it looks more complicated to handle.
Good point.
> By the way, with a connection, we could probably just close() the
> socket here instead of having a "quit" command.
True.
> If you're referring to the fact we don't keep message boundaries, so we
> would in theory need to add short read handling to the recvmsg() below:
> I'd rather switch cmd to a single byte instead. You can't transfer less
> than that.
I was thinking that preserving message boundaries might allow
extending the command format more easily, but you've convinced me it's
not worth the trouble.
> > > + perror("Failed to create AF_UNIX socket");
> > > + return -1;
> > > + }
> > > +
> > > + if (connect(s, (struct sockaddr *)&a, sizeof(a))) {
> > > + fprintf(stderr, "Failed to connect to %s: %s\n", argv[1],
> > > + strerror(errno));
> > > + return -1;
> > > + }
> > > +
> > > + while (1) {
> > > + int n;
> > > +
> > > + if (recvmsg(s, &msg, 0) < 0) {
> > > + perror("Failed to receive message");
> > > + return -1;
> > > + }
> > > +
> > > + if (!cmsg ||
> > > + cmsg->cmsg_len < CMSG_LEN(sizeof(int)) ||
> > > + cmsg->cmsg_len > CMSG_LEN(sizeof(int) * SCM_MAX_FD) ||
> > > + cmsg->cmsg_type != SCM_RIGHTS)
> > > + return -1;
> > > +
> > > + n = cmsg->cmsg_len / CMSG_LEN(sizeof(int));
> > > + memcpy(fds, CMSG_DATA(cmsg), sizeof(int) * n);
> > > +
> > > + switch (cmd) {
> > > + case INT_MAX:
> > > + return 0;
> > > + case TCP_REPAIR_ON:
> > > + case TCP_REPAIR_OFF:
> > > + case TCP_REPAIR_OFF_NO_WP:
> > > + for (i = 0; i < n; i++) {
> > > + if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR,
> > > + &cmd, sizeof(int))) {
> > > + perror("Setting TCP_REPAIR");
> > > + return -1;
> >
> > We probably want this to report errors back to passt, rather than just
> > dying in this case. That way if for some weird reason one socket
> > can't be placed in repair mode, we can still migrate all the other
> > connections.
>
> We implicitly report the error in the sense that we close the
> connection and passt will abort the migration. If you look at the
> handling of TCP_REPAIR in do_tcp_setsockopt(), you'll see that it
> either always fails (EPERM), or always succeeds.
Ah, right. That's probably good enough for now.
> I mean, it's straightforward to implement, and we can just reply with a
> different command. But it's probably more meaningful and fitting to
> abort altogether.
Right, best effort maintenance of connections can be a later feature,
if anyone wants it.
> Besides, if we have to report exactly on which socket we failed, we
> won't be able to switch to a single-byte command protocol.
>
> > > + }
> > > + }
> > > +
> > > + /* Confirm setting by echoing the command back */
> > > + if (send(s, &cmd, sizeof(int), 0) < 0) {
> > > + fprintf(stderr, "Reply to command %i: %s\n",
> > > + cmd, strerror(errno));
> > > + return -1;
> > > + }
> > > +
> > > + break;
> > > + default:
> > > + fprintf(stderr, "Unsupported command 0x%04x\n", cmd);
> > > + return -1;
> > > + }
> > > + }
> > > +}
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-29 1:29 ` David Gibson
@ 2025-01-29 7:04 ` Stefano Brivio
2025-01-30 0:53 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-29 7:04 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Wed, 29 Jan 2025 12:29:27 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
> > On Tue, 28 Jan 2025 12:51:59 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > > > passt. Not used yet.
> > > >
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > > Makefile | 10 +++--
> > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > > 2 files changed, 118 insertions(+), 3 deletions(-)
> > > > create mode 100644 passt-repair.c
> > > >
> > > > diff --git a/Makefile b/Makefile
> > > > index 1383875..1b71cb0 100644
> > > > --- a/Makefile
> > > > +++ b/Makefile
> > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > vhost_user.c virtio.c vu_common.c
> > > > QRAP_SRCS = qrap.c
> > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > +PASST_REPAIR_SRCS = passt-repair.c
> > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > > >
> > > > MANPAGES = passt.1 pasta.1 qrap.1
> > > >
> > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > > > man1dir ?= $(mandir)/man1
> > > >
> > > > ifeq ($(TARGET_ARCH),x86_64)
> > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > > > else
> > > > -BIN := passt pasta qrap
> > > > +BIN := passt pasta qrap passt-repair
> > > > endif
> > > >
> > > > all: $(BIN) $(MANPAGES) docs
> > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > > > qrap: $(QRAP_SRCS) passt.h
> > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> > > >
> > > > +passt-repair: $(PASST_REPAIR_SRCS)
> > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > > > +
> > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > > > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > > > mmap2 munmap open unlink gettimeofday futex statx \
> > > > diff --git a/passt-repair.c b/passt-repair.c
> > > > new file mode 100644
> > > > index 0000000..e9b9609
> > > > --- /dev/null
> > > > +++ b/passt-repair.c
> > > > @@ -0,0 +1,111 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > +
> > > > +/* PASST - Plug A Simple Socket Transport
> > > > + * for qemu/UNIX domain socket mode
> > > > + *
> > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > > > + *
> > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > + *
> > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > > > + */
> > > > +
> > > > +#include <sys/types.h>
> > > > +#include <sys/socket.h>
> > > > +#include <sys/un.h>
> > > > +#include <errno.h>
> > > > +#include <stdio.h>
> > > > +#include <stdlib.h>
> > > > +#include <string.h>
> > > > +#include <limits.h>
> > > > +#include <unistd.h>
> > > > +#include <netdb.h>
> > > > +
> > > > +#include <netinet/tcp.h>
> > > > +
> > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > > +
> > > > +int main(int argc, char **argv)
> > > > +{
> > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > > + struct sockaddr_un a = { AF_UNIX, "" };
> > > > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > > > + struct cmsghdr *cmsg;
> > > > + struct msghdr msg;
> > > > + struct iovec iov;
> > > > +
> > > > + iov = (struct iovec){ &cmd, sizeof(cmd) };
> > >
> > > I mean, local to local, it's *probably* fine, but still a network
> > > protocol not defined in terms of explicit width fields makes me
> > > nervous. I'd prefer to see the cmd being a packed structure with
> > > fixed width elements.
> >
> > It actually is, because:
> >
> > struct {
> > int cmd;
> > };
> >
> > is a packet structure with fixed width elements. Any architecture we
> > build for (at least the ones I'm aware of) has a 32-bit int. We can
> > make it uint32_t if it makes you feel better.
>
> Sorry, I should have said "*explicitly* fixed width fields". So, yes,
> uint32_t would make me feel better :)
Changed to int8_t anyway meanwhile. We don't need all those bits.
> > > I also think we should do some sort of basic magic / version exchange.
> > > I don't see any reason we'd need to extend the protocol, but I'd
> > > rather have the option if we have to.
> >
> > passt-repair will be packaged and distributed together with passt,
> > though. Versions must match.
>
> But nothing enforces that.
Distribution packages. If I run claws-mail with the wrong version of,
say, libpixman, it won't start. If you don't use them, you're on your
own.
> AIUI KubeVirt will be running passt-repair
> in a different context. Which means it may well be deployed by a
> different path than the passt binary
No, that's not the way it works. It needs to match, in the sense that
1. it's a KubeVirt requirement to have compatible packages between
distribution and the "base container image" and 2. this would most
likely be sourced from the "base container image" anyway.
I maintain the packages for four distributions, plus AppArmor and
SELinux policies upstream and downstream, and I take care of updating
the package in KubeVirt as well, so I guess I have a vague idea of
what's convenient, enforced, burdensome, and so on.
> which means however we
> distribute it's quite plausible that a downstream screwup could
> mismatch the versions. We should endeavour to have a reasonably
> graceful failure mode for that.
Regardless of this, I think that *this one* is an interface (I wouldn't
even call it a protocol) that needs to be set in stone, except for
hypothetical (and highly unlikely) UAPI additions which we'll be anyway
able to accommodate for easily.
It's a single socket option with three possible values (for 13 years
now), of which we plan to use two. If we want this interface to do
anything else, it should be another interface.
So there's really no problem with this.
Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW
should ideally suffice), so it needs to be extremely simple and
auditable.
> > And latency here might matter more than in
> > the rest of the migration process.
>
> > > Plus checking a magic number
> > > should make things less damaging and more debuggable if you were to
> > > point the repair helper at an entirely unrelated unix socket instead
> > > of passt's repair socket.
> >
> > Maybe, yes, even though I don't really see good chances for that
> > mistake to happen. Feel free to post a proposal, of course.
>
> I disagree on the good chances for a mistake thing. In GSS I saw
> plenty of occasions where things that shouldn't be mismatched were due
> to some packaging or user screwup. And that's before even considering
> the way that KubeVirt deploys its various pieces seems to provide a
> number of opportunities to mess this up.
>
> So, I'll see what I can come up with. I'm fine with requiring
> matching versions if it's actually checked. Maybe a magic derived
> from our git hash, or even our build-id.
Both would make things significantly less usable, because they would
make different but compatible builds incompatible, and different
implementations rather inconvenient.
For example, it might be a practical solution to have a Go
implementation of this in KubeVirt's virt-handler itself, but if it
needs to extract information or strings from the binary, that becomes
impractical.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-29 1:02 ` David Gibson
@ 2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-29 7:33 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Wed, 29 Jan 2025 12:02:09 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:
> > On Tue, 28 Jan 2025 11:53:09 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > > > Moving in_epoll out of the common flow data created a 7-bit hole in
> > > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > > > unused) bit.
> > >
> > > Is this actually necessary for the migration stuff? Or just a cleanup
> > > you spotted along the way?
> >
> > I thought it was helpful to keep the same size on 32-bit, but it looks
> > like it's not actually needed.
> >
> > Let me drop it from this series as it's just noise and I'm trying to
> > keep this slim. If we are all happy with it I can apply it. If not I'll
> > forget about it.
>
> Eh, I don't care that much either way.
>
> Note, btw, that bit-field packing is another way source and
> destination could potentially have mismatching data structures. IIUC
> bit field packing is described by the ABI and doesn't necessarily
> match the byte endianness.
Right, that's actually the reason that brought me to this change: I was
comparing stuff between x86_64 and armv6l. On the other hand, this part
of the specific ABI is generally considered stable so I can rely on it.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-29 1:03 ` David Gibson
@ 2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-29 7:33 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Wed, 29 Jan 2025 12:03:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:
> > On Tue, 28 Jan 2025 11:59:28 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:
> > > > These are symmetric to write_remainder() and write_all_buf() and
> > > > almost a copy and paste of them, with the most notable differences
> > > > being reversed reads/writes and a couple of better-safe-than-sorry
> > > > asserts to keep Coverity happy.
> > >
> > > So, there's one thing that needs to be not quite symmetric for the
> > > read() version: we need to handle EOF. At present, I believe these
> > > will enter an infinite loop on EOF, which is not a graceful failure
> > > mode.
> >
> > It doesn't happen in our current usage where we close the socket once
> > we're done,
>
> I don't see how what we do with the socket is relevant. Couldn't we
> hit this case if qemu unexpectedly closed the socket or died?
Yes, sure. I just mentioned that it's not the intended usage, and
rather an error case we need to handle.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-29 1:16 ` David Gibson
@ 2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:48 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-29 7:33 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Wed, 29 Jan 2025 12:16:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:
> > On Tue, 28 Jan 2025 12:40:12 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > > > Add two sets (source or target) of three functions each for passt in
> > > > vhost-user mode, triggered by activity on the file descriptor passed
> > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > >
> > > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > > for migration, before data is transferred
> > > >
> > > > - migrate_source() sends, and migrate_target() receives migration data
> > > >
> > > > - migrate_source_post() and migrate_target_post() are responsible for
> > > > any post-migration task
> > > >
> > > > Callbacks are added to these functions with arrays of function
> > > > pointers in migrate.c. Migration handlers are versioned.
> > > >
> > > > Versioned descriptions of data sections will be added to the
> > > > data_versions array, which points to versioned iovec arrays. Version
> > > > 1 is currently empty and will be filled in in subsequent patches.
> > > >
> > > > The source announces the data version to be used and informs the peer
> > > > about endianness, and the size of void *, time_t, flow entries and
> > > > flow hash table entries.
> > > >
> > > > The target checks if the version of the source is still supported. If
> > > > it's not, it aborts the migration.
> > > >
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > > Makefile | 12 +--
> > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > migrate.h | 90 ++++++++++++++++++
> > > > passt.c | 2 +-
> > > > vu_common.c | 122 ++++++++++++++++---------
> > > > vu_common.h | 2 +-
> > > > 6 files changed, 438 insertions(+), 49 deletions(-)
> > > > create mode 100644 migrate.c
> > > > create mode 100644 migrate.h
> > > >
> > > > diff --git a/Makefile b/Makefile
> > > > index 464eef1..1383875 100644
> > > > --- a/Makefile
> > > > +++ b/Makefile
> > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > > >
> > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > vhost_user.c virtio.c vu_common.c
> > > > QRAP_SRCS = qrap.c
> > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > > >
> > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > > - virtio.h vu_common.h
> > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > > + vhost_user.h virtio.h vu_common.h
> > > > HEADERS = $(PASST_HEADERS) seccomp.h
> > > >
> > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > > diff --git a/migrate.c b/migrate.c
> > > > new file mode 100644
> > > > index 0000000..bee9653
> > > > --- /dev/null
> > > > +++ b/migrate.c
> > > > @@ -0,0 +1,259 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > +
> > > > +/* PASST - Plug A Simple Socket Transport
> > > > + * for qemu/UNIX domain socket mode
> > > > + *
> > > > + * PASTA - Pack A Subtle Tap Abstraction
> > > > + * for network namespace/tap device mode
> > > > + *
> > > > + * migrate.c - Migration sections, layout, and routines
> > > > + *
> > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > + */
> > > > +
> > > > +#include <errno.h>
> > > > +#include <sys/uio.h>
> > > > +
> > > > +#include "util.h"
> > > > +#include "ip.h"
> > > > +#include "passt.h"
> > > > +#include "inany.h"
> > > > +#include "flow.h"
> > > > +#include "flow_table.h"
> > > > +
> > > > +#include "migrate.h"
> > > > +
> > > > +/* Current version of migration data */
> > > > +#define MIGRATE_VERSION 1
> > > > +
> > > > +/* Magic as we see it and as seen with reverse endianness */
> > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
> > >
> > > As noted, I'm hoping we can get rid of "either endian" migration. But
> > > if this stays, we should define it using __bswap_constant_32() to
> > > avoid embarrassing mistakes.
> >
> > Those always give me issues on musl,
>
> What sort of issues? We're already using them, and have fallback
> versions defined in util.h
The very issues that brought me to introduce those fallback versions,
so I'm instinctively reluctant to use them.
Actually, I think it's even clearer to have this spelt out (I always
need to stop for a moment and think: what happens when I cross the
32-bit boundary?).
> > so I'd rather test things on
> > big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap).
> >
> > Feel free to post a different proposal if tested.
> >
> > > > +
> > > > +/* Migration header to send from source */
> > > > +static union migrate_header header = {
> > > > + .magic = MIGRATE_MAGIC,
> > > > + .version = htonl_constant(MIGRATE_VERSION),
> > > > + .time_t_size = htonl_constant(sizeof(time_t)),
> > > > + .flow_size = htonl_constant(sizeof(union flow)),
> > > > + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)),
> > > > + .voidp_size = htonl_constant(sizeof(void *)),
> > > > +};
> > > > +
> > > > +/* Data sections for version 1 */
> > > > +static struct iovec sections_v1[] = {
> > > > + { &header, sizeof(header) },
> > > > +};
> > > > +
> > > > +/* Set of data versions */
> > > > +static struct migrate_data data_versions[] = {
> > > > + {
> > > > + 1, sections_v1,
> > > > + },
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/* Handlers to call in source before sending data */
> > > > +struct migrate_handler handlers_source_pre[] = {
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/* Handlers to call in source after sending data */
> > > > +struct migrate_handler handlers_source_post[] = {
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/* Handlers to call in target before receiving data with version 1 */
> > > > +struct migrate_handler handlers_target_pre_v1[] = {
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/* Handlers to call in target after receiving data with version 1 */
> > > > +struct migrate_handler handlers_target_post_v1[] = {
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/* Versioned sets of migration handlers */
> > > > +struct migrate_target_handlers target_handlers[] = {
> > > > + {
> > > > + 1,
> > > > + handlers_target_pre_v1,
> > > > + handlers_target_post_v1,
> > > > + },
> > > > + { 0 },
> > > > +};
> > > > +
> > > > +/**
> > > > + * migrate_source_pre() - Pre-migration tasks as source
> > > > + * @m: Migration metadata
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > > > + */
> > > > +int migrate_source_pre(struct migrate_meta *m)
> > > > +{
> > > > + struct migrate_handler *h;
> > > > +
> > > > + for (h = handlers_source_pre; h->fn; h++) {
> > > > + int rc;
> > > > +
> > > > + if ((rc = h->fn(m, h->data)))
> > > > + return rc;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_source() - Perform migration as source: send state to hypervisor
> > > > + * @fd: Descriptor for state transfer
> > > > + * @m: Migration metadata
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > > > + */
> > > > +int migrate_source(int fd, const struct migrate_meta *m)
> > > > +{
> > > > + static struct migrate_data *d;
> > > > + unsigned count;
> > > > + int rc;
> > > > +
> > > > + for (d = data_versions; d->v != MIGRATE_VERSION; d++);
> > >
> > > Should ASSERT() if we don't find the version within the array.
> >
> > This looks a bit unnecessary, MIGRATE_VERSION is defined just above...
> > it's just a readability killer to me.
> >
> > > > + for (count = 0; d->sections[count].iov_len; count++);
> > > > +
> > > > + debug("Writing %u migration sections", count - 1 /* minus header */);
> > > > + rc = write_remainder(fd, d->sections, count, 0);
> > > > + if (rc < 0)
> > > > + return errno;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_source_post() - Post-migration tasks as source
> > > > + * @m: Migration metadata
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > > > + */
> > > > +void migrate_source_post(struct migrate_meta *m)
> > > > +{
> > > > + struct migrate_handler *h;
> > > > +
> > > > + for (h = handlers_source_post; h->fn; h++)
> > > > + h->fn(m, h->data);
> > >
> > > Is there actually anything we might need to do on the source after a
> > > successful migration, other than exit?
> >
> > We might want to log a couple of things, which would warrant these
> > handlers.
> >
> > But let's say we need to do something *similar* to "updating the
> > network" such as the RARP announcement that QEMU is requesting (this is
>
> IIUC, that's on the target end, not the source end...
The RARP announcement yes, but something *similar* to it, not
necessarily.
> > intended for OVN-Kubernetes, so go figure), or that we need a
> > workaround for a kernel issue with implicit close() with TCP_REPAIR
> > on... I would leave this in for completeness.
>
> ...but sure, point taken.
>
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_target_read_header() - Set metadata in target from source header
> > > > + * @fd: Descriptor for state transfer
> > > > + * @m: Migration metadata, filled on return
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > >
> > > We nearly always use negative error codes. Why not here?
> >
> > Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned:
> >
> > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-message-types
> >
> > and I want to keep this consistent/untranslated.
>
> Ok.
>
> > > > + */
> > > > +int migrate_target_read_header(int fd, struct migrate_meta *m)
> > > > +{
> > > > + static struct migrate_data *d;
> > > > + union migrate_header h;
> > > > +
> > > > + if (read_all_buf(fd, &h, sizeof(h)))
> > > > + return errno;
> > > > +
> > > > + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u",
> > > > + h.magic, ntohl(h.voidp_size), ntohl(h.version));
> > > > +
> > > > + for (d = data_versions; d->v != ntohl(h.version); d++);
> > > > + if (!d->v)
> > > > + return ENOTSUP;
> > >
> > > This is too late. The loop doesn't check it, so you've already
> > > overrun the data_versions table if the version wasn't in there.
> >
> > Ah, yes, I forgot the '&& d->v' part (see migrate_target()).
> >
> > > Easier to use an ARRAY_SIZE() limit in the loop, I think.
> >
> > I'd rather keep that as a one-liner, and NULL-terminate the arrays.
> >
> > > > + m->v = d->v;
> > > > +
> > > > + if (h.magic == MIGRATE_MAGIC)
> > > > + m->bswap = false;
> > > > + else if (h.magic == MIGRATE_MAGIC_SWAPPED)
> > > > + m->bswap = true;
> > > > + else
> > > > + return ENOTSUP;
> > > > +
> > > > + if (ntohl(h.voidp_size) == 4)
> > > > + m->source_64b = false;
> > > > + else if (ntohl(h.voidp_size) == 8)
> > > > + m->source_64b = true;
> > > > + else
> > > > + return ENOTSUP;
> > > > +
> > > > + if (ntohl(h.time_t_size) == 4)
> > > > + m->time_64b = false;
> > > > + else if (ntohl(h.time_t_size) == 8)
> > > > + m->time_64b = true;
> > > > + else
> > > > + return ENOTSUP;
> > > > +
> > > > + m->flow_size = ntohl(h.flow_size);
> > > > + m->flow_sidx_size = ntohl(h.flow_sidx_size);
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_target_pre() - Pre-migration tasks as target
> > > > + * @m: Migration metadata
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > > > + */
> > > > +int migrate_target_pre(struct migrate_meta *m)
> > > > +{
> > > > + struct migrate_target_handlers *th;
> > > > + struct migrate_handler *h;
> > > > +
> > > > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > > > +
> > > > + for (h = th->pre; h->fn; h++) {
> > > > + int rc;
> > > > +
> > > > + if ((rc = h->fn(m, h->data)))
> > > > + return rc;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_target() - Perform migration as target: receive state from hypervisor
> > > > + * @fd: Descriptor for state transfer
> > > > + * @m: Migration metadata
> > > > + *
> > > > + * Return: 0 on success, error code on failure
> > > > + *
> > > > + * #syscalls:vu readv
> > > > + */
> > > > +int migrate_target(int fd, const struct migrate_meta *m)
> > > > +{
> > > > + static struct migrate_data *d;
> > > > + unsigned cnt;
> > > > + int rc;
> > > > +
> > > > + for (d = data_versions; d->v != m->v && d->v; d++);
> > > > +
> > > > + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++);
> > > > +
> > > > + debug("Reading %u migration sections", cnt);
> > > > + rc = read_remainder(fd, d->sections + 1, cnt, 0);
> > > > + if (rc < 0)
> > > > + return errno;
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * migrate_target_post() - Post-migration tasks as target
> > > > + * @m: Migration metadata
> > > > + */
> > > > +void migrate_target_post(struct migrate_meta *m)
> > > > +{
> > > > + struct migrate_target_handlers *th;
> > > > + struct migrate_handler *h;
> > > > +
> > > > + for (th = target_handlers; th->v != m->v && th->v; th++);
> > > > +
> > > > + for (h = th->post; h->fn; h++)
> > > > + h->fn(m, h->data);
> > > > +}
> > > > diff --git a/migrate.h b/migrate.h
> > > > new file mode 100644
> > > > index 0000000..5582f75
> > > > --- /dev/null
> > > > +++ b/migrate.h
> > > > @@ -0,0 +1,90 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0-or-later
> > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > + */
> > > > +
> > > > +#ifndef MIGRATE_H
> > > > +#define MIGRATE_H
> > > > +
> > > > +/**
> > > > + * struct migrate_meta - Migration metadata
> > > > + * @v: Chosen migration data version, host order
> > > > + * @bswap: Source has opposite endianness
> > > > + * @peer_64b: Source uses 64-bit void *
> > > > + * @time_64b: Source uses 64-bit time_t
> > > > + * @flow_size: Size of union flow in source
> > > > + * @flow_sidx_size: Size of struct flow_sidx in source
> > > > + */
> > > > +struct migrate_meta {
> > > > + uint32_t v;
> > > > + bool bswap;
> > > > + bool source_64b;
> > > > + bool time_64b;
> > > > + size_t flow_size;
> > > > + size_t flow_sidx_size;
> > > > +};
> > > > +
> > > > +/**
> > > > + * union migrate_header - Migration header from source
> > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > > > + * @version: Source sends highest known, target aborts if unsupported
> > > > + * @voidp_size: sizeof(void *), network order
> > > > + * @time_t_size: sizeof(time_t), network order
> > > > + * @flow_size: sizeof(union flow), network order
> > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > > > + * @unused: Go figure
> > > > + */
> > > > +union migrate_header {
> > > > + struct {
> > > > + uint64_t magic;
> > > > + uint32_t version;
> > > > + uint32_t voidp_size;
> > > > + uint32_t time_t_size;
> > > > + uint32_t flow_size;
> > > > + uint32_t flow_sidx_size;
> > > > + };
> > > > + uint8_t unused[65536];
> > >
> > > So, having looked at this, I no longer think padding the header to 64kiB
> > > is a good idea. The structure means we're basically stuck always
> > > having that chunky header. Instead, I think the header should be
> > > absolutely minimal: basically magic and version only. v1 (and maybe
> > > others) can add a "metadata" or whatever section for additional
> > > information like this they need.
> >
> > The header is processed by the target in a separate, preliminary step,
> > though.
> >
> > That's why I added metadata right in the header: if the target needs to
> > abort the migration because, say, the size of a flow entry is too big
> > to handle for a particular version, then we should know that before
> > migrate_target_pre().
>
> Ah, yes, I missed that, we'd need a more complex design to do
> additional transfers and checks before making the target_pre
> callbacks.
>
> > As long as we check the version first, we can always shrink the header
> > later on.
>
> *thinks*.. I guess so, though it's kind of awkward; a future version
> would have to read the "header of the header", check the version, then
> if it's the old one, read the remainder of the 64kiB block.
>
> I still think we should clearly separate the part that we're
> committing to being in every future version (which I think should just
> be magic and version), from the stuff that's just v1.
Sure, I can add a comment.
> > But having 64 KiB reserved looks more robust because it's a
> > safe place to add this kind of metadata.
> >
> > Note that 64 KiB is typically transferred in a single read/write
> > from/to the vhost-user back-end.
>
> Ok, but it also has to go over the qemu migration channel, which will
> often be a physical link, not a super-fast local/virtual one, and may
> be bandwidth capped as well. I'm not actually certain if 64kiB is
> likely to be a problem there, but it *is* large compared to the state
> blobs of most qemu devices (usually only a few hundred bytes).
Even if we transfer just what we need of a flow, it's still something
well in excess of 50 bytes each. 100k flows would be 5 megs.
Well, anyway, let's cut this down to 4k, which should be enough, so
that it's not a topic anymore.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-29 7:33 ` Stefano Brivio
@ 2025-01-30 0:44 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-30 0:44 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]
On Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 12:02:09 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:
> > > On Tue, 28 Jan 2025 11:53:09 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > > > > Moving in_epoll out of the common flow data created a 7-bit hole in
> > > > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > > > > unused) bit.
> > > >
> > > > Is this actually necessary for the migration stuff? Or just a cleanup
> > > > you spotted along the way?
> > >
> > > I thought it was helpful to keep the same size on 32-bit, but it looks
> > > like it's not actually needed.
> > >
> > > Let me drop it from this series as it's just noise and I'm trying to
> > > keep this slim. If we are all happy with it I can apply it. If not I'll
> > > forget about it.
> >
> > Eh, I don't care that much either way.
> >
> > Note, btw, that bit-field packing is another way source and
> > destination could potentially have mismatching data structures. IIUC
> > bit field packing is described by the ABI and doesn't necessarily
> > match the byte endianness.
>
> Right, that's actually the reason that brought me to this change: I was
> comparing stuff between x86_64 and armv6l. On the other hand, this part
> of the specific ABI is generally considered stable so I can rely on it.
Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these
endian, word size etc. checks is that you're not counting on it being
an identical ABI at each end. I'm saying the bit field packing is
another way the ABIs at each end could differ, which is not currently
accounted for.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/7] util: Add read_remainder() and read_all_buf()
2025-01-29 7:33 ` Stefano Brivio
@ 2025-01-30 0:44 ` David Gibson
0 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2025-01-30 0:44 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 1548 bytes --]
On Wed, Jan 29, 2025 at 08:33:47AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 12:03:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 07:48:49AM +0100, Stefano Brivio wrote:
> > > On Tue, 28 Jan 2025 11:59:28 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 12:15:30AM +0100, Stefano Brivio wrote:
> > > > > These are symmetric to write_remainder() and write_all_buf() and
> > > > > almost a copy and paste of them, with the most notable differences
> > > > > being reversed reads/writes and a couple of better-safe-than-sorry
> > > > > asserts to keep Coverity happy.
> > > >
> > > > So, there's one thing that needs to be not quite symmetric for the
> > > > read() version: we need to handle EOF. At present, I believe these
> > > > will enter an infinite loop on EOF, which is not a graceful failure
> > > > mode.
> > >
> > > It doesn't happen in our current usage where we close the socket once
> > > we're done,
> >
> > I don't see how what we do with the socket is relevant. Couldn't we
> > hit this case if qemu unexpectedly closed the socket or died?
>
> Yes, sure. I just mentioned that it's not the intended usage, and
> rather an error case we need to handle.
Oh, sure, no argument there.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-29 7:33 ` Stefano Brivio
@ 2025-01-30 0:48 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-30 0:48 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 9887 bytes --]
On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 12:16:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:
> > > On Tue, 28 Jan 2025 12:40:12 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > > > > Add two sets (source or target) of three functions each for passt in
> > > > > vhost-user mode, triggered by activity on the file descriptor passed
> > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > > >
> > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > > > for migration, before data is transferred
> > > > >
> > > > > - migrate_source() sends, and migrate_target() receives migration data
> > > > >
> > > > > - migrate_source_post() and migrate_target_post() are responsible for
> > > > > any post-migration task
> > > > >
> > > > > Callbacks are added to these functions with arrays of function
> > > > > pointers in migrate.c. Migration handlers are versioned.
> > > > >
> > > > > Versioned descriptions of data sections will be added to the
> > > > > data_versions array, which points to versioned iovec arrays. Version
> > > > > 1 is currently empty and will be filled in in subsequent patches.
> > > > >
> > > > > The source announces the data version to be used and informs the peer
> > > > > about endianness, and the size of void *, time_t, flow entries and
> > > > > flow hash table entries.
> > > > >
> > > > > The target checks if the version of the source is still supported. If
> > > > > it's not, it aborts the migration.
> > > > >
> > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > ---
> > > > > Makefile | 12 +--
> > > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > migrate.h | 90 ++++++++++++++++++
> > > > > passt.c | 2 +-
> > > > > vu_common.c | 122 ++++++++++++++++---------
> > > > > vu_common.h | 2 +-
> > > > > 6 files changed, 438 insertions(+), 49 deletions(-)
> > > > > create mode 100644 migrate.c
> > > > > create mode 100644 migrate.h
> > > > >
> > > > > diff --git a/Makefile b/Makefile
> > > > > index 464eef1..1383875 100644
> > > > > --- a/Makefile
> > > > > +++ b/Makefile
> > > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > > > >
> > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > vhost_user.c virtio.c vu_common.c
> > > > > QRAP_SRCS = qrap.c
> > > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > > > >
> > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > > > - virtio.h vu_common.h
> > > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > > > + vhost_user.h virtio.h vu_common.h
> > > > > HEADERS = $(PASST_HEADERS) seccomp.h
> > > > >
> > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > > > diff --git a/migrate.c b/migrate.c
> > > > > new file mode 100644
> > > > > index 0000000..bee9653
> > > > > --- /dev/null
> > > > > +++ b/migrate.c
> > > > > @@ -0,0 +1,259 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > +
> > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > + * for qemu/UNIX domain socket mode
> > > > > + *
> > > > > + * PASTA - Pack A Subtle Tap Abstraction
> > > > > + * for network namespace/tap device mode
> > > > > + *
> > > > > + * migrate.c - Migration sections, layout, and routines
> > > > > + *
> > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > + */
> > > > > +
> > > > > +#include <errno.h>
> > > > > +#include <sys/uio.h>
> > > > > +
> > > > > +#include "util.h"
> > > > > +#include "ip.h"
> > > > > +#include "passt.h"
> > > > > +#include "inany.h"
> > > > > +#include "flow.h"
> > > > > +#include "flow_table.h"
> > > > > +
> > > > > +#include "migrate.h"
> > > > > +
> > > > > +/* Current version of migration data */
> > > > > +#define MIGRATE_VERSION 1
> > > > > +
> > > > > +/* Magic as we see it and as seen with reverse endianness */
> > > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
> > > >
> > > > As noted, I'm hoping we can get rid of "either endian" migration. But
> > > > if this stays, we should define it using __bswap_constant_32() to
> > > > avoid embarrassing mistakes.
> > >
> > > Those always give me issues on musl,
> >
> > What sort of issues? We're already using them, and have fallback
> > versions defined in util.h
>
> The very issues that brought me to introduce those fallback versions,
> so I'm instinctively reluctant to use them.
>
> Actually, I think it's even clearer to have this spelt out (I always
> need to stop for a moment and think: what happens when I cross the
> 32-bit boundary?).
Oh, yes, we'd need to add a __bswap_constant_64() for this.
[snip]
> > > > > +/**
> > > > > + * union migrate_header - Migration header from source
> > > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > > > > + * @version: Source sends highest known, target aborts if unsupported
> > > > > + * @voidp_size: sizeof(void *), network order
> > > > > + * @time_t_size: sizeof(time_t), network order
> > > > > + * @flow_size: sizeof(union flow), network order
> > > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > > > > + * @unused: Go figure
> > > > > + */
> > > > > +union migrate_header {
> > > > > + struct {
> > > > > + uint64_t magic;
> > > > > + uint32_t version;
> > > > > + uint32_t voidp_size;
> > > > > + uint32_t time_t_size;
> > > > > + uint32_t flow_size;
> > > > > + uint32_t flow_sidx_size;
> > > > > + };
> > > > > + uint8_t unused[65536];
> > > >
> > > > So, having looked at this, I no longer think padding the header to 64kiB
> > > > is a good idea. The structure means we're basically stuck always
> > > > having that chunky header. Instead, I think the header should be
> > > > absolutely minimal: basically magic and version only. v1 (and maybe
> > > > others) can add a "metadata" or whatever section for additional
> > > > information like this they need.
> > >
> > > The header is processed by the target in a separate, preliminary step,
> > > though.
> > >
> > > That's why I added metadata right in the header: if the target needs to
> > > abort the migration because, say, the size of a flow entry is too big
> > > to handle for a particular version, then we should know that before
> > > migrate_target_pre().
> >
> > Ah, yes, I missed that, we'd need a more complex design to do
> > additional transfers and checks before making the target_pre
> > callbacks.
> >
> > > As long as we check the version first, we can always shrink the header
> > > later on.
> >
> > *thinks*.. I guess so, though it's kind of awkward; a future version
> > would have to read the "header of the header", check the version, then
> > if it's the old one, read the remainder of the 64kiB block.
> >
> > I still think we should clearly separate the part that we're
> > committing to being in every future version (which I think should just
> > be magic and version), from the stuff that's just v1.
>
> Sure, I can add a comment.
>
> > > But having 64 KiB reserved looks more robust because it's a
> > > safe place to add this kind of metadata.
> > >
> > > Note that 64 KiB is typically transferred in a single read/write
> > > from/to the vhost-user back-end.
> >
> > Ok, but it also has to go over the qemu migration channel, which will
> > often be a physical link, not a super-fast local/virtual one, and may
> > be bandwidth capped as well. I'm not actually certain if 64kiB is
> > likely to be a problem there, but it *is* large compared to the state
> > blobs of most qemu devices (usually only a few hundred bytes).
>
> Even if we transfer just what we need of a flow, it's still something
> well in excess of 50 bytes each. 100k flows would be 5 megs.
Just transferring the in-use flows would be higher priority than being
selective about what we send within each flow. It's both easier to do
and a bigger win in most cases. That would dramatically reduce the
size sent here.
> Well, anyway, let's cut this down to 4k, which should be enough, so
> that it's not a topic anymore.
I still think it's ugly, but whatever.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-29 7:04 ` Stefano Brivio
@ 2025-01-30 0:53 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-30 0:53 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 9513 bytes --]
On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:
> On Wed, 29 Jan 2025 12:29:27 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
> > > On Tue, 28 Jan 2025 12:51:59 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > > > > passt. Not used yet.
> > > > >
> > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > ---
> > > > > Makefile | 10 +++--
> > > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > 2 files changed, 118 insertions(+), 3 deletions(-)
> > > > > create mode 100644 passt-repair.c
> > > > >
> > > > > diff --git a/Makefile b/Makefile
> > > > > index 1383875..1b71cb0 100644
> > > > > --- a/Makefile
> > > > > +++ b/Makefile
> > > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > vhost_user.c virtio.c vu_common.c
> > > > > QRAP_SRCS = qrap.c
> > > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > +PASST_REPAIR_SRCS = passt-repair.c
> > > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > > > >
> > > > > MANPAGES = passt.1 pasta.1 qrap.1
> > > > >
> > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > > > > man1dir ?= $(mandir)/man1
> > > > >
> > > > > ifeq ($(TARGET_ARCH),x86_64)
> > > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > > > > else
> > > > > -BIN := passt pasta qrap
> > > > > +BIN := passt pasta qrap passt-repair
> > > > > endif
> > > > >
> > > > > all: $(BIN) $(MANPAGES) docs
> > > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > > > > qrap: $(QRAP_SRCS) passt.h
> > > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> > > > >
> > > > > +passt-repair: $(PASST_REPAIR_SRCS)
> > > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > > > > +
> > > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > > > > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > > > > mmap2 munmap open unlink gettimeofday futex statx \
> > > > > diff --git a/passt-repair.c b/passt-repair.c
> > > > > new file mode 100644
> > > > > index 0000000..e9b9609
> > > > > --- /dev/null
> > > > > +++ b/passt-repair.c
> > > > > @@ -0,0 +1,111 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > +
> > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > + * for qemu/UNIX domain socket mode
> > > > > + *
> > > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > > > > + *
> > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > + *
> > > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > > > > + */
> > > > > +
> > > > > +#include <sys/types.h>
> > > > > +#include <sys/socket.h>
> > > > > +#include <sys/un.h>
> > > > > +#include <errno.h>
> > > > > +#include <stdio.h>
> > > > > +#include <stdlib.h>
> > > > > +#include <string.h>
> > > > > +#include <limits.h>
> > > > > +#include <unistd.h>
> > > > > +#include <netdb.h>
> > > > > +
> > > > > +#include <netinet/tcp.h>
> > > > > +
> > > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > > > +
> > > > > +int main(int argc, char **argv)
> > > > > +{
> > > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > > > + struct sockaddr_un a = { AF_UNIX, "" };
> > > > > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > > > > + struct cmsghdr *cmsg;
> > > > > + struct msghdr msg;
> > > > > + struct iovec iov;
> > > > > +
> > > > > + iov = (struct iovec){ &cmd, sizeof(cmd) };
> > > >
> > > > I mean, local to local, it's *probably* fine, but still a network
> > > > protocol not defined in terms of explicit width fields makes me
> > > > nervous. I'd prefer to see the cmd being a packed structure with
> > > > fixed width elements.
> > >
> > > It actually is, because:
> > >
> > > struct {
> > > int cmd;
> > > };
> > >
> > > is a packet structure with fixed width elements. Any architecture we
> > > build for (at least the ones I'm aware of) has a 32-bit int. We can
> > > make it uint32_t if it makes you feel better.
> >
> > Sorry, I should have said "*explicitly* fixed width fields". So, yes,
> > uint32_t would make me feel better :)
>
> Changed to int8_t anyway meanwhile. We don't need all those bits.
Works or me.
> > > > I also think we should do some sort of basic magic / version exchange.
> > > > I don't see any reason we'd need to extend the protocol, but I'd
> > > > rather have the option if we have to.
> > >
> > > passt-repair will be packaged and distributed together with passt,
> > > though. Versions must match.
> >
> > But nothing enforces that.
>
> Distribution packages. If I run claws-mail with the wrong version of,
> say, libpixman, it won't start. If you don't use them, you're on your
> own.
But shared libraries *do* have versioning checks: there are defined
compatibility semantics for sonames, and there can be symbol versions
as well.
> > AIUI KubeVirt will be running passt-repair
> > in a different context. Which means it may well be deployed by a
> > different path than the passt binary
>
> No, that's not the way it works. It needs to match, in the sense that
> 1. it's a KubeVirt requirement to have compatible packages between
> distribution and the "base container image" and 2. this would most
> likely be sourced from the "base container image" anyway.
>
> I maintain the packages for four distributions, plus AppArmor and
> SELinux policies upstream and downstream, and I take care of updating
> the package in KubeVirt as well, so I guess I have a vague idea of
> what's convenient, enforced, burdensome, and so on.
>
> > which means however we
> > distribute it's quite plausible that a downstream screwup could
> > mismatch the versions. We should endeavour to have a reasonably
> > graceful failure mode for that.
>
> Regardless of this, I think that *this one* is an interface (I wouldn't
> even call it a protocol) that needs to be set in stone, except for
> hypothetical (and highly unlikely) UAPI additions which we'll be anyway
> able to accommodate for easily.
Ok, I can buy that, but it's a contradictory position to "versions
must match".
> It's a single socket option with three possible values (for 13 years
> now), of which we plan to use two. If we want this interface to do
> anything else, it should be another interface.
>
> So there's really no problem with this.
>
> Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW
> should ideally suffice), so it needs to be extremely simple and
> auditable.
Sending and checking a magic number is not a lot of complexity, even
in something on this scale.
>
> > > And latency here might matter more than in
> > > the rest of the migration process.
> >
> > > > Plus checking a magic number
> > > > should make things less damaging and more debuggable if you were to
> > > > point the repair helper at an entirely unrelated unix socket instead
> > > > of passt's repair socket.
> > >
> > > Maybe, yes, even though I don't really see good chances for that
> > > mistake to happen. Feel free to post a proposal, of course.
> >
> > I disagree on the good chances for a mistake thing. In GSS I saw
> > plenty of occasions where things that shouldn't be mismatched were due
> > to some packaging or user screwup. And that's before even considering
> > the way that KubeVirt deploys its various pieces seems to provide a
> > number of opportunities to mess this up.
> >
> > So, I'll see what I can come up with. I'm fine with requiring
> > matching versions if it's actually checked. Maybe a magic derived
> > from our git hash, or even our build-id.
>
> Both would make things significantly less usable, because they would
> make different but compatible builds incompatible, and different
> implementations rather inconvenient.
Ok, so you're definitely now saying versions *don't* need to match.
> For example, it might be a practical solution to have a Go
> implementation of this in KubeVirt's virt-handler itself, but if it
> needs to extract information or strings from the binary, that becomes
> impractical.
Ok... could we at least add just a magic number then. If we do ever
need a new protocol we can change it, otherwise the protocol immutable
for now.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-30 0:44 ` David Gibson
@ 2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:27 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-30 4:55 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Thu, 30 Jan 2025 11:44:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Jan 2025 12:02:09 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:
> > > > On Tue, 28 Jan 2025 11:53:09 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > > > > > Moving in_epoll out of the common flow data created a 7-bit hole in
> > > > > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > > > > > unused) bit.
> > > > >
> > > > > Is this actually necessary for the migration stuff? Or just a cleanup
> > > > > you spotted along the way?
> > > >
> > > > I thought it was helpful to keep the same size on 32-bit, but it looks
> > > > like it's not actually needed.
> > > >
> > > > Let me drop it from this series as it's just noise and I'm trying to
> > > > keep this slim. If we are all happy with it I can apply it. If not I'll
> > > > forget about it.
> > >
> > > Eh, I don't care that much either way.
> > >
> > > Note, btw, that bit-field packing is another way source and
> > > destination could potentially have mismatching data structures. IIUC
> > > bit field packing is described by the ABI and doesn't necessarily
> > > match the byte endianness.
> >
> > Right, that's actually the reason that brought me to this change: I was
> > comparing stuff between x86_64 and armv6l. On the other hand, this part
> > of the specific ABI is generally considered stable so I can rely on it.
>
> Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these
> endian, word size etc. checks is that you're not counting on it being
> an identical ABI at each end.
Of course. I'm just saying that I can *rely on ABIs*. Not on them being
the same.
> I'm saying the bit field packing is another way the ABIs at each end
> could differ
It does.
> which is not currently accounted for.
That's because I have two hands, but obviously if I'm comparing ABIs...
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-30 0:48 ` David Gibson
@ 2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:38 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-30 4:55 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Thu, 30 Jan 2025 11:48:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Jan 2025 12:16:58 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:
> > > > On Tue, 28 Jan 2025 12:40:12 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > > > > > Add two sets (source or target) of three functions each for passt in
> > > > > > vhost-user mode, triggered by activity on the file descriptor passed
> > > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > > > >
> > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > > > > for migration, before data is transferred
> > > > > >
> > > > > > - migrate_source() sends, and migrate_target() receives migration data
> > > > > >
> > > > > > - migrate_source_post() and migrate_target_post() are responsible for
> > > > > > any post-migration task
> > > > > >
> > > > > > Callbacks are added to these functions with arrays of function
> > > > > > pointers in migrate.c. Migration handlers are versioned.
> > > > > >
> > > > > > Versioned descriptions of data sections will be added to the
> > > > > > data_versions array, which points to versioned iovec arrays. Version
> > > > > > 1 is currently empty and will be filled in in subsequent patches.
> > > > > >
> > > > > > The source announces the data version to be used and informs the peer
> > > > > > about endianness, and the size of void *, time_t, flow entries and
> > > > > > flow hash table entries.
> > > > > >
> > > > > > The target checks if the version of the source is still supported. If
> > > > > > it's not, it aborts the migration.
> > > > > >
> > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > > ---
> > > > > > Makefile | 12 +--
> > > > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > migrate.h | 90 ++++++++++++++++++
> > > > > > passt.c | 2 +-
> > > > > > vu_common.c | 122 ++++++++++++++++---------
> > > > > > vu_common.h | 2 +-
> > > > > > 6 files changed, 438 insertions(+), 49 deletions(-)
> > > > > > create mode 100644 migrate.c
> > > > > > create mode 100644 migrate.h
> > > > > >
> > > > > > diff --git a/Makefile b/Makefile
> > > > > > index 464eef1..1383875 100644
> > > > > > --- a/Makefile
> > > > > > +++ b/Makefile
> > > > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > > > > >
> > > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > vhost_user.c virtio.c vu_common.c
> > > > > > QRAP_SRCS = qrap.c
> > > > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > > > > >
> > > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > > > > - virtio.h vu_common.h
> > > > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > > > > + vhost_user.h virtio.h vu_common.h
> > > > > > HEADERS = $(PASST_HEADERS) seccomp.h
> > > > > >
> > > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > > > > diff --git a/migrate.c b/migrate.c
> > > > > > new file mode 100644
> > > > > > index 0000000..bee9653
> > > > > > --- /dev/null
> > > > > > +++ b/migrate.c
> > > > > > @@ -0,0 +1,259 @@
> > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > +
> > > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > > + * for qemu/UNIX domain socket mode
> > > > > > + *
> > > > > > + * PASTA - Pack A Subtle Tap Abstraction
> > > > > > + * for network namespace/tap device mode
> > > > > > + *
> > > > > > + * migrate.c - Migration sections, layout, and routines
> > > > > > + *
> > > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > > + */
> > > > > > +
> > > > > > +#include <errno.h>
> > > > > > +#include <sys/uio.h>
> > > > > > +
> > > > > > +#include "util.h"
> > > > > > +#include "ip.h"
> > > > > > +#include "passt.h"
> > > > > > +#include "inany.h"
> > > > > > +#include "flow.h"
> > > > > > +#include "flow_table.h"
> > > > > > +
> > > > > > +#include "migrate.h"
> > > > > > +
> > > > > > +/* Current version of migration data */
> > > > > > +#define MIGRATE_VERSION 1
> > > > > > +
> > > > > > +/* Magic as we see it and as seen with reverse endianness */
> > > > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > > > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
> > > > >
> > > > > As noted, I'm hoping we can get rid of "either endian" migration. But
> > > > > if this stays, we should define it using __bswap_constant_32() to
> > > > > avoid embarrassing mistakes.
> > > >
> > > > Those always give me issues on musl,
> > >
> > > What sort of issues? We're already using them, and have fallback
> > > versions defined in util.h
> >
> > The very issues that brought me to introduce those fallback versions,
> > so I'm instinctively reluctant to use them.
> >
> > Actually, I think it's even clearer to have this spelt out (I always
> > need to stop for a moment and think: what happens when I cross the
> > 32-bit boundary?).
>
> Oh, yes, we'd need to add a __bswap_constant_64() for this.
...which doesn't exist on musl. On current Alpine Edge:
util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration]
131 | #define htonll_constant(x) (__bswap_constant_64(x))
| ^~~~~~~~~~~~~~~~~~~
...so rather than adding an implementation for this single usage, which
makes it actually less clear to me, I would keep it like it is.
> [snip]
> > > > > > +/**
> > > > > > + * union migrate_header - Migration header from source
> > > > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > > > > > + * @version: Source sends highest known, target aborts if unsupported
> > > > > > + * @voidp_size: sizeof(void *), network order
> > > > > > + * @time_t_size: sizeof(time_t), network order
> > > > > > + * @flow_size: sizeof(union flow), network order
> > > > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > > > > > + * @unused: Go figure
> > > > > > + */
> > > > > > +union migrate_header {
> > > > > > + struct {
> > > > > > + uint64_t magic;
> > > > > > + uint32_t version;
> > > > > > + uint32_t voidp_size;
> > > > > > + uint32_t time_t_size;
> > > > > > + uint32_t flow_size;
> > > > > > + uint32_t flow_sidx_size;
> > > > > > + };
> > > > > > + uint8_t unused[65536];
> > > > >
> > > > > So, having looked at this, I no longer think padding the header to 64kiB
> > > > > is a good idea. The structure means we're basically stuck always
> > > > > having that chunky header. Instead, I think the header should be
> > > > > absolutely minimal: basically magic and version only. v1 (and maybe
> > > > > others) can add a "metadata" or whatever section for additional
> > > > > information like this they need.
> > > >
> > > > The header is processed by the target in a separate, preliminary step,
> > > > though.
> > > >
> > > > That's why I added metadata right in the header: if the target needs to
> > > > abort the migration because, say, the size of a flow entry is too big
> > > > to handle for a particular version, then we should know that before
> > > > migrate_target_pre().
> > >
> > > Ah, yes, I missed that, we'd need a more complex design to do
> > > additional transfers and checks before making the target_pre
> > > callbacks.
> > >
> > > > As long as we check the version first, we can always shrink the header
> > > > later on.
> > >
> > > *thinks*.. I guess so, though it's kind of awkward; a future version
> > > would have to read the "header of the header", check the version, then
> > > if it's the old one, read the remainder of the 64kiB block.
> > >
> > > I still think we should clearly separate the part that we're
> > > committing to being in every future version (which I think should just
> > > be magic and version), from the stuff that's just v1.
> >
> > Sure, I can add a comment.
> >
> > > > But having 64 KiB reserved looks more robust because it's a
> > > > safe place to add this kind of metadata.
> > > >
> > > > Note that 64 KiB is typically transferred in a single read/write
> > > > from/to the vhost-user back-end.
> > >
> > > Ok, but it also has to go over the qemu migration channel, which will
> > > often be a physical link, not a super-fast local/virtual one, and may
> > > be bandwidth capped as well. I'm not actually certain if 64kiB is
> > > likely to be a problem there, but it *is* large compared to the state
> > > blobs of most qemu devices (usually only a few hundred bytes).
> >
> > Even if we transfer just what we need of a flow, it's still something
> > well in excess of 50 bytes each. 100k flows would be 5 megs.
>
> Just transferring the in-use flows would be higher priority than being
> selective about what we send within each flow.
Well, of course, I meant that we would only transfer used flows at that
point, because it's not about transferring the flow table as a whole,
with none of the advantages and disadvantages of it.
But still one can have 128k flows at the moment.
> It's both easier to do
> and a bigger win in most cases. That would dramatically reduce the
> size sent here.
Yep, feel free.
> > Well, anyway, let's cut this down to 4k, which should be enough, so
> > that it's not a topic anymore.
>
> I still think it's ugly, but whatever.
Same here.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-30 0:53 ` David Gibson
@ 2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:43 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-30 4:55 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Thu, 30 Jan 2025 11:53:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Jan 2025 12:29:27 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >
> > > On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
> > > > On Tue, 28 Jan 2025 12:51:59 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > > > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > > > > > passt. Not used yet.
> > > > > >
> > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > > ---
> > > > > > Makefile | 10 +++--
> > > > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > 2 files changed, 118 insertions(+), 3 deletions(-)
> > > > > > create mode 100644 passt-repair.c
> > > > > >
> > > > > > diff --git a/Makefile b/Makefile
> > > > > > index 1383875..1b71cb0 100644
> > > > > > --- a/Makefile
> > > > > > +++ b/Makefile
> > > > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > vhost_user.c virtio.c vu_common.c
> > > > > > QRAP_SRCS = qrap.c
> > > > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > > +PASST_REPAIR_SRCS = passt-repair.c
> > > > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > > > > >
> > > > > > MANPAGES = passt.1 pasta.1 qrap.1
> > > > > >
> > > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > > > > > man1dir ?= $(mandir)/man1
> > > > > >
> > > > > > ifeq ($(TARGET_ARCH),x86_64)
> > > > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > > > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > > > > > else
> > > > > > -BIN := passt pasta qrap
> > > > > > +BIN := passt pasta qrap passt-repair
> > > > > > endif
> > > > > >
> > > > > > all: $(BIN) $(MANPAGES) docs
> > > > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > > > > > qrap: $(QRAP_SRCS) passt.h
> > > > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> > > > > >
> > > > > > +passt-repair: $(PASST_REPAIR_SRCS)
> > > > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > > > > > +
> > > > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > > > > > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > > > > > mmap2 munmap open unlink gettimeofday futex statx \
> > > > > > diff --git a/passt-repair.c b/passt-repair.c
> > > > > > new file mode 100644
> > > > > > index 0000000..e9b9609
> > > > > > --- /dev/null
> > > > > > +++ b/passt-repair.c
> > > > > > @@ -0,0 +1,111 @@
> > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > +
> > > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > > + * for qemu/UNIX domain socket mode
> > > > > > + *
> > > > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > > > > > + *
> > > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > > + *
> > > > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > > > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > > > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > > > > > + */
> > > > > > +
> > > > > > +#include <sys/types.h>
> > > > > > +#include <sys/socket.h>
> > > > > > +#include <sys/un.h>
> > > > > > +#include <errno.h>
> > > > > > +#include <stdio.h>
> > > > > > +#include <stdlib.h>
> > > > > > +#include <string.h>
> > > > > > +#include <limits.h>
> > > > > > +#include <unistd.h>
> > > > > > +#include <netdb.h>
> > > > > > +
> > > > > > +#include <netinet/tcp.h>
> > > > > > +
> > > > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > > > > +
> > > > > > +int main(int argc, char **argv)
> > > > > > +{
> > > > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > > > > + struct sockaddr_un a = { AF_UNIX, "" };
> > > > > > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > > > > > + struct cmsghdr *cmsg;
> > > > > > + struct msghdr msg;
> > > > > > + struct iovec iov;
> > > > > > +
> > > > > > + iov = (struct iovec){ &cmd, sizeof(cmd) };
> > > > >
> > > > > I mean, local to local, it's *probably* fine, but still a network
> > > > > protocol not defined in terms of explicit width fields makes me
> > > > > nervous. I'd prefer to see the cmd being a packed structure with
> > > > > fixed width elements.
> > > >
> > > > It actually is, because:
> > > >
> > > > struct {
> > > > int cmd;
> > > > };
> > > >
> > > > is a packet structure with fixed width elements. Any architecture we
> > > > build for (at least the ones I'm aware of) has a 32-bit int. We can
> > > > make it uint32_t if it makes you feel better.
> > >
> > > Sorry, I should have said "*explicitly* fixed width fields". So, yes,
> > > uint32_t would make me feel better :)
> >
> > Changed to int8_t anyway meanwhile. We don't need all those bits.
>
> Works or me.
>
> > > > > I also think we should do some sort of basic magic / version exchange.
> > > > > I don't see any reason we'd need to extend the protocol, but I'd
> > > > > rather have the option if we have to.
> > > >
> > > > passt-repair will be packaged and distributed together with passt,
> > > > though. Versions must match.
> > >
> > > But nothing enforces that.
> >
> > Distribution packages. If I run claws-mail with the wrong version of,
> > say, libpixman, it won't start. If you don't use them, you're on your
> > own.
>
> But shared libraries *do* have versioning checks: there are defined
> compatibility semantics for sonames, and there can be symbol versions
> as well.
>
> > > AIUI KubeVirt will be running passt-repair
> > > in a different context. Which means it may well be deployed by a
> > > different path than the passt binary
> >
> > No, that's not the way it works. It needs to match, in the sense that
> > 1. it's a KubeVirt requirement to have compatible packages between
> > distribution and the "base container image" and 2. this would most
> > likely be sourced from the "base container image" anyway.
> >
> > I maintain the packages for four distributions, plus AppArmor and
> > SELinux policies upstream and downstream, and I take care of updating
> > the package in KubeVirt as well, so I guess I have a vague idea of
> > what's convenient, enforced, burdensome, and so on.
> >
> > > which means however we
> > > distribute it's quite plausible that a downstream screwup could
> > > mismatch the versions. We should endeavour to have a reasonably
> > > graceful failure mode for that.
> >
> > Regardless of this, I think that *this one* is an interface (I wouldn't
> > even call it a protocol) that needs to be set in stone, except for
> > hypothetical (and highly unlikely) UAPI additions which we'll be anyway
> > able to accommodate for easily.
>
> Ok, I can buy that, but it's a contradictory position to "versions
> must match".
Note: "Regardless of this". It's *another* consideration *on top of
that*.
1. Versions (builds) match.
2. And even if they didn't, it wouldn't be a problem, because this
interface will not change.
> > It's a single socket option with three possible values (for 13 years
> > now), of which we plan to use two. If we want this interface to do
> > anything else, it should be another interface.
> >
> > So there's really no problem with this.
> >
> > Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW
> > should ideally suffice), so it needs to be extremely simple and
> > auditable.
>
> Sending and checking a magic number is not a lot of complexity, even
> in something on this scale.
If you want to have multiple bytes (because I'm forecasting that you
won't be happy with 255 values), it's substantial complexity in
comparison to the current implementation.
> > > > And latency here might matter more than in
> > > > the rest of the migration process.
> > >
> > > > > Plus checking a magic number
> > > > > should make things less damaging and more debuggable if you were to
> > > > > point the repair helper at an entirely unrelated unix socket instead
> > > > > of passt's repair socket.
> > > >
> > > > Maybe, yes, even though I don't really see good chances for that
> > > > mistake to happen. Feel free to post a proposal, of course.
> > >
> > > I disagree on the good chances for a mistake thing. In GSS I saw
> > > plenty of occasions where things that shouldn't be mismatched were due
> > > to some packaging or user screwup. And that's before even considering
> > > the way that KubeVirt deploys its various pieces seems to provide a
> > > number of opportunities to mess this up.
> > >
> > > So, I'll see what I can come up with. I'm fine with requiring
> > > matching versions if it's actually checked. Maybe a magic derived
> > > from our git hash, or even our build-id.
> >
> > Both would make things significantly less usable, because they would
> > make different but compatible builds incompatible, and different
> > implementations rather inconvenient.
>
> Ok, so you're definitely now saying versions *don't* need to match.
They don't need to, no. They will match, but they don't need to.
> > For example, it might be a practical solution to have a Go
> > implementation of this in KubeVirt's virt-handler itself, but if it
> > needs to extract information or strings from the binary, that becomes
> > impractical.
>
> Ok... could we at least add just a magic number then. If we do ever
> need a new protocol we can change it, otherwise the protocol immutable
> for now.
Adding a non-byte magic number implies handling short reads and short
writes, which I think is absolutely unnecessary. Feel free to propose
an implementation, as usual.
If you are happy with a single byte magic number, then I suppose that,
given that we're just using three values, we could encode that
information using a combination of the remaining bits, which has the
advantage of not needing any specific implementation until it's
actually needed (never, I suppose), because passt-repair already
terminates on an unknown command value.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn
2025-01-30 4:55 ` Stefano Brivio
@ 2025-01-30 7:27 ` David Gibson
0 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2025-01-30 7:27 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 2619 bytes --]
On Thu, Jan 30, 2025 at 05:55:00AM +0100, Stefano Brivio wrote:
> On Thu, 30 Jan 2025 11:44:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Jan 29, 2025 at 08:33:40AM +0100, Stefano Brivio wrote:
> > > On Wed, 29 Jan 2025 12:02:09 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 07:48:33AM +0100, Stefano Brivio wrote:
> > > > > On Tue, 28 Jan 2025 11:53:09 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > On Tue, Jan 28, 2025 at 12:15:28AM +0100, Stefano Brivio wrote:
> > > > > > > Moving in_epoll out of the common flow data created a 7-bit hole in
> > > > > > > struct tcp_splice_conn: repack by shrinking @flags by one (otherwise
> > > > > > > unused) bit.
> > > > > >
> > > > > > Is this actually necessary for the migration stuff? Or just a cleanup
> > > > > > you spotted along the way?
> > > > >
> > > > > I thought it was helpful to keep the same size on 32-bit, but it looks
> > > > > like it's not actually needed.
> > > > >
> > > > > Let me drop it from this series as it's just noise and I'm trying to
> > > > > keep this slim. If we are all happy with it I can apply it. If not I'll
> > > > > forget about it.
> > > >
> > > > Eh, I don't care that much either way.
> > > >
> > > > Note, btw, that bit-field packing is another way source and
> > > > destination could potentially have mismatching data structures. IIUC
> > > > bit field packing is described by the ABI and doesn't necessarily
> > > > match the byte endianness.
> > >
> > > Right, that's actually the reason that brought me to this change: I was
> > > comparing stuff between x86_64 and armv6l. On the other hand, this part
> > > of the specific ABI is generally considered stable so I can rely on it.
> >
> > Uhh.. a specific ABI is stable, yes, but IIUC the whole point of these
> > endian, word size etc. checks is that you're not counting on it being
> > an identical ABI at each end.
>
> Of course. I'm just saying that I can *rely on ABIs*. Not on them being
> the same.
>
> > I'm saying the bit field packing is another way the ABIs at each end
> > could differ
>
> It does.
>
> > which is not currently accounted for.
>
> That's because I have two hands, but obviously if I'm comparing ABIs...
Ok, fair enough.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-30 4:55 ` Stefano Brivio
@ 2025-01-30 7:38 ` David Gibson
2025-01-30 8:32 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-30 7:38 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 12457 bytes --]
On Thu, Jan 30, 2025 at 05:55:22AM +0100, Stefano Brivio wrote:
> On Thu, 30 Jan 2025 11:48:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote:
> > > On Wed, 29 Jan 2025 12:16:58 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote:
> > > > > On Tue, 28 Jan 2025 12:40:12 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:
> > > > > > > Add two sets (source or target) of three functions each for passt in
> > > > > > > vhost-user mode, triggered by activity on the file descriptor passed
> > > > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE:
> > > > > > >
> > > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare
> > > > > > > for migration, before data is transferred
> > > > > > >
> > > > > > > - migrate_source() sends, and migrate_target() receives migration data
> > > > > > >
> > > > > > > - migrate_source_post() and migrate_target_post() are responsible for
> > > > > > > any post-migration task
> > > > > > >
> > > > > > > Callbacks are added to these functions with arrays of function
> > > > > > > pointers in migrate.c. Migration handlers are versioned.
> > > > > > >
> > > > > > > Versioned descriptions of data sections will be added to the
> > > > > > > data_versions array, which points to versioned iovec arrays. Version
> > > > > > > 1 is currently empty and will be filled in in subsequent patches.
> > > > > > >
> > > > > > > The source announces the data version to be used and informs the peer
> > > > > > > about endianness, and the size of void *, time_t, flow entries and
> > > > > > > flow hash table entries.
> > > > > > >
> > > > > > > The target checks if the version of the source is still supported. If
> > > > > > > it's not, it aborts the migration.
> > > > > > >
> > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > > > ---
> > > > > > > Makefile | 12 +--
> > > > > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > migrate.h | 90 ++++++++++++++++++
> > > > > > > passt.c | 2 +-
> > > > > > > vu_common.c | 122 ++++++++++++++++---------
> > > > > > > vu_common.h | 2 +-
> > > > > > > 6 files changed, 438 insertions(+), 49 deletions(-)
> > > > > > > create mode 100644 migrate.c
> > > > > > > create mode 100644 migrate.h
> > > > > > >
> > > > > > > diff --git a/Makefile b/Makefile
> > > > > > > index 464eef1..1383875 100644
> > > > > > > --- a/Makefile
> > > > > > > +++ b/Makefile
> > > > > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
> > > > > > >
> > > > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
> > > > > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> > > > > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \
> > > > > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > > vhost_user.c virtio.c vu_common.c
> > > > > > > QRAP_SRCS = qrap.c
> > > > > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1
> > > > > > >
> > > > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
> > > > > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
> > > > > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
> > > > > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> > > > > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> > > > > > > - virtio.h vu_common.h
> > > > > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \
> > > > > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \
> > > > > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \
> > > > > > > + vhost_user.h virtio.h vu_common.h
> > > > > > > HEADERS = $(PASST_HEADERS) seccomp.h
> > > > > > >
> > > > > > > C := \#include <sys/random.h>\nint main(){int a=getrandom(0, 0, 0);}
> > > > > > > diff --git a/migrate.c b/migrate.c
> > > > > > > new file mode 100644
> > > > > > > index 0000000..bee9653
> > > > > > > --- /dev/null
> > > > > > > +++ b/migrate.c
> > > > > > > @@ -0,0 +1,259 @@
> > > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > > +
> > > > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > > > + * for qemu/UNIX domain socket mode
> > > > > > > + *
> > > > > > > + * PASTA - Pack A Subtle Tap Abstraction
> > > > > > > + * for network namespace/tap device mode
> > > > > > > + *
> > > > > > > + * migrate.c - Migration sections, layout, and routines
> > > > > > > + *
> > > > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <errno.h>
> > > > > > > +#include <sys/uio.h>
> > > > > > > +
> > > > > > > +#include "util.h"
> > > > > > > +#include "ip.h"
> > > > > > > +#include "passt.h"
> > > > > > > +#include "inany.h"
> > > > > > > +#include "flow.h"
> > > > > > > +#include "flow_table.h"
> > > > > > > +
> > > > > > > +#include "migrate.h"
> > > > > > > +
> > > > > > > +/* Current version of migration data */
> > > > > > > +#define MIGRATE_VERSION 1
> > > > > > > +
> > > > > > > +/* Magic as we see it and as seen with reverse endianness */
> > > > > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0
> > > > > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1
> > > > > >
> > > > > > As noted, I'm hoping we can get rid of "either endian" migration. But
> > > > > > if this stays, we should define it using __bswap_constant_32() to
> > > > > > avoid embarrassing mistakes.
> > > > >
> > > > > Those always give me issues on musl,
> > > >
> > > > What sort of issues? We're already using them, and have fallback
> > > > versions defined in util.h
> > >
> > > The very issues that brought me to introduce those fallback versions,
> > > so I'm instinctively reluctant to use them.
> > >
> > > Actually, I think it's even clearer to have this spelt out (I always
> > > need to stop for a moment and think: what happens when I cross the
> > > 32-bit boundary?).
> >
> > Oh, yes, we'd need to add a __bswap_constant_64() for this.
>
> ...which doesn't exist on musl. On current Alpine Edge:
>
> util.h:131:34: error: implicit declaration of function '__bswap_constant_64' [-Wimplicit-function-declaration]
> 131 | #define htonll_constant(x) (__bswap_constant_64(x))
> | ^~~~~~~~~~~~~~~~~~~
>
> ...so rather than adding an implementation for this single usage, which
> makes it actually less clear to me, I would keep it like it is.
Very well.
> > [snip]
> > > > > > > +/**
> > > > > > > + * union migrate_header - Migration header from source
> > > > > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order
> > > > > > > + * @version: Source sends highest known, target aborts if unsupported
> > > > > > > + * @voidp_size: sizeof(void *), network order
> > > > > > > + * @time_t_size: sizeof(time_t), network order
> > > > > > > + * @flow_size: sizeof(union flow), network order
> > > > > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order
> > > > > > > + * @unused: Go figure
> > > > > > > + */
> > > > > > > +union migrate_header {
> > > > > > > + struct {
> > > > > > > + uint64_t magic;
> > > > > > > + uint32_t version;
> > > > > > > + uint32_t voidp_size;
> > > > > > > + uint32_t time_t_size;
> > > > > > > + uint32_t flow_size;
> > > > > > > + uint32_t flow_sidx_size;
> > > > > > > + };
> > > > > > > + uint8_t unused[65536];
> > > > > >
> > > > > > So, having looked at this, I no longer think padding the header to 64kiB
> > > > > > is a good idea. The structure means we're basically stuck always
> > > > > > having that chunky header. Instead, I think the header should be
> > > > > > absolutely minimal: basically magic and version only. v1 (and maybe
> > > > > > others) can add a "metadata" or whatever section for additional
> > > > > > information like this they need.
> > > > >
> > > > > The header is processed by the target in a separate, preliminary step,
> > > > > though.
> > > > >
> > > > > That's why I added metadata right in the header: if the target needs to
> > > > > abort the migration because, say, the size of a flow entry is too big
> > > > > to handle for a particular version, then we should know that before
> > > > > migrate_target_pre().
> > > >
> > > > Ah, yes, I missed that, we'd need a more complex design to do
> > > > additional transfers and checks before making the target_pre
> > > > callbacks.
> > > >
> > > > > As long as we check the version first, we can always shrink the header
> > > > > later on.
> > > >
> > > > *thinks*.. I guess so, though it's kind of awkward; a future version
> > > > would have to read the "header of the header", check the version, then
> > > > if it's the old one, read the remainder of the 64kiB block.
> > > >
> > > > I still think we should clearly separate the part that we're
> > > > committing to being in every future version (which I think should just
> > > > be magic and version), from the stuff that's just v1.
> > >
> > > Sure, I can add a comment.
> > >
> > > > > But having 64 KiB reserved looks more robust because it's a
> > > > > safe place to add this kind of metadata.
> > > > >
> > > > > Note that 64 KiB is typically transferred in a single read/write
> > > > > from/to the vhost-user back-end.
> > > >
> > > > Ok, but it also has to go over the qemu migration channel, which will
> > > > often be a physical link, not a super-fast local/virtual one, and may
> > > > be bandwidth capped as well. I'm not actually certain if 64kiB is
> > > > likely to be a problem there, but it *is* large compared to the state
> > > > blobs of most qemu devices (usually only a few hundred bytes).
> > >
> > > Even if we transfer just what we need of a flow, it's still something
> > > well in excess of 50 bytes each. 100k flows would be 5 megs.
> >
> > Just transferring the in-use flows would be higher priority than being
> > selective about what we send within each flow.
>
> Well, of course, I meant that we would only transfer used flows at that
> point, because it's not about transferring the flow table as a whole,
> with none of the advantages and disadvantages of it.
>
> But still one can have 128k flows at the moment.
Right, but in the present draft you pay that cost whether or not
you're actually using the flows. Unfortunately a busy server with
heaps of active connections is exactly the case that's likely to be
most sensitve to additional downtime, but there's not really any
getting around that. A machine with a lot of state will need either
high downtime or high migration bandwidth.
But, I'm really hoping we can move relatively quickly to a model where
a guest with only a handful of connections _doesn't_ have to pay that
128k flow cost - and can consequently migrate ok even with quite
constrained migration bandwidth. In that scenario the size of the
header could become significant.
> > It's both easier to do
> > and a bigger win in most cases. That would dramatically reduce the
> > size sent here.
>
> Yep, feel free.
It's on my queue for the next few days.
> > > Well, anyway, let's cut this down to 4k, which should be enough, so
> > > that it's not a topic anymore.
> >
> > I still think it's ugly, but whatever.
>
> Same here.
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-30 4:55 ` Stefano Brivio
@ 2025-01-30 7:43 ` David Gibson
2025-01-30 7:56 ` Stefano Brivio
0 siblings, 1 reply; 41+ messages in thread
From: David Gibson @ 2025-01-30 7:43 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 11939 bytes --]
On Thu, Jan 30, 2025 at 05:55:43AM +0100, Stefano Brivio wrote:
> On Thu, 30 Jan 2025 11:53:08 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Wed, Jan 29, 2025 at 08:04:28AM +0100, Stefano Brivio wrote:
> > > On Wed, 29 Jan 2025 12:29:27 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Tue, Jan 28, 2025 at 07:51:31AM +0100, Stefano Brivio wrote:
> > > > > On Tue, 28 Jan 2025 12:51:59 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > On Tue, Jan 28, 2025 at 12:15:32AM +0100, Stefano Brivio wrote:
> > > > > > > A privileged helper to set/clear TCP_REPAIR on sockets on behalf of
> > > > > > > passt. Not used yet.
> > > > > > >
> > > > > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > > > > ---
> > > > > > > Makefile | 10 +++--
> > > > > > > passt-repair.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > 2 files changed, 118 insertions(+), 3 deletions(-)
> > > > > > > create mode 100644 passt-repair.c
> > > > > > >
> > > > > > > diff --git a/Makefile b/Makefile
> > > > > > > index 1383875..1b71cb0 100644
> > > > > > > --- a/Makefile
> > > > > > > +++ b/Makefile
> > > > > > > @@ -42,7 +42,8 @@ PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
> > > > > > > tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> > > > > > > vhost_user.c virtio.c vu_common.c
> > > > > > > QRAP_SRCS = qrap.c
> > > > > > > -SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> > > > > > > +PASST_REPAIR_SRCS = passt-repair.c
> > > > > > > +SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
> > > > > > >
> > > > > > > MANPAGES = passt.1 pasta.1 qrap.1
> > > > > > >
> > > > > > > @@ -72,9 +73,9 @@ mandir ?= $(datarootdir)/man
> > > > > > > man1dir ?= $(mandir)/man1
> > > > > > >
> > > > > > > ifeq ($(TARGET_ARCH),x86_64)
> > > > > > > -BIN := passt passt.avx2 pasta pasta.avx2 qrap
> > > > > > > +BIN := passt passt.avx2 pasta pasta.avx2 qrap passt-repair
> > > > > > > else
> > > > > > > -BIN := passt pasta qrap
> > > > > > > +BIN := passt pasta qrap passt-repair
> > > > > > > endif
> > > > > > >
> > > > > > > all: $(BIN) $(MANPAGES) docs
> > > > > > > @@ -101,6 +102,9 @@ pasta.avx2 pasta.1 pasta: pasta%: passt%
> > > > > > > qrap: $(QRAP_SRCS) passt.h
> > > > > > > $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) -DARCH=\"$(TARGET_ARCH)\" $(QRAP_SRCS) -o qrap $(LDFLAGS)
> > > > > > >
> > > > > > > +passt-repair: $(PASST_REPAIR_SRCS)
> > > > > > > + $(CC) $(FLAGS) $(CFLAGS) $(CPPFLAGS) $(PASST_REPAIR_SRCS) -o passt-repair $(LDFLAGS)
> > > > > > > +
> > > > > > > valgrind: EXTRA_SYSCALLS += rt_sigprocmask rt_sigtimedwait rt_sigaction \
> > > > > > > rt_sigreturn getpid gettid kill clock_gettime mmap \
> > > > > > > mmap2 munmap open unlink gettimeofday futex statx \
> > > > > > > diff --git a/passt-repair.c b/passt-repair.c
> > > > > > > new file mode 100644
> > > > > > > index 0000000..e9b9609
> > > > > > > --- /dev/null
> > > > > > > +++ b/passt-repair.c
> > > > > > > @@ -0,0 +1,111 @@
> > > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > > > > +
> > > > > > > +/* PASST - Plug A Simple Socket Transport
> > > > > > > + * for qemu/UNIX domain socket mode
> > > > > > > + *
> > > > > > > + * passt-repair.c - Privileged helper to set/clear TCP_REPAIR on sockets
> > > > > > > + *
> > > > > > > + * Copyright (c) 2025 Red Hat GmbH
> > > > > > > + * Author: Stefano Brivio <sbrivio@redhat.com>
> > > > > > > + *
> > > > > > > + * Connect to passt via UNIX domain socket, receive sockets via SCM_RIGHTS along
> > > > > > > + * with commands mapping to TCP_REPAIR values, and switch repair mode on or
> > > > > > > + * off. Reply by echoing the command. Exit if the command is INT_MAX.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <sys/types.h>
> > > > > > > +#include <sys/socket.h>
> > > > > > > +#include <sys/un.h>
> > > > > > > +#include <errno.h>
> > > > > > > +#include <stdio.h>
> > > > > > > +#include <stdlib.h>
> > > > > > > +#include <string.h>
> > > > > > > +#include <limits.h>
> > > > > > > +#include <unistd.h>
> > > > > > > +#include <netdb.h>
> > > > > > > +
> > > > > > > +#include <netinet/tcp.h>
> > > > > > > +
> > > > > > > +#define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not in UAPI */
> > > > > > > +
> > > > > > > +int main(int argc, char **argv)
> > > > > > > +{
> > > > > > > + char buf[CMSG_SPACE(sizeof(int) * SCM_MAX_FD)]
> > > > > > > + __attribute__ ((aligned(__alignof__(struct cmsghdr))));
> > > > > > > + struct sockaddr_un a = { AF_UNIX, "" };
> > > > > > > + int cmd, fds[SCM_MAX_FD], s, ret, i;
> > > > > > > + struct cmsghdr *cmsg;
> > > > > > > + struct msghdr msg;
> > > > > > > + struct iovec iov;
> > > > > > > +
> > > > > > > + iov = (struct iovec){ &cmd, sizeof(cmd) };
> > > > > >
> > > > > > I mean, local to local, it's *probably* fine, but still a network
> > > > > > protocol not defined in terms of explicit width fields makes me
> > > > > > nervous. I'd prefer to see the cmd being a packed structure with
> > > > > > fixed width elements.
> > > > >
> > > > > It actually is, because:
> > > > >
> > > > > struct {
> > > > > int cmd;
> > > > > };
> > > > >
> > > > > is a packet structure with fixed width elements. Any architecture we
> > > > > build for (at least the ones I'm aware of) has a 32-bit int. We can
> > > > > make it uint32_t if it makes you feel better.
> > > >
> > > > Sorry, I should have said "*explicitly* fixed width fields". So, yes,
> > > > uint32_t would make me feel better :)
> > >
> > > Changed to int8_t anyway meanwhile. We don't need all those bits.
> >
> > Works or me.
> >
> > > > > > I also think we should do some sort of basic magic / version exchange.
> > > > > > I don't see any reason we'd need to extend the protocol, but I'd
> > > > > > rather have the option if we have to.
> > > > >
> > > > > passt-repair will be packaged and distributed together with passt,
> > > > > though. Versions must match.
> > > >
> > > > But nothing enforces that.
> > >
> > > Distribution packages. If I run claws-mail with the wrong version of,
> > > say, libpixman, it won't start. If you don't use them, you're on your
> > > own.
> >
> > But shared libraries *do* have versioning checks: there are defined
> > compatibility semantics for sonames, and there can be symbol versions
> > as well.
> >
> > > > AIUI KubeVirt will be running passt-repair
> > > > in a different context. Which means it may well be deployed by a
> > > > different path than the passt binary
> > >
> > > No, that's not the way it works. It needs to match, in the sense that
> > > 1. it's a KubeVirt requirement to have compatible packages between
> > > distribution and the "base container image" and 2. this would most
> > > likely be sourced from the "base container image" anyway.
> > >
> > > I maintain the packages for four distributions, plus AppArmor and
> > > SELinux policies upstream and downstream, and I take care of updating
> > > the package in KubeVirt as well, so I guess I have a vague idea of
> > > what's convenient, enforced, burdensome, and so on.
> > >
> > > > which means however we
> > > > distribute it's quite plausible that a downstream screwup could
> > > > mismatch the versions. We should endeavour to have a reasonably
> > > > graceful failure mode for that.
> > >
> > > Regardless of this, I think that *this one* is an interface (I wouldn't
> > > even call it a protocol) that needs to be set in stone, except for
> > > hypothetical (and highly unlikely) UAPI additions which we'll be anyway
> > > able to accommodate for easily.
> >
> > Ok, I can buy that, but it's a contradictory position to "versions
> > must match".
>
> Note: "Regardless of this". It's *another* consideration *on top of
> that*.
>
> 1. Versions (builds) match.
>
> 2. And even if they didn't, it wouldn't be a problem, because this
> interface will not change.
Hm, ok. I'm way less convinced on both of those points. Which means
I'd like to have a clear policy on whether we require versions to
match or not. Which we prioritise affects design choices.
> > > It's a single socket option with three possible values (for 13 years
> > > now), of which we plan to use two. If we want this interface to do
> > > anything else, it should be another interface.
> > >
> > > So there's really no problem with this.
> > >
> > > Besides, the helper runs with CAP_NET_ADMIN (even though CAP_NET_RAW
> > > should ideally suffice), so it needs to be extremely simple and
> > > auditable.
> >
> > Sending and checking a magic number is not a lot of complexity, even
> > in something on this scale.
>
> If you want to have multiple bytes (because I'm forecasting that you
> won't be happy with 255 values), it's substantial complexity in
> comparison to the current implementation.
> > > > > And latency here might matter more than in
> > > > > the rest of the migration process.
> > > >
> > > > > > Plus checking a magic number
> > > > > > should make things less damaging and more debuggable if you were to
> > > > > > point the repair helper at an entirely unrelated unix socket instead
> > > > > > of passt's repair socket.
> > > > >
> > > > > Maybe, yes, even though I don't really see good chances for that
> > > > > mistake to happen. Feel free to post a proposal, of course.
> > > >
> > > > I disagree on the good chances for a mistake thing. In GSS I saw
> > > > plenty of occasions where things that shouldn't be mismatched were due
> > > > to some packaging or user screwup. And that's before even considering
> > > > the way that KubeVirt deploys its various pieces seems to provide a
> > > > number of opportunities to mess this up.
> > > >
> > > > So, I'll see what I can come up with. I'm fine with requiring
> > > > matching versions if it's actually checked. Maybe a magic derived
> > > > from our git hash, or even our build-id.
> > >
> > > Both would make things significantly less usable, because they would
> > > make different but compatible builds incompatible, and different
> > > implementations rather inconvenient.
> >
> > Ok, so you're definitely now saying versions *don't* need to match.
>
> They don't need to, no. They will match, but they don't need to.
>
> > > For example, it might be a practical solution to have a Go
> > > implementation of this in KubeVirt's virt-handler itself, but if it
> > > needs to extract information or strings from the binary, that becomes
> > > impractical.
> >
> > Ok... could we at least add just a magic number then. If we do ever
> > need a new protocol we can change it, otherwise the protocol immutable
> > for now.
>
> Adding a non-byte magic number implies handling short reads and short
> writes, which I think is absolutely unnecessary. Feel free to propose
> an implementation, as usual.
Ok, it's on my list.
> If you are happy with a single byte magic number, then I suppose that,
> given that we're just using three values, we could encode that
> information using a combination of the remaining bits, which has the
> advantage of not needing any specific implementation until it's
> actually needed (never, I suppose), because passt-repair already
> terminates on an unknown command value.
Yeah, as you predicted, I'm not really happy with a 1 byte magic
number.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/7] Introduce passt-repair
2025-01-30 7:43 ` David Gibson
@ 2025-01-30 7:56 ` Stefano Brivio
0 siblings, 0 replies; 41+ messages in thread
From: Stefano Brivio @ 2025-01-30 7:56 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Thu, 30 Jan 2025 18:43:34 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Hm, ok. I'm way less convinced on both of those points. Which means
> I'd like to have a clear policy on whether we require versions to
> match or not. Which we prioritise affects design choices.
No, in this case, we don't require versions to match. The protocol is
well-defined and won't change. Any change to it will require a
different interface.
The protocol is one byte that can be TCP_REPAIR_ON, TCP_REPAIR_OFF,
TCP_REPAIR_OFF_WP, and one to SCM_MAX_FD sockets as SCM_RIGHTS
ancillary message, sent by the server.
The client replies with the same byte (and no ancillary message) to
signal success, and closes the connection on failure.
The server closes the connection on error or completion.
This is obviously enough for a privileged helper that has the only
function of setting the TCP_REPAIR socket option to TCP_REPAIR_ON,
TCP_REPAIR_OFF, or TCP_REPAIR_OFF_WP, on a given set of sockets.
As a result, I think that any added complexity is plain wrong.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-30 7:38 ` David Gibson
@ 2025-01-30 8:32 ` Stefano Brivio
2025-01-30 8:54 ` David Gibson
0 siblings, 1 reply; 41+ messages in thread
From: Stefano Brivio @ 2025-01-30 8:32 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev, Laurent Vivier
On Thu, 30 Jan 2025 18:38:22 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Right, but in the present draft you pay that cost whether or not
> you're actually using the flows. Unfortunately a busy server with
> heaps of active connections is exactly the case that's likely to be
> most sensitve to additional downtime, but there's not really any
> getting around that. A machine with a lot of state will need either
> high downtime or high migration bandwidth.
It's... sixteen megabytes. A KubeVirt node is only allowed to perform up
to _four_ migrations in parallel, and that's our main use case at the
moment. "High downtime" is kind of relative.
> But, I'm really hoping we can move relatively quickly to a model where
> a guest with only a handful of connections _doesn't_ have to pay that
> 128k flow cost - and can consequently migrate ok even with quite
> constrained migration bandwidth. In that scenario the size of the
> header could become significant.
I think the biggest cost of the full flow table transfer is rather code
that's a bit quicker to write (I just managed to properly set sequences
on the target, connections don't quite "flow" yet) but relatively high
maintenance (as you mentioned, we need to be careful about every single
field) and easy to break.
I would like to quickly complete the whole flow first, because I think
we can inform design and implementation decisions much better at that
point, and we can be sure it's feasible, but I'm not particularly keen
to merge this patch like it is, if we can switch it relatively swiftly
to an implementation where we model a smaller fixed-endian structure
with just the stuff we need.
And again, to be a bit more sure of which stuff we need in it, the full
flow is useful to have implemented.
Actually the biggest complications I see in switching to that approach,
from the current point, are that we need to, I guess:
1. model arrays (not really complicated by itself)
2. have a temporary structure where we store flows instead of using the
flow table directly (meaning that the "data model" needs to logically
decouple source and destination of the copy)
3. batch stuff to some extent. We'll call socket() and connect() once
for each socket anyway, obviously, but sending one message to the
TCP_REPAIR helper for each socket looks like a rather substantial
and avoidable overhead
> > > It's both easier to do
> > > and a bigger win in most cases. That would dramatically reduce the
> > > size sent here.
> >
> > Yep, feel free.
>
> It's on my queue for the next few days.
To me this part actually looks like the biggest priority after/while
getting the whole thing to work, because we can start right with a 'v1'
which looks more sustainable.
And I would just get stuff working on x86_64 in that case, without even
implementing conversions and endianness switches etc.
--
Stefano
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
2025-01-30 8:32 ` Stefano Brivio
@ 2025-01-30 8:54 ` David Gibson
0 siblings, 0 replies; 41+ messages in thread
From: David Gibson @ 2025-01-30 8:54 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev, Laurent Vivier
[-- Attachment #1: Type: text/plain, Size: 5584 bytes --]
On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:
> On Thu, 30 Jan 2025 18:38:22 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Right, but in the present draft you pay that cost whether or not
> > you're actually using the flows. Unfortunately a busy server with
> > heaps of active connections is exactly the case that's likely to be
> > most sensitve to additional downtime, but there's not really any
> > getting around that. A machine with a lot of state will need either
> > high downtime or high migration bandwidth.
>
> It's... sixteen megabytes. A KubeVirt node is only allowed to perform up
> to _four_ migrations in parallel, and that's our main use case at the
> moment. "High downtime" is kind of relative.
Certainly. But I believe it's typical to aim for downtimes in the
~100ms range.
> > But, I'm really hoping we can move relatively quickly to a model where
> > a guest with only a handful of connections _doesn't_ have to pay that
> > 128k flow cost - and can consequently migrate ok even with quite
> > constrained migration bandwidth. In that scenario the size of the
> > header could become significant.
>
> I think the biggest cost of the full flow table transfer is rather code
> that's a bit quicker to write (I just managed to properly set sequences
> on the target, connections don't quite "flow" yet) but relatively high
> maintenance (as you mentioned, we need to be careful about every single
> field) and easy to break.
Right. And with this draft we can't even change the size of the flow
table without breaking migration. That seems like a thing we might
well want to change.
> I would like to quickly complete the whole flow first, because I think
> we can inform design and implementation decisions much better at that
> point, and we can be sure it's feasible,
That's fair.
> but I'm not particularly keen
> to merge this patch like it is, if we can switch it relatively swiftly
> to an implementation where we model a smaller fixed-endian structure
> with just the stuff we need.
So, there are kind of two parts to this:
1) Only transferring active flow entries, and not transferring the
hash table
I think this is pretty easy. It could be done with or without
preserving flow indicies. Preserving makes for debug log continuity
between the ends, but not preserving lets us change the size of the
flow table without breaking migration.
2) Only transferring the necessary pieces of each entry, and using a
fixed representation of each piece
This is harder. Not *super* hard, I think, but definitely trickier
than (1)
> And again, to be a bit more sure of which stuff we need in it, the full
> flow is useful to have implemented.
>
> Actually the biggest complications I see in switching to that approach,
> from the current point, are that we need to, I guess:
>
> 1. model arrays (not really complicated by itself)
So here, I actually think this is simpler if we don't attempt to have
a declarative approach to defining the protocol, but just write
functions to implement it.
> 2. have a temporary structure where we store flows instead of using the
> flow table directly (meaning that the "data model" needs to logically
> decouple source and destination of the copy)
Right.. I'd really prefer to "stream" in the entries one by one,
rather than having a big staging area. That's even harder to do
declaratively, but I think the other advantages are worth it.
> 3. batch stuff to some extent. We'll call socket() and connect() once
> for each socket anyway, obviously, but sending one message to the
> TCP_REPAIR helper for each socket looks like a rather substantial
> and avoidable overhead
I don't think this actually has a lot of bearing on the protocol. I'd
envisage migrate_target() decodes all the information into the
target's flow table, then migrate_target_post() steps through all the
flows re-establishing the connections. Since we've already parsed the
protocol at that point, we can make multiple passes: one to gather
batches and set TCP_REPAIR, another through each entry to set the
values, and a final one to clear TCP_REPAIR in batches.
> > > > It's both easier to do
> > > > and a bigger win in most cases. That would dramatically reduce the
> > > > size sent here.
> > >
> > > Yep, feel free.
> >
> > It's on my queue for the next few days.
>
> To me this part actually looks like the biggest priority after/while
> getting the whole thing to work, because we can start right with a 'v1'
> which looks more sustainable.
>
> And I would just get stuff working on x86_64 in that case, without even
> implementing conversions and endianness switches etc.
Right. Given the number of options here, I think it would be safest
to go in expecting to go through a few throwaway protocol versions
before reaching one we're happy enough to support long term.
To ease that process, I'm wondering if we should, add a default-off
command line option to enable migration. For now, enabling it would
print some sort of "migration is experimental!" warning. Once we have
a stream format we're ok with, we can flip it to on-by-default, but we
don't maintain receive compatibility for the experimental versions
leading up to that.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2025-01-30 8:54 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28 0:49 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28 0:50 ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28 0:53 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:02 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:27 ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28 0:59 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:03 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28 1:40 ` David Gibson
2025-01-28 6:50 ` Stefano Brivio
2025-01-29 1:16 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:48 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:38 ` David Gibson
2025-01-30 8:32 ` Stefano Brivio
2025-01-30 8:54 ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31 ` Stefano Brivio
2025-01-28 1:51 ` David Gibson
2025-01-28 6:51 ` Stefano Brivio
2025-01-29 1:29 ` David Gibson
2025-01-29 7:04 ` Stefano Brivio
2025-01-30 0:53 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:43 ` David Gibson
2025-01-30 7:56 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).