From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=N/f9MV2X; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 13A775A0276 for ; Wed, 29 Jan 2025 08:33:57 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738136037; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IBVg3Cf3Y9mXaVVramXIpSfuML26sZff5cZktS5Pc0U=; b=N/f9MV2XJpR+NIiLXMAVfncP1DN68poIHemTaG5RAC1eb8HdN1olGpL1s7rVaR2ruW+jvt 01w6SUbIo/jcuBHBgwRMDWG3SOdm0wz3DZgSWPYY+I0M7w7RJZSYkssx5vYxUGmZf0tYyN AoM0sqGn+lT7Juh/2nAnT2Gwit4dmZI= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-76-NfY3EOxwMpObTDPxLlbw8Q-1; Wed, 29 Jan 2025 02:33:55 -0500 X-MC-Unique: NfY3EOxwMpObTDPxLlbw8Q-1 X-Mimecast-MFC-AGG-ID: NfY3EOxwMpObTDPxLlbw8Q Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-4361d4e8359so46992115e9.3 for ; Tue, 28 Jan 2025 23:33:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738136033; x=1738740833; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=IBVg3Cf3Y9mXaVVramXIpSfuML26sZff5cZktS5Pc0U=; b=chqcC3sVPzMriuFs8GziMVb+wU6n+YLym3lF05TPhxqKpWKmwpTjh20xwkvEgcazSr e3SNlkpixHM36WILJ37H7pf2EE3af6OCd0U7vlLetFW0v8zAB7E4xCxtz/OkSF8AQGbJ 3Pga36eGLhnIjxV4y+VyAFTNheHooNDNSrgIcYw03fQYUQXpLIhnyB+dFClsVGRCqLp4 69Lg6G5Qy1D8cbPTG7Ig+LSYLJB+vVAAbJqdqZTmdFBnHy2LHDho16wOCsKhovYxpuvv BM7+qa7Nx03lyP4ovNjrqPO0IZ1X0vlSoCHJ83Ix9VF3QhA3Z3KojIaITWBImX+VjU6X pNXQ== X-Gm-Message-State: AOJu0YxUxIxqtIKhU3doZbXBa6mCijhu+G/GRpuEUxAmBkNlA6MYvbqS XFPKNirhL0lW3qkWqLjIEi3wlkb03rWe9ipO81ZwbtiPrCnhwSDA6jHz9hjLSeuSthRWL+DnLD/ TSClBHAnH7aHUmdNqwKA1HoYOO/f4wf/u37bfnMeV5K1f6tJvGzo18yGrsg== X-Gm-Gg: ASbGncudCi6SCFcKeLc/tq2+KafGfCgpePIY95FOf+3/kxelvwWrUvCdks1/SFnslz2 Cq5FTOEpWFXHkQ9jcrg8ODcqW6Nfh8HGBGhQ38pasf1rG6cIb9nAvZgkMddYm95/a7OJQsNikvS 7KQu6M43+f3veEAyEm0AJGbrTn8F2jLHJ7rgPyTy5s9GjHQlvIWpiHmkH8R4DHdmJtcOxVv0osU 3FkGZvfumuRHYu1M3SOUQZaH00j/CxZF2ryhoA3kDTTwbcLT3wrNSDhVsOzC0vH+KtjeVdi2sYs R/YrH85g5IXfg7ep X-Received: by 2002:a7b:c40e:0:b0:436:17e4:ad4c with SMTP id 5b1f17b1804b1-438dc3aa77amr13361185e9.6.1738136033367; Tue, 28 Jan 2025 23:33:53 -0800 (PST) X-Google-Smtp-Source: AGHT+IHM5f4Ars3kK6r6MV3nLgj/zI3UhR1yRxiK7DPwWN8fBSLCvOzT0xH/FdU5gv0N57GVQfjvlQ== X-Received: by 2002:a7b:c40e:0:b0:436:17e4:ad4c with SMTP id 5b1f17b1804b1-438dc3aa77amr13360925e9.6.1738136032864; Tue, 28 Jan 2025 23:33:52 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-438dcc13202sm12549625e9.5.2025.01.28.23.33.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 28 Jan 2025 23:33:51 -0800 (PST) Date: Wed, 29 Jan 2025 08:33:50 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: <20250129083350.220a7ab0@elisabeth> In-Reply-To: References: <20250127231532.672363-1-sbrivio@redhat.com> <20250127231532.672363-7-sbrivio@redhat.com> <20250128075001.3557d398@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: WmYM3RhgiHRf4OxxQuCoFhGsunOayc7nUQfC7OAIIRE_1738136034 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: OTV7QZWAEVO4BFORIXBLXLI4YL7NOAZK X-Message-ID-Hash: OTV7QZWAEVO4BFORIXBLXLI4YL7NOAZK X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 29 Jan 2025 12:16:58 +1100 David Gibson wrote: > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote: > > On Tue, 28 Jan 2025 12:40:12 +1100 > > David Gibson wrote: > > > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote: > > > > Add two sets (source or target) of three functions each for passt in > > > > vhost-user mode, triggered by activity on the file descriptor passed > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > > > > > > > - migrate_source_pre() and migrate_target_pre() are called to prepare > > > > for migration, before data is transferred > > > > > > > > - migrate_source() sends, and migrate_target() receives migration data > > > > > > > > - migrate_source_post() and migrate_target_post() are responsible for > > > > any post-migration task > > > > > > > > Callbacks are added to these functions with arrays of function > > > > pointers in migrate.c. Migration handlers are versioned. > > > > > > > > Versioned descriptions of data sections will be added to the > > > > data_versions array, which points to versioned iovec arrays. Version > > > > 1 is currently empty and will be filled in in subsequent patches. > > > > > > > > The source announces the data version to be used and informs the peer > > > > about endianness, and the size of void *, time_t, flow entries and > > > > flow hash table entries. > > > > > > > > The target checks if the version of the source is still supported. If > > > > it's not, it aborts the migration. > > > > > > > > Signed-off-by: Stefano Brivio > > > > --- > > > > Makefile | 12 +-- > > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > migrate.h | 90 ++++++++++++++++++ > > > > passt.c | 2 +- > > > > vu_common.c | 122 ++++++++++++++++--------- > > > > vu_common.h | 2 +- > > > > 6 files changed, 438 insertions(+), 49 deletions(-) > > > > create mode 100644 migrate.c > > > > create mode 100644 migrate.h > > > > > > > > diff --git a/Makefile b/Makefile > > > > index 464eef1..1383875 100644 > > > > --- a/Makefile > > > > +++ b/Makefile > > > > @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS) > > > > > > > > PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ > > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ > > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ > > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > > vhost_user.c virtio.c vu_common.c > > > > QRAP_SRCS = qrap.c > > > > SRCS = $(PASST_SRCS) $(QRAP_SRCS) > > > > @@ -48,10 +48,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 > > > > > > > > PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ > > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ > > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ > > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ > > > > - virtio.h vu_common.h > > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ > > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ > > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ > > > > + vhost_user.h virtio.h vu_common.h > > > > HEADERS = $(PASST_HEADERS) seccomp.h > > > > > > > > C := \#include \nint main(){int a=getrandom(0, 0, 0);} > > > > diff --git a/migrate.c b/migrate.c > > > > new file mode 100644 > > > > index 0000000..bee9653 > > > > --- /dev/null > > > > +++ b/migrate.c > > > > @@ -0,0 +1,259 @@ > > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > > + > > > > +/* PASST - Plug A Simple Socket Transport > > > > + * for qemu/UNIX domain socket mode > > > > + * > > > > + * PASTA - Pack A Subtle Tap Abstraction > > > > + * for network namespace/tap device mode > > > > + * > > > > + * migrate.c - Migration sections, layout, and routines > > > > + * > > > > + * Copyright (c) 2025 Red Hat GmbH > > > > + * Author: Stefano Brivio > > > > + */ > > > > + > > > > +#include > > > > +#include > > > > + > > > > +#include "util.h" > > > > +#include "ip.h" > > > > +#include "passt.h" > > > > +#include "inany.h" > > > > +#include "flow.h" > > > > +#include "flow_table.h" > > > > + > > > > +#include "migrate.h" > > > > + > > > > +/* Current version of migration data */ > > > > +#define MIGRATE_VERSION 1 > > > > + > > > > +/* Magic as we see it and as seen with reverse endianness */ > > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 > > > > > > As noted, I'm hoping we can get rid of "either endian" migration. But > > > if this stays, we should define it using __bswap_constant_32() to > > > avoid embarrassing mistakes. > > > > Those always give me issues on musl, > > What sort of issues? We're already using them, and have fallback > versions defined in util.h The very issues that brought me to introduce those fallback versions, so I'm instinctively reluctant to use them. Actually, I think it's even clearer to have this spelt out (I always need to stop for a moment and think: what happens when I cross the 32-bit boundary?). > > so I'd rather test things on > > big-endian and realise it's actually 0xB0D1B1B01B1DBBB1 (0x0b bitswap). > > > > Feel free to post a different proposal if tested. > > > > > > + > > > > +/* Migration header to send from source */ > > > > +static union migrate_header header = { > > > > + .magic = MIGRATE_MAGIC, > > > > + .version = htonl_constant(MIGRATE_VERSION), > > > > + .time_t_size = htonl_constant(sizeof(time_t)), > > > > + .flow_size = htonl_constant(sizeof(union flow)), > > > > + .flow_sidx_size = htonl_constant(sizeof(struct flow_sidx)), > > > > + .voidp_size = htonl_constant(sizeof(void *)), > > > > +}; > > > > + > > > > +/* Data sections for version 1 */ > > > > +static struct iovec sections_v1[] = { > > > > + { &header, sizeof(header) }, > > > > +}; > > > > + > > > > +/* Set of data versions */ > > > > +static struct migrate_data data_versions[] = { > > > > + { > > > > + 1, sections_v1, > > > > + }, > > > > + { 0 }, > > > > +}; > > > > + > > > > +/* Handlers to call in source before sending data */ > > > > +struct migrate_handler handlers_source_pre[] = { > > > > + { 0 }, > > > > +}; > > > > + > > > > +/* Handlers to call in source after sending data */ > > > > +struct migrate_handler handlers_source_post[] = { > > > > + { 0 }, > > > > +}; > > > > + > > > > +/* Handlers to call in target before receiving data with version 1 */ > > > > +struct migrate_handler handlers_target_pre_v1[] = { > > > > + { 0 }, > > > > +}; > > > > + > > > > +/* Handlers to call in target after receiving data with version 1 */ > > > > +struct migrate_handler handlers_target_post_v1[] = { > > > > + { 0 }, > > > > +}; > > > > + > > > > +/* Versioned sets of migration handlers */ > > > > +struct migrate_target_handlers target_handlers[] = { > > > > + { > > > > + 1, > > > > + handlers_target_pre_v1, > > > > + handlers_target_post_v1, > > > > + }, > > > > + { 0 }, > > > > +}; > > > > + > > > > +/** > > > > + * migrate_source_pre() - Pre-migration tasks as source > > > > + * @m: Migration metadata > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > + */ > > > > +int migrate_source_pre(struct migrate_meta *m) > > > > +{ > > > > + struct migrate_handler *h; > > > > + > > > > + for (h = handlers_source_pre; h->fn; h++) { > > > > + int rc; > > > > + > > > > + if ((rc = h->fn(m, h->data))) > > > > + return rc; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/** > > > > + * migrate_source() - Perform migration as source: send state to hypervisor > > > > + * @fd: Descriptor for state transfer > > > > + * @m: Migration metadata > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > + */ > > > > +int migrate_source(int fd, const struct migrate_meta *m) > > > > +{ > > > > + static struct migrate_data *d; > > > > + unsigned count; > > > > + int rc; > > > > + > > > > + for (d = data_versions; d->v != MIGRATE_VERSION; d++); > > > > > > Should ASSERT() if we don't find the version within the array. > > > > This looks a bit unnecessary, MIGRATE_VERSION is defined just above... > > it's just a readability killer to me. > > > > > > + for (count = 0; d->sections[count].iov_len; count++); > > > > + > > > > + debug("Writing %u migration sections", count - 1 /* minus header */); > > > > + rc = write_remainder(fd, d->sections, count, 0); > > > > + if (rc < 0) > > > > + return errno; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/** > > > > + * migrate_source_post() - Post-migration tasks as source > > > > + * @m: Migration metadata > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > + */ > > > > +void migrate_source_post(struct migrate_meta *m) > > > > +{ > > > > + struct migrate_handler *h; > > > > + > > > > + for (h = handlers_source_post; h->fn; h++) > > > > + h->fn(m, h->data); > > > > > > Is there actually anything we might need to do on the source after a > > > successful migration, other than exit? > > > > We might want to log a couple of things, which would warrant these > > handlers. > > > > But let's say we need to do something *similar* to "updating the > > network" such as the RARP announcement that QEMU is requesting (this is > > IIUC, that's on the target end, not the source end... The RARP announcement yes, but something *similar* to it, not necessarily. > > intended for OVN-Kubernetes, so go figure), or that we need a > > workaround for a kernel issue with implicit close() with TCP_REPAIR > > on... I would leave this in for completeness. > > ...but sure, point taken. > > > > > +} > > > > + > > > > +/** > > > > + * migrate_target_read_header() - Set metadata in target from source header > > > > + * @fd: Descriptor for state transfer > > > > + * @m: Migration metadata, filled on return > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > > > We nearly always use negative error codes. Why not here? > > > > Because the reply to VHOST_USER_SET_DEVICE_STATE_FD is unsigned: > > > > https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#front-end-message-types > > > > and I want to keep this consistent/untranslated. > > Ok. > > > > > + */ > > > > +int migrate_target_read_header(int fd, struct migrate_meta *m) > > > > +{ > > > > + static struct migrate_data *d; > > > > + union migrate_header h; > > > > + > > > > + if (read_all_buf(fd, &h, sizeof(h))) > > > > + return errno; > > > > + > > > > + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version: %u", > > > > + h.magic, ntohl(h.voidp_size), ntohl(h.version)); > > > > + > > > > + for (d = data_versions; d->v != ntohl(h.version); d++); > > > > + if (!d->v) > > > > + return ENOTSUP; > > > > > > This is too late. The loop doesn't check it, so you've already > > > overrun the data_versions table if the version wasn't in there. > > > > Ah, yes, I forgot the '&& d->v' part (see migrate_target()). > > > > > Easier to use an ARRAY_SIZE() limit in the loop, I think. > > > > I'd rather keep that as a one-liner, and NULL-terminate the arrays. > > > > > > + m->v = d->v; > > > > + > > > > + if (h.magic == MIGRATE_MAGIC) > > > > + m->bswap = false; > > > > + else if (h.magic == MIGRATE_MAGIC_SWAPPED) > > > > + m->bswap = true; > > > > + else > > > > + return ENOTSUP; > > > > + > > > > + if (ntohl(h.voidp_size) == 4) > > > > + m->source_64b = false; > > > > + else if (ntohl(h.voidp_size) == 8) > > > > + m->source_64b = true; > > > > + else > > > > + return ENOTSUP; > > > > + > > > > + if (ntohl(h.time_t_size) == 4) > > > > + m->time_64b = false; > > > > + else if (ntohl(h.time_t_size) == 8) > > > > + m->time_64b = true; > > > > + else > > > > + return ENOTSUP; > > > > + > > > > + m->flow_size = ntohl(h.flow_size); > > > > + m->flow_sidx_size = ntohl(h.flow_sidx_size); > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/** > > > > + * migrate_target_pre() - Pre-migration tasks as target > > > > + * @m: Migration metadata > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > + */ > > > > +int migrate_target_pre(struct migrate_meta *m) > > > > +{ > > > > + struct migrate_target_handlers *th; > > > > + struct migrate_handler *h; > > > > + > > > > + for (th = target_handlers; th->v != m->v && th->v; th++); > > > > + > > > > + for (h = th->pre; h->fn; h++) { > > > > + int rc; > > > > + > > > > + if ((rc = h->fn(m, h->data))) > > > > + return rc; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/** > > > > + * migrate_target() - Perform migration as target: receive state from hypervisor > > > > + * @fd: Descriptor for state transfer > > > > + * @m: Migration metadata > > > > + * > > > > + * Return: 0 on success, error code on failure > > > > + * > > > > + * #syscalls:vu readv > > > > + */ > > > > +int migrate_target(int fd, const struct migrate_meta *m) > > > > +{ > > > > + static struct migrate_data *d; > > > > + unsigned cnt; > > > > + int rc; > > > > + > > > > + for (d = data_versions; d->v != m->v && d->v; d++); > > > > + > > > > + for (cnt = 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt++); > > > > + > > > > + debug("Reading %u migration sections", cnt); > > > > + rc = read_remainder(fd, d->sections + 1, cnt, 0); > > > > + if (rc < 0) > > > > + return errno; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +/** > > > > + * migrate_target_post() - Post-migration tasks as target > > > > + * @m: Migration metadata > > > > + */ > > > > +void migrate_target_post(struct migrate_meta *m) > > > > +{ > > > > + struct migrate_target_handlers *th; > > > > + struct migrate_handler *h; > > > > + > > > > + for (th = target_handlers; th->v != m->v && th->v; th++); > > > > + > > > > + for (h = th->post; h->fn; h++) > > > > + h->fn(m, h->data); > > > > +} > > > > diff --git a/migrate.h b/migrate.h > > > > new file mode 100644 > > > > index 0000000..5582f75 > > > > --- /dev/null > > > > +++ b/migrate.h > > > > @@ -0,0 +1,90 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0-or-later > > > > + * Copyright (c) 2025 Red Hat GmbH > > > > + * Author: Stefano Brivio > > > > + */ > > > > + > > > > +#ifndef MIGRATE_H > > > > +#define MIGRATE_H > > > > + > > > > +/** > > > > + * struct migrate_meta - Migration metadata > > > > + * @v: Chosen migration data version, host order > > > > + * @bswap: Source has opposite endianness > > > > + * @peer_64b: Source uses 64-bit void * > > > > + * @time_64b: Source uses 64-bit time_t > > > > + * @flow_size: Size of union flow in source > > > > + * @flow_sidx_size: Size of struct flow_sidx in source > > > > + */ > > > > +struct migrate_meta { > > > > + uint32_t v; > > > > + bool bswap; > > > > + bool source_64b; > > > > + bool time_64b; > > > > + size_t flow_size; > > > > + size_t flow_sidx_size; > > > > +}; > > > > + > > > > +/** > > > > + * union migrate_header - Migration header from source > > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > > > + * @version: Source sends highest known, target aborts if unsupported > > > > + * @voidp_size: sizeof(void *), network order > > > > + * @time_t_size: sizeof(time_t), network order > > > > + * @flow_size: sizeof(union flow), network order > > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > > > + * @unused: Go figure > > > > + */ > > > > +union migrate_header { > > > > + struct { > > > > + uint64_t magic; > > > > + uint32_t version; > > > > + uint32_t voidp_size; > > > > + uint32_t time_t_size; > > > > + uint32_t flow_size; > > > > + uint32_t flow_sidx_size; > > > > + }; > > > > + uint8_t unused[65536]; > > > > > > So, having looked at this, I no longer think padding the header to 64kiB > > > is a good idea. The structure means we're basically stuck always > > > having that chunky header. Instead, I think the header should be > > > absolutely minimal: basically magic and version only. v1 (and maybe > > > others) can add a "metadata" or whatever section for additional > > > information like this they need. > > > > The header is processed by the target in a separate, preliminary step, > > though. > > > > That's why I added metadata right in the header: if the target needs to > > abort the migration because, say, the size of a flow entry is too big > > to handle for a particular version, then we should know that before > > migrate_target_pre(). > > Ah, yes, I missed that, we'd need a more complex design to do > additional transfers and checks before making the target_pre > callbacks. > > > As long as we check the version first, we can always shrink the header > > later on. > > *thinks*.. I guess so, though it's kind of awkward; a future version > would have to read the "header of the header", check the version, then > if it's the old one, read the remainder of the 64kiB block. > > I still think we should clearly separate the part that we're > committing to being in every future version (which I think should just > be magic and version), from the stuff that's just v1. Sure, I can add a comment. > > But having 64 KiB reserved looks more robust because it's a > > safe place to add this kind of metadata. > > > > Note that 64 KiB is typically transferred in a single read/write > > from/to the vhost-user back-end. > > Ok, but it also has to go over the qemu migration channel, which will > often be a physical link, not a super-fast local/virtual one, and may > be bandwidth capped as well. I'm not actually certain if 64kiB is > likely to be a problem there, but it *is* large compared to the state > blobs of most qemu devices (usually only a few hundred bytes). Even if we transfer just what we need of a flow, it's still something well in excess of 50 bytes each. 100k flows would be 5 megs. Well, anyway, let's cut this down to 4k, which should be enough, so that it's not a topic anymore. -- Stefano