From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=KjwsMwXI; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 7D2C55A061E for ; Thu, 30 Jan 2025 02:33:21 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1738200790; bh=cmpCup9fORhS1zpZtwGOodbDoXIvdNWD9o6W2pdpLTI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=KjwsMwXIDG0QYZorbzAapdQBftY0UqwnNIRtjubw84a7dfYd7R9Yupja4nqEewXnO YNT+hr7sN3gEmBa1geUxqgfY1JQHJhF8Mkpc/aaGcPeLDwJi/LNTdYpzQIB8VE7YlS qdEny5YjivC10YDbf8nhgrbRddm5aNe1oXMLyPqBAbSqReA58zgVDALSdXJLygskGa KeRWefSoaQtTdzdqhOPUJnsg6M/nzhGPKN0NbUECNw8fCU5nCdswDFwKHN7Hsd3ihj dbNm/BFkA1rBjkCFepg0uV6aIHXN65UKL9SzACDpRpVEJT4aRyV/UHcs+rCVpXZR2i ZpfhNjkX4+D9g== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Yk1kG0TPdz4wc4; Thu, 30 Jan 2025 12:33:10 +1100 (AEDT) Date: Thu, 30 Jan 2025 12:28:42 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v2 5/8] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: References: <20250128233940.1235855-1-sbrivio@redhat.com> <20250128233940.1235855-6-sbrivio@redhat.com> <20250129094606.133a28e3@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="WU439lBrfdGacBPF" Content-Disposition: inline In-Reply-To: <20250129094606.133a28e3@elisabeth> Message-ID-Hash: TSGCZWQTAYF23BHBUFXIADL3LBVX4FKJ X-Message-ID-Hash: TSGCZWQTAYF23BHBUFXIADL3LBVX4FKJ X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --WU439lBrfdGacBPF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jan 29, 2025 at 09:46:06AM +0100, Stefano Brivio wrote: > On Wed, 29 Jan 2025 16:41:01 +1100 > David Gibson wrote: >=20 > > On Wed, Jan 29, 2025 at 12:39:37AM +0100, Stefano Brivio wrote: > > > Add two sets (source or target) of three functions each for passt in > > > vhost-user mode, triggered by activity on the file descriptor passed > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > >=20 > > > - migrate_source_pre() and migrate_target_pre() are called to prepare > > > for migration, before data is transferred > > >=20 > > > - migrate_source() sends, and migrate_target() receives migration data > > >=20 > > > - migrate_source_post() and migrate_target_post() are responsible for > > > any post-migration task > > >=20 > > > Callbacks are added to these functions with arrays of function > > > pointers in migrate.c. Migration handlers are versioned. > > >=20 > > > Versioned descriptions of data sections will be added to the > > > data_versions array, which points to versioned iovec arrays. Version > > > 1 is currently empty and will be filled in in subsequent patches. > > >=20 > > > The source announces the data version to be used and informs the peer > > > about endianness, and the size of void *, time_t, flow entries and > > > flow hash table entries. > > >=20 > > > The target checks if the version of the source is still supported. If > > > it's not, it aborts the migration. > > >=20 > > > Signed-off-by: Stefano Brivio =20 > >=20 > > Tomorrow, I'm planning to try implementing a bunch of the suggestions > > I have below. > >=20 > > > --- > > > Makefile | 12 +-- > > > migrate.c | 264 ++++++++++++++++++++++++++++++++++++++++++++++++++= ++ > > > migrate.h | 88 ++++++++++++++++++ > > > passt.c | 2 +- > > > vu_common.c | 124 ++++++++++++++++-------- > > > vu_common.h | 2 +- > > > 6 files changed, 443 insertions(+), 49 deletions(-) > > > create mode 100644 migrate.c > > > create mode 100644 migrate.h > > >=20 > > > diff --git a/Makefile b/Makefile > > > index 464eef1..1383875 100644 > > > --- a/Makefile > > > +++ b/Makefile > > > @@ -38,8 +38,8 @@ FLAGS +=3D -DDUAL_STACK_SOCKETS=3D$(DUAL_STACK_SOCK= ETS) > > > =20 > > > PASST_SRCS =3D arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c= fwd.c \ > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c= \ > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap= =2Ec \ > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c uti= l.c \ > > > vhost_user.c virtio.c vu_common.c > > > QRAP_SRCS =3D qrap.c > > > SRCS =3D $(PASST_SRCS) $(QRAP_SRCS) > > > @@ -48,10 +48,10 @@ MANPAGES =3D passt.1 pasta.1 qrap.1 > > > =20 > > > PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flo= w.h fwd.h \ > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pi= f.h \ > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splic= e.h \ > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user= =2Eh \ > > > - virtio.h vu_common.h > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h= \ > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_interna= l.h \ > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util= =2Eh \ > > > + vhost_user.h virtio.h vu_common.h > > > HEADERS =3D $(PASST_HEADERS) seccomp.h > > > =20 > > > C :=3D \#include \nint main(){int a=3Dgetrandom(0, 0, = 0);} > > > diff --git a/migrate.c b/migrate.c > > > new file mode 100644 > > > index 0000000..b8b79e0 > > > --- /dev/null > > > +++ b/migrate.c > > > @@ -0,0 +1,264 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +/* PASST - Plug A Simple Socket Transport > > > + * for qemu/UNIX domain socket mode > > > + * > > > + * PASTA - Pack A Subtle Tap Abstraction > > > + * for network namespace/tap device mode > > > + * > > > + * migrate.c - Migration sections, layout, and routines > > > + * > > > + * Copyright (c) 2025 Red Hat GmbH > > > + * Author: Stefano Brivio > > > + */ > > > + > > > +#include > > > +#include > > > + > > > +#include "util.h" > > > +#include "ip.h" > > > +#include "passt.h" > > > +#include "inany.h" > > > +#include "flow.h" > > > +#include "flow_table.h" > > > + > > > +#include "migrate.h" > > > + > > > +/* Current version of migration data */ > > > +#define MIGRATE_VERSION 1 > > > + > > > +/* Magic as we see it and as seen with reverse endianness */ > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B1B01B1DBBB1 > > > + > > > +/* Migration header to send from source */ > > > +static union migrate_header header =3D { > > > + .magic =3D MIGRATE_MAGIC, > > > + .version =3D htonl_constant(MIGRATE_VERSION), > > > + .time_t_size =3D htonl_constant(sizeof(time_t)), > > > + .flow_size =3D htonl_constant(sizeof(union flow)), > > > + .flow_sidx_size =3D htonl_constant(sizeof(struct flow_sidx)), > > > + .voidp_size =3D htonl_constant(sizeof(void *)), > > > +}; > > > + > > > +/* Data sections for version 1 */ > > > +static struct iovec sections_v1[] =3D { > > > + { &header, sizeof(header) }, =20 > >=20 > > This format assumes that everything we send is in buffers with > > statically known size and address. I'm pretty sure we'll outgrow that > > quickly. For one thing, I realised we want to transfer addr*_seen, or > > we won't know how to direct an incoming connection after migration, if > > that occurs before the guest sends anything outgoing. >=20 > Yes, I'm transferring those too (local changes not completely > working yet) as separate sections. The layout is just fine for those. Oh, I guess the context does actually live in a global, we just pretend it doesn't most of the time. > What we don't cover with this layout is arrays, and I guess that should > be added, because if we want to transfer a minimal representation of > single flows instead of the whole flow table, we need something like > that. >=20 > I think it could be something on top of iovecs, that is, a simple > structure with iovec pointers and a count, or a pointer to a count. >=20 > > > +}; > > > + > > > +/* Set of data versions */ > > > +static struct migrate_data data_versions[] =3D { > > > + { > > > + 1, sections_v1, > > > + }, =20 > >=20 > > I realise it's a bit less "declarative", but maybe just a "receive" > > function for each version which open codes sending each section would > > be more flexible, and not that much more verbose. >=20 > It's 18 lines of migrate_source() and 17 lines of migrate_target() > right now, so it's not really *that* short, and if it can be > declarative (I'm finding this quite convenient as I'm *using* it), I > think it should be. What else would you like to add? >=20 > Besides, it can always be changed later as a use case appears. >=20 > > > + { 0 }, > > > +}; > > > + > > > +/* Handlers to call in source before sending data */ > > > +struct migrate_handler handlers_source_pre[] =3D { > > > + { 0 }, > > > +}; > > > + > > > +/* Handlers to call in source after sending data */ > > > +struct migrate_handler handlers_source_post[] =3D { > > > + { 0 }, > > > +}; =20 > >=20 > > I also wonder if these tables of handlers are overkill, rather than > > just having a function that calls all the things in the right order. >=20 > That risks turning into a long series of: >=20 > ret =3D handler(c); > if (ret) > return ret; >=20 > ... >=20 > but anyway, let me finish writing the code using this first, then we'll > know, instead of speculating. Ah, yeah, bulky C error handling, that's a good point. > > > + > > > +/* Handlers to call in target before receiving data with version 1 */ > > > +struct migrate_handler handlers_target_pre_v1[] =3D { > > > + { 0 }, > > > +}; > > > + > > > +/* Handlers to call in target after receiving data with version 1 */ > > > +struct migrate_handler handlers_target_post_v1[] =3D { > > > + { 0 }, > > > +}; > > > + > > > +/* Versioned sets of migration handlers */ > > > +struct migrate_target_handlers target_handlers[] =3D { > > > + { > > > + 1, > > > + handlers_target_pre_v1, > > > + handlers_target_post_v1, > > > + }, > > > + { 0 }, > > > +}; > > > + > > > +/** > > > + * migrate_source_pre() - Pre-migration tasks as source > > > + * @c: Execution context > > > + * @m: Migration metadata > > > + * > > > + * Return: 0 on success, error code on failure > > > + */ > > > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m) > > > +{ > > > + struct migrate_handler *h; > > > + > > > + for (h =3D handlers_source_pre; h->fn; h++) { > > > + int rc; > > > + > > > + if ((rc =3D h->fn(c, m))) > > > + return rc; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * migrate_source() - Perform migration as source: send state to hyp= ervisor > > > + * @fd: Descriptor for state transfer > > > + * @m: Migration metadata > > > + * > > > + * Return: 0 on success, error code on failure > > > + */ > > > +int migrate_source(int fd, const struct migrate_meta *m) > > > +{ > > > + static struct migrate_data *d; > > > + int count, rc; > > > + > > > + (void)m; > > > + > > > + for (d =3D data_versions; d->v !=3D MIGRATE_VERSION; d++); > > > + > > > + for (count =3D 0; d->sections[count].iov_len; count++); > > > + > > > + debug("Writing %u migration sections", count - 1 /* minus header */= ); > > > + rc =3D write_remainder(fd, d->sections, count, 0); > > > + if (rc < 0) > > > + return errno; > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * migrate_source_post() - Post-migration tasks as source > > > + * @c: Execution context > > > + * @m: Migration metadata > > > + * > > > + * Return: 0 on success, error code on failure > > > + */ > > > +void migrate_source_post(struct ctx *c, struct migrate_meta *m) > > > +{ > > > + struct migrate_handler *h; > > > + > > > + for (h =3D handlers_source_post; h->fn; h++) > > > + h->fn(c, m); > > > +} > > > + > > > +/** > > > + * migrate_target_read_header() - Set metadata in target from source= header > > > + * @fd: Descriptor for state transfer > > > + * @m: Migration metadata, filled on return > > > + * > > > + * Return: 0 on success, error code on failure > > > + */ > > > +int migrate_target_read_header(int fd, struct migrate_meta *m) > > > +{ > > > + static struct migrate_data *d; > > > + union migrate_header h; > > > + > > > + if (read_all_buf(fd, &h, sizeof(h))) > > > + return errno; > > > + > > > + debug("Source magic: 0x%016" PRIx64 ", sizeof(void *): %u, version:= %u", > > > + h.magic, ntohl(h.voidp_size), ntohl(h.version)); > > > + > > > + for (d =3D data_versions; d->v !=3D ntohl(h.version) && d->v; d++); > > > + if (!d->v) > > > + return ENOTSUP; > > > + m->v =3D d->v; > > > + > > > + if (h.magic =3D=3D MIGRATE_MAGIC) > > > + m->bswap =3D false; > > > + else if (h.magic =3D=3D MIGRATE_MAGIC_SWAPPED) > > > + m->bswap =3D true; > > > + else > > > + return ENOTSUP; > > > + > > > + if (ntohl(h.voidp_size) =3D=3D 4) > > > + m->source_64b =3D false; > > > + else if (ntohl(h.voidp_size) =3D=3D 8) > > > + m->source_64b =3D true; > > > + else > > > + return ENOTSUP; > > > + > > > + if (ntohl(h.time_t_size) =3D=3D 4) > > > + m->time_64b =3D false; > > > + else if (ntohl(h.time_t_size) =3D=3D 8) > > > + m->time_64b =3D true; > > > + else > > > + return ENOTSUP; > > > + > > > + m->flow_size =3D ntohl(h.flow_size); > > > + m->flow_sidx_size =3D ntohl(h.flow_sidx_size); > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * migrate_target_pre() - Pre-migration tasks as target > > > + * @c: Execution context > > > + * @m: Migration metadata > > > + * > > > + * Return: 0 on success, error code on failure > > > + */ > > > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m) > > > +{ > > > + struct migrate_target_handlers *th; > > > + struct migrate_handler *h; > > > + > > > + for (th =3D target_handlers; th->v !=3D m->v && th->v; th++); > > > + > > > + for (h =3D th->pre; h->fn; h++) { > > > + int rc; > > > + > > > + if ((rc =3D h->fn(c, m))) > > > + return rc; > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * migrate_target() - Perform migration as target: receive state fro= m hypervisor > > > + * @fd: Descriptor for state transfer > > > + * @m: Migration metadata > > > + * > > > + * Return: 0 on success, error code on failure > > > + * > > > + * #syscalls:vu readv > > > + */ > > > +int migrate_target(int fd, const struct migrate_meta *m) > > > +{ > > > + static struct migrate_data *d; > > > + unsigned cnt; > > > + int rc; > > > + > > > + for (d =3D data_versions; d->v !=3D m->v && d->v; d++); > > > + > > > + for (cnt =3D 0; d->sections[cnt + 1 /* skip header */].iov_len; cnt= ++); > > > + > > > + debug("Reading %u migration sections", cnt); > > > + rc =3D read_remainder(fd, d->sections + 1, cnt, 0); > > > + if (rc < 0) > > > + return errno; > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * migrate_target_post() - Post-migration tasks as target > > > + * @c: Execution context > > > + * @m: Migration metadata > > > + */ > > > +void migrate_target_post(struct ctx *c, struct migrate_meta *m) > > > +{ > > > + struct migrate_target_handlers *th; > > > + struct migrate_handler *h; > > > + > > > + for (th =3D target_handlers; th->v !=3D m->v && th->v; th++); > > > + > > > + for (h =3D th->post; h->fn; h++) > > > + h->fn(c, m); > > > +} > > > diff --git a/migrate.h b/migrate.h > > > new file mode 100644 > > > index 0000000..f9635ac > > > --- /dev/null > > > +++ b/migrate.h > > > @@ -0,0 +1,88 @@ > > > +/* SPDX-License-Identifier: GPL-2.0-or-later > > > + * Copyright (c) 2025 Red Hat GmbH > > > + * Author: Stefano Brivio > > > + */ > > > +=20 > > > +#ifndef MIGRATE_H > > > +#define MIGRATE_H > > > + > > > +/** > > > + * struct migrate_meta - Migration metadata > > > + * @v: Chosen migration data version, host order > > > + * @bswap: Source has opposite endianness > > > + * @peer_64b: Source uses 64-bit void * > > > + * @time_64b: Source uses 64-bit time_t > > > + * @flow_size: Size of union flow in source > > > + * @flow_sidx_size: Size of struct flow_sidx in source > > > + */ > > > +struct migrate_meta { > > > + uint32_t v; > > > + bool bswap; > > > + bool source_64b; > > > + bool time_64b; > > > + size_t flow_size; > > > + size_t flow_sidx_size; > > > +}; > > > + > > > +/** > > > + * union migrate_header - Migration header from source > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > > + * @version: Source sends highest known, target aborts if unsupport= ed > > > + * @voidp_size: sizeof(void *), network order > > > + * @time_t_size: sizeof(time_t), network order > > > + * @flow_size: sizeof(union flow), network order > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > > + * @unused: Go figure > > > + */ > > > +union migrate_header { > > > + struct { > > > + uint64_t magic; > > > + uint32_t version; > > > + uint32_t voidp_size; > > > + uint32_t time_t_size; > > > + uint32_t flow_size; > > > + uint32_t flow_sidx_size; > > > + }; > > > + uint8_t unused[65536]; > > > +}; > > > + > > > +/** > > > + * struct migrate_data - Data sections for given source version > > > + * @v: Source version this applies to, host order > > > + * @sections: Array of data sections, NULL-terminated > > > + */ > > > +struct migrate_data { > > > + uint32_t v; > > > + struct iovec *sections; > > > +}; > > > + > > > +/** > > > + * struct migrate_handler - Function to handle a specific data secti= on > > > + * @fn: Function pointer taking pointer to context and metadata > > > + */ > > > +struct migrate_handler { > > > + int (*fn)(struct ctx *c, struct migrate_meta *m); > > > +}; > > > + > > > +/** > > > + * struct migrate_target_handlers - Versioned sets of migration targ= et handlers > > > + * @v: Source version this applies to, host order > > > + * @pre: Set of functions to execute in target before data copy > > > + * @post: Set of functions to execute in target after data copy > > > + */ > > > +struct migrate_target_handlers { > > > + uint32_t v; > > > + struct migrate_handler *pre; > > > + struct migrate_handler *post; > > > +}; > > > + > > > +int migrate_source_pre(struct ctx *c, struct migrate_meta *m); > > > +int migrate_source(int fd, const struct migrate_meta *m); > > > +void migrate_source_post(struct ctx *c, struct migrate_meta *m); > > > + > > > +int migrate_target_read_header(int fd, struct migrate_meta *m); > > > +int migrate_target_pre(struct ctx *c, struct migrate_meta *m); > > > +int migrate_target(int fd, const struct migrate_meta *m); > > > +void migrate_target_post(struct ctx *c, struct migrate_meta *m); > > > + > > > +#endif /* MIGRATE_H */ > > > diff --git a/passt.c b/passt.c > > > index b1c8ab6..184d4e5 100644 > > > --- a/passt.c > > > +++ b/passt.c > > > @@ -358,7 +358,7 @@ loop: > > > vu_kick_cb(c.vdev, ref, &now); > > > break; > > > case EPOLL_TYPE_VHOST_MIGRATION: > > > - vu_migrate(c.vdev, eventmask); > > > + vu_migrate(&c, eventmask); =20 > >=20 > > Not really a comment on this patch, but do we actually need/want to > > put the migration fd into the epoll loop? IIUC, everything we're > > doing on the migration fd is synchronous and blocking, so we could run > > the whole migration from the VHOST_USER_SET_DEVICE_STATE callback. >=20 > The epoll event bits are practical to have here, though. No, not really; the direction is explicitly given in the VU_USER_SET_DEVICE_STATE_FD request. Indeed, I'd say that's the correct source for this information: in practice we're getting a single-direction socket for the fd, but I don't know that the protocol guarantees that. AFAICT, in theory a front-end saving state to disk could just hand over a handle to a regular file, meaning the epoll bits tell us nothing. > > Or, maybe less confusing, flag the migration to run before we next > > call epoll_wait(): having the migration run strictly in-between epoll > > cycles seems like it might make some things easier to analyse. >=20 > Well, we don't really *need* to have the descriptor in the epoll list, > but it's more consistent. Should we really introduce a different > mechanism just because we don't strictly need that? >=20 > What's the actual problem with the current implementation? It suggests that events on the device state fd can be interspersed with other events: that's kind of the whole point of epoll. But that's not true, we want to freeze the world as much as possible for the migration. More specifically the device state epoll event could occur in between, say, an EPOLLIN for an external TCP socket, and the deferred handling for that TCP flow. I don't know if that could cause trouble, but it seems like a case we're better off eliminating: we're always going to have less variable state between epoll cycles than within an epoll cycle. [Not important right now, but that made me think of a potential complication if we ever try to implement migration over -net stream: at the moment reads & writes to the qemu socket are essentially independent of each other. But if qemu were to make a migrate request over that, we'd need some sort of reply. We can't really ensure that that reply is the first thing qemu sees after sending the request, because there might have been unrelated frames in flight already. That means qemu would have to deal with those incoming frames *while the guest is already stopped* which sounds rather awkward.] > > > break; > > > default: > > > /* Can't happen */ > > > diff --git a/vu_common.c b/vu_common.c > > > index f43d8ac..6c346c8 100644 > > > --- a/vu_common.c > > > +++ b/vu_common.c > > > @@ -5,6 +5,7 @@ > > > * common_vu.c - vhost-user common UDP and TCP functions > > > */ > > > =20 > > > +#include > > > #include > > > #include > > > #include > > > @@ -17,6 +18,7 @@ > > > #include "vhost_user.h" > > > #include "pcap.h" > > > #include "vu_common.h" > > > +#include "migrate.h" > > > =20 > > > #define VU_MAX_TX_BUFFER_NB 2 > > > =20 > > > @@ -305,50 +307,90 @@ err: > > > } > > > =20 > > > /** > > > - * vu_migrate() - Send/receive passt insternal state to/from QEMU > > > - * @vdev: vhost-user device > > > + * vu_migrate_source() - Migration as source, send state to hypervis= or > > > + * @c: Execution context > > > + * @fd: File descriptor for state transfer > > > + * > > > + * Return: 0 on success, positive error code on failure > > > + */ > > > +static int vu_migrate_source(struct ctx *c, int fd) > > > +{ > > > + struct migrate_meta m; > > > + int rc; > > > + > > > + if ((rc =3D migrate_source_pre(c, &m))) { > > > + err("Source pre-migration failed: %s, abort", strerror_(rc)); > > > + return rc; > > > + } > > > + > > > + debug("Saving backend state"); > > > + > > > + rc =3D migrate_source(fd, &m); > > > + if (rc) > > > + err("Source migration failed: %s", strerror_(rc)); > > > + else > > > + migrate_source_post(c, &m); > > > + > > > + return rc; > > > +} > > > + > > > +/** > > > + * vu_migrate_target() - Migration as target, receive state from hyp= ervisor > > > + * @c: Execution context > > > + * @fd: File descriptor for state transfer > > > + * > > > + * Return: 0 on success, positive error code on failure > > > + */ > > > +static int vu_migrate_target(struct ctx *c, int fd) > > > +{ > > > + struct migrate_meta m; > > > + int rc; > > > + > > > + rc =3D migrate_target_read_header(fd, &m); > > > + if (rc) { > > > + err("Migration header check failed: %s, abort", strerror_(rc)); > > > + return rc; > > > + } > > > + > > > + if ((rc =3D migrate_target_pre(c, &m))) { > > > + err("Target pre-migration failed: %s, abort", strerror_(rc)); > > > + return rc; > > > + } > > > + > > > + debug("Loading backend state"); > > > + > > > + rc =3D migrate_target(fd, &m); > > > + if (rc) > > > + err("Target migration failed: %s", strerror_(rc)); > > > + else > > > + migrate_target_post(c, &m); > > > + > > > + return rc; > > > +} > > > + > > > +/** > > > + * vu_migrate() - Send/receive passt internal state to/from QEMU > > > + * @c: Execution context > > > * @events: epoll events > > > */ > > > -void vu_migrate(struct vu_dev *vdev, uint32_t events) > > > +void vu_migrate(struct ctx *c, uint32_t events) > > > { > > > - int ret; > > > + struct vu_dev *vdev =3D c->vdev; > > > + int rc =3D EIO; > > > =20 > > > - /* TODO: collect/set passt internal state > > > - * and use vdev->device_state_fd to send/receive it > > > - */ > > > debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); > > > - if (events & EPOLLOUT) { > > > - debug("Saving backend state"); > > > - > > > - /* send some stuff */ > > > - ret =3D write(vdev->device_state_fd, "PASST", 6); > > > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ > > > - vdev->device_state_result =3D ret =3D=3D -1 ? -1 : 0; > > > - /* Closing the file descriptor signals the end of transfer */ > > > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, > > > - vdev->device_state_fd, NULL); > > > - close(vdev->device_state_fd); > > > - vdev->device_state_fd =3D -1; > > > - } else if (events & EPOLLIN) { > > > - char buf[6]; > > > - > > > - debug("Loading backend state"); > > > - /* read some stuff */ > > > - ret =3D read(vdev->device_state_fd, buf, sizeof(buf)); > > > - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ > > > - if (ret !=3D sizeof(buf)) { > > > - vdev->device_state_result =3D -1; > > > - } else { > > > - ret =3D strncmp(buf, "PASST", sizeof(buf)); > > > - vdev->device_state_result =3D ret =3D=3D 0 ? 0 : -1; > > > - } > > > - } else if (events & EPOLLHUP) { > > > - debug("Closing migration channel"); > > > - > > > - /* The end of file signals the end of the transfer. */ > > > - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, > > > - vdev->device_state_fd, NULL); > > > - close(vdev->device_state_fd); > > > - vdev->device_state_fd =3D -1; > > > - } > > > + > > > + if (events & EPOLLOUT) > > > + rc =3D vu_migrate_source(c, vdev->device_state_fd); > > > + else if (events & EPOLLIN) > > > + rc =3D vu_migrate_target(c, vdev->device_state_fd); > > > + > > > + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed= */ > > > + > > > + vdev->device_state_result =3D rc; =20 > >=20 > > Again, not really a comment on this patch, but I think it would be > > safer to set device_state_result to something non-zero (probably > > EINPROGRESS) as soon as we get the VHOST_USER_SET_DEVICE_STATE_FD. > > That way if somehow VHOST_USER_CHECK_DEVICE_STATE were called before > > we actually do anything about the migration we'd correctly report that > > it hasn't happened. >=20 > I see your point, but I'm not sure if it's correct to report > EINPROGRESS while the migration is not actually in progress. I mean.. it's in progress in the sense that the request has been issued but we haven't done it yet. > That is, if we get VHOST_USER_SET_DEVICE_STATE_FD, and then the > hypervisor decides to just send VHOST_USER_CHECK_DEVICE_STATE right > away, the migration isn't in progress. This is especially relevant as > the vhost-user documentation says that: >=20 > "any non-zero value is an error" >=20 > Further, it also states, about VHOST_USER_CHECK_DEVICE_STATE, that: >=20 > The back-end responds once it knows whether the transfer and > processing was successful or not. > > so, strictly speaking, we should probably *not* respond to that before > we're done here. On the other hand, we risk triggering issues/hangs in > the hypervisor just for the sake of being correct, so I would rather > report 0 as it would be (not actually happening) done now. *thinks about the mechanics of delaying a response to a vhost-user request* I guess we can just ignore it, return to the epoll loop and assume the event will trigger on the next loop as well. Looks like we're not using EPOLLET on the VU socket, so that should work. Does enything define how VHOST_USER_CHECK_DEVICE_STATE should respond if no migration is requested? I guess it should be success, so that if for some reason it were called twice after a successful migration it would return success both times? I still think this works well with the idea of putting migration between epoll cycles rather than within. We add a pending_migration flag which is either NONE, SOURCE, TARGET. VHOST_USER_SET_DEVICE_STATE_FD sets it based on the direction flag. After post_handler(), we check it and if non-NONE, we synchronously complete the migration, then clear. In VHOST_USER_CHECK_DEVICE_STATE if it's non-NONE we do nothing and return to the epoll loop. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --WU439lBrfdGacBPF Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmea1ckACgkQzQJF27ox 2Gc5uA/+J408rQS3N02lnUK1iKT1YfrP1buLnY3Vf4yQhgSFF9ogR0eV1HERli+N lX5wSr7wDnGoGrh0thCv4EFq4WVwVVV+/xlTV9Swi6Qy9rC3UKVEn7Lj6ZnoDW7T CjKmI+pSRb7Nyexw+xtVwH2j/Ip01Sd4k8jOmznYGbgmuYpD1G5oyr1elH/8M9zk kSrcky3dZwz9Dx7+MnbPdCherRENCqf1XEDAzHuCFiaWfNGMydJuyLQku5pancMn xL0IqGHbTH8ordCEMu0e4zBeBl7TVTqjEVUpk1Ir6oTigHj1lY17mo2RQhaUDZC9 WUHoRqtoQBQ7RMBsobWmBc7OXkJ/SlDEFzTQxs+aQdCaJEN3e4faBG+mEY9HfGgT yM68w2c5UgdDvlMHySEOHllQoenrN7TBu+JOeQKIZCaogqjUT4rYkx599+iGzg3K gtv9pJuG+yUxOK+FsWNAfu15ngfAF8r8DfdPXZmC+q1rFqJQvoftKf6KCTHkiob7 eFBwr30jbMvMiww9JEPUO7cZJOhUtlEo7Tf2JJGTNqrpMO2bpi+y8Wvyv/fmNXhv CT3kEqbDiMqXpEUO46UtlzRU468Gh0fgYdoNH4nui+z5yG0JYB0jfcXYs/QkRitS tvd7+qkts7q/+vVbLLAwv7cHKZBKr3BOV+lzG4gVJKJUXP3Vmzw= =MOWM -----END PGP SIGNATURE----- --WU439lBrfdGacBPF--