From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=nVWPCW1h; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 7F2CD5A026F for ; Thu, 30 Jan 2025 08:44:01 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1738223028; bh=kPNa2VvMROq5l/o9wQ+uatMzJS6BSZggmBzjps2eMW8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nVWPCW1h1OzcTQLSw4kVHSzhjijG6OUNZGtNbGjD6E7tKDag4CiwMM38TuFVe7MRN nDmfQ50+rptEJ8L5hn+vI+FXwnHWCqtfSIahHoc3Xosmoro12eNoeExm9zXei4vRiY bXU8lbtv0xXMLj2LeQ0nCpZCy87hg12mkOdnJdvrCQH/a2vgXS2qVmDXUhKCCFo7ZR C9FAMmZTz5NJmErme2Q3LNxL0Z8Okz0m41G4CgyWDIHvqvAoIbtM20sWuu82R2Z+Bn yCpHaQVHJdA0RqwXPwcX1BEhdVEVXbc2uS9RmEJfg7N3qtaCLfNR/rHKh19oxKeMUY 1GBrWeQF0RBbA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Yk9xw276Zz4x5m; Thu, 30 Jan 2025 18:43:48 +1100 (AEDT) Date: Thu, 30 Jan 2025 18:38:22 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: References: <20250127231532.672363-1-sbrivio@redhat.com> <20250127231532.672363-7-sbrivio@redhat.com> <20250128075001.3557d398@elisabeth> <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="s+E6DAfL8hAMt2Hs" Content-Disposition: inline In-Reply-To: <20250130055522.39acb265@elisabeth> Message-ID-Hash: 6AQWPLSBY4CJQP2DB2NLUDHPYYYLT2UR X-Message-ID-Hash: 6AQWPLSBY4CJQP2DB2NLUDHPYYYLT2UR X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --s+E6DAfL8hAMt2Hs Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 30, 2025 at 05:55:22AM +0100, Stefano Brivio wrote: > On Thu, 30 Jan 2025 11:48:19 +1100 > David Gibson wrote: >=20 > > On Wed, Jan 29, 2025 at 08:33:50AM +0100, Stefano Brivio wrote: > > > On Wed, 29 Jan 2025 12:16:58 +1100 > > > David Gibson wrote: > > > =20 > > > > On Tue, Jan 28, 2025 at 07:50:01AM +0100, Stefano Brivio wrote: =20 > > > > > On Tue, 28 Jan 2025 12:40:12 +1100 > > > > > David Gibson wrote: > > > > > =20 > > > > > > On Tue, Jan 28, 2025 at 12:15:31AM +0100, Stefano Brivio wrote:= =20 > > > > > > > Add two sets (source or target) of three functions each for p= asst in > > > > > > > vhost-user mode, triggered by activity on the file descriptor= passed > > > > > > > via VHOST_USER_PROTOCOL_F_DEVICE_STATE: > > > > > > >=20 > > > > > > > - migrate_source_pre() and migrate_target_pre() are called to= prepare > > > > > > > for migration, before data is transferred > > > > > > >=20 > > > > > > > - migrate_source() sends, and migrate_target() receives migra= tion data > > > > > > >=20 > > > > > > > - migrate_source_post() and migrate_target_post() are respons= ible for > > > > > > > any post-migration task > > > > > > >=20 > > > > > > > Callbacks are added to these functions with arrays of function > > > > > > > pointers in migrate.c. Migration handlers are versioned. > > > > > > >=20 > > > > > > > Versioned descriptions of data sections will be added to the > > > > > > > data_versions array, which points to versioned iovec arrays. = Version > > > > > > > 1 is currently empty and will be filled in in subsequent patc= hes. > > > > > > >=20 > > > > > > > The source announces the data version to be used and informs = the peer > > > > > > > about endianness, and the size of void *, time_t, flow entrie= s and > > > > > > > flow hash table entries. > > > > > > >=20 > > > > > > > The target checks if the version of the source is still suppo= rted. If > > > > > > > it's not, it aborts the migration. > > > > > > >=20 > > > > > > > Signed-off-by: Stefano Brivio > > > > > > > --- > > > > > > > Makefile | 12 +-- > > > > > > > migrate.c | 259 ++++++++++++++++++++++++++++++++++++++++++= ++++++++++ > > > > > > > migrate.h | 90 ++++++++++++++++++ > > > > > > > passt.c | 2 +- > > > > > > > vu_common.c | 122 ++++++++++++++++--------- > > > > > > > vu_common.h | 2 +- > > > > > > > 6 files changed, 438 insertions(+), 49 deletions(-) > > > > > > > create mode 100644 migrate.c > > > > > > > create mode 100644 migrate.h > > > > > > >=20 > > > > > > > diff --git a/Makefile b/Makefile > > > > > > > index 464eef1..1383875 100644 > > > > > > > --- a/Makefile > > > > > > > +++ b/Makefile > > > > > > > @@ -38,8 +38,8 @@ FLAGS +=3D -DDUAL_STACK_SOCKETS=3D$(DUAL_ST= ACK_SOCKETS) > > > > > > > =20 > > > > > > > PASST_SRCS =3D arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.= c flow.c fwd.c \ > > > > > > > icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log= =2Ec mld.c \ > > > > > > > - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c= tcp.c \ > > > > > > > - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c u= til.c \ > > > > > > > + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c p= if.c tap.c \ > > > > > > > + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_= vu.c util.c \ > > > > > > > vhost_user.c virtio.c vu_common.c > > > > > > > QRAP_SRCS =3D qrap.c > > > > > > > SRCS =3D $(PASST_SRCS) $(QRAP_SRCS) > > > > > > > @@ -48,10 +48,10 @@ MANPAGES =3D passt.1 pasta.1 qrap.1 > > > > > > > =20 > > > > > > > PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp.h dhcp= v6.h flow.h fwd.h \ > > > > > > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolatio= n.h \ > > > > > > > - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h p= cap.h pif.h \ > > > > > > > - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h t= cp_splice.h \ > > > > > > > - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vh= ost_user.h \ > > > > > > > - virtio.h vu_common.h > > > > > > > + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h= pasta.h \ > > > > > > > + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp= _internal.h \ > > > > > > > + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_v= u.h util.h \ > > > > > > > + vhost_user.h virtio.h vu_common.h > > > > > > > HEADERS =3D $(PASST_HEADERS) seccomp.h > > > > > > > =20 > > > > > > > C :=3D \#include \nint main(){int a=3Dgetrando= m(0, 0, 0);} > > > > > > > diff --git a/migrate.c b/migrate.c > > > > > > > new file mode 100644 > > > > > > > index 0000000..bee9653 > > > > > > > --- /dev/null > > > > > > > +++ b/migrate.c > > > > > > > @@ -0,0 +1,259 @@ > > > > > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > > > > > + > > > > > > > +/* PASST - Plug A Simple Socket Transport > > > > > > > + * for qemu/UNIX domain socket mode > > > > > > > + * > > > > > > > + * PASTA - Pack A Subtle Tap Abstraction > > > > > > > + * for network namespace/tap device mode > > > > > > > + * > > > > > > > + * migrate.c - Migration sections, layout, and routines > > > > > > > + * > > > > > > > + * Copyright (c) 2025 Red Hat GmbH > > > > > > > + * Author: Stefano Brivio > > > > > > > + */ > > > > > > > + > > > > > > > +#include > > > > > > > +#include > > > > > > > + > > > > > > > +#include "util.h" > > > > > > > +#include "ip.h" > > > > > > > +#include "passt.h" > > > > > > > +#include "inany.h" > > > > > > > +#include "flow.h" > > > > > > > +#include "flow_table.h" > > > > > > > + > > > > > > > +#include "migrate.h" > > > > > > > + > > > > > > > +/* Current version of migration data */ > > > > > > > +#define MIGRATE_VERSION 1 > > > > > > > + > > > > > > > +/* Magic as we see it and as seen with reverse endianness */ > > > > > > > +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 > > > > > > > +#define MIGRATE_MAGIC_SWAPPED 0xB0D1B10B1B1DBBB1 =20 > > > > > >=20 > > > > > > As noted, I'm hoping we can get rid of "either endian" migratio= n. But > > > > > > if this stays, we should define it using __bswap_constant_32() = to > > > > > > avoid embarrassing mistakes. =20 > > > > >=20 > > > > > Those always give me issues on musl, =20 > > > >=20 > > > > What sort of issues? We're already using them, and have fallback > > > > versions defined in util.h =20 > > >=20 > > > The very issues that brought me to introduce those fallback versions, > > > so I'm instinctively reluctant to use them. > > >=20 > > > Actually, I think it's even clearer to have this spelt out (I always > > > need to stop for a moment and think: what happens when I cross the > > > 32-bit boundary?). =20 > >=20 > > Oh, yes, we'd need to add a __bswap_constant_64() for this. >=20 > ...which doesn't exist on musl. On current Alpine Edge: >=20 > util.h:131:34: error: implicit declaration of function '__bswap_constant_= 64' [-Wimplicit-function-declaration] > 131 | #define htonll_constant(x) (__bswap_constant_64(x)) > | ^~~~~~~~~~~~~~~~~~~ >=20 > ...so rather than adding an implementation for this single usage, which > makes it actually less clear to me, I would keep it like it is. Very well. > > [snip] > > > > > > > +/** > > > > > > > + * union migrate_header - Migration header from source > > > > > > > + * @magic: 0xB1BB1D1B0BB1D1B0, host order > > > > > > > + * @version: Source sends highest known, target aborts if u= nsupported > > > > > > > + * @voidp_size: sizeof(void *), network order > > > > > > > + * @time_t_size: sizeof(time_t), network order > > > > > > > + * @flow_size: sizeof(union flow), network order > > > > > > > + * @flow_sidx_size: sizeof(struct flow_sidx_t), network order > > > > > > > + * @unused: Go figure > > > > > > > + */ > > > > > > > +union migrate_header { > > > > > > > + struct { > > > > > > > + uint64_t magic; > > > > > > > + uint32_t version; > > > > > > > + uint32_t voidp_size; > > > > > > > + uint32_t time_t_size; > > > > > > > + uint32_t flow_size; > > > > > > > + uint32_t flow_sidx_size; > > > > > > > + }; > > > > > > > + uint8_t unused[65536]; =20 > > > > > >=20 > > > > > > So, having looked at this, I no longer think padding the header= to 64kiB > > > > > > is a good idea. The structure means we're basically stuck alwa= ys > > > > > > having that chunky header. Instead, I think the header should = be > > > > > > absolutely minimal: basically magic and version only. v1 (and = maybe > > > > > > others) can add a "metadata" or whatever section for additional > > > > > > information like this they need. =20 > > > > >=20 > > > > > The header is processed by the target in a separate, preliminary = step, > > > > > though. > > > > >=20 > > > > > That's why I added metadata right in the header: if the target ne= eds to > > > > > abort the migration because, say, the size of a flow entry is too= big > > > > > to handle for a particular version, then we should know that befo= re > > > > > migrate_target_pre(). =20 > > > >=20 > > > > Ah, yes, I missed that, we'd need a more complex design to do > > > > additional transfers and checks before making the target_pre > > > > callbacks. > > > > =20 > > > > > As long as we check the version first, we can always shrink the h= eader > > > > > later on. =20 > > > >=20 > > > > *thinks*.. I guess so, though it's kind of awkward; a future version > > > > would have to read the "header of the header", check the version, t= hen > > > > if it's the old one, read the remainder of the 64kiB block. > > > >=20 > > > > I still think we should clearly separate the part that we're > > > > committing to being in every future version (which I think should j= ust > > > > be magic and version), from the stuff that's just v1. =20 > > >=20 > > > Sure, I can add a comment. > > > =20 > > > > > But having 64 KiB reserved looks more robust because it's a > > > > > safe place to add this kind of metadata. > > > > >=20 > > > > > Note that 64 KiB is typically transferred in a single read/write > > > > > from/to the vhost-user back-end. =20 > > > >=20 > > > > Ok, but it also has to go over the qemu migration channel, which wi= ll > > > > often be a physical link, not a super-fast local/virtual one, and m= ay > > > > be bandwidth capped as well. I'm not actually certain if 64kiB is > > > > likely to be a problem there, but it *is* large compared to the sta= te > > > > blobs of most qemu devices (usually only a few hundred bytes). =20 > > >=20 > > > Even if we transfer just what we need of a flow, it's still something > > > well in excess of 50 bytes each. 100k flows would be 5 megs. =20 > >=20 > > Just transferring the in-use flows would be higher priority than being > > selective about what we send within each flow. >=20 > Well, of course, I meant that we would only transfer used flows at that > point, because it's not about transferring the flow table as a whole, > with none of the advantages and disadvantages of it. >=20 > But still one can have 128k flows at the moment. Right, but in the present draft you pay that cost whether or not you're actually using the flows. Unfortunately a busy server with heaps of active connections is exactly the case that's likely to be most sensitve to additional downtime, but there's not really any getting around that. A machine with a lot of state will need either high downtime or high migration bandwidth. But, I'm really hoping we can move relatively quickly to a model where a guest with only a handful of connections _doesn't_ have to pay that 128k flow cost - and can consequently migrate ok even with quite constrained migration bandwidth. In that scenario the size of the header could become significant. > > It's both easier to do > > and a bigger win in most cases. That would dramatically reduce the > > size sent here. >=20 > Yep, feel free. It's on my queue for the next few days. > > > Well, anyway, let's cut this down to 4k, which should be enough, so > > > that it's not a topic anymore. =20 > >=20 > > I still think it's ugly, but whatever. >=20 > Same here. >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --s+E6DAfL8hAMt2Hs Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmebLG0ACgkQzQJF27ox 2GfMdw//YUlYkaXbxHujjeFsKdrxdhC3RU/xK99N+NOKfQ2ti41lnibnCWTkoyug lUeG8PJg6qXapvwUEkJ3qNUtaBVzbFxFSch7GU8aWddnyQdZHcK3PcTUg9FIHHoM S2l2lA3fMCmBnmBraxPJRPtfyIGSKn9QQBfyqpIcE5LXIMwZi4khDabh4CYxkJin ft7twAR6gq/bnBVDSlYz4+ZMZ5/TJIIU/ddqdvMrW9hA1wISTRzaqQw+S8jeiu79 brHkFyo3wrwWYxdKoNtC7kSwGRRZkU5XfyMJUmTZ38cq54ZBPdKrcS59OoIh2seF tQGe90LfqaSE7/ynUaYB0zoQLriwMVaLqxq+ri2T97054PkqU4uoVajMvGCBIXgZ VRgTXt35v6Zmyp9Q5a5QDTli6r3M/ZwQTDizu9+BYck+pgLo0aNhmhsgWWtcwo2Q OYGiYBVM+YSasOkPbnSUhPW4x+/NCtD3rt446XGTysaE/Ao+8dCcen+a674MGkzu z3q/kdxuHq1LB3mEO/azWboYeIgT8aBG4Cd/FiZiA1eo9XYvZZSaYwKid1o6bhvS Sf09WHu8SMSBMeiFajlvxYV5JDf8PPFUdBSXoQxeVtVPRQCClVwQallGdVfp1aRs 48jSqEc0tmt2dfAuf2psjdAlMBN+JRDvhK5IMmBnlfjfJT8hPSk= =bhDV -----END PGP SIGNATURE----- --s+E6DAfL8hAMt2Hs--