From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=PmoR97kZ; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 92A655A026F for ; Thu, 30 Jan 2025 09:54:24 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202412; t=1738227253; bh=8Onim+MBivT/NTcmJQdJPzkXXByzyHzReOx+EoQMrIY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=PmoR97kZ30VsvvVwQYkeAWfJUsh4iEn/7Y//bp44l9X0hogyz+028SQN+0kKA8BMc hYP3MuLmDedjpRl+3hy7zo+LZuFceuIB1fwpCX9+J17JjQJcRvW3/HEhlM9jRStUVI va3FrsUUDq2Jx+f6RP5HlhC1FIh2FI6Y7bGWmzNNqVYMqbVDBuSNBPnUASojK2/Jn0 khle5v5Gk+QnFBwG/a35uoYDvgS1vWspxgqMep4SHVEzdL+uoFtLMxJgVabWcYCELY ipJn6N13z5vuTjvvjNLNyBnctFKVBomP0XCN/GPMH/e+6AJsy0qYggrxo1GmfGSESN zOqjicpSLgvfg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YkCW92ZWpz4x5k; Thu, 30 Jan 2025 19:54:13 +1100 (AEDT) Date: Thu, 30 Jan 2025 19:54:17 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: References: <20250127231532.672363-1-sbrivio@redhat.com> <20250127231532.672363-7-sbrivio@redhat.com> <20250128075001.3557d398@elisabeth> <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> <20250130093236.117c3fd0@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="7/HVsaLFeq6hz0B7" Content-Disposition: inline In-Reply-To: <20250130093236.117c3fd0@elisabeth> Message-ID-Hash: DQQR56222TBWNLHVRMSZI2SKKGADJVSK X-Message-ID-Hash: DQQR56222TBWNLHVRMSZI2SKKGADJVSK X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --7/HVsaLFeq6hz0B7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote: > On Thu, 30 Jan 2025 18:38:22 +1100 > David Gibson wrote: >=20 > > Right, but in the present draft you pay that cost whether or not > > you're actually using the flows. Unfortunately a busy server with > > heaps of active connections is exactly the case that's likely to be > > most sensitve to additional downtime, but there's not really any > > getting around that. A machine with a lot of state will need either > > high downtime or high migration bandwidth. >=20 > It's... sixteen megabytes. A KubeVirt node is only allowed to perform up > to _four_ migrations in parallel, and that's our main use case at the > moment. "High downtime" is kind of relative. Certainly. But I believe it's typical to aim for downtimes in the ~100ms range. > > But, I'm really hoping we can move relatively quickly to a model where > > a guest with only a handful of connections _doesn't_ have to pay that > > 128k flow cost - and can consequently migrate ok even with quite > > constrained migration bandwidth. In that scenario the size of the > > header could become significant. >=20 > I think the biggest cost of the full flow table transfer is rather code > that's a bit quicker to write (I just managed to properly set sequences > on the target, connections don't quite "flow" yet) but relatively high > maintenance (as you mentioned, we need to be careful about every single > field) and easy to break. Right. And with this draft we can't even change the size of the flow table without breaking migration. That seems like a thing we might well want to change. > I would like to quickly complete the whole flow first, because I think > we can inform design and implementation decisions much better at that > point, and we can be sure it's feasible, That's fair. > but I'm not particularly keen > to merge this patch like it is, if we can switch it relatively swiftly > to an implementation where we model a smaller fixed-endian structure > with just the stuff we need. So, there are kind of two parts to this: 1) Only transferring active flow entries, and not transferring the hash table I think this is pretty easy. It could be done with or without preserving flow indicies. Preserving makes for debug log continuity between the ends, but not preserving lets us change the size of the flow table without breaking migration. 2) Only transferring the necessary pieces of each entry, and using a fixed representation of each piece This is harder. Not *super* hard, I think, but definitely trickier than (1) > And again, to be a bit more sure of which stuff we need in it, the full > flow is useful to have implemented. >=20 > Actually the biggest complications I see in switching to that approach, > from the current point, are that we need to, I guess: >=20 > 1. model arrays (not really complicated by itself) So here, I actually think this is simpler if we don't attempt to have a declarative approach to defining the protocol, but just write functions to implement it. > 2. have a temporary structure where we store flows instead of using the > flow table directly (meaning that the "data model" needs to logically > decouple source and destination of the copy) Right.. I'd really prefer to "stream" in the entries one by one, rather than having a big staging area. That's even harder to do declaratively, but I think the other advantages are worth it. > 3. batch stuff to some extent. We'll call socket() and connect() once > for each socket anyway, obviously, but sending one message to the > TCP_REPAIR helper for each socket looks like a rather substantial > and avoidable overhead I don't think this actually has a lot of bearing on the protocol. I'd envisage migrate_target() decodes all the information into the target's flow table, then migrate_target_post() steps through all the flows re-establishing the connections. Since we've already parsed the protocol at that point, we can make multiple passes: one to gather batches and set TCP_REPAIR, another through each entry to set the values, and a final one to clear TCP_REPAIR in batches. > > > > It's both easier to do > > > > and a bigger win in most cases. That would dramatically reduce the > > > > size sent here. =20 > > >=20 > > > Yep, feel free. =20 > >=20 > > It's on my queue for the next few days. >=20 > To me this part actually looks like the biggest priority after/while > getting the whole thing to work, because we can start right with a 'v1' > which looks more sustainable. >=20 > And I would just get stuff working on x86_64 in that case, without even > implementing conversions and endianness switches etc. Right. Given the number of options here, I think it would be safest to go in expecting to go through a few throwaway protocol versions before reaching one we're happy enough to support long term. To ease that process, I'm wondering if we should, add a default-off command line option to enable migration. For now, enabling it would print some sort of "migration is experimental!" warning. Once we have a stream format we're ok with, we can flip it to on-by-default, but we don't maintain receive compatibility for the experimental versions leading up to that. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --7/HVsaLFeq6hz0B7 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmebPjMACgkQzQJF27ox 2GeBag/9GRmEQ4wB7x4JXLXnaZbnywf8WRo86ooag0kuiUioEpeWxTDZtZltM3H7 GF1km1vF9aCWOPfVfwvROMjQje3o8Gjmv+bAa8TqrySbaybw83/Bqhi2yzh+2h4v /AjU9WvTt8fyEypzK+pwovophM7Nj7QXntL27Cl3ighAsdqg18E5aKL6iPFaD4bd j2Ak7mUlB4343Pzq0oCLnH0ylYK7VZDaNAbVsZvXdN1U7ssiNhZ4CVyFBcVaZrzn Tit9/QlFsi+gpKIW9cxpEvZNM+S6SYvIqUc9fk+LrZCQJnzhgXb83ogY9kHTpYcl GPJnJTuKgfCcJy5vlRkP3lAXBujJkM436AB5JnY610fgCHaV/zhSGXzl8PTZmKTx s9hm/93D0lRpZnS4ehuaiX/0VtrHstdCIS4gqT6LazKfBWH3u5mKA3At34ZqDPuf 1Wl1GUp7s89HdWjZI80w2j4Ip19ofov0JcFw7QfGQdLIbaftmpNPjdfKShsMsXB+ dRLzGW31Dz/N2va34FCoZj+uuMSwDRTW0Icyh+pvGTTRCvnk0b4WsC9KfMSRaD5z rz7hSorr2bKeHqK2Jxu/Gp07Df+Nwl7fsBNyEueTjZc3zswOrbDZM4AMlX7iZ4mI W27EvcgwAnIhGkc9oOq9B1o5sTZa3rMvXtEE17Vkvqfsw2kGZ9c= =r1i0 -----END PGP SIGNATURE----- --7/HVsaLFeq6hz0B7--