From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=i0LLU+mg; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 703AC5A004E for ; Mon, 03 Feb 2025 03:16:54 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1738549000; bh=8PKW7ltGM+PHVBjtr+Z7x4NJDuCi0qx9phv6yq4HDUw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=i0LLU+mg0cjiQdWRkajCEWImdMQM+kapczKuxmzQl1PH+oVzbhlH0o9Mz8Z9DD+I5 aefg3bcj9cL6g+U9URc8TLoz1F7v6xSOyuelMzTV1EKQ2T2gt7a+owA92dE6Hpv4Qx BwXpBWN2IzH7CB0DWPYpaOSDYD+j8PM/02OANXSCFO9lqV3OzPoo1iRB0ni5JDtoDB rCYysMNNaqgURqs+z1sOyQEDFT6k1Il1zvKtGW/FHpM5UD0G+xM/A+svZgJCrD8kMv 2oyuJ9faUSEXJWVIBTxtkvSJXJPjiWS7grrnVklebjk8JEaFBs28YuV3LVZs8swR/X ViXW+PdKiN/DQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4YmVVc4BXHz4wcn; Mon, 3 Feb 2025 13:16:40 +1100 (AEDT) Date: Mon, 3 Feb 2025 11:47:55 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: References: <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> <20250130093236.117c3fd0@elisabeth> <20250131064621.6cddad7a@elisabeth> <20250131100931.714531ee@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="4HTNwZwPxSBmZCsU" Content-Disposition: inline In-Reply-To: <20250131100931.714531ee@elisabeth> Message-ID-Hash: V5ONMVWOWYMH5E2KF6OCWK5WCGQHPLK2 X-Message-ID-Hash: V5ONMVWOWYMH5E2KF6OCWK5WCGQHPLK2 X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --4HTNwZwPxSBmZCsU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Jan 31, 2025 at 10:09:31AM +0100, Stefano Brivio wrote: > On Fri, 31 Jan 2025 17:32:05 +1100 > David Gibson wrote: >=20 > > On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote: > > > On Thu, 30 Jan 2025 19:54:17 +1100 > > > David Gibson wrote: > > > =20 > > > > On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote: =20 > > > > > On Thu, 30 Jan 2025 18:38:22 +1100 > > > > > David Gibson wrote: > > > > > =20 > > > > > > Right, but in the present draft you pay that cost whether or not > > > > > > you're actually using the flows. Unfortunately a busy server w= ith > > > > > > heaps of active connections is exactly the case that's likely t= o be > > > > > > most sensitve to additional downtime, but there's not really any > > > > > > getting around that. A machine with a lot of state will need e= ither > > > > > > high downtime or high migration bandwidth. =20 > > > > >=20 > > > > > It's... sixteen megabytes. A KubeVirt node is only allowed to per= form up > > > > > to _four_ migrations in parallel, and that's our main use case at= the > > > > > moment. "High downtime" is kind of relative. =20 > > > >=20 > > > > Certainly. But I believe it's typical to aim for downtimes in the > > > > ~100ms range. > > > > =20 > > > > > > But, I'm really hoping we can move relatively quickly to a mode= l where > > > > > > a guest with only a handful of connections _doesn't_ have to pa= y that > > > > > > 128k flow cost - and can consequently migrate ok even with quite > > > > > > constrained migration bandwidth. In that scenario the size of = the > > > > > > header could become significant. =20 > > > > >=20 > > > > > I think the biggest cost of the full flow table transfer is rathe= r code > > > > > that's a bit quicker to write (I just managed to properly set seq= uences > > > > > on the target, connections don't quite "flow" yet) but relatively= high > > > > > maintenance (as you mentioned, we need to be careful about every = single > > > > > field) and easy to break. =20 > > > >=20 > > > > Right. And with this draft we can't even change the size of the fl= ow > > > > table without breaking migration. That seems like a thing we might > > > > well want to change. =20 > > >=20 > > > Why? The size of the flow table hasn't changed since it was added. I = =20 > >=20 > > Which wasn't _that_ long ago. It just seems like a really obvious > > constant to tune to me, and one which it would be surprising if it > > broken migration. > >=20 > > > don't see a reason to improve this if we don't want to transfer the > > > flow table anyway. =20 > >=20 > > I don't follow. Do you mean not transferring the hash table? This is > > not relevant to that, I'm talking about the size of the base flow > > table, not the hash table. Or do you mean not transferring the flow > > table as a whole, but rather entry by entry? In that case I'm seeing > > it as exactly the mechanism to improve this. >=20 > I mean transferring the flow table entry by entry. >=20 > I think we need to jump to that directly, because of something else I > just found: if we don't transfer one struct tcp_repair_window, the > connection works at a basic level but we don't really comply with RFC > 9293 anymore, I think, because we might shrink the window. >=20 > And that struct is 20 bytes. We just reached 124 bytes with the > socket-side sequence numbers, so now we would exceed 128 bytes. Ok. This is next on my list. > > > > > I would like to quickly complete the whole flow first, because I = think > > > > > we can inform design and implementation decisions much better at = that > > > > > point, and we can be sure it's feasible, =20 > > > >=20 > > > > That's fair. > > > > =20 > > > > > but I'm not particularly keen > > > > > to merge this patch like it is, if we can switch it relatively sw= iftly > > > > > to an implementation where we model a smaller fixed-endian struct= ure > > > > > with just the stuff we need. =20 > > > >=20 > > > > So, there are kind of two parts to this: > > > >=20 > > > > 1) Only transferring active flow entries, and not transferring the > > > > hash table > > > >=20 > > > > I think this is pretty easy. It could be done with or without > > > > preserving flow indicies. Preserving makes for debug log continuity > > > > between the ends, but not preserving lets us change the size of the > > > > flow table without breaking migration. =20 > > >=20 > > > I would just add prints on migration showing how old flow indices map > > > to new ones. =20 > >=20 > > That's possible, although it would mean transferring the old indices, > > which is not otherwise strictly necessary. What we could do easily is > > a debug log similar to the "new flow" logs but for "immigrated flow". >=20 > Transferring them if we have a dedicated structure shouldn't be that > bad: we don't need to store them. True. > > > > 2) Only transferring the necessary pieces of each entry, and using a > > > > fixed representation of each piece > > > >=20 > > > > This is harder. Not *super* hard, I think, but definitely trickier > > > > than (1) > > > > =20 > > > > > And again, to be a bit more sure of which stuff we need in it, th= e full > > > > > flow is useful to have implemented. > > > > >=20 > > > > > Actually the biggest complications I see in switching to that app= roach, > > > > > from the current point, are that we need to, I guess: > > > > >=20 > > > > > 1. model arrays (not really complicated by itself) =20 > > > >=20 > > > > So here, I actually think this is simpler if we don't attempt to ha= ve > > > > a declarative approach to defining the protocol, but just write > > > > functions to implement it. > > > > =20 > > > > > 2. have a temporary structure where we store flows instead of usi= ng the > > > > > flow table directly (meaning that the "data model" needs to lo= gically > > > > > decouple source and destination of the copy) =20 > > > >=20 > > > > Right.. I'd really prefer to "stream" in the entries one by one, > > > > rather than having a big staging area. That's even harder to do > > > > declaratively, but I think the other advantages are worth it. > > > > =20 > > > > > 3. batch stuff to some extent. We'll call socket() and connect() = once > > > > > for each socket anyway, obviously, but sending one message to = the > > > > > TCP_REPAIR helper for each socket looks like a rather substant= ial > > > > > and avoidable overhead =20 > > > >=20 > > > > I don't think this actually has a lot of bearing on the protocol. = I'd > > > > envisage migrate_target() decodes all the information into the > > > > target's flow table, then migrate_target_post() steps through all t= he > > > > flows re-establishing the connections. Since we've already parsed = the > > > > protocol at that point, we can make multiple passes: one to gather > > > > batches and set TCP_REPAIR, another through each entry to set the > > > > values, and a final one to clear TCP_REPAIR in batches. =20 > > >=20 > > > Ah, right, I didn't think of using the target flow table directly. Th= at > > > has the advantage that the current code I'm writing to reactivate flo= ws > > > from the flow table can be recycled as it is. =20 > >=20 > > Possibly - it might need to do some slightly different things: > > regenerating some fields from redundant data maybe, and/or re-hashing > > the entries. But certainly the structure should be similar, yes. > >=20 > > > > > > > > It's both easier to do > > > > > > > > and a bigger win in most cases. That would dramatically re= duce the > > > > > > > > size sent here. =20 > > > > > > >=20 > > > > > > > Yep, feel free. =20 > > > > > >=20 > > > > > > It's on my queue for the next few days. =20 > > > > >=20 > > > > > To me this part actually looks like the biggest priority after/wh= ile > > > > > getting the whole thing to work, because we can start right with = a 'v1' > > > > > which looks more sustainable. > > > > >=20 > > > > > And I would just get stuff working on x86_64 in that case, withou= t even > > > > > implementing conversions and endianness switches etc. =20 > > > >=20 > > > > Right. Given the number of options here, I think it would be safest > > > > to go in expecting to go through a few throwaway protocol versions > > > > before reaching one we're happy enough to support long term. > > > >=20 > > > > To ease that process, I'm wondering if we should, add a default-off > > > > command line option to enable migration. For now, enabling it would > > > > print some sort of "migration is experimental!" warning. Once we h= ave > > > > a stream format we're ok with, we can flip it to on-by-default, but= we > > > > don't maintain receive compatibility for the experimental versions > > > > leading up to that. =20 > > >=20 > > > It looks like unnecessary code churn to me. It doesn't need to be > > > merged if it's work in progress. You can also push stuff to a tempora= ry > > > branch if needed. =20 > >=20 > > Eh, just thought merging might save us some rebase work against any > > other pressing changes we need. > >=20 > > > It can also be merged and not documented for a while, as long as it > > > doesn't break existing functionality. =20 > >=20 > > I'd be a bit cautious about this. AIUI, right now if you attempt to > > migrate, qemu will simply fail it because we don't respond to the > > migration commands. Having enough merged that qemu won't outright > > fail the migration, but it won't reliably work seems like a bad idea. >=20 > Not really: we transfer "PASST" at the moment, and the migration > succeeds. Oh... right. We kind of already made that mistake. > You can start 'ping' in the source and see it continue > flawlessly in the target. Sure, TCP connections break altogether > instead of the okayish migration implemented by v3 I'm posting in a > bit... >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --4HTNwZwPxSBmZCsU Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmegEjoACgkQzQJF27ox 2Gcwvg//eE8qfcmvkm83LxSNEvISFVTkTjyORcfCbzy4/G1gyGUQ8GCI8OsdLg0g lka2SkZmKJfvLirLmiBzndcx04PBb233FZZ9ZXEbLjKqglvq/YEUexdYhcxfmZtF Gj9HPZMRkqFDknvVNQW2kf7OJS04PzCspn8+OLqm7wbrhtIj6DgdDufVRCg8GB/i ngLfOJFyXbtu1qCPOb20ox0H9lwu0KmN+25yk5Se15AN1G8daj+DDkhTM72YMH7t U/z3xmytxQIYqtQ3k3CZu/vxfIu3cY3hjiWER0SlRMnGH1uV14DBEPC81jubR/wM DvSy4Te4qfURQcLqQluOP9y4MIvRMZ9044eDBGsLUKf5Bbxz7O0DfjBcxZXS6ZPe /GqzvgrEhayYjgeei8fTO0nAVGosG1nWV8z4dHF3SH2ET6CCGh7IHvPEKAr8Dp6D ZrJJIdHJSqyMTUeTB21F3vksjChGcb0z1IjqERg4FtWuofLLFNzvweP8H3WgCCh6 fYp8XOxCOzYbeDmilh4caWhC10XtGKmenmmSVdJVTqLgUrPLr3hAOT6fAYPQ68BG TwE66NSGfYh0K/22Sz6TwEvOeyt9I5yF0ckkrJ5MPr0eGeHV8SMtNo+sx49Wamje BI0xJhckHOJOn/DcZPYB+CBHyUdG0Ef3xHbOGGvIxoXAz2YjVuE= =iidX -----END PGP SIGNATURE----- --4HTNwZwPxSBmZCsU--