From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top, Laurent Vivier <lvivier@redhat.com>
Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure
Date: Mon, 3 Feb 2025 11:47:55 +1100 [thread overview]
Message-ID: <Z6ASO-Q7NwtrpyVj@zatzit> (raw)
In-Reply-To: <20250131100931.714531ee@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 10144 bytes --]
On Fri, Jan 31, 2025 at 10:09:31AM +0100, Stefano Brivio wrote:
> On Fri, 31 Jan 2025 17:32:05 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote:
> > > On Thu, 30 Jan 2025 19:54:17 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >
> > > > On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote:
> > > > > On Thu, 30 Jan 2025 18:38:22 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >
> > > > > > Right, but in the present draft you pay that cost whether or not
> > > > > > you're actually using the flows. Unfortunately a busy server with
> > > > > > heaps of active connections is exactly the case that's likely to be
> > > > > > most sensitve to additional downtime, but there's not really any
> > > > > > getting around that. A machine with a lot of state will need either
> > > > > > high downtime or high migration bandwidth.
> > > > >
> > > > > It's... sixteen megabytes. A KubeVirt node is only allowed to perform up
> > > > > to _four_ migrations in parallel, and that's our main use case at the
> > > > > moment. "High downtime" is kind of relative.
> > > >
> > > > Certainly. But I believe it's typical to aim for downtimes in the
> > > > ~100ms range.
> > > >
> > > > > > But, I'm really hoping we can move relatively quickly to a model where
> > > > > > a guest with only a handful of connections _doesn't_ have to pay that
> > > > > > 128k flow cost - and can consequently migrate ok even with quite
> > > > > > constrained migration bandwidth. In that scenario the size of the
> > > > > > header could become significant.
> > > > >
> > > > > I think the biggest cost of the full flow table transfer is rather code
> > > > > that's a bit quicker to write (I just managed to properly set sequences
> > > > > on the target, connections don't quite "flow" yet) but relatively high
> > > > > maintenance (as you mentioned, we need to be careful about every single
> > > > > field) and easy to break.
> > > >
> > > > Right. And with this draft we can't even change the size of the flow
> > > > table without breaking migration. That seems like a thing we might
> > > > well want to change.
> > >
> > > Why? The size of the flow table hasn't changed since it was added. I
> >
> > Which wasn't _that_ long ago. It just seems like a really obvious
> > constant to tune to me, and one which it would be surprising if it
> > broken migration.
> >
> > > don't see a reason to improve this if we don't want to transfer the
> > > flow table anyway.
> >
> > I don't follow. Do you mean not transferring the hash table? This is
> > not relevant to that, I'm talking about the size of the base flow
> > table, not the hash table. Or do you mean not transferring the flow
> > table as a whole, but rather entry by entry? In that case I'm seeing
> > it as exactly the mechanism to improve this.
>
> I mean transferring the flow table entry by entry.
>
> I think we need to jump to that directly, because of something else I
> just found: if we don't transfer one struct tcp_repair_window, the
> connection works at a basic level but we don't really comply with RFC
> 9293 anymore, I think, because we might shrink the window.
>
> And that struct is 20 bytes. We just reached 124 bytes with the
> socket-side sequence numbers, so now we would exceed 128 bytes.
Ok. This is next on my list.
> > > > > I would like to quickly complete the whole flow first, because I think
> > > > > we can inform design and implementation decisions much better at that
> > > > > point, and we can be sure it's feasible,
> > > >
> > > > That's fair.
> > > >
> > > > > but I'm not particularly keen
> > > > > to merge this patch like it is, if we can switch it relatively swiftly
> > > > > to an implementation where we model a smaller fixed-endian structure
> > > > > with just the stuff we need.
> > > >
> > > > So, there are kind of two parts to this:
> > > >
> > > > 1) Only transferring active flow entries, and not transferring the
> > > > hash table
> > > >
> > > > I think this is pretty easy. It could be done with or without
> > > > preserving flow indicies. Preserving makes for debug log continuity
> > > > between the ends, but not preserving lets us change the size of the
> > > > flow table without breaking migration.
> > >
> > > I would just add prints on migration showing how old flow indices map
> > > to new ones.
> >
> > That's possible, although it would mean transferring the old indices,
> > which is not otherwise strictly necessary. What we could do easily is
> > a debug log similar to the "new flow" logs but for "immigrated flow".
>
> Transferring them if we have a dedicated structure shouldn't be that
> bad: we don't need to store them.
True.
> > > > 2) Only transferring the necessary pieces of each entry, and using a
> > > > fixed representation of each piece
> > > >
> > > > This is harder. Not *super* hard, I think, but definitely trickier
> > > > than (1)
> > > >
> > > > > And again, to be a bit more sure of which stuff we need in it, the full
> > > > > flow is useful to have implemented.
> > > > >
> > > > > Actually the biggest complications I see in switching to that approach,
> > > > > from the current point, are that we need to, I guess:
> > > > >
> > > > > 1. model arrays (not really complicated by itself)
> > > >
> > > > So here, I actually think this is simpler if we don't attempt to have
> > > > a declarative approach to defining the protocol, but just write
> > > > functions to implement it.
> > > >
> > > > > 2. have a temporary structure where we store flows instead of using the
> > > > > flow table directly (meaning that the "data model" needs to logically
> > > > > decouple source and destination of the copy)
> > > >
> > > > Right.. I'd really prefer to "stream" in the entries one by one,
> > > > rather than having a big staging area. That's even harder to do
> > > > declaratively, but I think the other advantages are worth it.
> > > >
> > > > > 3. batch stuff to some extent. We'll call socket() and connect() once
> > > > > for each socket anyway, obviously, but sending one message to the
> > > > > TCP_REPAIR helper for each socket looks like a rather substantial
> > > > > and avoidable overhead
> > > >
> > > > I don't think this actually has a lot of bearing on the protocol. I'd
> > > > envisage migrate_target() decodes all the information into the
> > > > target's flow table, then migrate_target_post() steps through all the
> > > > flows re-establishing the connections. Since we've already parsed the
> > > > protocol at that point, we can make multiple passes: one to gather
> > > > batches and set TCP_REPAIR, another through each entry to set the
> > > > values, and a final one to clear TCP_REPAIR in batches.
> > >
> > > Ah, right, I didn't think of using the target flow table directly. That
> > > has the advantage that the current code I'm writing to reactivate flows
> > > from the flow table can be recycled as it is.
> >
> > Possibly - it might need to do some slightly different things:
> > regenerating some fields from redundant data maybe, and/or re-hashing
> > the entries. But certainly the structure should be similar, yes.
> >
> > > > > > > > It's both easier to do
> > > > > > > > and a bigger win in most cases. That would dramatically reduce the
> > > > > > > > size sent here.
> > > > > > >
> > > > > > > Yep, feel free.
> > > > > >
> > > > > > It's on my queue for the next few days.
> > > > >
> > > > > To me this part actually looks like the biggest priority after/while
> > > > > getting the whole thing to work, because we can start right with a 'v1'
> > > > > which looks more sustainable.
> > > > >
> > > > > And I would just get stuff working on x86_64 in that case, without even
> > > > > implementing conversions and endianness switches etc.
> > > >
> > > > Right. Given the number of options here, I think it would be safest
> > > > to go in expecting to go through a few throwaway protocol versions
> > > > before reaching one we're happy enough to support long term.
> > > >
> > > > To ease that process, I'm wondering if we should, add a default-off
> > > > command line option to enable migration. For now, enabling it would
> > > > print some sort of "migration is experimental!" warning. Once we have
> > > > a stream format we're ok with, we can flip it to on-by-default, but we
> > > > don't maintain receive compatibility for the experimental versions
> > > > leading up to that.
> > >
> > > It looks like unnecessary code churn to me. It doesn't need to be
> > > merged if it's work in progress. You can also push stuff to a temporary
> > > branch if needed.
> >
> > Eh, just thought merging might save us some rebase work against any
> > other pressing changes we need.
> >
> > > It can also be merged and not documented for a while, as long as it
> > > doesn't break existing functionality.
> >
> > I'd be a bit cautious about this. AIUI, right now if you attempt to
> > migrate, qemu will simply fail it because we don't respond to the
> > migration commands. Having enough merged that qemu won't outright
> > fail the migration, but it won't reliably work seems like a bad idea.
>
> Not really: we transfer "PASST" at the moment, and the migration
> succeeds.
Oh... right. We kind of already made that mistake.
> You can start 'ping' in the source and see it continue
> flawlessly in the target. Sure, TCP connections break altogether
> instead of the okayish migration implemented by v3 I'm posting in a
> bit...
>
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-02-03 2:16 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-27 23:15 [PATCH 0/7] Draft, incomplete series introducing state migration Stefano Brivio
2025-01-27 23:15 ` [PATCH 1/7] icmp, udp: Pad time_t timestamp to 64-bit to ease " Stefano Brivio
2025-01-28 0:49 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-27 23:15 ` [PATCH 2/7] flow, flow_table: Pad flow table entries to 128 bytes, hash entries to 32 bits Stefano Brivio
2025-01-28 0:50 ` David Gibson
2025-01-27 23:15 ` [PATCH 3/7] tcp_conn: Avoid 7-bit hole in struct tcp_splice_conn Stefano Brivio
2025-01-28 0:53 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:02 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:27 ` David Gibson
2025-01-27 23:15 ` [PATCH 4/7] flow_table: Use size in extern declaration for flowtab Stefano Brivio
2025-01-27 23:15 ` [PATCH 5/7] util: Add read_remainder() and read_all_buf() Stefano Brivio
2025-01-28 0:59 ` David Gibson
2025-01-28 6:48 ` Stefano Brivio
2025-01-29 1:03 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:44 ` David Gibson
2025-01-27 23:15 ` [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Stefano Brivio
2025-01-28 1:40 ` David Gibson
2025-01-28 6:50 ` Stefano Brivio
2025-01-29 1:16 ` David Gibson
2025-01-29 7:33 ` Stefano Brivio
2025-01-30 0:48 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:38 ` David Gibson
2025-01-30 8:32 ` Stefano Brivio
2025-01-30 8:54 ` David Gibson
2025-01-31 5:46 ` Stefano Brivio
2025-01-31 6:32 ` David Gibson
2025-01-31 9:09 ` Stefano Brivio
2025-02-03 0:47 ` David Gibson [this message]
2025-01-31 5:36 ` Stefano Brivio
2025-01-31 6:14 ` David Gibson
2025-01-31 9:09 ` Stefano Brivio
2025-02-03 0:46 ` David Gibson
2025-02-03 6:09 ` Stefano Brivio
2025-02-03 9:06 ` David Gibson
2025-02-03 9:45 ` Stefano Brivio
2025-02-03 9:57 ` David Gibson
2025-01-27 23:15 ` [PATCH 7/7] Introduce passt-repair Stefano Brivio
2025-01-27 23:31 ` Stefano Brivio
2025-01-28 1:51 ` David Gibson
2025-01-28 6:51 ` Stefano Brivio
2025-01-29 1:29 ` David Gibson
2025-01-29 7:04 ` Stefano Brivio
2025-01-30 0:53 ` David Gibson
2025-01-30 4:55 ` Stefano Brivio
2025-01-30 7:43 ` David Gibson
2025-01-30 7:56 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z6ASO-Q7NwtrpyVj@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=lvivier@redhat.com \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).