From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=KAjpki2o; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id D4E245A0626 for ; Fri, 31 Jan 2025 10:09:37 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1738314576; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0ASRE15W1TFdda8xDsZkZaDjxDJzZHnp0uUJvEfteTM=; b=KAjpki2odXDK9MjzF6yPyvkD6+q6N/jTLxokKlTzkO33iBqL6+JNTceM9Mjm3RcGxDO/1H mXzxeyTXRAh2muHsxeA8r9MDw3JaNh4v30eIYo26xOLsbgLcdCx0yB9qOhG7zEf1vbBmb6 C6ctB1r2rdIMI3wBBN2JvAKUIWk/d9g= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-118-aJX9Xa4_Nsifd3CWntxA8w-1; Fri, 31 Jan 2025 04:09:35 -0500 X-MC-Unique: aJX9Xa4_Nsifd3CWntxA8w-1 X-Mimecast-MFC-AGG-ID: aJX9Xa4_Nsifd3CWntxA8w Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-43646b453bcso8516245e9.3 for ; Fri, 31 Jan 2025 01:09:35 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738314574; x=1738919374; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0ASRE15W1TFdda8xDsZkZaDjxDJzZHnp0uUJvEfteTM=; b=f+6Ts313PY1atDuxnbC/devS0mT4dzSyiA3/QBBJEeRHHcRFj4rXQ8a6orvcSgr1BD 0rC2t6EgeaoG2VUZJpfXhMa4vfhirmhwE3oAuHN98Id66DvTCh3phB9XbwkUziJgMHjO EKsRBltSgfe91czet4jh+uGzkwoBpy3W89T3ejlkV7VI56+NngjH92l9h42Gxn+phgA7 DLgzHAs1mI+WCw4hDcbmBqesG1N2YZPtPG+P4YVHLs806XfHCQmNCEgguA2uqtKnqoxM XtXhIgcLDCW60Gnuna9TbU9MwC/y+apE47zJpks0vS8GOhGC4pbx3FuHi3S+OV3Wn9Zt taBw== X-Gm-Message-State: AOJu0YwdedzDO7SvtuMyc9g1LrWRyJE5rKQxLAv70u+sft3YytMrog1I SWXrjEycQBqROrPIXv3yesO51VGnGUChvjszkoWYRPvDnMWZnfuNxFdlp/DvO26fYFZzU5dKDWK LLIik6/QCIltoCjgXZHFSDwWoPk0h79CYfgmeIkMmLuzsPccVpw== X-Gm-Gg: ASbGncvcniCZc+lOBKx0j9je/Qluzz8OruY1tH7MWIjVXuw9PBZhIXjRQ3/C0Ns37tr g5VTFHI3B6Yp63+6AqpcHSNa3jMWhSZyK8frxJ6N84VjVy7FdtQ0GU8DX/op/Rf0diPiTn88pzn bpb178+22KrDnFp+KS7o7ciHpGZyC/o6llqP0ysCXOUidW0k00EMKzcBd0wkWroDwAe3nqPHshF lhIRr8YzCmd36KkNt+EUf9oYW3+x7RbMB7k80EBhfJfYALbstl5dPfMRQhb4k02NThK/mfbSwpG b6hC6QZLyuXIZKOTGR5wsA0RJER3Z2qLcg== X-Received: by 2002:a05:600c:3494:b0:434:a315:19c with SMTP id 5b1f17b1804b1-438dc3a85b3mr86655075e9.3.1738314573825; Fri, 31 Jan 2025 01:09:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IEuVlAeH/S7DWXVDRU/3vwIpB4PcSdzi1N/w3LejkToNyYl2riesLKPd8Y2RizE88N7CPJrzw== X-Received: by 2002:a05:600c:3494:b0:434:a315:19c with SMTP id 5b1f17b1804b1-438dc3a85b3mr86654705e9.3.1738314573339; Fri, 31 Jan 2025 01:09:33 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-438e245efbcsm47781095e9.33.2025.01.31.01.09.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 31 Jan 2025 01:09:32 -0800 (PST) Date: Fri, 31 Jan 2025 10:09:31 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 6/7] Introduce facilities for guest migration on top of vhost-user infrastructure Message-ID: <20250131100931.714531ee@elisabeth> In-Reply-To: References: <20250128075001.3557d398@elisabeth> <20250129083350.220a7ab0@elisabeth> <20250130055522.39acb265@elisabeth> <20250130093236.117c3fd0@elisabeth> <20250131064621.6cddad7a@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: YWHe6fX5kCEOM-FeCtTE0aGrFaQrKSVz_Pi3xFsGQc4_1738314574 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: GF3K3RORY2EGA2LV24TSK6ECTLMPW4BG X-Message-ID-Hash: GF3K3RORY2EGA2LV24TSK6ECTLMPW4BG X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Laurent Vivier X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 31 Jan 2025 17:32:05 +1100 David Gibson wrote: > On Fri, Jan 31, 2025 at 06:46:21AM +0100, Stefano Brivio wrote: > > On Thu, 30 Jan 2025 19:54:17 +1100 > > David Gibson wrote: > > > > > On Thu, Jan 30, 2025 at 09:32:36AM +0100, Stefano Brivio wrote: > > > > On Thu, 30 Jan 2025 18:38:22 +1100 > > > > David Gibson wrote: > > > > > > > > > Right, but in the present draft you pay that cost whether or not > > > > > you're actually using the flows. Unfortunately a busy server with > > > > > heaps of active connections is exactly the case that's likely to be > > > > > most sensitve to additional downtime, but there's not really any > > > > > getting around that. A machine with a lot of state will need either > > > > > high downtime or high migration bandwidth. > > > > > > > > It's... sixteen megabytes. A KubeVirt node is only allowed to perform up > > > > to _four_ migrations in parallel, and that's our main use case at the > > > > moment. "High downtime" is kind of relative. > > > > > > Certainly. But I believe it's typical to aim for downtimes in the > > > ~100ms range. > > > > > > > > But, I'm really hoping we can move relatively quickly to a model where > > > > > a guest with only a handful of connections _doesn't_ have to pay that > > > > > 128k flow cost - and can consequently migrate ok even with quite > > > > > constrained migration bandwidth. In that scenario the size of the > > > > > header could become significant. > > > > > > > > I think the biggest cost of the full flow table transfer is rather code > > > > that's a bit quicker to write (I just managed to properly set sequences > > > > on the target, connections don't quite "flow" yet) but relatively high > > > > maintenance (as you mentioned, we need to be careful about every single > > > > field) and easy to break. > > > > > > Right. And with this draft we can't even change the size of the flow > > > table without breaking migration. That seems like a thing we might > > > well want to change. > > > > Why? The size of the flow table hasn't changed since it was added. I > > Which wasn't _that_ long ago. It just seems like a really obvious > constant to tune to me, and one which it would be surprising if it > broken migration. > > > don't see a reason to improve this if we don't want to transfer the > > flow table anyway. > > I don't follow. Do you mean not transferring the hash table? This is > not relevant to that, I'm talking about the size of the base flow > table, not the hash table. Or do you mean not transferring the flow > table as a whole, but rather entry by entry? In that case I'm seeing > it as exactly the mechanism to improve this. I mean transferring the flow table entry by entry. I think we need to jump to that directly, because of something else I just found: if we don't transfer one struct tcp_repair_window, the connection works at a basic level but we don't really comply with RFC 9293 anymore, I think, because we might shrink the window. And that struct is 20 bytes. We just reached 124 bytes with the socket-side sequence numbers, so now we would exceed 128 bytes. > > > > I would like to quickly complete the whole flow first, because I think > > > > we can inform design and implementation decisions much better at that > > > > point, and we can be sure it's feasible, > > > > > > That's fair. > > > > > > > but I'm not particularly keen > > > > to merge this patch like it is, if we can switch it relatively swiftly > > > > to an implementation where we model a smaller fixed-endian structure > > > > with just the stuff we need. > > > > > > So, there are kind of two parts to this: > > > > > > 1) Only transferring active flow entries, and not transferring the > > > hash table > > > > > > I think this is pretty easy. It could be done with or without > > > preserving flow indicies. Preserving makes for debug log continuity > > > between the ends, but not preserving lets us change the size of the > > > flow table without breaking migration. > > > > I would just add prints on migration showing how old flow indices map > > to new ones. > > That's possible, although it would mean transferring the old indices, > which is not otherwise strictly necessary. What we could do easily is > a debug log similar to the "new flow" logs but for "immigrated flow". Transferring them if we have a dedicated structure shouldn't be that bad: we don't need to store them. > > > 2) Only transferring the necessary pieces of each entry, and using a > > > fixed representation of each piece > > > > > > This is harder. Not *super* hard, I think, but definitely trickier > > > than (1) > > > > > > > And again, to be a bit more sure of which stuff we need in it, the full > > > > flow is useful to have implemented. > > > > > > > > Actually the biggest complications I see in switching to that approach, > > > > from the current point, are that we need to, I guess: > > > > > > > > 1. model arrays (not really complicated by itself) > > > > > > So here, I actually think this is simpler if we don't attempt to have > > > a declarative approach to defining the protocol, but just write > > > functions to implement it. > > > > > > > 2. have a temporary structure where we store flows instead of using the > > > > flow table directly (meaning that the "data model" needs to logically > > > > decouple source and destination of the copy) > > > > > > Right.. I'd really prefer to "stream" in the entries one by one, > > > rather than having a big staging area. That's even harder to do > > > declaratively, but I think the other advantages are worth it. > > > > > > > 3. batch stuff to some extent. We'll call socket() and connect() once > > > > for each socket anyway, obviously, but sending one message to the > > > > TCP_REPAIR helper for each socket looks like a rather substantial > > > > and avoidable overhead > > > > > > I don't think this actually has a lot of bearing on the protocol. I'd > > > envisage migrate_target() decodes all the information into the > > > target's flow table, then migrate_target_post() steps through all the > > > flows re-establishing the connections. Since we've already parsed the > > > protocol at that point, we can make multiple passes: one to gather > > > batches and set TCP_REPAIR, another through each entry to set the > > > values, and a final one to clear TCP_REPAIR in batches. > > > > Ah, right, I didn't think of using the target flow table directly. That > > has the advantage that the current code I'm writing to reactivate flows > > from the flow table can be recycled as it is. > > Possibly - it might need to do some slightly different things: > regenerating some fields from redundant data maybe, and/or re-hashing > the entries. But certainly the structure should be similar, yes. > > > > > > > > It's both easier to do > > > > > > > and a bigger win in most cases. That would dramatically reduce the > > > > > > > size sent here. > > > > > > > > > > > > Yep, feel free. > > > > > > > > > > It's on my queue for the next few days. > > > > > > > > To me this part actually looks like the biggest priority after/while > > > > getting the whole thing to work, because we can start right with a 'v1' > > > > which looks more sustainable. > > > > > > > > And I would just get stuff working on x86_64 in that case, without even > > > > implementing conversions and endianness switches etc. > > > > > > Right. Given the number of options here, I think it would be safest > > > to go in expecting to go through a few throwaway protocol versions > > > before reaching one we're happy enough to support long term. > > > > > > To ease that process, I'm wondering if we should, add a default-off > > > command line option to enable migration. For now, enabling it would > > > print some sort of "migration is experimental!" warning. Once we have > > > a stream format we're ok with, we can flip it to on-by-default, but we > > > don't maintain receive compatibility for the experimental versions > > > leading up to that. > > > > It looks like unnecessary code churn to me. It doesn't need to be > > merged if it's work in progress. You can also push stuff to a temporary > > branch if needed. > > Eh, just thought merging might save us some rebase work against any > other pressing changes we need. > > > It can also be merged and not documented for a while, as long as it > > doesn't break existing functionality. > > I'd be a bit cautious about this. AIUI, right now if you attempt to > migrate, qemu will simply fail it because we don't respond to the > migration commands. Having enough merged that qemu won't outright > fail the migration, but it won't reliably work seems like a bad idea. Not really: we transfer "PASST" at the moment, and the migration succeeds. You can start 'ping' in the source and see it continue flawlessly in the target. Sure, TCP connections break altogether instead of the okayish migration implemented by v3 I'm posting in a bit... -- Stefano