From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=kM3p3bNj; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 9AC6F5A0626 for ; Fri, 07 Mar 2025 09:35:54 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1741336543; bh=T4EMQbeGbGVwS8Q4miJgS1sBLmTZOqPWoEjcFVzaadw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=kM3p3bNjeLMCYSk3YImAlugYU+tLat0DA6ydEUmef9Ln0fQ2KUFGKvOy1T1YKfVPe ejsFmh4x3W8rfQkTPMVzhu4V4Uoh1HiTZ8ojT79U+F0l3LicaTe/lg8lDqmLcFnqgN sWd7CDS+sjBnjEJuLMkJawUQ/QNYv3Gzoot+MDP6C+ZXsc5sDjYrI7mt02H9GjK2rk HWvXY9qkvfBOnAYKNiveU9IZDerdEP6DJLvgK9Ox+7rDMZRDkCeDsKbOCKrKc4NwsQ 5iKUFA4XSiXDfdYHnCmxwJibaucD6IxVwp0oKzeTn5t4gGaMsYFa0lJ5zWwueXt9Z6 84bI/aiNjmFdQ== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Z8KPC4yQDz4wcw; Fri, 7 Mar 2025 19:35:43 +1100 (AEDT) Date: Fri, 7 Mar 2025 19:34:06 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] flow, repair: Wait for a short while for passt-repair to connect Message-ID: References: <20250306190051.3924372-1-sbrivio@redhat.com> <20250307071445.7c26b2d1@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Yh8UbXoOOFTASIXD" Content-Disposition: inline In-Reply-To: <20250307071445.7c26b2d1@elisabeth> Message-ID-Hash: IJEEV7OMDDZUOCJFDENKO7DZRKA6MKET X-Message-ID-Hash: IJEEV7OMDDZUOCJFDENKO7DZRKA6MKET X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --Yh8UbXoOOFTASIXD Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Mar 07, 2025 at 07:14:45AM +0100, Stefano Brivio wrote: > On Fri, 7 Mar 2025 13:59:08 +1100 > David Gibson wrote: >=20 > > On Thu, Mar 06, 2025 at 08:00:51PM +0100, Stefano Brivio wrote: > > > ...and time out after that. This will be needed because of an upcoming > > > change to passt-repair enabling it to start before passt is started, > > > on both source and target, by means of an inotify watch. > > >=20 > > > Once the inotify watch triggers, passt-repair will connect right away, > > > but we have no guarantees that the connection completes before we > > > start the migration process, so wait for it (for a reasonable amount > > > of time). > > >=20 > > > Signed-off-by: Stefano Brivio =20 > >=20 > > Hmm. So I'm pretty confident this patch does what it sets out to do. > > I'm increasingly dubious about this direction. > >=20 > > > --- > > > I'm still working on the passt-repair change mentioned above. It's > > > turning out to be a bit trickier than expected because to support > > > typical libvirt migration usage, on the target, we also need to be > > > able to wait for creation of a directory and, separately, for a socket > > > file being added inside it. =20 > >=20 > > Oof. This significantly amplifies my concerns about the security > > model: one of the most obvious ways to to bound who could potentially > > convince passt-repair to connect is to make sure the directory exists > > before hand: if you don't have permissions to create the socket in > > that directory, passt-repair won't connect. > >=20 > > If we're also waiting for directories to appear, it makes this even > > harder to analyse in terms of whether someone who's not supposed to > > could convince passt-repair to connect to them. >=20 > I don't think your concern is particularly founded because in a general > case we have LSM policies preventing other processes from creating that > file in that directory, and in KubeVirt's case we simply have no other > processes that can access the pod mount namespace and do that. I mean, LSM properties should generally be a additional layer of protection, things should be basically ok with the vanilla Unix permission model. > But that part of the change is not strictly necessary, surely not for a > proper libvirt integration (libvirt knows the path) and also not for > the integration for KubeVirt (KubeVirt can build the path). >=20 > So, to dodge the obstructionism here, I'll happily drop that part. It > also saves me time. >=20 > > Given the complexity you're hitting here, I'm really starting to think > > maybe we should reverse our model: have passt connect to passt-repair. > > The security model is easier to understand: permission to the socket > > and its directory is permission to use TCP_REPAIR. There's no awkward > > waiting on either side - I think it might actually be less total code > > chyurn. >=20 > Of course it's a bit less code, but it turns a potential concern into > the actually expected situation: you have something permanently around > that allows any process (because you have no practical way to restrict A (semi-)permanentaly running daemon was not what I had in mind. What Istead I was thinking that you start passt-repair shortly before the migration (as we already planned), it does exactly one accept(), then terminates once its single client is done. > it, not in distributions and not in KubeVirt) to set/clear repair mode > on sockets. Unless you start introducing authentication, that is. Well, the plan was the authentication was via the Unix permissions on the socket. > In the current model it's still a one-off operation, even with an > additional wait. You start passt-repair once to do its job, and once it > does the job, it terminates. Right, also what I was intending... but of course, I realised a few hours later that I'd forgotten that our filesystem isolation means we can't connect() to a Unix socket in the wider world at migration time. There are ways to work around that, of course, but I have yet to spot one that seems better than the status quo. So, Reviewed-by: David Gibson I guess. > > Judging by the early queries from the Kubevirt guys it seems > > to be the more obvious model people would expect. >=20 > Note that you talked to people taking care of (almost exclusively) > networking there. Knowing myself a fair bit of code, architecture, and > history of security weaknesses and discussions around them, I'm pretty > sure you can't do the opposite. >=20 > The long-term solutions are anyway having libvirt provide a phase or > blocking event where this can be done, and, at least in KubeVirt's > case, eventually handle this with seitan. Right. Probably not really valuable while it still needs CAP_NET_ADMIN, but if we do manage to change that to CAP_NET_RAW, it would make sense for passt to only use the helper if it can't do the TCP_REPAIR itself. That makes setup simpler for environments that don't have counterproductive "security" policies. > > On the other hand, > > configuring the socket path is still an issue, and being able to tell > > passt where to connect to through libvirt might become more important > > than it is now. > >=20 > > > However, this part looks solid and it should do what we need. I > > > tested it by starting passt-repair after sleep(1) in migrate/* tests, > > > changing this timeout to a minute, on either source, target, or both. > > >=20 > > > flow.c | 20 ++++++++++++++++++++ > > > repair.c | 31 +++++++++++++++++++++++++++++++ > > > repair.h | 1 + > > > 3 files changed, 52 insertions(+) > > >=20 > > > diff --git a/flow.c b/flow.c > > > index 749c498..5e64b79 100644 > > > --- a/flow.c > > > +++ b/flow.c > > > @@ -911,6 +911,21 @@ static int flow_migrate_source_rollback(struct c= tx *c, unsigned bound, int ret) > > > return ret; > > > } > > > =20 > > > +/** > > > + * flow_migrate_need_repair() - Do we need to set repair mode for an= y flow? > > > + * > > > + * Return: true if repair mode is needed, false otherwise > > > + */ > > > +static bool flow_migrate_need_repair(void) > > > +{ > > > + union flow *flow; > > > + > > > + foreach_established_tcp_flow(flow) > > > + return true; > > > + > > > + return false; > > > +} > > > + > > > /** > > > * flow_migrate_repair_all() - Turn repair mode on or off for all fl= ows > > > * @c: Execution context > > > @@ -966,6 +981,9 @@ int flow_migrate_source_pre(struct ctx *c, const = struct migrate_stage *stage, > > > (void)stage; > > > (void)fd; > > > =20 > > > + if (flow_migrate_need_repair()) > > > + repair_wait(c); > > > + > > > if ((rc =3D flow_migrate_repair_all(c, true))) > > > return -rc; > > > =20 > > > @@ -1083,6 +1101,8 @@ int flow_migrate_target(struct ctx *c, const st= ruct migrate_stage *stage, > > > if (!count) > > > return 0; > > > =20 > > > + repair_wait(c); > > > + > > > if ((rc =3D flow_migrate_repair_all(c, true))) > > > return -rc; > > > =20 > > > diff --git a/repair.c b/repair.c > > > index 3ee089f..ce5a4a4 100644 > > > --- a/repair.c > > > +++ b/repair.c > > > @@ -27,6 +27,11 @@ > > > =20 > > > #define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not= in UAPI */ > > > =20 > > > +/* Wait for a while for TCP_REPAIR helper to connect if it's not the= re yet */ > > > +#define REPAIR_ACCEPT_TIMEOUT_MS 100 =20 > >=20 > > 100ms is a pretty long time in the context of a stopped guest. I > > guess is is a worst case. >=20 > It's actually not a case I can reproduce at all, as expected: > passt-repair connects quickly enough once the socket is available. Good to know. It does mean we're adding 100ms of downtime for the fallback case where there is no passt-repair, of course. > > We could possibly mitigate it by doing the >=20 > Mitigate what exactly? The expected mode of operation is with > passt-repair available. The failure handling is a fallback. I can turn > it into 1 ms, it just seems to increase risk for no good reason. > > wait when memory change logging goes on (so migration is starting, bu > > the guest is not get stopped). However at that point we can't rely on > > flow_migrate_need_repair(), because we might have no flows, but one is > > opened before the guest is stopped. >=20 > Right, that complicates things quite a lot. >=20 > > > +#define REPAIR_ACCEPT_TIMEOUT_US (REPAIR_ACCEPT_TIMEOUT_MS * 1000) > > > +static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000); =20 > >=20 > > In this context the static_assert() seems like a pretty arbitrary > > bound. To make the reason for it clearer, I'd suggest moving it to > > where we construct the timeval. >=20 > Oh, I didn't think of that. Sure, I'll move it. >=20 > > > + > > > /* Pending file descriptors for next repair_flush() call, or command= change */ > > > static int repair_fds[SCM_MAX_FD]; > > > =20 > > > @@ -138,6 +143,32 @@ void repair_handler(struct ctx *c, uint32_t even= ts) > > > repair_close(c); > > > } > > > =20 > > > +/** > > > + * repair_wait() - Wait (with timeout) for TCP_REPAIR helper to conn= ect > > > + * @c: Execution context > > > + */ > > > +void repair_wait(struct ctx *c) > > > +{ > > > + struct timeval tv =3D { .tv_sec =3D 0, > > > + .tv_usec =3D (long)(REPAIR_ACCEPT_TIMEOUT_US) }; > > > + > > > + if (c->fd_repair >=3D 0 || c->fd_repair_listen =3D=3D -1) > > > + return; > > > + > > > + if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO, > > > + &tv, sizeof(tv))) { > > > + err_perror("Set timeout on TCP_REPAIR listening socket"); > > > + return; > > > + } > > > + > > > + repair_listen_handler(c, EPOLLIN); > > > + > > > + tv.tv_usec =3D 0; > > > + if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO, > > > + &tv, sizeof(tv))) > > > + err_perror("Clear timeout on TCP_REPAIR listening socket"); > > > +} > > > + > > > /** > > > * repair_flush() - Flush current set of sockets to helper, with cur= rent command > > > * @c: Execution context > > > diff --git a/repair.h b/repair.h > > > index de279d6..1d37922 100644 > > > --- a/repair.h > > > +++ b/repair.h > > > @@ -10,6 +10,7 @@ void repair_sock_init(const struct ctx *c); > > > void repair_listen_handler(struct ctx *c, uint32_t events); > > > void repair_handler(struct ctx *c, uint32_t events); > > > void repair_close(struct ctx *c); > > > +void repair_wait(struct ctx *c); > > > int repair_flush(struct ctx *c); > > > int repair_set(struct ctx *c, int s, int cmd); >=20 --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --Yh8UbXoOOFTASIXD Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmfKr3EACgkQzQJF27ox 2GcDsg/9Hgwg5xNeOq7EW6dzjXUVFOQZWa5OK9Whk7K0o5TE+demmN988U6H0l6l nVTG+zwVYWgm3nUN2BV3E25vFikFXUfpkhSZ5iEUuishZj0juqx8a0TQ8E122/hR FDbzM47CgYZe8PzP2BWPwmeGEJiRl4tyNVkxzCgHb3o+vcR29vAGIW9BZT34vkEN SepLGN4v6pXvsXPiyWc/czZNphN4qpiYam2XCocV920ciGxmwSUp6Mx0xVUgBnN6 Yqoeh2jv6MTtoXPizLazuIy9Dcc7Dk3AJQD01LiaNmsIX582Fqcmv+L1F/IbogBd ozJcUxnIWGrU3tpAtZZg04dXuRpHtgqu68O0lJX4JOIR/WvBRfcRKCepMkygxEIb P/CeNsnlM6mp4mAx2Sn3vaUOo+JHfI1d4bHmqWlnXnb9T6P7ZfNGMKWgdKpvjn0I i2ap8G0OqB41Va6VOW8yGxRZPaMVifoSirjAkSv0Q+z0mCSw6oQzd8O0CjsqmqoC vhwxdYeI3vxdLLMvci7ath+hCF9jVenZag8iGjjGRndkR4GVYU2JEizOvyPcTN1F I0v4jq014qn8hjNY4r1cBPndF4zi+Hm0gh32zo8mcp0bya2FBqDWrTiT2n9ZS4SB YFDMk+BBvry0DFGQRaww2QAOhPNotFc/+sWnARg79vaYT/2OIG0= =PZSb -----END PGP SIGNATURE----- --Yh8UbXoOOFTASIXD--