From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=kM3p3bNj;
	dkim-atps=neutral
Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76])
	by passt.top (Postfix) with ESMTPS id 9AC6F5A0626
	for <passt-dev@passt.top>; Fri, 07 Mar 2025 09:35:54 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202502; t=1741336543;
	bh=T4EMQbeGbGVwS8Q4miJgS1sBLmTZOqPWoEjcFVzaadw=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=kM3p3bNjeLMCYSk3YImAlugYU+tLat0DA6ydEUmef9Ln0fQ2KUFGKvOy1T1YKfVPe
	 ejsFmh4x3W8rfQkTPMVzhu4V4Uoh1HiTZ8ojT79U+F0l3LicaTe/lg8lDqmLcFnqgN
	 sWd7CDS+sjBnjEJuLMkJawUQ/QNYv3Gzoot+MDP6C+ZXsc5sDjYrI7mt02H9GjK2rk
	 HWvXY9qkvfBOnAYKNiveU9IZDerdEP6DJLvgK9Ox+7rDMZRDkCeDsKbOCKrKc4NwsQ
	 5iKUFA4XSiXDfdYHnCmxwJibaucD6IxVwp0oKzeTn5t4gGaMsYFa0lJ5zWwueXt9Z6
	 84bI/aiNjmFdQ==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4Z8KPC4yQDz4wcw; Fri,  7 Mar 2025 19:35:43 +1100 (AEDT)
Date: Fri, 7 Mar 2025 19:34:06 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH] flow, repair: Wait for a short while for passt-repair to
 connect
Message-ID: <Z8qvftYzxJog05Ef@zatzit>
References: <20250306190051.3924372-1-sbrivio@redhat.com>
 <Z8pg_KEeH_EqeMGa@zatzit>
 <20250307071445.7c26b2d1@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="Yh8UbXoOOFTASIXD"
Content-Disposition: inline
In-Reply-To: <20250307071445.7c26b2d1@elisabeth>
Message-ID-Hash: IJEEV7OMDDZUOCJFDENKO7DZRKA6MKET
X-Message-ID-Hash: IJEEV7OMDDZUOCJFDENKO7DZRKA6MKET
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/Z8qvftYzxJog05Ef@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/IJEEV7OMDDZUOCJFDENKO7DZRKA6MKET/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--Yh8UbXoOOFTASIXD
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Mar 07, 2025 at 07:14:45AM +0100, Stefano Brivio wrote:
> On Fri, 7 Mar 2025 13:59:08 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Thu, Mar 06, 2025 at 08:00:51PM +0100, Stefano Brivio wrote:
> > > ...and time out after that. This will be needed because of an upcoming
> > > change to passt-repair enabling it to start before passt is started,
> > > on both source and target, by means of an inotify watch.
> > >=20
> > > Once the inotify watch triggers, passt-repair will connect right away,
> > > but we have no guarantees that the connection completes before we
> > > start the migration process, so wait for it (for a reasonable amount
> > > of time).
> > >=20
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com> =20
> >=20
> > Hmm.  So I'm pretty confident this patch does what it sets out to do.
> > I'm increasingly dubious about this direction.
> >=20
> > > ---
> > > I'm still working on the passt-repair change mentioned above. It's
> > > turning out to be a bit trickier than expected because to support
> > > typical libvirt migration usage, on the target, we also need to be
> > > able to wait for creation of a directory and, separately, for a socket
> > > file being added inside it. =20
> >=20
> > Oof.  This significantly amplifies my concerns about the security
> > model: one of the most obvious ways to to bound who could potentially
> > convince passt-repair to connect is to make sure the directory exists
> > before hand: if you don't have permissions to create the socket in
> > that directory, passt-repair won't connect.
> >=20
> > If we're also waiting for directories to appear, it makes this even
> > harder to analyse in terms of whether someone who's not supposed to
> > could convince passt-repair to connect to them.
>=20
> I don't think your concern is particularly founded because in a general
> case we have LSM policies preventing other processes from creating that
> file in that directory, and in KubeVirt's case we simply have no other
> processes that can access the pod mount namespace and do that.

I mean, LSM properties should generally be a additional layer of
protection, things should be basically ok with the vanilla Unix
permission model.

> But that part of the change is not strictly necessary, surely not for a
> proper libvirt integration (libvirt knows the path) and also not for
> the integration for KubeVirt (KubeVirt can build the path).
>=20
> So, to dodge the obstructionism here, I'll happily drop that part. It
> also saves me time.
>=20
> > Given the complexity you're hitting here, I'm really starting to think
> > maybe we should reverse our model: have passt connect to passt-repair.
> > The security model is easier to understand: permission to the socket
> > and its directory is permission to use TCP_REPAIR.  There's no awkward
> > waiting on either side - I think it might actually be less total code
> > chyurn.
>=20
> Of course it's a bit less code, but it turns a potential concern into
> the actually expected situation: you have something permanently around
> that allows any process (because you have no practical way to restrict

A (semi-)permanentaly running daemon was not what I had in mind.  What
Istead I was thinking that you start passt-repair shortly before the
migration (as we already planned), it does exactly one accept(), then
terminates once its single client is done.

> it, not in distributions and not in KubeVirt) to set/clear repair mode
> on sockets. Unless you start introducing authentication, that is.

Well, the plan was the authentication was via the Unix permissions on
the socket.

> In the current model it's still a one-off operation, even with an
> additional wait. You start passt-repair once to do its job, and once it
> does the job, it terminates.

Right, also what I was intending... but of course, I realised a few
hours later that I'd forgotten that our filesystem isolation means we
can't connect() to a Unix socket in the wider world at migration time.
There are ways to work around that, of course, but I have yet to spot
one that seems better than the status quo.  So,

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I guess.


> > Judging by the early queries from the Kubevirt guys it seems
> > to be the more obvious model people would expect.
>=20
> Note that you talked to people taking care of (almost exclusively)
> networking there. Knowing myself a fair bit of code, architecture, and
> history of security weaknesses and discussions around them, I'm pretty
> sure you can't do the opposite.
>=20
> The long-term solutions are anyway having libvirt provide a phase or
> blocking event where this can be done, and, at least in KubeVirt's
> case, eventually handle this with seitan.

Right.  Probably not really valuable while it still needs
CAP_NET_ADMIN, but if we do manage to change that to CAP_NET_RAW, it
would make sense for passt to only use the helper if it can't do the
TCP_REPAIR itself.  That makes setup simpler for environments that
don't have counterproductive "security" policies.

> > On the other hand,
> > configuring the socket path is still an issue, and being able to tell
> > passt where to connect to through libvirt might become more important
> > than it is now.
> >=20
> > > However, this part looks solid and it should do what we need. I
> > > tested it by starting passt-repair after sleep(1) in migrate/* tests,
> > > changing this timeout to a minute, on either source, target, or both.
> > >=20
> > >  flow.c   | 20 ++++++++++++++++++++
> > >  repair.c | 31 +++++++++++++++++++++++++++++++
> > >  repair.h |  1 +
> > >  3 files changed, 52 insertions(+)
> > >=20
> > > diff --git a/flow.c b/flow.c
> > > index 749c498..5e64b79 100644
> > > --- a/flow.c
> > > +++ b/flow.c
> > > @@ -911,6 +911,21 @@ static int flow_migrate_source_rollback(struct c=
tx *c, unsigned bound, int ret)
> > >  	return ret;
> > >  }
> > > =20
> > > +/**
> > > + * flow_migrate_need_repair() - Do we need to set repair mode for an=
y flow?
> > > + *
> > > + * Return: true if repair mode is needed, false otherwise
> > > + */
> > > +static bool flow_migrate_need_repair(void)
> > > +{
> > > +	union flow *flow;
> > > +
> > > +	foreach_established_tcp_flow(flow)
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > >  /**
> > >   * flow_migrate_repair_all() - Turn repair mode on or off for all fl=
ows
> > >   * @c:		Execution context
> > > @@ -966,6 +981,9 @@ int flow_migrate_source_pre(struct ctx *c, const =
struct migrate_stage *stage,
> > >  	(void)stage;
> > >  	(void)fd;
> > > =20
> > > +	if (flow_migrate_need_repair())
> > > +		repair_wait(c);
> > > +
> > >  	if ((rc =3D flow_migrate_repair_all(c, true)))
> > >  		return -rc;
> > > =20
> > > @@ -1083,6 +1101,8 @@ int flow_migrate_target(struct ctx *c, const st=
ruct migrate_stage *stage,
> > >  	if (!count)
> > >  		return 0;
> > > =20
> > > +	repair_wait(c);
> > > +
> > >  	if ((rc =3D flow_migrate_repair_all(c, true)))
> > >  		return -rc;
> > > =20
> > > diff --git a/repair.c b/repair.c
> > > index 3ee089f..ce5a4a4 100644
> > > --- a/repair.c
> > > +++ b/repair.c
> > > @@ -27,6 +27,11 @@
> > > =20
> > >  #define SCM_MAX_FD 253 /* From Linux kernel (include/net/scm.h), not=
 in UAPI */
> > > =20
> > > +/* Wait for a while for TCP_REPAIR helper to connect if it's not the=
re yet */
> > > +#define REPAIR_ACCEPT_TIMEOUT_MS	100 =20
> >=20
> > 100ms is a pretty long time in the context of a stopped guest.  I
> > guess is is a worst case.
>=20
> It's actually not a case I can reproduce at all, as expected:
> passt-repair connects quickly enough once the socket is available.

Good to know.  It does mean we're adding 100ms of downtime for the
fallback case where there is no passt-repair, of course.

> > We could possibly mitigate it by doing the
>=20
> Mitigate what exactly? The expected mode of operation is with
> passt-repair available. The failure handling is a fallback. I can turn
> it into 1 ms, it just seems to increase risk for no good reason.

> > wait when memory change logging goes on (so migration is starting, bu
> > the guest is not get stopped).  However at that point we can't rely on
> > flow_migrate_need_repair(), because we might have no flows, but one is
> > opened before the guest is stopped.
>=20
> Right, that complicates things quite a lot.
>=20
> > > +#define REPAIR_ACCEPT_TIMEOUT_US	(REPAIR_ACCEPT_TIMEOUT_MS * 1000)
> > > +static_assert(REPAIR_ACCEPT_TIMEOUT_US < 1000 * 1000); =20
> >=20
> > In this context the static_assert() seems like a pretty arbitrary
> > bound.  To make the reason for it clearer, I'd suggest moving it to
> > where we construct the timeval.
>=20
> Oh, I didn't think of that. Sure, I'll move it.
>=20
> > > +
> > >  /* Pending file descriptors for next repair_flush() call, or command=
 change */
> > >  static int repair_fds[SCM_MAX_FD];
> > > =20
> > > @@ -138,6 +143,32 @@ void repair_handler(struct ctx *c, uint32_t even=
ts)
> > >  	repair_close(c);
> > >  }
> > > =20
> > > +/**
> > > + * repair_wait() - Wait (with timeout) for TCP_REPAIR helper to conn=
ect
> > > + * @c:		Execution context
> > > + */
> > > +void repair_wait(struct ctx *c)
> > > +{
> > > +	struct timeval tv =3D { .tv_sec =3D 0,
> > > +			      .tv_usec =3D (long)(REPAIR_ACCEPT_TIMEOUT_US) };
> > > +
> > > +	if (c->fd_repair >=3D 0 || c->fd_repair_listen =3D=3D -1)
> > > +		return;
> > > +
> > > +	if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
> > > +		       &tv, sizeof(tv))) {
> > > +		err_perror("Set timeout on TCP_REPAIR listening socket");
> > > +		return;
> > > +	}
> > > +
> > > +	repair_listen_handler(c, EPOLLIN);
> > > +
> > > +	tv.tv_usec =3D 0;
> > > +	if (setsockopt(c->fd_repair_listen, SOL_SOCKET, SO_RCVTIMEO,
> > > +		       &tv, sizeof(tv)))
> > > +		err_perror("Clear timeout on TCP_REPAIR listening socket");
> > > +}
> > > +
> > >  /**
> > >   * repair_flush() - Flush current set of sockets to helper, with cur=
rent command
> > >   * @c:		Execution context
> > > diff --git a/repair.h b/repair.h
> > > index de279d6..1d37922 100644
> > > --- a/repair.h
> > > +++ b/repair.h
> > > @@ -10,6 +10,7 @@ void repair_sock_init(const struct ctx *c);
> > >  void repair_listen_handler(struct ctx *c, uint32_t events);
> > >  void repair_handler(struct ctx *c, uint32_t events);
> > >  void repair_close(struct ctx *c);
> > > +void repair_wait(struct ctx *c);
> > >  int repair_flush(struct ctx *c);
> > >  int repair_set(struct ctx *c, int s, int cmd);
>=20

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--Yh8UbXoOOFTASIXD
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmfKr3EACgkQzQJF27ox
2GcDsg/9Hgwg5xNeOq7EW6dzjXUVFOQZWa5OK9Whk7K0o5TE+demmN988U6H0l6l
nVTG+zwVYWgm3nUN2BV3E25vFikFXUfpkhSZ5iEUuishZj0juqx8a0TQ8E122/hR
FDbzM47CgYZe8PzP2BWPwmeGEJiRl4tyNVkxzCgHb3o+vcR29vAGIW9BZT34vkEN
SepLGN4v6pXvsXPiyWc/czZNphN4qpiYam2XCocV920ciGxmwSUp6Mx0xVUgBnN6
Yqoeh2jv6MTtoXPizLazuIy9Dcc7Dk3AJQD01LiaNmsIX582Fqcmv+L1F/IbogBd
ozJcUxnIWGrU3tpAtZZg04dXuRpHtgqu68O0lJX4JOIR/WvBRfcRKCepMkygxEIb
P/CeNsnlM6mp4mAx2Sn3vaUOo+JHfI1d4bHmqWlnXnb9T6P7ZfNGMKWgdKpvjn0I
i2ap8G0OqB41Va6VOW8yGxRZPaMVifoSirjAkSv0Q+z0mCSw6oQzd8O0CjsqmqoC
vhwxdYeI3vxdLLMvci7ath+hCF9jVenZag8iGjjGRndkR4GVYU2JEizOvyPcTN1F
I0v4jq014qn8hjNY4r1cBPndF4zi+Hm0gh32zo8mcp0bya2FBqDWrTiT2n9ZS4SB
YFDMk+BBvry0DFGQRaww2QAOhPNotFc/+sWnARg79vaYT/2OIG0=
=PZSb
-----END PGP SIGNATURE-----

--Yh8UbXoOOFTASIXD--