From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 4E5C35A0275 for ; Mon, 19 Feb 2024 09:19:05 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1708330743; bh=uqUVZeXlAjGIgWJL/bVy20i9PPj29QB4cDqkrqqaqN4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=a74M813LceWx4DzFbdtmBra2Wxwi+f7wbW+jPsfTEknsgRhj5IGLiIJHPu4Zx4Qt2 BHkwg6osgt4g1sT3VzmoOKGOtHKqHgJwTliTdwPc9rc04P8wduSqsOsjAV8n+GgSbD jhx0ns6SOQte4YraKkynOWb9c/V9b+tm7AN37rcNQOrrit7FDLeKQt1Pz4Bv7zYaqf XOtstw03ClM6WAgYrnxn+TTfb3qc25pNKp07RKkoJhFIzTIs2AbFHfdc8aFodkJFNV W1Dao6GcgJF8mrZqHcCNp+sl29eR2NOMDD9NV8aEQpXGO3gvLwiskHB7VUM/+lnFXO Edf3gZ6iLFUBA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Tdb6H0qT3z4x1P; Mon, 19 Feb 2024 19:19:03 +1100 (AEDT) Date: Mon, 19 Feb 2024 19:18:47 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead Message-ID: References: <20240217133457.2331118-1-sbrivio@redhat.com> <20240219090505.60f01ab7@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="FWr0xPmWluvA7Kmy" Content-Disposition: inline In-Reply-To: <20240219090505.60f01ab7@elisabeth> Message-ID-Hash: NTZDX75XMSAYU23WENMPITNBDPOZATMR X-Message-ID-Hash: NTZDX75XMSAYU23WENMPITNBDPOZATMR X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --FWr0xPmWluvA7Kmy Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Feb 19, 2024 at 09:05:05AM +0100, Stefano Brivio wrote: > On Mon, 19 Feb 2024 13:35:48 +1100 > David Gibson wrote: >=20 > > On Sat, Feb 17, 2024 at 02:34:57PM +0100, Stefano Brivio wrote: > > > We watch network namespace entries to detect when we should quit > > > (unless --no-netns-quit is passed), and these might stored in a tmpfs > > > typically mounted at /run/user/UID or /var/run/user/UID, or found in > > > procfs at /proc/PID/ns/. > > >=20 > > > Currently, we try to use inotify for any possible location of those > > > entries, but inotify, of course, doesn't work on pseudo-filesystems > > > (see inotify(7)). > > >=20 > > > The man page reflects this: the description of --no-netns-quit > > > implies that we won't quit anyway if the namespace is not "bound to > > > the filesystem". > > >=20 > > > Well, we won't quit, but, since commit 9e0dbc894813 ("More > > > deterministic detection of whether argument is a PID, PATH or NAME"), > > > we try. And, indeed, this is harmless, as the caveat from that > > > commit message states. > > >=20 > > > Now, it turns out that Buildah, a tool to create container images, > > > sharing its codebase with Podman, passes a procfs entry to pasta, and > > > expects pasta to exit once the network namespace is not needed > > > anymore, that is, once the original Buildah process terminates. > > >=20 > > > Get this to work by using the timer fallback mechanism if the > > > namespace name is passed as a path belonging to a pseudo-filesystem. > > > This is expected to be procfs, but I covered sysfs and devpts > > > pseudo-filesystems as well, because nothing actually prevents > > > creating this kind of directory structure and links there. > > >=20 > > > Note that statfs(), according to some versions of man pages, was > > > apparently "deprecated" by the LSB. My reasoning for using it is > > > essentially this: > > > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6h= slahv24szs4raroaz@ogivjbfdaqtb/t/#u > > >=20 > > > ...that is, there was no such thing as an LSB deprecation, and > > > anyway there's no other way to get the filesystem type. =20 > >=20 > > Huh, weird. > >=20 > > > Also note that, while it might sound more robust to detect the > > > filesystem type using fstatfs() on the file descriptor > > > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no > > > matter what path was given to pasta. If we use the parent directory, > > > we'll typically have either tmpfs or procfs reported. =20 > >=20 > > We could, however, move the opening of the parent directory earlier > > and use fstatfs() on that instead of statfs() on the path. >=20 > Done in v3. >=20 > > > The timer, however, still uses the file descriptor of the parent > > > directory later, as it has no access to the filesystem, and that > > > avoids possible races if the PID is recycled: if the original process > > > terminates, the handle we have on /proc/PID/ns still refers to it, > > > not to any other process with the same PID. =20 > >=20 > > True, but it's not obvious to me how that's relevant to this patch. > >=20 > > > We could have used pidfd_open() to get a handle on the parent > > > process. But it's not guaranteed that the parent process is actually > > > the one associated to the network namespace we operate on, and if we > > > get a PID file descriptor for a PID (parent or not) or path that was > > > given on our command line, this inherently causes a race condition as > > > that PID might have been recycled by the time we call pidfd_open(). = =20 > >=20 > > Again, not really sure of the relevance of this to the patch at hand. >=20 > Right, sorry, in both paragraphs I took the context of the Podman issue > for granted. Changed in v3. >=20 > > > Even assuming the process we want to watch is the parent process, and > > > a race-free usage of pidfd_open() would have been possible, I'm not > > > very enthusiastic about enabling yet another system call in the > > > seccomp profile just for this, while openat() is needed anyway. > > >=20 > > > Update the man page to reflect that, even if the target network > > > namespace is passed as a procfs path or a PID, we'll now quit when > > > the procfs entry is gone. > > >=20 > > > Reported-by: Paul Holzinger > > > Link: https://github.com/containers/podman/pull/21563#issuecomment-19= 48200214 > > > Signed-off-by: Stefano Brivio > > > --- > > > passt.1 | 8 ++++++-- > > > pasta.c | 15 +++++++++++++-- > > > 2 files changed, 19 insertions(+), 4 deletions(-) > > >=20 > > > diff --git a/passt.1 b/passt.1 > > > index dc2d719..de6e3bf 100644 > > > --- a/passt.1 > > > +++ b/passt.1 > > > @@ -550,8 +550,12 @@ without \-\-userns. > > > =20 > > > .TP > > > .BR \-\-no-netns-quit > > > -If the target network namespace is bound to the filesystem (that is,= if PATH or > > > -NAME are given as target), do not exit once the network namespace is= deleted. > > > +If the target network namespace is bound to the filesystem, do not e= xit once > > > +that path is deleted. > > > + > > > +If the target network namespace is represented by a procfs entry, do= not exit > > > +once that entry is removed from procfs (representing the fact that a= process > > > +with the given PID terminated). =20 > >=20 > > Can't you now simplify this all to "Do not exit when the target > > network namespace ends"? >=20 > That was my first attempt, but as I was writing something of that sort, > I realised it's actually false: as long as pasta is using the network > namespace, the namespace exists and networking is still up and running. >=20 > One could delete the entry in tmpfs, or the original process could > terminate, join the network namespace after that, and keep using it. Ah, true. I wish there were a more succinct way to express this, but I can't think of it either. > So I guess it's better to clarify what are our conditions for > terminating (and, most likely, terminating the target namespace as a > consequence). >=20 > > > .TP > > > .BR \-\-config-net > > > diff --git a/pasta.c b/pasta.c > > > index 01d1511..4110917 100644 > > > --- a/pasta.c > > > +++ b/pasta.c > > > @@ -33,6 +33,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -41,6 +42,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > =20 > > > #include "util.h" > > > #include "passt.h" > > > @@ -390,12 +392,18 @@ void pasta_netns_quit_init(const struct ctx *c) > > > union epoll_ref ref =3D { .type =3D EPOLL_TYPE_NSQUIT_INOTIFY }; > > > struct epoll_event ev =3D { .events =3D EPOLLIN }; > > > int flags =3D O_NONBLOCK | O_CLOEXEC; > > > - int fd; > > > + struct statfs s =3D { 0 }; =20 > >=20 > > I don't think you need to initialise this, since you should only be > > reading it in the case that statfs() succeeds. >=20 > There's no functional need, but I don't want to expose our stack memory > to the kernel. That seems... pointlessly paranoid to me. The syscall just takes a pointer, which implies that the kernel can access our memory anyway. > Even though the kernel is generally able to control the process, I feel > like it's nice to avoid it, and I think it's consistent with most of > our system call usage. Hrm, not generally in places I've been working at any rate. I generally dislike initialization's that aren't essential, because it eliminates the possibility of compiler or checker warnings about uninitialised memory if we later mess up the logic below. > In a number of cases, valgrind might complain about uninitialised > bytes, too. If it's a false positive, we can work around that the same way we do for the returned socket address from accept() et al. If it's not a false positive, then that's a good thing. > > > + bool try_inotify =3D true; > > > + int fd =3D -1; > > > =20 > > > if (c->mode !=3D MODE_PASTA || c->no_netns_quit || !*c->netns_base) > > > return; > > > =20 > > > - if ((fd =3D inotify_init1(flags)) < 0) > > > + if (statfs(c->netns_dir, &s) || s.f_type =3D=3D DEVPTS_SUPER_MA= GIC || > > > + s.f_type =3D=3D PROC_SUPER_MAGIC || s.f_type =3D=3D SYSFS_MAGIC) > > > + try_inotify =3D false; > > > + > > > + if (try_inotify && (fd =3D inotify_init1(flags)) < 0) > > > warn("inotify_init1(): %s, use a timer", strerror(errno)); > > > =20 > > > if (fd >=3D 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0= ) { > > > @@ -463,6 +471,9 @@ void pasta_netns_quit_timer_handler(struct ctx *c= , union epoll_ref ref) > > > =20 > > > fd =3D openat(ref.nsdir_fd, c->netns_base, O_PATH | O_CLOEXEC); > > > if (fd < 0) { > > > + if (errno =3D=3D EACCES) /* Expected for existing procfs entry */ > > > + return; =20 > >=20 > > This seems like an unrelated fix. >=20 > It's actually related because, before this patch, procfs entries aren't > handled by the timer, and with tmpfs entries you don't get EACCES. Ah, ok. > > > + > > > info("Namespace %s is gone, exiting", c->netns_base); > > > exit(EXIT_SUCCESS); > > > } =20 >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --FWr0xPmWluvA7Kmy Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmXTDuYACgkQzQJF27ox 2Gcnww/8Cbjxm+DZo5/BJMusz0CbEOiH9wuOijwxX/W54J8tPoeFMk7PldqzJa5Z r9nQLroxhF+sGrbh1viqrB1OUTduX++l022C8F6TZBSUfn+MJAoFu1Sk5lUP0EeB NCC9nbKZ3dN47GXf01Z/GTF7dz2M7ei79b0b1gGnBG6eSpNDcR2jlFL/BZytsFTZ Kcox0IEO2X8/gcniTn9YfBAD3yEzUADelbD2HVroiEmtUwL0GwxDpVSyHduq7YA4 Du9zEWxFi3KKy75AumCStWendfR4zgj3Hw3Etnx0JhmmjYcnDMOp0FYa8xuIrVgA SddGmQTIdkP+sRkUOAdPiUhyi1djt/1v8tSdu4jJD57W2DgABX4Qp3/bKrVszoUt SKNNWNy95ZN/PxHktccI8fZdZf6etYRE9CAifSWT1ZGWTUCWLYtdR05JYoV54phH x1fiDQ8ndZnKILa/r2YMD+/oFU9SBF5JjjNi7ZXLMP+U5GFhGtBRvFS+t+w+f9r+ lyGxsE34wh7HanQx1HJP8d/+uFNPR2SHV6BdMw9oxrCJ9+C6fhleBdDKC1T2OjZf 2Bkyr4U9s4tgNWUJaEVQihUe1XNL/QD6eq5C5Dm+pP92IP9+Ie/gX0jAtp+I5tQL /DSgWwqFxUXtC4smdlPc6i2UL1+sX6JgCc+266mR7ZGMd05txew= =y6O2 -----END PGP SIGNATURE----- --FWr0xPmWluvA7Kmy--