From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id DF0095A0279 for ; Mon, 19 Feb 2024 03:38:10 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1708310287; bh=bZ/qXIOMiqm0bxlBjRBPcebHuNOfP+qB2ApxPV4PMaM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=TT9swkQZzJXvEdyynZttkzlp5XbSjMOMLpQrYkex9NEHnCI4qbeESWZVpkCh8h3fQ UnAsNYFaTHt0sjFIJt629SN86tHEJzcWAuiC5i+Kre1QZiY8kPDiAgbVoxyTuyActe RF//mIj2/DHbYRosWRn1IJ/XzXPQbkONcHvnD5uWwrDafPyTFsvurm5gAaUj/d0+2N qLkGY1sY7x3G7qeChGHYInAqjpG4s8ZhKLOBa57hERuqJmNPG95GIHbwXNogFb4igE aruUrg25UoiakxhFvK30IIoH8pkhH3YKVA+axumcfpybu4Y6G20p0EPw8kN8V1xxQN 8REGybz6Wr59w== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4TdRXv5Q4Mz4wcJ; Mon, 19 Feb 2024 13:38:07 +1100 (AEDT) Date: Mon, 19 Feb 2024 13:38:01 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v2] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead Message-ID: References: <20240217141859.2358370-1-sbrivio@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="5067lOkY9MGjcshX" Content-Disposition: inline In-Reply-To: <20240217141859.2358370-1-sbrivio@redhat.com> Message-ID-Hash: RVGDAHKZUSN2AXIPPYMHBLQRFZT2NNES X-Message-ID-Hash: RVGDAHKZUSN2AXIPPYMHBLQRFZT2NNES X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --5067lOkY9MGjcshX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Feb 17, 2024 at 03:18:59PM +0100, Stefano Brivio wrote: > We watch network namespace entries to detect when we should quit > (unless --no-netns-quit is passed), and these might stored in a tmpfs > typically mounted at /run/user/UID or /var/run/user/UID, or found in > procfs at /proc/PID/ns/. >=20 > Currently, we try to use inotify for any possible location of those > entries, but inotify, of course, doesn't work on pseudo-filesystems > (see inotify(7)). >=20 > The man page reflects this: the description of --no-netns-quit > implies that we won't quit anyway if the namespace is not "bound to > the filesystem". >=20 > Well, we won't quit, but, since commit 9e0dbc894813 ("More > deterministic detection of whether argument is a PID, PATH or NAME"), > we try. And, indeed, this is harmless, as the caveat from that > commit message states. >=20 > Now, it turns out that Buildah, a tool to create container images, > sharing its codebase with Podman, passes a procfs entry to pasta, and > expects pasta to exit once the network namespace is not needed > anymore, that is, once the original Buildah process terminates. >=20 > Get this to work by using the timer fallback mechanism if the > namespace name is passed as a path belonging to a pseudo-filesystem. > This is expected to be procfs, but I covered sysfs and devpts > pseudo-filesystems as well, because nothing actually prevents > creating this kind of directory structure and links there. >=20 > Note that statfs(), according to some versions of man pages, was > apparently "deprecated" by the LSB. My reasoning for using it is > essentially this: > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslah= v24szs4raroaz@ogivjbfdaqtb/t/#u >=20 > ...that is, there was no such thing as an LSB deprecation, and > anyway there's no other way to get the filesystem type. >=20 > Also note that, while it might sound more robust to detect the > filesystem type using fstatfs() on the file descriptor > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no > matter what path was given to pasta. If we use the parent directory, > we'll typically have either tmpfs or procfs reported. >=20 > The timer, however, still uses the file descriptor of the parent > directory later, as it has no access to the filesystem, and that > avoids possible races if the PID is recycled: if the original process > terminates, the handle we have on /proc/PID/ns still refers to it, > not to any other process with the same PID. >=20 > We could have used pidfd_open() to get a handle on the parent > process. But it's not guaranteed that the parent process is actually > the one associated to the network namespace we operate on, and if we > get a PID file descriptor for a PID (parent or not) or path that was > given on our command line, this inherently causes a race condition as > that PID might have been recycled by the time we call pidfd_open(). >=20 > Even assuming the process we want to watch is the parent process, and > a race-free usage of pidfd_open() would have been possible, I'm not > very enthusiastic about enabling yet another system call in the > seccomp profile just for this, while openat() is needed anyway. >=20 > Update the man page to reflect that, even if the target network > namespace is passed as a procfs path or a PID, we'll now quit when > the procfs entry is gone. Oops, didn't spot this before replying to v1. Looks like the comments there still apply, though. > Reported-by: Paul Holzinger > Link: https://github.com/containers/podman/pull/21563#issuecomment-194820= 0214 > Signed-off-by: Stefano Brivio > --- > v2: Coverity Scan isn't happy if we "check" (kind of) c->netns_dir > with statfs() before opening it in a non-atomic way. Just to make > things clear, false positive or not: open it, check it, close it > if it wasn't needed: we don't rely on the check. >=20 > passt.1 | 8 ++++++-- > pasta.c | 24 +++++++++++++++++++----- > 2 files changed, 25 insertions(+), 7 deletions(-) >=20 > diff --git a/passt.1 b/passt.1 > index dc2d719..de6e3bf 100644 > --- a/passt.1 > +++ b/passt.1 > @@ -550,8 +550,12 @@ without \-\-userns. > =20 > .TP > .BR \-\-no-netns-quit > -If the target network namespace is bound to the filesystem (that is, if = PATH or > -NAME are given as target), do not exit once the network namespace is del= eted. > +If the target network namespace is bound to the filesystem, do not exit = once > +that path is deleted. > + > +If the target network namespace is represented by a procfs entry, do not= exit > +once that entry is removed from procfs (representing the fact that a pro= cess > +with the given PID terminated). > =20 > .TP > .BR \-\-config-net > diff --git a/pasta.c b/pasta.c > index 01d1511..465fe1a 100644 > --- a/pasta.c > +++ b/pasta.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -41,6 +42,7 @@ > #include > #include > #include > +#include > =20 > #include "util.h" > #include "passt.h" > @@ -390,12 +392,21 @@ void pasta_netns_quit_init(const struct ctx *c) > union epoll_ref ref =3D { .type =3D EPOLL_TYPE_NSQUIT_INOTIFY }; > struct epoll_event ev =3D { .events =3D EPOLLIN }; > int flags =3D O_NONBLOCK | O_CLOEXEC; > - int fd; > + struct statfs s =3D { 0 }; > + bool try_inotify =3D true; > + int fd =3D -1, dir_fd; > =20 > if (c->mode !=3D MODE_PASTA || c->no_netns_quit || !*c->netns_base) > return; > =20 > - if ((fd =3D inotify_init1(flags)) < 0) > + if ((dir_fd =3D open(c->netns_dir, O_CLOEXEC | O_RDONLY)) < 0) > + die("netns dir open: %s, exiting", strerror(errno)); > + > + if (statfs(c->netns_dir, &s) || s.f_type =3D=3D DEVPTS_SUPER_MAGIC = || > + s.f_type =3D=3D PROC_SUPER_MAGIC || s.f_type =3D=3D SYSFS_MAGIC) > + try_inotify =3D false; Since you're already opening netns_dir, it seems prudent to use fstatfs() instead of statfs() here. > + if (try_inotify && (fd =3D inotify_init1(flags)) < 0) > warn("inotify_init1(): %s, use a timer", strerror(errno)); > =20 > if (fd >=3D 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) { > @@ -409,11 +420,11 @@ void pasta_netns_quit_init(const struct ctx *c) > if ((fd =3D pasta_netns_quit_timer()) < 0) > die("Failed to set up fallback netns timer, exiting"); > =20 > - ref.nsdir_fd =3D open(c->netns_dir, O_CLOEXEC | O_RDONLY); > - if (ref.nsdir_fd < 0) > - die("netns dir open: %s, exiting", strerror(errno)); > + ref.nsdir_fd =3D dir_fd; > =20 > ref.type =3D EPOLL_TYPE_NSQUIT_TIMER; > + } else { > + close(dir_fd); > } > =20 > if (fd > FD_REF_MAX) > @@ -463,6 +474,9 @@ void pasta_netns_quit_timer_handler(struct ctx *c, un= ion epoll_ref ref) > =20 > fd =3D openat(ref.nsdir_fd, c->netns_base, O_PATH | O_CLOEXEC); > if (fd < 0) { > + if (errno =3D=3D EACCES) /* Expected for existing procfs entry */ > + return; > + > info("Namespace %s is gone, exiting", c->netns_base); > exit(EXIT_SUCCESS); > } --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --5067lOkY9MGjcshX Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmXSvwgACgkQzQJF27ox 2Ge/eRAAqnlB49YWvWrdNXEuGRKYUHsdCA8i3CV8SexRMO3WksvH5UGW8NqOVQc3 xRPo5vJUJ67wLi4Fzu/Ks+OlVwbNovnsognGnHUkxJ7+q9oJCEbNTk05fTgCvusV iWW9P1IQOKp5rx+KWo9UbWSnBUdPxYM/AcUX2z6xiu/6526UBV7lTkRdxarzpeHm frzvNc57zOYRM6n6QOLWJt2h+p8M9VDdfaRkZ/mxDSwUUHc3hlf2O5ePPAXIyz52 jhHbVVJCf/xjtfUq1CpfEldqY3sAiNcwGnFN80uyviNATjExQJ6hDF55kr40v7eX NuB4mikABHhr2clLenhlo5M693u455wsfcXXTgGp8LoZ7ybs62t1SqGPulH+vTRZ /vLzTA3+7xptlveOd+W1sCz+JRNMifXTd9a2/MhD0xWz7J2EJEm7RbCZJi1FC2RR E+OpVsbt0a1wn0e54TKg6QuTx/PPdUlotflTLARxJbImseOmnTveD1qPU2WdZLfP e5fKL488+5OHzmgMFlsCt0a5VDSDOnNau2g+zNNXr2XBTU3d/9BOelmGUF4C8iEc j3EaHv+lRmUVy444oVPLrSBEHb6IJO51mpCuMqWPhQ3vB4/568DhKh77QveyJ3iQ z0+4aIPRToBhY9hVHQ3raA/CedQ0HIFcny7o1FS7m1bVvRwsECA= =Nz0K -----END PGP SIGNATURE----- --5067lOkY9MGjcshX--