On Sat, Feb 17, 2024 at 02:34:57PM +0100, Stefano Brivio wrote: > We watch network namespace entries to detect when we should quit > (unless --no-netns-quit is passed), and these might stored in a tmpfs > typically mounted at /run/user/UID or /var/run/user/UID, or found in > procfs at /proc/PID/ns/. > > Currently, we try to use inotify for any possible location of those > entries, but inotify, of course, doesn't work on pseudo-filesystems > (see inotify(7)). > > The man page reflects this: the description of --no-netns-quit > implies that we won't quit anyway if the namespace is not "bound to > the filesystem". > > Well, we won't quit, but, since commit 9e0dbc894813 ("More > deterministic detection of whether argument is a PID, PATH or NAME"), > we try. And, indeed, this is harmless, as the caveat from that > commit message states. > > Now, it turns out that Buildah, a tool to create container images, > sharing its codebase with Podman, passes a procfs entry to pasta, and > expects pasta to exit once the network namespace is not needed > anymore, that is, once the original Buildah process terminates. > > Get this to work by using the timer fallback mechanism if the > namespace name is passed as a path belonging to a pseudo-filesystem. > This is expected to be procfs, but I covered sysfs and devpts > pseudo-filesystems as well, because nothing actually prevents > creating this kind of directory structure and links there. > > Note that statfs(), according to some versions of man pages, was > apparently "deprecated" by the LSB. My reasoning for using it is > essentially this: > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u > > ...that is, there was no such thing as an LSB deprecation, and > anyway there's no other way to get the filesystem type. Huh, weird. > Also note that, while it might sound more robust to detect the > filesystem type using fstatfs() on the file descriptor > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no > matter what path was given to pasta. If we use the parent directory, > we'll typically have either tmpfs or procfs reported. We could, however, move the opening of the parent directory earlier and use fstatfs() on that instead of statfs() on the path. > The timer, however, still uses the file descriptor of the parent > directory later, as it has no access to the filesystem, and that > avoids possible races if the PID is recycled: if the original process > terminates, the handle we have on /proc/PID/ns still refers to it, > not to any other process with the same PID. True, but it's not obvious to me how that's relevant to this patch. > We could have used pidfd_open() to get a handle on the parent > process. But it's not guaranteed that the parent process is actually > the one associated to the network namespace we operate on, and if we > get a PID file descriptor for a PID (parent or not) or path that was > given on our command line, this inherently causes a race condition as > that PID might have been recycled by the time we call pidfd_open(). Again, not really sure of the relevance of this to the patch at hand. > Even assuming the process we want to watch is the parent process, and > a race-free usage of pidfd_open() would have been possible, I'm not > very enthusiastic about enabling yet another system call in the > seccomp profile just for this, while openat() is needed anyway. > > Update the man page to reflect that, even if the target network > namespace is passed as a procfs path or a PID, we'll now quit when > the procfs entry is gone. > > Reported-by: Paul Holzinger > Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 > Signed-off-by: Stefano Brivio > --- > passt.1 | 8 ++++++-- > pasta.c | 15 +++++++++++++-- > 2 files changed, 19 insertions(+), 4 deletions(-) > > diff --git a/passt.1 b/passt.1 > index dc2d719..de6e3bf 100644 > --- a/passt.1 > +++ b/passt.1 > @@ -550,8 +550,12 @@ without \-\-userns. > > .TP > .BR \-\-no-netns-quit > -If the target network namespace is bound to the filesystem (that is, if PATH or > -NAME are given as target), do not exit once the network namespace is deleted. > +If the target network namespace is bound to the filesystem, do not exit once > +that path is deleted. > + > +If the target network namespace is represented by a procfs entry, do not exit > +once that entry is removed from procfs (representing the fact that a process > +with the given PID terminated). Can't you now simplify this all to "Do not exit when the target network namespace ends"? > .TP > .BR \-\-config-net > diff --git a/pasta.c b/pasta.c > index 01d1511..4110917 100644 > --- a/pasta.c > +++ b/pasta.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -41,6 +42,7 @@ > #include > #include > #include > +#include > > #include "util.h" > #include "passt.h" > @@ -390,12 +392,18 @@ void pasta_netns_quit_init(const struct ctx *c) > union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY }; > struct epoll_event ev = { .events = EPOLLIN }; > int flags = O_NONBLOCK | O_CLOEXEC; > - int fd; > + struct statfs s = { 0 }; I don't think you need to initialise this, since you should only be reading it in the case that statfs() succeeds. > + bool try_inotify = true; > + int fd = -1; > > if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base) > return; > > - if ((fd = inotify_init1(flags)) < 0) > + if (statfs(c->netns_dir, &s) || s.f_type == DEVPTS_SUPER_MAGIC || > + s.f_type == PROC_SUPER_MAGIC || s.f_type == SYSFS_MAGIC) > + try_inotify = false; > + > + if (try_inotify && (fd = inotify_init1(flags)) < 0) > warn("inotify_init1(): %s, use a timer", strerror(errno)); > > if (fd >= 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) { > @@ -463,6 +471,9 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref) > > fd = openat(ref.nsdir_fd, c->netns_base, O_PATH | O_CLOEXEC); > if (fd < 0) { > + if (errno == EACCES) /* Expected for existing procfs entry */ > + return; This seems like an unrelated fix. > + > info("Namespace %s is gone, exiting", c->netns_base); > exit(EXIT_SUCCESS); > } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson