On Mon, Feb 19, 2024 at 09:05:05AM +0100, Stefano Brivio wrote:
> On Mon, 19 Feb 2024 13:35:48 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sat, Feb 17, 2024 at 02:34:57PM +0100, Stefano Brivio wrote:
> > > We watch network namespace entries to detect when we should quit
> > > (unless --no-netns-quit is passed), and these might stored in a tmpfs
> > > typically mounted at /run/user/UID or /var/run/user/UID, or found in
> > > procfs at /proc/PID/ns/.
> > > 
> > > Currently, we try to use inotify for any possible location of those
> > > entries, but inotify, of course, doesn't work on pseudo-filesystems
> > > (see inotify(7)).
> > > 
> > > The man page reflects this: the description of --no-netns-quit
> > > implies that we won't quit anyway if the namespace is not "bound to
> > > the filesystem".
> > > 
> > > Well, we won't quit, but, since commit 9e0dbc894813 ("More
> > > deterministic detection of whether argument is a PID, PATH or NAME"),
> > > we try. And, indeed, this is harmless, as the caveat from that
> > > commit message states.
> > > 
> > > Now, it turns out that Buildah, a tool to create container images,
> > > sharing its codebase with Podman, passes a procfs entry to pasta, and
> > > expects pasta to exit once the network namespace is not needed
> > > anymore, that is, once the original Buildah process terminates.
> > > 
> > > Get this to work by using the timer fallback mechanism if the
> > > namespace name is passed as a path belonging to a pseudo-filesystem.
> > > This is expected to be procfs, but I covered sysfs and devpts
> > > pseudo-filesystems as well, because nothing actually prevents
> > > creating this kind of directory structure and links there.
> > > 
> > > Note that statfs(), according to some versions of man pages, was
> > > apparently "deprecated" by the LSB. My reasoning for using it is
> > > essentially this:
> > >   https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u
> > > 
> > > ...that is, there was no such thing as an LSB deprecation, and
> > > anyway there's no other way to get the filesystem type.  
> > 
> > Huh, weird.
> > 
> > > Also note that, while it might sound more robust to detect the
> > > filesystem type using fstatfs() on the file descriptor
> > > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
> > > matter what path was given to pasta. If we use the parent directory,
> > > we'll typically have either tmpfs or procfs reported.  
> > 
> > We could, however, move the opening of the parent directory earlier
> > and use fstatfs() on that instead of statfs() on the path.
> 
> Done in v3.
> 
> > > The timer, however, still uses the file descriptor of the parent
> > > directory later, as it has no access to the filesystem, and that
> > > avoids possible races if the PID is recycled: if the original process
> > > terminates, the handle we have on /proc/PID/ns still refers to it,
> > > not to any other process with the same PID.  
> > 
> > True, but it's not obvious to me how that's relevant to this patch.
> > 
> > > We could have used pidfd_open() to get a handle on the parent
> > > process. But it's not guaranteed that the parent process is actually
> > > the one associated to the network namespace we operate on, and if we
> > > get a PID file descriptor for a PID (parent or not) or path that was
> > > given on our command line, this inherently causes a race condition as
> > > that PID might have been recycled by the time we call pidfd_open().  
> > 
> > Again, not really sure of the relevance of this to the patch at hand.
> 
> Right, sorry, in both paragraphs I took the context of the Podman issue
> for granted. Changed in v3.
> 
> > > Even assuming the process we want to watch is the parent process, and
> > > a race-free usage of pidfd_open() would have been possible, I'm not
> > > very enthusiastic about enabling yet another system call in the
> > > seccomp profile just for this, while openat() is needed anyway.
> > > 
> > > Update the man page to reflect that, even if the target network
> > > namespace is passed as a procfs path or a PID, we'll now quit when
> > > the procfs entry is gone.
> > > 
> > > Reported-by: Paul Holzinger <pholzing@redhat.com>
> > > Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  passt.1 |  8 ++++++--
> > >  pasta.c | 15 +++++++++++++--
> > >  2 files changed, 19 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/passt.1 b/passt.1
> > > index dc2d719..de6e3bf 100644
> > > --- a/passt.1
> > > +++ b/passt.1
> > > @@ -550,8 +550,12 @@ without \-\-userns.
> > >  
> > >  .TP
> > >  .BR \-\-no-netns-quit
> > > -If the target network namespace is bound to the filesystem (that is, if PATH or
> > > -NAME are given as target), do not exit once the network namespace is deleted.
> > > +If the target network namespace is bound to the filesystem, do not exit once
> > > +that path is deleted.
> > > +
> > > +If the target network namespace is represented by a procfs entry, do not exit
> > > +once that entry is removed from procfs (representing the fact that a process
> > > +with the given PID terminated).  
> > 
> > Can't you now simplify this all to "Do not exit when the target
> > network namespace ends"?
> 
> That was my first attempt, but as I was writing something of that sort,
> I realised it's actually false: as long as pasta is using the network
> namespace, the namespace exists and networking is still up and running.
> 
> One could delete the entry in tmpfs, or the original process could
> terminate, join the network namespace after that, and keep using it.

Ah, true.  I wish there were a more succinct way to express this, but
I can't think of it either.

> So I guess it's better to clarify what are our conditions for
> terminating (and, most likely, terminating the target namespace as a
> consequence).
> 
> > >  .TP
> > >  .BR \-\-config-net
> > > diff --git a/pasta.c b/pasta.c
> > > index 01d1511..4110917 100644
> > > --- a/pasta.c
> > > +++ b/pasta.c
> > > @@ -33,6 +33,7 @@
> > >  #include <sys/timerfd.h>
> > >  #include <sys/types.h>
> > >  #include <sys/stat.h>
> > > +#include <sys/statfs.h>
> > >  #include <fcntl.h>
> > >  #include <sys/wait.h>
> > >  #include <signal.h>
> > > @@ -41,6 +42,7 @@
> > >  #include <netinet/in.h>
> > >  #include <net/ethernet.h>
> > >  #include <sys/syscall.h>
> > > +#include <linux/magic.h>
> > >  
> > >  #include "util.h"
> > >  #include "passt.h"
> > > @@ -390,12 +392,18 @@ void pasta_netns_quit_init(const struct ctx *c)
> > >  	union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY };
> > >  	struct epoll_event ev = { .events = EPOLLIN };
> > >  	int flags = O_NONBLOCK | O_CLOEXEC;
> > > -	int fd;
> > > +	struct statfs s = { 0 };  
> > 
> > I don't think you need to initialise this, since you should only be
> > reading it in the case that statfs() succeeds.
> 
> There's no functional need, but I don't want to expose our stack memory
> to the kernel.

That seems... pointlessly paranoid to me.  The syscall just takes a
pointer, which implies that the kernel can access our memory anyway.

> Even though the kernel is generally able to control the process, I feel
> like it's nice to avoid it, and I think it's consistent with most of
> our system call usage.

Hrm, not generally in places I've been working at any rate.  I
generally dislike initialization's that aren't essential, because it
eliminates the possibility of compiler or checker warnings about
uninitialised memory if we later mess up the logic below.

> In a number of cases, valgrind might complain about uninitialised
> bytes, too.

If it's a false positive, we can work around that the same way we do
for the returned socket address from accept() et al.  If it's not a
false positive, then that's a good thing.

> > > +	bool try_inotify = true;
> > > +	int fd = -1;
> > >  
> > >  	if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base)
> > >  		return;
> > >  
> > > -	if ((fd = inotify_init1(flags)) < 0)
> > > +	if (statfs(c->netns_dir, &s)     || s.f_type == DEVPTS_SUPER_MAGIC ||
> > > +	    s.f_type == PROC_SUPER_MAGIC || s.f_type == SYSFS_MAGIC)
> > > +		try_inotify = false;
> > > +
> > > +	if (try_inotify && (fd = inotify_init1(flags)) < 0)
> > >  		warn("inotify_init1(): %s, use a timer", strerror(errno));
> > >  
> > >  	if (fd >= 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) {
> > > @@ -463,6 +471,9 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
> > >  
> > >  	fd = openat(ref.nsdir_fd, c->netns_base, O_PATH | O_CLOEXEC);
> > >  	if (fd < 0) {
> > > +		if (errno == EACCES)	/* Expected for existing procfs entry */
> > > +			return;  
> > 
> > This seems like an unrelated fix.
> 
> It's actually related because, before this patch, procfs entries aren't
> handled by the timer, and with tmpfs entries you don't get EACCES.

Ah, ok.

> > > +
> > >  		info("Namespace %s is gone, exiting", c->netns_base);
> > >  		exit(EXIT_SUCCESS);
> > >  	}  
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson