From: Stefano Brivio <sbrivio@redhat.com>
To: Paul Holzinger <pholzing@redhat.com>
Cc: passt-dev@passt.top, David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v3] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead
Date: Mon, 19 Feb 2024 14:06:36 +0100 [thread overview]
Message-ID: <20240219140636.0d6219e0@elisabeth> (raw)
In-Reply-To: <b5bdc669-6a9f-491c-abb9-cc85f21ecbc7@redhat.com>
On Mon, 19 Feb 2024 13:40:59 +0100
Paul Holzinger <pholzing@redhat.com> wrote:
> Hi Stefano, thanks for the patch.
>
> On 19/02/2024 09:05, Stefano Brivio wrote:
> > We watch network namespace entries to detect when we should quit
> > (unless --no-netns-quit is passed), and these might stored in a tmpfs
> > typically mounted at /run/user/UID or /var/run/user/UID, or found in
> > procfs at /proc/PID/ns/.
> >
> > Currently, we try to use inotify for any possible location of those
> > entries, but inotify, of course, doesn't work on pseudo-filesystems
> > (see inotify(7)).
> >
> > The man page reflects this: the description of --no-netns-quit
> > implies that we won't quit anyway if the namespace is not "bound to
> > the filesystem".
> >
> > Well, we won't quit, but, since commit 9e0dbc894813 ("More
> > deterministic detection of whether argument is a PID, PATH or NAME"),
> > we try. And, indeed, this is harmless, as the caveat from that
> > commit message states.
> >
> > Now, it turns out that Buildah, a tool to create container images,
> > sharing its codebase with Podman, passes a procfs entry to pasta, and
> > expects pasta to exit once the network namespace is not needed
> > anymore, that is, once the original Buildah process terminates.
> Note we do not want to track buildah, we pass down the pid of the
> container process so the tree looks like this:
>
> buildah
> \_ container process
> \_ pasta
>
> We pass down the pid of the container process to pasta and it needs to
> exit with it. One buildah can spawn multiple pasta processes at once.
> Each RUN instruction in a Dockerfile will setup a new container and thus
> a new netns.
Oh, I didn't know that, thanks for clarifying, I'll fix this in v4.
> > Get this to work by using the timer fallback mechanism if the
> > namespace name is passed as a path belonging to a pseudo-filesystem.
> > This is expected to be procfs, but I covered sysfs and devpts
> > pseudo-filesystems as well, because nothing actually prevents
> > creating this kind of directory structure and links there.
> >
> > Note that fstatfs(), according to some versions of man pages, was
> > apparently "deprecated" by the LSB. My reasoning for using it is
> > essentially this:
> > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u
> >
> > ...that is, there was no such thing as an LSB deprecation, and
> > anyway there's no other way to get the filesystem type.
> >
> > Also note that, while it might sound more obvious to detect the
> > filesystem type using fstatfs() on the file descriptor itself
> > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
> > matter what path was given to pasta. If we use the parent directory,
> > we'll typically have either tmpfs or procfs reported.
> >
> > If the target naemsapce is given as a PID, or as a PID-based procfs
> typo namespace
Fixed.
> > entry, we don't risk races if this PID is recycled: our handle on
> > /proc/PID/ns will always refer to the original namespace associated
> > with that PID, and we don't re-open this entry from procfs to check
> > it.
> >
> > Instead of directly monitoring the target namespace, we could have
> > tried to monitor a process with a given PID, using pidfd_open() to
> > get a handle on it, to decide when to terminate.
> >
> > But it's not guaranteed that the parent process is actually the one
> > associated to the network namespace we operate on, and if we get a
> > PID file descriptor for a PID (parent or not) or path that was given
> > on our command line, this inherently causes a race condition as that
> > PID might have been recycled by the time we call pidfd_open().
> Well I mean the race is always there regardless of pidfd_open, already
> when you open /proc/$PID/ns/net the race exists as it may no longer be
> the correct pid if the process exited (and was reaped) before it can
> open the path.
For a PID referring to a third process, you're right, I'll change this
paragraph.
But if the PID is the one of the parent, we don't have this issue
because conf() is called before __daemon() in passt.c. If the parent
terminates before we get a reference to /proc/$PID/ns/net, pasta will
also terminate.
> I think if we really want a race free interface then it
> would make the most sense to have the parent pass down a pidfd to pasta,
> this allows pasta to poll the fd to see the exit and also to call setns
> on that fd.
...and if we want a pidfd referring to another process that's not the
parent, right, that's the only way: the parent could first get a pidfd
referring to a child process, and then pass that to pasta. I guess we
should implement this at some point.
> >
> > Even assuming the process we want to watch is the parent process, and
> > a race-free usage of pidfd_open() would have been possible, I'm not
> > very enthusiastic about enabling yet another system call in the
> > seccomp profile just for this, while openat() is needed anyway.
> If the parent opens the pidfd then this would not be an issue.
Right, we could then just add it to the epoll set -- I'll adjust this
paragraph as well.
Does the patch otherwise work for you? I guess we would need a new
release if we want to check this with Buildah, right?
--
Stefano
next prev parent reply other threads:[~2024-02-19 13:07 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-19 8:05 [PATCH v3] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead Stefano Brivio
2024-02-19 8:26 ` David Gibson
2024-02-19 12:34 ` Stefano Brivio
2024-02-19 12:40 ` Paul Holzinger
2024-02-19 13:06 ` Stefano Brivio [this message]
2024-02-19 13:40 ` Paul Holzinger
2024-02-19 15:18 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240219140636.0d6219e0@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=pholzing@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).