From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 5E9815A0275 for ; Mon, 19 Feb 2024 13:35:12 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1708346111; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/JNlz/ehreY88Qn0qj97/WwaJXkC2ZD32qYdvUlQuUc=; b=isPajNIwZFx/L5geeJgcF2t1IwVRfCKoJ6fsr+iKquH733tjirnPESTYSIOZ3F4TSFlYMg Zqu5+WKFvdRcyVeR2PIlaiSLYCxqPAZ8xYjJ5AFzJOwgo3rFUGSMBXR1AvoWlKiZk2IurN q2NURRQ59RWTe+IdB6a19g+9Yy8DqaQ= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-688-t578DRB_P2y3ey6QUi2fsA-1; Mon, 19 Feb 2024 07:35:09 -0500 X-MC-Unique: t578DRB_P2y3ey6QUi2fsA-1 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-558aafe9bf2so2883189a12.1 for ; Mon, 19 Feb 2024 04:35:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708346109; x=1708950909; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=/JNlz/ehreY88Qn0qj97/WwaJXkC2ZD32qYdvUlQuUc=; b=QMr6U3wSsfar0y12fmdvNPuLY/DCHMh0ryb3FpyTdmeK4KoUAIIFd0MIaZvCk3W78z 1l6C2FXqjggeY4l7qFJfEi9Rm9lPgqKt3G3WvAD7XUZ8bPb1kOu123yjB2QK6tc7DC+k 1+5K1cxLErLrAhDWQoeHmJEwe0eAAD6kXrycEUnEqe2u8P/u6cimQ//P97lVEHriI45r +qtTRNnx3yQ7VSRLS27jlYbgwOcQReAm2/vD97ui+nStRQ2TgLqRHcB5LVtCdKrYsnHL zZhLmP3Z/p9BzCBvREhp4oarwi4BptSwdL+TDFgEMy51EIcNoKnBRYOTCwLt7Mp1JRhN rXHw== X-Gm-Message-State: AOJu0YyadH+W3fJ+FhgAd9p8o/9IPSp1MkEe8weLhM9kAsH8egEYY9rh bGrFrHEgABrhBkGLQqBE+lpblM9hICebeAlH3rIfXl31tHlH5k1F+O5v3iLVJumT8/Jo/PEaz0w 6heYPUOJJiy51PQYuMC1tY974+K8BvsDbpijTfolnp+UWVDYDkA== X-Received: by 2002:a05:6402:5250:b0:564:4e61:76ca with SMTP id t16-20020a056402525000b005644e6176camr3054822edd.37.1708346108768; Mon, 19 Feb 2024 04:35:08 -0800 (PST) X-Google-Smtp-Source: AGHT+IE0U1Pmdb2Q57BeDylszjBWeihTfV0EQMNhLAG/FNkQQ+w/eIx/F1gWlcARVCWmMEPvAZAd6w== X-Received: by 2002:a05:6402:5250:b0:564:4e61:76ca with SMTP id t16-20020a056402525000b005644e6176camr3054814edd.37.1708346108344; Mon, 19 Feb 2024 04:35:08 -0800 (PST) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id ig3-20020a056402458300b0055d333a0584sm2642782edb.72.2024.02.19.04.35.07 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Feb 2024 04:35:07 -0800 (PST) Date: Mon, 19 Feb 2024 13:34:34 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v3] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead Message-ID: <20240219133405.6a295633@elisabeth> In-Reply-To: References: <20240219080533.3584215-1-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: UY2PPBACUQZWJNQCQLINDHJXKZSZESZE X-Message-ID-Hash: UY2PPBACUQZWJNQCQLINDHJXKZSZESZE X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 19 Feb 2024 19:26:15 +1100 David Gibson wrote: > On Mon, Feb 19, 2024 at 09:05:33AM +0100, Stefano Brivio wrote: > > We watch network namespace entries to detect when we should quit > > (unless --no-netns-quit is passed), and these might stored in a tmpfs > > typically mounted at /run/user/UID or /var/run/user/UID, or found in > > procfs at /proc/PID/ns/. > > > > Currently, we try to use inotify for any possible location of those > > entries, but inotify, of course, doesn't work on pseudo-filesystems > > (see inotify(7)). > > > > The man page reflects this: the description of --no-netns-quit > > implies that we won't quit anyway if the namespace is not "bound to > > the filesystem". > > > > Well, we won't quit, but, since commit 9e0dbc894813 ("More > > deterministic detection of whether argument is a PID, PATH or NAME"), > > we try. And, indeed, this is harmless, as the caveat from that > > commit message states. > > > > Now, it turns out that Buildah, a tool to create container images, > > sharing its codebase with Podman, passes a procfs entry to pasta, and > > expects pasta to exit once the network namespace is not needed > > anymore, that is, once the original Buildah process terminates. > > > > Get this to work by using the timer fallback mechanism if the > > namespace name is passed as a path belonging to a pseudo-filesystem. > > This is expected to be procfs, but I covered sysfs and devpts > > pseudo-filesystems as well, because nothing actually prevents > > creating this kind of directory structure and links there. > > > > Note that fstatfs(), according to some versions of man pages, was > > apparently "deprecated" by the LSB. My reasoning for using it is > > essentially this: > > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u > > > > ...that is, there was no such thing as an LSB deprecation, and > > anyway there's no other way to get the filesystem type. > > > > Also note that, while it might sound more obvious to detect the > > filesystem type using fstatfs() on the file descriptor itself > > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no > > matter what path was given to pasta. If we use the parent directory, > > we'll typically have either tmpfs or procfs reported. > > > > If the target naemsapce is given as a PID, or as a PID-based procfs > > entry, we don't risk races if this PID is recycled: our handle on > > /proc/PID/ns will always refer to the original namespace associated > > with that PID, and we don't re-open this entry from procfs to check > > it. > > > > Instead of directly monitoring the target namespace, we could have > > tried to monitor a process with a given PID, using pidfd_open() to > > get a handle on it, to decide when to terminate. > > > > But it's not guaranteed that the parent process is actually the one > > associated to the network namespace we operate on, and if we get a > > PID file descriptor for a PID (parent or not) or path that was given > > on our command line, this inherently causes a race condition as that > > PID might have been recycled by the time we call pidfd_open(). > > > > Even assuming the process we want to watch is the parent process, and > > a race-free usage of pidfd_open() would have been possible, I'm not > > very enthusiastic about enabling yet another system call in the > > seccomp profile just for this, while openat() is needed anyway. > > > > Update the man page to reflect that, even if the target network > > namespace is passed as a procfs path or a PID, we'll now quit when > > the procfs entry is gone. > > > > Reported-by: Paul Holzinger > > Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214 > > Signed-off-by: Stefano Brivio > > --- > > v3: Given that we now open c->netns_dir before checking the > > filesystem type, we could as well pass this file descriptor to > > fstatfs() to do the check, instead of statfs() on the path. > > > > Fix a couple of paragraphs in the commit message. > > > > v2: Coverity Scan isn't happy if we "check" (kind of) c->netns_dir > > with statfs() before opening it in a non-atomic way. Just to make > > things clear, false positive or not: open it, check it, close it > > if it wasn't needed: we don't rely on the check. > > > > passt.1 | 8 ++++++-- > > pasta.c | 24 +++++++++++++++++++----- > > 2 files changed, 25 insertions(+), 7 deletions(-) > > > > diff --git a/passt.1 b/passt.1 > > index dc2d719..de6e3bf 100644 > > --- a/passt.1 > > +++ b/passt.1 > > @@ -550,8 +550,12 @@ without \-\-userns. > > > > .TP > > .BR \-\-no-netns-quit > > -If the target network namespace is bound to the filesystem (that is, if PATH or > > -NAME are given as target), do not exit once the network namespace is deleted. > > +If the target network namespace is bound to the filesystem, do not exit once > > +that path is deleted. > > + > > +If the target network namespace is represented by a procfs entry, do not exit > > +once that entry is removed from procfs (representing the fact that a process > > +with the given PID terminated). > > I realised part of the reason this seems so awkward to me is that > we're describing our normal behaviour w.r.t. netns lifetime, in the > context of an option that disables that. So, maybe rephrase something like:: > > --no-netns-quit > Don't exit based on the state of the network namespace. > > Usually we exit if...
. > > > > > > .TP > > .BR \-\-config-net > > diff --git a/pasta.c b/pasta.c > > index 01d1511..61feaa9 100644 > > --- a/pasta.c > > +++ b/pasta.c > > @@ -33,6 +33,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -41,6 +42,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "util.h" > > #include "passt.h" > > @@ -390,12 +392,21 @@ void pasta_netns_quit_init(const struct ctx *c) > > union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY }; > > struct epoll_event ev = { .events = EPOLLIN }; > > int flags = O_NONBLOCK | O_CLOEXEC; > > - int fd; > > + struct statfs s = { 0 }; > > I still don't like this initialisation, but I can live with it. Also, > it's slightly shorter than the next line. Shorter than "bool try_inotify = true;"? They're both 24 characters...? -- Stefano