From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by passt.top (Postfix, from userid 1000) id 0A6B35A0281; Mon, 22 May 2023 10:52:05 +0200 (CEST) From: Stefano Brivio To: passt-dev@passt.top Subject: [PATCH v2 0/3] Fix pasta-in-pasta operation (and similar) Date: Mon, 22 May 2023 10:52:02 +0200 Message-Id: <20230522085205.2803560-1-sbrivio@redhat.com> X-Mailer: git-send-email 2.39.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID-Hash: QXIXR6V6S4OTBPTS46HSLUDHGFLBP6KJ X-Message-ID-Hash: QXIXR6V6S4OTBPTS46HSLUDHGFLBP6KJ X-MailFrom: sbrivio@passt.top X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: When pasta spawns a command (operation without pre-existing namespace), it calls clone(2) with CLONE_NEWPID to detach the PID namespace where this command runs, but it needs to mount /proc (in a separate mount namespace), otherwise its contents are not consistent with the new PID namespace. If /proc contents are not consistent, pasta will fail to run in a user and network namespace created by another pasta instance. An alternative would be to drop CLONE_NEWPID altogether: pasta itself not a container engine, and it's not meant to provide general isolation features other than for networking aspects. This would also make testing and debugging a bit easier, as the PIDs of processes descending from pasta would be the same outside the detached namespace. However, also for testing and debugging usage itself, we would lose two advantages: the inner environment looks more observable (from inside) with CLONE_NEWPID, and we don't need to explicitly clean up the environment as pasta terminates: see the ugliness of pasta_ns_cleanup() before commit 0515adceaa8f ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"). It wasn't very robust either. Now that this part works, note that writing to the uid_map procfs entry, with 0 as domain for the map, requires (since Linux 5.12) CAP_SETFCAP in the parent process. We need this mapping to keep the behaviour consistent with what happens when we run directly from the init namespace, and to set the ping_group_range sysctl. Keep CAP_SETFCAP if we're running with UID 0 from a non-init user namespace. With this series, pasta finally runs in itself. I checked basic connectivity inside a dozen of recursively nested instances. v2: Fix size of buffer and comparison in 1/3 for ns_is_init(), address comment from David Stefano Brivio (3): util, conf: Add and use ns_is_init() helper pasta: Detach mount namespace, (re)mount procfs before spawning command isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init conf.c | 16 +--------------- isolation.c | 17 ++++++++++++++--- pasta.c | 7 ++++++- util.c | 25 +++++++++++++++++++++++++ util.h | 2 ++ 5 files changed, 48 insertions(+), 19 deletions(-) -- 2.39.2