From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 2ABAA5A0275 for ; Mon, 19 Feb 2024 14:07:17 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1708348035; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qdqtkcPuXfQFN+tvsTLtQaOFDT4867QB3s/RcmtGKXc=; b=PWBE8DiAA++L0trtTcxN+T5RG6yT9ntFOoiAYIWPe+ERtod26NN6rhPLmWtNJpX5UGmAK6 oosMhBRtCl+3uzVlNQNEGDEeiZd1FjGVo/IkLV8Bh6btD+z9NlrtrLF2iyDweUIKMvoWi8 OnqhhenHUdlQd1X64+PrHCc4KH7CTIY= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-499-43IAiui5OT6kvntW8B2c4w-1; Mon, 19 Feb 2024 08:07:14 -0500 X-MC-Unique: 43IAiui5OT6kvntW8B2c4w-1 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-564902d757bso396710a12.1 for ; Mon, 19 Feb 2024 05:07:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708348031; x=1708952831; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qdqtkcPuXfQFN+tvsTLtQaOFDT4867QB3s/RcmtGKXc=; b=IreBJyIQKYzpoL+ZVGp3o1gwll0GuqPQ4ZYsUyEJy89sZXas297rpJBETf2rBANQhX U/0ZIpCWdYy7ockxUAtBeCSC9T8UKbRAhZOOjzptVmLqJzGDCkJV4wdwuIFxYcWGh5M6 fS+VI9FyhA50sMCtLjuQ12sHrJGO7R75BpW9yxvWlD/blQNU/D+c630D3YYjHXRQXxVF dAUMBXE7GMIyWmkC/+uxk3y310qxd4Yvu8hEjjQIBcrnAQ+yvuyvW3zg/Ka0WTkuHNcW JzV2yrJiIAGl5zoNiSiIKY0bR3Lqq+aIncRz2CzL6sOISAqY4xg23GllfOY3jmLUCTHG x8Uw== X-Gm-Message-State: AOJu0Yyx8tf3q2K8HyH7fMAYnX4gH/rWG8eI3HbMKfJWcVm2wTaVdKis /mDWIAd175xeKCI6djjx68tTFxe5VYu1vWN8avNjTtbKk+SBS3HI/UKV1e21uTcmDkkiAK1bZgT fEATt72wswjWIBm2e9Rz03KVI37IHGoWbowkjihTkPwmCdf8OyWOJD0XUJrAI X-Received: by 2002:aa7:d74c:0:b0:564:20f5:6926 with SMTP id a12-20020aa7d74c000000b0056420f56926mr4135295eds.14.1708348031366; Mon, 19 Feb 2024 05:07:11 -0800 (PST) X-Google-Smtp-Source: AGHT+IF/QHCsNKd6uCJXtxFL69wCDULfwsYU0rQshAgTzAop7mHypJJUOOSnqg9Pdy2UrH+4PZQE7A== X-Received: by 2002:aa7:d74c:0:b0:564:20f5:6926 with SMTP id a12-20020aa7d74c000000b0056420f56926mr4135280eds.14.1708348031005; Mon, 19 Feb 2024 05:07:11 -0800 (PST) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id fi17-20020a056402551100b005612025465asm2675372edb.74.2024.02.19.05.07.09 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 19 Feb 2024 05:07:09 -0800 (PST) Date: Mon, 19 Feb 2024 14:06:36 +0100 From: Stefano Brivio To: Paul Holzinger Subject: Re: [PATCH v3] pasta: Don't try to watch namespaces in procfs with inotify, use timer instead Message-ID: <20240219140636.0d6219e0@elisabeth> In-Reply-To: References: <20240219080533.3584215-1-sbrivio@redhat.com> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.36; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: VDLUHGXEPMGWAEWKUM7QYZGV3BZBGY3L X-Message-ID-Hash: VDLUHGXEPMGWAEWKUM7QYZGV3BZBGY3L X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Mon, 19 Feb 2024 13:40:59 +0100 Paul Holzinger wrote: > Hi Stefano, thanks for the patch. > > On 19/02/2024 09:05, Stefano Brivio wrote: > > We watch network namespace entries to detect when we should quit > > (unless --no-netns-quit is passed), and these might stored in a tmpfs > > typically mounted at /run/user/UID or /var/run/user/UID, or found in > > procfs at /proc/PID/ns/. > > > > Currently, we try to use inotify for any possible location of those > > entries, but inotify, of course, doesn't work on pseudo-filesystems > > (see inotify(7)). > > > > The man page reflects this: the description of --no-netns-quit > > implies that we won't quit anyway if the namespace is not "bound to > > the filesystem". > > > > Well, we won't quit, but, since commit 9e0dbc894813 ("More > > deterministic detection of whether argument is a PID, PATH or NAME"), > > we try. And, indeed, this is harmless, as the caveat from that > > commit message states. > > > > Now, it turns out that Buildah, a tool to create container images, > > sharing its codebase with Podman, passes a procfs entry to pasta, and > > expects pasta to exit once the network namespace is not needed > > anymore, that is, once the original Buildah process terminates. > Note we do not want to track buildah, we pass down the pid of the > container process so the tree looks like this: > > buildah > \_ container process > \_ pasta > > We pass down the pid of the container process to pasta and it needs to > exit with it. One buildah can spawn multiple pasta processes at once. > Each RUN instruction in a Dockerfile will setup a new container and thus > a new netns. Oh, I didn't know that, thanks for clarifying, I'll fix this in v4. > > Get this to work by using the timer fallback mechanism if the > > namespace name is passed as a path belonging to a pseudo-filesystem. > > This is expected to be procfs, but I covered sysfs and devpts > > pseudo-filesystems as well, because nothing actually prevents > > creating this kind of directory structure and links there. > > > > Note that fstatfs(), according to some versions of man pages, was > > apparently "deprecated" by the LSB. My reasoning for using it is > > essentially this: > > https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u > > > > ...that is, there was no such thing as an LSB deprecation, and > > anyway there's no other way to get the filesystem type. > > > > Also note that, while it might sound more obvious to detect the > > filesystem type using fstatfs() on the file descriptor itself > > (c->pasta_netns_fd), the reported filesystem type for it is nsfs, no > > matter what path was given to pasta. If we use the parent directory, > > we'll typically have either tmpfs or procfs reported. > > > > If the target naemsapce is given as a PID, or as a PID-based procfs > typo namespace Fixed. > > entry, we don't risk races if this PID is recycled: our handle on > > /proc/PID/ns will always refer to the original namespace associated > > with that PID, and we don't re-open this entry from procfs to check > > it. > > > > Instead of directly monitoring the target namespace, we could have > > tried to monitor a process with a given PID, using pidfd_open() to > > get a handle on it, to decide when to terminate. > > > > But it's not guaranteed that the parent process is actually the one > > associated to the network namespace we operate on, and if we get a > > PID file descriptor for a PID (parent or not) or path that was given > > on our command line, this inherently causes a race condition as that > > PID might have been recycled by the time we call pidfd_open(). > Well I mean the race is always there regardless of pidfd_open, already > when you open /proc/$PID/ns/net the race exists as it may no longer be > the correct pid if the process exited (and was reaped) before it can > open the path. For a PID referring to a third process, you're right, I'll change this paragraph. But if the PID is the one of the parent, we don't have this issue because conf() is called before __daemon() in passt.c. If the parent terminates before we get a reference to /proc/$PID/ns/net, pasta will also terminate. > I think if we really want a race free interface then it > would make the most sense to have the parent pass down a pidfd to pasta, > this allows pasta to poll the fd to see the exit and also to call setns > on that fd. ...and if we want a pidfd referring to another process that's not the parent, right, that's the only way: the parent could first get a pidfd referring to a child process, and then pass that to pasta. I guess we should implement this at some point. > > > > Even assuming the process we want to watch is the parent process, and > > a race-free usage of pidfd_open() would have been possible, I'm not > > very enthusiastic about enabling yet another system call in the > > seccomp profile just for this, while openat() is needed anyway. > If the parent opens the pidfd then this would not be an issue. Right, we could then just add it to the epoll set -- I'll adjust this paragraph as well. Does the patch otherwise work for you? I guess we would need a new release if we want to check this with Buildah, right? -- Stefano