From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 8F7305A0265 for ; Wed, 12 Oct 2022 07:54:40 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665554079; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sYToIYXyU4tUXYMUib/MwtDyADmVYrqrgpeRC4GgyRk=; b=MjQngXbIdAbBlUZPtEN1lrg8pk0rEkqydPGa20BvPaPNQ9PyVSsMddvHqRF/hFdcprSTmK 35SewOQ5PCi8HXheNT7n75c2JNIl0CvLjIo+bk2NnDvaXbb2yvGrYF7nP79RMKUdLylTLI 0krEEF39x8WzQqu2009oMlZPwjKHf0U= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-177-X8bJBGuOM8S10EHUSXE-6w-1; Wed, 12 Oct 2022 01:54:37 -0400 X-MC-Unique: X8bJBGuOM8S10EHUSXE-6w-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7A66985A59D; Wed, 12 Oct 2022 05:54:37 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-3.brq.redhat.com [10.40.208.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id CE4B5C23F70; Wed, 12 Oct 2022 05:54:36 +0000 (UTC) Date: Wed, 12 Oct 2022 07:54:32 +0200 From: Stefano Brivio To: David Gibson Subject: Re: Alas for CAP_NET_BIND_SERVICE Message-ID: <20221012075432.09e33625@elisabeth> In-Reply-To: References: Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.8 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: 3L5ITJXNGO4RKIJ3VRCLDONHYNIEZ7WR X-Message-ID-Hash: 3L5ITJXNGO4RKIJ3VRCLDONHYNIEZ7WR X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Hi David, On Wed, 12 Oct 2022 13:55:02 +1100 David Gibson wrote: > Hi Stefano, > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE > isn't working, and I'm afraid I have bad news. Thanks for the investigation. > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we > unshare() or setns() into the isolated namespace, and this appears to > be intended behaviour. From user_namespaces(7), in the Capabilities section: > > The child process created by clone(2) with the CLONE_NEWUSER flag > starts out with a complete set of capabilities in the new user > namespace. Likewise, a process that creates a new user namespace > using unshare(2) or joins an existing user namespace using > setns(2) gains a full set of capabilities in that namespace. ***On > the other hand, that process has no capabilities in the parent (in > the case of clone(2)) or previous (in the case of unshare(2) and > setns(2)) user namespace, even if the new namespace is created or > joined by the root user (i.e., a process with user ID 0 in the > root namespace).*** > > Emphasis (***) mine. Basically, despite the way it's phrased in many > places, processes don't have an independent set of capabilities in > each userns, they only have a set of capabilities in their current > userns. Any capabilities in other namespaces are implied in a pretty > much all or nothing way - if the process's UID (the real, init ns one) > owns the userns (or one of its ancestors), it gets all caps, otherwise > none. cap_capable() has the specific logic in the kernel. Right, I missed this. For a moment, I wondered about ambient capabilities, but those would only have an effect on an execve(), not on a clone(), I guess. > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating > ourselves in our own userns. At the very least "auto" inbound > forwarding of low ports is pretty much off the cards. > > For forwarding of specific low ports, we could delay our entry into > the new userns until we've set up the listening sockets, although it > does mean rolling back some of the simplification we gained from the > new-style userns handling. If I understand correctly, the biggest hurdle would be: 1. we detach namespaces 2. only then we can finalise any missing bit of addressing and routing configuration (relevant for pasta) 3. we bind ports as we parse configuration options, but we need addressing to be fully configured for this Referring to your latest patchset (which I'm still reviewing), I guess that implies a further split of isolate_user() (it's great to have a name for that, finally!), right? > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle > low ports in passt. I do see a fair bit of logic in that approach: > passt has no meaningful way to limit what users do with the low ports > it allows them (indirectly) to bind to, giving passt > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process > which can invoke passt CAP_NET_BIND_SERVICE. I also see the general point, even though if file capabilities are used, I guess the equivalence doesn't really hold. And perhaps we should at least recommend that as a preferred way. What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE, and due to something that's slightly more than an implementation detail, it won't be able to bind to low ports, which is the very reason for that capability. That sounds highly counterintuitive. -- Stefano