From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 9236A5A0265 for ; Wed, 12 Oct 2022 11:31:32 +0200 (CEST) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4MnS8F74xXz4xGs; Wed, 12 Oct 2022 20:31:25 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1665567085; bh=1ICSKqkon2smAaQXujeJWMOmyWKeborPwSPcvAtQbis=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=V+z86xeGrxJEOh/+i9qbBiYCowbFKE65IXbBVNyh1THsYXHK3CWWCesDugFYY1pgy jSxxyzAPxdJ7opKTaYl0gImLwt675q46/6HVJCcQajUK/V888GGoDPYFcfEFytGHLC r8QEiGEBiJREuyC5e5WTCFJpQf67pp+VWTpDmYh8= Date: Wed, 12 Oct 2022 20:31:20 +1100 From: David Gibson To: Stefano Brivio Subject: Re: Alas for CAP_NET_BIND_SERVICE Message-ID: References: <20221012075432.09e33625@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="RM16NKR8G30wkYdx" Content-Disposition: inline In-Reply-To: <20221012075432.09e33625@elisabeth> Message-ID-Hash: FJRPBXH3JUT4EKJYXB2BNELN35IJD25M X-Message-ID-Hash: FJRPBXH3JUT4EKJYXB2BNELN35IJD25M X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --RM16NKR8G30wkYdx Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote: > Hi David, >=20 > On Wed, 12 Oct 2022 13:55:02 +1100 > David Gibson wrote: >=20 > > Hi Stefano, > >=20 > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE > > isn't working, and I'm afraid I have bad news. >=20 > Thanks for the investigation. >=20 > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we > > unshare() or setns() into the isolated namespace, and this appears to > > be intended behaviour. From user_namespaces(7), in the Capabilities se= ction: > >=20 > > The child process created by clone(2) with the CLONE_NEWUSER flag > > starts out with a complete set of capabilities in the new user > > namespace. Likewise, a process that creates a new user namespace > > using unshare(2) or joins an existing user namespace using > > setns(2) gains a full set of capabilities in that namespace. ***On > > the other hand, that process has no capabilities in the parent (in > > the case of clone(2)) or previous (in the case of unshare(2) and > > setns(2)) user namespace, even if the new namespace is created or > > joined by the root user (i.e., a process with user ID 0 in the > > root namespace).*** > >=20 > > Emphasis (***) mine. Basically, despite the way it's phrased in many > > places, processes don't have an independent set of capabilities in > > each userns, they only have a set of capabilities in their current > > userns. Any capabilities in other namespaces are implied in a pretty > > much all or nothing way - if the process's UID (the real, init ns one) > > owns the userns (or one of its ancestors), it gets all caps, otherwise > > none. cap_capable() has the specific logic in the kernel. >=20 > Right, I missed this. >=20 > For a moment, I wondered about ambient capabilities, but those would > only have an effect on an execve(), not on a clone(), I guess. Well, yes, but it doesn't really make any difference in any case. All ambient caps can do is be another way to get things into the permitted set. If that happens before the unshare() then we still lose them on unshare(). If it happens after the unshare(), then it's just giving us caps within the namespace, which isn't what we need. > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating > > ourselves in our own userns. At the very least "auto" inbound > > forwarding of low ports is pretty much off the cards. > >=20 > > For forwarding of specific low ports, we could delay our entry into > > the new userns until we've set up the listening sockets, although it > > does mean rolling back some of the simplification we gained from the > > new-style userns handling. >=20 > If I understand correctly, the biggest hurdle would be: >=20 > 1. we detach namespaces >=20 > 2. only then we can finalise any missing bit of addressing and routing > configuration (relevant for pasta) >=20 > 3. we bind ports as we parse configuration options, but we need > addressing to be fully configured for this >=20 > Referring to your latest patchset (which I'm still reviewing), I guess > that implies a further split of isolate_user() (it's great to have a > name for that, finally!), right? Uh.. something like that, I haven't looked at the details. As we did before my userns cleanup, we'd probably need to repeatedly enter the userns as well as the netns to operate upon it, staying in the initial userns, with our initial caps until sandbox()/isolate_prefork() or thereabouts. > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle > > low ports in passt. I do see a fair bit of logic in that approach: > > passt has no meaningful way to limit what users do with the low ports > > it allows them (indirectly) to bind to, giving passt > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process > > which can invoke passt CAP_NET_BIND_SERVICE. >=20 > I also see the general point, even though if file capabilities are > used, I guess the equivalence doesn't really hold. Uh.. I don't follow. It's exactly file capabilities which make this equivalence. If the passt binary has cap_net_bind_service=3Dep, you can, as an unprivileged user, take any server, stick it in a namespace and use pasta to effectively bind it to a low port in the init namespace. You can do the same thing with passt, though it's fiddlier (you'd need a shim to translate qemu socket protocol before plugging it into the server). > And perhaps we > should at least recommend that as a preferred way. >=20 > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE, > and due to something that's slightly more than an implementation detail, > it won't be able to bind to low ports, which is the very reason for that > capability. That sounds highly counterintuitive. I guess it is in the sense that the reason for this wasn't obvious to either of us initially. However it makes sense to me now that I've looked at it. We use a userns for two reasons: 1) to control a netns and 2) to isolate ourselves for security. We use the same path and the same userns for both, but they're logically different reasons. If (1) was the only reason for the userns we could handle this pretty easily: we'd only enter the userns transiently when we need to manipulate it, just like we do with the netns. That way the main thread would retain CAP_NET_BIND_SERVICE in the original ns. For (2), we're specifically choosing to isolate ourselves: that is to give up privilege from our original position. It's not surprising there's some degree of granularity to how we can do that, and the deal is that we can't give up our membership in the original userns without also giving up our enhanced capabilities in that userns. I don't think giving up (2) is a price worth paying for this. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --RM16NKR8G30wkYdx Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEoULxWu4/Ws0dB+XtgypY4gEwYSIFAmNGiUwACgkQgypY4gEw YSJTlRAA1UuCktbHthL5KytgoHQa/rpjuPXZmn0CdNOV1/jyZLg3r47Le5tw1fZg oEEF7hZ0iiUoWrFyddp/Ny4T5il/PxvaNv7c01X8bAB5OgGkVOJeZOOTZrcxpNc4 la40g/seMHMoFoKWVaXpsF8ugOcpl2nWHTa27i8zr00i/SNmXpkyC5bzf5beljYN 9ABbBoN+J5gHyVlH7D1LDg4As2uzXMEr9CjA27UktsSlvPop55gv5TO84bGScntK VJSCcYXGsQgCURbl3yA/I/EOsuIrOQnnVbXdpx0uST5r1Vdjcgi6oZEiVqSCssSO N4/kP2eTY19qJnnZnUW7yw1FZ9YB87jYG622yWgfmWRaTU1zRIb/RFKhBOSpCJKy 7f68t0ADZX6GA4p9Kc6vVcwTzfMWg18H3Jj1CEMpX7agIfhTlfYaL7MNnssfTI+i j7X/lFZDrvjCldhr+wGEmJH0FCO01W87K/11ue8jy0zb1SqI+5S6Ibu61o5WYlxz D01IIE3B4l2KyNvCgyAUNTUKBa13wVzIdIMrIwiJ/jdNYsFlSE+S55bE72xchLH+ 7qMzjY/tfwq9JULEprNKjxscviEx8AmyJicOIrsYqItUfdBZB+Vtjc4KNBRNwmDK H+hsoC2Qci1bXNZHKE7wLjVbY+ncFh7R9cb32lnfNkr0YP/harI= =zXxN -----END PGP SIGNATURE----- --RM16NKR8G30wkYdx--