From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 22CD85A005E for ; Fri, 14 Oct 2022 04:56:55 +0200 (CEST) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4MpWJ33dNbz4xGt; Fri, 14 Oct 2022 13:56:51 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1665716211; bh=N8vU/MP6+16E1hbvzCQVvj7xjxjCnPCWwF2QXs9zeNY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=hHELeNTFtRs8ksYSlmL7cknuhVDB9+ki6YZva6/4ZrsDYrw5yYYgsRGmp5p3YQQR+ 6D8TqYg23+xb8V6ESTq2ybRQr+BwWSPWV8NlveG0TctXm8t3pFpQ43iPx0dg1yxVV7 sO0HWhlxn0QgXew27rC4T1zg8/LKiLpSWAWLtASM= Date: Fri, 14 Oct 2022 13:54:28 +1100 From: David Gibson To: Stefano Brivio Subject: Re: Alas for CAP_NET_BIND_SERVICE Message-ID: References: <20221012075432.09e33625@elisabeth> <20221012124707.70755587@elisabeth> <20221013065426.618e88b5@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="wULNrkqi1mL9DB5j" Content-Disposition: inline In-Reply-To: <20221013065426.618e88b5@elisabeth> Message-ID-Hash: 6J3ILAK5HDL4ZJ4Y4UGKHDMPS4V24TTF X-Message-ID-Hash: 6J3ILAK5HDL4ZJ4Y4UGKHDMPS4V24TTF X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --wULNrkqi1mL9DB5j Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote: > On Thu, 13 Oct 2022 11:34:04 +1100 > David Gibson wrote: >=20 > > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote: > > > On Wed, 12 Oct 2022 20:31:20 +1100 > > > David Gibson wrote: > > > =20 > > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote: =20 > > > > > Hi David, > > > > >=20 > > > > > On Wed, 12 Oct 2022 13:55:02 +1100 > > > > > David Gibson wrote: > > > > > =20 > > > > > > Hi Stefano, > > > > > >=20 > > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SER= VICE > > > > > > isn't working, and I'm afraid I have bad news. =20 > > > > >=20 > > > > > Thanks for the investigation. > > > > > =20 > > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon a= s we > > > > > > unshare() or setns() into the isolated namespace, and this appe= ars to > > > > > > be intended behaviour. From user_namespaces(7), in the Capabil= ities section: > > > > > >=20 > > > > > > The child process created by clone(2) with the CLONE_NEWUSE= R flag > > > > > > starts out with a complete set of capabilities in the new u= ser > > > > > > namespace. Likewise, a process that creates a new user nam= espace > > > > > > using unshare(2) or joins an existing user namespace using > > > > > > setns(2) gains a full set of capabilities in that namespace= =2E ***On > > > > > > the other hand, that process has no capabilities in the par= ent (in > > > > > > the case of clone(2)) or previous (in the case of unshare(2= ) and > > > > > > setns(2)) user namespace, even if the new namespace is crea= ted or > > > > > > joined by the root user (i.e., a process with user ID 0 in = the > > > > > > root namespace).*** > > > > > >=20 > > > > > > Emphasis (***) mine. Basically, despite the way it's phrased i= n many > > > > > > places, processes don't have an independent set of capabilities= in > > > > > > each userns, they only have a set of capabilities in their curr= ent > > > > > > userns. Any capabilities in other namespaces are implied in a = pretty > > > > > > much all or nothing way - if the process's UID (the real, init = ns one) > > > > > > owns the userns (or one of its ancestors), it gets all caps, ot= herwise > > > > > > none. cap_capable() has the specific logic in the kernel. = =20 > > > > >=20 > > > > > Right, I missed this. > > > > >=20 > > > > > For a moment, I wondered about ambient capabilities, but those wo= uld > > > > > only have an effect on an execve(), not on a clone(), I guess. = =20 > > > >=20 > > > > Well, yes, but it doesn't really make any difference in any case. = All > > > > ambient caps can do is be another way to get things into the permit= ted > > > > set. If that happens before the unshare() then we still lose them = on > > > > unshare(). If it happens after the unshare(), then it's just giving > > > > us caps within the namespace, which isn't what we need. > > > > =20 > > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating > > > > > > ourselves in our own userns. At the very least "auto" inbound > > > > > > forwarding of low ports is pretty much off the cards. > > > > > >=20 > > > > > > For forwarding of specific low ports, we could delay our entry = into > > > > > > the new userns until we've set up the listening sockets, althou= gh it > > > > > > does mean rolling back some of the simplification we gained fro= m the > > > > > > new-style userns handling. =20 > > > > >=20 > > > > > If I understand correctly, the biggest hurdle would be: > > > > >=20 > > > > > 1. we detach namespaces > > > > >=20 > > > > > 2. only then we can finalise any missing bit of addressing and ro= uting > > > > > configuration (relevant for pasta) > > > > >=20 > > > > > 3. we bind ports as we parse configuration options, but we need > > > > > addressing to be fully configured for this > > > > >=20 > > > > > Referring to your latest patchset (which I'm still reviewing), I = guess > > > > > that implies a further split of isolate_user() (it's great to hav= e a > > > > > name for that, finally!), right? =20 > > > >=20 > > > > Uh.. something like that, I haven't looked at the details. As we d= id > > > > before my userns cleanup, we'd probably need to repeatedly enter the > > > > userns as well as the netns to operate upon it, staying in the init= ial > > > > userns, with our initial caps until sandbox()/isolate_prefork() or > > > > thereabouts. > > > > =20 > > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the > > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to h= andle > > > > > > low ports in passt. I do see a fair bit of logic in that appro= ach: > > > > > > passt has no meaningful way to limit what users do with the low= ports > > > > > > it allows them (indirectly) to bind to, giving passt > > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any pr= ocess > > > > > > which can invoke passt CAP_NET_BIND_SERVICE. =20 > > > > >=20 > > > > > I also see the general point, even though if file capabilities are > > > > > used, I guess the equivalence doesn't really hold. =20 > > > >=20 > > > > Uh.. I don't follow. It's exactly file capabilities which make this > > > > equivalence. If the passt binary has cap_net_bind_service=3Dep, you > > > > can, as an unprivileged user, take any server, stick it in a namesp= ace > > > > and use pasta to effectively bind it to a low port in the init > > > > namespace. =20 > > >=20 > > > I actually meant with passt but... even for pasta, this depends on the > > > decision of whether we drop capabilities for the spawned process. If = we > > > decide we don't, one day, then it's not equivalent. =20 > >=20 > > No, from a security perspective it pretty much is still equivalent. > > You can start your own namespace where you have full capabilities, run > > the server in there, then use pasta to translate your > > cap_net_bind_service within to cap_net_bind_service on the host. Or > > just run the server on a high port and tell pasta to connect a low > > port to it. >=20 > Ah, sorry, now I understand what you mean here, and... >=20 > > > It would be equivalent if we just inherited capabilities from the > > > parent as opposed to file capabilities -- that's what I meant. > > >=20 > > > I think it's a bit early to decide to drop those, though. Right now > > > pasta isn't really used as a stand-alone tool (even though I > > > actually do that, I find it very convenient also for totally unrelated > > > purposes). > > >=20 > > > Should we see some use cases, then we could make a more informed > > > decision. > > > =20 > > > > You can do the same thing with passt, though it's fiddlier > > > > (you'd need a shim to translate qemu socket protocol before plugging > > > > it into the server). =20 > > >=20 > > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I > > > don't understand how you'd pass that kind of stuff over AF_UNIX... = =20 > >=20 > > No qemu necessary. Make your bogus server, but instead of directly > > listen()ing on a low port, have it connect to a Unix socket and wait > > for SYN packets to a low port in qemu protocol. Then use passt to > > turn your Unix socket into a real listen()ing socket on the host. >=20 > ...here. But the environment I had in mind was a rather controller one, > with KSM policies that would normally prevent you from even having your > bogus server. >=20 > Well, that would be the case for KubeVirt at least: three binaries and > not much margin to play tricks. Ok, but even then using the file capability rather than the sysctl only makes a difference if the attacker: * CAN escape confinement enough to make socket calls in the netns where we would be setting the sysctl * CAN'T escape confinment enough to exec() passt Which seems like it would be a very narrow category to me. > Unless, of course, you manage to do arbitrary code execution in qemu, > which... would be actually one of the few cases where we would prefer > to have CAP_NET_BIND_SERVICE granted to passt instead of allowing > everybody to bind low ports. In that situation, I think it's still equivalent unless qemu was be prevented from (re) exec()ing passt... but the obvious mechanism to do that (put it in a different mount namespace from qemu) is fairly easily extended to working with the sysctl as well (put qemu in a separate netns so it can't access any network except via the passt socket it's been fiven). > Still, it depends: you would have to reuse that qemu process, because > it's the only one that's connected to passt (which only applies with > minimally defined KSM policies, sure). >=20 > In that environment, you couldn't easily turn libvirtd into a DNS > resolver. >=20 > In a general case, I see your point, but in specific cases it actually > depends on what the environment allows. >=20 > > > > > And perhaps we > > > > > should at least recommend that as a preferred way. > > > > >=20 > > > > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SER= VICE, > > > > > and due to something that's slightly more than an implementation = detail, > > > > > it won't be able to bind to low ports, which is the very reason f= or that > > > > > capability. That sounds highly counterintuitive. =20 > > > >=20 > > > > I guess it is in the sense that the reason for this wasn't obvious = to > > > > either of us initially. However it makes sense to me now that I've > > > > looked at it. =20 > > >=20 > > > No, no, in the sense that it makes sense to you and now to me as well, > > > as you explained it to me. And yet it I find it hard to imagine that = it > > > would naturally make sense to users, in these terms: > > >=20 > > > - we offer a program that provides network connectivity to qemu > > >=20 > > > - it also includes port forwarding functionality: it binds to > > > configured ports and maps them to the guest > > >=20 > > > - it can't bind to any port: it doesn't run as root, and Linux preven= ts > > > non-root processes from binding to ports lower than 1024, which is a > > > well-known fact -- at least by default (lesser known fact) > > >=20 > > > - somewhat in between on the scale of general knowledge, lies > > > CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low > > > ports > > >=20 > > > ...but not passt. For very valid reasons, indeed, but those will need > > > to be explained over and over again. =20 > >=20 > > Yeah, I guess so. > >=20 > > > > We use a userns for two reasons: 1) to control a netns > > > > and 2) to isolate ourselves for security. We use the same path and > > > > the same userns for both, but they're logically different reasons. > > > >=20 > > > > If (1) was the only reason for the userns we could handle this pret= ty > > > > easily: we'd only enter the userns transiently when we need to > > > > manipulate it, just like we do with the netns. That way the main > > > > thread would retain CAP_NET_BIND_SERVICE in the original ns. > > > >=20 > > > > For (2), we're specifically choosing to isolate ourselves: that is = to > > > > give up privilege from our original position. It's not surprising > > > > there's some degree of granularity to how we can do that, and the d= eal > > > > is that we can't give up our membership in the original userns with= out > > > > also giving up our enhanced capabilities in that userns. > > > >=20 > > > > I don't think giving up (2) is a price worth paying for this. =20 > > >=20 > > > Absolutely, I agree, I wouldn't either. > > >=20 > > > However, he could give users the choice without compromising (2) at > > > all, by binding to low ports early (without automatic detection, sure= ). > > > And, somewhat importantly, by not handling any data from them. > > >=20 > > > We could even defer the listen() calls if there's any value in doing = so > > > (is there some? I can't think of anything). > > >=20 > > > Actually, I'm thinking of an easier way to break the circular > > > dependency between isolation steps and port configuration I outlined > > > earlier, without undoing your cleanups at all. > > >=20 > > > We currently need to process port configuration in a second step for > > > two reasons: > > >=20 > > > - we might bind ports in the detached namespace (-T, -U) > > >=20 > > > - one between IPv4 and IPv6 support could be administratively disabled > > > (operationally, who cares, we'll just fail to bind if that's the > > > case) > > >=20 > > > ...but for init/host facing ports (now: "inbound"), we don't care abo= ut > > > the detached namespace, and we could simply call conf_ports() for -t > > > and -u in a second step after the main loop. Sure, if we continue like > > > this, we'll end up with O(n=B2) option handling, but right now it > > > doesn't look that bad to me. =20 > >=20 > > Ah, yes, that could work. Of course, this does mean moving some > > relatively complex code out of at least one layer of isolation, which > > carries some risks. >=20 > Some of it, yes, but to me it doesn't look much worse than conf() in > comparison. I still have to think about downsides of having listen() > calls there, though. On the other hand, that would mean we can also get > rid of listen() later with seccomp, for passt only. >=20 > Anyway, I drafted it... but this happens. I dropped pasta symlinks for > simplicity: >=20 --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --wULNrkqi1mL9DB5j Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEoULxWu4/Ws0dB+XtgypY4gEwYSIFAmNIz10ACgkQgypY4gEw YSLxDw//d3gmUXvJwu/APvnAjwh2+zT7VSV5wMdQFSSlZiibDH7mnNBcBR6rwac2 D2QsuXTsk6/P10WrilWVeVKMNbIdfhH0SH8f4naCL2Cma2ewizLWEtCRxQdGFU+9 rd0npDiSpa8Z4QznBI021V+w5CxA6UOfE2JWlJd5DWmOV0+p295FBpE2XlFF/QJe MG+HJKyysytZYnbQBzlo4DINRO88HZWSAfcWBibcDZdt7ffEuayO2Ba21sdThZF6 DeR3sAGrxjYUC8YLlPWdEVT5rickSj98KGXHcFREKgF9NLteA2imY1cXtGyAnJql HrasdzQF4oZSEa+TWndeFlqK6k5AUyVN8hbXDD8rlo/vo8oO3tX3Mi6QM7HMy/aT mX6u4DY0GiyPdDGySOoLr+uXbOUtUc4f6oKtu91yLIXe81Lc4sIfoUvJA6ic2mqf 0GuDny3FINPM5Ul6IwQ27CFqop5RkXKQwCHiUstVoAsZVSrla9s2tgS3Xkcu2Sua +naOK3l4kK5HaL5rBL1unYGAcJ4m453ZNuEkKXH9QA7s5Vqh76jA/xBU1jCLiWmE wrKzF1Ddjij9w33p1k9TktsKB+/lVGSJd9x2qYAc3LcGHnFu8zR7HNrCSoI+D/Xb yD7Z4SpziKa9M7/+PfyS6BYiAPBTinsuWFPe+u9KfkSpmhrgSqw= =IdIy -----END PGP SIGNATURE----- --wULNrkqi1mL9DB5j--