From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTP id 73C315A005E for ; Thu, 13 Oct 2022 06:54:33 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665636872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AAnzhkelsC/hyyEfuvBBDJgMNhrU8cpxy7kq6gTcyK4=; b=N/pnrw3P/a0mgFAF7zf6NSXrmoVxWao/9ylwGvBOUYZG4BTkIewBIpCL3LM+Sqa6huZ68M gkeLQAewrhlsLnOO1Av6mZB6coion5ZLpc9DSjxvRS4ayDYlHtHF45Vgved33AE46oBUzZ hKJqtBJ5w84KoebGmpNj3oKRGnHWOi4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-605-aAOeejRZO56pkIFnbJ98cQ-1; Thu, 13 Oct 2022 00:54:31 -0400 X-MC-Unique: aAOeejRZO56pkIFnbJ98cQ-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id DC9DD380450C; Thu, 13 Oct 2022 04:54:30 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-3.brq.redhat.com [10.40.208.3]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 564711400C3B; Thu, 13 Oct 2022 04:54:29 +0000 (UTC) Date: Thu, 13 Oct 2022 06:54:26 +0200 From: Stefano Brivio To: David Gibson Subject: Re: Alas for CAP_NET_BIND_SERVICE Message-ID: <20221013065426.618e88b5@elisabeth> In-Reply-To: References: <20221012075432.09e33625@elisabeth> <20221012124707.70755587@elisabeth> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.7 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 6DTO32JS76QAA2KU6CVRCU2PNTJ67NDF X-Message-ID-Hash: 6DTO32JS76QAA2KU6CVRCU2PNTJ67NDF X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: <> Archived-At: List-Archive: <> List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 13 Oct 2022 11:34:04 +1100 David Gibson wrote: > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote: > > On Wed, 12 Oct 2022 20:31:20 +1100 > > David Gibson wrote: > > =20 > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote: =20 > > > > Hi David, > > > >=20 > > > > On Wed, 12 Oct 2022 13:55:02 +1100 > > > > David Gibson wrote: > > > > =20 > > > > > Hi Stefano, > > > > >=20 > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVI= CE > > > > > isn't working, and I'm afraid I have bad news. =20 > > > >=20 > > > > Thanks for the investigation. > > > > =20 > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as = we > > > > > unshare() or setns() into the isolated namespace, and this appear= s to > > > > > be intended behaviour. From user_namespaces(7), in the Capabilit= ies section: > > > > >=20 > > > > > The child process created by clone(2) with the CLONE_NEWUSER = flag > > > > > starts out with a complete set of capabilities in the new use= r > > > > > namespace. Likewise, a process that creates a new user names= pace > > > > > using unshare(2) or joins an existing user namespace using > > > > > setns(2) gains a full set of capabilities in that namespace. = ***On > > > > > the other hand, that process has no capabilities in the paren= t (in > > > > > the case of clone(2)) or previous (in the case of unshare(2) = and > > > > > setns(2)) user namespace, even if the new namespace is create= d or > > > > > joined by the root user (i.e., a process with user ID 0 in th= e > > > > > root namespace).*** > > > > >=20 > > > > > Emphasis (***) mine. Basically, despite the way it's phrased in = many > > > > > places, processes don't have an independent set of capabilities i= n > > > > > each userns, they only have a set of capabilities in their curren= t > > > > > userns. Any capabilities in other namespaces are implied in a pr= etty > > > > > much all or nothing way - if the process's UID (the real, init ns= one) > > > > > owns the userns (or one of its ancestors), it gets all caps, othe= rwise > > > > > none. cap_capable() has the specific logic in the kernel. =20 > > > >=20 > > > > Right, I missed this. > > > >=20 > > > > For a moment, I wondered about ambient capabilities, but those woul= d > > > > only have an effect on an execve(), not on a clone(), I guess. = =20 > > >=20 > > > Well, yes, but it doesn't really make any difference in any case. Al= l > > > ambient caps can do is be another way to get things into the permitte= d > > > set. If that happens before the unshare() then we still lose them on > > > unshare(). If it happens after the unshare(), then it's just giving > > > us caps within the namespace, which isn't what we need. > > > =20 > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating > > > > > ourselves in our own userns. At the very least "auto" inbound > > > > > forwarding of low ports is pretty much off the cards. > > > > >=20 > > > > > For forwarding of specific low ports, we could delay our entry in= to > > > > > the new userns until we've set up the listening sockets, although= it > > > > > does mean rolling back some of the simplification we gained from = the > > > > > new-style userns handling. =20 > > > >=20 > > > > If I understand correctly, the biggest hurdle would be: > > > >=20 > > > > 1. we detach namespaces > > > >=20 > > > > 2. only then we can finalise any missing bit of addressing and rout= ing > > > > configuration (relevant for pasta) > > > >=20 > > > > 3. we bind ports as we parse configuration options, but we need > > > > addressing to be fully configured for this > > > >=20 > > > > Referring to your latest patchset (which I'm still reviewing), I gu= ess > > > > that implies a further split of isolate_user() (it's great to have = a > > > > name for that, finally!), right? =20 > > >=20 > > > Uh.. something like that, I haven't looked at the details. As we did > > > before my userns cleanup, we'd probably need to repeatedly enter the > > > userns as well as the netns to operate upon it, staying in the initia= l > > > userns, with our initial caps until sandbox()/isolate_prefork() or > > > thereabouts. > > > =20 > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to han= dle > > > > > low ports in passt. I do see a fair bit of logic in that approac= h: > > > > > passt has no meaningful way to limit what users do with the low p= orts > > > > > it allows them (indirectly) to bind to, giving passt > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any proc= ess > > > > > which can invoke passt CAP_NET_BIND_SERVICE. =20 > > > >=20 > > > > I also see the general point, even though if file capabilities are > > > > used, I guess the equivalence doesn't really hold. =20 > > >=20 > > > Uh.. I don't follow. It's exactly file capabilities which make this > > > equivalence. If the passt binary has cap_net_bind_service=3Dep, you > > > can, as an unprivileged user, take any server, stick it in a namespac= e > > > and use pasta to effectively bind it to a low port in the init > > > namespace. =20 > >=20 > > I actually meant with passt but... even for pasta, this depends on the > > decision of whether we drop capabilities for the spawned process. If we > > decide we don't, one day, then it's not equivalent. =20 >=20 > No, from a security perspective it pretty much is still equivalent. > You can start your own namespace where you have full capabilities, run > the server in there, then use pasta to translate your > cap_net_bind_service within to cap_net_bind_service on the host. Or > just run the server on a high port and tell pasta to connect a low > port to it. Ah, sorry, now I understand what you mean here, and... > > It would be equivalent if we just inherited capabilities from the > > parent as opposed to file capabilities -- that's what I meant. > >=20 > > I think it's a bit early to decide to drop those, though. Right now > > pasta isn't really used as a stand-alone tool (even though I > > actually do that, I find it very convenient also for totally unrelated > > purposes). > >=20 > > Should we see some use cases, then we could make a more informed > > decision. > > =20 > > > You can do the same thing with passt, though it's fiddlier > > > (you'd need a shim to translate qemu socket protocol before plugging > > > it into the server). =20 > >=20 > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I > > don't understand how you'd pass that kind of stuff over AF_UNIX... =20 >=20 > No qemu necessary. Make your bogus server, but instead of directly > listen()ing on a low port, have it connect to a Unix socket and wait > for SYN packets to a low port in qemu protocol. Then use passt to > turn your Unix socket into a real listen()ing socket on the host. ...here. But the environment I had in mind was a rather controller one, with KSM policies that would normally prevent you from even having your bogus server. Well, that would be the case for KubeVirt at least: three binaries and not much margin to play tricks. Unless, of course, you manage to do arbitrary code execution in qemu, which... would be actually one of the few cases where we would prefer to have CAP_NET_BIND_SERVICE granted to passt instead of allowing everybody to bind low ports. Still, it depends: you would have to reuse that qemu process, because it's the only one that's connected to passt (which only applies with minimally defined KSM policies, sure). In that environment, you couldn't easily turn libvirtd into a DNS resolver. In a general case, I see your point, but in specific cases it actually depends on what the environment allows. > > > > And perhaps we > > > > should at least recommend that as a preferred way. > > > >=20 > > > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVI= CE, > > > > and due to something that's slightly more than an implementation de= tail, > > > > it won't be able to bind to low ports, which is the very reason for= that > > > > capability. That sounds highly counterintuitive. =20 > > >=20 > > > I guess it is in the sense that the reason for this wasn't obvious to > > > either of us initially. However it makes sense to me now that I've > > > looked at it. =20 > >=20 > > No, no, in the sense that it makes sense to you and now to me as well, > > as you explained it to me. And yet it I find it hard to imagine that it > > would naturally make sense to users, in these terms: > >=20 > > - we offer a program that provides network connectivity to qemu > >=20 > > - it also includes port forwarding functionality: it binds to > > configured ports and maps them to the guest > >=20 > > - it can't bind to any port: it doesn't run as root, and Linux prevents > > non-root processes from binding to ports lower than 1024, which is a > > well-known fact -- at least by default (lesser known fact) > >=20 > > - somewhat in between on the scale of general knowledge, lies > > CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low > > ports > >=20 > > ...but not passt. For very valid reasons, indeed, but those will need > > to be explained over and over again. =20 >=20 > Yeah, I guess so. >=20 > > > We use a userns for two reasons: 1) to control a netns > > > and 2) to isolate ourselves for security. We use the same path and > > > the same userns for both, but they're logically different reasons. > > >=20 > > > If (1) was the only reason for the userns we could handle this pretty > > > easily: we'd only enter the userns transiently when we need to > > > manipulate it, just like we do with the netns. That way the main > > > thread would retain CAP_NET_BIND_SERVICE in the original ns. > > >=20 > > > For (2), we're specifically choosing to isolate ourselves: that is to > > > give up privilege from our original position. It's not surprising > > > there's some degree of granularity to how we can do that, and the dea= l > > > is that we can't give up our membership in the original userns withou= t > > > also giving up our enhanced capabilities in that userns. > > >=20 > > > I don't think giving up (2) is a price worth paying for this. =20 > >=20 > > Absolutely, I agree, I wouldn't either. > >=20 > > However, he could give users the choice without compromising (2) at > > all, by binding to low ports early (without automatic detection, sure). > > And, somewhat importantly, by not handling any data from them. > >=20 > > We could even defer the listen() calls if there's any value in doing so > > (is there some? I can't think of anything). > >=20 > > Actually, I'm thinking of an easier way to break the circular > > dependency between isolation steps and port configuration I outlined > > earlier, without undoing your cleanups at all. > >=20 > > We currently need to process port configuration in a second step for > > two reasons: > >=20 > > - we might bind ports in the detached namespace (-T, -U) > >=20 > > - one between IPv4 and IPv6 support could be administratively disabled > > (operationally, who cares, we'll just fail to bind if that's the > > case) > >=20 > > ...but for init/host facing ports (now: "inbound"), we don't care about > > the detached namespace, and we could simply call conf_ports() for -t > > and -u in a second step after the main loop. Sure, if we continue like > > this, we'll end up with O(n=C2=B2) option handling, but right now it > > doesn't look that bad to me. =20 >=20 > Ah, yes, that could work. Of course, this does mean moving some > relatively complex code out of at least one layer of isolation, which > carries some risks. Some of it, yes, but to me it doesn't look much worse than conf() in comparison. I still have to think about downsides of having listen() calls there, though. On the other hand, that would mean we can also get rid of listen() later with seccomp, for passt only. Anyway, I drafted it... but this happens. I dropped pasta symlinks for simplicity: -- # setcap 'cap_net_bind_service=3D+ep' /home/sbrivio/passt/pasta.avx2 # getcap /home/sbrivio/passt/pasta.avx2 /home/sbrivio/passt/pasta.avx2 cap_net_bind_service=3Dep -- $ strace -ebind,capget,capset,readlink -f ./pasta.avx2 -f -t 81 1763943 readlink("/proc/self/exe", "/home/sbrivio/passt/pasta.avx2", 4095) =3D 30 capget({version=3D_LINUX_CAPABILITY_VERSION_3, pid=3D0}, {effective=3D0, pe= rmitted=3D0, inheritable=3D0}) =3D 0 capset({version=3D_LINUX_CAPABILITY_VERSION_3, pid=3D0}, {effective=3D0, pe= rmitted=3D0, inheritable=3D0}) =3D 0 bind(5, {sa_family=3DAF_INET, sin_port=3Dhtons(81), sin_addr=3Dinet_addr("0= .0.0.0")}, 16) =3D -1 EACCES (Permission denied) bind(5, {sa_family=3DAF_INET, sin_port=3Dhtons(81), sin_addr=3Dinet_addr("1= 27.0.0.1")}, 16) =3D -1 EACCES (Permission denied) bind(5, {sa_family=3DAF_INET6, sin6_port=3Dhtons(81), sin6_flowinfo=3Dhtonl= (0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=3D0}, 28) =3D -1 = EACCES (Permission denied) bind(5, {sa_family=3DAF_INET6, sin6_port=3Dhtons(81), sin6_flowinfo=3Dhtonl= (0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=3D0}, 28) =3D -1= EACCES (Permission denied) -- no fancy filesystem attributes, just a very unassuming ext4. I must be missing something very obvious. --=20 Stefano