public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: Alas for CAP_NET_BIND_SERVICE
Date: Fri, 14 Oct 2022 13:54:28 +1100	[thread overview]
Message-ID: <Y0jPZBvaBFOP/qPj@yekko> (raw)
In-Reply-To: <20221013065426.618e88b5@elisabeth>

[-- Attachment #1: Type: text/plain, Size: 13726 bytes --]

On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote:
> On Thu, 13 Oct 2022 11:34:04 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:
> > > On Wed, 12 Oct 2022 20:31:20 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:  
> > > > > Hi David,
> > > > > 
> > > > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > Hi Stefano,
> > > > > > 
> > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > > > isn't working, and I'm afraid I have bad news.    
> > > > > 
> > > > > Thanks for the investigation.
> > > > >     
> > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > > > 
> > > > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > > > >     starts out with a complete set of capabilities in the new user
> > > > > >     namespace.  Likewise, a process that creates a new user namespace
> > > > > >     using unshare(2) or joins an existing user namespace using
> > > > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > > > >     the other hand, that process has no capabilities in the parent (in
> > > > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > > > >     setns(2)) user namespace, even if the new namespace is created or
> > > > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > > > >     root namespace).***
> > > > > > 
> > > > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > > > places, processes don't have an independent set of capabilities in
> > > > > > each userns, they only have a set of capabilities in their current
> > > > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > > > none.  cap_capable() has the specific logic in the kernel.    
> > > > > 
> > > > > Right, I missed this.
> > > > > 
> > > > > For a moment, I wondered about ambient capabilities, but those would
> > > > > only have an effect on an execve(), not on a clone(), I guess.    
> > > > 
> > > > Well, yes, but it doesn't really make any difference in any case.  All
> > > > ambient caps can do is be another way to get things into the permitted
> > > > set.  If that happens before the unshare() then we still lose them on
> > > > unshare().  If it happens after the unshare(), then it's just giving
> > > > us caps within the namespace, which isn't what we need.
> > > >   
> > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > > > forwarding of low ports is pretty much off the cards.
> > > > > > 
> > > > > > For forwarding of specific low ports, we could delay our entry into
> > > > > > the new userns until we've set up the listening sockets, although it
> > > > > > does mean rolling back some of the simplification we gained from the
> > > > > > new-style userns handling.    
> > > > > 
> > > > > If I understand correctly, the biggest hurdle would be:
> > > > > 
> > > > > 1. we detach namespaces
> > > > > 
> > > > > 2. only then we can finalise any missing bit of addressing and routing
> > > > >    configuration (relevant for pasta)
> > > > > 
> > > > > 3. we bind ports as we parse configuration options, but we need
> > > > >    addressing to be fully configured for this
> > > > > 
> > > > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > > > that implies a further split of isolate_user() (it's great to have a
> > > > > name for that, finally!), right?    
> > > > 
> > > > Uh.. something like that, I haven't looked at the details.  As we did
> > > > before my userns cleanup, we'd probably need to repeatedly enter the
> > > > userns as well as the netns to operate upon it, staying in the initial
> > > > userns, with our initial caps until sandbox()/isolate_prefork() or
> > > > thereabouts.
> > > >   
> > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > > > passt has no meaningful way to limit what users do with the low ports
> > > > > > it allows them (indirectly) to bind to, giving passt
> > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > > > which can invoke passt CAP_NET_BIND_SERVICE.    
> > > > > 
> > > > > I also see the general point, even though if file capabilities are
> > > > > used, I guess the equivalence doesn't really hold.    
> > > > 
> > > > Uh.. I don't follow.  It's exactly file capabilities which make this
> > > > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > > > can, as an unprivileged user, take any server, stick it in a namespace
> > > > and use pasta to effectively bind it to a low port in the init
> > > > namespace.  
> > > 
> > > I actually meant with passt but... even for pasta, this depends on the
> > > decision of whether we drop capabilities for the spawned process. If we
> > > decide we don't, one day, then it's not equivalent.  
> > 
> > No, from a security perspective it pretty much is still equivalent.
> > You can start your own namespace where you have full capabilities, run
> > the server in there, then use pasta to translate your
> > cap_net_bind_service within to cap_net_bind_service on the host.  Or
> > just run the server on a high port and tell pasta to connect a low
> > port to it.
> 
> Ah, sorry, now I understand what you mean here, and...
> 
> > > It would be equivalent if we just inherited capabilities from the
> > > parent as opposed to file capabilities -- that's what I meant.
> > > 
> > > I think it's a bit early to decide to drop those, though. Right now
> > > pasta isn't really used as a stand-alone tool (even though I
> > > actually do that, I find it very convenient also for totally unrelated
> > > purposes).
> > > 
> > > Should we see some use cases, then we could make a more informed
> > > decision.
> > >   
> > > > You can do the same thing with passt, though it's fiddlier
> > > > (you'd need a shim to translate qemu socket protocol before plugging
> > > > it into the server).  
> > > 
> > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > > don't understand how you'd pass that kind of stuff over AF_UNIX...  
> > 
> > No qemu necessary.  Make your bogus server, but instead of directly
> > listen()ing on a low port, have it connect to a Unix socket and wait
> > for SYN packets to a low port in qemu protocol.  Then use passt to
> > turn your Unix socket into a real listen()ing socket on the host.
> 
> ...here. But the environment I had in mind was a rather controller one,
> with KSM policies that would normally prevent you from even having your
> bogus server.
> 
> Well, that would be the case for KubeVirt at least: three binaries and
> not much margin to play tricks.

Ok, but even then using the file capability rather than the sysctl
only makes a difference if the attacker:
 * CAN escape confinement enough to make socket calls in the netns
   where we would be setting the sysctl
 * CAN'T escape confinment enough to exec() passt

Which seems like it would be a very narrow category to me.

> Unless, of course, you manage to do arbitrary code execution in qemu,
> which... would be actually one of the few cases where we would prefer
> to have CAP_NET_BIND_SERVICE granted to passt instead of allowing
> everybody to bind low ports.

In that situation, I think it's still equivalent unless qemu was be
prevented from (re) exec()ing passt... but the obvious mechanism to do
that (put it in a different mount namespace from qemu) is fairly
easily extended to working with the sysctl as well (put qemu in a
separate netns so it can't access any network except via the passt
socket it's been fiven).

> Still, it depends: you would have to reuse that qemu process, because
> it's the only one that's connected to passt (which only applies with
> minimally defined KSM policies, sure).
> 
> In that environment, you couldn't easily turn libvirtd into a DNS
> resolver.
> 
> In a general case, I see your point, but in specific cases it actually
> depends on what the environment allows.
> 
> > > > > And perhaps we
> > > > > should at least recommend that as a preferred way.
> > > > > 
> > > > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> > > > > and due to something that's slightly more than an implementation detail,
> > > > > it won't be able to bind to low ports, which is the very reason for that
> > > > > capability. That sounds highly counterintuitive.    
> > > > 
> > > > I guess it is in the sense that the reason for this wasn't obvious to
> > > > either of us initially.  However it makes sense to me now that I've
> > > > looked at it.  
> > > 
> > > No, no, in the sense that it makes sense to you and now to me as well,
> > > as you explained it to me. And yet it I find it hard to imagine that it
> > > would naturally make sense to users, in these terms:
> > > 
> > > - we offer a program that provides network connectivity to qemu
> > > 
> > > - it also includes port forwarding functionality: it binds to
> > >   configured ports and maps them to the guest
> > > 
> > > - it can't bind to any port: it doesn't run as root, and Linux prevents
> > >   non-root processes from binding to ports lower than 1024, which is a
> > >   well-known fact -- at least by default (lesser known fact)
> > > 
> > > - somewhat in between on the scale of general knowledge, lies
> > >   CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low
> > >   ports
> > > 
> > > ...but not passt. For very valid reasons, indeed, but those will need
> > > to be explained over and over again.  
> > 
> > Yeah, I guess so.
> > 
> > > > We use a userns for two reasons: 1) to control a netns
> > > > and 2) to isolate ourselves for security.  We use the same path and
> > > > the same userns for both, but they're logically different reasons.
> > > > 
> > > > If (1) was the only reason for the userns we could handle this pretty
> > > > easily: we'd only enter the userns transiently when we need to
> > > > manipulate it, just like we do with the netns.  That way the main
> > > > thread would retain CAP_NET_BIND_SERVICE in the original ns.
> > > > 
> > > > For (2), we're specifically choosing to isolate ourselves: that is to
> > > > give up privilege from our original position.  It's not surprising
> > > > there's some degree of granularity to how we can do that, and the deal
> > > > is that we can't give up our membership in the original userns without
> > > > also giving up our enhanced capabilities in that userns.
> > > > 
> > > > I don't think giving up (2) is a price worth paying for this.  
> > > 
> > > Absolutely, I agree, I wouldn't either.
> > > 
> > > However, he could give users the choice without compromising (2) at
> > > all, by binding to low ports early (without automatic detection, sure).
> > > And, somewhat importantly, by not handling any data from them.
> > > 
> > > We could even defer the listen() calls if there's any value in doing so
> > > (is there some? I can't think of anything).
> > > 
> > > Actually, I'm thinking of an easier way to break the circular
> > > dependency between isolation steps and port configuration I outlined
> > > earlier, without undoing your cleanups at all.
> > > 
> > > We currently need to process port configuration in a second step for
> > > two reasons:
> > > 
> > > - we might bind ports in the detached namespace (-T, -U)
> > > 
> > > - one between IPv4 and IPv6 support could be administratively disabled
> > >   (operationally, who cares, we'll just fail to bind if that's the
> > >   case)
> > > 
> > > ...but for init/host facing ports (now: "inbound"), we don't care about
> > > the detached namespace, and we could simply call conf_ports() for -t
> > > and -u in a second step after the main loop. Sure, if we continue like
> > > this, we'll end up with O(n²) option handling, but right now it
> > > doesn't look that bad to me.  
> > 
> > Ah, yes, that could work.  Of course, this does mean moving some
> > relatively complex code out of at least one layer of isolation, which
> > carries some risks.
> 
> Some of it, yes, but to me it doesn't look much worse than conf() in
> comparison. I still have to think about downsides of having listen()
> calls there, though. On the other hand, that would mean we can also get
> rid of listen() later with seccomp, for passt only.
> 
> Anyway, I drafted it... but this happens. I dropped pasta symlinks for
> simplicity:
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

  parent reply	other threads:[~2022-10-14  2:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-12  2:55 Alas for CAP_NET_BIND_SERVICE David Gibson
2022-10-12  5:54 ` Stefano Brivio
2022-10-12  9:31   ` David Gibson
2022-10-12 10:47     ` Stefano Brivio
2022-10-13  0:34       ` David Gibson
2022-10-13  4:54         ` Stefano Brivio
2022-10-13  5:15           ` Stefano Brivio
2022-10-14  2:54           ` David Gibson [this message]
2022-10-16  9:46             ` Stefano Brivio
2022-10-17  3:20               ` David Gibson
2022-10-13 10:50       ` Stefano Brivio
2022-10-14  2:56         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y0jPZBvaBFOP/qPj@yekko \
    --to=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    --cc=sbrivio@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).