public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: Alas for CAP_NET_BIND_SERVICE
Date: Sun, 16 Oct 2022 11:46:46 +0200	[thread overview]
Message-ID: <20221016114646.6733393a@elisabeth> (raw)
In-Reply-To: <Y0jPZBvaBFOP/qPj@yekko>

On Fri, 14 Oct 2022 13:54:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote:
> > On Thu, 13 Oct 2022 11:34:04 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:  
> > > > On Wed, 12 Oct 2022 20:31:20 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:    
> > > > > > Hi David,
> > > > > > 
> > > > > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > Hi Stefano,
> > > > > > > 
> > > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > > > > isn't working, and I'm afraid I have bad news.      
> > > > > > 
> > > > > > Thanks for the investigation.
> > > > > >       
> > > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > > > > 
> > > > > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > > > > >     starts out with a complete set of capabilities in the new user
> > > > > > >     namespace.  Likewise, a process that creates a new user namespace
> > > > > > >     using unshare(2) or joins an existing user namespace using
> > > > > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > > > > >     the other hand, that process has no capabilities in the parent (in
> > > > > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > > > > >     setns(2)) user namespace, even if the new namespace is created or
> > > > > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > > > > >     root namespace).***
> > > > > > > 
> > > > > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > > > > places, processes don't have an independent set of capabilities in
> > > > > > > each userns, they only have a set of capabilities in their current
> > > > > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > > > > none.  cap_capable() has the specific logic in the kernel.      
> > > > > > 
> > > > > > Right, I missed this.
> > > > > > 
> > > > > > For a moment, I wondered about ambient capabilities, but those would
> > > > > > only have an effect on an execve(), not on a clone(), I guess.      
> > > > > 
> > > > > Well, yes, but it doesn't really make any difference in any case.  All
> > > > > ambient caps can do is be another way to get things into the permitted
> > > > > set.  If that happens before the unshare() then we still lose them on
> > > > > unshare().  If it happens after the unshare(), then it's just giving
> > > > > us caps within the namespace, which isn't what we need.
> > > > >     
> > > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > > > > forwarding of low ports is pretty much off the cards.
> > > > > > > 
> > > > > > > For forwarding of specific low ports, we could delay our entry into
> > > > > > > the new userns until we've set up the listening sockets, although it
> > > > > > > does mean rolling back some of the simplification we gained from the
> > > > > > > new-style userns handling.      
> > > > > > 
> > > > > > If I understand correctly, the biggest hurdle would be:
> > > > > > 
> > > > > > 1. we detach namespaces
> > > > > > 
> > > > > > 2. only then we can finalise any missing bit of addressing and routing
> > > > > >    configuration (relevant for pasta)
> > > > > > 
> > > > > > 3. we bind ports as we parse configuration options, but we need
> > > > > >    addressing to be fully configured for this
> > > > > > 
> > > > > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > > > > that implies a further split of isolate_user() (it's great to have a
> > > > > > name for that, finally!), right?      
> > > > > 
> > > > > Uh.. something like that, I haven't looked at the details.  As we did
> > > > > before my userns cleanup, we'd probably need to repeatedly enter the
> > > > > userns as well as the netns to operate upon it, staying in the initial
> > > > > userns, with our initial caps until sandbox()/isolate_prefork() or
> > > > > thereabouts.
> > > > >     
> > > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > > > > passt has no meaningful way to limit what users do with the low ports
> > > > > > > it allows them (indirectly) to bind to, giving passt
> > > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > > > > which can invoke passt CAP_NET_BIND_SERVICE.      
> > > > > > 
> > > > > > I also see the general point, even though if file capabilities are
> > > > > > used, I guess the equivalence doesn't really hold.      
> > > > > 
> > > > > Uh.. I don't follow.  It's exactly file capabilities which make this
> > > > > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > > > > can, as an unprivileged user, take any server, stick it in a namespace
> > > > > and use pasta to effectively bind it to a low port in the init
> > > > > namespace.    
> > > > 
> > > > I actually meant with passt but... even for pasta, this depends on the
> > > > decision of whether we drop capabilities for the spawned process. If we
> > > > decide we don't, one day, then it's not equivalent.    
> > > 
> > > No, from a security perspective it pretty much is still equivalent.
> > > You can start your own namespace where you have full capabilities, run
> > > the server in there, then use pasta to translate your
> > > cap_net_bind_service within to cap_net_bind_service on the host.  Or
> > > just run the server on a high port and tell pasta to connect a low
> > > port to it.  
> > 
> > Ah, sorry, now I understand what you mean here, and...
> >   
> > > > It would be equivalent if we just inherited capabilities from the
> > > > parent as opposed to file capabilities -- that's what I meant.
> > > > 
> > > > I think it's a bit early to decide to drop those, though. Right now
> > > > pasta isn't really used as a stand-alone tool (even though I
> > > > actually do that, I find it very convenient also for totally unrelated
> > > > purposes).
> > > > 
> > > > Should we see some use cases, then we could make a more informed
> > > > decision.
> > > >     
> > > > > You can do the same thing with passt, though it's fiddlier
> > > > > (you'd need a shim to translate qemu socket protocol before plugging
> > > > > it into the server).    
> > > > 
> > > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > > > don't understand how you'd pass that kind of stuff over AF_UNIX...    
> > > 
> > > No qemu necessary.  Make your bogus server, but instead of directly
> > > listen()ing on a low port, have it connect to a Unix socket and wait
> > > for SYN packets to a low port in qemu protocol.  Then use passt to
> > > turn your Unix socket into a real listen()ing socket on the host.  
> > 
> > ...here. But the environment I had in mind was a rather controller one,
> > with KSM policies that would normally prevent you from even having your
> > bogus server.
> > 
> > Well, that would be the case for KubeVirt at least: three binaries and
> > not much margin to play tricks.  
> 
> Ok, but even then using the file capability rather than the sysctl
> only makes a difference if the attacker:
>  * CAN escape confinement enough to make socket calls in the netns
>    where we would be setting the sysctl
>  * CAN'T escape confinment enough to exec() passt

Hmm, I'm thinking about another fact. Now we don't drop the capability
after binding ports, but that's anyway not effective in the parent
namespace because of what you mentioned, which implies that we can just
bind configured ports.

There might be a relevant difference between binding a port 25, a less
usable 53 or 67, or a more innocent 443. In practice, if somebody uses
the sysctl, they might very well be setting it to 0, instead.

By the way, I just realised, after these changes we should double check
the AppArmor and SELinux profiles we ship as examples.

I don't think it's urgent, because in the worst case they should be too
restrictive rather than the opposite -- see the current AppArmor
"capability" directive and the SELinux "allow passt_t self:capability"
enforcement.

-- 
Stefano


  reply	other threads:[~2022-10-16  9:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-12  2:55 Alas for CAP_NET_BIND_SERVICE David Gibson
2022-10-12  5:54 ` Stefano Brivio
2022-10-12  9:31   ` David Gibson
2022-10-12 10:47     ` Stefano Brivio
2022-10-13  0:34       ` David Gibson
2022-10-13  4:54         ` Stefano Brivio
2022-10-13  5:15           ` Stefano Brivio
2022-10-14  2:54           ` David Gibson
2022-10-16  9:46             ` Stefano Brivio [this message]
2022-10-17  3:20               ` David Gibson
2022-10-13 10:50       ` Stefano Brivio
2022-10-14  2:56         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221016114646.6733393a@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).