Alas for CAP_NET_BIND

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* Alas for CAP_NET_BIND_SERVICE
@ 2022-10-12  2:55 David Gibson
  2022-10-12  5:54 ` Stefano Brivio
  0 siblings, 1 reply; 12+ messages in thread
From: David Gibson @ 2022-10-12  2:55 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

Hi Stefano,

I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
isn't working, and I'm afraid I have bad news.

We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
unshare() or setns() into the isolated namespace, and this appears to
be intended behaviour.  From user_namespaces(7), in the Capabilities section:

    The child process created by clone(2) with the CLONE_NEWUSER flag
    starts out with a complete set of capabilities in the new user
    namespace.  Likewise, a process that creates a new user namespace
    using unshare(2) or joins an existing user namespace using
    setns(2) gains a full set of capabilities in that namespace.  ***On
    the other hand, that process has no capabilities in the parent (in
    the case of clone(2)) or previous (in the case of unshare(2) and
    setns(2)) user namespace, even if the new namespace is created or
    joined by the root user (i.e., a process with user ID 0 in the
    root namespace).***

Emphasis (***) mine.  Basically, despite the way it's phrased in many
places, processes don't have an independent set of capabilities in
each userns, they only have a set of capabilities in their current
userns.  Any capabilities in other namespaces are implied in a pretty
much all or nothing way - if the process's UID (the real, init ns one)
owns the userns (or one of its ancestors), it gets all caps, otherwise
none.  cap_capable() has the specific logic in the kernel.

So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
ourselves in our own userns.  At the very least "auto" inbound
forwarding of low ports is pretty much off the cards.

For forwarding of specific low ports, we could delay our entry into
the new userns until we've set up the listening sockets, although it
does mean rolling back some of the simplification we gained from the
new-style userns handling.

Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
low ports in passt.  I do see a fair bit of logic in that approach:
passt has no meaningful way to limit what users do with the low ports
it allows them (indirectly) to bind to, giving passt
CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
which can invoke passt CAP_NET_BIND_SERVICE.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-12  2:55 Alas for CAP_NET_BIND_SERVICE David Gibson
@ 2022-10-12  5:54 ` Stefano Brivio
  2022-10-12  9:31   ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2022-10-12  5:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

Hi David,

On Wed, 12 Oct 2022 13:55:02 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Hi Stefano,
> 
> I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> isn't working, and I'm afraid I have bad news.

Thanks for the investigation.

> We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> unshare() or setns() into the isolated namespace, and this appears to
> be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> 
>     The child process created by clone(2) with the CLONE_NEWUSER flag
>     starts out with a complete set of capabilities in the new user
>     namespace.  Likewise, a process that creates a new user namespace
>     using unshare(2) or joins an existing user namespace using
>     setns(2) gains a full set of capabilities in that namespace.  ***On
>     the other hand, that process has no capabilities in the parent (in
>     the case of clone(2)) or previous (in the case of unshare(2) and
>     setns(2)) user namespace, even if the new namespace is created or
>     joined by the root user (i.e., a process with user ID 0 in the
>     root namespace).***
> 
> Emphasis (***) mine.  Basically, despite the way it's phrased in many
> places, processes don't have an independent set of capabilities in
> each userns, they only have a set of capabilities in their current
> userns.  Any capabilities in other namespaces are implied in a pretty
> much all or nothing way - if the process's UID (the real, init ns one)
> owns the userns (or one of its ancestors), it gets all caps, otherwise
> none.  cap_capable() has the specific logic in the kernel.

Right, I missed this.

For a moment, I wondered about ambient capabilities, but those would
only have an effect on an execve(), not on a clone(), I guess.

> So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> ourselves in our own userns.  At the very least "auto" inbound
> forwarding of low ports is pretty much off the cards.
> 
> For forwarding of specific low ports, we could delay our entry into
> the new userns until we've set up the listening sockets, although it
> does mean rolling back some of the simplification we gained from the
> new-style userns handling.

If I understand correctly, the biggest hurdle would be:

1. we detach namespaces

2. only then we can finalise any missing bit of addressing and routing
   configuration (relevant for pasta)

3. we bind ports as we parse configuration options, but we need
   addressing to be fully configured for this

Referring to your latest patchset (which I'm still reviewing), I guess
that implies a further split of isolate_user() (it's great to have a
name for that, finally!), right?

> Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> low ports in passt.  I do see a fair bit of logic in that approach:
> passt has no meaningful way to limit what users do with the low ports
> it allows them (indirectly) to bind to, giving passt
> CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> which can invoke passt CAP_NET_BIND_SERVICE.

I also see the general point, even though if file capabilities are
used, I guess the equivalence doesn't really hold. And perhaps we
should at least recommend that as a preferred way.

What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
and due to something that's slightly more than an implementation detail,
it won't be able to bind to low ports, which is the very reason for that
capability. That sounds highly counterintuitive.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-12  5:54 ` Stefano Brivio
@ 2022-10-12  9:31   ` David Gibson
  2022-10-12 10:47     ` Stefano Brivio
  0 siblings, 1 reply; 12+ messages in thread
From: David Gibson @ 2022-10-12  9:31 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6215 bytes --]

On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:
> Hi David,
> 
> On Wed, 12 Oct 2022 13:55:02 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Hi Stefano,
> > 
> > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > isn't working, and I'm afraid I have bad news.
> 
> Thanks for the investigation.
> 
> > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > unshare() or setns() into the isolated namespace, and this appears to
> > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > 
> >     The child process created by clone(2) with the CLONE_NEWUSER flag
> >     starts out with a complete set of capabilities in the new user
> >     namespace.  Likewise, a process that creates a new user namespace
> >     using unshare(2) or joins an existing user namespace using
> >     setns(2) gains a full set of capabilities in that namespace.  ***On
> >     the other hand, that process has no capabilities in the parent (in
> >     the case of clone(2)) or previous (in the case of unshare(2) and
> >     setns(2)) user namespace, even if the new namespace is created or
> >     joined by the root user (i.e., a process with user ID 0 in the
> >     root namespace).***
> > 
> > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > places, processes don't have an independent set of capabilities in
> > each userns, they only have a set of capabilities in their current
> > userns.  Any capabilities in other namespaces are implied in a pretty
> > much all or nothing way - if the process's UID (the real, init ns one)
> > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > none.  cap_capable() has the specific logic in the kernel.
> 
> Right, I missed this.
> 
> For a moment, I wondered about ambient capabilities, but those would
> only have an effect on an execve(), not on a clone(), I guess.

Well, yes, but it doesn't really make any difference in any case.  All
ambient caps can do is be another way to get things into the permitted
set.  If that happens before the unshare() then we still lose them on
unshare().  If it happens after the unshare(), then it's just giving
us caps within the namespace, which isn't what we need.

> > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > ourselves in our own userns.  At the very least "auto" inbound
> > forwarding of low ports is pretty much off the cards.
> > 
> > For forwarding of specific low ports, we could delay our entry into
> > the new userns until we've set up the listening sockets, although it
> > does mean rolling back some of the simplification we gained from the
> > new-style userns handling.
> 
> If I understand correctly, the biggest hurdle would be:
> 
> 1. we detach namespaces
> 
> 2. only then we can finalise any missing bit of addressing and routing
>    configuration (relevant for pasta)
> 
> 3. we bind ports as we parse configuration options, but we need
>    addressing to be fully configured for this
> 
> Referring to your latest patchset (which I'm still reviewing), I guess
> that implies a further split of isolate_user() (it's great to have a
> name for that, finally!), right?

Uh.. something like that, I haven't looked at the details.  As we did
before my userns cleanup, we'd probably need to repeatedly enter the
userns as well as the netns to operate upon it, staying in the initial
userns, with our initial caps until sandbox()/isolate_prefork() or
thereabouts.

> > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > low ports in passt.  I do see a fair bit of logic in that approach:
> > passt has no meaningful way to limit what users do with the low ports
> > it allows them (indirectly) to bind to, giving passt
> > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > which can invoke passt CAP_NET_BIND_SERVICE.
> 
> I also see the general point, even though if file capabilities are
> used, I guess the equivalence doesn't really hold.

Uh.. I don't follow.  It's exactly file capabilities which make this
equivalence.  If the passt binary has cap_net_bind_service=ep, you
can, as an unprivileged user, take any server, stick it in a namespace
and use pasta to effectively bind it to a low port in the init
namespace.  You can do the same thing with passt, though it's fiddlier
(you'd need a shim to translate qemu socket protocol before plugging
it into the server).

> And perhaps we
> should at least recommend that as a preferred way.
> 
> What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> and due to something that's slightly more than an implementation detail,
> it won't be able to bind to low ports, which is the very reason for that
> capability. That sounds highly counterintuitive.

I guess it is in the sense that the reason for this wasn't obvious to
either of us initially.  However it makes sense to me now that I've
looked at it.  We use a userns for two reasons: 1) to control a netns
and 2) to isolate ourselves for security.  We use the same path and
the same userns for both, but they're logically different reasons.

If (1) was the only reason for the userns we could handle this pretty
easily: we'd only enter the userns transiently when we need to
manipulate it, just like we do with the netns.  That way the main
thread would retain CAP_NET_BIND_SERVICE in the original ns.

For (2), we're specifically choosing to isolate ourselves: that is to
give up privilege from our original position.  It's not surprising
there's some degree of granularity to how we can do that, and the deal
is that we can't give up our membership in the original userns without
also giving up our enhanced capabilities in that userns.

I don't think giving up (2) is a price worth paying for this.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-12  9:31   ` David Gibson
@ 2022-10-12 10:47     ` Stefano Brivio
  2022-10-13  0:34       ` David Gibson
  2022-10-13 10:50       ` Stefano Brivio
  0 siblings, 2 replies; 12+ messages in thread
From: Stefano Brivio @ 2022-10-12 10:47 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 12 Oct 2022 20:31:20 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:
> > Hi David,
> > 
> > On Wed, 12 Oct 2022 13:55:02 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Hi Stefano,
> > > 
> > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > isn't working, and I'm afraid I have bad news.  
> > 
> > Thanks for the investigation.
> >   
> > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > unshare() or setns() into the isolated namespace, and this appears to
> > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > 
> > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > >     starts out with a complete set of capabilities in the new user
> > >     namespace.  Likewise, a process that creates a new user namespace
> > >     using unshare(2) or joins an existing user namespace using
> > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > >     the other hand, that process has no capabilities in the parent (in
> > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > >     setns(2)) user namespace, even if the new namespace is created or
> > >     joined by the root user (i.e., a process with user ID 0 in the
> > >     root namespace).***
> > > 
> > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > places, processes don't have an independent set of capabilities in
> > > each userns, they only have a set of capabilities in their current
> > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > much all or nothing way - if the process's UID (the real, init ns one)
> > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > none.  cap_capable() has the specific logic in the kernel.  
> > 
> > Right, I missed this.
> > 
> > For a moment, I wondered about ambient capabilities, but those would
> > only have an effect on an execve(), not on a clone(), I guess.  
> 
> Well, yes, but it doesn't really make any difference in any case.  All
> ambient caps can do is be another way to get things into the permitted
> set.  If that happens before the unshare() then we still lose them on
> unshare().  If it happens after the unshare(), then it's just giving
> us caps within the namespace, which isn't what we need.
> 
> > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > ourselves in our own userns.  At the very least "auto" inbound
> > > forwarding of low ports is pretty much off the cards.
> > > 
> > > For forwarding of specific low ports, we could delay our entry into
> > > the new userns until we've set up the listening sockets, although it
> > > does mean rolling back some of the simplification we gained from the
> > > new-style userns handling.  
> > 
> > If I understand correctly, the biggest hurdle would be:
> > 
> > 1. we detach namespaces
> > 
> > 2. only then we can finalise any missing bit of addressing and routing
> >    configuration (relevant for pasta)
> > 
> > 3. we bind ports as we parse configuration options, but we need
> >    addressing to be fully configured for this
> > 
> > Referring to your latest patchset (which I'm still reviewing), I guess
> > that implies a further split of isolate_user() (it's great to have a
> > name for that, finally!), right?  
> 
> Uh.. something like that, I haven't looked at the details.  As we did
> before my userns cleanup, we'd probably need to repeatedly enter the
> userns as well as the netns to operate upon it, staying in the initial
> userns, with our initial caps until sandbox()/isolate_prefork() or
> thereabouts.
> 
> > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > passt has no meaningful way to limit what users do with the low ports
> > > it allows them (indirectly) to bind to, giving passt
> > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > which can invoke passt CAP_NET_BIND_SERVICE.  
> > 
> > I also see the general point, even though if file capabilities are
> > used, I guess the equivalence doesn't really hold.  
> 
> Uh.. I don't follow.  It's exactly file capabilities which make this
> equivalence.  If the passt binary has cap_net_bind_service=ep, you
> can, as an unprivileged user, take any server, stick it in a namespace
> and use pasta to effectively bind it to a low port in the init
> namespace.

I actually meant with passt but... even for pasta, this depends on the
decision of whether we drop capabilities for the spawned process. If we
decide we don't, one day, then it's not equivalent.

It would be equivalent if we just inherited capabilities from the
parent as opposed to file capabilities -- that's what I meant.

I think it's a bit early to decide to drop those, though. Right now
pasta isn't really used as a stand-alone tool (even though I
actually do that, I find it very convenient also for totally unrelated
purposes).

Should we see some use cases, then we could make a more informed
decision.

> You can do the same thing with passt, though it's fiddlier
> (you'd need a shim to translate qemu socket protocol before plugging
> it into the server).

Oh, you mean running pasta plus a shim plus qemu? Because with passt I
don't understand how you'd pass that kind of stuff over AF_UNIX...

> > And perhaps we
> > should at least recommend that as a preferred way.
> > 
> > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> > and due to something that's slightly more than an implementation detail,
> > it won't be able to bind to low ports, which is the very reason for that
> > capability. That sounds highly counterintuitive.  
> 
> I guess it is in the sense that the reason for this wasn't obvious to
> either of us initially.  However it makes sense to me now that I've
> looked at it.

No, no, in the sense that it makes sense to you and now to me as well,
as you explained it to me. And yet it I find it hard to imagine that it
would naturally make sense to users, in these terms:

- we offer a program that provides network connectivity to qemu

- it also includes port forwarding functionality: it binds to
  configured ports and maps them to the guest

- it can't bind to any port: it doesn't run as root, and Linux prevents
  non-root processes from binding to ports lower than 1024, which is a
  well-known fact -- at least by default (lesser known fact)

- somewhat in between on the scale of general knowledge, lies
  CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low
  ports

...but not passt. For very valid reasons, indeed, but those will need
to be explained over and over again.

> We use a userns for two reasons: 1) to control a netns
> and 2) to isolate ourselves for security.  We use the same path and
> the same userns for both, but they're logically different reasons.
> 
> If (1) was the only reason for the userns we could handle this pretty
> easily: we'd only enter the userns transiently when we need to
> manipulate it, just like we do with the netns.  That way the main
> thread would retain CAP_NET_BIND_SERVICE in the original ns.
> 
> For (2), we're specifically choosing to isolate ourselves: that is to
> give up privilege from our original position.  It's not surprising
> there's some degree of granularity to how we can do that, and the deal
> is that we can't give up our membership in the original userns without
> also giving up our enhanced capabilities in that userns.
> 
> I don't think giving up (2) is a price worth paying for this.

Absolutely, I agree, I wouldn't either.

However, he could give users the choice without compromising (2) at
all, by binding to low ports early (without automatic detection, sure).
And, somewhat importantly, by not handling any data from them.

We could even defer the listen() calls if there's any value in doing so
(is there some? I can't think of anything).

Actually, I'm thinking of an easier way to break the circular
dependency between isolation steps and port configuration I outlined
earlier, without undoing your cleanups at all.

We currently need to process port configuration in a second step for
two reasons:

- we might bind ports in the detached namespace (-T, -U)

- one between IPv4 and IPv6 support could be administratively disabled
  (operationally, who cares, we'll just fail to bind if that's the
  case)

...but for init/host facing ports (now: "inbound"), we don't care about
the detached namespace, and we could simply call conf_ports() for -t
and -u in a second step after the main loop. Sure, if we continue like
this, we'll end up with O(n²) option handling, but right now it
doesn't look that bad to me.

I would give it a shot after I'm done reviewing your patchset (it
should also look clearer after that) and re-spinning my recent ones,
unless you see something wrong with it.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-12 10:47     ` Stefano Brivio
@ 2022-10-13  0:34       ` David Gibson
  2022-10-13  4:54         ` Stefano Brivio
  2022-10-13 10:50       ` Stefano Brivio
  1 sibling, 1 reply; 12+ messages in thread
From: David Gibson @ 2022-10-13  0:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10788 bytes --]

On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:
> On Wed, 12 Oct 2022 20:31:20 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:
> > > Hi David,
> > > 
> > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Hi Stefano,
> > > > 
> > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > isn't working, and I'm afraid I have bad news.  
> > > 
> > > Thanks for the investigation.
> > >   
> > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > 
> > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > >     starts out with a complete set of capabilities in the new user
> > > >     namespace.  Likewise, a process that creates a new user namespace
> > > >     using unshare(2) or joins an existing user namespace using
> > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > >     the other hand, that process has no capabilities in the parent (in
> > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > >     setns(2)) user namespace, even if the new namespace is created or
> > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > >     root namespace).***
> > > > 
> > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > places, processes don't have an independent set of capabilities in
> > > > each userns, they only have a set of capabilities in their current
> > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > none.  cap_capable() has the specific logic in the kernel.  
> > > 
> > > Right, I missed this.
> > > 
> > > For a moment, I wondered about ambient capabilities, but those would
> > > only have an effect on an execve(), not on a clone(), I guess.  
> > 
> > Well, yes, but it doesn't really make any difference in any case.  All
> > ambient caps can do is be another way to get things into the permitted
> > set.  If that happens before the unshare() then we still lose them on
> > unshare().  If it happens after the unshare(), then it's just giving
> > us caps within the namespace, which isn't what we need.
> > 
> > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > forwarding of low ports is pretty much off the cards.
> > > > 
> > > > For forwarding of specific low ports, we could delay our entry into
> > > > the new userns until we've set up the listening sockets, although it
> > > > does mean rolling back some of the simplification we gained from the
> > > > new-style userns handling.  
> > > 
> > > If I understand correctly, the biggest hurdle would be:
> > > 
> > > 1. we detach namespaces
> > > 
> > > 2. only then we can finalise any missing bit of addressing and routing
> > >    configuration (relevant for pasta)
> > > 
> > > 3. we bind ports as we parse configuration options, but we need
> > >    addressing to be fully configured for this
> > > 
> > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > that implies a further split of isolate_user() (it's great to have a
> > > name for that, finally!), right?  
> > 
> > Uh.. something like that, I haven't looked at the details.  As we did
> > before my userns cleanup, we'd probably need to repeatedly enter the
> > userns as well as the netns to operate upon it, staying in the initial
> > userns, with our initial caps until sandbox()/isolate_prefork() or
> > thereabouts.
> > 
> > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > passt has no meaningful way to limit what users do with the low ports
> > > > it allows them (indirectly) to bind to, giving passt
> > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > which can invoke passt CAP_NET_BIND_SERVICE.  
> > > 
> > > I also see the general point, even though if file capabilities are
> > > used, I guess the equivalence doesn't really hold.  
> > 
> > Uh.. I don't follow.  It's exactly file capabilities which make this
> > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > can, as an unprivileged user, take any server, stick it in a namespace
> > and use pasta to effectively bind it to a low port in the init
> > namespace.
> 
> I actually meant with passt but... even for pasta, this depends on the
> decision of whether we drop capabilities for the spawned process. If we
> decide we don't, one day, then it's not equivalent.

No, from a security perspective it pretty much is still equivalent.
You can start your own namespace where you have full capabilities, run
the server in there, then use pasta to translate your
cap_net_bind_service within to cap_net_bind_service on the host.  Or
just run the server on a high port and tell pasta to connect a low
port to it.

> It would be equivalent if we just inherited capabilities from the
> parent as opposed to file capabilities -- that's what I meant.
> 
> I think it's a bit early to decide to drop those, though. Right now
> pasta isn't really used as a stand-alone tool (even though I
> actually do that, I find it very convenient also for totally unrelated
> purposes).
> 
> Should we see some use cases, then we could make a more informed
> decision.
> 
> > You can do the same thing with passt, though it's fiddlier
> > (you'd need a shim to translate qemu socket protocol before plugging
> > it into the server).
> 
> Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> don't understand how you'd pass that kind of stuff over AF_UNIX...

No qemu necessary.  Make your bogus server, but instead of directly
listen()ing on a low port, have it connect to a Unix socket and wait
for SYN packets to a low port in qemu protocol.  Then use passt to
turn your Unix socket into a real listen()ing socket on the host.

> > > And perhaps we
> > > should at least recommend that as a preferred way.
> > > 
> > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> > > and due to something that's slightly more than an implementation detail,
> > > it won't be able to bind to low ports, which is the very reason for that
> > > capability. That sounds highly counterintuitive.  
> > 
> > I guess it is in the sense that the reason for this wasn't obvious to
> > either of us initially.  However it makes sense to me now that I've
> > looked at it.
> 
> No, no, in the sense that it makes sense to you and now to me as well,
> as you explained it to me. And yet it I find it hard to imagine that it
> would naturally make sense to users, in these terms:
> 
> - we offer a program that provides network connectivity to qemu
> 
> - it also includes port forwarding functionality: it binds to
>   configured ports and maps them to the guest
> 
> - it can't bind to any port: it doesn't run as root, and Linux prevents
>   non-root processes from binding to ports lower than 1024, which is a
>   well-known fact -- at least by default (lesser known fact)
> 
> - somewhat in between on the scale of general knowledge, lies
>   CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low
>   ports
> 
> ...but not passt. For very valid reasons, indeed, but those will need
> to be explained over and over again.

Yeah, I guess so.

> > We use a userns for two reasons: 1) to control a netns
> > and 2) to isolate ourselves for security.  We use the same path and
> > the same userns for both, but they're logically different reasons.
> > 
> > If (1) was the only reason for the userns we could handle this pretty
> > easily: we'd only enter the userns transiently when we need to
> > manipulate it, just like we do with the netns.  That way the main
> > thread would retain CAP_NET_BIND_SERVICE in the original ns.
> > 
> > For (2), we're specifically choosing to isolate ourselves: that is to
> > give up privilege from our original position.  It's not surprising
> > there's some degree of granularity to how we can do that, and the deal
> > is that we can't give up our membership in the original userns without
> > also giving up our enhanced capabilities in that userns.
> > 
> > I don't think giving up (2) is a price worth paying for this.
> 
> Absolutely, I agree, I wouldn't either.
> 
> However, he could give users the choice without compromising (2) at
> all, by binding to low ports early (without automatic detection, sure).
> And, somewhat importantly, by not handling any data from them.
> 
> We could even defer the listen() calls if there's any value in doing so
> (is there some? I can't think of anything).
> 
> Actually, I'm thinking of an easier way to break the circular
> dependency between isolation steps and port configuration I outlined
> earlier, without undoing your cleanups at all.
> 
> We currently need to process port configuration in a second step for
> two reasons:
> 
> - we might bind ports in the detached namespace (-T, -U)
> 
> - one between IPv4 and IPv6 support could be administratively disabled
>   (operationally, who cares, we'll just fail to bind if that's the
>   case)
> 
> ...but for init/host facing ports (now: "inbound"), we don't care about
> the detached namespace, and we could simply call conf_ports() for -t
> and -u in a second step after the main loop. Sure, if we continue like
> this, we'll end up with O(n²) option handling, but right now it
> doesn't look that bad to me.

Ah, yes, that could work.  Of course, this does mean moving some
relatively complex code out of at least one layer of isolation, which
carries some risks.

> I would give it a shot after I'm done reviewing your patchset (it
> should also look clearer after that) and re-spinning my recent ones,
> unless you see something wrong with it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-13  0:34       ` David Gibson
@ 2022-10-13  4:54         ` Stefano Brivio
  2022-10-13  5:15           ` Stefano Brivio
  2022-10-14  2:54           ` David Gibson
  0 siblings, 2 replies; 12+ messages in thread
From: Stefano Brivio @ 2022-10-13  4:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 13 Oct 2022 11:34:04 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:
> > On Wed, 12 Oct 2022 20:31:20 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:  
> > > > Hi David,
> > > > 
> > > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > Hi Stefano,
> > > > > 
> > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > > isn't working, and I'm afraid I have bad news.    
> > > > 
> > > > Thanks for the investigation.
> > > >     
> > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > > 
> > > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > > >     starts out with a complete set of capabilities in the new user
> > > > >     namespace.  Likewise, a process that creates a new user namespace
> > > > >     using unshare(2) or joins an existing user namespace using
> > > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > > >     the other hand, that process has no capabilities in the parent (in
> > > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > > >     setns(2)) user namespace, even if the new namespace is created or
> > > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > > >     root namespace).***
> > > > > 
> > > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > > places, processes don't have an independent set of capabilities in
> > > > > each userns, they only have a set of capabilities in their current
> > > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > > none.  cap_capable() has the specific logic in the kernel.    
> > > > 
> > > > Right, I missed this.
> > > > 
> > > > For a moment, I wondered about ambient capabilities, but those would
> > > > only have an effect on an execve(), not on a clone(), I guess.    
> > > 
> > > Well, yes, but it doesn't really make any difference in any case.  All
> > > ambient caps can do is be another way to get things into the permitted
> > > set.  If that happens before the unshare() then we still lose them on
> > > unshare().  If it happens after the unshare(), then it's just giving
> > > us caps within the namespace, which isn't what we need.
> > >   
> > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > > forwarding of low ports is pretty much off the cards.
> > > > > 
> > > > > For forwarding of specific low ports, we could delay our entry into
> > > > > the new userns until we've set up the listening sockets, although it
> > > > > does mean rolling back some of the simplification we gained from the
> > > > > new-style userns handling.    
> > > > 
> > > > If I understand correctly, the biggest hurdle would be:
> > > > 
> > > > 1. we detach namespaces
> > > > 
> > > > 2. only then we can finalise any missing bit of addressing and routing
> > > >    configuration (relevant for pasta)
> > > > 
> > > > 3. we bind ports as we parse configuration options, but we need
> > > >    addressing to be fully configured for this
> > > > 
> > > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > > that implies a further split of isolate_user() (it's great to have a
> > > > name for that, finally!), right?    
> > > 
> > > Uh.. something like that, I haven't looked at the details.  As we did
> > > before my userns cleanup, we'd probably need to repeatedly enter the
> > > userns as well as the netns to operate upon it, staying in the initial
> > > userns, with our initial caps until sandbox()/isolate_prefork() or
> > > thereabouts.
> > >   
> > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > > passt has no meaningful way to limit what users do with the low ports
> > > > > it allows them (indirectly) to bind to, giving passt
> > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > > which can invoke passt CAP_NET_BIND_SERVICE.    
> > > > 
> > > > I also see the general point, even though if file capabilities are
> > > > used, I guess the equivalence doesn't really hold.    
> > > 
> > > Uh.. I don't follow.  It's exactly file capabilities which make this
> > > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > > can, as an unprivileged user, take any server, stick it in a namespace
> > > and use pasta to effectively bind it to a low port in the init
> > > namespace.  
> > 
> > I actually meant with passt but... even for pasta, this depends on the
> > decision of whether we drop capabilities for the spawned process. If we
> > decide we don't, one day, then it's not equivalent.  
> 
> No, from a security perspective it pretty much is still equivalent.
> You can start your own namespace where you have full capabilities, run
> the server in there, then use pasta to translate your
> cap_net_bind_service within to cap_net_bind_service on the host.  Or
> just run the server on a high port and tell pasta to connect a low
> port to it.

Ah, sorry, now I understand what you mean here, and...

> > It would be equivalent if we just inherited capabilities from the
> > parent as opposed to file capabilities -- that's what I meant.
> > 
> > I think it's a bit early to decide to drop those, though. Right now
> > pasta isn't really used as a stand-alone tool (even though I
> > actually do that, I find it very convenient also for totally unrelated
> > purposes).
> > 
> > Should we see some use cases, then we could make a more informed
> > decision.
> >   
> > > You can do the same thing with passt, though it's fiddlier
> > > (you'd need a shim to translate qemu socket protocol before plugging
> > > it into the server).  
> > 
> > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > don't understand how you'd pass that kind of stuff over AF_UNIX...  
> 
> No qemu necessary.  Make your bogus server, but instead of directly
> listen()ing on a low port, have it connect to a Unix socket and wait
> for SYN packets to a low port in qemu protocol.  Then use passt to
> turn your Unix socket into a real listen()ing socket on the host.

...here. But the environment I had in mind was a rather controller one,
with KSM policies that would normally prevent you from even having your
bogus server.

Well, that would be the case for KubeVirt at least: three binaries and
not much margin to play tricks.

Unless, of course, you manage to do arbitrary code execution in qemu,
which... would be actually one of the few cases where we would prefer
to have CAP_NET_BIND_SERVICE granted to passt instead of allowing
everybody to bind low ports.

Still, it depends: you would have to reuse that qemu process, because
it's the only one that's connected to passt (which only applies with
minimally defined KSM policies, sure).

In that environment, you couldn't easily turn libvirtd into a DNS
resolver.

In a general case, I see your point, but in specific cases it actually
depends on what the environment allows.

> > > > And perhaps we
> > > > should at least recommend that as a preferred way.
> > > > 
> > > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> > > > and due to something that's slightly more than an implementation detail,
> > > > it won't be able to bind to low ports, which is the very reason for that
> > > > capability. That sounds highly counterintuitive.    
> > > 
> > > I guess it is in the sense that the reason for this wasn't obvious to
> > > either of us initially.  However it makes sense to me now that I've
> > > looked at it.  
> > 
> > No, no, in the sense that it makes sense to you and now to me as well,
> > as you explained it to me. And yet it I find it hard to imagine that it
> > would naturally make sense to users, in these terms:
> > 
> > - we offer a program that provides network connectivity to qemu
> > 
> > - it also includes port forwarding functionality: it binds to
> >   configured ports and maps them to the guest
> > 
> > - it can't bind to any port: it doesn't run as root, and Linux prevents
> >   non-root processes from binding to ports lower than 1024, which is a
> >   well-known fact -- at least by default (lesser known fact)
> > 
> > - somewhat in between on the scale of general knowledge, lies
> >   CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low
> >   ports
> > 
> > ...but not passt. For very valid reasons, indeed, but those will need
> > to be explained over and over again.  
> 
> Yeah, I guess so.
> 
> > > We use a userns for two reasons: 1) to control a netns
> > > and 2) to isolate ourselves for security.  We use the same path and
> > > the same userns for both, but they're logically different reasons.
> > > 
> > > If (1) was the only reason for the userns we could handle this pretty
> > > easily: we'd only enter the userns transiently when we need to
> > > manipulate it, just like we do with the netns.  That way the main
> > > thread would retain CAP_NET_BIND_SERVICE in the original ns.
> > > 
> > > For (2), we're specifically choosing to isolate ourselves: that is to
> > > give up privilege from our original position.  It's not surprising
> > > there's some degree of granularity to how we can do that, and the deal
> > > is that we can't give up our membership in the original userns without
> > > also giving up our enhanced capabilities in that userns.
> > > 
> > > I don't think giving up (2) is a price worth paying for this.  
> > 
> > Absolutely, I agree, I wouldn't either.
> > 
> > However, he could give users the choice without compromising (2) at
> > all, by binding to low ports early (without automatic detection, sure).
> > And, somewhat importantly, by not handling any data from them.
> > 
> > We could even defer the listen() calls if there's any value in doing so
> > (is there some? I can't think of anything).
> > 
> > Actually, I'm thinking of an easier way to break the circular
> > dependency between isolation steps and port configuration I outlined
> > earlier, without undoing your cleanups at all.
> > 
> > We currently need to process port configuration in a second step for
> > two reasons:
> > 
> > - we might bind ports in the detached namespace (-T, -U)
> > 
> > - one between IPv4 and IPv6 support could be administratively disabled
> >   (operationally, who cares, we'll just fail to bind if that's the
> >   case)
> > 
> > ...but for init/host facing ports (now: "inbound"), we don't care about
> > the detached namespace, and we could simply call conf_ports() for -t
> > and -u in a second step after the main loop. Sure, if we continue like
> > this, we'll end up with O(n²) option handling, but right now it
> > doesn't look that bad to me.  
> 
> Ah, yes, that could work.  Of course, this does mean moving some
> relatively complex code out of at least one layer of isolation, which
> carries some risks.

Some of it, yes, but to me it doesn't look much worse than conf() in
comparison. I still have to think about downsides of having listen()
calls there, though. On the other hand, that would mean we can also get
rid of listen() later with seccomp, for passt only.

Anyway, I drafted it... but this happens. I dropped pasta symlinks for
simplicity:

--

# setcap 'cap_net_bind_service=+ep' /home/sbrivio/passt/pasta.avx2
# getcap /home/sbrivio/passt/pasta.avx2
/home/sbrivio/passt/pasta.avx2 cap_net_bind_service=ep

--

$ strace -ebind,capget,capset,readlink -f ./pasta.avx2 -f -t 81 1763943
readlink("/proc/self/exe", "/home/sbrivio/passt/pasta.avx2", 4095) = 30
capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
bind(5, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
bind(5, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EACCES (Permission denied)
bind(5, {sa_family=AF_INET6, sin6_port=htons(81), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)
bind(5, {sa_family=AF_INET6, sin6_port=htons(81), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)

--

no fancy filesystem attributes, just a very unassuming ext4. I must be
missing something very obvious.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-13  4:54         ` Stefano Brivio
@ 2022-10-13  5:15           ` Stefano Brivio
  2022-10-14  2:54           ` David Gibson
  1 sibling, 0 replies; 12+ messages in thread
From: Stefano Brivio @ 2022-10-13  5:15 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 13 Oct 2022 06:54:26 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> Anyway, I drafted it... but this happens. I dropped pasta symlinks for
> simplicity:
> 
> --
> 
> # setcap 'cap_net_bind_service=+ep' /home/sbrivio/passt/pasta.avx2
> # getcap /home/sbrivio/passt/pasta.avx2
> /home/sbrivio/passt/pasta.avx2 cap_net_bind_service=ep
> 
> --
> 
> $ strace -ebind,capget,capset,readlink -f ./pasta.avx2 -f -t 81 1763943
> readlink("/proc/self/exe", "/home/sbrivio/passt/pasta.avx2", 4095) = 30
> capget({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
> capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
> bind(5, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EACCES (Permission denied)
> bind(5, {sa_family=AF_INET, sin_port=htons(81), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EACCES (Permission denied)
> bind(5, {sa_family=AF_INET6, sin6_port=htons(81), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)
> bind(5, {sa_family=AF_INET6, sin6_port=htons(81), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)
> 
> --
> 
> no fancy filesystem attributes, just a very unassuming ext4. I must be
> missing something very obvious.

...oops, never mind, it's strace, of course -- that's even mentioned in
man 7 capabilities.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-12 10:47     ` Stefano Brivio
  2022-10-13  0:34       ` David Gibson
@ 2022-10-13 10:50       ` Stefano Brivio
  2022-10-14  2:56         ` David Gibson
  1 sibling, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2022-10-13 10:50 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 12 Oct 2022 12:47:07 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> [...]
>
> We currently need to process port configuration in a second step for
> two reasons:
> 
> - we might bind ports in the detached namespace (-T, -U)
> 
> - one between IPv4 and IPv6 support could be administratively disabled
>   (operationally, who cares, we'll just fail to bind if that's the
>   case)
> 
> ...but for init/host facing ports (now: "inbound"), we don't care about
> the detached namespace, and we could simply call conf_ports() for -t
> and -u in a second step after the main loop. Sure, if we continue like
> this, we'll end up with O(n²) option handling, but right now it
> doesn't look that bad to me.
> 
> I would give it a shot after I'm done reviewing your patchset (it
> should also look clearer after that) and re-spinning my recent ones,
> unless you see something wrong with it.

So, I have a draft (minus man page changes), a bit more involved than I
wanted but still not adding much.

It's based on your patchset, so I can't really test it with Podman
because of that new issue I'm facing with filesystem-bound namespaces,
but it passes our tests, and it works with low ports too.

I would try to figure out that other issue before posting it properly,
here it comes anyway:

diff --git a/conf.c b/conf.c
index a22ef48..e1f42f1 100644
--- a/conf.c
+++ b/conf.c
@@ -1530,6 +1530,11 @@ void conf(struct ctx *c, int argc, char **argv)
 		}
 	} while (name != -1);
 
+	if (v4_only && v6_only) {
+		err("Options ipv4-only and ipv6-only are mutually exclusive");
+		usage(argv[0]);
+	}
+
 	ret = conf_ugid(runas, &uid, &gid);
 	if (ret)
 		usage(argv[0]);
@@ -1539,6 +1544,30 @@ void conf(struct ctx *c, int argc, char **argv)
 			     logfile, logsize);
 	}
 
+	nl_sock_init(c, false);
+	if (!v6_only)
+		c->ifi4 = conf_ip4(ifi, &c->ip4, c->mac);
+	if (!v4_only)
+		c->ifi6 = conf_ip6(ifi, &c->ip6, c->mac);
+	if (!c->ifi4 && !c->ifi6) {
+		err("External interface not usable");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Inbound port options can be parsed now (after IPv4/IPv6 settings) */
+	optind = 1;
+	do {
+		struct port_fwd *fwd;
+
+		name = getopt_long(argc, argv, optstring, options, NULL);
+
+		if ((name == 't' && (fwd = &c->tcp.fwd_in)) ||
+		    (name == 'u' && (fwd = &c->udp.fwd_in.f))) {
+			if (!optarg || conf_ports(c, name, optarg, fwd))
+				usage(argv[0]);
+		}
+	} while (name != -1);
+
 	if (c->mode == MODE_PASTA) {
 		if (conf_pasta_ns(&netns_only, userns, netns,
 				  optind, argc, argv) < 0)
@@ -1561,50 +1590,20 @@ void conf(struct ctx *c, int argc, char **argv)
 		}
 	}
 
-	if (nl_sock_init(c)) {
-		err("Failed to get netlink socket");
-		exit(EXIT_FAILURE);
-	}
-
-	if (v4_only && v6_only) {
-		err("Options ipv4-only and ipv6-only are mutually exclusive");
-		usage(argv[0]);
-	}
-	if (!v6_only)
-		c->ifi4 = conf_ip4(ifi, &c->ip4, c->mac);
-	if (!v4_only)
-		c->ifi6 = conf_ip6(ifi, &c->ip6, c->mac);
-	if (!c->ifi4 && !c->ifi6) {
-		err("External interface not usable");
-		exit(EXIT_FAILURE);
-	}
+	if (c->mode == MODE_PASTA)
+		nl_sock_init(c, true);
 
-	/* Now we can process port configuration options */
+	/* ...and outbound port options now that namespaces are set up. */
 	optind = 1;
 	do {
-		struct port_fwd *fwd = NULL;
+		struct port_fwd *fwd;
 
 		name = getopt_long(argc, argv, optstring, options, NULL);
-		switch (name) {
-		case 't':
-		case 'u':
-		case 'T':
-		case 'U':
-			if (name == 't')
-				fwd = &c->tcp.fwd_in;
-			else if (name == 'T')
-				fwd = &c->tcp.fwd_out;
-			else if (name == 'u')
-				fwd = &c->udp.fwd_in.f;
-			else if (name == 'U')
-				fwd = &c->udp.fwd_out.f;
 
+		if ((name == 'T' && (fwd = &c->tcp.fwd_out)) ||
+		    (name == 'U' && (fwd = &c->udp.fwd_out.f))) {
 			if (!optarg || conf_ports(c, name, optarg, fwd))
 				usage(argv[0]);
-
-			break;
-		default:
-			break;
 		}
 	} while (name != -1);
 
diff --git a/netlink.c b/netlink.c
index 6e5a96b..3ee4d42 100644
--- a/netlink.c
+++ b/netlink.c
@@ -19,6 +19,7 @@
 #include <sys/types.h>
 #include <limits.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <stdint.h>
 #include <unistd.h>
 #include <arpa/inet.h>
@@ -71,25 +72,28 @@ ns:
 }
 
 /**
- * nl_sock_init() - Call nl_sock_init_do() and check for failures
+ * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
  * @c:		Execution context
- *
- * Return: -EIO if sockets couldn't be set up, 0 otherwise
+ * @ns:		Get socket in namespace, not in init
  */
-int nl_sock_init(const struct ctx *c)
+void nl_sock_init(const struct ctx *c, bool ns)
 {
-	if (c->mode == MODE_PASTA) {
+	if (c->mode == MODE_PASTA && ns) {
 		NS_CALL(nl_sock_init_do, c);
 		if (nl_sock_ns == -1)
-			return -EIO;
+			goto fail;
 	} else {
 		nl_sock_init_do(NULL);
 	}
 
 	if (nl_sock == -1)
-		return -EIO;
+		goto fail;
 
-	return 0;
+	return;
+
+fail:
+	err("Failed to get netlink socket");
+	exit(EXIT_FAILURE);
 }
 
 /**
diff --git a/netlink.h b/netlink.h
index 5ce5037..3c991cc 100644
--- a/netlink.h
+++ b/netlink.h
@@ -6,7 +6,7 @@
 #ifndef NETLINK_H
 #define NETLINK_H
 
-int nl_sock_init(const struct ctx *c);
+void nl_sock_init(const struct ctx *c, bool ns);
 unsigned int nl_get_ext_if(sa_family_t af);
 void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw);
 void nl_addr(int ns, unsigned int ifi, sa_family_t af,
diff --git a/tap.c b/tap.c
index 77e513c..8b6d9bc 100644
--- a/tap.c
+++ b/tap.c
@@ -30,6 +30,7 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <sys/uio.h>
+#include <stdbool.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <netinet/ip.h>

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-13  4:54         ` Stefano Brivio
  2022-10-13  5:15           ` Stefano Brivio
@ 2022-10-14  2:54           ` David Gibson
  2022-10-16  9:46             ` Stefano Brivio
  1 sibling, 1 reply; 12+ messages in thread
From: David Gibson @ 2022-10-14  2:54 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13726 bytes --]

On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote:
> On Thu, 13 Oct 2022 11:34:04 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:
> > > On Wed, 12 Oct 2022 20:31:20 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:  
> > > > > Hi David,
> > > > > 
> > > > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > Hi Stefano,
> > > > > > 
> > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > > > isn't working, and I'm afraid I have bad news.    
> > > > > 
> > > > > Thanks for the investigation.
> > > > >     
> > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > > > 
> > > > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > > > >     starts out with a complete set of capabilities in the new user
> > > > > >     namespace.  Likewise, a process that creates a new user namespace
> > > > > >     using unshare(2) or joins an existing user namespace using
> > > > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > > > >     the other hand, that process has no capabilities in the parent (in
> > > > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > > > >     setns(2)) user namespace, even if the new namespace is created or
> > > > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > > > >     root namespace).***
> > > > > > 
> > > > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > > > places, processes don't have an independent set of capabilities in
> > > > > > each userns, they only have a set of capabilities in their current
> > > > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > > > none.  cap_capable() has the specific logic in the kernel.    
> > > > > 
> > > > > Right, I missed this.
> > > > > 
> > > > > For a moment, I wondered about ambient capabilities, but those would
> > > > > only have an effect on an execve(), not on a clone(), I guess.    
> > > > 
> > > > Well, yes, but it doesn't really make any difference in any case.  All
> > > > ambient caps can do is be another way to get things into the permitted
> > > > set.  If that happens before the unshare() then we still lose them on
> > > > unshare().  If it happens after the unshare(), then it's just giving
> > > > us caps within the namespace, which isn't what we need.
> > > >   
> > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > > > forwarding of low ports is pretty much off the cards.
> > > > > > 
> > > > > > For forwarding of specific low ports, we could delay our entry into
> > > > > > the new userns until we've set up the listening sockets, although it
> > > > > > does mean rolling back some of the simplification we gained from the
> > > > > > new-style userns handling.    
> > > > > 
> > > > > If I understand correctly, the biggest hurdle would be:
> > > > > 
> > > > > 1. we detach namespaces
> > > > > 
> > > > > 2. only then we can finalise any missing bit of addressing and routing
> > > > >    configuration (relevant for pasta)
> > > > > 
> > > > > 3. we bind ports as we parse configuration options, but we need
> > > > >    addressing to be fully configured for this
> > > > > 
> > > > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > > > that implies a further split of isolate_user() (it's great to have a
> > > > > name for that, finally!), right?    
> > > > 
> > > > Uh.. something like that, I haven't looked at the details.  As we did
> > > > before my userns cleanup, we'd probably need to repeatedly enter the
> > > > userns as well as the netns to operate upon it, staying in the initial
> > > > userns, with our initial caps until sandbox()/isolate_prefork() or
> > > > thereabouts.
> > > >   
> > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > > > passt has no meaningful way to limit what users do with the low ports
> > > > > > it allows them (indirectly) to bind to, giving passt
> > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > > > which can invoke passt CAP_NET_BIND_SERVICE.    
> > > > > 
> > > > > I also see the general point, even though if file capabilities are
> > > > > used, I guess the equivalence doesn't really hold.    
> > > > 
> > > > Uh.. I don't follow.  It's exactly file capabilities which make this
> > > > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > > > can, as an unprivileged user, take any server, stick it in a namespace
> > > > and use pasta to effectively bind it to a low port in the init
> > > > namespace.  
> > > 
> > > I actually meant with passt but... even for pasta, this depends on the
> > > decision of whether we drop capabilities for the spawned process. If we
> > > decide we don't, one day, then it's not equivalent.  
> > 
> > No, from a security perspective it pretty much is still equivalent.
> > You can start your own namespace where you have full capabilities, run
> > the server in there, then use pasta to translate your
> > cap_net_bind_service within to cap_net_bind_service on the host.  Or
> > just run the server on a high port and tell pasta to connect a low
> > port to it.
> 
> Ah, sorry, now I understand what you mean here, and...
> 
> > > It would be equivalent if we just inherited capabilities from the
> > > parent as opposed to file capabilities -- that's what I meant.
> > > 
> > > I think it's a bit early to decide to drop those, though. Right now
> > > pasta isn't really used as a stand-alone tool (even though I
> > > actually do that, I find it very convenient also for totally unrelated
> > > purposes).
> > > 
> > > Should we see some use cases, then we could make a more informed
> > > decision.
> > >   
> > > > You can do the same thing with passt, though it's fiddlier
> > > > (you'd need a shim to translate qemu socket protocol before plugging
> > > > it into the server).  
> > > 
> > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > > don't understand how you'd pass that kind of stuff over AF_UNIX...  
> > 
> > No qemu necessary.  Make your bogus server, but instead of directly
> > listen()ing on a low port, have it connect to a Unix socket and wait
> > for SYN packets to a low port in qemu protocol.  Then use passt to
> > turn your Unix socket into a real listen()ing socket on the host.
> 
> ...here. But the environment I had in mind was a rather controller one,
> with KSM policies that would normally prevent you from even having your
> bogus server.
> 
> Well, that would be the case for KubeVirt at least: three binaries and
> not much margin to play tricks.

Ok, but even then using the file capability rather than the sysctl
only makes a difference if the attacker:
 * CAN escape confinement enough to make socket calls in the netns
   where we would be setting the sysctl
 * CAN'T escape confinment enough to exec() passt

Which seems like it would be a very narrow category to me.

> Unless, of course, you manage to do arbitrary code execution in qemu,
> which... would be actually one of the few cases where we would prefer
> to have CAP_NET_BIND_SERVICE granted to passt instead of allowing
> everybody to bind low ports.

In that situation, I think it's still equivalent unless qemu was be
prevented from (re) exec()ing passt... but the obvious mechanism to do
that (put it in a different mount namespace from qemu) is fairly
easily extended to working with the sysctl as well (put qemu in a
separate netns so it can't access any network except via the passt
socket it's been fiven).

> Still, it depends: you would have to reuse that qemu process, because
> it's the only one that's connected to passt (which only applies with
> minimally defined KSM policies, sure).
> 
> In that environment, you couldn't easily turn libvirtd into a DNS
> resolver.
> 
> In a general case, I see your point, but in specific cases it actually
> depends on what the environment allows.
> 
> > > > > And perhaps we
> > > > > should at least recommend that as a preferred way.
> > > > > 
> > > > > What still perplexes me is: somebody gives passt CAP_NET_BIND_SERVICE,
> > > > > and due to something that's slightly more than an implementation detail,
> > > > > it won't be able to bind to low ports, which is the very reason for that
> > > > > capability. That sounds highly counterintuitive.    
> > > > 
> > > > I guess it is in the sense that the reason for this wasn't obvious to
> > > > either of us initially.  However it makes sense to me now that I've
> > > > looked at it.  
> > > 
> > > No, no, in the sense that it makes sense to you and now to me as well,
> > > as you explained it to me. And yet it I find it hard to imagine that it
> > > would naturally make sense to users, in these terms:
> > > 
> > > - we offer a program that provides network connectivity to qemu
> > > 
> > > - it also includes port forwarding functionality: it binds to
> > >   configured ports and maps them to the guest
> > > 
> > > - it can't bind to any port: it doesn't run as root, and Linux prevents
> > >   non-root processes from binding to ports lower than 1024, which is a
> > >   well-known fact -- at least by default (lesser known fact)
> > > 
> > > - somewhat in between on the scale of general knowledge, lies
> > >   CAP_NET_BIND_SERVICE: it allows non-root processes to bind to low
> > >   ports
> > > 
> > > ...but not passt. For very valid reasons, indeed, but those will need
> > > to be explained over and over again.  
> > 
> > Yeah, I guess so.
> > 
> > > > We use a userns for two reasons: 1) to control a netns
> > > > and 2) to isolate ourselves for security.  We use the same path and
> > > > the same userns for both, but they're logically different reasons.
> > > > 
> > > > If (1) was the only reason for the userns we could handle this pretty
> > > > easily: we'd only enter the userns transiently when we need to
> > > > manipulate it, just like we do with the netns.  That way the main
> > > > thread would retain CAP_NET_BIND_SERVICE in the original ns.
> > > > 
> > > > For (2), we're specifically choosing to isolate ourselves: that is to
> > > > give up privilege from our original position.  It's not surprising
> > > > there's some degree of granularity to how we can do that, and the deal
> > > > is that we can't give up our membership in the original userns without
> > > > also giving up our enhanced capabilities in that userns.
> > > > 
> > > > I don't think giving up (2) is a price worth paying for this.  
> > > 
> > > Absolutely, I agree, I wouldn't either.
> > > 
> > > However, he could give users the choice without compromising (2) at
> > > all, by binding to low ports early (without automatic detection, sure).
> > > And, somewhat importantly, by not handling any data from them.
> > > 
> > > We could even defer the listen() calls if there's any value in doing so
> > > (is there some? I can't think of anything).
> > > 
> > > Actually, I'm thinking of an easier way to break the circular
> > > dependency between isolation steps and port configuration I outlined
> > > earlier, without undoing your cleanups at all.
> > > 
> > > We currently need to process port configuration in a second step for
> > > two reasons:
> > > 
> > > - we might bind ports in the detached namespace (-T, -U)
> > > 
> > > - one between IPv4 and IPv6 support could be administratively disabled
> > >   (operationally, who cares, we'll just fail to bind if that's the
> > >   case)
> > > 
> > > ...but for init/host facing ports (now: "inbound"), we don't care about
> > > the detached namespace, and we could simply call conf_ports() for -t
> > > and -u in a second step after the main loop. Sure, if we continue like
> > > this, we'll end up with O(n²) option handling, but right now it
> > > doesn't look that bad to me.  
> > 
> > Ah, yes, that could work.  Of course, this does mean moving some
> > relatively complex code out of at least one layer of isolation, which
> > carries some risks.
> 
> Some of it, yes, but to me it doesn't look much worse than conf() in
> comparison. I still have to think about downsides of having listen()
> calls there, though. On the other hand, that would mean we can also get
> rid of listen() later with seccomp, for passt only.
> 
> Anyway, I drafted it... but this happens. I dropped pasta symlinks for
> simplicity:
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-13 10:50       ` Stefano Brivio
@ 2022-10-14  2:56         ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2022-10-14  2:56 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6503 bytes --]

On Thu, Oct 13, 2022 at 12:50:17PM +0200, Stefano Brivio wrote:
> On Wed, 12 Oct 2022 12:47:07 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > [...]
> >
> > We currently need to process port configuration in a second step for
> > two reasons:
> > 
> > - we might bind ports in the detached namespace (-T, -U)
> > 
> > - one between IPv4 and IPv6 support could be administratively disabled
> >   (operationally, who cares, we'll just fail to bind if that's the
> >   case)
> > 
> > ...but for init/host facing ports (now: "inbound"), we don't care about
> > the detached namespace, and we could simply call conf_ports() for -t
> > and -u in a second step after the main loop. Sure, if we continue like
> > this, we'll end up with O(n²) option handling, but right now it
> > doesn't look that bad to me.
> > 
> > I would give it a shot after I'm done reviewing your patchset (it
> > should also look clearer after that) and re-spinning my recent ones,
> > unless you see something wrong with it.
> 
> So, I have a draft (minus man page changes), a bit more involved than I
> wanted but still not adding much.
> 
> It's based on your patchset, so I can't really test it with Podman
> because of that new issue I'm facing with filesystem-bound namespaces,
> but it passes our tests, and it works with low ports too.
> 
> I would try to figure out that other issue before posting it properly,
> here it comes anyway:

Looks reasonable.

> diff --git a/conf.c b/conf.c
> index a22ef48..e1f42f1 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -1530,6 +1530,11 @@ void conf(struct ctx *c, int argc, char **argv)
>  		}
>  	} while (name != -1);
>  
> +	if (v4_only && v6_only) {
> +		err("Options ipv4-only and ipv6-only are mutually exclusive");
> +		usage(argv[0]);
> +	}
> +
>  	ret = conf_ugid(runas, &uid, &gid);
>  	if (ret)
>  		usage(argv[0]);
> @@ -1539,6 +1544,30 @@ void conf(struct ctx *c, int argc, char **argv)
>  			     logfile, logsize);
>  	}
>  
> +	nl_sock_init(c, false);
> +	if (!v6_only)
> +		c->ifi4 = conf_ip4(ifi, &c->ip4, c->mac);
> +	if (!v4_only)
> +		c->ifi6 = conf_ip6(ifi, &c->ip6, c->mac);
> +	if (!c->ifi4 && !c->ifi6) {
> +		err("External interface not usable");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* Inbound port options can be parsed now (after IPv4/IPv6 settings) */
> +	optind = 1;
> +	do {
> +		struct port_fwd *fwd;
> +
> +		name = getopt_long(argc, argv, optstring, options, NULL);
> +
> +		if ((name == 't' && (fwd = &c->tcp.fwd_in)) ||
> +		    (name == 'u' && (fwd = &c->udp.fwd_in.f))) {
> +			if (!optarg || conf_ports(c, name, optarg, fwd))
> +				usage(argv[0]);
> +		}
> +	} while (name != -1);
> +
>  	if (c->mode == MODE_PASTA) {
>  		if (conf_pasta_ns(&netns_only, userns, netns,
>  				  optind, argc, argv) < 0)
> @@ -1561,50 +1590,20 @@ void conf(struct ctx *c, int argc, char **argv)
>  		}
>  	}
>  
> -	if (nl_sock_init(c)) {
> -		err("Failed to get netlink socket");
> -		exit(EXIT_FAILURE);
> -	}
> -
> -	if (v4_only && v6_only) {
> -		err("Options ipv4-only and ipv6-only are mutually exclusive");
> -		usage(argv[0]);
> -	}
> -	if (!v6_only)
> -		c->ifi4 = conf_ip4(ifi, &c->ip4, c->mac);
> -	if (!v4_only)
> -		c->ifi6 = conf_ip6(ifi, &c->ip6, c->mac);
> -	if (!c->ifi4 && !c->ifi6) {
> -		err("External interface not usable");
> -		exit(EXIT_FAILURE);
> -	}
> +	if (c->mode == MODE_PASTA)
> +		nl_sock_init(c, true);
>  
> -	/* Now we can process port configuration options */
> +	/* ...and outbound port options now that namespaces are set up. */
>  	optind = 1;
>  	do {
> -		struct port_fwd *fwd = NULL;
> +		struct port_fwd *fwd;
>  
>  		name = getopt_long(argc, argv, optstring, options, NULL);
> -		switch (name) {
> -		case 't':
> -		case 'u':
> -		case 'T':
> -		case 'U':
> -			if (name == 't')
> -				fwd = &c->tcp.fwd_in;
> -			else if (name == 'T')
> -				fwd = &c->tcp.fwd_out;
> -			else if (name == 'u')
> -				fwd = &c->udp.fwd_in.f;
> -			else if (name == 'U')
> -				fwd = &c->udp.fwd_out.f;
>  
> +		if ((name == 'T' && (fwd = &c->tcp.fwd_out)) ||
> +		    (name == 'U' && (fwd = &c->udp.fwd_out.f))) {
>  			if (!optarg || conf_ports(c, name, optarg, fwd))
>  				usage(argv[0]);
> -
> -			break;
> -		default:
> -			break;
>  		}
>  	} while (name != -1);
>  
> diff --git a/netlink.c b/netlink.c
> index 6e5a96b..3ee4d42 100644
> --- a/netlink.c
> +++ b/netlink.c
> @@ -19,6 +19,7 @@
>  #include <sys/types.h>
>  #include <limits.h>
>  #include <stdlib.h>
> +#include <stdbool.h>
>  #include <stdint.h>
>  #include <unistd.h>
>  #include <arpa/inet.h>
> @@ -71,25 +72,28 @@ ns:
>  }
>  
>  /**
> - * nl_sock_init() - Call nl_sock_init_do() and check for failures
> + * nl_sock_init() - Call nl_sock_init_do(), won't return on failure
>   * @c:		Execution context
> - *
> - * Return: -EIO if sockets couldn't be set up, 0 otherwise
> + * @ns:		Get socket in namespace, not in init
>   */
> -int nl_sock_init(const struct ctx *c)
> +void nl_sock_init(const struct ctx *c, bool ns)
>  {
> -	if (c->mode == MODE_PASTA) {
> +	if (c->mode == MODE_PASTA && ns) {
>  		NS_CALL(nl_sock_init_do, c);
>  		if (nl_sock_ns == -1)
> -			return -EIO;
> +			goto fail;
>  	} else {
>  		nl_sock_init_do(NULL);
>  	}
>  
>  	if (nl_sock == -1)
> -		return -EIO;
> +		goto fail;
>  
> -	return 0;
> +	return;
> +
> +fail:
> +	err("Failed to get netlink socket");
> +	exit(EXIT_FAILURE);
>  }
>  
>  /**
> diff --git a/netlink.h b/netlink.h
> index 5ce5037..3c991cc 100644
> --- a/netlink.h
> +++ b/netlink.h
> @@ -6,7 +6,7 @@
>  #ifndef NETLINK_H
>  #define NETLINK_H
>  
> -int nl_sock_init(const struct ctx *c);
> +void nl_sock_init(const struct ctx *c, bool ns);
>  unsigned int nl_get_ext_if(sa_family_t af);
>  void nl_route(int ns, unsigned int ifi, sa_family_t af, void *gw);
>  void nl_addr(int ns, unsigned int ifi, sa_family_t af,
> diff --git a/tap.c b/tap.c
> index 77e513c..8b6d9bc 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -30,6 +30,7 @@
>  #include <sys/stat.h>
>  #include <fcntl.h>
>  #include <sys/uio.h>
> +#include <stdbool.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <netinet/ip.h>
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-14  2:54           ` David Gibson
@ 2022-10-16  9:46             ` Stefano Brivio
  2022-10-17  3:20               ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2022-10-16  9:46 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 14 Oct 2022 13:54:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote:
> > On Thu, 13 Oct 2022 11:34:04 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:  
> > > > On Wed, 12 Oct 2022 20:31:20 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Wed, Oct 12, 2022 at 07:54:32AM +0200, Stefano Brivio wrote:    
> > > > > > Hi David,
> > > > > > 
> > > > > > On Wed, 12 Oct 2022 13:55:02 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > Hi Stefano,
> > > > > > > 
> > > > > > > I've looked deeper into why giving passt/pasta CAP_NET_BIND_SERVICE
> > > > > > > isn't working, and I'm afraid I have bad news.      
> > > > > > 
> > > > > > Thanks for the investigation.
> > > > > >       
> > > > > > > We lose CAP_NET_BIND_SERVICE in the initial namespace as soon as we
> > > > > > > unshare() or setns() into the isolated namespace, and this appears to
> > > > > > > be intended behaviour.  From user_namespaces(7), in the Capabilities section:
> > > > > > > 
> > > > > > >     The child process created by clone(2) with the CLONE_NEWUSER flag
> > > > > > >     starts out with a complete set of capabilities in the new user
> > > > > > >     namespace.  Likewise, a process that creates a new user namespace
> > > > > > >     using unshare(2) or joins an existing user namespace using
> > > > > > >     setns(2) gains a full set of capabilities in that namespace.  ***On
> > > > > > >     the other hand, that process has no capabilities in the parent (in
> > > > > > >     the case of clone(2)) or previous (in the case of unshare(2) and
> > > > > > >     setns(2)) user namespace, even if the new namespace is created or
> > > > > > >     joined by the root user (i.e., a process with user ID 0 in the
> > > > > > >     root namespace).***
> > > > > > > 
> > > > > > > Emphasis (***) mine.  Basically, despite the way it's phrased in many
> > > > > > > places, processes don't have an independent set of capabilities in
> > > > > > > each userns, they only have a set of capabilities in their current
> > > > > > > userns.  Any capabilities in other namespaces are implied in a pretty
> > > > > > > much all or nothing way - if the process's UID (the real, init ns one)
> > > > > > > owns the userns (or one of its ancestors), it gets all caps, otherwise
> > > > > > > none.  cap_capable() has the specific logic in the kernel.      
> > > > > > 
> > > > > > Right, I missed this.
> > > > > > 
> > > > > > For a moment, I wondered about ambient capabilities, but those would
> > > > > > only have an effect on an execve(), not on a clone(), I guess.      
> > > > > 
> > > > > Well, yes, but it doesn't really make any difference in any case.  All
> > > > > ambient caps can do is be another way to get things into the permitted
> > > > > set.  If that happens before the unshare() then we still lose them on
> > > > > unshare().  If it happens after the unshare(), then it's just giving
> > > > > us caps within the namespace, which isn't what we need.
> > > > >     
> > > > > > > So, using CAP_NET_BIND_SERVICE isn't compatible with isolating
> > > > > > > ourselves in our own userns.  At the very least "auto" inbound
> > > > > > > forwarding of low ports is pretty much off the cards.
> > > > > > > 
> > > > > > > For forwarding of specific low ports, we could delay our entry into
> > > > > > > the new userns until we've set up the listening sockets, although it
> > > > > > > does mean rolling back some of the simplification we gained from the
> > > > > > > new-style userns handling.      
> > > > > > 
> > > > > > If I understand correctly, the biggest hurdle would be:
> > > > > > 
> > > > > > 1. we detach namespaces
> > > > > > 
> > > > > > 2. only then we can finalise any missing bit of addressing and routing
> > > > > >    configuration (relevant for pasta)
> > > > > > 
> > > > > > 3. we bind ports as we parse configuration options, but we need
> > > > > >    addressing to be fully configured for this
> > > > > > 
> > > > > > Referring to your latest patchset (which I'm still reviewing), I guess
> > > > > > that implies a further split of isolate_user() (it's great to have a
> > > > > > name for that, finally!), right?      
> > > > > 
> > > > > Uh.. something like that, I haven't looked at the details.  As we did
> > > > > before my userns cleanup, we'd probably need to repeatedly enter the
> > > > > userns as well as the netns to operate upon it, staying in the initial
> > > > > userns, with our initial caps until sandbox()/isolate_prefork() or
> > > > > thereabouts.
> > > > >     
> > > > > > > Or, we could abandon CAP_NET_BIND_SERVICE, and recommend the
> > > > > > > net.ipv4.ip_unprivileged_port_start sysctl as the only way to handle
> > > > > > > low ports in passt.  I do see a fair bit of logic in that approach:
> > > > > > > passt has no meaningful way to limit what users do with the low ports
> > > > > > > it allows them (indirectly) to bind to, giving passt
> > > > > > > CAP_NET_BIND_SERVICE is pretty much equivalent to giving any process
> > > > > > > which can invoke passt CAP_NET_BIND_SERVICE.      
> > > > > > 
> > > > > > I also see the general point, even though if file capabilities are
> > > > > > used, I guess the equivalence doesn't really hold.      
> > > > > 
> > > > > Uh.. I don't follow.  It's exactly file capabilities which make this
> > > > > equivalence.  If the passt binary has cap_net_bind_service=ep, you
> > > > > can, as an unprivileged user, take any server, stick it in a namespace
> > > > > and use pasta to effectively bind it to a low port in the init
> > > > > namespace.    
> > > > 
> > > > I actually meant with passt but... even for pasta, this depends on the
> > > > decision of whether we drop capabilities for the spawned process. If we
> > > > decide we don't, one day, then it's not equivalent.    
> > > 
> > > No, from a security perspective it pretty much is still equivalent.
> > > You can start your own namespace where you have full capabilities, run
> > > the server in there, then use pasta to translate your
> > > cap_net_bind_service within to cap_net_bind_service on the host.  Or
> > > just run the server on a high port and tell pasta to connect a low
> > > port to it.  
> > 
> > Ah, sorry, now I understand what you mean here, and...
> >   
> > > > It would be equivalent if we just inherited capabilities from the
> > > > parent as opposed to file capabilities -- that's what I meant.
> > > > 
> > > > I think it's a bit early to decide to drop those, though. Right now
> > > > pasta isn't really used as a stand-alone tool (even though I
> > > > actually do that, I find it very convenient also for totally unrelated
> > > > purposes).
> > > > 
> > > > Should we see some use cases, then we could make a more informed
> > > > decision.
> > > >     
> > > > > You can do the same thing with passt, though it's fiddlier
> > > > > (you'd need a shim to translate qemu socket protocol before plugging
> > > > > it into the server).    
> > > > 
> > > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > > > don't understand how you'd pass that kind of stuff over AF_UNIX...    
> > > 
> > > No qemu necessary.  Make your bogus server, but instead of directly
> > > listen()ing on a low port, have it connect to a Unix socket and wait
> > > for SYN packets to a low port in qemu protocol.  Then use passt to
> > > turn your Unix socket into a real listen()ing socket on the host.  
> > 
> > ...here. But the environment I had in mind was a rather controller one,
> > with KSM policies that would normally prevent you from even having your
> > bogus server.
> > 
> > Well, that would be the case for KubeVirt at least: three binaries and
> > not much margin to play tricks.  
> 
> Ok, but even then using the file capability rather than the sysctl
> only makes a difference if the attacker:
>  * CAN escape confinement enough to make socket calls in the netns
>    where we would be setting the sysctl
>  * CAN'T escape confinment enough to exec() passt

Hmm, I'm thinking about another fact. Now we don't drop the capability
after binding ports, but that's anyway not effective in the parent
namespace because of what you mentioned, which implies that we can just
bind configured ports.

There might be a relevant difference between binding a port 25, a less
usable 53 or 67, or a more innocent 443. In practice, if somebody uses
the sysctl, they might very well be setting it to 0, instead.

By the way, I just realised, after these changes we should double check
the AppArmor and SELinux profiles we ship as examples.

I don't think it's urgent, because in the worst case they should be too
restrictive rather than the opposite -- see the current AppArmor
"capability" directive and the SELinux "allow passt_t self:capability"
enforcement.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Alas for CAP_NET_BIND_SERVICE
  2022-10-16  9:46             ` Stefano Brivio
@ 2022-10-17  3:20               ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2022-10-17  3:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3451 bytes --]

On Sun, Oct 16, 2022 at 11:46:46AM +0200, Stefano Brivio wrote:
> On Fri, 14 Oct 2022 13:54:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Oct 13, 2022 at 06:54:26AM +0200, Stefano Brivio wrote:
> > > On Thu, 13 Oct 2022 11:34:04 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Oct 12, 2022 at 12:47:07PM +0200, Stefano Brivio wrote:  
> > > > > On Wed, 12 Oct 2022 20:31:20 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
[snip]
> > > > > It would be equivalent if we just inherited capabilities from the
> > > > > parent as opposed to file capabilities -- that's what I meant.
> > > > > 
> > > > > I think it's a bit early to decide to drop those, though. Right now
> > > > > pasta isn't really used as a stand-alone tool (even though I
> > > > > actually do that, I find it very convenient also for totally unrelated
> > > > > purposes).
> > > > > 
> > > > > Should we see some use cases, then we could make a more informed
> > > > > decision.
> > > > >     
> > > > > > You can do the same thing with passt, though it's fiddlier
> > > > > > (you'd need a shim to translate qemu socket protocol before plugging
> > > > > > it into the server).    
> > > > > 
> > > > > Oh, you mean running pasta plus a shim plus qemu? Because with passt I
> > > > > don't understand how you'd pass that kind of stuff over AF_UNIX...    
> > > > 
> > > > No qemu necessary.  Make your bogus server, but instead of directly
> > > > listen()ing on a low port, have it connect to a Unix socket and wait
> > > > for SYN packets to a low port in qemu protocol.  Then use passt to
> > > > turn your Unix socket into a real listen()ing socket on the host.  
> > > 
> > > ...here. But the environment I had in mind was a rather controller one,
> > > with KSM policies that would normally prevent you from even having your
> > > bogus server.
> > > 
> > > Well, that would be the case for KubeVirt at least: three binaries and
> > > not much margin to play tricks.  
> > 
> > Ok, but even then using the file capability rather than the sysctl
> > only makes a difference if the attacker:
> >  * CAN escape confinement enough to make socket calls in the netns
> >    where we would be setting the sysctl
> >  * CAN'T escape confinment enough to exec() passt
> 
> Hmm, I'm thinking about another fact. Now we don't drop the capability
> after binding ports, but that's anyway not effective in the parent
> namespace because of what you mentioned, which implies that we can just
> bind configured ports.

Sorry, I don't follow what you're saying here.

> There might be a relevant difference between binding a port 25, a less
> usable 53 or 67, or a more innocent 443. In practice, if somebody uses
> the sysctl, they might very well be setting it to 0, instead.
> 
> By the way, I just realised, after these changes we should double check
> the AppArmor and SELinux profiles we ship as examples.
> 
> I don't think it's urgent, because in the worst case they should be too
> restrictive rather than the opposite -- see the current AppArmor
> "capability" directive and the SELinux "allow passt_t self:capability"
> enforcement.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-17  4:22 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-12  2:55 Alas for CAP_NET_BIND_SERVICE David Gibson
2022-10-12  5:54 ` Stefano Brivio
2022-10-12  9:31   ` David Gibson
2022-10-12 10:47     ` Stefano Brivio
2022-10-13  0:34       ` David Gibson
2022-10-13  4:54         ` Stefano Brivio
2022-10-13  5:15           ` Stefano Brivio
2022-10-14  2:54           ` David Gibson
2022-10-16  9:46             ` Stefano Brivio
2022-10-17  3:20               ` David Gibson
2022-10-13 10:50       ` Stefano Brivio
2022-10-14  2:56         ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).