On Thu, Oct 13, 2022 at 06:37:05PM +0200, Stefano Brivio wrote: > On Thu, 13 Oct 2022 15:08:02 +0200 > Stefano Brivio wrote: > > > On Thu, 13 Oct 2022 06:01:19 +0200 > > Stefano Brivio wrote: > > > > > On Tue, 11 Oct 2022 16:40:15 +1100 > > > David Gibson wrote: > > > > > > > @@ -251,7 +275,19 @@ int isolate_prefork(struct ctx *c) > > > > return -errno; > > > > } > > > > > > > > - drop_caps(); /* Relative to the new user namespace this time. */ > > > > + /* Drop capabilites in our new userns */ > > > > + if (c->mode == MODE_PASTA) { > > > > + /* Keep CAP_SYS_ADMIN, so that we can setns() to the > > > > + * netns when we need to act upon it > > > > + */ > > > > + ns_caps |= 1UL << CAP_SYS_ADMIN; > > > > + /* Keep CAP_NET_BIND_SERVICE, so we can splice > > > > + * outbound connections to low port numbers > > > > + */ > > > > + ns_caps |= 1UL << CAP_NET_BIND_SERVICE; > > > > + } > > > > + > > > > + drop_caps_ep_except(ns_caps); > > > > > > Hmm, I didn't really look into this yet, but there seems to be an issue > > > with filesystem-bound network namespaces now. Running something like: > > > > > > pasta --config-net --netns /run/user/1000/netns/netns-6466ff4b-1efc-2b58-685b-cbc12feb9ccc > > > > > > (from Podman), this happens: > > > > > > [...] > > > > > > [pid 1763223] setns(7, CLONE_NEWNET) = -1 EPERM (Operation not permitted) > > > > Ah, "of course". Podman calls us with UID 0 in the user namespace it > > just created, so if we drop CAP_SYS_ADMIN in isolate_initial() we can't > > join the network namespace, and if we drop CAP_NET_ADMIN we can't > > configure it. > > > > So for that case (and only for that, I suppose), we need something like > > (tested): > > > > diff --git a/isolation.c b/isolation.c > > index 1769180..fee6dbd 100644 > > --- a/isolation.c > > +++ b/isolation.c > > @@ -190,7 +190,7 @@ void isolate_initial(void) > > * namespace if we have it, so that we can forward low ports > > * into the guest/namespace > > */ > > - drop_caps_ep_except((1UL << CAP_NET_BIND_SERVICE)); > > + drop_caps_ep_except(BIT(CAP_SYS_ADMIN) | BIT(CAP_NET_ADMIN)); > > } > > > > ...which is a bit pointless. Better than *any* capability, but not by > > far. > > > > So, if we make this totally independent from configuration, we need > > those two capabilities. > > > > We could add a "postconf" stage and cover a tiny bit more of conf.c. > > > > Or we could have a special path in isolate_initial() for the case we > > know we're not in the init namespace. > > > > I'm not sure. If you have a specific preference/strong opinion I would > > actually be happier. :) > > Further on, if we are started as root, we'll fail to drop to 'nobody' > or any other user, if we lose CAP_SETUID and CAP_SETGID here. I have > tested this version of isolate_initial(): > > drop_caps_ep_except(BIT(CAP_SYS_ADMIN) | BIT(CAP_NET_ADMIN) | > BIT(CAP_SETGID) | BIT(CAP_SETUID) | > BIT(CAP_NET_BIND_SERVICE)); > > for any use case I can reasonably think of. Yes, it's a lot -- we > should make it really clear that those are not the capabilities we > actually use at "run time". Ah, right. I didn't think through the --netns-only case. I think what we need to do in the short term at least is: * Add all those caps and a comment in isolate_initial() (or even just remove the drop_caps there) * In isolate user drop the caps we can at that point (SETGID and SETUID) - whether or not we've joined a new userns * Remove the rest of the "runtime" caps in isolate_prefork() like we do now. I'll rework accordingly. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson