On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote: > On Wed, 22 Oct 2025 11:34:40 +1100 > David Gibson wrote: > > > On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote: > > > On Fri, 17 Oct 2025 11:34:47 +1100 > > > David Gibson wrote: > > > > > > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the > > > > loopback address. Typically we create two sockets, one for 127.0.0.1 and > > > > one for ::1. > > > > > > > > This has some disadvantages: > > > > * The guest can't connect to these services using its global IP address, > > > > it must explicitly use 127.0.0.1 or ::1 (bug 100) > > > > * The guest can't even connect via 127.0.0.0/8 addresses other than > > > > 127.0.0.1 > > > > * We can't use dual-stack sockets, we have to have separate sockets for > > > > IPv4 and IPv6. > > > > > > > > The restriction exist for a reason though. If the guest has any interfaces > > > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host > > > > via the forwards. Especially combined with -T auto / -U auto this would > > > > make it very easy to make a mistake with nasty security implications. > > > > > > > > We can achieve both goals, however, if we don't bind the outbound listening > > > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict > > > > them to the "lo" interface. > > > > > > Nice trick, I didn't think of it. I wonder if doing the same host-side > > > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113 > > > as well. > > > > I don't think we even need to do anything host side - bug 113 arises > > because of where we're listening on the guest side. > > Ah, you're right, I guess I picked the wrong bug. I have a vague memory > of another one where somebody is running a DNS proxy in a container > (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1, > and we automatically bind, on the host side, to 127.0.0.1. > > > So this might be > > enough to fix it all on its own. I'm not certain, because the special > > case DNS handling complicates the picture there. > > I guess perhaps worth a quick test with socat if checking against > systemd-resolved isn't pratical, to see if we can close that one too? Right, I'll have a look when I can. > > > > Link: https://bugs.passt.top/show_bug.cgi?id=100 > > > > > > > > Signed-off-by: David Gibson > > > > --- > > > > pif.c | 6 ------ > > > > tcp.c | 18 ++---------------- > > > > udp.c | 27 ++++++++++----------------- > > > > 3 files changed, 12 insertions(+), 39 deletions(-) > > > > > > > > diff --git a/pif.c b/pif.c > > > > index 592fafaa..84e3ceae 100644 > > > > --- a/pif.c > > > > +++ b/pif.c > > > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif, > > > > > > > > ASSERT(pif_is_socket(pif)); > > > > > > > > - if (pif == PIF_SPLICE) { > > > > - /* Sanity checks */ > > > > - ASSERT(!ifname); > > > > - ASSERT(addr && inany_is_loopback(addr)); > > > > - } > > > > - > > > > if (!addr) > > > > return sock_l4_sa(c, type, &sa, sizeof(sa.sa6), > > > > ifname, false, data); > > > > diff --git a/tcp.c b/tcp.c > > > > index 15c012d7..982c9190 100644 > > > > --- a/tcp.c > > > > +++ b/tcp.c > > > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif, > > > > > > > > return r4 < 0 ? r4 : r6; > > > > } > > > > -/** > > > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections > > > > - * @c: Execution context > > > > - * @port: Port, host order > > > > - */ > > > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port) > > > > -{ > > > > - ASSERT(!c->no_tcp); > > > > - > > > > - if (c->ifi4) > > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port); > > > > - if (c->ifi6) > > > > - tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port); > > > > -} > > > > > > > > /** > > > > * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections > > > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg) > > > > if (!bitmap_isset(c->tcp.fwd_out.map, port)) > > > > continue; > > > > > > > > - tcp_ns_sock_init(c, port); > > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); > > > > > > I thought the "lo" string would be part of the Linux UAPI, but that's > > > not the case, and loopback_net_init() just calls: > > > > > > alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup); > > > > > > so I think it's relatively unproblematic to hardcode that as well, and it > > > looks like we can't create a second loopback interface, even though: > > > > > > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol' > > > 1: lol: mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 > > > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > > > > Hm, that is a point. *thinks*. So I believe loopback always has > > index 1, so we could potentially use that. > > Right, see pasta_ns_conf(): > > rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP); > > > Except that BINDTODEVICE > > takes a name, not an index. I don't think looking up the name from > > the index then using BINDTODEVICE would do quite what we want either: > > IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed > > the name of lo after we did BINDTODEVICE, then it wouldn't "follow" > > the interface name change (which is what you want for intermittently > > present interfaces, not so much here). > > Yes, we would need to keep querying it. > > > > I don't have any quick solution and I don't think we care enough as to > > > write a function in netlink.c fetching links with loopback type, so I'm > > > totally fine with this as it is. > > > > Yeah, given the above, I think this is another case of we can only go > > so far to stop the guest shooting itself in the foot. > > > > > By the way, if we fail to use SO_BINDTODEVICE, we already defensively > > > close the socket. The only possible flaw that occurs to me is that > > > somebody could rename 'lo' and then create a link called 'lo' of a > > > different type. But that needs CAP_NET_ADMIN in the container anyway. > > > > Right. And while that could expose host ports in ways we didn't > > expect, a malicious guest could already do that by running a port > > forwarder. So, again, I think this falls under the category of the > > guest being allowed to shoot itself in the foot. > > Indeed. > > > > > } > > > > > > > > return 0; > > > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound) > > > > if ((c->ifi4 && socks[port][V4] == -1) || > > > > (c->ifi6 && socks[port][V6] == -1)) { > > > > if (outbound) > > > > - tcp_ns_sock_init(c, port); > > > > + tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port); > > > > > > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels? > > > > For a moment I thought we didn't need a fallback, because we'd be > > entering the guest ns and thereby gaining CAP_NET_RAW. But that's not > > the case: we only enter the guest netns for this operation, we're > > already in the userns and have dropped capabilities at this point > > (unlike the bindings we create at startup). > > > > So, good question. Having a fallback would make this a *lot* messier, > > Hmm, why? I didn't test this (and I'm quite confused by > inany_from_sockaddr() right now) but: Ah, I didn't think of doing the workaround at this level. That does help somewhat. > > --- > diff --git a/util.c b/util.c > index c492f90..6b04040 100644 > --- a/util.c > +++ b/util.c > @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, > * ("net: core: enable SO_BINDTODEVICE for non-root users"). If > * it's unsupported, don't bind the socket at all, because the > * user might rely on this to filter incoming connections. > + * > + * As an exception, if we want to bind to 'lo', approximate that > + * by binding to 127.0.0.1 (which might be the wrong loopback > + * address for IPv4) or ::1 (always correct, for IPv6). This > + * adds https://bugs.passt.top/show_bug.cgi?id=100 back for > + * pre-5.7 kernels. > */ > if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, > ifname, strlen(ifname))) { > char str[SOCKADDR_STRLEN]; > > - ret = -errno; > - warn("Can't bind %s socket for %s to %s, closing", > - EPOLL_TYPE_STR(proto), > - sockaddr_ntop(sa, str, sizeof(str)), ifname); > - close(fd); > - return ret; > + if (errno == EPERM && !strcmp(ifname, "lo")) { > + /* I just realised inany_from_sockaddr() doesn't > + * actually take a sockaddr as source...? Do we It does, the description's just not great. It takes the sockaddr as a (void *) @addr so that callers can pass a struct sockaddr_whatever without casting. I'll try to improve this. > + * have another function doing that now? > + * > + * Well, change the address in 'sa' and warn(). We need to specially handle the dual stack case, to force the caller to split into separate v4 and v6 sockets here. > + */ > + ; > + } else { > + ret = -errno; > + warn("Can't bind %s socket for %s to %s, closing", > + EPOLL_TYPE_STR(proto), > + sockaddr_ntop(sa, str, sizeof(str)), ifname); > + close(fd); > + return ret; > + } > } > } > > --- > > > and perhaps more importantly means we'd get a subtle but real > > behavioural difference based on kernel version which sounds like it > > could be pretty surprising to the user. My inclination is to say that > > -T auto / -U auto requires a kernel with that patch, but if you > > overrule me, I'll see what I can do. > > See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication > that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19 > kernels) was the 'oldoldstable' until recently, and extended long term > support ends in 2029. Poop. > If this was a new feature, I would agree. But with this, we are > introducing a regression and we might break some setups. Yeah, ok. I'll figure something out. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson