On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote:
> On Wed, 22 Oct 2025 11:34:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> > > On Fri, 17 Oct 2025 11:34:47 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > > > loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> > > > one for ::1.
> > > > 
> > > > This has some disadvantages:
> > > >  * The guest can't connect to these services using its global IP address,
> > > >    it must explicitly use 127.0.0.1 or ::1 (bug 100)
> > > >  * The guest can't even connect via 127.0.0.0/8 addresses other than
> > > >    127.0.0.1
> > > >  * We can't use dual-stack sockets, we have to have separate sockets for
> > > >    IPv4 and IPv6.
> > > > 
> > > > The restriction exist for a reason though.  If the guest has any interfaces
> > > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > > > via the forwards.  Especially combined with -T auto / -U auto this would
> > > > make it very easy to make a mistake with nasty security implications.
> > > > 
> > > > We can achieve both goals, however, if we don't bind the outbound listening
> > > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > > > them to the "lo" interface.  
> > > 
> > > Nice trick, I didn't think of it. I wonder if doing the same host-side
> > > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> > > as well.  
> > 
> > I don't think we even need to do anything host side - bug 113 arises
> > because of where we're listening on the guest side.
> 
> Ah, you're right, I guess I picked the wrong bug. I have a vague memory
> of another one where somebody is running a DNS proxy in a container
> (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1,
> and we automatically bind, on the host side, to 127.0.0.1.
> 
> > So this might be
> > enough to fix it all on its own.  I'm not certain, because the special
> > case DNS handling complicates the picture there.
> 
> I guess perhaps worth a quick test with socat if checking against
> systemd-resolved isn't pratical, to see if we can close that one too?

Right, I'll have a look when I can.

> > > > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  pif.c |  6 ------
> > > >  tcp.c | 18 ++----------------
> > > >  udp.c | 27 ++++++++++-----------------
> > > >  3 files changed, 12 insertions(+), 39 deletions(-)
> > > > 
> > > > diff --git a/pif.c b/pif.c
> > > > index 592fafaa..84e3ceae 100644
> > > > --- a/pif.c
> > > > +++ b/pif.c
> > > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > > >  
> > > >  	ASSERT(pif_is_socket(pif));
> > > >  
> > > > -	if (pif == PIF_SPLICE) {
> > > > -		/* Sanity checks */
> > > > -		ASSERT(!ifname);
> > > > -		ASSERT(addr && inany_is_loopback(addr));
> > > > -	}
> > > > -
> > > >  	if (!addr)
> > > >  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> > > >  				  ifname, false, data);
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 15c012d7..982c9190 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> > > >  
> > > >  	return r4 < 0 ? r4 : r6;
> > > >  }
> > > > -/**
> > > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > > > - * @c:		Execution context
> > > > - * @port:	Port, host order
> > > > - */
> > > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > > > -{
> > > > -	ASSERT(!c->no_tcp);
> > > > -
> > > > -	if (c->ifi4)
> > > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > > > -	if (c->ifi6)
> > > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > > > -}
> > > >  
> > > >  /**
> > > >   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> > > >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> > > >  			continue;
> > > >  
> > > > -		tcp_ns_sock_init(c, port);
> > > > +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > > 
> > > I thought the "lo" string would be part of the Linux UAPI, but that's
> > > not the case, and loopback_net_init() just calls:
> > > 
> > > 	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> > > 
> > > so I think it's relatively unproblematic to hardcode that as well, and it
> > > looks like we can't create a second loopback interface, even though:
> > > 
> > > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> > > 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> > >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  
> > 
> > Hm, that is a point.  *thinks*.  So I believe loopback always has
> > index 1, so we could potentially use that.
> 
> Right, see pasta_ns_conf():
> 
>   rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
> 
> > Except that BINDTODEVICE
> > takes a name, not an index.  I don't think looking up the name from
> > the index then using BINDTODEVICE would do quite what we want either:
> > IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
> > the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
> > the interface name change (which is what you want for intermittently
> > present interfaces, not so much here).
> 
> Yes, we would need to keep querying it.
> 
> > > I don't have any quick solution and I don't think we care enough as to
> > > write a function in netlink.c fetching links with loopback type, so I'm
> > > totally fine with this as it is.  
> > 
> > Yeah, given the above, I think this is another case of we can only go
> > so far to stop the guest shooting itself in the foot.
> > 
> > > By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> > > close the socket. The only possible flaw that occurs to me is that
> > > somebody could rename 'lo' and then create a link called 'lo' of a
> > > different type. But that needs CAP_NET_ADMIN in the container anyway.  
> > 
> > Right.  And while that could expose host ports in ways we didn't
> > expect, a malicious guest could already do that by running a port
> > forwarder.  So, again, I think this falls under the category of the
> > guest being allowed to shoot itself in the foot.
> 
> Indeed.
> 
> > > >  	}
> > > >  
> > > >  	return 0;
> > > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> > > >  		if ((c->ifi4 && socks[port][V4] == -1) ||
> > > >  		    (c->ifi6 && socks[port][V6] == -1)) {
> > > >  			if (outbound)
> > > > -				tcp_ns_sock_init(c, port);
> > > > +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > > 
> > > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?  
> > 
> > For a moment I thought we didn't need a fallback, because we'd be
> > entering the guest ns and thereby gaining CAP_NET_RAW.  But that's not
> > the case: we only enter the guest netns for this operation, we're
> > already in the userns and have dropped capabilities at this point
> > (unlike the bindings we create at startup).
> > 
> > So, good question.  Having a fallback would make this a *lot* messier,
> 
> Hmm, why? I didn't test this (and I'm quite confused by
> inany_from_sockaddr() right now) but:

Ah, I didn't think of doing the workaround at this level.  That does
help somewhat.

> 
> ---
> diff --git a/util.c b/util.c
> index c492f90..6b04040 100644
> --- a/util.c
> +++ b/util.c
> @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  		 * ("net: core: enable SO_BINDTODEVICE for non-root users"). If
>  		 * it's unsupported, don't bind the socket at all, because the
>  		 * user might rely on this to filter incoming connections.
> +		 *
> +		 * As an exception, if we want to bind to 'lo', approximate that
> +		 * by binding to 127.0.0.1 (which might be the wrong loopback
> +		 * address for IPv4) or ::1 (always correct, for IPv6). This
> +		 * adds https://bugs.passt.top/show_bug.cgi?id=100 back for
> +		 * pre-5.7 kernels.
>  		 */
>  		if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
>  			       ifname, strlen(ifname))) {
>  			char str[SOCKADDR_STRLEN];
>  
> -			ret = -errno;
> -			warn("Can't bind %s socket for %s to %s, closing",
> -			     EPOLL_TYPE_STR(proto),
> -			     sockaddr_ntop(sa, str, sizeof(str)), ifname);
> -			close(fd);
> -			return ret;
> +			if (errno == EPERM && !strcmp(ifname, "lo")) {
> +				/* I just realised inany_from_sockaddr() doesn't
> +				 * actually take a sockaddr as source...? Do we

It does, the description's just not great.  It takes the sockaddr as a
(void *) @addr so that callers can pass a struct sockaddr_whatever
without casting.  I'll try to improve this.

> +				 * have another function doing that now?
> +				 *
> +				 * Well, change the address in 'sa' and warn().

We need to specially handle the dual stack case, to force the caller
to split into separate v4 and v6 sockets here.

> +				 */
> +				;
> +			} else {
> +				ret = -errno;
> +				warn("Can't bind %s socket for %s to %s, closing",
> +				     EPOLL_TYPE_STR(proto),
> +				     sockaddr_ntop(sa, str, sizeof(str)), ifname);
> +				close(fd);
> +				return ret;
> +			}
>  		}
>  	}
>  
> ---
> 
> > and perhaps more importantly means we'd get a subtle but real
> > behavioural difference based on kernel version which sounds like it
> > could be pretty surprising to the user.  My inclination is to say that
> > -T auto / -U auto requires a kernel with that patch, but if you
> > overrule me, I'll see what I can do.
> 
> See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication
> that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19
> kernels) was the 'oldoldstable' until recently, and extended long term
> support ends in 2029.

Poop.

> If this was a new feature, I would agree. But with this, we are
> introducing a regression and we might break some setups.

Yeah, ok.  I'll figure something out.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson