public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
Date: Wed, 22 Oct 2025 10:59:16 +0200	[thread overview]
Message-ID: <20251022105916.53925523@elisabeth> (raw)
In-Reply-To: <aPgmoNyf8c0RkEO-@zatzit>

On Wed, 22 Oct 2025 11:34:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> > On Fri, 17 Oct 2025 11:34:47 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > > loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> > > one for ::1.
> > > 
> > > This has some disadvantages:
> > >  * The guest can't connect to these services using its global IP address,
> > >    it must explicitly use 127.0.0.1 or ::1 (bug 100)
> > >  * The guest can't even connect via 127.0.0.0/8 addresses other than
> > >    127.0.0.1
> > >  * We can't use dual-stack sockets, we have to have separate sockets for
> > >    IPv4 and IPv6.
> > > 
> > > The restriction exist for a reason though.  If the guest has any interfaces
> > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > > via the forwards.  Especially combined with -T auto / -U auto this would
> > > make it very easy to make a mistake with nasty security implications.
> > > 
> > > We can achieve both goals, however, if we don't bind the outbound listening
> > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > > them to the "lo" interface.  
> > 
> > Nice trick, I didn't think of it. I wonder if doing the same host-side
> > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> > as well.  
> 
> I don't think we even need to do anything host side - bug 113 arises
> because of where we're listening on the guest side.

Ah, you're right, I guess I picked the wrong bug. I have a vague memory
of another one where somebody is running a DNS proxy in a container
(PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1,
and we automatically bind, on the host side, to 127.0.0.1.

> So this might be
> enough to fix it all on its own.  I'm not certain, because the special
> case DNS handling complicates the picture there.

I guess perhaps worth a quick test with socat if checking against
systemd-resolved isn't pratical, to see if we can close that one too?

> > > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  pif.c |  6 ------
> > >  tcp.c | 18 ++----------------
> > >  udp.c | 27 ++++++++++-----------------
> > >  3 files changed, 12 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/pif.c b/pif.c
> > > index 592fafaa..84e3ceae 100644
> > > --- a/pif.c
> > > +++ b/pif.c
> > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > >  
> > >  	ASSERT(pif_is_socket(pif));
> > >  
> > > -	if (pif == PIF_SPLICE) {
> > > -		/* Sanity checks */
> > > -		ASSERT(!ifname);
> > > -		ASSERT(addr && inany_is_loopback(addr));
> > > -	}
> > > -
> > >  	if (!addr)
> > >  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> > >  				  ifname, false, data);
> > > diff --git a/tcp.c b/tcp.c
> > > index 15c012d7..982c9190 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> > >  
> > >  	return r4 < 0 ? r4 : r6;
> > >  }
> > > -/**
> > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > > - * @c:		Execution context
> > > - * @port:	Port, host order
> > > - */
> > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > > -{
> > > -	ASSERT(!c->no_tcp);
> > > -
> > > -	if (c->ifi4)
> > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > > -	if (c->ifi6)
> > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > > -}
> > >  
> > >  /**
> > >   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> > >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> > >  			continue;
> > >  
> > > -		tcp_ns_sock_init(c, port);
> > > +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > 
> > I thought the "lo" string would be part of the Linux UAPI, but that's
> > not the case, and loopback_net_init() just calls:
> > 
> > 	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> > 
> > so I think it's relatively unproblematic to hardcode that as well, and it
> > looks like we can't create a second loopback interface, even though:
> > 
> > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> > 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  
> 
> Hm, that is a point.  *thinks*.  So I believe loopback always has
> index 1, so we could potentially use that.

Right, see pasta_ns_conf():

  rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);

> Except that BINDTODEVICE
> takes a name, not an index.  I don't think looking up the name from
> the index then using BINDTODEVICE would do quite what we want either:
> IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
> the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
> the interface name change (which is what you want for intermittently
> present interfaces, not so much here).

Yes, we would need to keep querying it.

> > I don't have any quick solution and I don't think we care enough as to
> > write a function in netlink.c fetching links with loopback type, so I'm
> > totally fine with this as it is.  
> 
> Yeah, given the above, I think this is another case of we can only go
> so far to stop the guest shooting itself in the foot.
> 
> > By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> > close the socket. The only possible flaw that occurs to me is that
> > somebody could rename 'lo' and then create a link called 'lo' of a
> > different type. But that needs CAP_NET_ADMIN in the container anyway.  
> 
> Right.  And while that could expose host ports in ways we didn't
> expect, a malicious guest could already do that by running a port
> forwarder.  So, again, I think this falls under the category of the
> guest being allowed to shoot itself in the foot.

Indeed.

> > >  	}
> > >  
> > >  	return 0;
> > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> > >  		if ((c->ifi4 && socks[port][V4] == -1) ||
> > >  		    (c->ifi6 && socks[port][V6] == -1)) {
> > >  			if (outbound)
> > > -				tcp_ns_sock_init(c, port);
> > > +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > 
> > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?  
> 
> For a moment I thought we didn't need a fallback, because we'd be
> entering the guest ns and thereby gaining CAP_NET_RAW.  But that's not
> the case: we only enter the guest netns for this operation, we're
> already in the userns and have dropped capabilities at this point
> (unlike the bindings we create at startup).
> 
> So, good question.  Having a fallback would make this a *lot* messier,

Hmm, why? I didn't test this (and I'm quite confused by
inany_from_sockaddr() right now) but:

---
diff --git a/util.c b/util.c
index c492f90..6b04040 100644
--- a/util.c
+++ b/util.c
@@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		 * ("net: core: enable SO_BINDTODEVICE for non-root users"). If
 		 * it's unsupported, don't bind the socket at all, because the
 		 * user might rely on this to filter incoming connections.
+		 *
+		 * As an exception, if we want to bind to 'lo', approximate that
+		 * by binding to 127.0.0.1 (which might be the wrong loopback
+		 * address for IPv4) or ::1 (always correct, for IPv6). This
+		 * adds https://bugs.passt.top/show_bug.cgi?id=100 back for
+		 * pre-5.7 kernels.
 		 */
 		if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
 			       ifname, strlen(ifname))) {
 			char str[SOCKADDR_STRLEN];
 
-			ret = -errno;
-			warn("Can't bind %s socket for %s to %s, closing",
-			     EPOLL_TYPE_STR(proto),
-			     sockaddr_ntop(sa, str, sizeof(str)), ifname);
-			close(fd);
-			return ret;
+			if (errno == EPERM && !strcmp(ifname, "lo")) {
+				/* I just realised inany_from_sockaddr() doesn't
+				 * actually take a sockaddr as source...? Do we
+				 * have another function doing that now?
+				 *
+				 * Well, change the address in 'sa' and warn().
+				 */
+				;
+			} else {
+				ret = -errno;
+				warn("Can't bind %s socket for %s to %s, closing",
+				     EPOLL_TYPE_STR(proto),
+				     sockaddr_ntop(sa, str, sizeof(str)), ifname);
+				close(fd);
+				return ret;
+			}
 		}
 	}
 
---

> and perhaps more importantly means we'd get a subtle but real
> behavioural difference based on kernel version which sounds like it
> could be pretty surprising to the user.  My inclination is to say that
> -T auto / -U auto requires a kernel with that patch, but if you
> overrule me, I'll see what I can do.

See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication
that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19
kernels) was the 'oldoldstable' until recently, and extended long term
support ends in 2029.

If this was a new feature, I would agree. But with this, we are
introducing a regression and we might break some setups.

> > >  			else
> > >  				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
> > >  		}
> > > diff --git a/udp.c b/udp.c
> > > index 49dd0144..e38114eb 100644
> > > --- a/udp.c
> > > +++ b/udp.c
> > > @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
> > >  	}
> > >  
> > >  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> > > -		const union inany_addr *a = addr ?
> > > -			addr : &inany_any4;
> > > -
> > > -		if (pif == PIF_SPLICE)
> > > -			a = &inany_loopback4;
> > > -
> > > -		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > > +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > > +				 addr ? addr : &inany_any4, ifname,
> > >  				 port, uref.u32);
> > >  
> > >  		socks[V4][port] = r4 < 0 ? -1 : r4;
> > >  	}
> > >  
> > >  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> > > -		const union inany_addr *a = addr ?
> > > -			addr : &inany_any6;
> > > -
> > > -		if (pif == PIF_SPLICE)
> > > -			a = &inany_loopback6;
> > > -
> > > -		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > > +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > > +				 addr ? addr : &inany_any6, ifname,
> > >  				 port, uref.u32);
> > >  
> > >  		socks[V6][port] = r6 < 0 ? -1 : r6;
> > > @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> > >  			continue;
> > >  
> > >  		if ((c->ifi4 && socks[V4][port] == -1) ||
> > > -		    (c->ifi6 && socks[V6][port] == -1))
> > > -			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> > > -				      NULL, NULL, port);
> > > +		    (c->ifi6 && socks[V6][port] == -1)) {
> > > +			if (outbound)
> > > +				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> > > +			else
> > > +				udp_sock_init(c, PIF_HOST, NULL, NULL, port);  
> > 
> > Same here, should we add a fallback case? The rest of the series looks
> > good to me.  
> 
> Same comments as for TCP.

-- 
Stefano


  reply	other threads:[~2025-10-22  8:59 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-17  0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
2025-10-20  6:08   ` Stefano Brivio
2025-10-20  9:24     ` David Gibson
2025-10-20  6:09   ` Stefano Brivio
2025-10-20  9:25     ` David Gibson
2025-10-17  0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
2025-10-21 21:51   ` Stefano Brivio
2025-10-22  0:08     ` David Gibson
2025-10-17  0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
2025-10-21 21:51   ` Stefano Brivio
2025-10-22  0:34     ` David Gibson
2025-10-22  8:59       ` Stefano Brivio [this message]
2025-10-23  1:18         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251022105916.53925523@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=passt-dev@passt.top \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).