[PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding
@ 2025-10-17  0:34 David Gibson
  2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: David Gibson @ 2025-10-17  0:34 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

The fact that outbound forwarding sockets are bound to the loopback
address, whereas inbound forwarding sockets are (by default) bound to
the unspecified address leads to some unexpected differences between
the paths setting up each of them.

Happily there's an approach to tackling bug 100 which also reduces
those differences, allowing more code to be shared between the two
paths.  Amongst other things, this will make the next steps of
flexible forwarding configuration easier.

Link: https://bugs.passt.top/show_bug.cgi?id=100

David Gibson (3):
  tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  udp: Unify some more inbound/outbound parts of udp_sock_init()
  tcp, udp: Bind outbound listening sockets by interface instead of
    address

 conf.c |   4 +--
 pif.c  |   6 ----
 tcp.c  | 110 +++++++++++++++------------------------------------------
 tcp.h  |   5 +--
 udp.c  |  55 +++++++++++++----------------
 udp.h  |   5 +--
 6 files changed, 61 insertions(+), 124 deletions(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-17  0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
@ 2025-10-17  0:34 ` David Gibson
  2025-10-20  6:08   ` Stefano Brivio
  2025-10-20  6:09   ` Stefano Brivio
  2025-10-17  0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
  2025-10-17  0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
  2 siblings, 2 replies; 14+ messages in thread
From: David Gibson @ 2025-10-17  0:34 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Surprisingly little logic is shared between the path for creating a
listen()ing socket in the guest namespace versus in the host namespace.
Improve this, by extending tcp_sock_init_one() to take a pif parameter
indicating where it should open the socket.  This allows
tcp_ns_sock_init[46]() to be removed entirely.

We generalise tcp_sock_init() in the same way, although we don't use it
yet, due to some subtle differences in how we bind for -t versus -T.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c |  2 +-
 tcp.c  | 96 ++++++++++++++++++----------------------------------------
 tcp.h  |  5 +--
 3 files changed, 33 insertions(+), 70 deletions(-)

diff --git a/conf.c b/conf.c
index 66b9e634..26f1bcc0 100644
--- a/conf.c
+++ b/conf.c
@@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
 		fwd->delta[i] = to - first;
 
 		if (optname == 't')
-			ret = tcp_sock_init(c, addr, ifname, i);
+			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else if (optname == 'u')
 			ret = udp_sock_init(c, 0, addr, ifname, i);
 		else
diff --git a/tcp.c b/tcp.c
index 0f9e9b3f..15c012d7 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2515,29 +2515,38 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 /**
  * tcp_sock_init_one() - Initialise listening socket for address and port
  * @c:		Execution context
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL for dual stack any
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: fd for the new listening socket, negative error code on failure
+ *
+ * If pif == PIF_SPLICE, must have already entered the namespace.
  */
-static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
-			     const char *ifname, in_port_t port)
+static int tcp_sock_init_one(const struct ctx *c, uint8_t pif,
+			     const union inany_addr *addr, const char *ifname,
+			     in_port_t port)
 {
+	const struct fwd_ports *fwd = pif == PIF_HOST ?
+		&c->tcp.fwd_in : &c->tcp.fwd_out;
 	union tcp_listen_epoll_ref tref = {
 		.port = port,
-		.pif = PIF_HOST,
+		.pif = pif,
 	};
 	int s;
 
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_HOST, addr,
-				ifname, port, tref.u32);
+	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, pif, addr, ifname,
+			port, tref.u32);
+
+	if (fwd->mode == FWD_AUTO) {
+		int (*socks)[IP_VERSIONS] = pif == PIF_SPLICE ?
+			tcp_sock_ns : tcp_sock_init_ext;
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO) {
 		if (!addr || inany_v4(addr))
-			tcp_sock_init_ext[port][V4] = s < 0 ? -1 : s;
+			socks[port][V4] = s < 0 ? -1 : s;
 		if (!addr || !inany_v4(addr))
-			tcp_sock_init_ext[port][V6] = s < 0 ? -1 : s;
+			socks[port][V6] = s < 0 ? -1 : s;
 	}
 
 	if (s < 0)
@@ -2549,14 +2558,16 @@ static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
 /**
  * tcp_sock_init() - Create listening sockets for a given host ("inbound") port
  * @c:		Execution context
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: 0 on (partial) success, negative error code on (complete) failure
  */
-int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
-		  const char *ifname, in_port_t port)
+int tcp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port)
 {
 	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
 
@@ -2564,72 +2575,23 @@ int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
 
 	if (!addr && c->ifi4 && c->ifi6)
 		/* Attempt to get a dual stack socket */
-		if (tcp_sock_init_one(c, NULL, ifname, port) >= 0)
+		if (tcp_sock_init_one(c, pif, NULL, ifname, port) >= 0)
 			return 0;
 
 	/* Otherwise create a socket per IP version */
 	if ((!addr || inany_v4(addr)) && c->ifi4)
-		r4 = tcp_sock_init_one(c, addr ? addr : &inany_any4,
-				       ifname, port);
+		r4 = tcp_sock_init_one(c, pif,
+				       addr ? addr : &inany_any4, ifname, port);
 
 	if ((!addr || !inany_v4(addr)) && c->ifi6)
-		r6 = tcp_sock_init_one(c, addr ? addr : &inany_any6,
-				       ifname, port);
+		r6 = tcp_sock_init_one(c, pif,
+				       addr ? addr : &inany_any6, ifname, port);
 
 	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
 		return 0;
 
 	return r4 < 0 ? r4 : r6;
 }
-
-/**
- * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
-{
-	union tcp_listen_epoll_ref tref = {
-		.port = port,
-		.pif = PIF_SPLICE,
-	};
-	int s;
-
-	ASSERT(c->mode == MODE_PASTA);
-
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback4,
-			NULL, port, tref.u32);
-	if (s < 0)
-		s = -1;
-
-	if (c->tcp.fwd_out.mode == FWD_AUTO)
-		tcp_sock_ns[port][V4] = s;
-}
-
-/**
- * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
-{
-	union tcp_listen_epoll_ref tref = {
-		.port = port,
-		.pif = PIF_SPLICE,
-	};
-	int s;
-
-	ASSERT(c->mode == MODE_PASTA);
-
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback6,
-			NULL, port, tref.u32);
-	if (s < 0)
-		s = -1;
-
-	if (c->tcp.fwd_out.mode == FWD_AUTO)
-		tcp_sock_ns[port][V6] = s;
-}
-
 /**
  * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
  * @c:		Execution context
@@ -2640,9 +2602,9 @@ static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
 	ASSERT(!c->no_tcp);
 
 	if (c->ifi4)
-		tcp_ns_sock_init4(c, port);
+		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
 	if (c->ifi6)
-		tcp_ns_sock_init6(c, port);
+		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
 }
 
 /**
@@ -2845,7 +2807,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
 			if (outbound)
 				tcp_ns_sock_init(c, port);
 			else
-				tcp_sock_init(c, NULL, NULL, port);
+				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
 		}
 	}
 }
diff --git a/tcp.h b/tcp.h
index 234a8033..fb22bac0 100644
--- a/tcp.h
+++ b/tcp.h
@@ -18,8 +18,9 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		    const void *saddr, const void *daddr, uint32_t flow_lbl,
 		    const struct pool *p, int idx, const struct timespec *now);
-int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
-		  const char *ifname, in_port_t port);
+int tcp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port);
 int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-10-17  0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
  2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
@ 2025-10-17  0:34 ` David Gibson
  2025-10-21 21:51   ` Stefano Brivio
  2025-10-17  0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
  2 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2025-10-17  0:34 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

udp_sock_init() takes an 'ns' parameter determining if it creates a socket
in the guest namespace or host namespace.  Alter it to take a pif
parameter instead, like tcp_sock_init(), and use that change to slightly
reduce code duplication between the HOST and SPLICE cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c |  2 +-
 udp.c  | 60 +++++++++++++++++++++++++++++-----------------------------
 udp.h  |  5 +++--
 3 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/conf.c b/conf.c
index 26f1bcc0..08cb50aa 100644
--- a/conf.c
+++ b/conf.c
@@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
 		if (optname == 't')
 			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else if (optname == 'u')
-			ret = udp_sock_init(c, 0, addr, ifname, i);
+			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else
 			/* No way to check in advance for -T and -U */
 			ret = 0;
diff --git a/udp.c b/udp.c
index 86585b7e..49dd0144 100644
--- a/udp.c
+++ b/udp.c
@@ -1093,64 +1093,63 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 /**
  * udp_sock_init() - Initialise listening sockets for a given port
  * @c:		Execution context
- * @ns:		In pasta mode, if set, bind with loopback address in namespace
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: 0 on (partial) success, negative error code on (complete) failure
  */
-int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
-		  const char *ifname, in_port_t port)
+int udp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port)
 {
+	int (*socks)[NUM_PORTS] = pif == PIF_HOST ?
+		udp_splice_init : udp_splice_ns;
 	union udp_listen_epoll_ref uref = {
-		.pif = ns ? PIF_SPLICE : PIF_HOST,
+		.pif = pif,
 		.port = port,
 	};
 	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
 
 	ASSERT(!c->no_udp);
 
-	if (!addr && c->ifi4 && c->ifi6 && !ns) {
+	if (!addr && c->ifi4 && c->ifi6 && (pif == PIF_HOST)) {
 		int s;
 
 		/* Attempt to get a dual stack socket */
 		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
 				NULL, ifname, port, uref.u32);
-		udp_splice_init[V4][port] = s < 0 ? -1 : s;
-		udp_splice_init[V6][port] = s < 0 ? -1 : s;
+		socks[V4][port] = s < 0 ? -1 : s;
+		socks[V6][port] = s < 0 ? -1 : s;
 		if (IN_INTERVAL(0, FD_REF_MAX, s))
 			return 0;
 	}
 
 	if ((!addr || inany_v4(addr)) && c->ifi4) {
-		if (!ns) {
-			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
-					 addr ? addr : &inany_any4, ifname,
-					 port, uref.u32);
+		const union inany_addr *a = addr ?
+			addr : &inany_any4;
 
-			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
-		} else {
-			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
-					  &inany_loopback4, ifname,
-					  port, uref.u32);
-			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
-		}
+		if (pif == PIF_SPLICE)
+			a = &inany_loopback4;
+
+		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+				 port, uref.u32);
+
+		socks[V4][port] = r4 < 0 ? -1 : r4;
 	}
 
 	if ((!addr || !inany_v4(addr)) && c->ifi6) {
-		if (!ns) {
-			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
-					 addr ? addr : &inany_any6, ifname,
-					 port, uref.u32);
+		const union inany_addr *a = addr ?
+			addr : &inany_any6;
 
-			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
-		} else {
-			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
-					 &inany_loopback6, ifname,
-					 port, uref.u32);
-			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
-		}
+		if (pif == PIF_SPLICE)
+			a = &inany_loopback6;
+
+		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+				 port, uref.u32);
+
+		socks[V6][port] = r6 < 0 ? -1 : r6;
 	}
 
 	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
@@ -1216,7 +1215,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
 
 		if ((c->ifi4 && socks[V4][port] == -1) ||
 		    (c->ifi6 && socks[V6][port] == -1))
-			udp_sock_init(c, outbound, NULL, NULL, port);
+			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
+				      NULL, NULL, port);
 	}
 }
 
diff --git a/udp.h b/udp.h
index 8f8531ad..f78dc528 100644
--- a/udp.h
+++ b/udp.h
@@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
 		    uint8_t ttl, const struct pool *p, int idx,
 		    const struct timespec *now);
-int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
-		  const char *ifname, in_port_t port);
+int udp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port);
 int udp_init(struct ctx *c);
 void udp_timer(struct ctx *c, const struct timespec *now);
 void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-17  0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
  2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
  2025-10-17  0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
@ 2025-10-17  0:34 ` David Gibson
  2025-10-21 21:51   ` Stefano Brivio
  2 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2025-10-17  0:34 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently, outbound forwards (-T, -U) are handled by sockets bound to the
loopback address.  Typically we create two sockets, one for 127.0.0.1 and
one for ::1.

This has some disadvantages:
 * The guest can't connect to these services using its global IP address,
   it must explicitly use 127.0.0.1 or ::1 (bug 100)
 * The guest can't even connect via 127.0.0.0/8 addresses other than
   127.0.0.1
 * We can't use dual-stack sockets, we have to have separate sockets for
   IPv4 and IPv6.

The restriction exist for a reason though.  If the guest has any interfaces
other than pasta (e.g. a VPN tunnel) external hosts could reach the host
via the forwards.  Especially combined with -T auto / -U auto this would
make it very easy to make a mistake with nasty security implications.

We can achieve both goals, however, if we don't bind the outbound listening
sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
them to the "lo" interface.

Link: https://bugs.passt.top/show_bug.cgi?id=100

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 pif.c |  6 ------
 tcp.c | 18 ++----------------
 udp.c | 27 ++++++++++-----------------
 3 files changed, 12 insertions(+), 39 deletions(-)

diff --git a/pif.c b/pif.c
index 592fafaa..84e3ceae 100644
--- a/pif.c
+++ b/pif.c
@@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 
 	ASSERT(pif_is_socket(pif));
 
-	if (pif == PIF_SPLICE) {
-		/* Sanity checks */
-		ASSERT(!ifname);
-		ASSERT(addr && inany_is_loopback(addr));
-	}
-
 	if (!addr)
 		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
 				  ifname, false, data);
diff --git a/tcp.c b/tcp.c
index 15c012d7..982c9190 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
 
 	return r4 < 0 ? r4 : r6;
 }
-/**
- * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
-{
-	ASSERT(!c->no_tcp);
-
-	if (c->ifi4)
-		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
-	if (c->ifi6)
-		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
-}
 
 /**
  * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
@@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
 		if (!bitmap_isset(c->tcp.fwd_out.map, port))
 			continue;
 
-		tcp_ns_sock_init(c, port);
+		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
 	}
 
 	return 0;
@@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
 		if ((c->ifi4 && socks[port][V4] == -1) ||
 		    (c->ifi6 && socks[port][V6] == -1)) {
 			if (outbound)
-				tcp_ns_sock_init(c, port);
+				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
 			else
 				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
 		}
diff --git a/udp.c b/udp.c
index 49dd0144..e38114eb 100644
--- a/udp.c
+++ b/udp.c
@@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 	}
 
 	if ((!addr || inany_v4(addr)) && c->ifi4) {
-		const union inany_addr *a = addr ?
-			addr : &inany_any4;
-
-		if (pif == PIF_SPLICE)
-			a = &inany_loopback4;
-
-		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
+				 addr ? addr : &inany_any4, ifname,
 				 port, uref.u32);
 
 		socks[V4][port] = r4 < 0 ? -1 : r4;
 	}
 
 	if ((!addr || !inany_v4(addr)) && c->ifi6) {
-		const union inany_addr *a = addr ?
-			addr : &inany_any6;
-
-		if (pif == PIF_SPLICE)
-			a = &inany_loopback6;
-
-		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
+				 addr ? addr : &inany_any6, ifname,
 				 port, uref.u32);
 
 		socks[V6][port] = r6 < 0 ? -1 : r6;
@@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
 			continue;
 
 		if ((c->ifi4 && socks[V4][port] == -1) ||
-		    (c->ifi6 && socks[V6][port] == -1))
-			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
-				      NULL, NULL, port);
+		    (c->ifi6 && socks[V6][port] == -1)) {
+			if (outbound)
+				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
+			else
+				udp_sock_init(c, PIF_HOST, NULL, NULL, port);
+		}
 	}
 }
 
-- 
2.51.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
@ 2025-10-20  6:08   ` Stefano Brivio
  2025-10-20  9:24     ` David Gibson
  2025-10-20  6:09   ` Stefano Brivio
  1 sibling, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-10-20  6:08 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 17 Oct 2025 11:34:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Surprisingly little logic is shared between the path for creating a
> listen()ing socket in the guest namespace versus in the host namespace.
> Improve this, by extending tcp_sock_init_one() to take a pif parameter
> indicating where it should open the socket.  This allows
> tcp_ns_sock_init[46]() to be removed entirely.
> 
> We generalise tcp_sock_init() in the same way, although we don't use it
> yet, due to some subtle differences in how we bind for -t versus -T.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c |  2 +-
>  tcp.c  | 96 ++++++++++++++++++----------------------------------------
>  tcp.h  |  5 +--
>  3 files changed, 33 insertions(+), 70 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 66b9e634..26f1bcc0 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
>  		fwd->delta[i] = to - first;
>  
>  		if (optname == 't')
> -			ret = tcp_sock_init(c, addr, ifname, i);
> +			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else if (optname == 'u')
>  			ret = udp_sock_init(c, 0, addr, ifname, i);
>  		else
> diff --git a/tcp.c b/tcp.c
> index 0f9e9b3f..15c012d7 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2515,29 +2515,38 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
>  /**
>   * tcp_sock_init_one() - Initialise listening socket for address and port
>   * @c:		Execution context
> + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
>   * @addr:	Pointer to address for binding, NULL for dual stack any
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   *
>   * Return: fd for the new listening socket, negative error code on failure
> + *
> + * If pif == PIF_SPLICE, must have already entered the namespace.
>   */
> -static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
> -			     const char *ifname, in_port_t port)
> +static int tcp_sock_init_one(const struct ctx *c, uint8_t pif,
> +			     const union inany_addr *addr, const char *ifname,
> +			     in_port_t port)
>  {
> +	const struct fwd_ports *fwd = pif == PIF_HOST ?
> +		&c->tcp.fwd_in : &c->tcp.fwd_out;

While I appreciate the resulting brevity, I wonder if it would make
more sense to have this as an explicit if / else clause, for
readability. Same for similar occurrences in the next patches (which I
didn't fully review, yet).

Another alternative is:

	const struct fwd_ports *fwd;

	fwd = (pif == PIF_HOST) ? &c->tcp.fwd_in : &c->tcp.fwd_out;

...still two lines of code, perhaps just slightly less readable than
the five obvious ones:

	const struct fwd_ports *fwd;

	if (pif == PIF_HOST)
		fwd = &c->tcp.fwd_in;
	else
		fwd = &c->tcp.fwd_out;

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
  2025-10-20  6:08   ` Stefano Brivio
@ 2025-10-20  6:09   ` Stefano Brivio
  2025-10-20  9:25     ` David Gibson
  1 sibling, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-10-20  6:09 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 17 Oct 2025 11:34:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Surprisingly little logic is shared between the path for creating a
> listen()ing socket in the guest namespace versus in the host namespace.
> Improve this, by extending tcp_sock_init_one() to take a pif parameter
> indicating where it should open the socket.  This allows
> tcp_ns_sock_init[46]() to be removed entirely.
> 
> We generalise tcp_sock_init() in the same way, although we don't use it
> yet, due to some subtle differences in how we bind for -t versus -T.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c |  2 +-
>  tcp.c  | 96 ++++++++++++++++++----------------------------------------
>  tcp.h  |  5 +--
>  3 files changed, 33 insertions(+), 70 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 66b9e634..26f1bcc0 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
>  		fwd->delta[i] = to - first;
>  
>  		if (optname == 't')
> -			ret = tcp_sock_init(c, addr, ifname, i);
> +			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else if (optname == 'u')
>  			ret = udp_sock_init(c, 0, addr, ifname, i);
>  		else
> diff --git a/tcp.c b/tcp.c
> index 0f9e9b3f..15c012d7 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2515,29 +2515,38 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
>  /**
>   * tcp_sock_init_one() - Initialise listening socket for address and port
>   * @c:		Execution context
> + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
>   * @addr:	Pointer to address for binding, NULL for dual stack any
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   *
>   * Return: fd for the new listening socket, negative error code on failure
> + *
> + * If pif == PIF_SPLICE, must have already entered the namespace.

...I forgot: nit: a subject in this sentence would be nice for
readability.

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-20  6:08   ` Stefano Brivio
@ 2025-10-20  9:24     ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-10-20  9:24 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3509 bytes --]

On Mon, Oct 20, 2025 at 08:08:39AM +0200, Stefano Brivio wrote:
> On Fri, 17 Oct 2025 11:34:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Surprisingly little logic is shared between the path for creating a
> > listen()ing socket in the guest namespace versus in the host namespace.
> > Improve this, by extending tcp_sock_init_one() to take a pif parameter
> > indicating where it should open the socket.  This allows
> > tcp_ns_sock_init[46]() to be removed entirely.
> > 
> > We generalise tcp_sock_init() in the same way, although we don't use it
> > yet, due to some subtle differences in how we bind for -t versus -T.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  conf.c |  2 +-
> >  tcp.c  | 96 ++++++++++++++++++----------------------------------------
> >  tcp.h  |  5 +--
> >  3 files changed, 33 insertions(+), 70 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 66b9e634..26f1bcc0 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
> >  		fwd->delta[i] = to - first;
> >  
> >  		if (optname == 't')
> > -			ret = tcp_sock_init(c, addr, ifname, i);
> > +			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else if (optname == 'u')
> >  			ret = udp_sock_init(c, 0, addr, ifname, i);
> >  		else
> > diff --git a/tcp.c b/tcp.c
> > index 0f9e9b3f..15c012d7 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2515,29 +2515,38 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> >  /**
> >   * tcp_sock_init_one() - Initialise listening socket for address and port
> >   * @c:		Execution context
> > + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
> >   * @addr:	Pointer to address for binding, NULL for dual stack any
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   *
> >   * Return: fd for the new listening socket, negative error code on failure
> > + *
> > + * If pif == PIF_SPLICE, must have already entered the namespace.
> >   */
> > -static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
> > -			     const char *ifname, in_port_t port)
> > +static int tcp_sock_init_one(const struct ctx *c, uint8_t pif,
> > +			     const union inany_addr *addr, const char *ifname,
> > +			     in_port_t port)
> >  {
> > +	const struct fwd_ports *fwd = pif == PIF_HOST ?
> > +		&c->tcp.fwd_in : &c->tcp.fwd_out;
> 
> While I appreciate the resulting brevity, I wonder if it would make
> more sense to have this as an explicit if / else clause, for
> readability. Same for similar occurrences in the next patches (which I
> didn't fully review, yet).
> 
> Another alternative is:
> 
> 	const struct fwd_ports *fwd;
> 
> 	fwd = (pif == PIF_HOST) ? &c->tcp.fwd_in : &c->tcp.fwd_out;
> 
> ...still two lines of code, perhaps just slightly less readable than
> the five obvious ones:
> 
> 	const struct fwd_ports *fwd;
> 
> 	if (pif == PIF_HOST)
> 		fwd = &c->tcp.fwd_in;
> 	else
> 		fwd = &c->tcp.fwd_out;

Good point.  I suspect this will be shuffled again in later patches,
but I might as well go with the less obfuscated version in the
meantime.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-20  6:09   ` Stefano Brivio
@ 2025-10-20  9:25     ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-10-20  9:25 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2446 bytes --]

On Mon, Oct 20, 2025 at 08:09:59AM +0200, Stefano Brivio wrote:
> On Fri, 17 Oct 2025 11:34:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Surprisingly little logic is shared between the path for creating a
> > listen()ing socket in the guest namespace versus in the host namespace.
> > Improve this, by extending tcp_sock_init_one() to take a pif parameter
> > indicating where it should open the socket.  This allows
> > tcp_ns_sock_init[46]() to be removed entirely.
> > 
> > We generalise tcp_sock_init() in the same way, although we don't use it
> > yet, due to some subtle differences in how we bind for -t versus -T.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  conf.c |  2 +-
> >  tcp.c  | 96 ++++++++++++++++++----------------------------------------
> >  tcp.h  |  5 +--
> >  3 files changed, 33 insertions(+), 70 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 66b9e634..26f1bcc0 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
> >  		fwd->delta[i] = to - first;
> >  
> >  		if (optname == 't')
> > -			ret = tcp_sock_init(c, addr, ifname, i);
> > +			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else if (optname == 'u')
> >  			ret = udp_sock_init(c, 0, addr, ifname, i);
> >  		else
> > diff --git a/tcp.c b/tcp.c
> > index 0f9e9b3f..15c012d7 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2515,29 +2515,38 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
> >  /**
> >   * tcp_sock_init_one() - Initialise listening socket for address and port
> >   * @c:		Execution context
> > + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
> >   * @addr:	Pointer to address for binding, NULL for dual stack any
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   *
> >   * Return: fd for the new listening socket, negative error code on failure
> > + *
> > + * If pif == PIF_SPLICE, must have already entered the namespace.
> 
> ...I forgot: nit: a subject in this sentence would be nice for
> readability.

Noted, will adjust.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-10-17  0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
@ 2025-10-21 21:51   ` Stefano Brivio
  2025-10-22  0:08     ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-10-21 21:51 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 17 Oct 2025 11:34:46 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> udp_sock_init() takes an 'ns' parameter determining if it creates a socket
> in the guest namespace or host namespace.  Alter it to take a pif
> parameter instead, like tcp_sock_init(), and use that change to slightly
> reduce code duplication between the HOST and SPLICE cases.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c |  2 +-
>  udp.c  | 60 +++++++++++++++++++++++++++++-----------------------------
>  udp.h  |  5 +++--
>  3 files changed, 34 insertions(+), 33 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 26f1bcc0..08cb50aa 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
>  		if (optname == 't')
>  			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else if (optname == 'u')
> -			ret = udp_sock_init(c, 0, addr, ifname, i);
> +			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else
>  			/* No way to check in advance for -T and -U */
>  			ret = 0;
> diff --git a/udp.c b/udp.c
> index 86585b7e..49dd0144 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -1093,64 +1093,63 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
>  /**
>   * udp_sock_init() - Initialise listening sockets for a given port
>   * @c:		Execution context
> - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
>   * @addr:	Pointer to address for binding, NULL if not configured
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   *
>   * Return: 0 on (partial) success, negative error code on (complete) failure
>   */
> -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> -		  const char *ifname, in_port_t port)
> +int udp_sock_init(const struct ctx *c, uint8_t pif,
> +		  const union inany_addr *addr, const char *ifname,
> +		  in_port_t port)
>  {
> +	int (*socks)[NUM_PORTS] = pif == PIF_HOST ?
> +		udp_splice_init : udp_splice_ns;

Same as on v1: I'd rather avoid cramping ternary operators like this.

>  	union udp_listen_epoll_ref uref = {
> -		.pif = ns ? PIF_SPLICE : PIF_HOST,
> +		.pif = pif,
>  		.port = port,
>  	};
>  	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
>  
>  	ASSERT(!c->no_udp);
>  
> -	if (!addr && c->ifi4 && c->ifi6 && !ns) {
> +	if (!addr && c->ifi4 && c->ifi6 && (pif == PIF_HOST)) {

I think it's more readable without the extra parentheses around the
comparison, because when I see those I automatically think of an
assignment we want to use in a conditional clause, but that's just a
comparison.

>  		int s;
>  
>  		/* Attempt to get a dual stack socket */
>  		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
>  				NULL, ifname, port, uref.u32);
> -		udp_splice_init[V4][port] = s < 0 ? -1 : s;
> -		udp_splice_init[V6][port] = s < 0 ? -1 : s;
> +		socks[V4][port] = s < 0 ? -1 : s;
> +		socks[V6][port] = s < 0 ? -1 : s;
>  		if (IN_INTERVAL(0, FD_REF_MAX, s))
>  			return 0;
>  	}
>  
>  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> -		if (!ns) {
> -			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> -					 addr ? addr : &inany_any4, ifname,
> -					 port, uref.u32);
> +		const union inany_addr *a = addr ?
> +			addr : &inany_any4;
>  
> -			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
> -		} else {
> -			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> -					  &inany_loopback4, ifname,
> -					  port, uref.u32);
> -			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
> -		}
> +		if (pif == PIF_SPLICE)
> +			a = &inany_loopback4;
> +
> +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +				 port, uref.u32);
> +
> +		socks[V4][port] = r4 < 0 ? -1 : r4;
>  	}
>  
>  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> -		if (!ns) {
> -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> -					 addr ? addr : &inany_any6, ifname,
> -					 port, uref.u32);
> +		const union inany_addr *a = addr ?
> +			addr : &inany_any6;
>  
> -			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
> -		} else {
> -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> -					 &inany_loopback6, ifname,
> -					 port, uref.u32);
> -			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
> -		}
> +		if (pif == PIF_SPLICE)
> +			a = &inany_loopback6;
> +
> +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +				 port, uref.u32);
> +
> +		socks[V6][port] = r6 < 0 ? -1 : r6;
>  	}
>  
>  	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
> @@ -1216,7 +1215,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
>  
>  		if ((c->ifi4 && socks[V4][port] == -1) ||
>  		    (c->ifi6 && socks[V6][port] == -1))
> -			udp_sock_init(c, outbound, NULL, NULL, port);
> +			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> +				      NULL, NULL, port);
>  	}
>  }
>  
> diff --git a/udp.h b/udp.h
> index 8f8531ad..f78dc528 100644
> --- a/udp.h
> +++ b/udp.h
> @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
>  		    sa_family_t af, const void *saddr, const void *daddr,
>  		    uint8_t ttl, const struct pool *p, int idx,
>  		    const struct timespec *now);
> -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> -		  const char *ifname, in_port_t port);
> +int udp_sock_init(const struct ctx *c, uint8_t pif,
> +		  const union inany_addr *addr, const char *ifname,
> +		  in_port_t port);
>  int udp_init(struct ctx *c);
>  void udp_timer(struct ctx *c, const struct timespec *now);
>  void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-17  0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
@ 2025-10-21 21:51   ` Stefano Brivio
  2025-10-22  0:34     ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-10-21 21:51 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 17 Oct 2025 11:34:47 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> one for ::1.
> 
> This has some disadvantages:
>  * The guest can't connect to these services using its global IP address,
>    it must explicitly use 127.0.0.1 or ::1 (bug 100)
>  * The guest can't even connect via 127.0.0.0/8 addresses other than
>    127.0.0.1
>  * We can't use dual-stack sockets, we have to have separate sockets for
>    IPv4 and IPv6.
> 
> The restriction exist for a reason though.  If the guest has any interfaces
> other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> via the forwards.  Especially combined with -T auto / -U auto this would
> make it very easy to make a mistake with nasty security implications.
> 
> We can achieve both goals, however, if we don't bind the outbound listening
> sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> them to the "lo" interface.

Nice trick, I didn't think of it. I wonder if doing the same host-side
might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
as well.

> Link: https://bugs.passt.top/show_bug.cgi?id=100
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  pif.c |  6 ------
>  tcp.c | 18 ++----------------
>  udp.c | 27 ++++++++++-----------------
>  3 files changed, 12 insertions(+), 39 deletions(-)
> 
> diff --git a/pif.c b/pif.c
> index 592fafaa..84e3ceae 100644
> --- a/pif.c
> +++ b/pif.c
> @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  
>  	ASSERT(pif_is_socket(pif));
>  
> -	if (pif == PIF_SPLICE) {
> -		/* Sanity checks */
> -		ASSERT(!ifname);
> -		ASSERT(addr && inany_is_loopback(addr));
> -	}
> -
>  	if (!addr)
>  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
>  				  ifname, false, data);
> diff --git a/tcp.c b/tcp.c
> index 15c012d7..982c9190 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
>  
>  	return r4 < 0 ? r4 : r6;
>  }
> -/**
> - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> - * @c:		Execution context
> - * @port:	Port, host order
> - */
> -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> -{
> -	ASSERT(!c->no_tcp);
> -
> -	if (c->ifi4)
> -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> -	if (c->ifi6)
> -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> -}
>  
>  /**
>   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
>  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
>  			continue;
>  
> -		tcp_ns_sock_init(c, port);
> +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);

I thought the "lo" string would be part of the Linux UAPI, but that's
not the case, and loopback_net_init() just calls:

	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);

so I think it's relatively unproblematic to hardcode that as well, and it
looks like we can't create a second loopback interface, even though:

$ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

I don't have any quick solution and I don't think we care enough as to
write a function in netlink.c fetching links with loopback type, so I'm
totally fine with this as it is.

By the way, if we fail to use SO_BINDTODEVICE, we already defensively
close the socket. The only possible flaw that occurs to me is that
somebody could rename 'lo' and then create a link called 'lo' of a
different type. But that needs CAP_NET_ADMIN in the container anyway.

>  	}
>  
>  	return 0;
> @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
>  		if ((c->ifi4 && socks[port][V4] == -1) ||
>  		    (c->ifi6 && socks[port][V6] == -1)) {
>  			if (outbound)
> -				tcp_ns_sock_init(c, port);
> +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);

Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?

>  			else
>  				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
>  		}
> diff --git a/udp.c b/udp.c
> index 49dd0144..e38114eb 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
>  	}
>  
>  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> -		const union inany_addr *a = addr ?
> -			addr : &inany_any4;
> -
> -		if (pif == PIF_SPLICE)
> -			a = &inany_loopback4;
> -
> -		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> +				 addr ? addr : &inany_any4, ifname,
>  				 port, uref.u32);
>  
>  		socks[V4][port] = r4 < 0 ? -1 : r4;
>  	}
>  
>  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> -		const union inany_addr *a = addr ?
> -			addr : &inany_any6;
> -
> -		if (pif == PIF_SPLICE)
> -			a = &inany_loopback6;
> -
> -		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> +				 addr ? addr : &inany_any6, ifname,
>  				 port, uref.u32);
>  
>  		socks[V6][port] = r6 < 0 ? -1 : r6;
> @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
>  			continue;
>  
>  		if ((c->ifi4 && socks[V4][port] == -1) ||
> -		    (c->ifi6 && socks[V6][port] == -1))
> -			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> -				      NULL, NULL, port);
> +		    (c->ifi6 && socks[V6][port] == -1)) {
> +			if (outbound)
> +				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> +			else
> +				udp_sock_init(c, PIF_HOST, NULL, NULL, port);

Same here, should we add a fallback case? The rest of the series looks
good to me.

> +		}
>  	}
>  }
>  

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-10-21 21:51   ` Stefano Brivio
@ 2025-10-22  0:08     ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-10-22  0:08 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6634 bytes --]

On Tue, Oct 21, 2025 at 11:51:07PM +0200, Stefano Brivio wrote:
> On Fri, 17 Oct 2025 11:34:46 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > udp_sock_init() takes an 'ns' parameter determining if it creates a socket
> > in the guest namespace or host namespace.  Alter it to take a pif
> > parameter instead, like tcp_sock_init(), and use that change to slightly
> > reduce code duplication between the HOST and SPLICE cases.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  conf.c |  2 +-
> >  udp.c  | 60 +++++++++++++++++++++++++++++-----------------------------
> >  udp.h  |  5 +++--
> >  3 files changed, 34 insertions(+), 33 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 26f1bcc0..08cb50aa 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
> >  		if (optname == 't')
> >  			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else if (optname == 'u')
> > -			ret = udp_sock_init(c, 0, addr, ifname, i);
> > +			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else
> >  			/* No way to check in advance for -T and -U */
> >  			ret = 0;
> > diff --git a/udp.c b/udp.c
> > index 86585b7e..49dd0144 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -1093,64 +1093,63 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >  /**
> >   * udp_sock_init() - Initialise listening sockets for a given port
> >   * @c:		Execution context
> > - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> > + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
> >   * @addr:	Pointer to address for binding, NULL if not configured
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   *
> >   * Return: 0 on (partial) success, negative error code on (complete) failure
> >   */
> > -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > -		  const char *ifname, in_port_t port)
> > +int udp_sock_init(const struct ctx *c, uint8_t pif,
> > +		  const union inany_addr *addr, const char *ifname,
> > +		  in_port_t port)
> >  {
> > +	int (*socks)[NUM_PORTS] = pif == PIF_HOST ?
> > +		udp_splice_init : udp_splice_ns;
> 
> Same as on v1: I'd rather avoid cramping ternary operators like this.

Good idea, done.

> >  	union udp_listen_epoll_ref uref = {
> > -		.pif = ns ? PIF_SPLICE : PIF_HOST,
> > +		.pif = pif,
> >  		.port = port,
> >  	};
> >  	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
> >  
> >  	ASSERT(!c->no_udp);
> >  
> > -	if (!addr && c->ifi4 && c->ifi6 && !ns) {
> > +	if (!addr && c->ifi4 && c->ifi6 && (pif == PIF_HOST)) {
> 
> I think it's more readable without the extra parentheses around the
> comparison, because when I see those I automatically think of an
> assignment we want to use in a conditional clause, but that's just a
> comparison.

Done.  Thomas Huth finally convinced me not to put precautionary
parens on anything next to && and ||, but I forgot this time :).

> 
> >  		int s;
> >  
> >  		/* Attempt to get a dual stack socket */
> >  		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> >  				NULL, ifname, port, uref.u32);
> > -		udp_splice_init[V4][port] = s < 0 ? -1 : s;
> > -		udp_splice_init[V6][port] = s < 0 ? -1 : s;
> > +		socks[V4][port] = s < 0 ? -1 : s;
> > +		socks[V6][port] = s < 0 ? -1 : s;
> >  		if (IN_INTERVAL(0, FD_REF_MAX, s))
> >  			return 0;
> >  	}
> >  
> >  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> > -		if (!ns) {
> > -			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> > -					 addr ? addr : &inany_any4, ifname,
> > -					 port, uref.u32);
> > +		const union inany_addr *a = addr ?
> > +			addr : &inany_any4;
> >  
> > -			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
> > -		} else {
> > -			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> > -					  &inany_loopback4, ifname,
> > -					  port, uref.u32);
> > -			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
> > -		}
> > +		if (pif == PIF_SPLICE)
> > +			a = &inany_loopback4;
> > +
> > +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +				 port, uref.u32);
> > +
> > +		socks[V4][port] = r4 < 0 ? -1 : r4;
> >  	}
> >  
> >  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> > -		if (!ns) {
> > -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> > -					 addr ? addr : &inany_any6, ifname,
> > -					 port, uref.u32);
> > +		const union inany_addr *a = addr ?
> > +			addr : &inany_any6;
> >  
> > -			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
> > -		} else {
> > -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> > -					 &inany_loopback6, ifname,
> > -					 port, uref.u32);
> > -			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
> > -		}
> > +		if (pif == PIF_SPLICE)
> > +			a = &inany_loopback6;
> > +
> > +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +				 port, uref.u32);
> > +
> > +		socks[V6][port] = r6 < 0 ? -1 : r6;
> >  	}
> >  
> >  	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
> > @@ -1216,7 +1215,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> >  
> >  		if ((c->ifi4 && socks[V4][port] == -1) ||
> >  		    (c->ifi6 && socks[V6][port] == -1))
> > -			udp_sock_init(c, outbound, NULL, NULL, port);
> > +			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> > +				      NULL, NULL, port);
> >  	}
> >  }
> >  
> > diff --git a/udp.h b/udp.h
> > index 8f8531ad..f78dc528 100644
> > --- a/udp.h
> > +++ b/udp.h
> > @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >  		    sa_family_t af, const void *saddr, const void *daddr,
> >  		    uint8_t ttl, const struct pool *p, int idx,
> >  		    const struct timespec *now);
> > -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > -		  const char *ifname, in_port_t port);
> > +int udp_sock_init(const struct ctx *c, uint8_t pif,
> > +		  const union inany_addr *addr, const char *ifname,
> > +		  in_port_t port);
> >  int udp_init(struct ctx *c);
> >  void udp_timer(struct ctx *c, const struct timespec *now);
> >  void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-21 21:51   ` Stefano Brivio
@ 2025-10-22  0:34     ` David Gibson
  2025-10-22  8:59       ` Stefano Brivio
  0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2025-10-22  0:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8637 bytes --]

On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> On Fri, 17 Oct 2025 11:34:47 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> > one for ::1.
> > 
> > This has some disadvantages:
> >  * The guest can't connect to these services using its global IP address,
> >    it must explicitly use 127.0.0.1 or ::1 (bug 100)
> >  * The guest can't even connect via 127.0.0.0/8 addresses other than
> >    127.0.0.1
> >  * We can't use dual-stack sockets, we have to have separate sockets for
> >    IPv4 and IPv6.
> > 
> > The restriction exist for a reason though.  If the guest has any interfaces
> > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > via the forwards.  Especially combined with -T auto / -U auto this would
> > make it very easy to make a mistake with nasty security implications.
> > 
> > We can achieve both goals, however, if we don't bind the outbound listening
> > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > them to the "lo" interface.
> 
> Nice trick, I didn't think of it. I wonder if doing the same host-side
> might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> as well.

I don't think we even need to do anything host side - bug 113 arises
because of where we're listening on the guest side.  So this might be
enough to fix it all on its own.  I'm not certain, because the special
case DNS handling complicates the picture there.

> 
> > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  pif.c |  6 ------
> >  tcp.c | 18 ++----------------
> >  udp.c | 27 ++++++++++-----------------
> >  3 files changed, 12 insertions(+), 39 deletions(-)
> > 
> > diff --git a/pif.c b/pif.c
> > index 592fafaa..84e3ceae 100644
> > --- a/pif.c
> > +++ b/pif.c
> > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  
> >  	ASSERT(pif_is_socket(pif));
> >  
> > -	if (pif == PIF_SPLICE) {
> > -		/* Sanity checks */
> > -		ASSERT(!ifname);
> > -		ASSERT(addr && inany_is_loopback(addr));
> > -	}
> > -
> >  	if (!addr)
> >  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> >  				  ifname, false, data);
> > diff --git a/tcp.c b/tcp.c
> > index 15c012d7..982c9190 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> >  
> >  	return r4 < 0 ? r4 : r6;
> >  }
> > -/**
> > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > - * @c:		Execution context
> > - * @port:	Port, host order
> > - */
> > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > -{
> > -	ASSERT(!c->no_tcp);
> > -
> > -	if (c->ifi4)
> > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > -	if (c->ifi6)
> > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > -}
> >  
> >  /**
> >   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> >  			continue;
> >  
> > -		tcp_ns_sock_init(c, port);
> > +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> 
> I thought the "lo" string would be part of the Linux UAPI, but that's
> not the case, and loopback_net_init() just calls:
> 
> 	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> 
> so I think it's relatively unproblematic to hardcode that as well, and it
> looks like we can't create a second loopback interface, even though:
> 
> $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Hm, that is a point.  *thinks*.  So I believe loopback always has
index 1, so we could potentially use that.  Except that BINDTODEVICE
takes a name, not an index.  I don't think looking up the name from
the index then using BINDTODEVICE would do quite what we want either:
IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
the interface name change (which is what you want for intermittently
present interfaces, not so much here).

> I don't have any quick solution and I don't think we care enough as to
> write a function in netlink.c fetching links with loopback type, so I'm
> totally fine with this as it is.

Yeah, given the above, I think this is another case of we can only go
so far to stop the guest shooting itself in the foot.

> By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> close the socket. The only possible flaw that occurs to me is that
> somebody could rename 'lo' and then create a link called 'lo' of a
> different type. But that needs CAP_NET_ADMIN in the container anyway.

Right.  And while that could expose host ports in ways we didn't
expect, a malicious guest could already do that by running a port
forwarder.  So, again, I think this falls under the category of the
guest being allowed to shoot itself in the foot.

> >  	}
> >  
> >  	return 0;
> > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> >  		if ((c->ifi4 && socks[port][V4] == -1) ||
> >  		    (c->ifi6 && socks[port][V6] == -1)) {
> >  			if (outbound)
> > -				tcp_ns_sock_init(c, port);
> > +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> 
> Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?

For a moment I thought we didn't need a fallback, because we'd be
entering the guest ns and thereby gaining CAP_NET_RAW.  But that's not
the case: we only enter the guest netns for this operation, we're
already in the userns and have dropped capabilities at this point
(unlike the bindings we create at startup).

So, good question.  Having a fallback would make this a *lot* messier,
and perhaps more importantly means we'd get a subtle but real
behavioural difference based on kernel version which sounds like it
could be pretty surprising to the user.  My inclination is to say that
-T auto / -U auto requires a kernel with that patch, but if you
overrule me, I'll see what I can do.

> >  			else
> >  				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
> >  		}
> > diff --git a/udp.c b/udp.c
> > index 49dd0144..e38114eb 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
> >  	}
> >  
> >  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> > -		const union inany_addr *a = addr ?
> > -			addr : &inany_any4;
> > -
> > -		if (pif == PIF_SPLICE)
> > -			a = &inany_loopback4;
> > -
> > -		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > +				 addr ? addr : &inany_any4, ifname,
> >  				 port, uref.u32);
> >  
> >  		socks[V4][port] = r4 < 0 ? -1 : r4;
> >  	}
> >  
> >  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> > -		const union inany_addr *a = addr ?
> > -			addr : &inany_any6;
> > -
> > -		if (pif == PIF_SPLICE)
> > -			a = &inany_loopback6;
> > -
> > -		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > +				 addr ? addr : &inany_any6, ifname,
> >  				 port, uref.u32);
> >  
> >  		socks[V6][port] = r6 < 0 ? -1 : r6;
> > @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> >  			continue;
> >  
> >  		if ((c->ifi4 && socks[V4][port] == -1) ||
> > -		    (c->ifi6 && socks[V6][port] == -1))
> > -			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> > -				      NULL, NULL, port);
> > +		    (c->ifi6 && socks[V6][port] == -1)) {
> > +			if (outbound)
> > +				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> > +			else
> > +				udp_sock_init(c, PIF_HOST, NULL, NULL, port);
> 
> Same here, should we add a fallback case? The rest of the series looks
> good to me.

Same comments as for TCP.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-22  0:34     ` David Gibson
@ 2025-10-22  8:59       ` Stefano Brivio
  2025-10-23  1:18         ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Stefano Brivio @ 2025-10-22  8:59 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 22 Oct 2025 11:34:40 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> > On Fri, 17 Oct 2025 11:34:47 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > > loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> > > one for ::1.
> > > 
> > > This has some disadvantages:
> > >  * The guest can't connect to these services using its global IP address,
> > >    it must explicitly use 127.0.0.1 or ::1 (bug 100)
> > >  * The guest can't even connect via 127.0.0.0/8 addresses other than
> > >    127.0.0.1
> > >  * We can't use dual-stack sockets, we have to have separate sockets for
> > >    IPv4 and IPv6.
> > > 
> > > The restriction exist for a reason though.  If the guest has any interfaces
> > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > > via the forwards.  Especially combined with -T auto / -U auto this would
> > > make it very easy to make a mistake with nasty security implications.
> > > 
> > > We can achieve both goals, however, if we don't bind the outbound listening
> > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > > them to the "lo" interface.  
> > 
> > Nice trick, I didn't think of it. I wonder if doing the same host-side
> > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> > as well.  
> 
> I don't think we even need to do anything host side - bug 113 arises
> because of where we're listening on the guest side.

Ah, you're right, I guess I picked the wrong bug. I have a vague memory
of another one where somebody is running a DNS proxy in a container
(PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1,
and we automatically bind, on the host side, to 127.0.0.1.

> So this might be
> enough to fix it all on its own.  I'm not certain, because the special
> case DNS handling complicates the picture there.

I guess perhaps worth a quick test with socat if checking against
systemd-resolved isn't pratical, to see if we can close that one too?

> > > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  pif.c |  6 ------
> > >  tcp.c | 18 ++----------------
> > >  udp.c | 27 ++++++++++-----------------
> > >  3 files changed, 12 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/pif.c b/pif.c
> > > index 592fafaa..84e3ceae 100644
> > > --- a/pif.c
> > > +++ b/pif.c
> > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > >  
> > >  	ASSERT(pif_is_socket(pif));
> > >  
> > > -	if (pif == PIF_SPLICE) {
> > > -		/* Sanity checks */
> > > -		ASSERT(!ifname);
> > > -		ASSERT(addr && inany_is_loopback(addr));
> > > -	}
> > > -
> > >  	if (!addr)
> > >  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> > >  				  ifname, false, data);
> > > diff --git a/tcp.c b/tcp.c
> > > index 15c012d7..982c9190 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> > >  
> > >  	return r4 < 0 ? r4 : r6;
> > >  }
> > > -/**
> > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > > - * @c:		Execution context
> > > - * @port:	Port, host order
> > > - */
> > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > > -{
> > > -	ASSERT(!c->no_tcp);
> > > -
> > > -	if (c->ifi4)
> > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > > -	if (c->ifi6)
> > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > > -}
> > >  
> > >  /**
> > >   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> > >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> > >  			continue;
> > >  
> > > -		tcp_ns_sock_init(c, port);
> > > +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > 
> > I thought the "lo" string would be part of the Linux UAPI, but that's
> > not the case, and loopback_net_init() just calls:
> > 
> > 	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> > 
> > so I think it's relatively unproblematic to hardcode that as well, and it
> > looks like we can't create a second loopback interface, even though:
> > 
> > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> > 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  
> 
> Hm, that is a point.  *thinks*.  So I believe loopback always has
> index 1, so we could potentially use that.

Right, see pasta_ns_conf():

  rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);

> Except that BINDTODEVICE
> takes a name, not an index.  I don't think looking up the name from
> the index then using BINDTODEVICE would do quite what we want either:
> IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
> the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
> the interface name change (which is what you want for intermittently
> present interfaces, not so much here).

Yes, we would need to keep querying it.

> > I don't have any quick solution and I don't think we care enough as to
> > write a function in netlink.c fetching links with loopback type, so I'm
> > totally fine with this as it is.  
> 
> Yeah, given the above, I think this is another case of we can only go
> so far to stop the guest shooting itself in the foot.
> 
> > By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> > close the socket. The only possible flaw that occurs to me is that
> > somebody could rename 'lo' and then create a link called 'lo' of a
> > different type. But that needs CAP_NET_ADMIN in the container anyway.  
> 
> Right.  And while that could expose host ports in ways we didn't
> expect, a malicious guest could already do that by running a port
> forwarder.  So, again, I think this falls under the category of the
> guest being allowed to shoot itself in the foot.

Indeed.

> > >  	}
> > >  
> > >  	return 0;
> > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> > >  		if ((c->ifi4 && socks[port][V4] == -1) ||
> > >  		    (c->ifi6 && socks[port][V6] == -1)) {
> > >  			if (outbound)
> > > -				tcp_ns_sock_init(c, port);
> > > +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > 
> > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?  
> 
> For a moment I thought we didn't need a fallback, because we'd be
> entering the guest ns and thereby gaining CAP_NET_RAW.  But that's not
> the case: we only enter the guest netns for this operation, we're
> already in the userns and have dropped capabilities at this point
> (unlike the bindings we create at startup).
> 
> So, good question.  Having a fallback would make this a *lot* messier,

Hmm, why? I didn't test this (and I'm quite confused by
inany_from_sockaddr() right now) but:

---
diff --git a/util.c b/util.c
index c492f90..6b04040 100644
--- a/util.c
+++ b/util.c
@@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		 * ("net: core: enable SO_BINDTODEVICE for non-root users"). If
 		 * it's unsupported, don't bind the socket at all, because the
 		 * user might rely on this to filter incoming connections.
+		 *
+		 * As an exception, if we want to bind to 'lo', approximate that
+		 * by binding to 127.0.0.1 (which might be the wrong loopback
+		 * address for IPv4) or ::1 (always correct, for IPv6). This
+		 * adds https://bugs.passt.top/show_bug.cgi?id=100 back for
+		 * pre-5.7 kernels.
 		 */
 		if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
 			       ifname, strlen(ifname))) {
 			char str[SOCKADDR_STRLEN];
 
-			ret = -errno;
-			warn("Can't bind %s socket for %s to %s, closing",
-			     EPOLL_TYPE_STR(proto),
-			     sockaddr_ntop(sa, str, sizeof(str)), ifname);
-			close(fd);
-			return ret;
+			if (errno == EPERM && !strcmp(ifname, "lo")) {
+				/* I just realised inany_from_sockaddr() doesn't
+				 * actually take a sockaddr as source...? Do we
+				 * have another function doing that now?
+				 *
+				 * Well, change the address in 'sa' and warn().
+				 */
+				;
+			} else {
+				ret = -errno;
+				warn("Can't bind %s socket for %s to %s, closing",
+				     EPOLL_TYPE_STR(proto),
+				     sockaddr_ntop(sa, str, sizeof(str)), ifname);
+				close(fd);
+				return ret;
+			}
 		}
 	}
 
---

> and perhaps more importantly means we'd get a subtle but real
> behavioural difference based on kernel version which sounds like it
> could be pretty surprising to the user.  My inclination is to say that
> -T auto / -U auto requires a kernel with that patch, but if you
> overrule me, I'll see what I can do.

See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication
that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19
kernels) was the 'oldoldstable' until recently, and extended long term
support ends in 2029.

If this was a new feature, I would agree. But with this, we are
introducing a regression and we might break some setups.

> > >  			else
> > >  				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
> > >  		}
> > > diff --git a/udp.c b/udp.c
> > > index 49dd0144..e38114eb 100644
> > > --- a/udp.c
> > > +++ b/udp.c
> > > @@ -1127,26 +1127,16 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
> > >  	}
> > >  
> > >  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> > > -		const union inany_addr *a = addr ?
> > > -			addr : &inany_any4;
> > > -
> > > -		if (pif == PIF_SPLICE)
> > > -			a = &inany_loopback4;
> > > -
> > > -		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > > +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > > +				 addr ? addr : &inany_any4, ifname,
> > >  				 port, uref.u32);
> > >  
> > >  		socks[V4][port] = r4 < 0 ? -1 : r4;
> > >  	}
> > >  
> > >  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> > > -		const union inany_addr *a = addr ?
> > > -			addr : &inany_any6;
> > > -
> > > -		if (pif == PIF_SPLICE)
> > > -			a = &inany_loopback6;
> > > -
> > > -		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > > +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
> > > +				 addr ? addr : &inany_any6, ifname,
> > >  				 port, uref.u32);
> > >  
> > >  		socks[V6][port] = r6 < 0 ? -1 : r6;
> > > @@ -1214,9 +1204,12 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> > >  			continue;
> > >  
> > >  		if ((c->ifi4 && socks[V4][port] == -1) ||
> > > -		    (c->ifi6 && socks[V6][port] == -1))
> > > -			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> > > -				      NULL, NULL, port);
> > > +		    (c->ifi6 && socks[V6][port] == -1)) {
> > > +			if (outbound)
> > > +				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
> > > +			else
> > > +				udp_sock_init(c, PIF_HOST, NULL, NULL, port);  
> > 
> > Same here, should we add a fallback case? The rest of the series looks
> > good to me.  
> 
> Same comments as for TCP.

-- 
Stefano


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-22  8:59       ` Stefano Brivio
@ 2025-10-23  1:18         ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2025-10-23  1:18 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10817 bytes --]

On Wed, Oct 22, 2025 at 10:59:16AM +0200, Stefano Brivio wrote:
> On Wed, 22 Oct 2025 11:34:40 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Oct 21, 2025 at 11:51:12PM +0200, Stefano Brivio wrote:
> > > On Fri, 17 Oct 2025 11:34:47 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > Currently, outbound forwards (-T, -U) are handled by sockets bound to the
> > > > loopback address.  Typically we create two sockets, one for 127.0.0.1 and
> > > > one for ::1.
> > > > 
> > > > This has some disadvantages:
> > > >  * The guest can't connect to these services using its global IP address,
> > > >    it must explicitly use 127.0.0.1 or ::1 (bug 100)
> > > >  * The guest can't even connect via 127.0.0.0/8 addresses other than
> > > >    127.0.0.1
> > > >  * We can't use dual-stack sockets, we have to have separate sockets for
> > > >    IPv4 and IPv6.
> > > > 
> > > > The restriction exist for a reason though.  If the guest has any interfaces
> > > > other than pasta (e.g. a VPN tunnel) external hosts could reach the host
> > > > via the forwards.  Especially combined with -T auto / -U auto this would
> > > > make it very easy to make a mistake with nasty security implications.
> > > > 
> > > > We can achieve both goals, however, if we don't bind the outbound listening
> > > > sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
> > > > them to the "lo" interface.  
> > > 
> > > Nice trick, I didn't think of it. I wonder if doing the same host-side
> > > might help solving a part of https://bugs.passt.top/show_bug.cgi?id=113
> > > as well.  
> > 
> > I don't think we even need to do anything host side - bug 113 arises
> > because of where we're listening on the guest side.
> 
> Ah, you're right, I guess I picked the wrong bug. I have a vague memory
> of another one where somebody is running a DNS proxy in a container
> (PiHole maybe?), bound to something in 127.0.0.0/8 but not 127.0.0.1,
> and we automatically bind, on the host side, to 127.0.0.1.
> 
> > So this might be
> > enough to fix it all on its own.  I'm not certain, because the special
> > case DNS handling complicates the picture there.
> 
> I guess perhaps worth a quick test with socat if checking against
> systemd-resolved isn't pratical, to see if we can close that one too?

Right, I'll have a look when I can.

> > > > Link: https://bugs.passt.top/show_bug.cgi?id=100
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  pif.c |  6 ------
> > > >  tcp.c | 18 ++----------------
> > > >  udp.c | 27 ++++++++++-----------------
> > > >  3 files changed, 12 insertions(+), 39 deletions(-)
> > > > 
> > > > diff --git a/pif.c b/pif.c
> > > > index 592fafaa..84e3ceae 100644
> > > > --- a/pif.c
> > > > +++ b/pif.c
> > > > @@ -87,12 +87,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > > >  
> > > >  	ASSERT(pif_is_socket(pif));
> > > >  
> > > > -	if (pif == PIF_SPLICE) {
> > > > -		/* Sanity checks */
> > > > -		ASSERT(!ifname);
> > > > -		ASSERT(addr && inany_is_loopback(addr));
> > > > -	}
> > > > -
> > > >  	if (!addr)
> > > >  		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
> > > >  				  ifname, false, data);
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 15c012d7..982c9190 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -2592,20 +2592,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
> > > >  
> > > >  	return r4 < 0 ? r4 : r6;
> > > >  }
> > > > -/**
> > > > - * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > > > - * @c:		Execution context
> > > > - * @port:	Port, host order
> > > > - */
> > > > -static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > > > -{
> > > > -	ASSERT(!c->no_tcp);
> > > > -
> > > > -	if (c->ifi4)
> > > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
> > > > -	if (c->ifi6)
> > > > -		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
> > > > -}
> > > >  
> > > >  /**
> > > >   * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> > > > @@ -2625,7 +2611,7 @@ static int tcp_ns_socks_init(void *arg)
> > > >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> > > >  			continue;
> > > >  
> > > > -		tcp_ns_sock_init(c, port);
> > > > +		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > > 
> > > I thought the "lo" string would be part of the Linux UAPI, but that's
> > > not the case, and loopback_net_init() just calls:
> > > 
> > > 	alloc_netdev(0, "lo", NET_NAME_PREDICTABLE, loopback_setup);
> > > 
> > > so I think it's relatively unproblematic to hardcode that as well, and it
> > > looks like we can't create a second loopback interface, even though:
> > > 
> > > $ pasta -- sh -c 'ip link set dev lo down; ip link change dev lo name lol; ip link show lol'
> > > 1: lol: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
> > >     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  
> > 
> > Hm, that is a point.  *thinks*.  So I believe loopback always has
> > index 1, so we could potentially use that.
> 
> Right, see pasta_ns_conf():
> 
>   rc = nl_link_set_flags(nl_sock_ns, 1 /* lo */, IFF_UP, IFF_UP);
> 
> > Except that BINDTODEVICE
> > takes a name, not an index.  I don't think looking up the name from
> > the index then using BINDTODEVICE would do quite what we want either:
> > IIUC, BINDTODEVICE is fixed by name not index, so if the guest changed
> > the name of lo after we did BINDTODEVICE, then it wouldn't "follow"
> > the interface name change (which is what you want for intermittently
> > present interfaces, not so much here).
> 
> Yes, we would need to keep querying it.
> 
> > > I don't have any quick solution and I don't think we care enough as to
> > > write a function in netlink.c fetching links with loopback type, so I'm
> > > totally fine with this as it is.  
> > 
> > Yeah, given the above, I think this is another case of we can only go
> > so far to stop the guest shooting itself in the foot.
> > 
> > > By the way, if we fail to use SO_BINDTODEVICE, we already defensively
> > > close the socket. The only possible flaw that occurs to me is that
> > > somebody could rename 'lo' and then create a link called 'lo' of a
> > > different type. But that needs CAP_NET_ADMIN in the container anyway.  
> > 
> > Right.  And while that could expose host ports in ways we didn't
> > expect, a malicious guest could already do that by running a port
> > forwarder.  So, again, I think this falls under the category of the
> > guest being allowed to shoot itself in the foot.
> 
> Indeed.
> 
> > > >  	}
> > > >  
> > > >  	return 0;
> > > > @@ -2805,7 +2791,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
> > > >  		if ((c->ifi4 && socks[port][V4] == -1) ||
> > > >  		    (c->ifi6 && socks[port][V6] == -1)) {
> > > >  			if (outbound)
> > > > -				tcp_ns_sock_init(c, port);
> > > > +				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);  
> > > 
> > > Should we have/keep a fallback for pre-5.7 / pre-c427bfec18f2 kernels?  
> > 
> > For a moment I thought we didn't need a fallback, because we'd be
> > entering the guest ns and thereby gaining CAP_NET_RAW.  But that's not
> > the case: we only enter the guest netns for this operation, we're
> > already in the userns and have dropped capabilities at this point
> > (unlike the bindings we create at startup).
> > 
> > So, good question.  Having a fallback would make this a *lot* messier,
> 
> Hmm, why? I didn't test this (and I'm quite confused by
> inany_from_sockaddr() right now) but:

Ah, I didn't think of doing the workaround at this level.  That does
help somewhat.

> 
> ---
> diff --git a/util.c b/util.c
> index c492f90..6b04040 100644
> --- a/util.c
> +++ b/util.c
> @@ -126,17 +126,33 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  		 * ("net: core: enable SO_BINDTODEVICE for non-root users"). If
>  		 * it's unsupported, don't bind the socket at all, because the
>  		 * user might rely on this to filter incoming connections.
> +		 *
> +		 * As an exception, if we want to bind to 'lo', approximate that
> +		 * by binding to 127.0.0.1 (which might be the wrong loopback
> +		 * address for IPv4) or ::1 (always correct, for IPv6). This
> +		 * adds https://bugs.passt.top/show_bug.cgi?id=100 back for
> +		 * pre-5.7 kernels.
>  		 */
>  		if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE,
>  			       ifname, strlen(ifname))) {
>  			char str[SOCKADDR_STRLEN];
>  
> -			ret = -errno;
> -			warn("Can't bind %s socket for %s to %s, closing",
> -			     EPOLL_TYPE_STR(proto),
> -			     sockaddr_ntop(sa, str, sizeof(str)), ifname);
> -			close(fd);
> -			return ret;
> +			if (errno == EPERM && !strcmp(ifname, "lo")) {
> +				/* I just realised inany_from_sockaddr() doesn't
> +				 * actually take a sockaddr as source...? Do we

It does, the description's just not great.  It takes the sockaddr as a
(void *) @addr so that callers can pass a struct sockaddr_whatever
without casting.  I'll try to improve this.

> +				 * have another function doing that now?
> +				 *
> +				 * Well, change the address in 'sa' and warn().

We need to specially handle the dual stack case, to force the caller
to split into separate v4 and v6 sockets here.

> +				 */
> +				;
> +			} else {
> +				ret = -errno;
> +				warn("Can't bind %s socket for %s to %s, closing",
> +				     EPOLL_TYPE_STR(proto),
> +				     sockaddr_ntop(sa, str, sizeof(str)), ifname);
> +				close(fd);
> +				return ret;
> +			}
>  		}
>  	}
>  
> ---
> 
> > and perhaps more importantly means we'd get a subtle but real
> > behavioural difference based on kernel version which sounds like it
> > could be pretty surprising to the user.  My inclination is to say that
> > -T auto / -U auto requires a kernel with that patch, but if you
> > overrule me, I'll see what I can do.
> 
> See https://bugs.passt.top/show_bug.cgi?id=121 for a clear indication
> that we have users on 4.18 (RHEL 8) kernels. On Debian, Buster (4.19
> kernels) was the 'oldoldstable' until recently, and extended long term
> support ends in 2029.

Poop.

> If this was a new feature, I would agree. But with this, we are
> introducing a regression and we might break some setups.

Yeah, ok.  I'll figure something out.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-23  1:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-17  0:34 [PATCH 0/3] RFC: Reduce differences between inbound and outbound socket binding David Gibson
2025-10-17  0:34 ` [PATCH 1/3] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
2025-10-20  6:08   ` Stefano Brivio
2025-10-20  9:24     ` David Gibson
2025-10-20  6:09   ` Stefano Brivio
2025-10-20  9:25     ` David Gibson
2025-10-17  0:34 ` [PATCH 2/3] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
2025-10-21 21:51   ` Stefano Brivio
2025-10-22  0:08     ` David Gibson
2025-10-17  0:34 ` [PATCH 3/3] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
2025-10-21 21:51   ` Stefano Brivio
2025-10-22  0:34     ` David Gibson
2025-10-22  8:59       ` Stefano Brivio
2025-10-23  1:18         ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).