public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding
@ 2025-10-29  6:26 David Gibson
  2025-10-29  6:26 ` [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family David Gibson
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

The fact that outbound forwarding sockets are bound to the loopback
address, whereas inbound forwarding sockets are (by default) bound to
the unspecified address leads to some unexpected differences between
the paths setting up each of them.

An idea for tackling bug 100 suggested a different approach which will
also reduce some of those differences and allow more code to be shared
between the two paths.  I've since discovered that this approach may
not help for bug 100, but it might still be worthwhile for the clean
up.

Patches 1..6/8 are cleanups which shouldn't change behaviour, and I
think are ready to merge.  7/8 is (arguably) a behavioural change, but
I've made my case for it in the patch comment.  8/8 needs some further
consideration, since I've discovered it does not fix bug 100 as is,
I'm including it for advance review, though.

v3:
 - A number of additional fixes covering the handling of IPV6_V6ONLY sockopt
 - Assorted trivial changes
v2:
 - Some rearrangements and rewordings for clarity

David Gibson (8):
  inany: Let length of sockaddr_inany be implicit from the family
  util, flow, pif: Simplify sock_l4_sa() interface
  tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  udp: Unify some more inbound/outbound parts of udp_sock_init()
  udp: Move udp_sock_init() special case to its caller
  util: Fix setting of IPV6_V6ONLY socket option
  tcp, udp: Remove fallback if creating dual stack socket fails
  [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by
    interface instead of address

 conf.c       |   4 +-
 flow.c       |  19 +++---
 icmp.c       |   3 +-
 inany.h      |  18 ++++++
 pif.c        |  26 ++------
 pif.h        |   2 +-
 tcp.c        | 164 +++++++++++++++++++--------------------------------
 tcp.h        |   5 +-
 tcp_splice.c |   5 +-
 udp.c        | 102 +++++++++++++++-----------------
 udp.h        |   5 +-
 util.c       |  55 +++++++++++++----
 util.h       |   9 ++-
 13 files changed, 201 insertions(+), 216 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-11-13  6:33   ` Stefano Brivio
  2025-10-29  6:26 ` [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface David Gibson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

sockaddr_inany can contain either an IPv4 or IPv6 socket address, so the
relevant length for bind() or connect() can vary.  In pif_sockaddr() we
return that length, and in sock_l4_sa() we take it as an extra parameter.

However, sockaddr_inany always contains exactly a sockaddr_in or a
sockaddr_in6 each with a fixed size.  Therefore we can derive the relevant
length from the family, and don't need to pass it around separately.

Make a tiny helper to get the relevant address length, and update all
interfaces to use that approach instead.

In the process, fix a buglet in tcp_flow_repair_bind(): we passed
sizeof(union sockaddr_inany) to bind() instead of the specific length for
the address family.  Since the sizeof() is always longer than the specific
length, this is probably fine, but not theoretically correct.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 17 +++++++----------
 icmp.c       |  3 ++-
 inany.h      | 18 ++++++++++++++++++
 pif.c        | 14 ++++----------
 pif.h        |  2 +-
 tcp.c        | 20 +++++++++-----------
 tcp_splice.c |  5 ++---
 udp.c        |  8 +++-----
 util.c       |  9 ++++-----
 util.h       |  5 +++--
 10 files changed, 53 insertions(+), 48 deletions(-)

diff --git a/flow.c b/flow.c
index feefda3c..9926f408 100644
--- a/flow.c
+++ b/flow.c
@@ -170,8 +170,7 @@ struct flowside_sock_args {
 	int fd;
 	int err;
 	enum epoll_type type;
-	const struct sockaddr *sa;
-	socklen_t sl;
+	const union sockaddr_inany *sa;
 	const char *path;
 	uint32_t data;
 };
@@ -187,7 +186,7 @@ static int flowside_sock_splice(void *arg)
 
 	ns_enter(a->c);
 
-	a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
+	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
 	                   a->sa->sa_family == AF_INET6, a->data);
 	a->err = errno;
 
@@ -209,11 +208,10 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 {
 	const char *ifname = NULL;
 	union sockaddr_inany sa;
-	socklen_t sl;
 
 	ASSERT(pif_is_socket(pif));
 
-	pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
+	pif_sockaddr(c, &sa, pif, &tgt->oaddr, tgt->oport);
 
 	switch (pif) {
 	case PIF_HOST:
@@ -224,13 +222,13 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		else if (sa.sa_family == AF_INET6)
 			ifname = c->ip6.ifname_out;
 
-		return sock_l4_sa(c, type, &sa, sl, ifname,
+		return sock_l4_sa(c, type, &sa, ifname,
 				  sa.sa_family == AF_INET6, data);
 
 	case PIF_SPLICE: {
 		struct flowside_sock_args args = {
 			.c = c, .type = type,
-			.sa = &sa.sa, .sl = sl, .data = data,
+			.sa = &sa, .data = data,
 		};
 		NS_CALL(flowside_sock_splice, &args);
 		errno = args.err;
@@ -259,10 +257,9 @@ int flowside_connect(const struct ctx *c, int s,
 		     uint8_t pif, const struct flowside *tgt)
 {
 	union sockaddr_inany sa;
-	socklen_t sl;
 
-	pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport);
-	return connect(s, &sa.sa, sl);
+	pif_sockaddr(c, &sa, pif, &tgt->eaddr, tgt->eport);
+	return connect(s, &sa.sa, socklen_inany(&sa));
 }
 
 /** flow_log_ - Log flow-related message
diff --git a/icmp.c b/icmp.c
index 6dffafb0..62b3bed8 100644
--- a/icmp.c
+++ b/icmp.c
@@ -301,8 +301,9 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	ASSERT(flow_proto[pingf->f.type] == proto);
 	pingf->ts = now->tv_sec;
 
-	pif_sockaddr(c, &sa, &msh.msg_namelen, PIF_HOST, &tgt->eaddr, 0);
+	pif_sockaddr(c, &sa, PIF_HOST, &tgt->eaddr, 0);
 	msh.msg_name = &sa;
+	msh.msg_namelen = socklen_inany(&sa);
 	msh.msg_iov = iov;
 	msh.msg_iovlen = cnt;
 	msh.msg_control = NULL;
diff --git a/inany.h b/inany.h
index 7ca5cbd3..e3cae2a8 100644
--- a/inany.h
+++ b/inany.h
@@ -67,6 +67,24 @@ union sockaddr_inany {
 	struct sockaddr_in6	sa6;
 };
 
+/** socklen_inany() - Return relevant size of an sockaddr_inany
+ * @sa:		sockaddr_iany socket address
+ *
+ * Returns the correct socket address length to pass to bind() or connect() with
+ * @sa, based on whether it is an IPv4 or IPv6 address.
+ */
+static inline socklen_t socklen_inany(const union sockaddr_inany *sa)
+{
+	switch (sa->sa_family) {
+	case AF_INET:
+		return sizeof(sa->sa4);
+	case AF_INET6:
+		return sizeof(sa->sa6);
+	default:
+		ASSERT(0);
+	}
+}
+
 /** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
  * @addr:	IPv4 or IPv6 address
  *
diff --git a/pif.c b/pif.c
index 592fafaa..31723b29 100644
--- a/pif.c
+++ b/pif.c
@@ -29,12 +29,11 @@ static_assert(ARRAY_SIZE(pif_type_str) == PIF_NUM_TYPES,
 /** pif_sockaddr() - Construct a socket address suitable for an interface
  * @c:		Execution context
  * @sa:		Pointer to sockaddr to fill in
- * @sl:		Updated to relevant length of initialised @sa
  * @pif:	Interface to create the socket address
  * @addr:	IPv[46] address
  * @port:	Port (host byte order)
  */
-void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
+void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa,
 		  uint8_t pif, const union inany_addr *addr, in_port_t port)
 {
 	const struct in_addr *v4 = inany_v4(addr);
@@ -46,7 +45,6 @@ void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
 		sa->sa4.sin_addr = *v4;
 		sa->sa4.sin_port = htons(port);
 		memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero));
-		*sl = sizeof(sa->sa4);
 	} else {
 		sa->sa_family = AF_INET6;
 		sa->sa6.sin6_addr = addr->a6;
@@ -56,7 +54,6 @@ void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
 		else
 			sa->sa6.sin6_scope_id = 0;
 		sa->sa6.sin6_flowinfo = 0;
-		*sl = sizeof(sa->sa6);
 	}
 }
 
@@ -83,7 +80,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		.sa6.sin6_addr = in6addr_any,
 		.sa6.sin6_port = htons(port),
 	};
-	socklen_t sl;
 
 	ASSERT(pif_is_socket(pif));
 
@@ -94,10 +90,8 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 	}
 
 	if (!addr)
-		return sock_l4_sa(c, type, &sa, sizeof(sa.sa6),
-				  ifname, false, data);
+		return sock_l4_sa(c, type, &sa, ifname, false, data);
 
-	pif_sockaddr(c, &sa, &sl, pif, addr, port);
-	return sock_l4_sa(c, type, &sa, sl,
-			  ifname, sa.sa_family == AF_INET6, data);
+	pif_sockaddr(c, &sa, pif, addr, port);
+	return sock_l4_sa(c, type, &sa, ifname, sa.sa_family == AF_INET6, data);
 }
diff --git a/pif.h b/pif.h
index f029282b..0f7f6672 100644
--- a/pif.h
+++ b/pif.h
@@ -57,7 +57,7 @@ static inline bool pif_is_socket(uint8_t pif)
 	return pif == PIF_HOST || pif == PIF_SPLICE;
 }
 
-void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa, socklen_t *sl,
+void pif_sockaddr(const struct ctx *c, union sockaddr_inany *sa,
 		  uint8_t pif, const union inany_addr *addr, in_port_t port);
 int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		const union inany_addr *addr, const char *ifname,
diff --git a/tcp.c b/tcp.c
index 0f9e9b3f..d63a16bf 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1443,12 +1443,11 @@ static void tcp_bind_outbound(const struct ctx *c,
 {
 	const struct flowside *tgt = &conn->f.side[TGTSIDE];
 	union sockaddr_inany bind_sa;
-	socklen_t sl;
 
 
-	pif_sockaddr(c, &bind_sa, &sl, PIF_HOST, &tgt->oaddr, tgt->oport);
+	pif_sockaddr(c, &bind_sa, PIF_HOST, &tgt->oaddr, tgt->oport);
 	if (!inany_is_unspecified(&tgt->oaddr) || tgt->oport) {
-		if (bind(s, &bind_sa.sa, sl)) {
+		if (bind(s, &bind_sa.sa, socklen_inany(&bind_sa))) {
 			char sstr[INANY_ADDRSTRLEN];
 
 			flow_dbg_perror(conn,
@@ -1508,7 +1507,6 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	union flow *flow;
 	int s = -1, mss;
 	uint64_t hash;
-	socklen_t sl;
 
 	if (!(flow = flow_alloc()))
 		return;
@@ -1541,7 +1539,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	if ((s = tcp_conn_sock(af)) < 0)
 		goto cancel;
 
-	pif_sockaddr(c, &sa, &sl, PIF_HOST, &tgt->eaddr, tgt->eport);
+	pif_sockaddr(c, &sa, PIF_HOST, &tgt->eaddr, tgt->eport);
 
 	/* Use bind() to check if the target address is local (EADDRINUSE or
 	 * similar) and already bound, and set the LOCAL flag in that case.
@@ -1553,7 +1551,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	 *
 	 * So, if bind() succeeds, close the socket, get a new one, and proceed.
 	 */
-	if (bind(s, &sa.sa, sl)) {
+	if (bind(s, &sa.sa, socklen_inany(&sa))) {
 		if (errno != EADDRNOTAVAIL && errno != EACCES)
 			conn_flag(c, conn, LOCAL);
 	} else {
@@ -1593,7 +1591,7 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 
 	tcp_bind_outbound(c, conn, s);
 
-	if (connect(s, &sa.sa, sl)) {
+	if (connect(s, &sa.sa, socklen_inany(&sa))) {
 		if (errno != EINPROGRESS) {
 			tcp_rst(c, conn);
 			goto cancel;
@@ -1612,7 +1610,8 @@ static void tcp_conn_from_tap(const struct ctx *c, sa_family_t af,
 	tcp_epoll_ctl(c, conn);
 
 	if (c->mode == MODE_VU) { /* To rebind to same oport after migration */
-		sl = sizeof(sa);
+		socklen_t sl = sizeof(sa);
+
 		if (getsockname(s, &sa.sa, &sl) ||
 		    inany_from_sockaddr(&tgt->oaddr, &tgt->oport, &sa) < 0)
 			err_perror("Can't get local address for socket %i", s);
@@ -3592,11 +3591,10 @@ static int tcp_flow_repair_bind(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	const struct flowside *sockside = HOSTFLOW(conn);
 	union sockaddr_inany a;
-	socklen_t sl;
 
-	pif_sockaddr(c, &a, &sl, PIF_HOST, &sockside->oaddr, sockside->oport);
+	pif_sockaddr(c, &a, PIF_HOST, &sockside->oaddr, sockside->oport);
 
-	if (bind(conn->sock, &a.sa, sizeof(a))) {
+	if (bind(conn->sock, &a.sa, socklen_inany(&a))) {
 		int rc = -errno;
 		flow_perror(conn, "Failed to bind socket for migrated flow");
 		return rc;
diff --git a/tcp_splice.c b/tcp_splice.c
index 26cb6306..174a64d9 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -351,7 +351,6 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 	sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6;
 	uint8_t tgtpif = conn->f.pif[TGTSIDE];
 	union sockaddr_inany sa;
-	socklen_t sl;
 	int one = 1;
 
 	if (tgtpif == PIF_HOST)
@@ -379,9 +378,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 			   conn->s[1]);
 	}
 
-	pif_sockaddr(c, &sa, &sl, tgtpif, &tgt->eaddr, tgt->eport);
+	pif_sockaddr(c, &sa, tgtpif, &tgt->eaddr, tgt->eport);
 
-	if (connect(conn->s[1], &sa.sa, sl)) {
+	if (connect(conn->s[1], &sa.sa, socklen_inany(&sa))) {
 		if (errno != EINPROGRESS) {
 			flow_trace(conn, "Couldn't connect socket for splice: %s",
 				   strerror_(errno));
diff --git a/udp.c b/udp.c
index 86585b7e..8f96495d 100644
--- a/udp.c
+++ b/udp.c
@@ -773,7 +773,6 @@ static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
 	const struct udp_flow *uflow = udp_at_sidx(tosidx);
 	uint8_t topif = pif_at_sidx(tosidx);
 	int to_s = uflow->s[tosidx.sidei];
-	socklen_t sl;
 	int i;
 
 	if ((n = udp_sock_recv(c, from_s, udp_mh_recv, n)) <= 0)
@@ -784,7 +783,7 @@ static void udp_sock_to_sock(const struct ctx *c, int from_s, int n,
 			= udp_mh_recv[i].msg_len;
 	}
 
-	pif_sockaddr(c, &udp_splice_to, &sl, topif,
+	pif_sockaddr(c, &udp_splice_to, topif,
 		     &toside->eaddr, toside->eport);
 
 	sendmmsg(to_s, udp_mh_splice, n, MSG_NOSIGNAL);
@@ -986,7 +985,6 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 	flow_sidx_t tosidx;
 	in_port_t src, dst;
 	uint8_t topif;
-	socklen_t sl;
 
 	ASSERT(!c->no_udp);
 
@@ -1028,7 +1026,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 	s = uflow->s[tosidx.sidei];
 	ASSERT(s >= 0);
 
-	pif_sockaddr(c, &to_sa, &sl, topif, &toside->eaddr, toside->eport);
+	pif_sockaddr(c, &to_sa, topif, &toside->eaddr, toside->eport);
 
 	for (i = 0, j = 0; i < (int)p->count - idx && j < UIO_MAXIOV; i++) {
 		const struct udphdr *uh_send;
@@ -1041,7 +1039,7 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 			return p->count - idx;
 
 		mm[i].msg_hdr.msg_name = &to_sa;
-		mm[i].msg_hdr.msg_namelen = sl;
+		mm[i].msg_hdr.msg_namelen = socklen_inany(&to_sa);
 
 		if (data.cnt) {
 			int cnt;
diff --git a/util.c b/util.c
index c492f904..976fcabe 100644
--- a/util.c
+++ b/util.c
@@ -44,7 +44,6 @@
  * @c:		Execution context
  * @type:	epoll type
  * @sa:		Socket address to bind to
- * @sl:		Length of @sa
  * @ifname:	Interface for binding, NULL for any
  * @v6only:	Set IPV6_V6ONLY socket option
  * @data:	epoll reference portion for protocol handlers
@@ -52,10 +51,10 @@
  * Return: newly created socket, negative error code on failure
  */
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
-	       const void *sa, socklen_t sl,
-	       const char *ifname, bool v6only, uint32_t data)
+	       const union sockaddr_inany *sa, const char *ifname,
+	       bool v6only, uint32_t data)
 {
-	sa_family_t af = ((const struct sockaddr *)sa)->sa_family;
+	sa_family_t af = sa->sa_family;
 	union epoll_ref ref = { .type = type, .data = data };
 	bool freebind = false;
 	struct epoll_event ev;
@@ -152,7 +151,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 		}
 	}
 
-	if (bind(fd, sa, sl) < 0) {
+	if (bind(fd, &sa->sa, socklen_inany(sa)) < 0) {
 		/* We'll fail to bind to low ports if we don't have enough
 		 * capabilities, and we'll fail to bind on already bound ports,
 		 * this is fine. This might also fail for ICMP because of a
diff --git a/util.h b/util.h
index 22eaac56..e1a1ebc9 100644
--- a/util.h
+++ b/util.h
@@ -201,10 +201,11 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 #include "packet.h"
 
 struct ctx;
+union sockaddr_inany;
 
 int sock_l4_sa(const struct ctx *c, enum epoll_type type,
-	       const void *sa, socklen_t sl,
-	       const char *ifname, bool v6only, uint32_t data);
+	       const union sockaddr_inany *sa, const char *ifname,
+	       bool v6only, uint32_t data);
 int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
  2025-10-29  6:26 ` [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-11-13  6:33   ` Stefano Brivio
  2025-10-29  6:26 ` [PATCH v3 3/8] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
to set the IPV6_V6ONLY socket option.  Usually it's set when the given
address is IPv6, but not when we want to create a dual stack listening
socket.  The latter only makes sense when the address is :: however.

Clarify this by only keeping the v6only option in an internal helper
sock_l4_().  External users will call either sock_l4() which always creates
a socket bound to a specific IP version, or sock_l4_dualstack() which
creates a dual stack socket, but takes only a port not an address.

We drop the '_sa' suffix while we're at it - it exists because this used
to be an internal version with a sock_l4() wrapper.  The wrapper no longer
exists so the '_sa' is no longer useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c |  6 ++----
 pif.c  | 10 +++-------
 util.c | 27 +++++++++++++++++++++++----
 util.h |  8 +++++---
 4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/flow.c b/flow.c
index 9926f408..fd530ddb 100644
--- a/flow.c
+++ b/flow.c
@@ -186,8 +186,7 @@ static int flowside_sock_splice(void *arg)
 
 	ns_enter(a->c);
 
-	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
-	                   a->sa->sa_family == AF_INET6, a->data);
+	a->fd = sock_l4(a->c, a->type, a->sa, NULL, a->data);
 	a->err = errno;
 
 	return 0;
@@ -222,8 +221,7 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		else if (sa.sa_family == AF_INET6)
 			ifname = c->ip6.ifname_out;
 
-		return sock_l4_sa(c, type, &sa, ifname,
-				  sa.sa_family == AF_INET6, data);
+		return sock_l4(c, type, &sa, ifname, data);
 
 	case PIF_SPLICE: {
 		struct flowside_sock_args args = {
diff --git a/pif.c b/pif.c
index 31723b29..5fb1f455 100644
--- a/pif.c
+++ b/pif.c
@@ -75,11 +75,7 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 		const union inany_addr *addr, const char *ifname,
 		in_port_t port, uint32_t data)
 {
-	union sockaddr_inany sa = {
-		.sa6.sin6_family = AF_INET6,
-		.sa6.sin6_addr = in6addr_any,
-		.sa6.sin6_port = htons(port),
-	};
+	union sockaddr_inany sa;
 
 	ASSERT(pif_is_socket(pif));
 
@@ -90,8 +86,8 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 	}
 
 	if (!addr)
-		return sock_l4_sa(c, type, &sa, ifname, false, data);
+		return sock_l4_dualstack(c, type, port, ifname, data);
 
 	pif_sockaddr(c, &sa, pif, addr, port);
-	return sock_l4_sa(c, type, &sa, ifname, sa.sa_family == AF_INET6, data);
+	return sock_l4(c, type, &sa, ifname, data);
 }
diff --git a/util.c b/util.c
index 976fcabe..c94efae4 100644
--- a/util.c
+++ b/util.c
@@ -40,7 +40,7 @@
 #endif
 
 /**
- * sock_l4_sa() - Create and bind socket to socket address, add to epoll list
+ * sock_l4_() - Create and bind socket to socket address, add to epoll list
  * @c:		Execution context
  * @type:	epoll type
  * @sa:		Socket address to bind to
@@ -50,9 +50,9 @@
  *
  * Return: newly created socket, negative error code on failure
  */
-int sock_l4_sa(const struct ctx *c, enum epoll_type type,
-	       const union sockaddr_inany *sa, const char *ifname,
-	       bool v6only, uint32_t data)
+static int sock_l4_(const struct ctx *c, enum epoll_type type,
+		    const union sockaddr_inany *sa, const char *ifname,
+		    bool v6only, uint32_t data)
 {
 	sa_family_t af = sa->sa_family;
 	union epoll_ref ref = { .type = type, .data = data };
@@ -182,6 +182,25 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
 	return fd;
 }
 
+int sock_l4(const struct ctx *c, enum epoll_type type,
+	    const union sockaddr_inany *sa, const char *ifname,
+	    uint32_t data)
+{
+	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
+}
+
+int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
+		      in_port_t port, const char *ifname, uint32_t data)
+{
+	union sockaddr_inany sa = {
+		.sa6.sin6_family = AF_INET6,
+		.sa6.sin6_addr = in6addr_any,
+		.sa6.sin6_port = htons(port),
+	};
+
+	return sock_l4_(c, type, &sa, ifname, 0, data);
+}
+
 /**
  * sock_unix() - Create and bind AF_UNIX socket
  * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
diff --git a/util.h b/util.h
index e1a1ebc9..7f0cf686 100644
--- a/util.h
+++ b/util.h
@@ -203,9 +203,11 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 struct ctx;
 union sockaddr_inany;
 
-int sock_l4_sa(const struct ctx *c, enum epoll_type type,
-	       const union sockaddr_inany *sa, const char *ifname,
-	       bool v6only, uint32_t data);
+int sock_l4(const struct ctx *c, enum epoll_type type,
+	    const union sockaddr_inany *sa, const char *ifname,
+	    uint32_t data);
+int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
+		      in_port_t port, const char *ifname, uint32_t data);
 int sock_unix(char *sock_path);
 void sock_probe_mem(struct ctx *c);
 long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 3/8] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one()
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
  2025-10-29  6:26 ` [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family David Gibson
  2025-10-29  6:26 ` [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-10-29  6:26 ` [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Surprisingly little logic is shared between the path for creating a
listen()ing socket in the guest namespace versus in the host namespace.
Improve this, by extending tcp_sock_init_one() to take a pif parameter
indicating where it should open the socket.  This allows
tcp_ns_sock_init[46]() to be removed entirely.

We generalise tcp_sock_init() in the same way, although we don't use it
yet, due to some subtle differences in how we bind for -t versus -T.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c |   2 +-
 tcp.c  | 100 +++++++++++++++++++--------------------------------------
 tcp.h  |   5 +--
 3 files changed, 37 insertions(+), 70 deletions(-)

diff --git a/conf.c b/conf.c
index 66b9e634..26f1bcc0 100644
--- a/conf.c
+++ b/conf.c
@@ -169,7 +169,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
 		fwd->delta[i] = to - first;
 
 		if (optname == 't')
-			ret = tcp_sock_init(c, addr, ifname, i);
+			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else if (optname == 'u')
 			ret = udp_sock_init(c, 0, addr, ifname, i);
 		else
diff --git a/tcp.c b/tcp.c
index d63a16bf..5f4fc12c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2514,29 +2514,42 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 /**
  * tcp_sock_init_one() - Initialise listening socket for address and port
  * @c:		Execution context
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL for dual stack any
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: fd for the new listening socket, negative error code on failure
+ *
+ * If pif == PIF_SPLICE, the caller must have already entered the guest ns.
  */
-static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
-			     const char *ifname, in_port_t port)
+static int tcp_sock_init_one(const struct ctx *c, uint8_t pif,
+			     const union inany_addr *addr, const char *ifname,
+			     in_port_t port)
 {
 	union tcp_listen_epoll_ref tref = {
 		.port = port,
-		.pif = PIF_HOST,
+		.pif = pif,
 	};
+	const struct fwd_ports *fwd;
 	int s;
 
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_HOST, addr,
-				ifname, port, tref.u32);
+	if (pif == PIF_HOST)
+		fwd = &c->tcp.fwd_in;
+	else
+		fwd = &c->tcp.fwd_out;
+
+	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, pif, addr, ifname,
+			port, tref.u32);
+
+	if (fwd->mode == FWD_AUTO) {
+		int (*socks)[IP_VERSIONS] = pif == PIF_SPLICE ?
+			tcp_sock_ns : tcp_sock_init_ext;
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO) {
 		if (!addr || inany_v4(addr))
-			tcp_sock_init_ext[port][V4] = s < 0 ? -1 : s;
+			socks[port][V4] = s < 0 ? -1 : s;
 		if (!addr || !inany_v4(addr))
-			tcp_sock_init_ext[port][V6] = s < 0 ? -1 : s;
+			socks[port][V6] = s < 0 ? -1 : s;
 	}
 
 	if (s < 0)
@@ -2548,14 +2561,16 @@ static int tcp_sock_init_one(const struct ctx *c, const union inany_addr *addr,
 /**
  * tcp_sock_init() - Create listening sockets for a given host ("inbound") port
  * @c:		Execution context
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: 0 on (partial) success, negative error code on (complete) failure
  */
-int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
-		  const char *ifname, in_port_t port)
+int tcp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port)
 {
 	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
 
@@ -2563,72 +2578,23 @@ int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
 
 	if (!addr && c->ifi4 && c->ifi6)
 		/* Attempt to get a dual stack socket */
-		if (tcp_sock_init_one(c, NULL, ifname, port) >= 0)
+		if (tcp_sock_init_one(c, pif, NULL, ifname, port) >= 0)
 			return 0;
 
 	/* Otherwise create a socket per IP version */
 	if ((!addr || inany_v4(addr)) && c->ifi4)
-		r4 = tcp_sock_init_one(c, addr ? addr : &inany_any4,
-				       ifname, port);
+		r4 = tcp_sock_init_one(c, pif,
+				       addr ? addr : &inany_any4, ifname, port);
 
 	if ((!addr || !inany_v4(addr)) && c->ifi6)
-		r6 = tcp_sock_init_one(c, addr ? addr : &inany_any6,
-				       ifname, port);
+		r6 = tcp_sock_init_one(c, pif,
+				       addr ? addr : &inany_any6, ifname, port);
 
 	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
 		return 0;
 
 	return r4 < 0 ? r4 : r6;
 }
-
-/**
- * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
-{
-	union tcp_listen_epoll_ref tref = {
-		.port = port,
-		.pif = PIF_SPLICE,
-	};
-	int s;
-
-	ASSERT(c->mode == MODE_PASTA);
-
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback4,
-			NULL, port, tref.u32);
-	if (s < 0)
-		s = -1;
-
-	if (c->tcp.fwd_out.mode == FWD_AUTO)
-		tcp_sock_ns[port][V4] = s;
-}
-
-/**
- * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
-{
-	union tcp_listen_epoll_ref tref = {
-		.port = port,
-		.pif = PIF_SPLICE,
-	};
-	int s;
-
-	ASSERT(c->mode == MODE_PASTA);
-
-	s = pif_sock_l4(c, EPOLL_TYPE_TCP_LISTEN, PIF_SPLICE, &inany_loopback6,
-			NULL, port, tref.u32);
-	if (s < 0)
-		s = -1;
-
-	if (c->tcp.fwd_out.mode == FWD_AUTO)
-		tcp_sock_ns[port][V6] = s;
-}
-
 /**
  * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
  * @c:		Execution context
@@ -2639,9 +2605,9 @@ static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
 	ASSERT(!c->no_tcp);
 
 	if (c->ifi4)
-		tcp_ns_sock_init4(c, port);
+		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
 	if (c->ifi6)
-		tcp_ns_sock_init6(c, port);
+		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
 }
 
 /**
@@ -2844,7 +2810,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
 			if (outbound)
 				tcp_ns_sock_init(c, port);
 			else
-				tcp_sock_init(c, NULL, NULL, port);
+				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
 		}
 	}
 }
diff --git a/tcp.h b/tcp.h
index 234a8033..fb22bac0 100644
--- a/tcp.h
+++ b/tcp.h
@@ -18,8 +18,9 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref,
 int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		    const void *saddr, const void *daddr, uint32_t flow_lbl,
 		    const struct pool *p, int idx, const struct timespec *now);
-int tcp_sock_init(const struct ctx *c, const union inany_addr *addr,
-		  const char *ifname, in_port_t port);
+int tcp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port);
 int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (2 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 3/8] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-11-13  6:33   ` Stefano Brivio
  2025-10-29  6:26 ` [PATCH v3 5/8] udp: Move udp_sock_init() special case to its caller David Gibson
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

udp_sock_init() takes an 'ns' parameter determining if it creates a socket
in the guest namespace or host namespace.  Alter it to take a pif
parameter instead, like tcp_sock_init(), and use that change to slightly
reduce code duplication between the HOST and SPLICE cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c |  2 +-
 udp.c  | 65 +++++++++++++++++++++++++++++++---------------------------
 udp.h  |  5 +++--
 3 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/conf.c b/conf.c
index 26f1bcc0..08cb50aa 100644
--- a/conf.c
+++ b/conf.c
@@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
 		if (optname == 't')
 			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else if (optname == 'u')
-			ret = udp_sock_init(c, 0, addr, ifname, i);
+			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
 		else
 			/* No way to check in advance for -T and -U */
 			ret = 0;
diff --git a/udp.c b/udp.c
index 8f96495d..f3237436 100644
--- a/udp.c
+++ b/udp.c
@@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 /**
  * udp_sock_init() - Initialise listening sockets for a given port
  * @c:		Execution context
- * @ns:		In pasta mode, if set, bind with loopback address in namespace
+ * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
  * Return: 0 on (partial) success, negative error code on (complete) failure
  */
-int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
-		  const char *ifname, in_port_t port)
+int udp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port)
 {
 	union udp_listen_epoll_ref uref = {
-		.pif = ns ? PIF_SPLICE : PIF_HOST,
+		.pif = pif,
 		.port = port,
 	};
 	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
+	int (*socks)[NUM_PORTS];
 
 	ASSERT(!c->no_udp);
+	ASSERT(pif_is_socket(pif));
 
-	if (!addr && c->ifi4 && c->ifi6 && !ns) {
+	if (pif == PIF_HOST)
+		socks = udp_splice_init;
+	else
+		socks = udp_splice_ns;
+
+	if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) {
 		int s;
 
 		/* Attempt to get a dual stack socket */
 		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
 				NULL, ifname, port, uref.u32);
-		udp_splice_init[V4][port] = s < 0 ? -1 : s;
-		udp_splice_init[V6][port] = s < 0 ? -1 : s;
+		socks[V4][port] = s < 0 ? -1 : s;
+		socks[V6][port] = s < 0 ? -1 : s;
 		if (IN_INTERVAL(0, FD_REF_MAX, s))
 			return 0;
 	}
 
 	if ((!addr || inany_v4(addr)) && c->ifi4) {
-		if (!ns) {
-			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
-					 addr ? addr : &inany_any4, ifname,
-					 port, uref.u32);
+		const union inany_addr *a = addr ?
+			addr : &inany_any4;
 
-			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
-		} else {
-			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
-					  &inany_loopback4, ifname,
-					  port, uref.u32);
-			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
-		}
+		if (pif == PIF_SPLICE)
+			a = &inany_loopback4;
+
+		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+				 port, uref.u32);
+
+		socks[V4][port] = r4 < 0 ? -1 : r4;
 	}
 
 	if ((!addr || !inany_v4(addr)) && c->ifi6) {
-		if (!ns) {
-			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
-					 addr ? addr : &inany_any6, ifname,
-					 port, uref.u32);
+		const union inany_addr *a = addr ?
+			addr : &inany_any6;
 
-			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
-		} else {
-			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
-					 &inany_loopback6, ifname,
-					 port, uref.u32);
-			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
-		}
+		if (pif == PIF_SPLICE)
+			a = &inany_loopback6;
+
+		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
+				 port, uref.u32);
+
+		socks[V6][port] = r6 < 0 ? -1 : r6;
 	}
 
 	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
@@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
 
 		if ((c->ifi4 && socks[V4][port] == -1) ||
 		    (c->ifi6 && socks[V6][port] == -1))
-			udp_sock_init(c, outbound, NULL, NULL, port);
+			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
+				      NULL, NULL, port);
 	}
 }
 
diff --git a/udp.h b/udp.h
index 8f8531ad..f78dc528 100644
--- a/udp.h
+++ b/udp.h
@@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
 		    uint8_t ttl, const struct pool *p, int idx,
 		    const struct timespec *now);
-int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
-		  const char *ifname, in_port_t port);
+int udp_sock_init(const struct ctx *c, uint8_t pif,
+		  const union inany_addr *addr, const char *ifname,
+		  in_port_t port);
 int udp_init(struct ctx *c);
 void udp_timer(struct ctx *c, const struct timespec *now);
 void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 5/8] udp: Move udp_sock_init() special case to its caller
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (3 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-10-29  6:26 ` [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option David Gibson
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Inbound sockets are bound to the unspecified address by default, whereas
outbound sockets are bound to the loopback address.  This is currently
handled in udp_sock_init() which ignores its addr argument in the outbound
case.

Move the handling of this special case to the caller, udp_port_rebind().
This makes the semantics of the 'addr' argument more consistent, and will
make future changes easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 udp.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/udp.c b/udp.c
index f3237436..ee8334fc 100644
--- a/udp.c
+++ b/udp.c
@@ -1117,7 +1117,7 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 	else
 		socks = udp_splice_ns;
 
-	if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) {
+	if (!addr && c->ifi4 && c->ifi6) {
 		int s;
 
 		/* Attempt to get a dual stack socket */
@@ -1133,9 +1133,6 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 		const union inany_addr *a = addr ?
 			addr : &inany_any4;
 
-		if (pif == PIF_SPLICE)
-			a = &inany_loopback4;
-
 		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
 				 port, uref.u32);
 
@@ -1146,9 +1143,6 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 		const union inany_addr *a = addr ?
 			addr : &inany_any6;
 
-		if (pif == PIF_SPLICE)
-			a = &inany_loopback6;
-
 		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
 				 port, uref.u32);
 
@@ -1217,9 +1211,16 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
 			continue;
 
 		if ((c->ifi4 && socks[V4][port] == -1) ||
-		    (c->ifi6 && socks[V6][port] == -1))
-			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
-				      NULL, NULL, port);
+		    (c->ifi6 && socks[V6][port] == -1)) {
+			if (outbound) {
+				udp_sock_init(c, PIF_SPLICE,
+					      &inany_loopback4, NULL, port);
+				udp_sock_init(c, PIF_SPLICE,
+					      &inany_loopback6, NULL, port);
+			} else {
+				udp_sock_init(c, PIF_HOST, NULL, NULL, port);
+			}
+		}
 	}
 }
 
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (4 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 5/8] udp: Move udp_sock_init() special case to its caller David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-11-13  6:33   ` Stefano Brivio
  2025-10-29  6:26 ` [PATCH v3 7/8] tcp, udp: Remove fallback if creating dual stack socket fails David Gibson
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it
to 1, which we typically do on all IPv6 sockets except those explicitly for
dual stack listening.  That's not quite right in two ways:

 * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be
   changed with the net.ipv6.bindv6only sysctl.  It may also have different
   defaults on other OSes if we ever support them.  We know we need it off
   for dual stack sockets, so explicitly set it to 0 in that case.

 * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a
   specific address is harmless but pointless.  Don't set the option at all
   in this case, saving a syscall.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 util.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/util.c b/util.c
index c94efae4..62f43895 100644
--- a/util.c
+++ b/util.c
@@ -45,14 +45,14 @@
  * @type:	epoll type
  * @sa:		Socket address to bind to
  * @ifname:	Interface for binding, NULL for any
- * @v6only:	Set IPV6_V6ONLY socket option
+ * @v6only:	If >= 0, set IPV6_V6ONLY socket option to this value
  * @data:	epoll reference portion for protocol handlers
  *
  * Return: newly created socket, negative error code on failure
  */
 static int sock_l4_(const struct ctx *c, enum epoll_type type,
 		    const union sockaddr_inany *sa, const char *ifname,
-		    bool v6only, uint32_t data)
+		    int v6only, uint32_t data)
 {
 	sa_family_t af = sa->sa_family;
 	union epoll_ref ref = { .type = type, .data = data };
@@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,
 
 	ref.fd = fd;
 
-	if (v6only)
-		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
-			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
+	if (v6only >= 0)
+		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
+			       &v6only, sizeof(v6only)))
+			debug("Failed to set IPV6_V6ONLY to %d on socket %i",
+			      v6only, fd);
 
 	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
 		debug("Failed to set SO_REUSEADDR on socket %i", fd);
@@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type,
 	    const union sockaddr_inany *sa, const char *ifname,
 	    uint32_t data)
 {
-	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
+	int v6only = -1;
+
+	/* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6
+	 * sockets with a non-wildcard address.
+	 */
+	if (sa->sa_family == AF_INET6 &&
+	    IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr))
+		v6only = 1;
+
+	return sock_l4_(c, type, sa, ifname, v6only, data);
 }
 
 int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
@@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
 		.sa6.sin6_port = htons(port),
 	};
 
+	/* Dual stack sockets require IPV6_V6ONLY == 0.  Usually that's the
+	 * default, but sysctl net.ipv6.bindv6only can change that, so set the
+	 * sockopt explicitly.
+	 */
 	return sock_l4_(c, type, &sa, ifname, 0, data);
 }
 
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 7/8] tcp, udp: Remove fallback if creating dual stack socket fails
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (5 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-10-29  6:26 ` [PATCH v3 8/8] [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
  2025-10-30  3:58 ` [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
  8 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

To save kernel memory we try to use "dual stack" sockets which can listen
for both IPv4 and IPv6 connections where possible.  To support kernels
which don't allow dual stack sockets, we fall back to creating individual
sockets for IPv4 and IPv6.

This fallback causes some mild ugliness now, and will cause more difficulty
with upcoming improvements to the forwarding logic.  I don't think we need
the fallback on the following grounds:

1) The fallback was broken since inception:

The fallback was triggered if pif_sock_l4() failed attempting to create the
dual stack socket.  But even if the kernel didn't suppor them,
pif_sock_l4() would not report a failure.
  - Dual stack sockets are distinguished by having the IPV6_V6ONLY sockopt
    set to 0.  However, until the last patch, we only called setsockopt()
    if we wanted to set this to 1, so there was no kernel operation which
    could fail for dual stack sockets - we'd silently create a IPv6 only
    socket instead.
  - Even if we did call the setsockopt(), we only printed a debug() message
    for failures, we didn't report it to the caller

2) Dual stack sockets are not just a Linux extension

The dual stack socket interface is described in RFC3493, specifically
section 3.7 and section 5.3.  It is supported on BSD:
  https://man.freebsd.org/cgi/man.cgi?query=ip6
and on Windows:
  https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-ipv6-socket-options

3) Linux has supported dual stack sockets for over 20 years

According to ipv6(7) the IPV6_V6ONLY socket option was introduced in
Linux 2.6 and Linux 2.4.21 both from 2003.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 43 +++++++++++++++++++++++++------------------
 udp.c | 58 ++++++++++++++++++++++++++--------------------------------
 2 files changed, 51 insertions(+), 50 deletions(-)

diff --git a/tcp.c b/tcp.c
index 5f4fc12c..a19f0ba9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2559,42 +2559,49 @@ static int tcp_sock_init_one(const struct ctx *c, uint8_t pif,
 }
 
 /**
- * tcp_sock_init() - Create listening sockets for a given host ("inbound") port
+ * tcp_sock_init() - Create listening socket for a given host ("inbound") port
  * @c:		Execution context
  * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
- * Return: 0 on (partial) success, negative error code on (complete) failure
+ * Return: 0 on success, negative error code on failure
  */
 int tcp_sock_init(const struct ctx *c, uint8_t pif,
 		  const union inany_addr *addr, const char *ifname,
 		  in_port_t port)
 {
-	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
+	int s;
 
 	ASSERT(!c->no_tcp);
 
-	if (!addr && c->ifi4 && c->ifi6)
-		/* Attempt to get a dual stack socket */
-		if (tcp_sock_init_one(c, pif, NULL, ifname, port) >= 0)
+	if (!c->ifi4) {
+		if (!addr)
+			/* Restrict to v6 only */
+			addr = &inany_any6;
+		else if (inany_v4(addr))
+			/* Nothing to do */
 			return 0;
+	}
+	if (!c->ifi6) {
+		if (!addr)
+			/* Restrict to v4 only */
+			addr = &inany_any4;
+		else if (!inany_v4(addr))
+			/* Nothing to do */
+			return 0;
+	}
 
-	/* Otherwise create a socket per IP version */
-	if ((!addr || inany_v4(addr)) && c->ifi4)
-		r4 = tcp_sock_init_one(c, pif,
-				       addr ? addr : &inany_any4, ifname, port);
-
-	if ((!addr || !inany_v4(addr)) && c->ifi6)
-		r6 = tcp_sock_init_one(c, pif,
-				       addr ? addr : &inany_any6, ifname, port);
-
-	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
-		return 0;
+	s = tcp_sock_init_one(c, pif, addr, ifname, port);
+	if (s < 0)
+		return s;
+	if (s > FD_REF_MAX)
+		return -EIO;
 
-	return r4 < 0 ? r4 : r6;
+	return 0;
 }
+
 /**
  * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
  * @c:		Execution context
diff --git a/udp.c b/udp.c
index ee8334fc..5754c436 100644
--- a/udp.c
+++ b/udp.c
@@ -1089,14 +1089,14 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
 }
 
 /**
- * udp_sock_init() - Initialise listening sockets for a given port
+ * udp_sock_init() - Initialise listening socket for a given port
  * @c:		Execution context
  * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  *
- * Return: 0 on (partial) success, negative error code on (complete) failure
+ * Return: 0 on success, negative error code on failure
  */
 int udp_sock_init(const struct ctx *c, uint8_t pif,
 		  const union inany_addr *addr, const char *ifname,
@@ -1106,8 +1106,8 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 		.pif = pif,
 		.port = port,
 	};
-	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
 	int (*socks)[NUM_PORTS];
+	int s;
 
 	ASSERT(!c->no_udp);
 	ASSERT(pif_is_socket(pif));
@@ -1117,42 +1117,36 @@ int udp_sock_init(const struct ctx *c, uint8_t pif,
 	else
 		socks = udp_splice_ns;
 
-	if (!addr && c->ifi4 && c->ifi6) {
-		int s;
-
-		/* Attempt to get a dual stack socket */
-		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
-				NULL, ifname, port, uref.u32);
-		socks[V4][port] = s < 0 ? -1 : s;
-		socks[V6][port] = s < 0 ? -1 : s;
-		if (IN_INTERVAL(0, FD_REF_MAX, s))
+	if (!c->ifi4) {
+		if (!addr)
+			/* Restrict to v6 only */
+			addr = &inany_any6;
+		else if (inany_v4(addr))
+			/* Nothing to do */
 			return 0;
 	}
-
-	if ((!addr || inany_v4(addr)) && c->ifi4) {
-		const union inany_addr *a = addr ?
-			addr : &inany_any4;
-
-		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
-				 port, uref.u32);
-
-		socks[V4][port] = r4 < 0 ? -1 : r4;
+	if (!c->ifi6) {
+		if (!addr)
+			/* Restrict to v4 only */
+			addr = &inany_any4;
+		else if (!inany_v4(addr))
+			/* Nothing to do */
+			return 0;
 	}
 
-	if ((!addr || !inany_v4(addr)) && c->ifi6) {
-		const union inany_addr *a = addr ?
-			addr : &inany_any6;
-
-		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
-				 port, uref.u32);
-
-		socks[V6][port] = r6 < 0 ? -1 : r6;
+	s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif,
+			addr, ifname, port, uref.u32);
+	if (s > FD_REF_MAX) {
+		close(s);
+		s = -EIO;
 	}
 
-	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
-		return 0;
+	if (!addr || inany_v4(addr))
+		socks[V4][port] = s < 0 ? -1 : s;
+	if (!addr || !inany_v4(addr))
+		socks[V6][port] = s < 0 ? -1 : s;
 
-	return r4 < 0 ? r4 : r6;
+	return s < 0 ? s : 0;
 }
 
 /**
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 8/8] [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (6 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 7/8] tcp, udp: Remove fallback if creating dual stack socket fails David Gibson
@ 2025-10-29  6:26 ` David Gibson
  2025-10-30  3:58 ` [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
  8 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-10-29  6:26 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

NOTE: I've discovered this approach doesn't work to address bug 100,
although it may still be worthwhile for secondary reasons.

Currently, outbound forwards (-T, -U) are handled by sockets bound to the
loopback address.  Typically we create two sockets, one for 127.0.0.1 and
one for ::1.

This has some disadvantages:
 * The guest can't connect to these services using its global IP address,
   it must explicitly use 127.0.0.1 or ::1 (bug 100)
 * The guest can't even connect via 127.0.0.0/8 addresses other than
   127.0.0.1
 * We can't use dual-stack sockets, we have to have separate sockets for
   IPv4 and IPv6.

The restriction exist for a reason though.  If the guest has any interfaces
other than pasta (e.g. a VPN tunnel) external hosts could reach the host
via the forwards.  Especially combined with -T auto / -U auto this would
make it very easy to make a mistake with nasty security implications.

We can achieve both goals, however, if we don't bind the outbound listening
sockets to a particular address, but _do_ use SO_BINDTODEVICE to restrict
them to the "lo" interface.

Link: https://bugs.passt.top/show_bug.cgi?id=100

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 pif.c |  6 ------
 tcp.c | 19 ++-----------------
 udp.c | 10 +++-------
 3 files changed, 5 insertions(+), 30 deletions(-)

diff --git a/pif.c b/pif.c
index 5fb1f455..20b5cb99 100644
--- a/pif.c
+++ b/pif.c
@@ -79,12 +79,6 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
 
 	ASSERT(pif_is_socket(pif));
 
-	if (pif == PIF_SPLICE) {
-		/* Sanity checks */
-		ASSERT(!ifname);
-		ASSERT(addr && inany_is_loopback(addr));
-	}
-
 	if (!addr)
 		return sock_l4_dualstack(c, type, port, ifname, data);
 
diff --git a/tcp.c b/tcp.c
index a19f0ba9..841c60fa 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2602,21 +2602,6 @@ int tcp_sock_init(const struct ctx *c, uint8_t pif,
 	return 0;
 }
 
-/**
- * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
- * @c:		Execution context
- * @port:	Port, host order
- */
-static void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
-{
-	ASSERT(!c->no_tcp);
-
-	if (c->ifi4)
-		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback4, NULL, port);
-	if (c->ifi6)
-		tcp_sock_init_one(c, PIF_SPLICE, &inany_loopback6, NULL, port);
-}
-
 /**
  * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
  * @arg:	Execution context
@@ -2635,7 +2620,7 @@ static int tcp_ns_socks_init(void *arg)
 		if (!bitmap_isset(c->tcp.fwd_out.map, port))
 			continue;
 
-		tcp_ns_sock_init(c, port);
+		tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
 	}
 
 	return 0;
@@ -2815,7 +2800,7 @@ static void tcp_port_rebind(struct ctx *c, bool outbound)
 		if ((c->ifi4 && socks[port][V4] == -1) ||
 		    (c->ifi6 && socks[port][V6] == -1)) {
 			if (outbound)
-				tcp_ns_sock_init(c, port);
+				tcp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
 			else
 				tcp_sock_init(c, PIF_HOST, NULL, NULL, port);
 		}
diff --git a/udp.c b/udp.c
index 5754c436..4f796077 100644
--- a/udp.c
+++ b/udp.c
@@ -1206,14 +1206,10 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
 
 		if ((c->ifi4 && socks[V4][port] == -1) ||
 		    (c->ifi6 && socks[V6][port] == -1)) {
-			if (outbound) {
-				udp_sock_init(c, PIF_SPLICE,
-					      &inany_loopback4, NULL, port);
-				udp_sock_init(c, PIF_SPLICE,
-					      &inany_loopback6, NULL, port);
-			} else {
+			if (outbound)
+				udp_sock_init(c, PIF_SPLICE, NULL, "lo", port);
+			else
 				udp_sock_init(c, PIF_HOST, NULL, NULL, port);
-			}
 		}
 	}
 }
-- 
2.51.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding
  2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
                   ` (7 preceding siblings ...)
  2025-10-29  6:26 ` [PATCH v3 8/8] [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
@ 2025-10-30  3:58 ` David Gibson
  8 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-10-30  3:58 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

On Wed, Oct 29, 2025 at 05:26:20PM +1100, David Gibson wrote:
> The fact that outbound forwarding sockets are bound to the loopback
> address, whereas inbound forwarding sockets are (by default) bound to
> the unspecified address leads to some unexpected differences between
> the paths setting up each of them.
> 
> An idea for tackling bug 100 suggested a different approach which will
> also reduce some of those differences and allow more code to be shared
> between the two paths.  I've since discovered that this approach may
> not help for bug 100, but it might still be worthwhile for the clean
> up.
> 
> Patches 1..6/8 are cleanups which shouldn't change behaviour, and I
> think are ready to merge.  7/8 is (arguably) a behavioural change, but
> I've made my case for it in the patch comment.  8/8 needs some further
> consideration, since I've discovered it does not fix bug 100 as is,
> I'm including it for advance review, though.

Follow up on 8/8 now I've looked into SO_BINDTODEVICE semantics more
clearly.  Unfortunately this approach won't work to fix bug 100 [0].
I still think the change is useful for the internals and it may also
help with some other bugs such as 113.

So, I don't think the code of 8/8 needs to change, but the comments
and commit message do.

Here's my plan:
 - I'll wait for review on this draft
 - If it needs a respin, I'll make those doc changes in the next
   version
 - If the series doesn't need a respin, go ahead and apply 1..7, and
   I'll send a revised 8/8 separately.

[0] https://bugs.passt.top/show_bug.cgi?id=100#c8

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family
  2025-10-29  6:26 ` [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family David Gibson
@ 2025-11-13  6:33   ` Stefano Brivio
  2025-11-13 22:53     ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-13  6:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

Apologies for the very substantial delay. I have to admit I
procrastinated around 3/8 and 4/8 quite a bit as they look scary (but
much needed, of course).

Nits only, here:

On Wed, 29 Oct 2025 17:26:21 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> sockaddr_inany can contain either an IPv4 or IPv6 socket address, so the
> relevant length for bind() or connect() can vary.  In pif_sockaddr() we
> return that length, and in sock_l4_sa() we take it as an extra parameter.
> 
> However, sockaddr_inany always contains exactly a sockaddr_in or a
> sockaddr_in6 each with a fixed size.  Therefore we can derive the relevant
> length from the family, and don't need to pass it around separately.
> 
> Make a tiny helper to get the relevant address length, and update all
> interfaces to use that approach instead.
> 
> In the process, fix a buglet in tcp_flow_repair_bind(): we passed
> sizeof(union sockaddr_inany) to bind() instead of the specific length for
> the address family.  Since the sizeof() is always longer than the specific
> length, this is probably fine, but not theoretically correct.

(Whoops.)
 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c       | 17 +++++++----------
>  icmp.c       |  3 ++-
>  inany.h      | 18 ++++++++++++++++++
>  pif.c        | 14 ++++----------
>  pif.h        |  2 +-
>  tcp.c        | 20 +++++++++-----------
>  tcp_splice.c |  5 ++---
>  udp.c        |  8 +++-----
>  util.c       |  9 ++++-----
>  util.h       |  5 +++--
>  10 files changed, 53 insertions(+), 48 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index feefda3c..9926f408 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -170,8 +170,7 @@ struct flowside_sock_args {
>  	int fd;
>  	int err;
>  	enum epoll_type type;
> -	const struct sockaddr *sa;
> -	socklen_t sl;

This should be dropped from the struct documentation.

> +	const union sockaddr_inany *sa;
>  	const char *path;
>  	uint32_t data;

Note: you'll get a trivial conflict here with "util: Move epoll
registration out of sock_l4_sa()" if you rebase.

>  };
> @@ -187,7 +186,7 @@ static int flowside_sock_splice(void *arg)
>  
>  	ns_enter(a->c);
>  
> -	a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
> +	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
>  	                   a->sa->sa_family == AF_INET6, a->data);
>  	a->err = errno;
>  
> @@ -209,11 +208,10 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  {
>  	const char *ifname = NULL;
>  	union sockaddr_inany sa;
> -	socklen_t sl;
>  
>  	ASSERT(pif_is_socket(pif));
>  
> -	pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
> +	pif_sockaddr(c, &sa, pif, &tgt->oaddr, tgt->oport);
>  
>  	switch (pif) {
>  	case PIF_HOST:
> @@ -224,13 +222,13 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  		else if (sa.sa_family == AF_INET6)
>  			ifname = c->ip6.ifname_out;
>  
> -		return sock_l4_sa(c, type, &sa, sl, ifname,
> +		return sock_l4_sa(c, type, &sa, ifname,
>  				  sa.sa_family == AF_INET6, data);
>  
>  	case PIF_SPLICE: {
>  		struct flowside_sock_args args = {
>  			.c = c, .type = type,
> -			.sa = &sa.sa, .sl = sl, .data = data,
> +			.sa = &sa, .data = data,
>  		};
>  		NS_CALL(flowside_sock_splice, &args);
>  		errno = args.err;
> @@ -259,10 +257,9 @@ int flowside_connect(const struct ctx *c, int s,
>  		     uint8_t pif, const struct flowside *tgt)
>  {
>  	union sockaddr_inany sa;
> -	socklen_t sl;
>  
> -	pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport);
> -	return connect(s, &sa.sa, sl);
> +	pif_sockaddr(c, &sa, pif, &tgt->eaddr, tgt->eport);
> +	return connect(s, &sa.sa, socklen_inany(&sa));
>  }
>  
>  /** flow_log_ - Log flow-related message
> diff --git a/icmp.c b/icmp.c
> index 6dffafb0..62b3bed8 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -301,8 +301,9 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
>  	ASSERT(flow_proto[pingf->f.type] == proto);
>  	pingf->ts = now->tv_sec;
>  
> -	pif_sockaddr(c, &sa, &msh.msg_namelen, PIF_HOST, &tgt->eaddr, 0);
> +	pif_sockaddr(c, &sa, PIF_HOST, &tgt->eaddr, 0);
>  	msh.msg_name = &sa;
> +	msh.msg_namelen = socklen_inany(&sa);
>  	msh.msg_iov = iov;
>  	msh.msg_iovlen = cnt;
>  	msh.msg_control = NULL;
> diff --git a/inany.h b/inany.h
> index 7ca5cbd3..e3cae2a8 100644
> --- a/inany.h
> +++ b/inany.h
> @@ -67,6 +67,24 @@ union sockaddr_inany {
>  	struct sockaddr_in6	sa6;
>  };
>  
> +/** socklen_inany() - Return relevant size of an sockaddr_inany

'of a'

...but the fact that it returns that sounds kind of pleonastic. What
about:

/** socklen_inany() - Get relevant address length for sockaddr_inany address

> + * @sa:		sockaddr_iany socket address

sockaddr_inany

> + *
> + * Returns the correct socket address length to pass to bind() or connect() with
> + * @sa, based on whether it is an IPv4 or IPv6 address.

Same here, I guess we obviously want it to be correct, and we have a
somewhat standard syntax for return values in documentation, what about:

 * Return: socket address length for bind() or connect(), from IP family in @sa

?

> + */
> +static inline socklen_t socklen_inany(const union sockaddr_inany *sa)
> +{
> +	switch (sa->sa_family) {
> +	case AF_INET:
> +		return sizeof(sa->sa4);

Wouldn't it be more obvious to return sizeof(struct sockaddr_in), and
sizeof(struct sockaddr_in6) below, instead? I actually had to check that
sa->sa4 matched that (I didn't remember if that was a union).

Not a strong preference from my side, both are obviously correct anyway.

> +	case AF_INET6:
> +		return sizeof(sa->sa6);
> +	default:
> +		ASSERT(0);
> +	}
> +}

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-10-29  6:26 ` [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface David Gibson
@ 2025-11-13  6:33   ` Stefano Brivio
  2025-11-13 23:21     ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-13  6:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 29 Oct 2025 17:26:22 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> address is IPv6, but not when we want to create a dual stack listening
> socket.  The latter only makes sense when the address is :: however.
> 
> Clarify this by only keeping the v6only option in an internal helper
> sock_l4_().  External users will call either sock_l4() which always creates
> a socket bound to a specific IP version, or sock_l4_dualstack() which
> creates a dual stack socket, but takes only a port not an address.

I'm not sure if we'll ever need anything different, but I guess that
this is not the only obvious semantic of sock_l4_dualstack(), as it
could take a sockaddr_inany eventually, and bind() IPv6 address and its
v4-mapped equivalent (...does that even work?).

> We drop the '_sa' suffix while we're at it - it exists because this used
> to be an internal version with a sock_l4() wrapper.  The wrapper no longer
> exists so the '_sa' is no longer useful.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c |  6 ++----
>  pif.c  | 10 +++-------
>  util.c | 27 +++++++++++++++++++++++----
>  util.h |  8 +++++---
>  4 files changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 9926f408..fd530ddb 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -186,8 +186,7 @@ static int flowside_sock_splice(void *arg)
>  
>  	ns_enter(a->c);
>  
> -	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
> -	                   a->sa->sa_family == AF_INET6, a->data);
> +	a->fd = sock_l4(a->c, a->type, a->sa, NULL, a->data);
>  	a->err = errno;
>  
>  	return 0;
> @@ -222,8 +221,7 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  		else if (sa.sa_family == AF_INET6)
>  			ifname = c->ip6.ifname_out;
>  
> -		return sock_l4_sa(c, type, &sa, ifname,
> -				  sa.sa_family == AF_INET6, data);
> +		return sock_l4(c, type, &sa, ifname, data);
>  
>  	case PIF_SPLICE: {
>  		struct flowside_sock_args args = {
> diff --git a/pif.c b/pif.c
> index 31723b29..5fb1f455 100644
> --- a/pif.c
> +++ b/pif.c
> @@ -75,11 +75,7 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  		const union inany_addr *addr, const char *ifname,
>  		in_port_t port, uint32_t data)
>  {
> -	union sockaddr_inany sa = {
> -		.sa6.sin6_family = AF_INET6,
> -		.sa6.sin6_addr = in6addr_any,
> -		.sa6.sin6_port = htons(port),
> -	};
> +	union sockaddr_inany sa;
>  
>  	ASSERT(pif_is_socket(pif));
>  
> @@ -90,8 +86,8 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
>  	}
>  
>  	if (!addr)
> -		return sock_l4_sa(c, type, &sa, ifname, false, data);
> +		return sock_l4_dualstack(c, type, port, ifname, data);
>  
>  	pif_sockaddr(c, &sa, pif, addr, port);
> -	return sock_l4_sa(c, type, &sa, ifname, sa.sa_family == AF_INET6, data);
> +	return sock_l4(c, type, &sa, ifname, data);
>  }
> diff --git a/util.c b/util.c
> index 976fcabe..c94efae4 100644
> --- a/util.c
> +++ b/util.c
> @@ -40,7 +40,7 @@
>  #endif
>  
>  /**
> - * sock_l4_sa() - Create and bind socket to socket address, add to epoll list
> + * sock_l4_() - Create and bind socket to socket address, add to epoll list
>   * @c:		Execution context
>   * @type:	epoll type
>   * @sa:		Socket address to bind to
> @@ -50,9 +50,9 @@
>   *
>   * Return: newly created socket, negative error code on failure
>   */
> -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> -	       const union sockaddr_inany *sa, const char *ifname,
> -	       bool v6only, uint32_t data)
> +static int sock_l4_(const struct ctx *c, enum epoll_type type,
> +		    const union sockaddr_inany *sa, const char *ifname,
> +		    bool v6only, uint32_t data)
>  {
>  	sa_family_t af = sa->sa_family;
>  	union epoll_ref ref = { .type = type, .data = data };
> @@ -182,6 +182,25 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
>  	return fd;
>  }
>  
> +int sock_l4(const struct ctx *c, enum epoll_type type,
> +	    const union sockaddr_inany *sa, const char *ifname,
> +	    uint32_t data)

Not extremely useful but it saves one "lookup":

/**
 * sock_l4() - Create and bind socket to given address, add to epoll list
 * @c:		Execution context
 * @type:	epoll type
 * @sa:		Socket address to bind to
 * @ifname:	Interface for binding, NULL for any
 *
 * Return: newly created socket, negative error code on failure
 */

> +{
> +	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> +}
> +
> +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> +		      in_port_t port, const char *ifname, uint32_t data)

...same here, and the comment might be used to clarify the
functionality.

> +{
> +	union sockaddr_inany sa = {
> +		.sa6.sin6_family = AF_INET6,
> +		.sa6.sin6_addr = in6addr_any,
> +		.sa6.sin6_port = htons(port),
> +	};
> +
> +	return sock_l4_(c, type, &sa, ifname, 0, data);
> +}
> +
>  /**
>   * sock_unix() - Create and bind AF_UNIX socket
>   * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> diff --git a/util.h b/util.h
> index e1a1ebc9..7f0cf686 100644
> --- a/util.h
> +++ b/util.h
> @@ -203,9 +203,11 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  struct ctx;
>  union sockaddr_inany;
>  
> -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> -	       const union sockaddr_inany *sa, const char *ifname,
> -	       bool v6only, uint32_t data);
> +int sock_l4(const struct ctx *c, enum epoll_type type,
> +	    const union sockaddr_inany *sa, const char *ifname,
> +	    uint32_t data);
> +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> +		      in_port_t port, const char *ifname, uint32_t data);
>  int sock_unix(char *sock_path);
>  void sock_probe_mem(struct ctx *c);
>  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-10-29  6:26 ` [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
@ 2025-11-13  6:33   ` Stefano Brivio
  2025-11-13 23:33     ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-13  6:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 29 Oct 2025 17:26:24 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> udp_sock_init() takes an 'ns' parameter determining if it creates a socket
> in the guest namespace or host namespace.  Alter it to take a pif
> parameter instead, like tcp_sock_init(), and use that change to slightly
> reduce code duplication between the HOST and SPLICE cases.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c |  2 +-
>  udp.c  | 65 +++++++++++++++++++++++++++++++---------------------------
>  udp.h  |  5 +++--
>  3 files changed, 39 insertions(+), 33 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 26f1bcc0..08cb50aa 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
>  		if (optname == 't')
>  			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else if (optname == 'u')
> -			ret = udp_sock_init(c, 0, addr, ifname, i);
> +			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
>  		else
>  			/* No way to check in advance for -T and -U */
>  			ret = 0;
> diff --git a/udp.c b/udp.c
> index 8f96495d..f3237436 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
>  /**
>   * udp_sock_init() - Initialise listening sockets for a given port
>   * @c:		Execution context
> - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
>   * @addr:	Pointer to address for binding, NULL if not configured
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   *
>   * Return: 0 on (partial) success, negative error code on (complete) failure
>   */
> -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> -		  const char *ifname, in_port_t port)
> +int udp_sock_init(const struct ctx *c, uint8_t pif,
> +		  const union inany_addr *addr, const char *ifname,
> +		  in_port_t port)
>  {
>  	union udp_listen_epoll_ref uref = {
> -		.pif = ns ? PIF_SPLICE : PIF_HOST,
> +		.pif = pif,
>  		.port = port,
>  	};
>  	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
> +	int (*socks)[NUM_PORTS];
>  
>  	ASSERT(!c->no_udp);
> +	ASSERT(pif_is_socket(pif));
>  
> -	if (!addr && c->ifi4 && c->ifi6 && !ns) {
> +	if (pif == PIF_HOST)
> +		socks = udp_splice_init;
> +	else
> +		socks = udp_splice_ns;
> +
> +	if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) {
>  		int s;
>  
>  		/* Attempt to get a dual stack socket */
>  		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
>  				NULL, ifname, port, uref.u32);
> -		udp_splice_init[V4][port] = s < 0 ? -1 : s;
> -		udp_splice_init[V6][port] = s < 0 ? -1 : s;
> +		socks[V4][port] = s < 0 ? -1 : s;
> +		socks[V6][port] = s < 0 ? -1 : s;
>  		if (IN_INTERVAL(0, FD_REF_MAX, s))
>  			return 0;
>  	}
>  
>  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> -		if (!ns) {
> -			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> -					 addr ? addr : &inany_any4, ifname,
> -					 port, uref.u32);
> +		const union inany_addr *a = addr ?
> +			addr : &inany_any4;

Nit: this fits on a single line.

>  
> -			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
> -		} else {
> -			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> -					  &inany_loopback4, ifname,
> -					  port, uref.u32);
> -			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
> -		}
> +		if (pif == PIF_SPLICE)
> +			a = &inany_loopback4;
> +
> +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +				 port, uref.u32);
> +
> +		socks[V4][port] = r4 < 0 ? -1 : r4;
>  	}
>  
>  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> -		if (!ns) {
> -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> -					 addr ? addr : &inany_any6, ifname,
> -					 port, uref.u32);
> +		const union inany_addr *a = addr ?
> +			addr : &inany_any6;

...same here.

>  
> -			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
> -		} else {
> -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> -					 &inany_loopback6, ifname,
> -					 port, uref.u32);
> -			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
> -		}
> +		if (pif == PIF_SPLICE)
> +			a = &inany_loopback6;
> +
> +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> +				 port, uref.u32);
> +
> +		socks[V6][port] = r6 < 0 ? -1 : r6;
>  	}
>  
>  	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
> @@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
>  
>  		if ((c->ifi4 && socks[V4][port] == -1) ||
>  		    (c->ifi6 && socks[V6][port] == -1))
> -			udp_sock_init(c, outbound, NULL, NULL, port);
> +			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> +				      NULL, NULL, port);
>  	}
>  }
>  
> diff --git a/udp.h b/udp.h
> index 8f8531ad..f78dc528 100644
> --- a/udp.h
> +++ b/udp.h
> @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
>  		    sa_family_t af, const void *saddr, const void *daddr,
>  		    uint8_t ttl, const struct pool *p, int idx,
>  		    const struct timespec *now);
> -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> -		  const char *ifname, in_port_t port);
> +int udp_sock_init(const struct ctx *c, uint8_t pif,
> +		  const union inany_addr *addr, const char *ifname,
> +		  in_port_t port);
>  int udp_init(struct ctx *c);
>  void udp_timer(struct ctx *c, const struct timespec *now);
>  void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option
  2025-10-29  6:26 ` [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option David Gibson
@ 2025-11-13  6:33   ` Stefano Brivio
  2025-11-14  0:24     ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-13  6:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 29 Oct 2025 17:26:26 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it
> to 1, which we typically do on all IPv6 sockets except those explicitly for
> dual stack listening.  That's not quite right in two ways:
> 
>  * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be
>    changed with the net.ipv6.bindv6only sysctl.  It may also have different
>    defaults on other OSes if we ever support them.  We know we need it off
>    for dual stack sockets, so explicitly set it to 0 in that case.
> 
>  * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a
>    specific address is harmless but pointless.  Don't set the option at all
>    in this case, saving a syscall.

I haven't checked the implications of this but __inet6_bind() handles
address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for
IPV6_V6ONLY purposes. I'm not sure if this has any influence on
functionality though.

> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  util.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/util.c b/util.c
> index c94efae4..62f43895 100644
> --- a/util.c
> +++ b/util.c
> @@ -45,14 +45,14 @@
>   * @type:	epoll type
>   * @sa:		Socket address to bind to
>   * @ifname:	Interface for binding, NULL for any
> - * @v6only:	Set IPV6_V6ONLY socket option
> + * @v6only:	If >= 0, set IPV6_V6ONLY socket option to this value
>   * @data:	epoll reference portion for protocol handlers
>   *
>   * Return: newly created socket, negative error code on failure
>   */
>  static int sock_l4_(const struct ctx *c, enum epoll_type type,
>  		    const union sockaddr_inany *sa, const char *ifname,
> -		    bool v6only, uint32_t data)
> +		    int v6only, uint32_t data)
>  {
>  	sa_family_t af = sa->sa_family;
>  	union epoll_ref ref = { .type = type, .data = data };
> @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,
>  
>  	ref.fd = fd;
>  
> -	if (v6only)
> -		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
> -			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
> +	if (v6only >= 0)
> +		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
> +			       &v6only, sizeof(v6only)))
> +			debug("Failed to set IPV6_V6ONLY to %d on socket %i",
> +			      v6only, fd);

Nit: curly brackets (two pairs) for consistency.

>  
>  	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
>  		debug("Failed to set SO_REUSEADDR on socket %i", fd);
> @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type,
>  	    const union sockaddr_inany *sa, const char *ifname,
>  	    uint32_t data)
>  {
> -	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> +	int v6only = -1;
> +
> +	/* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6
> +	 * sockets with a non-wildcard address.

Same as above: I don't think this is true, strictly speaking, but I
didn't check whether this inaccuracy is in any way relevant.

> +	 */
> +	if (sa->sa_family == AF_INET6 &&
> +	    IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr))
> +		v6only = 1;
> +
> +	return sock_l4_(c, type, sa, ifname, v6only, data);
>  }
>  
>  int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
>  		.sa6.sin6_port = htons(port),
>  	};
>  
> +	/* Dual stack sockets require IPV6_V6ONLY == 0.  Usually that's the
> +	 * default, but sysctl net.ipv6.bindv6only can change that, so set the
> +	 * sockopt explicitly.
> +	 */
>  	return sock_l4_(c, type, &sa, ifname, 0, data);
>  }
>  

The rest of the series looks good to me, including 8/8, but it's not
clear to me what the "secondary reasons" to consider 8/8 at this stage
might be, so I'm not actually sure what to do with it.

Is it because it drops some code?

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family
  2025-11-13  6:33   ` Stefano Brivio
@ 2025-11-13 22:53     ` David Gibson
  0 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-11-13 22:53 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6890 bytes --]

On Thu, Nov 13, 2025 at 07:33:04AM +0100, Stefano Brivio wrote:
> Apologies for the very substantial delay. I have to admit I
> procrastinated around 3/8 and 4/8 quite a bit as they look scary (but
> much needed, of course).
> 
> Nits only, here:
> 
> On Wed, 29 Oct 2025 17:26:21 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > sockaddr_inany can contain either an IPv4 or IPv6 socket address, so the
> > relevant length for bind() or connect() can vary.  In pif_sockaddr() we
> > return that length, and in sock_l4_sa() we take it as an extra parameter.
> > 
> > However, sockaddr_inany always contains exactly a sockaddr_in or a
> > sockaddr_in6 each with a fixed size.  Therefore we can derive the relevant
> > length from the family, and don't need to pass it around separately.
> > 
> > Make a tiny helper to get the relevant address length, and update all
> > interfaces to use that approach instead.
> > 
> > In the process, fix a buglet in tcp_flow_repair_bind(): we passed
> > sizeof(union sockaddr_inany) to bind() instead of the specific length for
> > the address family.  Since the sizeof() is always longer than the specific
> > length, this is probably fine, but not theoretically correct.
> 
> (Whoops.)
>  
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c       | 17 +++++++----------
> >  icmp.c       |  3 ++-
> >  inany.h      | 18 ++++++++++++++++++
> >  pif.c        | 14 ++++----------
> >  pif.h        |  2 +-
> >  tcp.c        | 20 +++++++++-----------
> >  tcp_splice.c |  5 ++---
> >  udp.c        |  8 +++-----
> >  util.c       |  9 ++++-----
> >  util.h       |  5 +++--
> >  10 files changed, 53 insertions(+), 48 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index feefda3c..9926f408 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -170,8 +170,7 @@ struct flowside_sock_args {
> >  	int fd;
> >  	int err;
> >  	enum epoll_type type;
> > -	const struct sockaddr *sa;
> > -	socklen_t sl;
> 
> This should be dropped from the struct documentation.

Good point, done.

> > +	const union sockaddr_inany *sa;
> >  	const char *path;
> >  	uint32_t data;
> 
> Note: you'll get a trivial conflict here with "util: Move epoll
> registration out of sock_l4_sa()" if you rebase.

Actually, there are a heap of conflicts with that patch.  I'll
definitely respin for that reason if no other.

> 
> >  };
> > @@ -187,7 +186,7 @@ static int flowside_sock_splice(void *arg)
> >  
> >  	ns_enter(a->c);
> >  
> > -	a->fd = sock_l4_sa(a->c, a->type, a->sa, a->sl, NULL,
> > +	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
> >  	                   a->sa->sa_family == AF_INET6, a->data);
> >  	a->err = errno;
> >  
> > @@ -209,11 +208,10 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  {
> >  	const char *ifname = NULL;
> >  	union sockaddr_inany sa;
> > -	socklen_t sl;
> >  
> >  	ASSERT(pif_is_socket(pif));
> >  
> > -	pif_sockaddr(c, &sa, &sl, pif, &tgt->oaddr, tgt->oport);
> > +	pif_sockaddr(c, &sa, pif, &tgt->oaddr, tgt->oport);
> >  
> >  	switch (pif) {
> >  	case PIF_HOST:
> > @@ -224,13 +222,13 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  		else if (sa.sa_family == AF_INET6)
> >  			ifname = c->ip6.ifname_out;
> >  
> > -		return sock_l4_sa(c, type, &sa, sl, ifname,
> > +		return sock_l4_sa(c, type, &sa, ifname,
> >  				  sa.sa_family == AF_INET6, data);
> >  
> >  	case PIF_SPLICE: {
> >  		struct flowside_sock_args args = {
> >  			.c = c, .type = type,
> > -			.sa = &sa.sa, .sl = sl, .data = data,
> > +			.sa = &sa, .data = data,
> >  		};
> >  		NS_CALL(flowside_sock_splice, &args);
> >  		errno = args.err;
> > @@ -259,10 +257,9 @@ int flowside_connect(const struct ctx *c, int s,
> >  		     uint8_t pif, const struct flowside *tgt)
> >  {
> >  	union sockaddr_inany sa;
> > -	socklen_t sl;
> >  
> > -	pif_sockaddr(c, &sa, &sl, pif, &tgt->eaddr, tgt->eport);
> > -	return connect(s, &sa.sa, sl);
> > +	pif_sockaddr(c, &sa, pif, &tgt->eaddr, tgt->eport);
> > +	return connect(s, &sa.sa, socklen_inany(&sa));
> >  }
> >  
> >  /** flow_log_ - Log flow-related message
> > diff --git a/icmp.c b/icmp.c
> > index 6dffafb0..62b3bed8 100644
> > --- a/icmp.c
> > +++ b/icmp.c
> > @@ -301,8 +301,9 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
> >  	ASSERT(flow_proto[pingf->f.type] == proto);
> >  	pingf->ts = now->tv_sec;
> >  
> > -	pif_sockaddr(c, &sa, &msh.msg_namelen, PIF_HOST, &tgt->eaddr, 0);
> > +	pif_sockaddr(c, &sa, PIF_HOST, &tgt->eaddr, 0);
> >  	msh.msg_name = &sa;
> > +	msh.msg_namelen = socklen_inany(&sa);
> >  	msh.msg_iov = iov;
> >  	msh.msg_iovlen = cnt;
> >  	msh.msg_control = NULL;
> > diff --git a/inany.h b/inany.h
> > index 7ca5cbd3..e3cae2a8 100644
> > --- a/inany.h
> > +++ b/inany.h
> > @@ -67,6 +67,24 @@ union sockaddr_inany {
> >  	struct sockaddr_in6	sa6;
> >  };
> >  
> > +/** socklen_inany() - Return relevant size of an sockaddr_inany
> 
> 'of a'
> 
> ...but the fact that it returns that sounds kind of pleonastic. What
> about:
> 
> /** socklen_inany() - Get relevant address length for sockaddr_inany address

Good idea, done.

> 
> > + * @sa:		sockaddr_iany socket address
> 
> sockaddr_inany

Fixed.

> > + *
> > + * Returns the correct socket address length to pass to bind() or connect() with
> > + * @sa, based on whether it is an IPv4 or IPv6 address.
> 
> Same here, I guess we obviously want it to be correct, and we have a
> somewhat standard syntax for return values in documentation, what about:
> 
>  * Return: socket address length for bind() or connect(), from IP family in @sa
> 
> ?

Again, good idea, done.

> 
> > + */
> > +static inline socklen_t socklen_inany(const union sockaddr_inany *sa)
> > +{
> > +	switch (sa->sa_family) {
> > +	case AF_INET:
> > +		return sizeof(sa->sa4);
> 
> Wouldn't it be more obvious to return sizeof(struct sockaddr_in), and
> sizeof(struct sockaddr_in6) below, instead? I actually had to check that
> sa->sa4 matched that (I didn't remember if that was a union).

Eh, maybe?

> Not a strong preference from my side, both are obviously correct
> anyway.

I usually prefer to use sizeof() on variables not types when possible
- it means things tend to either work or break more obviously if the
types of the variables get changed.  Leaving this one for now.

> 
> > +	case AF_INET6:
> > +		return sizeof(sa->sa6);
> > +	default:
> > +		ASSERT(0);
> > +	}
> > +}
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-13  6:33   ` Stefano Brivio
@ 2025-11-13 23:21     ` David Gibson
  2025-11-18  0:19       ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-11-13 23:21 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8279 bytes --]

On Thu, Nov 13, 2025 at 07:33:13AM +0100, Stefano Brivio wrote:
> On Wed, 29 Oct 2025 17:26:22 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> > to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> > address is IPv6, but not when we want to create a dual stack listening
> > socket.  The latter only makes sense when the address is :: however.
> > 
> > Clarify this by only keeping the v6only option in an internal helper
> > sock_l4_().  External users will call either sock_l4() which always creates
> > a socket bound to a specific IP version, or sock_l4_dualstack() which
> > creates a dual stack socket, but takes only a port not an address.
> 
> I'm not sure if we'll ever need anything different, but I guess that
> this is not the only obvious semantic of sock_l4_dualstack(), as it
> could take a sockaddr_inany eventually, and bind() IPv6 address and its
> v4-mapped equivalent (...does that even work?).

Do you mean that if we have a v4-mapped address, then using an IPv6
"dual stack" socket will listen both for IPv4 traffic and for IPv6
traffic actually using that v4-mapped address on the wire (presumably
as a result of a router translating to a local IPv6-only network)?  I
think that will work, though I haven't tested.

In that case we can determine that we need IPV6_V6ONLY from the
address.  The only case that doesn't cover is if we want to listen for
v4-mapped traffic already translated by a router but *not* native IPv4
traffic.  I don't see a lot of reason to ever do that, so it's in the
"refactor if we ever discover we need it" pile.

Otherwise, the only case in which a single dual stack socket actually
listens to traffic from both protocols is for a wildcard.  Maybe there
are obscure wildcard addresses other than :: / 0.0.0.0, but that's
also in the "worry about it later" pile.

Note that:

https://github.com/containers/podman/pull/14026/commits/772ead25318dfa340541197e92322bd2346df087

implies some sort of dual stack localhost support (it treats "dual
stack" ::1 as listening on both ::1 and 127.0.0.1).  However, AFAICT
that's just not correct.  On Linux, listening on ::1 listens only on
::1 even with V6ONLY explicitly set to 0.

> > We drop the '_sa' suffix while we're at it - it exists because this used
> > to be an internal version with a sock_l4() wrapper.  The wrapper no longer
> > exists so the '_sa' is no longer useful.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c |  6 ++----
> >  pif.c  | 10 +++-------
> >  util.c | 27 +++++++++++++++++++++++----
> >  util.h |  8 +++++---
> >  4 files changed, 33 insertions(+), 18 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index 9926f408..fd530ddb 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -186,8 +186,7 @@ static int flowside_sock_splice(void *arg)
> >  
> >  	ns_enter(a->c);
> >  
> > -	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
> > -	                   a->sa->sa_family == AF_INET6, a->data);
> > +	a->fd = sock_l4(a->c, a->type, a->sa, NULL, a->data);
> >  	a->err = errno;
> >  
> >  	return 0;
> > @@ -222,8 +221,7 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  		else if (sa.sa_family == AF_INET6)
> >  			ifname = c->ip6.ifname_out;
> >  
> > -		return sock_l4_sa(c, type, &sa, ifname,
> > -				  sa.sa_family == AF_INET6, data);
> > +		return sock_l4(c, type, &sa, ifname, data);
> >  
> >  	case PIF_SPLICE: {
> >  		struct flowside_sock_args args = {
> > diff --git a/pif.c b/pif.c
> > index 31723b29..5fb1f455 100644
> > --- a/pif.c
> > +++ b/pif.c
> > @@ -75,11 +75,7 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  		const union inany_addr *addr, const char *ifname,
> >  		in_port_t port, uint32_t data)
> >  {
> > -	union sockaddr_inany sa = {
> > -		.sa6.sin6_family = AF_INET6,
> > -		.sa6.sin6_addr = in6addr_any,
> > -		.sa6.sin6_port = htons(port),
> > -	};
> > +	union sockaddr_inany sa;
> >  
> >  	ASSERT(pif_is_socket(pif));
> >  
> > @@ -90,8 +86,8 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> >  	}
> >  
> >  	if (!addr)
> > -		return sock_l4_sa(c, type, &sa, ifname, false, data);
> > +		return sock_l4_dualstack(c, type, port, ifname, data);
> >  
> >  	pif_sockaddr(c, &sa, pif, addr, port);
> > -	return sock_l4_sa(c, type, &sa, ifname, sa.sa_family == AF_INET6, data);
> > +	return sock_l4(c, type, &sa, ifname, data);
> >  }
> > diff --git a/util.c b/util.c
> > index 976fcabe..c94efae4 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -40,7 +40,7 @@
> >  #endif
> >  
> >  /**
> > - * sock_l4_sa() - Create and bind socket to socket address, add to epoll list
> > + * sock_l4_() - Create and bind socket to socket address, add to epoll list
> >   * @c:		Execution context
> >   * @type:	epoll type
> >   * @sa:		Socket address to bind to
> > @@ -50,9 +50,9 @@
> >   *
> >   * Return: newly created socket, negative error code on failure
> >   */
> > -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > -	       const union sockaddr_inany *sa, const char *ifname,
> > -	       bool v6only, uint32_t data)
> > +static int sock_l4_(const struct ctx *c, enum epoll_type type,
> > +		    const union sockaddr_inany *sa, const char *ifname,
> > +		    bool v6only, uint32_t data)
> >  {
> >  	sa_family_t af = sa->sa_family;
> >  	union epoll_ref ref = { .type = type, .data = data };
> > @@ -182,6 +182,25 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> >  	return fd;
> >  }
> >  
> > +int sock_l4(const struct ctx *c, enum epoll_type type,
> > +	    const union sockaddr_inany *sa, const char *ifname,
> > +	    uint32_t data)
> 
> Not extremely useful but it saves one "lookup":
> 
> /**
>  * sock_l4() - Create and bind socket to given address, add to epoll list
>  * @c:		Execution context
>  * @type:	epoll type
>  * @sa:		Socket address to bind to
>  * @ifname:	Interface for binding, NULL for any
>  *
>  * Return: newly created socket, negative error code on failure
>  */

Oops, I meant to go back and add function comments here, but I
obviously forgot.  Fixed.

While there I removed the "add to epoll list" which is no longer
correct.


> > +{
> > +	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> > +}
> > +
> > +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > +		      in_port_t port, const char *ifname, uint32_t data)
> 
> ...same here, and the comment might be used to clarify the
> functionality.

Done.

> 
> > +{
> > +	union sockaddr_inany sa = {
> > +		.sa6.sin6_family = AF_INET6,
> > +		.sa6.sin6_addr = in6addr_any,
> > +		.sa6.sin6_port = htons(port),
> > +	};
> > +
> > +	return sock_l4_(c, type, &sa, ifname, 0, data);
> > +}
> > +
> >  /**
> >   * sock_unix() - Create and bind AF_UNIX socket
> >   * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > diff --git a/util.h b/util.h
> > index e1a1ebc9..7f0cf686 100644
> > --- a/util.h
> > +++ b/util.h
> > @@ -203,9 +203,11 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
> >  struct ctx;
> >  union sockaddr_inany;
> >  
> > -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > -	       const union sockaddr_inany *sa, const char *ifname,
> > -	       bool v6only, uint32_t data);
> > +int sock_l4(const struct ctx *c, enum epoll_type type,
> > +	    const union sockaddr_inany *sa, const char *ifname,
> > +	    uint32_t data);
> > +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > +		      in_port_t port, const char *ifname, uint32_t data);
> >  int sock_unix(char *sock_path);
> >  void sock_probe_mem(struct ctx *c);
> >  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init()
  2025-11-13  6:33   ` Stefano Brivio
@ 2025-11-13 23:33     ` David Gibson
  0 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-11-13 23:33 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6379 bytes --]

On Thu, Nov 13, 2025 at 07:33:26AM +0100, Stefano Brivio wrote:
> On Wed, 29 Oct 2025 17:26:24 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > udp_sock_init() takes an 'ns' parameter determining if it creates a socket
> > in the guest namespace or host namespace.  Alter it to take a pif
> > parameter instead, like tcp_sock_init(), and use that change to slightly
> > reduce code duplication between the HOST and SPLICE cases.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  conf.c |  2 +-
> >  udp.c  | 65 +++++++++++++++++++++++++++++++---------------------------
> >  udp.h  |  5 +++--
> >  3 files changed, 39 insertions(+), 33 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 26f1bcc0..08cb50aa 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -171,7 +171,7 @@ static void conf_ports_range_except(const struct ctx *c, char optname,
> >  		if (optname == 't')
> >  			ret = tcp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else if (optname == 'u')
> > -			ret = udp_sock_init(c, 0, addr, ifname, i);
> > +			ret = udp_sock_init(c, PIF_HOST, addr, ifname, i);
> >  		else
> >  			/* No way to check in advance for -T and -U */
> >  			ret = 0;
> > diff --git a/udp.c b/udp.c
> > index 8f96495d..f3237436 100644
> > --- a/udp.c
> > +++ b/udp.c
> > @@ -1091,64 +1091,68 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >  /**
> >   * udp_sock_init() - Initialise listening sockets for a given port
> >   * @c:		Execution context
> > - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> > + * @pif:	Interface to open the socket for (PIF_HOST or PIF_SPLICE)
> >   * @addr:	Pointer to address for binding, NULL if not configured
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   *
> >   * Return: 0 on (partial) success, negative error code on (complete) failure
> >   */
> > -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > -		  const char *ifname, in_port_t port)
> > +int udp_sock_init(const struct ctx *c, uint8_t pif,
> > +		  const union inany_addr *addr, const char *ifname,
> > +		  in_port_t port)
> >  {
> >  	union udp_listen_epoll_ref uref = {
> > -		.pif = ns ? PIF_SPLICE : PIF_HOST,
> > +		.pif = pif,
> >  		.port = port,
> >  	};
> >  	int r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
> > +	int (*socks)[NUM_PORTS];
> >  
> >  	ASSERT(!c->no_udp);
> > +	ASSERT(pif_is_socket(pif));
> >  
> > -	if (!addr && c->ifi4 && c->ifi6 && !ns) {
> > +	if (pif == PIF_HOST)
> > +		socks = udp_splice_init;
> > +	else
> > +		socks = udp_splice_ns;
> > +
> > +	if (!addr && c->ifi4 && c->ifi6 && pif == PIF_HOST) {
> >  		int s;
> >  
> >  		/* Attempt to get a dual stack socket */
> >  		s = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> >  				NULL, ifname, port, uref.u32);
> > -		udp_splice_init[V4][port] = s < 0 ? -1 : s;
> > -		udp_splice_init[V6][port] = s < 0 ? -1 : s;
> > +		socks[V4][port] = s < 0 ? -1 : s;
> > +		socks[V6][port] = s < 0 ? -1 : s;
> >  		if (IN_INTERVAL(0, FD_REF_MAX, s))
> >  			return 0;
> >  	}
> >  
> >  	if ((!addr || inany_v4(addr)) && c->ifi4) {
> > -		if (!ns) {
> > -			r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> > -					 addr ? addr : &inany_any4, ifname,
> > -					 port, uref.u32);
> > +		const union inany_addr *a = addr ?
> > +			addr : &inany_any4;
> 
> Nit: this fits on a single line.

Fixed.  Probably a relic from an earlier draft.

> 
> >  
> > -			udp_splice_init[V4][port] = r4 < 0 ? -1 : r4;
> > -		} else {
> > -			r4  = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> > -					  &inany_loopback4, ifname,
> > -					  port, uref.u32);
> > -			udp_splice_ns[V4][port] = r4 < 0 ? -1 : r4;
> > -		}
> > +		if (pif == PIF_SPLICE)
> > +			a = &inany_loopback4;
> > +
> > +		r4 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +				 port, uref.u32);
> > +
> > +		socks[V4][port] = r4 < 0 ? -1 : r4;
> >  	}
> >  
> >  	if ((!addr || !inany_v4(addr)) && c->ifi6) {
> > -		if (!ns) {
> > -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_HOST,
> > -					 addr ? addr : &inany_any6, ifname,
> > -					 port, uref.u32);
> > +		const union inany_addr *a = addr ?
> > +			addr : &inany_any6;
> 
> ...same here.

Fixed.

> >  
> > -			udp_splice_init[V6][port] = r6 < 0 ? -1 : r6;
> > -		} else {
> > -			r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, PIF_SPLICE,
> > -					 &inany_loopback6, ifname,
> > -					 port, uref.u32);
> > -			udp_splice_ns[V6][port] = r6 < 0 ? -1 : r6;
> > -		}
> > +		if (pif == PIF_SPLICE)
> > +			a = &inany_loopback6;
> > +
> > +		r6 = pif_sock_l4(c, EPOLL_TYPE_UDP_LISTEN, pif, a, ifname,
> > +				 port, uref.u32);
> > +
> > +		socks[V6][port] = r6 < 0 ? -1 : r6;
> >  	}
> >  
> >  	if (IN_INTERVAL(0, FD_REF_MAX, r4) || IN_INTERVAL(0, FD_REF_MAX, r6))
> > @@ -1214,7 +1218,8 @@ static void udp_port_rebind(struct ctx *c, bool outbound)
> >  
> >  		if ((c->ifi4 && socks[V4][port] == -1) ||
> >  		    (c->ifi6 && socks[V6][port] == -1))
> > -			udp_sock_init(c, outbound, NULL, NULL, port);
> > +			udp_sock_init(c, outbound ? PIF_SPLICE : PIF_HOST,
> > +				      NULL, NULL, port);
> >  	}
> >  }
> >  
> > diff --git a/udp.h b/udp.h
> > index 8f8531ad..f78dc528 100644
> > --- a/udp.h
> > +++ b/udp.h
> > @@ -17,8 +17,9 @@ int udp_tap_handler(const struct ctx *c, uint8_t pif,
> >  		    sa_family_t af, const void *saddr, const void *daddr,
> >  		    uint8_t ttl, const struct pool *p, int idx,
> >  		    const struct timespec *now);
> > -int udp_sock_init(const struct ctx *c, int ns, const union inany_addr *addr,
> > -		  const char *ifname, in_port_t port);
> > +int udp_sock_init(const struct ctx *c, uint8_t pif,
> > +		  const union inany_addr *addr, const char *ifname,
> > +		  in_port_t port);
> >  int udp_init(struct ctx *c);
> >  void udp_timer(struct ctx *c, const struct timespec *now);
> >  void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
> 
> -- 
> Stefano
> 

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option
  2025-11-13  6:33   ` Stefano Brivio
@ 2025-11-14  0:24     ` David Gibson
  2025-11-18  0:19       ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-11-14  0:24 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5578 bytes --]

On Thu, Nov 13, 2025 at 07:33:35AM +0100, Stefano Brivio wrote:
> On Wed, 29 Oct 2025 17:26:26 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it
> > to 1, which we typically do on all IPv6 sockets except those explicitly for
> > dual stack listening.  That's not quite right in two ways:
> > 
> >  * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be
> >    changed with the net.ipv6.bindv6only sysctl.  It may also have different
> >    defaults on other OSes if we ever support them.  We know we need it off
> >    for dual stack sockets, so explicitly set it to 0 in that case.
> > 
> >  * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a
> >    specific address is harmless but pointless.  Don't set the option at all
> >    in this case, saving a syscall.
> 
> I haven't checked the implications of this but __inet6_bind() handles
> address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for
> IPV6_V6ONLY purposes. I'm not sure if this has any influence on
> functionality though.

AFAICT, technically yes, but not in a way that really matters.  IIUC
for a v4-mapped address using IPV6_V6ONLY=0 will let the socket handle
IPv4 traffic with the corresponding address.  IPV6_V6ONLY will only
handle IPv6 traffic that actually has the v4-mapped address on it.

We won't actually get to the point of passing v4-mapped addresses to
the kernel in passt/pasta, because we already use those to mean "IPv4"
and will go to the IPv4 paths.

> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  util.c | 27 +++++++++++++++++++++------
> >  1 file changed, 21 insertions(+), 6 deletions(-)
> > 
> > diff --git a/util.c b/util.c
> > index c94efae4..62f43895 100644
> > --- a/util.c
> > +++ b/util.c
> > @@ -45,14 +45,14 @@
> >   * @type:	epoll type
> >   * @sa:		Socket address to bind to
> >   * @ifname:	Interface for binding, NULL for any
> > - * @v6only:	Set IPV6_V6ONLY socket option
> > + * @v6only:	If >= 0, set IPV6_V6ONLY socket option to this value
> >   * @data:	epoll reference portion for protocol handlers
> >   *
> >   * Return: newly created socket, negative error code on failure
> >   */
> >  static int sock_l4_(const struct ctx *c, enum epoll_type type,
> >  		    const union sockaddr_inany *sa, const char *ifname,
> > -		    bool v6only, uint32_t data)
> > +		    int v6only, uint32_t data)
> >  {
> >  	sa_family_t af = sa->sa_family;
> >  	union epoll_ref ref = { .type = type, .data = data };
> > @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,
> >  
> >  	ref.fd = fd;
> >  
> > -	if (v6only)
> > -		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
> > -			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
> > +	if (v6only >= 0)
> > +		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
> > +			       &v6only, sizeof(v6only)))
> > +			debug("Failed to set IPV6_V6ONLY to %d on socket %i",
> > +			      v6only, fd);
> 
> Nit: curly brackets (two pairs) for consistency.

Fixed.

> >  
> >  	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
> >  		debug("Failed to set SO_REUSEADDR on socket %i", fd);
> > @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type,
> >  	    const union sockaddr_inany *sa, const char *ifname,
> >  	    uint32_t data)
> >  {
> > -	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> > +	int v6only = -1;
> > +
> > +	/* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6
> > +	 * sockets with a non-wildcard address.
> 
> Same as above: I don't think this is true, strictly speaking, but I
> didn't check whether this inaccuracy is in any way relevant.

Right, so technically yes, but I don't think it's relevant for the
reasons I gave above.  I've rephrased to "we don't care about it for
IPv6 sockets with ...".  Does that help?

> > +	 */
> > +	if (sa->sa_family == AF_INET6 &&
> > +	    IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr))
> > +		v6only = 1;
> > +
> > +	return sock_l4_(c, type, sa, ifname, v6only, data);
> >  }
> >  
> >  int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> >  		.sa6.sin6_port = htons(port),
> >  	};
> >  
> > +	/* Dual stack sockets require IPV6_V6ONLY == 0.  Usually that's the
> > +	 * default, but sysctl net.ipv6.bindv6only can change that, so set the
> > +	 * sockopt explicitly.
> > +	 */
> >  	return sock_l4_(c, type, &sa, ifname, 0, data);
> >  }
> >  
> 
> The rest of the series looks good to me, including 8/8, but it's not
> clear to me what the "secondary reasons" to consider 8/8 at this stage
> might be, so I'm not actually sure what to do with it.
> 
> Is it because it drops some code?

 * It drops some code
 * I think it will address bug 113 (still need to check)
 * It might also help for some other systemd-resolved edge cases
 * It will make life a bit easier to implement bug 171
 * I think the improved symmetry will make other flexible forwarding
   changes a bit easier

I'll look into bug 113 and revise the commit message for 8/8 in the
next spin.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-13 23:21     ` David Gibson
@ 2025-11-18  0:19       ` Stefano Brivio
  2025-11-18  3:34         ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-18  0:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 14 Nov 2025 10:21:46 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Nov 13, 2025 at 07:33:13AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Oct 2025 17:26:22 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> > > to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> > > address is IPv6, but not when we want to create a dual stack listening
> > > socket.  The latter only makes sense when the address is :: however.
> > > 
> > > Clarify this by only keeping the v6only option in an internal helper
> > > sock_l4_().  External users will call either sock_l4() which always creates
> > > a socket bound to a specific IP version, or sock_l4_dualstack() which
> > > creates a dual stack socket, but takes only a port not an address.  
> > 
> > I'm not sure if we'll ever need anything different, but I guess that
> > this is not the only obvious semantic of sock_l4_dualstack(), as it
> > could take a sockaddr_inany eventually, and bind() IPv6 address and its
> > v4-mapped equivalent (...does that even work?).  
> 
> Do you mean that if we have a v4-mapped address, then using an IPv6
> "dual stack" socket will listen both for IPv4 traffic and for IPv6
> traffic actually using that v4-mapped address on the wire (presumably
> as a result of a router translating to a local IPv6-only network)?  I
> think that will work, though I haven't tested.

Yes, that's what I meant.

> In that case we can determine that we need IPV6_V6ONLY from the
> address.  The only case that doesn't cover is if we want to listen for
> v4-mapped traffic already translated by a router but *not* native IPv4
> traffic.  I don't see a lot of reason to ever do that, so it's in the
> "refactor if we ever discover we need it" pile.

I thought that we might want to listen on both IP versions for whatever
reason, on a single socket, with a specific address (say, that v4-mapped
address and the equivalent untranslated address...?).

I know it can't be done now anyway, I'm just saying that
sock_l4_dualstack() forcing wildcard addresses isn't something we should
imply as part of "dualstack".

> Otherwise, the only case in which a single dual stack socket actually
> listens to traffic from both protocols is for a wildcard.  Maybe there
> are obscure wildcard addresses other than :: / 0.0.0.0, but that's
> also in the "worry about it later" pile.

Sure.

> Note that:
> 
> https://github.com/containers/podman/pull/14026/commits/772ead25318dfa340541197e92322bd2346df087
> 
> implies some sort of dual stack localhost support (it treats "dual
> stack" ::1 as listening on both ::1 and 127.0.0.1).  However, AFAICT
> that's just not correct.  On Linux, listening on ::1 listens only on
> ::1 even with V6ONLY explicitly set to 0.

Right, I don't even know what "simulated" means there. Actually there's
no problem description at all. Go figure. I'm not sure if we want to
report something (I'm not even sure what we should report).

> > > We drop the '_sa' suffix while we're at it - it exists because this used
> > > to be an internal version with a sock_l4() wrapper.  The wrapper no longer
> > > exists so the '_sa' is no longer useful.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  flow.c |  6 ++----
> > >  pif.c  | 10 +++-------
> > >  util.c | 27 +++++++++++++++++++++++----
> > >  util.h |  8 +++++---
> > >  4 files changed, 33 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/flow.c b/flow.c
> > > index 9926f408..fd530ddb 100644
> > > --- a/flow.c
> > > +++ b/flow.c
> > > @@ -186,8 +186,7 @@ static int flowside_sock_splice(void *arg)
> > >  
> > >  	ns_enter(a->c);
> > >  
> > > -	a->fd = sock_l4_sa(a->c, a->type, a->sa, NULL,
> > > -	                   a->sa->sa_family == AF_INET6, a->data);
> > > +	a->fd = sock_l4(a->c, a->type, a->sa, NULL, a->data);
> > >  	a->err = errno;
> > >  
> > >  	return 0;
> > > @@ -222,8 +221,7 @@ int flowside_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > >  		else if (sa.sa_family == AF_INET6)
> > >  			ifname = c->ip6.ifname_out;
> > >  
> > > -		return sock_l4_sa(c, type, &sa, ifname,
> > > -				  sa.sa_family == AF_INET6, data);
> > > +		return sock_l4(c, type, &sa, ifname, data);
> > >  
> > >  	case PIF_SPLICE: {
> > >  		struct flowside_sock_args args = {
> > > diff --git a/pif.c b/pif.c
> > > index 31723b29..5fb1f455 100644
> > > --- a/pif.c
> > > +++ b/pif.c
> > > @@ -75,11 +75,7 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > >  		const union inany_addr *addr, const char *ifname,
> > >  		in_port_t port, uint32_t data)
> > >  {
> > > -	union sockaddr_inany sa = {
> > > -		.sa6.sin6_family = AF_INET6,
> > > -		.sa6.sin6_addr = in6addr_any,
> > > -		.sa6.sin6_port = htons(port),
> > > -	};
> > > +	union sockaddr_inany sa;
> > >  
> > >  	ASSERT(pif_is_socket(pif));
> > >  
> > > @@ -90,8 +86,8 @@ int pif_sock_l4(const struct ctx *c, enum epoll_type type, uint8_t pif,
> > >  	}
> > >  
> > >  	if (!addr)
> > > -		return sock_l4_sa(c, type, &sa, ifname, false, data);
> > > +		return sock_l4_dualstack(c, type, port, ifname, data);
> > >  
> > >  	pif_sockaddr(c, &sa, pif, addr, port);
> > > -	return sock_l4_sa(c, type, &sa, ifname, sa.sa_family == AF_INET6, data);
> > > +	return sock_l4(c, type, &sa, ifname, data);
> > >  }
> > > diff --git a/util.c b/util.c
> > > index 976fcabe..c94efae4 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -40,7 +40,7 @@
> > >  #endif
> > >  
> > >  /**
> > > - * sock_l4_sa() - Create and bind socket to socket address, add to epoll list
> > > + * sock_l4_() - Create and bind socket to socket address, add to epoll list
> > >   * @c:		Execution context
> > >   * @type:	epoll type
> > >   * @sa:		Socket address to bind to
> > > @@ -50,9 +50,9 @@
> > >   *
> > >   * Return: newly created socket, negative error code on failure
> > >   */
> > > -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > > -	       const union sockaddr_inany *sa, const char *ifname,
> > > -	       bool v6only, uint32_t data)
> > > +static int sock_l4_(const struct ctx *c, enum epoll_type type,
> > > +		    const union sockaddr_inany *sa, const char *ifname,
> > > +		    bool v6only, uint32_t data)
> > >  {
> > >  	sa_family_t af = sa->sa_family;
> > >  	union epoll_ref ref = { .type = type, .data = data };
> > > @@ -182,6 +182,25 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > >  	return fd;
> > >  }
> > >  
> > > +int sock_l4(const struct ctx *c, enum epoll_type type,
> > > +	    const union sockaddr_inany *sa, const char *ifname,
> > > +	    uint32_t data)  
> > 
> > Not extremely useful but it saves one "lookup":
> > 
> > /**
> >  * sock_l4() - Create and bind socket to given address, add to epoll list
> >  * @c:		Execution context
> >  * @type:	epoll type
> >  * @sa:		Socket address to bind to
> >  * @ifname:	Interface for binding, NULL for any
> >  *
> >  * Return: newly created socket, negative error code on failure
> >  */  
> 
> Oops, I meant to go back and add function comments here, but I
> obviously forgot.  Fixed.
> 
> While there I removed the "add to epoll list" which is no longer
> correct.

Oops, I hadn't solved the merge conflict yet...

> > > +{
> > > +	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> > > +}
> > > +
> > > +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > > +		      in_port_t port, const char *ifname, uint32_t data)  
> > 
> > ...same here, and the comment might be used to clarify the
> > functionality.  
> 
> Done.
> 
> >   
> > > +{
> > > +	union sockaddr_inany sa = {
> > > +		.sa6.sin6_family = AF_INET6,
> > > +		.sa6.sin6_addr = in6addr_any,
> > > +		.sa6.sin6_port = htons(port),
> > > +	};
> > > +
> > > +	return sock_l4_(c, type, &sa, ifname, 0, data);
> > > +}
> > > +
> > >  /**
> > >   * sock_unix() - Create and bind AF_UNIX socket
> > >   * @sock_path:	Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
> > > diff --git a/util.h b/util.h
> > > index e1a1ebc9..7f0cf686 100644
> > > --- a/util.h
> > > +++ b/util.h
> > > @@ -203,9 +203,11 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
> > >  struct ctx;
> > >  union sockaddr_inany;
> > >  
> > > -int sock_l4_sa(const struct ctx *c, enum epoll_type type,
> > > -	       const union sockaddr_inany *sa, const char *ifname,
> > > -	       bool v6only, uint32_t data);
> > > +int sock_l4(const struct ctx *c, enum epoll_type type,
> > > +	    const union sockaddr_inany *sa, const char *ifname,
> > > +	    uint32_t data);
> > > +int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > > +		      in_port_t port, const char *ifname, uint32_t data);
> > >  int sock_unix(char *sock_path);
> > >  void sock_probe_mem(struct ctx *c);
> > >  long timespec_diff_ms(const struct timespec *a, const struct timespec *b);  

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option
  2025-11-14  0:24     ` David Gibson
@ 2025-11-18  0:19       ` Stefano Brivio
  0 siblings, 0 replies; 24+ messages in thread
From: Stefano Brivio @ 2025-11-18  0:19 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 14 Nov 2025 11:24:31 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Nov 13, 2025 at 07:33:35AM +0100, Stefano Brivio wrote:
> > On Wed, 29 Oct 2025 17:26:26 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > Currently we only call setsockopt() on IPV6_V6ONLY when we want to set it
> > > to 1, which we typically do on all IPv6 sockets except those explicitly for
> > > dual stack listening.  That's not quite right in two ways:
> > > 
> > >  * Although IPV6_V6ONLY==0 is normally the default on Linux, that can be
> > >    changed with the net.ipv6.bindv6only sysctl.  It may also have different
> > >    defaults on other OSes if we ever support them.  We know we need it off
> > >    for dual stack sockets, so explicitly set it to 0 in that case.
> > > 
> > >  * At the same time setting IPV6_V6ONLY to 1 for IPv6 sockets bound to a
> > >    specific address is harmless but pointless.  Don't set the option at all
> > >    in this case, saving a syscall.  
> > 
> > I haven't checked the implications of this but __inet6_bind() handles
> > address "types" IPV6_ADDR_ANY and IPV6_ADDR_MAPPED in the same way, for
> > IPV6_V6ONLY purposes. I'm not sure if this has any influence on
> > functionality though.  
> 
> AFAICT, technically yes, but not in a way that really matters.  IIUC
> for a v4-mapped address using IPV6_V6ONLY=0 will let the socket handle
> IPv4 traffic with the corresponding address.  IPV6_V6ONLY will only
> handle IPv6 traffic that actually has the v4-mapped address on it.
> 
> We won't actually get to the point of passing v4-mapped addresses to
> the kernel in passt/pasta, because we already use those to mean "IPv4"
> and will go to the IPv4 paths.
> 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  util.c | 27 +++++++++++++++++++++------
> > >  1 file changed, 21 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/util.c b/util.c
> > > index c94efae4..62f43895 100644
> > > --- a/util.c
> > > +++ b/util.c
> > > @@ -45,14 +45,14 @@
> > >   * @type:	epoll type
> > >   * @sa:		Socket address to bind to
> > >   * @ifname:	Interface for binding, NULL for any
> > > - * @v6only:	Set IPV6_V6ONLY socket option
> > > + * @v6only:	If >= 0, set IPV6_V6ONLY socket option to this value
> > >   * @data:	epoll reference portion for protocol handlers
> > >   *
> > >   * Return: newly created socket, negative error code on failure
> > >   */
> > >  static int sock_l4_(const struct ctx *c, enum epoll_type type,
> > >  		    const union sockaddr_inany *sa, const char *ifname,
> > > -		    bool v6only, uint32_t data)
> > > +		    int v6only, uint32_t data)
> > >  {
> > >  	sa_family_t af = sa->sa_family;
> > >  	union epoll_ref ref = { .type = type, .data = data };
> > > @@ -101,9 +101,11 @@ static int sock_l4_(const struct ctx *c, enum epoll_type type,
> > >  
> > >  	ref.fd = fd;
> > >  
> > > -	if (v6only)
> > > -		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
> > > -			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
> > > +	if (v6only >= 0)
> > > +		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
> > > +			       &v6only, sizeof(v6only)))
> > > +			debug("Failed to set IPV6_V6ONLY to %d on socket %i",
> > > +			      v6only, fd);  
> > 
> > Nit: curly brackets (two pairs) for consistency.  
> 
> Fixed.
> 
> > >  
> > >  	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
> > >  		debug("Failed to set SO_REUSEADDR on socket %i", fd);
> > > @@ -186,7 +188,16 @@ int sock_l4(const struct ctx *c, enum epoll_type type,
> > >  	    const union sockaddr_inany *sa, const char *ifname,
> > >  	    uint32_t data)
> > >  {
> > > -	return sock_l4_(c, type, sa, ifname, sa->sa_family == AF_INET6, data);
> > > +	int v6only = -1;
> > > +
> > > +	/* The option doesn't exist for IPv4 sockets, and is irrelevant for IPv6
> > > +	 * sockets with a non-wildcard address.  
> > 
> > Same as above: I don't think this is true, strictly speaking, but I
> > didn't check whether this inaccuracy is in any way relevant.  
> 
> Right, so technically yes, but I don't think it's relevant for the
> reasons I gave above.  I've rephrased to "we don't care about it for
> IPv6 sockets with ...".  Does that help?

Yes, it does, thanks.

> > > +	 */
> > > +	if (sa->sa_family == AF_INET6 &&
> > > +	    IN6_IS_ADDR_UNSPECIFIED(&sa->sa6.sin6_addr))
> > > +		v6only = 1;
> > > +
> > > +	return sock_l4_(c, type, sa, ifname, v6only, data);
> > >  }
> > >  
> > >  int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > > @@ -198,6 +209,10 @@ int sock_l4_dualstack(const struct ctx *c, enum epoll_type type,
> > >  		.sa6.sin6_port = htons(port),
> > >  	};
> > >  
> > > +	/* Dual stack sockets require IPV6_V6ONLY == 0.  Usually that's the
> > > +	 * default, but sysctl net.ipv6.bindv6only can change that, so set the
> > > +	 * sockopt explicitly.
> > > +	 */
> > >  	return sock_l4_(c, type, &sa, ifname, 0, data);
> > >  }
> > >    
> > 
> > The rest of the series looks good to me, including 8/8, but it's not
> > clear to me what the "secondary reasons" to consider 8/8 at this stage
> > might be, so I'm not actually sure what to do with it.
> > 
> > Is it because it drops some code?  
> 
>  * It drops some code
>  * I think it will address bug 113 (still need to check)

Right, I forgot about this part.

>  * It might also help for some other systemd-resolved edge cases
>  * It will make life a bit easier to implement bug 171
>  * I think the improved symmetry will make other flexible forwarding
>    changes a bit easier
> 
> I'll look into bug 113 and revise the commit message for 8/8 in the
> next spin.

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-18  0:19       ` Stefano Brivio
@ 2025-11-18  3:34         ` David Gibson
  2025-11-19 11:42           ` Stefano Brivio
  0 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-11-18  3:34 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4033 bytes --]

On Tue, Nov 18, 2025 at 01:19:21AM +0100, Stefano Brivio wrote:
> On Fri, 14 Nov 2025 10:21:46 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Nov 13, 2025 at 07:33:13AM +0100, Stefano Brivio wrote:
> > > On Wed, 29 Oct 2025 17:26:22 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> > > > to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> > > > address is IPv6, but not when we want to create a dual stack listening
> > > > socket.  The latter only makes sense when the address is :: however.
> > > > 
> > > > Clarify this by only keeping the v6only option in an internal helper
> > > > sock_l4_().  External users will call either sock_l4() which always creates
> > > > a socket bound to a specific IP version, or sock_l4_dualstack() which
> > > > creates a dual stack socket, but takes only a port not an address.  
> > > 
> > > I'm not sure if we'll ever need anything different, but I guess that
> > > this is not the only obvious semantic of sock_l4_dualstack(), as it
> > > could take a sockaddr_inany eventually, and bind() IPv6 address and its
> > > v4-mapped equivalent (...does that even work?).  
> > 
> > Do you mean that if we have a v4-mapped address, then using an IPv6
> > "dual stack" socket will listen both for IPv4 traffic and for IPv6
> > traffic actually using that v4-mapped address on the wire (presumably
> > as a result of a router translating to a local IPv6-only network)?  I
> > think that will work, though I haven't tested.
> 
> Yes, that's what I meant.
> 
> > In that case we can determine that we need IPV6_V6ONLY from the
> > address.  The only case that doesn't cover is if we want to listen for
> > v4-mapped traffic already translated by a router but *not* native IPv4
> > traffic.  I don't see a lot of reason to ever do that, so it's in the
> > "refactor if we ever discover we need it" pile.
> 
> I thought that we might want to listen on both IP versions for whatever
> reason, on a single socket, with a specific address (say, that v4-mapped
> address and the equivalent untranslated address...?).

I'm not really sure what you mean by an "equivalent untranslated
address".  AFAIK, the only non-wildcard case that will actually listen
on both IP versions is a v4-mapped address.

So, yes we probably should explicitly set IPV6_V6ONLY==0 for v4-mapped
addresses as well.

> I know it can't be done now anyway, I'm just saying that
> sock_l4_dualstack() forcing wildcard addresses isn't something we should
> imply as part of "dualstack".

Hm, ok.  What if I renamed it to sock_l4_dualwild()?

> > Otherwise, the only case in which a single dual stack socket actually
> > listens to traffic from both protocols is for a wildcard.  Maybe there
> > are obscure wildcard addresses other than :: / 0.0.0.0, but that's
> > also in the "worry about it later" pile.
> 
> Sure.
> 
> > Note that:
> > 
> > https://github.com/containers/podman/pull/14026/commits/772ead25318dfa340541197e92322bd2346df087
> > 
> > implies some sort of dual stack localhost support (it treats "dual
> > stack" ::1 as listening on both ::1 and 127.0.0.1).  However, AFAICT
> > that's just not correct.  On Linux, listening on ::1 listens only on
> > ::1 even with V6ONLY explicitly set to 0.
> 
> Right, I don't even know what "simulated" means there. Actually there's
> no problem description at all. Go figure. I'm not sure if we want to
> report something (I'm not even sure what we should report).

I think "simulated" there means using one v4 and one v6 socket instead
of a dual stack socket.

Looks like that patch came in response to
https://github.com/containers/podman/issues/12292

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-18  3:34         ` David Gibson
@ 2025-11-19 11:42           ` Stefano Brivio
  2025-11-20  0:05             ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: Stefano Brivio @ 2025-11-19 11:42 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 18 Nov 2025 14:34:58 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Nov 18, 2025 at 01:19:21AM +0100, Stefano Brivio wrote:
> > On Fri, 14 Nov 2025 10:21:46 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Nov 13, 2025 at 07:33:13AM +0100, Stefano Brivio wrote:  
> > > > On Wed, 29 Oct 2025 17:26:22 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> > > > > to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> > > > > address is IPv6, but not when we want to create a dual stack listening
> > > > > socket.  The latter only makes sense when the address is :: however.
> > > > > 
> > > > > Clarify this by only keeping the v6only option in an internal helper
> > > > > sock_l4_().  External users will call either sock_l4() which always creates
> > > > > a socket bound to a specific IP version, or sock_l4_dualstack() which
> > > > > creates a dual stack socket, but takes only a port not an address.    
> > > > 
> > > > I'm not sure if we'll ever need anything different, but I guess that
> > > > this is not the only obvious semantic of sock_l4_dualstack(), as it
> > > > could take a sockaddr_inany eventually, and bind() IPv6 address and its
> > > > v4-mapped equivalent (...does that even work?).    
> > > 
> > > Do you mean that if we have a v4-mapped address, then using an IPv6
> > > "dual stack" socket will listen both for IPv4 traffic and for IPv6
> > > traffic actually using that v4-mapped address on the wire (presumably
> > > as a result of a router translating to a local IPv6-only network)?  I
> > > think that will work, though I haven't tested.  
> > 
> > Yes, that's what I meant.
> >   
> > > In that case we can determine that we need IPV6_V6ONLY from the
> > > address.  The only case that doesn't cover is if we want to listen for
> > > v4-mapped traffic already translated by a router but *not* native IPv4
> > > traffic.  I don't see a lot of reason to ever do that, so it's in the
> > > "refactor if we ever discover we need it" pile.  
> > 
> > I thought that we might want to listen on both IP versions for whatever
> > reason, on a single socket, with a specific address (say, that v4-mapped
> > address and the equivalent untranslated address...?).  
> 
> I'm not really sure what you mean by an "equivalent untranslated
> address".  AFAIK, the only non-wildcard case that will actually listen
> on both IP versions is a v4-mapped address.

I mean 192.0.2.1 (untranslated, IPv4) and ::ffff:192.0.2.1 (v4-mapped).
Will we ever want to listen to both?

I don't think we have to care about that right now, though.

> So, yes we probably should explicitly set IPV6_V6ONLY==0 for v4-mapped
> addresses as well.
> 
> > I know it can't be done now anyway, I'm just saying that
> > sock_l4_dualstack() forcing wildcard addresses isn't something we should
> > imply as part of "dualstack".  
> 
> Hm, ok.  What if I renamed it to sock_l4_dualwild()?

Short-hands for "wildcard" aren't necessarily obvious. I would have
gone with "dual_any" or "dualstack_any" or "v4v6_any" or "inany_any".

But actually it can also stay like that, I guess, especially as this
looks refactor-prone for https://bugs.passt.top/show_bug.cgi?id=140. I
just wanted to raise the fact it's not obvious that "dualstack" implies
:: and 0.0.0.0. It doesn't need to be addressed in code or comments now,
or ever.

> > > Otherwise, the only case in which a single dual stack socket actually
> > > listens to traffic from both protocols is for a wildcard.  Maybe there
> > > are obscure wildcard addresses other than :: / 0.0.0.0, but that's
> > > also in the "worry about it later" pile.  
> > 
> > Sure.
> >   
> > > Note that:
> > > 
> > > https://github.com/containers/podman/pull/14026/commits/772ead25318dfa340541197e92322bd2346df087
> > > 
> > > implies some sort of dual stack localhost support (it treats "dual
> > > stack" ::1 as listening on both ::1 and 127.0.0.1).  However, AFAICT
> > > that's just not correct.  On Linux, listening on ::1 listens only on
> > > ::1 even with V6ONLY explicitly set to 0.  
> > 
> > Right, I don't even know what "simulated" means there. Actually there's
> > no problem description at all. Go figure. I'm not sure if we want to
> > report something (I'm not even sure what we should report).  
> 
> I think "simulated" there means using one v4 and one v6 socket instead
> of a dual stack socket.
> 
> Looks like that patch came in response to
> https://github.com/containers/podman/issues/12292

-- 
Stefano


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-19 11:42           ` Stefano Brivio
@ 2025-11-20  0:05             ` David Gibson
  2025-11-20  2:22               ` David Gibson
  0 siblings, 1 reply; 24+ messages in thread
From: David Gibson @ 2025-11-20  0:05 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4253 bytes --]

On Wed, Nov 19, 2025 at 12:42:04PM +0100, Stefano Brivio wrote:
> On Tue, 18 Nov 2025 14:34:58 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Nov 18, 2025 at 01:19:21AM +0100, Stefano Brivio wrote:
> > > On Fri, 14 Nov 2025 10:21:46 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Thu, Nov 13, 2025 at 07:33:13AM +0100, Stefano Brivio wrote:  
> > > > > On Wed, 29 Oct 2025 17:26:22 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > sock_l4_sa() has a somewhat confusing 'v6only' option controlling whether
> > > > > > to set the IPV6_V6ONLY socket option.  Usually it's set when the given
> > > > > > address is IPv6, but not when we want to create a dual stack listening
> > > > > > socket.  The latter only makes sense when the address is :: however.
> > > > > > 
> > > > > > Clarify this by only keeping the v6only option in an internal helper
> > > > > > sock_l4_().  External users will call either sock_l4() which always creates
> > > > > > a socket bound to a specific IP version, or sock_l4_dualstack() which
> > > > > > creates a dual stack socket, but takes only a port not an address.    
> > > > > 
> > > > > I'm not sure if we'll ever need anything different, but I guess that
> > > > > this is not the only obvious semantic of sock_l4_dualstack(), as it
> > > > > could take a sockaddr_inany eventually, and bind() IPv6 address and its
> > > > > v4-mapped equivalent (...does that even work?).    
> > > > 
> > > > Do you mean that if we have a v4-mapped address, then using an IPv6
> > > > "dual stack" socket will listen both for IPv4 traffic and for IPv6
> > > > traffic actually using that v4-mapped address on the wire (presumably
> > > > as a result of a router translating to a local IPv6-only network)?  I
> > > > think that will work, though I haven't tested.  
> > > 
> > > Yes, that's what I meant.
> > >   
> > > > In that case we can determine that we need IPV6_V6ONLY from the
> > > > address.  The only case that doesn't cover is if we want to listen for
> > > > v4-mapped traffic already translated by a router but *not* native IPv4
> > > > traffic.  I don't see a lot of reason to ever do that, so it's in the
> > > > "refactor if we ever discover we need it" pile.  
> > > 
> > > I thought that we might want to listen on both IP versions for whatever
> > > reason, on a single socket, with a specific address (say, that v4-mapped
> > > address and the equivalent untranslated address...?).  
> > 
> > I'm not really sure what you mean by an "equivalent untranslated
> > address".  AFAIK, the only non-wildcard case that will actually listen
> > on both IP versions is a v4-mapped address.
> 
> I mean 192.0.2.1 (untranslated, IPv4) and ::ffff:192.0.2.1 (v4-mapped).
> Will we ever want to listen to both?

Maybe one day, but not soon, I think.

> I don't think we have to care about that right now, though.

Agreed.

> > So, yes we probably should explicitly set IPV6_V6ONLY==0 for v4-mapped
> > addresses as well.
> > 
> > > I know it can't be done now anyway, I'm just saying that
> > > sock_l4_dualstack() forcing wildcard addresses isn't something we should
> > > imply as part of "dualstack".  
> > 
> > Hm, ok.  What if I renamed it to sock_l4_dualwild()?
> 
> Short-hands for "wildcard" aren't necessarily obvious. I would have
> gone with "dual_any" or "dualstack_any" or "v4v6_any" or "inany_any".

'dualstack_any' is definitely a better idea.  Unfortunately, I forget
to change this in my last spin.

> But actually it can also stay like that, I guess, especially as this
> looks refactor-prone for https://bugs.passt.top/show_bug.cgi?id=140.

True.

> I
> just wanted to raise the fact it's not obvious that "dualstack" implies
> :: and 0.0.0.0. It doesn't need to be addressed in code or comments now,
> or ever.

Ok.  Sounds like it's definitely worth a respin.  Perhaps I'll add a
rename patch into some future series.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface
  2025-11-20  0:05             ` David Gibson
@ 2025-11-20  2:22               ` David Gibson
  0 siblings, 0 replies; 24+ messages in thread
From: David Gibson @ 2025-11-20  2:22 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On Thu, Nov 20, 2025 at 11:05:25AM +1100, David Gibson wrote:
> On Wed, Nov 19, 2025 at 12:42:04PM +0100, Stefano Brivio wrote:
> > On Tue, 18 Nov 2025 14:34:58 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Tue, Nov 18, 2025 at 01:19:21AM +0100, Stefano Brivio wrote:
> > > > On Fri, 14 Nov 2025 10:21:46 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
[snip]
> > > I'm not really sure what you mean by an "equivalent untranslated
> > > address".  AFAIK, the only non-wildcard case that will actually listen
> > > on both IP versions is a v4-mapped address.
> > 
> > I mean 192.0.2.1 (untranslated, IPv4) and ::ffff:192.0.2.1 (v4-mapped).
> > Will we ever want to listen to both?
> 
> Maybe one day, but not soon, I think.
> 
> > I don't think we have to care about that right now, though.
> 
> Agreed.

So, I couldn't find anything definitive, but fwiw I did find a couple
things suggesting that v4-mapped addresses probably shouldn't ever
appear on the wire:

https://datatracker.ietf.org/doc/html/draft-itojun-v6ops-v4mapped-harmful-02
https://lwn.net/Articles/688630/

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-11-20  2:22 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-29  6:26 [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson
2025-10-29  6:26 ` [PATCH v3 1/8] inany: Let length of sockaddr_inany be implicit from the family David Gibson
2025-11-13  6:33   ` Stefano Brivio
2025-11-13 22:53     ` David Gibson
2025-10-29  6:26 ` [PATCH v3 2/8] util, flow, pif: Simplify sock_l4_sa() interface David Gibson
2025-11-13  6:33   ` Stefano Brivio
2025-11-13 23:21     ` David Gibson
2025-11-18  0:19       ` Stefano Brivio
2025-11-18  3:34         ` David Gibson
2025-11-19 11:42           ` Stefano Brivio
2025-11-20  0:05             ` David Gibson
2025-11-20  2:22               ` David Gibson
2025-10-29  6:26 ` [PATCH v3 3/8] tcp: Merge tcp_ns_sock_init[46]() into tcp_sock_init_one() David Gibson
2025-10-29  6:26 ` [PATCH v3 4/8] udp: Unify some more inbound/outbound parts of udp_sock_init() David Gibson
2025-11-13  6:33   ` Stefano Brivio
2025-11-13 23:33     ` David Gibson
2025-10-29  6:26 ` [PATCH v3 5/8] udp: Move udp_sock_init() special case to its caller David Gibson
2025-10-29  6:26 ` [PATCH v3 6/8] util: Fix setting of IPV6_V6ONLY socket option David Gibson
2025-11-13  6:33   ` Stefano Brivio
2025-11-14  0:24     ` David Gibson
2025-11-18  0:19       ` Stefano Brivio
2025-10-29  6:26 ` [PATCH v3 7/8] tcp, udp: Remove fallback if creating dual stack socket fails David Gibson
2025-10-29  6:26 ` [PATCH v3 8/8] [RFC, DO NOT APPLY] tcp, udp: Bind outbound listening sockets by interface instead of address David Gibson
2025-10-30  3:58 ` [PATCH v3 0/8] Reduce differences between inbound and outbound socket binding David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).