public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections
@ 2022-11-16  4:41 David Gibson
  2022-11-16  4:41 ` [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements David Gibson
                   ` (31 more replies)
  0 siblings, 32 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When forwarding many ports, passt can consume a lot of kernel memory
because of the many listening sockets it opens.  There are not a lot
of ways we can reduce that, but here's one.

Currently we create separate listening sockets for each port for both
IPv4 and IPv6.  However in Linux (and probably other platforms), it's
possible to listen for both IPv4 and IPv6 connections on an IPv6
socket.  This series uses such dual stack sockets to halve the number
of listening sockets needed for TCP.  When forwarding all TCP and UDP
ports, this reduces the kernel memory used from around 677 MiB to
around 487 MiB (kernel 6.0.8 on an x86_64 Fedora 37 machine).

This should also be possible for UDP, but that will require a mostly
separate implementation.

David Gibson (32):
  clang-tidy: Suppress warning about assignments in if statements
  style: Minor corrections to function comments
  tcp_splice: #include tcp_splice.h in tcp_splice.c
  tcp: Remove unused TCP_MAX_SOCKS constant
  tcp: Better helpers for converting between connection pointer and
    index
  tcp_splice: Helpers for converting from index to/from tcp_splice_conn
  tcp: Move connection state structures into a shared header
  tcp: Add connection union type
  tcp: Improved helpers to update connections after moving
  tcp: Unify spliced and non-spliced connection tables
  tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()
  tcp: Partially unify tcp_timer() and tcp_splice_timer()
  tcp: Unify the IN_EPOLL flag
  tcp: Separate helpers to create ns listening sockets
  tcp: Unify part of spliced and non-spliced conn_from_sock path
  tcp: Use the same sockets to listen for spliced and non-spliced
    connections
  tcp: Remove splice from tcp_epoll_ref
  tcp: Don't store hash bucket in connection structures
  inany: Helper functions for handling addresses which could be IPv4 or
    IPv6
  tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same
  tcp: Take tcp_hash_insert() address from struct tcp_conn
  tcp: Simplify tcp_hash_match() to take an inany_addr
  tcp: Unify initial sequence number calculation for IPv4 and IPv6
  tcp: Have tcp_seq_init() take its parameters from struct tcp_conn
  tcp: Fix small errors in tcp_seq_init() time handling
  tcp: Remove v6 flag from tcp_epoll_ref
  tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses
  tcp_splice: Allow splicing of connections from IPv4-mapped loopback
  tcp: Consolidate tcp_sock_init[46]
  util: Allow sock_l4() to open dual stack sockets
  util: Always return -1 on error in sock_l4()
  tcp: Use dual stack sockets for port forwarding when possible

 Makefile     |  11 +-
 conf.c       |  12 +-
 inany.h      |  94 +++++
 siphash.c    |   2 +
 tap.c        |   6 +-
 tcp.c        | 978 ++++++++++++++++++++++-----------------------------
 tcp.h        |  11 +-
 tcp_conn.h   | 192 ++++++++++
 tcp_splice.c | 337 ++++++++----------
 tcp_splice.h |  12 +-
 util.c       |  19 +-
 11 files changed, 887 insertions(+), 787 deletions(-)
 create mode 100644 inany.h
 create mode 100644 tcp_conn.h

-- 
2.38.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:10   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 02/32] style: Minor corrections to function comments David Gibson
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

clang-tools 15.0.0 appears to have added a new warning that will always
complain about assignments in if statements, which we use in a number of
places in passt/pasta.  Encountered on Fedora 37 with
clang-tools-extra-15.0.0-3.fc37.x86_64.

Suppress the new warning so that we can compile and test.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 6b22408..8bcbbc0 100644
--- a/Makefile
+++ b/Makefile
@@ -262,6 +262,7 @@ clang-tidy: $(SRCS) $(HEADERS)
 	clang-tidy -checks=*,-modernize-*,\
 	-clang-analyzer-valist.Uninitialized,\
 	-cppcoreguidelines-init-variables,\
+	-bugprone-assignment-in-if-condition,\
 	-bugprone-macro-parentheses,\
 	-google-readability-braces-around-statements,\
 	-hicpp-braces-around-statements,\
-- 
@@ -262,6 +262,7 @@ clang-tidy: $(SRCS) $(HEADERS)
 	clang-tidy -checks=*,-modernize-*,\
 	-clang-analyzer-valist.Uninitialized,\
 	-cppcoreguidelines-init-variables,\
+	-bugprone-assignment-in-if-condition,\
 	-bugprone-macro-parentheses,\
 	-google-readability-braces-around-statements,\
 	-hicpp-braces-around-statements,\
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/32] style: Minor corrections to function comments
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
  2022-11-16  4:41 ` [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:11   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 03/32] tcp_splice: #include tcp_splice.h in tcp_splice.c David Gibson
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Some style issues and a typo.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 conf.c | 6 +++---
 tap.c  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/conf.c b/conf.c
index 1adcf83..3ad247e 100644
--- a/conf.c
+++ b/conf.c
@@ -112,9 +112,9 @@ static int get_bound_ports_ns(void *arg)
  * @s:		String to search
  * @c:		Delimiter character
  *
- * Returns: If another @c is found in @s, returns a pointer to the
- *	    character *after* the delimiter, if no further @c is in
- *	    @s, return NULL
+ * Return: If another @c is found in @s, returns a pointer to the
+ *	   character *after* the delimiter, if no further @c is in @s,
+ *	   return NULL
  */
 static char *next_chunk(const char *s, char c)
 {
diff --git a/tap.c b/tap.c
index abeff25..707660c 100644
--- a/tap.c
+++ b/tap.c
@@ -90,7 +90,7 @@ int tap_send(const struct ctx *c, const void *data, size_t len)
  * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
  * @c:		Execution context
  *
- * Returns: IPv4 address, network order
+ * Return:	IPv4 address, network order
  */
 struct in_addr tap_ip4_daddr(const struct ctx *c)
 {
@@ -98,11 +98,11 @@ struct in_addr tap_ip4_daddr(const struct ctx *c)
 }
 
 /**
- * tap_ip6_daddr() - Normal IPv4 destination address for inbound packets
+ * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
  * @c:		Execution context
  * @src:	Source address
  *
- * Returns: pointer to IPv6 address
+ * Return:	pointer to IPv6 address
  */
 const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
 				     const struct in6_addr *src)
-- 
@@ -90,7 +90,7 @@ int tap_send(const struct ctx *c, const void *data, size_t len)
  * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
  * @c:		Execution context
  *
- * Returns: IPv4 address, network order
+ * Return:	IPv4 address, network order
  */
 struct in_addr tap_ip4_daddr(const struct ctx *c)
 {
@@ -98,11 +98,11 @@ struct in_addr tap_ip4_daddr(const struct ctx *c)
 }
 
 /**
- * tap_ip6_daddr() - Normal IPv4 destination address for inbound packets
+ * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
  * @c:		Execution context
  * @src:	Source address
  *
- * Returns: pointer to IPv6 address
+ * Return:	pointer to IPv6 address
  */
 const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
 				     const struct in6_addr *src)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/32] tcp_splice: #include tcp_splice.h in tcp_splice.c
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
  2022-11-16  4:41 ` [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements David Gibson
  2022-11-16  4:41 ` [PATCH 02/32] style: Minor corrections to function comments David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 04/32] tcp: Remove unused TCP_MAX_SOCKS constant David Gibson
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file.  This shows up
an old declaration for a function that is now static, and a duplicate
#define.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_splice.c | 2 +-
 tcp_splice.h | 3 ---
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 99c5fa7..3c5c111 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -49,9 +49,9 @@
 #include "util.h"
 #include "passt.h"
 #include "log.h"
+#include "tcp_splice.h"
 
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
-#define TCP_SPLICE_MAX_CONNS		(128 * 1024)
 #define TCP_SPLICE_PIPE_POOL_SIZE	16
 #define TCP_SPLICE_CONN_PRESSURE	30	/* % of splice_conn_count */
 #define TCP_SPLICE_FILE_PRESSURE	30	/* % of c->nofile */
diff --git a/tcp_splice.h b/tcp_splice.h
index 63ffc68..2c4bff3 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -8,11 +8,8 @@
 
 #define TCP_SPLICE_MAX_CONNS			(128 * 1024)
 
-struct tcp_splice_conn;
-
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn);
 void tcp_splice_init(struct ctx *c);
 void tcp_splice_timer(struct ctx *c);
 void tcp_splice_defer_handler(struct ctx *c);
-- 
@@ -8,11 +8,8 @@
 
 #define TCP_SPLICE_MAX_CONNS			(128 * 1024)
 
-struct tcp_splice_conn;
-
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn);
 void tcp_splice_init(struct ctx *c);
 void tcp_splice_timer(struct ctx *c);
 void tcp_splice_defer_handler(struct ctx *c);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/32] tcp: Remove unused TCP_MAX_SOCKS constant
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (2 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 03/32] tcp_splice: #include tcp_splice.h in tcp_splice.c David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index David Gibson
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Presumably it meant something in the past, but it's no longer used.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tcp.h b/tcp.h
index 3fabb5a..bba0f38 100644
--- a/tcp.h
+++ b/tcp.h
@@ -10,7 +10,6 @@
 
 #define TCP_CONN_INDEX_BITS		17	/* 128k */
 #define TCP_MAX_CONNS			(1 << TCP_CONN_INDEX_BITS)
-#define TCP_MAX_SOCKS			(TCP_MAX_CONNS + USHRT_MAX * 2)
 
 #define TCP_SOCK_POOL_SIZE		32
 
-- 
@@ -10,7 +10,6 @@
 
 #define TCP_CONN_INDEX_BITS		17	/* 128k */
 #define TCP_MAX_CONNS			(1 << TCP_CONN_INDEX_BITS)
-#define TCP_MAX_SOCKS			(TCP_MAX_CONNS + USHRT_MAX * 2)
 
 #define TCP_SOCK_POOL_SIZE		32
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (3 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 04/32] tcp: Remove unused TCP_MAX_SOCKS constant David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:11   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 06/32] tcp_splice: Helpers for converting from index to/from tcp_splice_conn David Gibson
                   ` (26 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

The macro CONN_OR_NULL() is used to look up connections by index with
bounds checking.  Replace it with an inline function, which means:
    - Better type checking
    - No danger of multiple evaluation of an @index with side effects

Also add a helper to perform the reverse translation: from connection
pointer to index.  Introduce a macro for this which will make later
cleanups easier and safer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 83 ++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 45 insertions(+), 38 deletions(-)

diff --git a/tcp.c b/tcp.c
index d043123..4e56a6c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -518,14 +518,6 @@ struct tcp_conn {
 	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
 #define CONN_HAS(conn, set)	((conn->events & (set)) == (set))
 
-#define CONN(index)		(tc + (index))
-
-/* We probably don't want to use gcc statement expressions (for portability), so
- * use this only after well-defined sequence points (no pre-/post-increments).
- */
-#define CONN_OR_NULL(index)						\
-	(((int)(index) >= 0 && (index) < TCP_MAX_CONNS) ? (tc + (index)) : NULL)
-
 static const char *tcp_event_str[] __attribute((__unused__)) = {
 	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
 
@@ -705,6 +697,21 @@ static size_t tcp6_l2_flags_buf_bytes;
 /* TCP connections */
 static struct tcp_conn tc[TCP_MAX_CONNS];
 
+#define CONN(index)		(tc + (index))
+#define CONN_IDX(conn)		((conn) - tc)
+
+/** conn_at_idx() - Find a connection by index, if present
+ * @index:	Index of connection to lookup
+ *
+ * Return:	Pointer to connection, or NULL if @index is out of bounds
+ */
+static inline struct tcp_conn *conn_at_idx(int index)
+{
+	if ((index < 0) || (index >= TCP_MAX_CONNS))
+		return NULL;
+	return CONN(index);
+}
+
 /* Table for lookup from remote address, local port, remote port */
 static struct tcp_conn *tc_hash[TCP_HASH_TABLE_SIZE];
 
@@ -761,7 +768,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
 {
 	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
-				.r.p.tcp.tcp.index = conn - tc,
+				.r.p.tcp.tcp.index = CONN_IDX(conn),
 				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
@@ -784,7 +791,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
 		union epoll_ref ref_t = { .r.proto = IPPROTO_TCP,
 					  .r.s = conn->sock,
 					  .r.p.tcp.tcp.timer = 1,
-					  .r.p.tcp.tcp.index = conn - tc };
+					  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
 					    .events = EPOLLIN | EPOLLET };
 
@@ -813,7 +820,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
 		union epoll_ref ref = { .r.proto = IPPROTO_TCP,
 					.r.s = conn->sock,
 					.r.p.tcp.tcp.timer = 1,
-					.r.p.tcp.tcp.index = conn - tc };
+					.r.p.tcp.tcp.index = CONN_IDX(conn) };
 		struct epoll_event ev = { .data.u64 = ref.u64,
 					  .events = EPOLLIN | EPOLLET };
 		int fd;
@@ -846,7 +853,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
 		it.it_value.tv_sec = ACT_TIMEOUT;
 	}
 
-	debug("TCP: index %li, timer expires in %lu.%03lus", conn - tc,
+	debug("TCP: index %li, timer expires in %lu.%03lus", CONN_IDX(conn),
 	      it.it_value.tv_sec, it.it_value.tv_nsec / 1000 / 1000);
 
 	timerfd_settime(conn->timer, 0, &it, NULL);
@@ -867,7 +874,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
 
 		conn->flags &= flag;
 		if (fls(~flag) >= 0) {
-			debug("TCP: index %li: %s dropped", conn - tc,
+			debug("TCP: index %li: %s dropped", CONN_IDX(conn),
 			      tcp_flag_str[fls(~flag)]);
 		}
 	} else {
@@ -876,7 +883,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
 
 		conn->flags |= flag;
 		if (fls(flag) >= 0) {
-			debug("TCP: index %li: %s", conn - tc,
+			debug("TCP: index %li: %s", CONN_IDX(conn),
 			      tcp_flag_str[fls(flag)]);
 		}
 	}
@@ -926,12 +933,12 @@ static void conn_event_do(const struct ctx *c, struct tcp_conn *conn,
 		new += 5;
 
 	if (prev != new) {
-		debug("TCP: index %li, %s: %s -> %s", conn - tc,
+		debug("TCP: index %li, %s: %s -> %s", CONN_IDX(conn),
 		      num == -1 	       ? "CLOSED" : tcp_event_str[num],
 		      prev == -1	       ? "CLOSED" : tcp_state_str[prev],
 		      (new == -1 || num == -1) ? "CLOSED" : tcp_state_str[new]);
 	} else {
-		debug("TCP: index %li, %s", conn - tc,
+		debug("TCP: index %li, %s", CONN_IDX(conn),
 		      num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 	}
 
@@ -1355,12 +1362,12 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn,
 	int b;
 
 	b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port);
-	conn->next_index = tc_hash[b] ? tc_hash[b] - tc : -1;
+	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 	conn->hash_bucket = b;
 
 	debug("TCP: hash table insert: index %li, sock %i, bucket: %i, next: "
-	      "%p", conn - tc, conn->sock, b, CONN_OR_NULL(conn->next_index));
+	      "%p", CONN_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index));
 }
 
 /**
@@ -1373,19 +1380,19 @@ static void tcp_hash_remove(const struct tcp_conn *conn)
 	int b = conn->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
-	     prev = entry, entry = CONN_OR_NULL(entry->next_index)) {
+	     prev = entry, entry = conn_at_idx(entry->next_index)) {
 		if (entry == conn) {
 			if (prev)
 				prev->next_index = conn->next_index;
 			else
-				tc_hash[b] = CONN_OR_NULL(conn->next_index);
+				tc_hash[b] = conn_at_idx(conn->next_index);
 			break;
 		}
 	}
 
 	debug("TCP: hash table remove: index %li, sock %i, bucket: %i, new: %p",
-	      conn - tc, conn->sock, b,
-	      prev ? CONN_OR_NULL(prev->next_index) : tc_hash[b]);
+	      CONN_IDX(conn), conn->sock, b,
+	      prev ? conn_at_idx(prev->next_index) : tc_hash[b]);
 }
 
 /**
@@ -1399,10 +1406,10 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
 	int b = old->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
-	     prev = entry, entry = CONN_OR_NULL(entry->next_index)) {
+	     prev = entry, entry = conn_at_idx(entry->next_index)) {
 		if (entry == old) {
 			if (prev)
-				prev->next_index = new - tc;
+				prev->next_index = CONN_IDX(new);
 			else
 				tc_hash[b] = new;
 			break;
@@ -1411,7 +1418,7 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
 
 	debug("TCP: hash table update: old index %li, new index %li, sock %i, "
 	      "bucket: %i, old: %p, new: %p",
-	      old - tc, new - tc, new->sock, b, old, new);
+	      CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new);
 }
 
 /**
@@ -1431,7 +1438,7 @@ static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af,
 	int b = tcp_hash(c, af, addr, tap_port, sock_port);
 	struct tcp_conn *conn;
 
-	for (conn = tc_hash[b]; conn; conn = CONN_OR_NULL(conn->next_index)) {
+	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
 		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
 			return conn;
 	}
@@ -1448,9 +1455,9 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
 {
 	struct tcp_conn *from, *to;
 
-	if ((hole - tc) == --c->tcp.conn_count) {
+	if (CONN_IDX(hole) == --c->tcp.conn_count) {
 		debug("TCP: hash table compaction: maximum index was %li (%p)",
-		      hole - tc, hole);
+		      CONN_IDX(hole), hole);
 		memset(hole, 0, sizeof(*hole));
 		return;
 	}
@@ -1465,7 +1472,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
 
 	debug("TCP: hash table compaction: old index %li, new index %li, "
 	      "sock %i, from: %p, to: %p",
-	      from - tc, to - tc, from->sock, from, to);
+	      CONN_IDX(from), CONN_IDX(to), from->sock, from, to);
 
 	memset(from, 0, sizeof(*from));
 }
@@ -1488,7 +1495,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn)
 static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn);
 #define tcp_rst(c, conn)						\
 	do {								\
-		debug("TCP: index %li, reset at %s:%i", conn - tc,	\
+		debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \
 		      __func__, __LINE__);				\
 		tcp_rst_do(c, conn);					\
 	} while (0)
@@ -2734,7 +2741,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr,
 		return 1;
 	}
 
-	trace("TCP: packet length %lu from tap for index %lu", len, conn - tc);
+	trace("TCP: packet length %lu from tap for index %lu", len, CONN_IDX(conn));
 
 	if (th->rst) {
 		conn_event(c, conn, CLOSED);
@@ -2942,7 +2949,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
  */
 static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 {
-	struct tcp_conn *conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index);
+	struct tcp_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index);
 	struct itimerspec check_armed = { { 0 }, { 0 } };
 
 	if (!conn)
@@ -2961,17 +2968,17 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		conn_flag(c, conn, ~ACK_TO_TAP_DUE);
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		if (!(conn->events & ESTABLISHED)) {
-			debug("TCP: index %li, handshake timeout", conn - tc);
+			debug("TCP: index %li, handshake timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
-			debug("TCP: index %li, FIN timeout", conn - tc);
+			debug("TCP: index %li, FIN timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else if (conn->retrans == TCP_MAX_RETRANS) {
 			debug("TCP: index %li, retransmissions count exceeded",
-			      conn - tc);
+			      CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else {
-			debug("TCP: index %li, ACK timeout, retry", conn - tc);
+			debug("TCP: index %li, ACK timeout, retry", CONN_IDX(conn));
 			conn->retrans++;
 			conn->seq_to_tap = conn->seq_ack_from_tap;
 			tcp_data_from_sock(c, conn);
@@ -2989,7 +2996,7 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		 */
 		timerfd_settime(conn->timer, 0, &new, &old);
 		if (old.it_value.tv_sec == ACT_TIMEOUT) {
-			debug("TCP: index %li, activity timeout", conn - tc);
+			debug("TCP: index %li, activity timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		}
 	}
@@ -3022,7 +3029,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		return;
 	}
 
-	if (!(conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index)))
+	if (!(conn = conn_at_idx(ref.r.p.tcp.tcp.index)))
 		return;
 
 	if (conn->events == CLOSED)
-- 
@@ -518,14 +518,6 @@ struct tcp_conn {
 	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
 #define CONN_HAS(conn, set)	((conn->events & (set)) == (set))
 
-#define CONN(index)		(tc + (index))
-
-/* We probably don't want to use gcc statement expressions (for portability), so
- * use this only after well-defined sequence points (no pre-/post-increments).
- */
-#define CONN_OR_NULL(index)						\
-	(((int)(index) >= 0 && (index) < TCP_MAX_CONNS) ? (tc + (index)) : NULL)
-
 static const char *tcp_event_str[] __attribute((__unused__)) = {
 	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
 
@@ -705,6 +697,21 @@ static size_t tcp6_l2_flags_buf_bytes;
 /* TCP connections */
 static struct tcp_conn tc[TCP_MAX_CONNS];
 
+#define CONN(index)		(tc + (index))
+#define CONN_IDX(conn)		((conn) - tc)
+
+/** conn_at_idx() - Find a connection by index, if present
+ * @index:	Index of connection to lookup
+ *
+ * Return:	Pointer to connection, or NULL if @index is out of bounds
+ */
+static inline struct tcp_conn *conn_at_idx(int index)
+{
+	if ((index < 0) || (index >= TCP_MAX_CONNS))
+		return NULL;
+	return CONN(index);
+}
+
 /* Table for lookup from remote address, local port, remote port */
 static struct tcp_conn *tc_hash[TCP_HASH_TABLE_SIZE];
 
@@ -761,7 +768,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
 {
 	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
-				.r.p.tcp.tcp.index = conn - tc,
+				.r.p.tcp.tcp.index = CONN_IDX(conn),
 				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
@@ -784,7 +791,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
 		union epoll_ref ref_t = { .r.proto = IPPROTO_TCP,
 					  .r.s = conn->sock,
 					  .r.p.tcp.tcp.timer = 1,
-					  .r.p.tcp.tcp.index = conn - tc };
+					  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 		struct epoll_event ev_t = { .data.u64 = ref_t.u64,
 					    .events = EPOLLIN | EPOLLET };
 
@@ -813,7 +820,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
 		union epoll_ref ref = { .r.proto = IPPROTO_TCP,
 					.r.s = conn->sock,
 					.r.p.tcp.tcp.timer = 1,
-					.r.p.tcp.tcp.index = conn - tc };
+					.r.p.tcp.tcp.index = CONN_IDX(conn) };
 		struct epoll_event ev = { .data.u64 = ref.u64,
 					  .events = EPOLLIN | EPOLLET };
 		int fd;
@@ -846,7 +853,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
 		it.it_value.tv_sec = ACT_TIMEOUT;
 	}
 
-	debug("TCP: index %li, timer expires in %lu.%03lus", conn - tc,
+	debug("TCP: index %li, timer expires in %lu.%03lus", CONN_IDX(conn),
 	      it.it_value.tv_sec, it.it_value.tv_nsec / 1000 / 1000);
 
 	timerfd_settime(conn->timer, 0, &it, NULL);
@@ -867,7 +874,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
 
 		conn->flags &= flag;
 		if (fls(~flag) >= 0) {
-			debug("TCP: index %li: %s dropped", conn - tc,
+			debug("TCP: index %li: %s dropped", CONN_IDX(conn),
 			      tcp_flag_str[fls(~flag)]);
 		}
 	} else {
@@ -876,7 +883,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
 
 		conn->flags |= flag;
 		if (fls(flag) >= 0) {
-			debug("TCP: index %li: %s", conn - tc,
+			debug("TCP: index %li: %s", CONN_IDX(conn),
 			      tcp_flag_str[fls(flag)]);
 		}
 	}
@@ -926,12 +933,12 @@ static void conn_event_do(const struct ctx *c, struct tcp_conn *conn,
 		new += 5;
 
 	if (prev != new) {
-		debug("TCP: index %li, %s: %s -> %s", conn - tc,
+		debug("TCP: index %li, %s: %s -> %s", CONN_IDX(conn),
 		      num == -1 	       ? "CLOSED" : tcp_event_str[num],
 		      prev == -1	       ? "CLOSED" : tcp_state_str[prev],
 		      (new == -1 || num == -1) ? "CLOSED" : tcp_state_str[new]);
 	} else {
-		debug("TCP: index %li, %s", conn - tc,
+		debug("TCP: index %li, %s", CONN_IDX(conn),
 		      num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 	}
 
@@ -1355,12 +1362,12 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn,
 	int b;
 
 	b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port);
-	conn->next_index = tc_hash[b] ? tc_hash[b] - tc : -1;
+	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 	conn->hash_bucket = b;
 
 	debug("TCP: hash table insert: index %li, sock %i, bucket: %i, next: "
-	      "%p", conn - tc, conn->sock, b, CONN_OR_NULL(conn->next_index));
+	      "%p", CONN_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index));
 }
 
 /**
@@ -1373,19 +1380,19 @@ static void tcp_hash_remove(const struct tcp_conn *conn)
 	int b = conn->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
-	     prev = entry, entry = CONN_OR_NULL(entry->next_index)) {
+	     prev = entry, entry = conn_at_idx(entry->next_index)) {
 		if (entry == conn) {
 			if (prev)
 				prev->next_index = conn->next_index;
 			else
-				tc_hash[b] = CONN_OR_NULL(conn->next_index);
+				tc_hash[b] = conn_at_idx(conn->next_index);
 			break;
 		}
 	}
 
 	debug("TCP: hash table remove: index %li, sock %i, bucket: %i, new: %p",
-	      conn - tc, conn->sock, b,
-	      prev ? CONN_OR_NULL(prev->next_index) : tc_hash[b]);
+	      CONN_IDX(conn), conn->sock, b,
+	      prev ? conn_at_idx(prev->next_index) : tc_hash[b]);
 }
 
 /**
@@ -1399,10 +1406,10 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
 	int b = old->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
-	     prev = entry, entry = CONN_OR_NULL(entry->next_index)) {
+	     prev = entry, entry = conn_at_idx(entry->next_index)) {
 		if (entry == old) {
 			if (prev)
-				prev->next_index = new - tc;
+				prev->next_index = CONN_IDX(new);
 			else
 				tc_hash[b] = new;
 			break;
@@ -1411,7 +1418,7 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
 
 	debug("TCP: hash table update: old index %li, new index %li, sock %i, "
 	      "bucket: %i, old: %p, new: %p",
-	      old - tc, new - tc, new->sock, b, old, new);
+	      CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new);
 }
 
 /**
@@ -1431,7 +1438,7 @@ static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af,
 	int b = tcp_hash(c, af, addr, tap_port, sock_port);
 	struct tcp_conn *conn;
 
-	for (conn = tc_hash[b]; conn; conn = CONN_OR_NULL(conn->next_index)) {
+	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
 		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
 			return conn;
 	}
@@ -1448,9 +1455,9 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
 {
 	struct tcp_conn *from, *to;
 
-	if ((hole - tc) == --c->tcp.conn_count) {
+	if (CONN_IDX(hole) == --c->tcp.conn_count) {
 		debug("TCP: hash table compaction: maximum index was %li (%p)",
-		      hole - tc, hole);
+		      CONN_IDX(hole), hole);
 		memset(hole, 0, sizeof(*hole));
 		return;
 	}
@@ -1465,7 +1472,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
 
 	debug("TCP: hash table compaction: old index %li, new index %li, "
 	      "sock %i, from: %p, to: %p",
-	      from - tc, to - tc, from->sock, from, to);
+	      CONN_IDX(from), CONN_IDX(to), from->sock, from, to);
 
 	memset(from, 0, sizeof(*from));
 }
@@ -1488,7 +1495,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn)
 static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn);
 #define tcp_rst(c, conn)						\
 	do {								\
-		debug("TCP: index %li, reset at %s:%i", conn - tc,	\
+		debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \
 		      __func__, __LINE__);				\
 		tcp_rst_do(c, conn);					\
 	} while (0)
@@ -2734,7 +2741,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr,
 		return 1;
 	}
 
-	trace("TCP: packet length %lu from tap for index %lu", len, conn - tc);
+	trace("TCP: packet length %lu from tap for index %lu", len, CONN_IDX(conn));
 
 	if (th->rst) {
 		conn_event(c, conn, CLOSED);
@@ -2942,7 +2949,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
  */
 static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 {
-	struct tcp_conn *conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index);
+	struct tcp_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index);
 	struct itimerspec check_armed = { { 0 }, { 0 } };
 
 	if (!conn)
@@ -2961,17 +2968,17 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		conn_flag(c, conn, ~ACK_TO_TAP_DUE);
 	} else if (conn->flags & ACK_FROM_TAP_DUE) {
 		if (!(conn->events & ESTABLISHED)) {
-			debug("TCP: index %li, handshake timeout", conn - tc);
+			debug("TCP: index %li, handshake timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) {
-			debug("TCP: index %li, FIN timeout", conn - tc);
+			debug("TCP: index %li, FIN timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else if (conn->retrans == TCP_MAX_RETRANS) {
 			debug("TCP: index %li, retransmissions count exceeded",
-			      conn - tc);
+			      CONN_IDX(conn));
 			tcp_rst(c, conn);
 		} else {
-			debug("TCP: index %li, ACK timeout, retry", conn - tc);
+			debug("TCP: index %li, ACK timeout, retry", CONN_IDX(conn));
 			conn->retrans++;
 			conn->seq_to_tap = conn->seq_ack_from_tap;
 			tcp_data_from_sock(c, conn);
@@ -2989,7 +2996,7 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 		 */
 		timerfd_settime(conn->timer, 0, &new, &old);
 		if (old.it_value.tv_sec == ACT_TIMEOUT) {
-			debug("TCP: index %li, activity timeout", conn - tc);
+			debug("TCP: index %li, activity timeout", CONN_IDX(conn));
 			tcp_rst(c, conn);
 		}
 	}
@@ -3022,7 +3029,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		return;
 	}
 
-	if (!(conn = CONN_OR_NULL(ref.r.p.tcp.tcp.index)))
+	if (!(conn = conn_at_idx(ref.r.p.tcp.tcp.index)))
 		return;
 
 	if (conn->events == CLOSED)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/32] tcp_splice: Helpers for converting from index to/from tcp_splice_conn
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (4 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 07/32] tcp: Move connection state structures into a shared header David Gibson
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures.  Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules).  This
will make subsequent changes a bit safer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_splice.c | 43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 3c5c111..4cc4ad2 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -13,10 +13,11 @@
  * DOC: Theory of Operation
  *
  *
- * For local traffic directed to TCP ports configured for direct mapping between
- * namespaces, packets are directly translated between L4 sockets using a pair
- * of splice() syscalls. These connections are tracked in the @tc array of
- * struct tcp_splice_conn, using these events:
+ * For local traffic directed to TCP ports configured for direct
+ * mapping between namespaces, packets are directly translated between
+ * L4 sockets using a pair of splice() syscalls. These connections are
+ * tracked in the @tc_splice array of struct tcp_splice_conn, using
+ * these events:
  *
  * - SPLICE_CONNECT:		connection accepted, connecting to target
  * - SPLICE_ESTABLISHED:	connection to target established
@@ -113,10 +114,11 @@ struct tcp_splice_conn {
 #define CONN_V6(x)			(x->flags & SOCK_V6)
 #define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
-#define CONN(index)			(tc + (index))
+#define CONN(index)			(tc_splice + (index))
+#define CONN_IDX(conn)			((conn) - tc_splice)
 
 /* Spliced connections */
-static struct tcp_splice_conn tc[TCP_SPLICE_MAX_CONNS];
+static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS];
 
 /* Display strings for connection events */
 static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
@@ -173,7 +175,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->flags &= flag;
 		if (fls(~flag) >= 0) {
-			debug("TCP (spliced): index %li: %s dropped", conn - tc,
+			debug("TCP (spliced): index %li: %s dropped", CONN_IDX(conn),
 			      tcp_splice_flag_str[fls(~flag)]);
 		}
 	} else {
@@ -182,7 +184,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->flags |= flag;
 		if (fls(flag) >= 0) {
-			debug("TCP (spliced): index %li: %s", conn - tc,
+			debug("TCP (spliced): index %li: %s", CONN_IDX(conn),
 			      tcp_splice_flag_str[fls(flag)]);
 		}
 	}
@@ -211,11 +213,11 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
-				  .r.p.tcp.tcp.index = conn - tc,
+				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
 				  .r.p.tcp.tcp.splice = 1,
-				  .r.p.tcp.tcp.index = conn - tc,
+				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
 	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
@@ -257,7 +259,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->events &= event;
 		if (fls(~event) >= 0) {
-			debug("TCP (spliced): index %li, ~%s", conn - tc,
+			debug("TCP (spliced): index %li, ~%s", CONN_IDX(conn),
 			      tcp_splice_event_str[fls(~event)]);
 		}
 	} else {
@@ -266,7 +268,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->events |= event;
 		if (fls(event) >= 0) {
-			debug("TCP (spliced): index %li, %s", conn - tc,
+			debug("TCP (spliced): index %li, %s", CONN_IDX(conn),
 			      tcp_splice_event_str[fls(event)]);
 		}
 	}
@@ -292,8 +294,8 @@ static void tcp_table_splice_compact(struct ctx *c,
 {
 	struct tcp_splice_conn *move;
 
-	if ((hole - tc) == --c->tcp.splice_conn_count) {
-		debug("TCP (spliced): index %li (max) removed", hole - tc);
+	if (CONN_IDX(hole) == --c->tcp.splice_conn_count) {
+		debug("TCP (spliced): index %li (max) removed", CONN_IDX(hole));
 		return;
 	}
 
@@ -307,7 +309,8 @@ static void tcp_table_splice_compact(struct ctx *c,
 	move->pipe_b_a[0] = move->pipe_b_a[1] = -1;
 	move->flags = move->events = 0;
 
-	debug("TCP (spliced): index %li moved to %li", move - tc, hole - tc);
+	debug("TCP (spliced): index %li moved to %li",
+	      CONN_IDX(move), CONN_IDX(hole));
 	tcp_splice_epoll_ctl(c, hole);
 	if (tcp_splice_epoll_ctl(c, hole))
 		conn_flag(c, hole, CLOSING);
@@ -345,7 +348,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 
 	conn->events = CLOSED;
 	conn->flags = 0;
-	debug("TCP (spliced): index %li, CLOSED", conn - tc);
+	debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
 
 	tcp_table_splice_compact(c, conn);
 }
@@ -872,7 +875,9 @@ void tcp_splice_timer(struct ctx *c)
 {
 	struct tcp_splice_conn *conn;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.splice_conn_count - 1);
+	     conn >= tc_splice;
+	     conn--) {
 		if (conn->flags & CLOSING) {
 			tcp_splice_destroy(c, conn);
 			return;
@@ -918,7 +923,9 @@ void tcp_splice_defer_handler(struct ctx *c)
 	if (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns))
 		return;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.splice_conn_count - 1);
+	     conn >= tc_splice;
+	     conn--) {
 		if (conn->flags & CLOSING)
 			tcp_splice_destroy(c, conn);
 	}
-- 
@@ -13,10 +13,11 @@
  * DOC: Theory of Operation
  *
  *
- * For local traffic directed to TCP ports configured for direct mapping between
- * namespaces, packets are directly translated between L4 sockets using a pair
- * of splice() syscalls. These connections are tracked in the @tc array of
- * struct tcp_splice_conn, using these events:
+ * For local traffic directed to TCP ports configured for direct
+ * mapping between namespaces, packets are directly translated between
+ * L4 sockets using a pair of splice() syscalls. These connections are
+ * tracked in the @tc_splice array of struct tcp_splice_conn, using
+ * these events:
  *
  * - SPLICE_CONNECT:		connection accepted, connecting to target
  * - SPLICE_ESTABLISHED:	connection to target established
@@ -113,10 +114,11 @@ struct tcp_splice_conn {
 #define CONN_V6(x)			(x->flags & SOCK_V6)
 #define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
-#define CONN(index)			(tc + (index))
+#define CONN(index)			(tc_splice + (index))
+#define CONN_IDX(conn)			((conn) - tc_splice)
 
 /* Spliced connections */
-static struct tcp_splice_conn tc[TCP_SPLICE_MAX_CONNS];
+static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS];
 
 /* Display strings for connection events */
 static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
@@ -173,7 +175,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->flags &= flag;
 		if (fls(~flag) >= 0) {
-			debug("TCP (spliced): index %li: %s dropped", conn - tc,
+			debug("TCP (spliced): index %li: %s dropped", CONN_IDX(conn),
 			      tcp_splice_flag_str[fls(~flag)]);
 		}
 	} else {
@@ -182,7 +184,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->flags |= flag;
 		if (fls(flag) >= 0) {
-			debug("TCP (spliced): index %li: %s", conn - tc,
+			debug("TCP (spliced): index %li: %s", CONN_IDX(conn),
 			      tcp_splice_flag_str[fls(flag)]);
 		}
 	}
@@ -211,11 +213,11 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
-				  .r.p.tcp.tcp.index = conn - tc,
+				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
 				  .r.p.tcp.tcp.splice = 1,
-				  .r.p.tcp.tcp.index = conn - tc,
+				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
 	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
@@ -257,7 +259,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->events &= event;
 		if (fls(~event) >= 0) {
-			debug("TCP (spliced): index %li, ~%s", conn - tc,
+			debug("TCP (spliced): index %li, ~%s", CONN_IDX(conn),
 			      tcp_splice_event_str[fls(~event)]);
 		}
 	} else {
@@ -266,7 +268,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 
 		conn->events |= event;
 		if (fls(event) >= 0) {
-			debug("TCP (spliced): index %li, %s", conn - tc,
+			debug("TCP (spliced): index %li, %s", CONN_IDX(conn),
 			      tcp_splice_event_str[fls(event)]);
 		}
 	}
@@ -292,8 +294,8 @@ static void tcp_table_splice_compact(struct ctx *c,
 {
 	struct tcp_splice_conn *move;
 
-	if ((hole - tc) == --c->tcp.splice_conn_count) {
-		debug("TCP (spliced): index %li (max) removed", hole - tc);
+	if (CONN_IDX(hole) == --c->tcp.splice_conn_count) {
+		debug("TCP (spliced): index %li (max) removed", CONN_IDX(hole));
 		return;
 	}
 
@@ -307,7 +309,8 @@ static void tcp_table_splice_compact(struct ctx *c,
 	move->pipe_b_a[0] = move->pipe_b_a[1] = -1;
 	move->flags = move->events = 0;
 
-	debug("TCP (spliced): index %li moved to %li", move - tc, hole - tc);
+	debug("TCP (spliced): index %li moved to %li",
+	      CONN_IDX(move), CONN_IDX(hole));
 	tcp_splice_epoll_ctl(c, hole);
 	if (tcp_splice_epoll_ctl(c, hole))
 		conn_flag(c, hole, CLOSING);
@@ -345,7 +348,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 
 	conn->events = CLOSED;
 	conn->flags = 0;
-	debug("TCP (spliced): index %li, CLOSED", conn - tc);
+	debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
 
 	tcp_table_splice_compact(c, conn);
 }
@@ -872,7 +875,9 @@ void tcp_splice_timer(struct ctx *c)
 {
 	struct tcp_splice_conn *conn;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.splice_conn_count - 1);
+	     conn >= tc_splice;
+	     conn--) {
 		if (conn->flags & CLOSING) {
 			tcp_splice_destroy(c, conn);
 			return;
@@ -918,7 +923,9 @@ void tcp_splice_defer_handler(struct ctx *c)
 	if (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns))
 		return;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.splice_conn_count - 1);
+	     conn >= tc_splice;
+	     conn--) {
 		if (conn->flags & CLOSING)
 			tcp_splice_destroy(c, conn);
 	}
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/32] tcp: Move connection state structures into a shared header
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (5 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 06/32] tcp_splice: Helpers for converting from index to/from tcp_splice_conn David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 08/32] tcp: Add connection union type David Gibson
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently spliced and non-spliced connections use completely independent
tracking structures.  We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.

This requires renaming some #defines with the same name but different
meanings between the two cases.  In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile     |   3 +-
 tcp.c        | 206 +++++++++++++--------------------------------------
 tcp_conn.h   | 168 +++++++++++++++++++++++++++++++++++++++++
 tcp_splice.c |  93 +++++++----------------
 4 files changed, 245 insertions(+), 225 deletions(-)
 create mode 100644 tcp_conn.h

diff --git a/Makefile b/Makefile
index 8bcbbc0..9046b0b 100644
--- a/Makefile
+++ b/Makefile
@@ -45,7 +45,8 @@ MANPAGES = passt.1 pasta.1 qrap.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \
 	isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_splice.h udp.h util.h
+	pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h \
+	util.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 # On gcc 11 and 12, with -O2 and -flto, tcp_hash() and siphash_20b(), if
diff --git a/tcp.c b/tcp.c
index 4e56a6c..1137f45 100644
--- a/tcp.c
+++ b/tcp.c
@@ -98,7 +98,7 @@
  * Connection tracking and storage
  * -------------------------------
  *
- * Connections are tracked by the @tc array of struct tcp_conn, containing
+ * Connections are tracked by the @tc array of struct tcp_tap_conn, containing
  * addresses, ports, TCP states and parameters. This is statically allocated and
  * indexed by an arbitrary connection number. The array is compacted whenever a
  * connection is closed, by remapping the highest connection index in use to the
@@ -301,6 +301,8 @@
 #include "tcp_splice.h"
 #include "log.h"
 
+#include "tcp_conn.h"
+
 #define TCP_FRAMES_MEM			128
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
@@ -308,7 +310,6 @@
 #define TCP_FILE_PRESSURE		30	/* % of c->nofile */
 #define TCP_CONN_PRESSURE		30	/* % of c->tcp.conn_count */
 
-#define TCP_HASH_BUCKET_BITS		(TCP_CONN_INDEX_BITS + 1)
 #define TCP_HASH_TABLE_LOAD		70		/* % */
 #define TCP_HASH_TABLE_SIZE		(TCP_MAX_CONNS * 100 /		\
 					 TCP_HASH_TABLE_LOAD)
@@ -402,117 +403,8 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
 #define OPT_SACK	5
 #define OPT_TS		8
 
-/**
- * struct tcp_conn - Descriptor for a TCP connection (not spliced)
- * @next_index:		Connection index of next item in hash chain, -1 for none
- * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
- * @sock:		Socket descriptor number
- * @events:		Connection events, implying connection states
- * @timer:		timerfd descriptor for timeout events
- * @flags:		Connection flags representing internal attributes
- * @hash_bucket:	Bucket index in connection lookup hash table
- * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
- * @ws_from_tap:	Window scaling factor advertised from tap/guest
- * @ws_to_tap:		Window scaling factor advertised to tap/guest
- * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
- * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @a.a6:		IPv6 remote address, can be IPv4-mapped
- * @a.a4.zero:		Zero prefix for IPv4-mapped, see RFC 6890, Table 20
- * @a.a4.one:		Ones prefix for IPv4-mapped
- * @a.a4.a:		IPv4 address
- * @tap_port:		Guest-facing tap port
- * @sock_port:		Remote, socket-facing port
- * @wnd_from_tap:	Last window size from tap, unscaled (as received)
- * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
- * @seq_to_tap:		Next sequence for packets to tap
- * @seq_ack_from_tap:	Last ACK number received from tap
- * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
- * @seq_ack_to_tap:	Last ACK number sent to tap
- * @seq_init_from_tap:	Initial sequence number from tap
- */
-struct tcp_conn {
-	int	 	next_index	:TCP_CONN_INDEX_BITS + 2;
-
-#define TCP_RETRANS_BITS		3
-	unsigned int	retrans		:TCP_RETRANS_BITS;
-#define TCP_MAX_RETRANS			((1U << TCP_RETRANS_BITS) - 1)
-
-#define TCP_WS_BITS			4	/* RFC 7323 */
-#define TCP_WS_MAX			14
-	unsigned int	ws_from_tap	:TCP_WS_BITS;
-	unsigned int	ws_to_tap	:TCP_WS_BITS;
-
-
-	int		sock		:SOCKET_REF_BITS;
-
-	uint8_t		events;
-#define CLOSED			0
-#define SOCK_ACCEPTED		BIT(0)	/* implies SYN sent to tap */
-#define TAP_SYN_RCVD		BIT(1)	/* implies socket connecting */
-#define  TAP_SYN_ACK_SENT	BIT( 3)	/* implies socket connected */
-#define ESTABLISHED		BIT(2)
-#define  SOCK_FIN_RCVD		BIT( 3)
-#define  SOCK_FIN_SENT		BIT( 4)
-#define  TAP_FIN_RCVD		BIT( 5)
-#define  TAP_FIN_SENT		BIT( 6)
-#define  TAP_FIN_ACKED		BIT( 7)
-
-#define	CONN_STATE_BITS		/* Setting these clears other flags */	\
-	(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
-
-
-	int		timer		:SOCKET_REF_BITS;
-
-	uint8_t		flags;
-#define STALLED			BIT(0)
-#define LOCAL			BIT(1)
-#define WND_CLAMPED		BIT(2)
-#define IN_EPOLL		BIT(3)
-#define ACTIVE_CLOSE		BIT(4)
-#define ACK_TO_TAP_DUE		BIT(5)
-#define ACK_FROM_TAP_DUE	BIT(6)
-
-
-	unsigned int	hash_bucket	:TCP_HASH_BUCKET_BITS;
-
-#define TCP_MSS_BITS			14
-	unsigned int	tap_mss		:TCP_MSS_BITS;
-#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
-#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
-
-
-#define SNDBUF_BITS		24
-	unsigned int	sndbuf		:SNDBUF_BITS;
-#define SNDBUF_SET(conn, bytes)	(conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS)))
-#define SNDBUF_GET(conn)	(conn->sndbuf << (32 - SNDBUF_BITS))
-
-	uint8_t		seq_dup_ack_approx;
-
-
-	union {
-		struct in6_addr a6;
-		struct {
-			uint8_t zero[10];
-			uint8_t one[2];
-			struct in_addr a;
-		} a4;
-	} a;
 #define CONN_V4(conn)		IN6_IS_ADDR_V4MAPPED(&conn->a.a6)
 #define CONN_V6(conn)		(!CONN_V4(conn))
-
-	in_port_t	tap_port;
-	in_port_t	sock_port;
-
-	uint16_t	wnd_from_tap;
-	uint16_t	wnd_to_tap;
-
-	uint32_t	seq_to_tap;
-	uint32_t	seq_ack_from_tap;
-	uint32_t	seq_from_tap;
-	uint32_t	seq_ack_to_tap;
-	uint32_t	seq_init_from_tap;
-};
-
 #define CONN_IS_CLOSING(conn)						\
 	((conn->events & ESTABLISHED) &&				\
 	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
@@ -695,7 +587,7 @@ static unsigned int tcp6_l2_flags_buf_used;
 static size_t tcp6_l2_flags_buf_bytes;
 
 /* TCP connections */
-static struct tcp_conn tc[TCP_MAX_CONNS];
+static struct tcp_tap_conn tc[TCP_MAX_CONNS];
 
 #define CONN(index)		(tc + (index))
 #define CONN_IDX(conn)		((conn) - tc)
@@ -705,7 +597,7 @@ static struct tcp_conn tc[TCP_MAX_CONNS];
  *
  * Return:	Pointer to connection, or NULL if @index is out of bounds
  */
-static inline struct tcp_conn *conn_at_idx(int index)
+static inline struct tcp_tap_conn *conn_at_idx(int index)
 {
 	if ((index < 0) || (index >= TCP_MAX_CONNS))
 		return NULL;
@@ -713,7 +605,7 @@ static inline struct tcp_conn *conn_at_idx(int index)
 }
 
 /* Table for lookup from remote address, local port, remote port */
-static struct tcp_conn *tc_hash[TCP_HASH_TABLE_SIZE];
+static struct tcp_tap_conn *tc_hash[TCP_HASH_TABLE_SIZE];
 
 /* Pools for pre-opened sockets */
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
@@ -749,7 +641,7 @@ static uint32_t tcp_conn_epoll_events(uint8_t events, uint8_t conn_flags)
 	return EPOLLRDHUP;
 }
 
-static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
+static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 unsigned long flag);
 #define conn_flag(c, conn, flag)					\
 	do {								\
@@ -764,7 +656,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
  *
  * Return: 0 on success, negative error code on failure (not on deletion)
  */
-static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
+static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
@@ -809,7 +701,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_conn *conn)
  *
  * #syscalls timerfd_create timerfd_settime
  */
-static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
+static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	struct itimerspec it = { { 0 }, { 0 } };
 
@@ -865,7 +757,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_conn *conn)
  * @conn:	Connection pointer
  * @flag:	Flag to set, or ~flag to unset
  */
-static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
+static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 unsigned long flag)
 {
 	if (flag & (flag - 1)) {
@@ -903,7 +795,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_conn *conn,
  * @conn:	Connection pointer
  * @event:	Connection event
  */
-static void conn_event_do(const struct ctx *c, struct tcp_conn *conn,
+static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			  unsigned long event)
 {
 	int prev, new, num = fls(event);
@@ -963,7 +855,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_conn *conn,
  *
  * Return: 1 if destination is in low RTT table, 0 otherwise
  */
-static int tcp_rtt_dst_low(const struct tcp_conn *conn)
+static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
 {
 	int i;
 
@@ -979,7 +871,7 @@ static int tcp_rtt_dst_low(const struct tcp_conn *conn)
  * @conn:	Connection pointer
  * @tinfo:	Pointer to struct tcp_info for socket
  */
-static void tcp_rtt_dst_check(const struct tcp_conn *conn,
+static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 			      const struct tcp_info *tinfo)
 {
 #ifdef HAS_MIN_RTT
@@ -1016,7 +908,7 @@ static void tcp_rtt_dst_check(const struct tcp_conn *conn,
  * tcp_get_sndbuf() - Get, scale SO_SNDBUF between thresholds (1 to 0.5 usage)
  * @conn:	Connection pointer
  */
-static void tcp_get_sndbuf(struct tcp_conn *conn)
+static void tcp_get_sndbuf(struct tcp_tap_conn *conn)
 {
 	int s = conn->sock, sndbuf;
 	socklen_t sl;
@@ -1290,7 +1182,8 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
  *
  * Return: 1 on match, 0 otherwise
  */
-static int tcp_hash_match(const struct tcp_conn *conn, int af, const void *addr,
+static int tcp_hash_match(const struct tcp_tap_conn *conn,
+			  int af, const void *addr,
 			  in_port_t tap_port, in_port_t sock_port)
 {
 	if (af == AF_INET && CONN_V4(conn)			&&
@@ -1356,7 +1249,7 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
  * @af:		Address family, AF_INET or AF_INET6
  * @addr:	Remote address, pointer to in_addr or in6_addr
  */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn,
+static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
 			    int af, const void *addr)
 {
 	int b;
@@ -1374,9 +1267,9 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_conn *conn,
  * tcp_hash_remove() - Drop connection from hash table, chain unlink
  * @conn:	Connection pointer
  */
-static void tcp_hash_remove(const struct tcp_conn *conn)
+static void tcp_hash_remove(const struct tcp_tap_conn *conn)
 {
-	struct tcp_conn *entry, *prev = NULL;
+	struct tcp_tap_conn *entry, *prev = NULL;
 	int b = conn->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
@@ -1400,9 +1293,9 @@ static void tcp_hash_remove(const struct tcp_conn *conn)
  * @old:	Old connection pointer
  * @new:	New connection pointer
  */
-static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
+static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new)
 {
-	struct tcp_conn *entry, *prev = NULL;
+	struct tcp_tap_conn *entry, *prev = NULL;
 	int b = old->hash_bucket;
 
 	for (entry = tc_hash[b]; entry;
@@ -1431,12 +1324,13 @@ static void tcp_hash_update(struct tcp_conn *old, struct tcp_conn *new)
  *
  * Return: connection pointer, if found, -ENOENT otherwise
  */
-static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af,
-					const void *addr,
-					in_port_t tap_port, in_port_t sock_port)
+static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
+					    int af, const void *addr,
+					    in_port_t tap_port,
+					    in_port_t sock_port)
 {
 	int b = tcp_hash(c, af, addr, tap_port, sock_port);
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 
 	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
 		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
@@ -1451,9 +1345,9 @@ static struct tcp_conn *tcp_hash_lookup(const struct ctx *c, int af,
  * @c:		Execution context
  * @hole:	Pointer to recently closed connection
  */
-static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
+static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole)
 {
-	struct tcp_conn *from, *to;
+	struct tcp_tap_conn *from, *to;
 
 	if (CONN_IDX(hole) == --c->tcp.conn_count) {
 		debug("TCP: hash table compaction: maximum index was %li (%p)",
@@ -1482,7 +1376,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_conn *hole)
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn)
+static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	close(conn->sock);
 	if (conn->timer != -1)
@@ -1492,7 +1386,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_conn *conn)
 	tcp_table_compact(c, conn);
 }
 
-static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn);
+static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
 #define tcp_rst(c, conn)						\
 	do {								\
 		debug("TCP: index %li, reset at %s:%i", CONN_IDX(conn), \
@@ -1627,7 +1521,7 @@ void tcp_defer_handler(struct ctx *c)
 {
 	int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE;
 	int max_files = c->nofile / 100 * TCP_FILE_PRESSURE;
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 
 	tcp_l2_flags_buf_flush(c);
 	tcp_l2_data_buf_flush(c);
@@ -1656,7 +1550,7 @@ void tcp_defer_handler(struct ctx *c)
  * Return: 802.3 length, host order
  */
 static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_conn *conn,
+				      const struct tcp_tap_conn *conn,
 				      void *p, size_t plen,
 				      const uint16_t *check, uint32_t seq)
 {
@@ -1738,7 +1632,7 @@ do {									\
  *
  * Return: 1 if sequence or window were updated, 0 otherwise
  */
-static int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_conn *conn,
+static int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
 				 int force_seq, struct tcp_info *tinfo)
 {
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap << conn->ws_to_tap;
@@ -1824,7 +1718,7 @@ out:
  *
  * Return: negative error code on connection reset, 0 otherwise
  */
-static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags)
+static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
 	uint32_t prev_ack_to_tap = conn->seq_ack_to_tap;
 	uint32_t prev_wnd_to_tap = conn->wnd_to_tap;
@@ -1971,7 +1865,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_conn *conn, int flags)
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn)
+static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	if (conn->events == CLOSED)
 		return;
@@ -1986,7 +1880,7 @@ static void tcp_rst_do(struct ctx *c, struct tcp_conn *conn)
  * @opts:	Pointer to start of TCP options
  * @optlen:	Bytes in options: caller MUST ensure available length
  */
-static void tcp_get_tap_ws(struct tcp_conn *conn,
+static void tcp_get_tap_ws(struct tcp_tap_conn *conn,
 			   const char *opts, size_t optlen)
 {
 	int ws = tcp_opt_get(opts, optlen, OPT_WS, NULL, NULL);
@@ -2003,7 +1897,7 @@ static void tcp_get_tap_ws(struct tcp_conn *conn,
  * @conn:	Connection pointer
  * @window:	Window value, host order, unscaled
  */
-static void tcp_clamp_window(const struct ctx *c, struct tcp_conn *conn,
+static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
 			     unsigned wnd)
 {
 	uint32_t prev_scaled = conn->wnd_from_tap << conn->ws_from_tap;
@@ -2125,7 +2019,7 @@ static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
  * Return: clamped MSS value
  */
 static uint16_t tcp_conn_tap_mss(const struct ctx *c,
-				 const struct tcp_conn *conn,
+				 const struct tcp_tap_conn *conn,
 				 const char *opts, size_t optlen)
 {
 	unsigned int mss;
@@ -2172,7 +2066,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 		.sin6_addr = *(struct in6_addr *)addr,
 	};
 	const struct sockaddr *sa;
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 	socklen_t sl;
 	int s, mss;
 
@@ -2280,7 +2174,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
  *
  * Return: 0 on success, negative error code from recv() on failure
  */
-static int tcp_sock_consume(struct tcp_conn *conn, uint32_t ack_seq)
+static int tcp_sock_consume(struct tcp_tap_conn *conn, uint32_t ack_seq)
 {
 	/* Simply ignore out-of-order ACKs: we already consumed the data we
 	 * needed from the buffer, and we won't rewind back to a lower ACK
@@ -2307,7 +2201,7 @@ static int tcp_sock_consume(struct tcp_conn *conn, uint32_t ack_seq)
  * @seq:	Sequence number to be sent
  * @now:	Current timestamp
  */
-static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn,
+static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			    ssize_t plen, int no_csum, uint32_t seq)
 {
 	struct iovec *iov;
@@ -2344,7 +2238,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_conn *conn,
  *
  * #syscalls recvmsg
  */
-static int tcp_data_from_sock(struct ctx *c, struct tcp_conn *conn)
+static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
 	int fill_bufs, send_bufs = 0, last_len, iov_rem = 0;
@@ -2475,7 +2369,7 @@ zero_len:
  *
  * #syscalls sendmsg
  */
-static void tcp_data_from_tap(struct ctx *c, struct tcp_conn *conn,
+static void tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn,
 			      const struct pool *p)
 {
 	int i, iov_i, ack = 0, fin = 0, retr = 0, keep = -1, partial_send = 0;
@@ -2675,7 +2569,7 @@ out:
  * @opts:	Pointer to start of options
  * @optlen:	Bytes in options: caller MUST ensure available length
  */
-static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn,
+static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
 				      const struct tcphdr *th,
 				      const char *opts, size_t optlen)
 {
@@ -2714,7 +2608,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_conn *conn,
 int tcp_tap_handler(struct ctx *c, int af, const void *addr,
 		    const struct pool *p, const struct timespec *now)
 {
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 	size_t optlen, len;
 	struct tcphdr *th;
 	int ack_due = 0;
@@ -2829,7 +2723,7 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr,
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_connect_finish(struct ctx *c, struct tcp_conn *conn)
+static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 {
 	socklen_t sl;
 	int so;
@@ -2857,7 +2751,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			       const struct timespec *now)
 {
 	struct sockaddr_storage sa;
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 	socklen_t sl;
 	int s;
 
@@ -2949,7 +2843,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
  */
 static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 {
-	struct tcp_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index);
+	struct tcp_tap_conn *conn = conn_at_idx(ref.r.p.tcp.tcp.index);
 	struct itimerspec check_armed = { { 0 }, { 0 } };
 
 	if (!conn)
@@ -3012,7 +2906,7 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		      const struct timespec *now)
 {
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 
 	if (ref.r.p.tcp.tcp.timer) {
 		tcp_timer_handler(c, ref);
@@ -3510,7 +3404,7 @@ static int tcp_port_rebind(void *arg)
 void tcp_timer(struct ctx *c, const struct timespec *ts)
 {
 	struct tcp_sock_refill_arg refill_arg = { c, 0 };
-	struct tcp_conn *conn;
+	struct tcp_tap_conn *conn;
 
 	(void)ts;
 
diff --git a/tcp_conn.h b/tcp_conn.h
new file mode 100644
index 0000000..db4c2d9
--- /dev/null
+++ b/tcp_conn.h
@@ -0,0 +1,168 @@
+/* SPDX-License-Identifier: AGPL-3.0-or-later
+ * Copyright Red Hat
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ * Author: David Gibson <david@gibson.dropbear.id.au>
+ *
+ * TCP connection tracking data structures, used by tcp.c and
+ * tcp_splice.c.  Shouldn't be included in non-TCP code.
+ */
+#ifndef TCP_CONN_H
+#define TCP_CONN_H
+
+#define TCP_HASH_BUCKET_BITS		(TCP_CONN_INDEX_BITS + 1)
+
+/**
+ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
+ * @next_index:		Connection index of next item in hash chain, -1 for none
+ * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
+ * @sock:		Socket descriptor number
+ * @events:		Connection events, implying connection states
+ * @timer:		timerfd descriptor for timeout events
+ * @flags:		Connection flags representing internal attributes
+ * @hash_bucket:	Bucket index in connection lookup hash table
+ * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
+ * @ws_from_tap:	Window scaling factor advertised from tap/guest
+ * @ws_to_tap:		Window scaling factor advertised to tap/guest
+ * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
+ * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
+ * @a.a6:		IPv6 remote address, can be IPv4-mapped
+ * @a.a4.zero:		Zero prefix for IPv4-mapped, see RFC 6890, Table 20
+ * @a.a4.one:		Ones prefix for IPv4-mapped
+ * @a.a4.a:		IPv4 address
+ * @tap_port:		Guest-facing tap port
+ * @sock_port:		Remote, socket-facing port
+ * @wnd_from_tap:	Last window size from tap, unscaled (as received)
+ * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
+ * @seq_to_tap:		Next sequence for packets to tap
+ * @seq_ack_from_tap:	Last ACK number received from tap
+ * @seq_from_tap:	Next sequence for packets from tap (not actually sent)
+ * @seq_ack_to_tap:	Last ACK number sent to tap
+ * @seq_init_from_tap:	Initial sequence number from tap
+ */
+struct tcp_tap_conn {
+	int	 	next_index	:TCP_CONN_INDEX_BITS + 2;
+
+#define TCP_RETRANS_BITS		3
+	unsigned int	retrans		:TCP_RETRANS_BITS;
+#define TCP_MAX_RETRANS			((1U << TCP_RETRANS_BITS) - 1)
+
+#define TCP_WS_BITS			4	/* RFC 7323 */
+#define TCP_WS_MAX			14
+	unsigned int	ws_from_tap	:TCP_WS_BITS;
+	unsigned int	ws_to_tap	:TCP_WS_BITS;
+
+
+	int		sock		:SOCKET_REF_BITS;
+
+	uint8_t		events;
+#define CLOSED			0
+#define SOCK_ACCEPTED		BIT(0)	/* implies SYN sent to tap */
+#define TAP_SYN_RCVD		BIT(1)	/* implies socket connecting */
+#define  TAP_SYN_ACK_SENT	BIT( 3)	/* implies socket connected */
+#define ESTABLISHED		BIT(2)
+#define  SOCK_FIN_RCVD		BIT( 3)
+#define  SOCK_FIN_SENT		BIT( 4)
+#define  TAP_FIN_RCVD		BIT( 5)
+#define  TAP_FIN_SENT		BIT( 6)
+#define  TAP_FIN_ACKED		BIT( 7)
+
+#define	CONN_STATE_BITS		/* Setting these clears other flags */	\
+	(SOCK_ACCEPTED | TAP_SYN_RCVD | ESTABLISHED)
+
+
+	int		timer		:SOCKET_REF_BITS;
+
+	uint8_t		flags;
+#define STALLED			BIT(0)
+#define LOCAL			BIT(1)
+#define WND_CLAMPED		BIT(2)
+#define IN_EPOLL		BIT(3)
+#define ACTIVE_CLOSE		BIT(4)
+#define ACK_TO_TAP_DUE		BIT(5)
+#define ACK_FROM_TAP_DUE	BIT(6)
+
+
+	unsigned int	hash_bucket	:TCP_HASH_BUCKET_BITS;
+
+#define TCP_MSS_BITS			14
+	unsigned int	tap_mss		:TCP_MSS_BITS;
+#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
+#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
+
+
+#define SNDBUF_BITS		24
+	unsigned int	sndbuf		:SNDBUF_BITS;
+#define SNDBUF_SET(conn, bytes)	(conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS)))
+#define SNDBUF_GET(conn)	(conn->sndbuf << (32 - SNDBUF_BITS))
+
+	uint8_t		seq_dup_ack_approx;
+
+
+	union {
+		struct in6_addr a6;
+		struct {
+			uint8_t zero[10];
+			uint8_t one[2];
+			struct in_addr a;
+		} a4;
+	} a;
+
+	in_port_t	tap_port;
+	in_port_t	sock_port;
+
+	uint16_t	wnd_from_tap;
+	uint16_t	wnd_to_tap;
+
+	uint32_t	seq_to_tap;
+	uint32_t	seq_ack_from_tap;
+	uint32_t	seq_from_tap;
+	uint32_t	seq_ack_to_tap;
+	uint32_t	seq_init_from_tap;
+};
+
+/**
+ * struct tcp_splice_conn - Descriptor for a spliced TCP connection
+ * @a:			File descriptor number of socket for accepted connection
+ * @pipe_a_b:		Pipe ends for splice() from @a to @b
+ * @b:			File descriptor number of peer connected socket
+ * @pipe_b_a:		Pipe ends for splice() from @b to @a
+ * @events:		Events observed/actions performed on connection
+ * @flags:		Connection flags (attributes, not events)
+ * @a_read:		Bytes read from @a (not fully written to @b in one shot)
+ * @a_written:		Bytes written to @a (not fully written from one @b read)
+ * @b_read:		Bytes read from @b (not fully written to @a in one shot)
+ * @b_written:		Bytes written to @b (not fully written from one @a read)
+*/
+struct tcp_splice_conn {
+	int a;
+	int pipe_a_b[2];
+	int b;
+	int pipe_b_a[2];
+
+	uint8_t events;
+#define SPLICE_CLOSED			0
+#define SPLICE_CONNECT			BIT(0)
+#define SPLICE_ESTABLISHED		BIT(1)
+#define A_OUT_WAIT			BIT(2)
+#define B_OUT_WAIT			BIT(3)
+#define A_FIN_RCVD			BIT(4)
+#define B_FIN_RCVD			BIT(5)
+#define A_FIN_SENT			BIT(6)
+#define B_FIN_SENT			BIT(7)
+
+	uint8_t flags;
+#define SPLICE_V6			BIT(0)
+#define SPLICE_IN_EPOLL			BIT(1)
+#define RCVLOWAT_SET_A			BIT(2)
+#define RCVLOWAT_SET_B			BIT(3)
+#define RCVLOWAT_ACT_A			BIT(4)
+#define RCVLOWAT_ACT_B			BIT(5)
+#define CLOSING				BIT(6)
+
+	uint32_t a_read;
+	uint32_t a_written;
+	uint32_t b_read;
+	uint32_t b_written;
+};
+
+#endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index 4cc4ad2..cbfab01 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -21,12 +21,12 @@
  *
  * - SPLICE_CONNECT:		connection accepted, connecting to target
  * - SPLICE_ESTABLISHED:	connection to target established
- * - SPLICE_A_OUT_WAIT:		pipe to accepted socket full, wait for EPOLLOUT
- * - SPLICE_B_OUT_WAIT:		pipe to target socket full, wait for EPOLLOUT
- * - SPLICE_A_FIN_RCVD:		FIN (EPOLLRDHUP) seen from accepted socket
- * - SPLICE_B_FIN_RCVD:		FIN (EPOLLRDHUP) seen from target socket
- * - SPLICE_A_FIN_RCVD:		FIN (write shutdown) sent to accepted socket
- * - SPLICE_B_FIN_RCVD:		FIN (write shutdown) sent to target socket
+ * - A_OUT_WAIT:		pipe to accepted socket full, wait for EPOLLOUT
+ * - B_OUT_WAIT:		pipe to target socket full, wait for EPOLLOUT
+ * - A_FIN_RCVD:		FIN (EPOLLRDHUP) seen from accepted socket
+ * - B_FIN_RCVD:		FIN (EPOLLRDHUP) seen from target socket
+ * - A_FIN_RCVD:		FIN (write shutdown) sent to accepted socket
+ * - B_FIN_RCVD:		FIN (write shutdown) sent to target socket
  *
  * #syscalls:pasta pipe2|pipe fcntl armv6l:fcntl64 armv7l:fcntl64 ppc64:fcntl64
  */
@@ -52,6 +52,8 @@
 #include "log.h"
 #include "tcp_splice.h"
 
+#include "tcp_conn.h"
+
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
 #define TCP_SPLICE_PIPE_POOL_SIZE	16
 #define TCP_SPLICE_CONN_PRESSURE	30	/* % of splice_conn_count */
@@ -66,52 +68,7 @@ extern int ns_sock_pool6		[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2][2];
 
-/**
- * struct tcp_splice_conn - Descriptor for a spliced TCP connection
- * @a:			File descriptor number of socket for accepted connection
- * @pipe_a_b:		Pipe ends for splice() from @a to @b
- * @b:			File descriptor number of peer connected socket
- * @pipe_b_a:		Pipe ends for splice() from @b to @a
- * @events:		Events observed/actions performed on connection
- * @flags:		Connection flags (attributes, not events)
- * @a_read:		Bytes read from @a (not fully written to @b in one shot)
- * @a_written:		Bytes written to @a (not fully written from one @b read)
- * @b_read:		Bytes read from @b (not fully written to @a in one shot)
- * @b_written:		Bytes written to @b (not fully written from one @a read)
-*/
-struct tcp_splice_conn {
-	int a;
-	int pipe_a_b[2];
-	int b;
-	int pipe_b_a[2];
-
-	uint8_t events;
-#define CLOSED				0
-#define CONNECT				BIT(0)
-#define ESTABLISHED			BIT(1)
-#define A_OUT_WAIT			BIT(2)
-#define B_OUT_WAIT			BIT(3)
-#define A_FIN_RCVD			BIT(4)
-#define B_FIN_RCVD			BIT(5)
-#define A_FIN_SENT			BIT(6)
-#define B_FIN_SENT			BIT(7)
-
-	uint8_t flags;
-#define SOCK_V6				BIT(0)
-#define IN_EPOLL			BIT(1)
-#define RCVLOWAT_SET_A			BIT(2)
-#define RCVLOWAT_SET_B			BIT(3)
-#define RCVLOWAT_ACT_A			BIT(4)
-#define RCVLOWAT_ACT_B			BIT(5)
-#define CLOSING				BIT(6)
-
-	uint32_t a_read;
-	uint32_t a_written;
-	uint32_t b_read;
-	uint32_t b_written;
-};
-
-#define CONN_V6(x)			(x->flags & SOCK_V6)
+#define CONN_V6(x)			(x->flags & SPLICE_V6)
 #define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(index)			(tc_splice + (index))
@@ -122,13 +79,13 @@ static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS];
 
 /* Display strings for connection events */
 static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
-	"CONNECT", "ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT",
+	"SPLICE_CONNECT", "SPLICE_ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT",
 	"A_FIN_RCVD", "B_FIN_RCVD", "A_FIN_SENT", "B_FIN_SENT",
 };
 
 /* Display strings for connection flags */
 static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
-	"SOCK_V6", "IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
+	"SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
 	"RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING",
 };
 
@@ -143,12 +100,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
 {
 	*a = *b = 0;
 
-	if (events & ESTABLISHED) {
+	if (events & SPLICE_ESTABLISHED) {
 		if (!(events & B_FIN_SENT))
 			*a = EPOLLIN | EPOLLRDHUP;
 		if (!(events & A_FIN_SENT))
 			*b = EPOLLIN | EPOLLRDHUP;
-	} else if (events & CONNECT) {
+	} else if (events & SPLICE_CONNECT) {
 		*b = EPOLLOUT;
 	}
 
@@ -210,7 +167,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn)
 {
-	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
@@ -234,7 +191,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	    epoll_ctl(c->epollfd, m, conn->b, &ev_b))
 		goto delete;
 
-	conn->flags |= IN_EPOLL;		/* No need to log this */
+	conn->flags |= SPLICE_IN_EPOLL;		/* No need to log this */
 
 	return 0;
 
@@ -323,7 +280,7 @@ static void tcp_table_splice_compact(struct ctx *c,
  */
 static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 {
-	if (conn->events & ESTABLISHED) {
+	if (conn->events & SPLICE_ESTABLISHED) {
 		/* Flushing might need to block: don't recycle them. */
 		if (conn->pipe_a_b[0] != -1) {
 			close(conn->pipe_a_b[0]);
@@ -337,7 +294,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 		}
 	}
 
-	if (conn->events & CONNECT) {
+	if (conn->events & SPLICE_CONNECT) {
 		close(conn->b);
 		conn->b = -1;
 	}
@@ -346,7 +303,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 	conn->a = -1;
 	conn->a_read = conn->a_written = conn->b_read = conn->b_written = 0;
 
-	conn->events = CLOSED;
+	conn->events = SPLICE_CLOSED;
 	conn->flags = 0;
 	debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
 
@@ -397,8 +354,8 @@ static int tcp_splice_connect_finish(const struct ctx *c,
 		}
 	}
 
-	if (!(conn->events & ESTABLISHED))
-		conn_event(c, conn, ESTABLISHED);
+	if (!(conn->events & SPLICE_ESTABLISHED))
+		conn_event(c, conn, SPLICE_ESTABLISHED);
 
 	return 0;
 }
@@ -466,9 +423,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			close(sock_conn);
 			return ret;
 		}
-		conn_event(c, conn, CONNECT);
+		conn_event(c, conn, SPLICE_CONNECT);
 	} else {
-		conn_event(c, conn, ESTABLISHED);
+		conn_event(c, conn, SPLICE_ESTABLISHED);
 		return tcp_splice_connect_finish(c, conn);
 	}
 
@@ -598,7 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 
 		conn = CONN(c->tcp.splice_conn_count++);
 		conn->a = s;
-		conn->flags = ref.r.p.tcp.tcp.v6 ? SOCK_V6 : 0;
+		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
 		if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
 				   ref.r.p.tcp.tcp.outbound))
@@ -609,13 +566,13 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 
 	conn = CONN(ref.r.p.tcp.tcp.index);
 
-	if (conn->events == CLOSED)
+	if (conn->events == SPLICE_CLOSED)
 		return;
 
 	if (events & EPOLLERR)
 		goto close;
 
-	if (conn->events == CONNECT) {
+	if (conn->events == SPLICE_CONNECT) {
 		if (!(events & EPOLLOUT))
 			goto close;
 		if (tcp_splice_connect_finish(c, conn))
-- 
@@ -21,12 +21,12 @@
  *
  * - SPLICE_CONNECT:		connection accepted, connecting to target
  * - SPLICE_ESTABLISHED:	connection to target established
- * - SPLICE_A_OUT_WAIT:		pipe to accepted socket full, wait for EPOLLOUT
- * - SPLICE_B_OUT_WAIT:		pipe to target socket full, wait for EPOLLOUT
- * - SPLICE_A_FIN_RCVD:		FIN (EPOLLRDHUP) seen from accepted socket
- * - SPLICE_B_FIN_RCVD:		FIN (EPOLLRDHUP) seen from target socket
- * - SPLICE_A_FIN_RCVD:		FIN (write shutdown) sent to accepted socket
- * - SPLICE_B_FIN_RCVD:		FIN (write shutdown) sent to target socket
+ * - A_OUT_WAIT:		pipe to accepted socket full, wait for EPOLLOUT
+ * - B_OUT_WAIT:		pipe to target socket full, wait for EPOLLOUT
+ * - A_FIN_RCVD:		FIN (EPOLLRDHUP) seen from accepted socket
+ * - B_FIN_RCVD:		FIN (EPOLLRDHUP) seen from target socket
+ * - A_FIN_RCVD:		FIN (write shutdown) sent to accepted socket
+ * - B_FIN_RCVD:		FIN (write shutdown) sent to target socket
  *
  * #syscalls:pasta pipe2|pipe fcntl armv6l:fcntl64 armv7l:fcntl64 ppc64:fcntl64
  */
@@ -52,6 +52,8 @@
 #include "log.h"
 #include "tcp_splice.h"
 
+#include "tcp_conn.h"
+
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
 #define TCP_SPLICE_PIPE_POOL_SIZE	16
 #define TCP_SPLICE_CONN_PRESSURE	30	/* % of splice_conn_count */
@@ -66,52 +68,7 @@ extern int ns_sock_pool6		[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2][2];
 
-/**
- * struct tcp_splice_conn - Descriptor for a spliced TCP connection
- * @a:			File descriptor number of socket for accepted connection
- * @pipe_a_b:		Pipe ends for splice() from @a to @b
- * @b:			File descriptor number of peer connected socket
- * @pipe_b_a:		Pipe ends for splice() from @b to @a
- * @events:		Events observed/actions performed on connection
- * @flags:		Connection flags (attributes, not events)
- * @a_read:		Bytes read from @a (not fully written to @b in one shot)
- * @a_written:		Bytes written to @a (not fully written from one @b read)
- * @b_read:		Bytes read from @b (not fully written to @a in one shot)
- * @b_written:		Bytes written to @b (not fully written from one @a read)
-*/
-struct tcp_splice_conn {
-	int a;
-	int pipe_a_b[2];
-	int b;
-	int pipe_b_a[2];
-
-	uint8_t events;
-#define CLOSED				0
-#define CONNECT				BIT(0)
-#define ESTABLISHED			BIT(1)
-#define A_OUT_WAIT			BIT(2)
-#define B_OUT_WAIT			BIT(3)
-#define A_FIN_RCVD			BIT(4)
-#define B_FIN_RCVD			BIT(5)
-#define A_FIN_SENT			BIT(6)
-#define B_FIN_SENT			BIT(7)
-
-	uint8_t flags;
-#define SOCK_V6				BIT(0)
-#define IN_EPOLL			BIT(1)
-#define RCVLOWAT_SET_A			BIT(2)
-#define RCVLOWAT_SET_B			BIT(3)
-#define RCVLOWAT_ACT_A			BIT(4)
-#define RCVLOWAT_ACT_B			BIT(5)
-#define CLOSING				BIT(6)
-
-	uint32_t a_read;
-	uint32_t a_written;
-	uint32_t b_read;
-	uint32_t b_written;
-};
-
-#define CONN_V6(x)			(x->flags & SOCK_V6)
+#define CONN_V6(x)			(x->flags & SPLICE_V6)
 #define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(index)			(tc_splice + (index))
@@ -122,13 +79,13 @@ static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS];
 
 /* Display strings for connection events */
 static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
-	"CONNECT", "ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT",
+	"SPLICE_CONNECT", "SPLICE_ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT",
 	"A_FIN_RCVD", "B_FIN_RCVD", "A_FIN_SENT", "B_FIN_SENT",
 };
 
 /* Display strings for connection flags */
 static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
-	"SOCK_V6", "IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
+	"SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
 	"RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING",
 };
 
@@ -143,12 +100,12 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
 {
 	*a = *b = 0;
 
-	if (events & ESTABLISHED) {
+	if (events & SPLICE_ESTABLISHED) {
 		if (!(events & B_FIN_SENT))
 			*a = EPOLLIN | EPOLLRDHUP;
 		if (!(events & A_FIN_SENT))
 			*b = EPOLLIN | EPOLLRDHUP;
-	} else if (events & CONNECT) {
+	} else if (events & SPLICE_CONNECT) {
 		*b = EPOLLOUT;
 	}
 
@@ -210,7 +167,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn)
 {
-	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
@@ -234,7 +191,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	    epoll_ctl(c->epollfd, m, conn->b, &ev_b))
 		goto delete;
 
-	conn->flags |= IN_EPOLL;		/* No need to log this */
+	conn->flags |= SPLICE_IN_EPOLL;		/* No need to log this */
 
 	return 0;
 
@@ -323,7 +280,7 @@ static void tcp_table_splice_compact(struct ctx *c,
  */
 static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 {
-	if (conn->events & ESTABLISHED) {
+	if (conn->events & SPLICE_ESTABLISHED) {
 		/* Flushing might need to block: don't recycle them. */
 		if (conn->pipe_a_b[0] != -1) {
 			close(conn->pipe_a_b[0]);
@@ -337,7 +294,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 		}
 	}
 
-	if (conn->events & CONNECT) {
+	if (conn->events & SPLICE_CONNECT) {
 		close(conn->b);
 		conn->b = -1;
 	}
@@ -346,7 +303,7 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 	conn->a = -1;
 	conn->a_read = conn->a_written = conn->b_read = conn->b_written = 0;
 
-	conn->events = CLOSED;
+	conn->events = SPLICE_CLOSED;
 	conn->flags = 0;
 	debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
 
@@ -397,8 +354,8 @@ static int tcp_splice_connect_finish(const struct ctx *c,
 		}
 	}
 
-	if (!(conn->events & ESTABLISHED))
-		conn_event(c, conn, ESTABLISHED);
+	if (!(conn->events & SPLICE_ESTABLISHED))
+		conn_event(c, conn, SPLICE_ESTABLISHED);
 
 	return 0;
 }
@@ -466,9 +423,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			close(sock_conn);
 			return ret;
 		}
-		conn_event(c, conn, CONNECT);
+		conn_event(c, conn, SPLICE_CONNECT);
 	} else {
-		conn_event(c, conn, ESTABLISHED);
+		conn_event(c, conn, SPLICE_ESTABLISHED);
 		return tcp_splice_connect_finish(c, conn);
 	}
 
@@ -598,7 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 
 		conn = CONN(c->tcp.splice_conn_count++);
 		conn->a = s;
-		conn->flags = ref.r.p.tcp.tcp.v6 ? SOCK_V6 : 0;
+		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
 		if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
 				   ref.r.p.tcp.tcp.outbound))
@@ -609,13 +566,13 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 
 	conn = CONN(ref.r.p.tcp.tcp.index);
 
-	if (conn->events == CLOSED)
+	if (conn->events == SPLICE_CLOSED)
 		return;
 
 	if (events & EPOLLERR)
 		goto close;
 
-	if (conn->events == CONNECT) {
+	if (conn->events == SPLICE_CONNECT) {
 		if (!(events & EPOLLOUT))
 			goto close;
 		if (tcp_splice_connect_finish(c, conn))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/32] tcp: Add connection union type
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (6 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 07/32] tcp: Move connection state structures into a shared header David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 09/32] tcp: Improved helpers to update connections after moving David Gibson
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays.  We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection.  For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.

This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline.  This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        |  4 ++++
 tcp_conn.h   | 30 ++++++++++++++++++++++++++++++
 tcp_splice.c |  2 ++
 3 files changed, 36 insertions(+)

diff --git a/tcp.c b/tcp.c
index 1137f45..b9b15ee 100644
--- a/tcp.c
+++ b/tcp.c
@@ -288,6 +288,7 @@
 #include <sys/uio.h>
 #include <unistd.h>
 #include <time.h>
+#include <assert.h>
 
 #include <linux/tcp.h> /* For struct tcp_info */
 
@@ -601,6 +602,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index)
 {
 	if ((index < 0) || (index >= TCP_MAX_CONNS))
 		return NULL;
+	assert(!(CONN(index)->c.spliced));
 	return CONN(index);
 }
 
@@ -2096,6 +2098,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	}
 
 	conn = CONN(c->tcp.conn_count++);
+	conn->c.spliced = false;
 	conn->sock = s;
 	conn->timer = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
@@ -2764,6 +2767,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 		return;
 
 	conn = CONN(c->tcp.conn_count++);
+	conn->c.spliced = false;
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
diff --git a/tcp_conn.h b/tcp_conn.h
index db4c2d9..39d104a 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -11,8 +11,19 @@
 
 #define TCP_HASH_BUCKET_BITS		(TCP_CONN_INDEX_BITS + 1)
 
+/**
+ * struct tcp_conn_common - Common fields for spliced and non-spliced
+ * @spliced:		Is this a spliced connection?
+ */
+struct tcp_conn_common {
+	bool spliced	:1;
+};
+
+extern const char *tcp_common_flag_str[];
+
 /**
  * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
+ * @c:			Fields common with tcp_splice_conn
  * @next_index:		Connection index of next item in hash chain, -1 for none
  * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
  * @sock:		Socket descriptor number
@@ -40,6 +51,9 @@
  * @seq_init_from_tap:	Initial sequence number from tap
  */
 struct tcp_tap_conn {
+	/* Must be first element to match tcp_splice_conn */
+	struct tcp_conn_common c;
+
 	int	 	next_index	:TCP_CONN_INDEX_BITS + 2;
 
 #define TCP_RETRANS_BITS		3
@@ -122,6 +136,7 @@ struct tcp_tap_conn {
 
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
+ * @c:			Fields common with tcp_tap_conn
  * @a:			File descriptor number of socket for accepted connection
  * @pipe_a_b:		Pipe ends for splice() from @a to @b
  * @b:			File descriptor number of peer connected socket
@@ -134,6 +149,9 @@ struct tcp_tap_conn {
  * @b_written:		Bytes written to @b (not fully written from one @a read)
 */
 struct tcp_splice_conn {
+	/* Must be first element to match tcp_tap_conn */
+	struct tcp_conn_common c;
+
 	int a;
 	int pipe_a_b[2];
 	int b;
@@ -165,4 +183,16 @@ struct tcp_splice_conn {
 	uint32_t b_written;
 };
 
+/**
+ * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced)
+ * @c:			Fields common between all variants
+ * @tap:		Fields specific to non-spliced connections
+ * @splice:		Fields specific to spliced connections
+*/
+union tcp_conn {
+	struct tcp_conn_common c;
+	struct tcp_tap_conn tap;
+	struct tcp_splice_conn splice;
+};
+
 #endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index cbfab01..d8be91b 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -46,6 +46,7 @@
 #include <sys/epoll.h>
 #include <sys/types.h>
 #include <sys/socket.h>
+#include <assert.h>
 
 #include "util.h"
 #include "passt.h"
@@ -554,6 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 		}
 
 		conn = CONN(c->tcp.splice_conn_count++);
+		conn->c.spliced = true;
 		conn->a = s;
 		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
-- 
@@ -46,6 +46,7 @@
 #include <sys/epoll.h>
 #include <sys/types.h>
 #include <sys/socket.h>
+#include <assert.h>
 
 #include "util.h"
 #include "passt.h"
@@ -554,6 +555,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 		}
 
 		conn = CONN(c->tcp.splice_conn_count++);
+		conn->c.spliced = true;
 		conn->a = s;
 		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/32] tcp: Improved helpers to update connections after moving
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (7 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 08/32] tcp: Add connection union type David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 10/32] tcp: Unify spliced and non-spliced connection tables David Gibson
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another.  That requires some updates
in the entries themselves.  Add helpers to make all the necessary updates
for the spliced and non-spliced cases.  This will simplify later cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 16 +++++++++-------
 tcp_splice.c | 17 ++++++++++++++---
 2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/tcp.c b/tcp.c
index b9b15ee..39b6176 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1291,11 +1291,13 @@ static void tcp_hash_remove(const struct tcp_tap_conn *conn)
 }
 
 /**
- * tcp_hash_update() - Update pointer for given connection
- * @old:	Old connection pointer
- * @new:	New connection pointer
+ * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
+ * @c:		Execution context
+ * @old:	Old location of tcp_tap_conn
+ * @new:	New location of tcp_tap_conn
  */
-static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new)
+static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old,
+				struct tcp_tap_conn *new)
 {
 	struct tcp_tap_conn *entry, *prev = NULL;
 	int b = old->hash_bucket;
@@ -1314,6 +1316,8 @@ static void tcp_hash_update(struct tcp_tap_conn *old, struct tcp_tap_conn *new)
 	debug("TCP: hash table update: old index %li, new index %li, sock %i, "
 	      "bucket: %i, old: %p, new: %p",
 	      CONN_IDX(old), CONN_IDX(new), new->sock, b, old, new);
+
+	tcp_epoll_ctl(c, new);
 }
 
 /**
@@ -1362,9 +1366,7 @@ static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole)
 	memcpy(hole, from, sizeof(*hole));
 
 	to = hole;
-	tcp_hash_update(from, to);
-
-	tcp_epoll_ctl(c, to);
+	tcp_tap_conn_update(c, from, to);
 
 	debug("TCP: hash table compaction: old index %li, new index %li, "
 	      "sock %i, from: %p, to: %p",
diff --git a/tcp_splice.c b/tcp_splice.c
index d8be91b..7dcd1cb 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -242,6 +242,19 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 		conn_event_do(c, conn, event);				\
 	} while (0)
 
+
+/**
+ * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
+ * @c:		Execution context
+ * @new:	New location of tcp_splice_conn
+ */
+static void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new)
+{
+	tcp_splice_epoll_ctl(c, new);
+	if (tcp_splice_epoll_ctl(c, new))
+		conn_flag(c, new, CLOSING);
+}
+
 /**
  * tcp_table_splice_compact - Compact spliced connection table
  * @c:		Execution context
@@ -269,9 +282,7 @@ static void tcp_table_splice_compact(struct ctx *c,
 
 	debug("TCP (spliced): index %li moved to %li",
 	      CONN_IDX(move), CONN_IDX(hole));
-	tcp_splice_epoll_ctl(c, hole);
-	if (tcp_splice_epoll_ctl(c, hole))
-		conn_flag(c, hole, CLOSING);
+	tcp_splice_conn_update(c, hole);
 }
 
 /**
-- 
@@ -242,6 +242,19 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 		conn_event_do(c, conn, event);				\
 	} while (0)
 
+
+/**
+ * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
+ * @c:		Execution context
+ * @new:	New location of tcp_splice_conn
+ */
+static void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new)
+{
+	tcp_splice_epoll_ctl(c, new);
+	if (tcp_splice_epoll_ctl(c, new))
+		conn_flag(c, new, CLOSING);
+}
+
 /**
  * tcp_table_splice_compact - Compact spliced connection table
  * @c:		Execution context
@@ -269,9 +282,7 @@ static void tcp_table_splice_compact(struct ctx *c,
 
 	debug("TCP (spliced): index %li moved to %li",
 	      CONN_IDX(move), CONN_IDX(hole));
-	tcp_splice_epoll_ctl(c, hole);
-	if (tcp_splice_epoll_ctl(c, hole))
-		conn_flag(c, hole, CLOSING);
+	tcp_splice_conn_update(c, hole);
 }
 
 /**
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/32] tcp: Unify spliced and non-spliced connection tables
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (8 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 09/32] tcp: Improved helpers to update connections after moving David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 11/32] tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() David Gibson
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections.  This is a bit counter-intuitive.

More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases.  So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.

For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall.  If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.

For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them.  We'll improve that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 46 ++++++++++++++++++++----------------
 tcp.h        |  2 +-
 tcp_conn.h   |  6 +++++
 tcp_splice.c | 66 ++++++++++++++--------------------------------------
 tcp_splice.h |  2 --
 5 files changed, 51 insertions(+), 71 deletions(-)

diff --git a/tcp.c b/tcp.c
index 39b6176..01b08c8 100644
--- a/tcp.c
+++ b/tcp.c
@@ -98,11 +98,11 @@
  * Connection tracking and storage
  * -------------------------------
  *
- * Connections are tracked by the @tc array of struct tcp_tap_conn, containing
- * addresses, ports, TCP states and parameters. This is statically allocated and
- * indexed by an arbitrary connection number. The array is compacted whenever a
- * connection is closed, by remapping the highest connection index in use to the
- * one freed up.
+ * Connections are tracked by struct tcp_tap_conn entries in the @tc
+ * array, containing addresses, ports, TCP states and parameters. This
+ * is statically allocated and indexed by an arbitrary connection
+ * number. The array is compacted whenever a connection is closed, by
+ * remapping the highest connection index in use to the one freed up.
  *
  * References used for the epoll interface report the connection index used for
  * the @tc array.
@@ -588,10 +588,10 @@ static unsigned int tcp6_l2_flags_buf_used;
 static size_t tcp6_l2_flags_buf_bytes;
 
 /* TCP connections */
-static struct tcp_tap_conn tc[TCP_MAX_CONNS];
+union tcp_conn tc[TCP_MAX_CONNS];
 
-#define CONN(index)		(tc + (index))
-#define CONN_IDX(conn)		((conn) - tc)
+#define CONN(index)		(&tc[(index)].tap)
+#define CONN_IDX(conn)		((union tcp_conn *)(conn) - tc)
 
 /** conn_at_idx() - Find a connection by index, if present
  * @index:	Index of connection to lookup
@@ -1351,26 +1351,28 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
  * @c:		Execution context
  * @hole:	Pointer to recently closed connection
  */
-static void tcp_table_compact(struct ctx *c, struct tcp_tap_conn *hole)
+void tcp_table_compact(struct ctx *c, union tcp_conn *hole)
 {
-	struct tcp_tap_conn *from, *to;
+	union tcp_conn *from;
 
 	if (CONN_IDX(hole) == --c->tcp.conn_count) {
-		debug("TCP: hash table compaction: maximum index was %li (%p)",
+		debug("TCP: table compaction: maximum index was %li (%p)",
 		      CONN_IDX(hole), hole);
 		memset(hole, 0, sizeof(*hole));
 		return;
 	}
 
-	from = CONN(c->tcp.conn_count);
+	from = tc + c->tcp.conn_count;
 	memcpy(hole, from, sizeof(*hole));
 
-	to = hole;
-	tcp_tap_conn_update(c, from, to);
+	if (from->c.spliced)
+		tcp_splice_conn_update(c, &hole->splice);
+	else
+		tcp_tap_conn_update(c, &from->tap, &hole->tap);
 
-	debug("TCP: hash table compaction: old index %li, new index %li, "
-	      "sock %i, from: %p, to: %p",
-	      CONN_IDX(from), CONN_IDX(to), from->sock, from, to);
+	debug("TCP: table compaction (spliced=%d): old index %li, new index %li, "
+	      "from: %p, to: %p",
+	      from->c.spliced, CONN_IDX(from), CONN_IDX(hole), from, hole);
 
 	memset(from, 0, sizeof(*from));
 }
@@ -1387,7 +1389,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn)
 		close(conn->timer);
 
 	tcp_hash_remove(conn);
-	tcp_table_compact(c, conn);
+	tcp_table_compact(c, (union tcp_conn *)conn);
 }
 
 static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
@@ -1535,7 +1537,9 @@ void tcp_defer_handler(struct ctx *c)
 	if (c->tcp.conn_count < MIN(max_files, max_conns))
 		return;
 
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
+		if (conn->c.spliced)
+			continue;
 		if (conn->events == CLOSED)
 			tcp_conn_destroy(c, conn);
 	}
@@ -3433,7 +3437,9 @@ void tcp_timer(struct ctx *c, const struct timespec *ts)
 		}
 	}
 
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= tc; conn--) {
+	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
+		if (conn->c.spliced)
+			continue;
 		if (conn->events == CLOSED)
 			tcp_conn_destroy(c, conn);
 	}
diff --git a/tcp.h b/tcp.h
index bba0f38..49738ef 100644
--- a/tcp.h
+++ b/tcp.h
@@ -54,7 +54,7 @@ union tcp_epoll_ref {
 /**
  * struct tcp_ctx - Execution context for TCP routines
  * @hash_secret:	128-bit secret for hash functions, ISN and hash table
- * @conn_count:		Count of connections (not spliced) in connection table
+ * @conn_count:		Count of total connections in connection table
  * @splice_conn_count:	Count of spliced connections in connection table
  * @port_to_tap:	Ports bound host-side, packets to tap or spliced
  * @fwd_in:		Port forwarding configuration for inbound packets
diff --git a/tcp_conn.h b/tcp_conn.h
index 39d104a..4295f7d 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -195,4 +195,10 @@ union tcp_conn {
 	struct tcp_splice_conn splice;
 };
 
+/* TCP connections */
+extern union tcp_conn tc[];
+
+void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new);
+void tcp_table_compact(struct ctx *c, union tcp_conn *hole);
+
 #endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index 7dcd1cb..c986a9c 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -16,7 +16,7 @@
  * For local traffic directed to TCP ports configured for direct
  * mapping between namespaces, packets are directly translated between
  * L4 sockets using a pair of splice() syscalls. These connections are
- * tracked in the @tc_splice array of struct tcp_splice_conn, using
+ * tracked by struct tcp_splice_conn entries in the @tc array, using
  * these events:
  *
  * - SPLICE_CONNECT:		connection accepted, connecting to target
@@ -57,7 +57,7 @@
 
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
 #define TCP_SPLICE_PIPE_POOL_SIZE	16
-#define TCP_SPLICE_CONN_PRESSURE	30	/* % of splice_conn_count */
+#define TCP_SPLICE_CONN_PRESSURE	30	/* % of conn_count */
 #define TCP_SPLICE_FILE_PRESSURE	30	/* % of c->nofile */
 
 /* From tcp.c */
@@ -72,11 +72,8 @@ static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2][2];
 #define CONN_V6(x)			(x->flags & SPLICE_V6)
 #define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
-#define CONN(index)			(tc_splice + (index))
-#define CONN_IDX(conn)			((conn) - tc_splice)
-
-/* Spliced connections */
-static struct tcp_splice_conn tc_splice[TCP_SPLICE_MAX_CONNS];
+#define CONN(index)			(&tc[(index)].splice)
+#define CONN_IDX(conn)			((union tcp_conn *)(conn) - tc)
 
 /* Display strings for connection events */
 static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
@@ -248,43 +245,13 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
  * @c:		Execution context
  * @new:	New location of tcp_splice_conn
  */
-static void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new)
+void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new)
 {
 	tcp_splice_epoll_ctl(c, new);
 	if (tcp_splice_epoll_ctl(c, new))
 		conn_flag(c, new, CLOSING);
 }
 
-/**
- * tcp_table_splice_compact - Compact spliced connection table
- * @c:		Execution context
- * @hole:	Pointer to recently closed connection
- */
-static void tcp_table_splice_compact(struct ctx *c,
-				     struct tcp_splice_conn *hole)
-{
-	struct tcp_splice_conn *move;
-
-	if (CONN_IDX(hole) == --c->tcp.splice_conn_count) {
-		debug("TCP (spliced): index %li (max) removed", CONN_IDX(hole));
-		return;
-	}
-
-	move = CONN(c->tcp.splice_conn_count);
-
-	memcpy(hole, move, sizeof(*hole));
-
-	move->a = move->b = -1;
-	move->a_read = move->a_written = move->b_read = move->b_written = 0;
-	move->pipe_a_b[0] = move->pipe_a_b[1] = -1;
-	move->pipe_b_a[0] = move->pipe_b_a[1] = -1;
-	move->flags = move->events = 0;
-
-	debug("TCP (spliced): index %li moved to %li",
-	      CONN_IDX(move), CONN_IDX(hole));
-	tcp_splice_conn_update(c, hole);
-}
-
 /**
  * tcp_splice_destroy() - Close spliced connection and pipes, clear
  * @c:		Execution context
@@ -319,7 +286,8 @@ static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 	conn->flags = 0;
 	debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
 
-	tcp_table_splice_compact(c, conn);
+	c->tcp.splice_conn_count--;
+	tcp_table_compact(c, (union tcp_conn *)conn);
 }
 
 /**
@@ -553,7 +521,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 	if (ref.r.p.tcp.tcp.listen) {
 		int s;
 
-		if (c->tcp.splice_conn_count >= TCP_SPLICE_MAX_CONNS)
+		if (c->tcp.conn_count >= TCP_MAX_CONNS)
 			return;
 
 		if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0)
@@ -565,8 +533,9 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			      s);
 		}
 
-		conn = CONN(c->tcp.splice_conn_count++);
+		conn = CONN(c->tcp.conn_count++);
 		conn->c.spliced = true;
+		c->tcp.splice_conn_count++;
 		conn->a = s;
 		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
@@ -845,9 +814,10 @@ void tcp_splice_timer(struct ctx *c)
 {
 	struct tcp_splice_conn *conn;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1);
-	     conn >= tc_splice;
-	     conn--) {
+	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
+		if (!conn->c.spliced)
+			continue;
+
 		if (conn->flags & CLOSING) {
 			tcp_splice_destroy(c, conn);
 			return;
@@ -890,12 +860,12 @@ void tcp_splice_defer_handler(struct ctx *c)
 	int max_files = c->nofile / 100 * TCP_SPLICE_FILE_PRESSURE;
 	struct tcp_splice_conn *conn;
 
-	if (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns))
+	if (c->tcp.conn_count < MIN(max_files / 6, max_conns))
 		return;
 
-	for (conn = CONN(c->tcp.splice_conn_count - 1);
-	     conn >= tc_splice;
-	     conn--) {
+	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
+		if (!conn->c.spliced)
+			continue;
 		if (conn->flags & CLOSING)
 			tcp_splice_destroy(c, conn);
 	}
diff --git a/tcp_splice.h b/tcp_splice.h
index 2c4bff3..e8c70e9 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -6,8 +6,6 @@
 #ifndef TCP_SPLICE_H
 #define TCP_SPLICE_H
 
-#define TCP_SPLICE_MAX_CONNS			(128 * 1024)
-
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
-- 
@@ -6,8 +6,6 @@
 #ifndef TCP_SPLICE_H
 #define TCP_SPLICE_H
 
-#define TCP_SPLICE_MAX_CONNS			(128 * 1024)
-
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/32] tcp: Unify tcp_defer_handler and tcp_splice_defer_handler()
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (9 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 10/32] tcp: Unify spliced and non-spliced connection tables David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 12/32] tcp: Partially unify tcp_timer() and tcp_splice_timer() David Gibson
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections.  To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 20 +++++++++++---------
 tcp_conn.h   |  1 +
 tcp_splice.c | 24 +-----------------------
 tcp_splice.h |  1 -
 4 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/tcp.c b/tcp.c
index 01b08c8..598634f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1527,21 +1527,23 @@ void tcp_defer_handler(struct ctx *c)
 {
 	int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE;
 	int max_files = c->nofile / 100 * TCP_FILE_PRESSURE;
-	struct tcp_tap_conn *conn;
+	union tcp_conn *conn;
 
 	tcp_l2_flags_buf_flush(c);
 	tcp_l2_data_buf_flush(c);
 
-	tcp_splice_defer_handler(c);
-
-	if (c->tcp.conn_count < MIN(max_files, max_conns))
+	if ((c->tcp.conn_count < MIN(max_files, max_conns)) &&
+	    (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns)))
 		return;
 
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
-		if (conn->c.spliced)
-			continue;
-		if (conn->events == CLOSED)
-			tcp_conn_destroy(c, conn);
+	for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) {
+		if (conn->c.spliced) {
+			if (conn->splice.flags & CLOSING)
+				tcp_splice_destroy(c, &conn->splice);
+		} else {
+			if (conn->tap.events == CLOSED)
+				tcp_conn_destroy(c, &conn->tap);
+		}
 	}
 
 }
diff --git a/tcp_conn.h b/tcp_conn.h
index 4295f7d..634e259 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -200,5 +200,6 @@ extern union tcp_conn tc[];
 
 void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new);
 void tcp_table_compact(struct ctx *c, union tcp_conn *hole);
+void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn);
 
 #endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index c986a9c..ad2b216 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -111,7 +111,6 @@ static void tcp_splice_conn_epoll_events(uint16_t events,
 	*b |= (events & B_OUT_WAIT) ? EPOLLOUT : 0;
 }
 
-static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn);
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn);
 
@@ -257,7 +256,7 @@ void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new)
  * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
+void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn)
 {
 	if (conn->events & SPLICE_ESTABLISHED) {
 		/* Flushing might need to block: don't recycle them. */
@@ -849,24 +848,3 @@ void tcp_splice_timer(struct ctx *c)
 
 	tcp_splice_pipe_refill(c);
 }
-
-/**
- * tcp_splice_defer_handler() - Close connections without timer on file pressure
- * @c:		Execution context
- */
-void tcp_splice_defer_handler(struct ctx *c)
-{
-	int max_conns = c->tcp.conn_count / 100 * TCP_SPLICE_CONN_PRESSURE;
-	int max_files = c->nofile / 100 * TCP_SPLICE_FILE_PRESSURE;
-	struct tcp_splice_conn *conn;
-
-	if (c->tcp.conn_count < MIN(max_files / 6, max_conns))
-		return;
-
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
-		if (!conn->c.spliced)
-			continue;
-		if (conn->flags & CLOSING)
-			tcp_splice_destroy(c, conn);
-	}
-}
diff --git a/tcp_splice.h b/tcp_splice.h
index e8c70e9..82e057c 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -10,6 +10,5 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
 void tcp_splice_timer(struct ctx *c);
-void tcp_splice_defer_handler(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -10,6 +10,5 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
 void tcp_splice_timer(struct ctx *c);
-void tcp_splice_defer_handler(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/32] tcp: Partially unify tcp_timer() and tcp_splice_timer()
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (10 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 11/32] tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 13/32] tcp: Unify the IN_EPOLL flag David Gibson
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them.  Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 18 ++++++++---------
 tcp_conn.h   |  3 +++
 tcp_splice.c | 57 +++++++++++++++++++++++-----------------------------
 tcp_splice.h |  1 -
 4 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/tcp.c b/tcp.c
index 598634f..08169b6 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3283,8 +3283,6 @@ int tcp_init(struct ctx *c)
 
 		refill_arg.ns = 1;
 		NS_CALL(tcp_sock_refill, &refill_arg);
-
-		tcp_splice_timer(c);
 	}
 
 	return 0;
@@ -3416,7 +3414,7 @@ static int tcp_port_rebind(void *arg)
 void tcp_timer(struct ctx *c, const struct timespec *ts)
 {
 	struct tcp_sock_refill_arg refill_arg = { c, 0 };
-	struct tcp_tap_conn *conn;
+	union tcp_conn *conn;
 
 	(void)ts;
 
@@ -3439,11 +3437,13 @@ void tcp_timer(struct ctx *c, const struct timespec *ts)
 		}
 	}
 
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
-		if (conn->c.spliced)
-			continue;
-		if (conn->events == CLOSED)
-			tcp_conn_destroy(c, conn);
+	for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) {
+		if (conn->c.spliced) {
+			tcp_splice_timer(c, &conn->splice);
+		} else {
+			if (conn->tap.events == CLOSED)
+				tcp_conn_destroy(c, &conn->tap);
+		}
 	}
 
 	tcp_sock_refill(&refill_arg);
@@ -3453,6 +3453,6 @@ void tcp_timer(struct ctx *c, const struct timespec *ts)
 		    (c->ifi6 && ns_sock_pool6[TCP_SOCK_POOL_TSH] < 0))
 			NS_CALL(tcp_sock_refill, &refill_arg);
 
-		tcp_splice_timer(c);
+		tcp_splice_pipe_refill(c);
 	}
 }
diff --git a/tcp_conn.h b/tcp_conn.h
index 634e259..7c450a0 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -201,5 +201,8 @@ extern union tcp_conn tc[];
 void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new);
 void tcp_table_compact(struct ctx *c, union tcp_conn *hole);
 void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn);
+void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn);
+void tcp_splice_pipe_refill(const struct ctx *c);
+
 
 #endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index ad2b216..0ac316d 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -766,7 +766,7 @@ smaller:
  * tcp_splice_pipe_refill() - Refill pool of pre-opened pipes
  * @c:		Execution context
  */
-static void tcp_splice_pipe_refill(const struct ctx *c)
+void tcp_splice_pipe_refill(const struct ctx *c)
 {
 	int i;
 
@@ -803,48 +803,41 @@ void tcp_splice_init(struct ctx *c)
 {
 	memset(splice_pipe_pool, 0xff, sizeof(splice_pipe_pool));
 	tcp_set_pipe_size(c);
+	tcp_splice_pipe_refill(c);
 }
 
 /**
  * tcp_splice_timer() - Timer for spliced connections
  * @c:		Execution context
+ * @conn:	Spliced connection
  */
-void tcp_splice_timer(struct ctx *c)
+void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn)
 {
-	struct tcp_splice_conn *conn;
-
-	for (conn = CONN(c->tcp.conn_count - 1); conn >= CONN(0); conn--) {
-		if (!conn->c.spliced)
-			continue;
-
-		if (conn->flags & CLOSING) {
-			tcp_splice_destroy(c, conn);
-			return;
-		}
+	if (conn->flags & CLOSING) {
+		tcp_splice_destroy(c, conn);
+		return;
+	}
 
-		if ( (conn->flags & RCVLOWAT_SET_A) &&
-		    !(conn->flags & RCVLOWAT_ACT_A)) {
-			if (setsockopt(conn->a, SOL_SOCKET, SO_RCVLOWAT,
-				       &((int){ 1 }), sizeof(int))) {
-				trace("TCP (spliced): can't set SO_RCVLOWAT on "
-				      "%i", conn->a);
-			}
-			conn_flag(c, conn, ~RCVLOWAT_SET_A);
+	if ( (conn->flags & RCVLOWAT_SET_A) &&
+	     !(conn->flags & RCVLOWAT_ACT_A)) {
+		if (setsockopt(conn->a, SOL_SOCKET, SO_RCVLOWAT,
+			       &((int){ 1 }), sizeof(int))) {
+			trace("TCP (spliced): can't set SO_RCVLOWAT on "
+			      "%i", conn->a);
 		}
+		conn_flag(c, conn, ~RCVLOWAT_SET_A);
+	}
 
-		if ( (conn->flags & RCVLOWAT_SET_B) &&
-		    !(conn->flags & RCVLOWAT_ACT_B)) {
-			if (setsockopt(conn->b, SOL_SOCKET, SO_RCVLOWAT,
-				       &((int){ 1 }), sizeof(int))) {
-				trace("TCP (spliced): can't set SO_RCVLOWAT on "
-				      "%i", conn->b);
-			}
-			conn_flag(c, conn, ~RCVLOWAT_SET_B);
+	if ( (conn->flags & RCVLOWAT_SET_B) &&
+	     !(conn->flags & RCVLOWAT_ACT_B)) {
+		if (setsockopt(conn->b, SOL_SOCKET, SO_RCVLOWAT,
+			       &((int){ 1 }), sizeof(int))) {
+			trace("TCP (spliced): can't set SO_RCVLOWAT on "
+			      "%i", conn->b);
 		}
-
-		conn_flag(c, conn, ~RCVLOWAT_ACT_A);
-		conn_flag(c, conn, ~RCVLOWAT_ACT_B);
+		conn_flag(c, conn, ~RCVLOWAT_SET_B);
 	}
 
-	tcp_splice_pipe_refill(c);
+	conn_flag(c, conn, ~RCVLOWAT_ACT_A);
+	conn_flag(c, conn, ~RCVLOWAT_ACT_B);
 }
diff --git a/tcp_splice.h b/tcp_splice.h
index 82e057c..22024d6 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -9,6 +9,5 @@
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
-void tcp_splice_timer(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -9,6 +9,5 @@
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
 void tcp_splice_init(struct ctx *c);
-void tcp_splice_timer(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/32] tcp: Unify the IN_EPOLL flag
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (11 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 12/32] tcp: Partially unify tcp_timer() and tcp_splice_timer() David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 14/32] tcp: Separate helpers to create ns listening sockets David Gibson
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

There is very little common between the tcp_tap_conn and tcp_splice_conn
structures.  However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.

Simplify things slightly by moving this bit into the common header of the
two structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 10 +++++-----
 tcp_conn.h   | 20 ++++++++++----------
 tcp_splice.c |  8 ++++----
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tcp.c b/tcp.c
index 08169b6..aac70cd 100644
--- a/tcp.c
+++ b/tcp.c
@@ -429,8 +429,8 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
 };
 
 static const char *tcp_flag_str[] __attribute((__unused__)) = {
-	"STALLED", "LOCAL", "WND_CLAMPED", "IN_EPOLL", "ACTIVE_CLOSE",
-	"ACK_TO_TAP_DUE", "ACK_FROM_TAP_DUE",
+	"STALLED", "LOCAL", "WND_CLAMPED", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE",
+	"ACK_FROM_TAP_DUE",
 };
 
 /* Listening sockets, used for automatic port forwarding in pasta mode only */
@@ -660,14 +660,14 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
-	int m = (conn->flags & IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
 				.r.p.tcp.tcp.index = CONN_IDX(conn),
 				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
 	if (conn->events == CLOSED) {
-		if (conn->flags & IN_EPOLL)
+		if (conn->c.in_epoll)
 			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, &ev);
 		if (conn->timer != -1)
 			epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev);
@@ -679,7 +679,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 	if (epoll_ctl(c->epollfd, m, conn->sock, &ev))
 		return -errno;
 
-	conn->flags |= IN_EPOLL;	/* No need to log this */
+	conn->c.in_epoll = true;
 
 	if (conn->timer != -1) {
 		union epoll_ref ref_t = { .r.proto = IPPROTO_TCP,
diff --git a/tcp_conn.h b/tcp_conn.h
index 7c450a0..faa63dc 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -14,9 +14,11 @@
 /**
  * struct tcp_conn_common - Common fields for spliced and non-spliced
  * @spliced:		Is this a spliced connection?
+ * @in_epoll:		Is the connection in the epoll set?
  */
 struct tcp_conn_common {
 	bool spliced	:1;
+	bool in_epoll	:1;
 };
 
 extern const char *tcp_common_flag_str[];
@@ -90,10 +92,9 @@ struct tcp_tap_conn {
 #define STALLED			BIT(0)
 #define LOCAL			BIT(1)
 #define WND_CLAMPED		BIT(2)
-#define IN_EPOLL		BIT(3)
-#define ACTIVE_CLOSE		BIT(4)
-#define ACK_TO_TAP_DUE		BIT(5)
-#define ACK_FROM_TAP_DUE	BIT(6)
+#define ACTIVE_CLOSE		BIT(3)
+#define ACK_TO_TAP_DUE		BIT(4)
+#define ACK_FROM_TAP_DUE	BIT(5)
 
 
 	unsigned int	hash_bucket	:TCP_HASH_BUCKET_BITS;
@@ -170,12 +171,11 @@ struct tcp_splice_conn {
 
 	uint8_t flags;
 #define SPLICE_V6			BIT(0)
-#define SPLICE_IN_EPOLL			BIT(1)
-#define RCVLOWAT_SET_A			BIT(2)
-#define RCVLOWAT_SET_B			BIT(3)
-#define RCVLOWAT_ACT_A			BIT(4)
-#define RCVLOWAT_ACT_B			BIT(5)
-#define CLOSING				BIT(6)
+#define RCVLOWAT_SET_A			BIT(1)
+#define RCVLOWAT_SET_B			BIT(2)
+#define RCVLOWAT_ACT_A			BIT(3)
+#define RCVLOWAT_ACT_B			BIT(4)
+#define CLOSING				BIT(5)
 
 	uint32_t a_read;
 	uint32_t a_written;
diff --git a/tcp_splice.c b/tcp_splice.c
index 0ac316d..7a06252 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -83,8 +83,8 @@ static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
 
 /* Display strings for connection flags */
 static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
-	"SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
-	"RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING",
+	"SPLICE_V6", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", "RCVLOWAT_ACT_A",
+	"RCVLOWAT_ACT_B", "CLOSING",
 };
 
 /**
@@ -164,7 +164,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn)
 {
-	int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
@@ -188,7 +188,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	    epoll_ctl(c->epollfd, m, conn->b, &ev_b))
 		goto delete;
 
-	conn->flags |= SPLICE_IN_EPOLL;		/* No need to log this */
+	conn->c.in_epoll = true;
 
 	return 0;
 
-- 
@@ -83,8 +83,8 @@ static const char *tcp_splice_event_str[] __attribute((__unused__)) = {
 
 /* Display strings for connection flags */
 static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
-	"SPLICE_V6", "SPLICE_IN_EPOLL", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B",
-	"RCVLOWAT_ACT_A", "RCVLOWAT_ACT_B", "CLOSING",
+	"SPLICE_V6", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", "RCVLOWAT_ACT_A",
+	"RCVLOWAT_ACT_B", "CLOSING",
 };
 
 /**
@@ -164,7 +164,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn,
 static int tcp_splice_epoll_ctl(const struct ctx *c,
 				struct tcp_splice_conn *conn)
 {
-	int m = (conn->flags & SPLICE_IN_EPOLL) ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
+	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
 				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
@@ -188,7 +188,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 	    epoll_ctl(c->epollfd, m, conn->b, &ev_b))
 		goto delete;
 
-	conn->flags |= SPLICE_IN_EPOLL;		/* No need to log this */
+	conn->c.in_epoll = true;
 
 	return 0;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 14/32] tcp: Separate helpers to create ns listening sockets
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (12 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 13/32] tcp: Unify the IN_EPOLL flag David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:51   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path David Gibson
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_sock_init*() can create either sockets listening on the host, or in
the pasta network namespace (with @ns==1).  There are, however, a number
of differences in how these two cases work in practice though.  "ns"
sockets are only used in pasta mode, and they always lead to spliced
connections only.  The functions are also only ever called in "ns" mode
with a NULL address and interface name, and it doesn't really make sense
for them to be called any other way.

Later changes will introduce further differences in behaviour between these
two cases, so it makes more sense to use separate functions for creating
the ns listening sockets than the regular external/host listening sockets.
---
 conf.c |   6 +--
 tcp.c  | 130 ++++++++++++++++++++++++++++++++++++++-------------------
 tcp.h  |   4 +-
 3 files changed, 92 insertions(+), 48 deletions(-)

diff --git a/conf.c b/conf.c
index 3ad247e..2b39d18 100644
--- a/conf.c
+++ b/conf.c
@@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
 
 		for (i = 0; i < PORT_EPHEMERAL_MIN; i++) {
 			if (optname == 't')
-				tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
+				tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i);
 			else if (optname == 'u')
 				udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
 		}
@@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
 			bitmap_set(fwd->map, i);
 
 			if (optname == 't')
-				tcp_sock_init(c, 0, af, addr, ifname, i);
+				tcp_sock_init(c, af, addr, ifname, i);
 			else if (optname == 'u')
 				udp_sock_init(c, 0, af, addr, ifname, i);
 		}
@@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
 			fwd->delta[i] = mapped_range.first - orig_range.first;
 
 			if (optname == 't')
-				tcp_sock_init(c, 0, af, addr, ifname, i);
+				tcp_sock_init(c, af, addr, ifname, i);
 			else if (optname == 'u')
 				udp_sock_init(c, 0, af, addr, ifname, i);
 		}
diff --git a/tcp.c b/tcp.c
index aac70cd..72d3b49 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2987,15 +2987,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 /**
  * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
  * @c:		Execution context
- * @ns:		In pasta mode, if set, bind with loopback address in namespace
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  */
-static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr,
+static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
 			   const char *ifname, in_port_t port)
 {
-	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns };
+	in_port_t idx = port + c->tcp.fwd_in.delta[port];
+	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
 	bool spliced = false, tap = true;
 	int s;
 
@@ -3006,14 +3006,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
 		if (!addr)
 			addr = &c->ip4.addr;
 
-		tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr);
+		tap = !IN4_IS_ADDR_LOOPBACK(addr);
 	}
 
-	if (ns)
-		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
-	else
-		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
-
 	if (tap) {
 		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
 			    tref.u32);
@@ -3039,29 +3034,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
 		else
 			s = -1;
 
-		if (c->tcp.fwd_out.mode == FWD_AUTO) {
-			if (ns)
-				tcp_sock_ns[port][V4] = s;
-			else
-				tcp_sock_init_lo[port][V4] = s;
-		}
+		if (c->tcp.fwd_out.mode == FWD_AUTO)
+			tcp_sock_init_lo[port][V4] = s;
 	}
 }
 
 /**
  * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port
  * @c:		Execution context
- * @ns:		In pasta mode, if set, bind with loopback address in namespace
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  */
-static void tcp_sock_init6(const struct ctx *c, int ns,
+static void tcp_sock_init6(const struct ctx *c,
 			   const struct in6_addr *addr, const char *ifname,
 			   in_port_t port)
 {
-	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns,
-				     .tcp.v6 = 1 };
+	in_port_t idx = port + c->tcp.fwd_in.delta[port];
+	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
+				     .tcp.index = idx	};
 	bool spliced = false, tap = true;
 	int s;
 
@@ -3073,14 +3064,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
 		if (!addr)
 			addr = &c->ip6.addr;
 
-		tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr);
+		tap = !IN6_IS_ADDR_LOOPBACK(addr);
 	}
 
-	if (ns)
-		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
-	else
-		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
-
 	if (tap) {
 		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
 			    tref.u32);
@@ -3105,40 +3091,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
 		else
 			s = -1;
 
-		if (c->tcp.fwd_out.mode == FWD_AUTO) {
-			if (ns)
-				tcp_sock_ns[port][V6] = s;
-			else
-				tcp_sock_init_lo[port][V6] = s;
-		}
+		if (c->tcp.fwd_out.mode == FWD_AUTO)
+			tcp_sock_init_lo[port][V6] = s;
 	}
 }
 
 /**
  * tcp_sock_init() - Initialise listening sockets for a given port
  * @c:		Execution context
- * @ns:		In pasta mode, if set, bind with loopback address in namespace
  * @af:		Address family to select a specific IP version, or AF_UNSPEC
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
  * @port:	Port, host order
  */
-void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
-		   const void *addr, const char *ifname, in_port_t port)
+void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
+		   const char *ifname, in_port_t port)
 {
 	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
-		tcp_sock_init4(c, ns, addr, ifname, port);
+		tcp_sock_init4(c, addr, ifname, port);
 	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
-		tcp_sock_init6(c, ns, addr, ifname, port);
+		tcp_sock_init6(c, addr, ifname, port);
+}
+
+/**
+ * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections
+ * @c:		Execution context
+ * @port:	Port, host order
+ */
+static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
+{
+	in_port_t idx = port + c->tcp.fwd_out.delta[port];
+	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
+				     .tcp.splice = 1, .tcp.index = idx };
+	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
+	int s;
+
+	assert(c->mode == MODE_PASTA);
+
+	s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32);
+	if (s >= 0)
+		tcp_sock_set_bufsize(c, s);
+	else
+		s = -1;
+
+	if (c->tcp.fwd_out.mode == FWD_AUTO)
+		tcp_sock_ns[port][V4] = s;
 }
 
 /**
- * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections
+ * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections
+ * @c:		Execution context
+ * @port:	Port, host order
+ */
+static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
+{
+	in_port_t idx = port + c->tcp.fwd_out.delta[port];
+	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
+				     .tcp.splice = 1, .tcp.v6 = 1,
+				     .tcp.index = idx};
+	int s;
+
+	assert(c->mode == MODE_PASTA);
+
+	s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port,
+		    tref.u32);
+	if (s >= 0)
+		tcp_sock_set_bufsize(c, s);
+	else
+		s = -1;
+
+	if (c->tcp.fwd_out.mode == FWD_AUTO)
+		tcp_sock_ns[port][V6] = s;
+}
+
+/**
+ * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
+ * @c:		Execution context
+ * @port:	Port, host order
+ */
+void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
+{
+	if (c->ifi4)
+		tcp_ns_sock_init4(c, port);
+	if (c->ifi6)
+		tcp_ns_sock_init6(c, port);
+}
+
+/**
+ * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
  * @arg:	Execution context
  *
  * Return: 0
  */
-static int tcp_sock_init_ns(void *arg)
+static int tcp_ns_socks_init(void *arg)
 {
 	struct ctx *c = (struct ctx *)arg;
 	unsigned port;
@@ -3149,7 +3194,7 @@ static int tcp_sock_init_ns(void *arg)
 		if (!bitmap_isset(c->tcp.fwd_out.map, port))
 			continue;
 
-		tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port);
+		tcp_ns_sock_init(c, port);
 	}
 
 	return 0;
@@ -3279,7 +3324,7 @@ int tcp_init(struct ctx *c)
 	if (c->mode == MODE_PASTA) {
 		tcp_splice_init(c);
 
-		NS_CALL(tcp_sock_init_ns, c);
+		NS_CALL(tcp_ns_socks_init, c);
 
 		refill_arg.ns = 1;
 		NS_CALL(tcp_sock_refill, &refill_arg);
@@ -3364,8 +3409,7 @@ static int tcp_port_rebind(void *arg)
 
 			if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) ||
 			    (a->c->ifi6 && tcp_sock_ns[port][V6] == -1))
-				tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL,
-					      port);
+				tcp_ns_sock_init(a->c, port);
 		}
 	} else {
 		for (port = 0; port < NUM_PORTS; port++) {
@@ -3398,7 +3442,7 @@ static int tcp_port_rebind(void *arg)
 
 			if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) ||
 			    (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1))
-				tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL,
+				tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL,
 					      port);
 		}
 	}
diff --git a/tcp.h b/tcp.h
index 49738ef..f4ed298 100644
--- a/tcp.h
+++ b/tcp.h
@@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		      const struct timespec *now);
 int tcp_tap_handler(struct ctx *c, int af, const void *addr,
 		    const struct pool *p, const struct timespec *now);
-void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
-		   const void *addr, const char *ifname, in_port_t port);
+void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
+		   const char *ifname, in_port_t port);
 int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *ts);
 void tcp_defer_handler(struct ctx *c);
-- 
@@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		      const struct timespec *now);
 int tcp_tap_handler(struct ctx *c, int af, const void *addr,
 		    const struct pool *p, const struct timespec *now);
-void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
-		   const void *addr, const char *ifname, in_port_t port);
+void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
+		   const char *ifname, in_port_t port);
 int tcp_init(struct ctx *c);
 void tcp_timer(struct ctx *c, const struct timespec *ts);
 void tcp_defer_handler(struct ctx *c);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (13 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 14/32] tcp: Separate helpers to create ns listening sockets David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:53   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections David Gibson
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else.  However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.

Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 68 ++++++++++++++++++++++++++++++++++------------------
 tcp_splice.c | 58 +++++++++++++++++++++++---------------------
 tcp_splice.h |  4 ++++
 3 files changed, 80 insertions(+), 50 deletions(-)

diff --git a/tcp.c b/tcp.c
index 72d3b49..e66a82a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2753,28 +2753,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 }
 
 /**
- * tcp_conn_from_sock() - Handle new connection request from listening socket
+ * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
  * @c:		Execution context
  * @ref:	epoll reference of listening socket
+ * @conn:	connection structure to initialize
+ * @s:		Accepted socket
+ * @sa:		Peer socket address (from accept())
  * @now:	Current timestamp
  */
-static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
-			       const struct timespec *now)
+static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
+				   struct tcp_tap_conn *conn, int s,
+				   struct sockaddr *sa,
+				   const struct timespec *now)
 {
-	struct sockaddr_storage sa;
-	struct tcp_tap_conn *conn;
-	socklen_t sl;
-	int s;
-
-	if (c->tcp.conn_count >= TCP_MAX_CONNS)
-		return;
-
-	sl = sizeof(sa);
-	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
-	if (s < 0)
-		return;
-
-	conn = CONN(c->tcp.conn_count++);
 	conn->c.spliced = false;
 	conn->sock = s;
 	conn->timer = -1;
@@ -2784,7 +2775,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	if (ref.r.p.tcp.tcp.v6) {
 		struct sockaddr_in6 sa6;
 
-		memcpy(&sa6, &sa, sizeof(sa6));
+		memcpy(&sa6, sa, sizeof(sa6));
 
 		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
 		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
@@ -2813,7 +2804,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	} else {
 		struct sockaddr_in sa4;
 
-		memcpy(&sa4, &sa, sizeof(sa4));
+		memcpy(&sa4, sa, sizeof(sa4));
 
 		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
 		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
@@ -2846,6 +2837,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	tcp_get_sndbuf(conn);
 }
 
+/**
+ * tcp_conn_from_sock() - Handle new connection request from listening socket
+ * @c:		Execution context
+ * @ref:	epoll reference of listening socket
+ * @now:	Current timestamp
+ */
+static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       const struct timespec *now)
+{
+	struct sockaddr_storage sa;
+	union tcp_conn *conn;
+	socklen_t sl;
+	int s;
+
+	if (c->tcp.conn_count >= TCP_MAX_CONNS)
+		return;
+
+	sl = sizeof(sa);
+	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
+	if (s < 0)
+		return;
+
+	conn = tc + c->tcp.conn_count++;
+
+	if (ref.r.p.tcp.tcp.splice)
+		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
+	else
+		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
+				       (struct sockaddr *)&sa, now);
+}
+
 /**
  * tcp_timer_handler() - timerfd events: close, send ACK, retransmit, or reset
  * @c:		Execution context
@@ -2925,13 +2947,13 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 		return;
 	}
 
-	if (ref.r.p.tcp.tcp.splice) {
-		tcp_sock_handler_splice(c, ref, events);
+	if (ref.r.p.tcp.tcp.listen) {
+		tcp_conn_from_sock(c, ref, now);
 		return;
 	}
 
-	if (ref.r.p.tcp.tcp.listen) {
-		tcp_conn_from_sock(c, ref, now);
+	if (ref.r.p.tcp.tcp.splice) {
+		tcp_sock_handler_splice(c, ref, events);
 		return;
 	}
 
diff --git a/tcp_splice.c b/tcp_splice.c
index 7a06252..7007501 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -501,6 +501,36 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
 	*pipes = *from == conn->a ? conn->pipe_a_b : conn->pipe_b_a;
 }
 
+/**
+ * tcp_splice_conn_from_sock() - Initialize state for spliced connection
+ * @c:		Execution context
+ * @ref:	epoll reference of listening socket
+ * @conn:	connection structure to initialize
+ * @s:		Accepted socket
+ *
+ * #syscalls:pasta setsockopt
+ */
+void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s)
+{
+	assert(c->mode == MODE_PASTA);
+
+	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
+		       sizeof(int))) {
+		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
+		      s);
+	}
+
+	conn->c.spliced = true;
+	c->tcp.splice_conn_count++;
+	conn->a = s;
+	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
+
+	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
+			   ref.r.p.tcp.tcp.outbound))
+		conn_flag(c, conn, CLOSING);
+}
+
 /**
  * tcp_sock_handler_splice() - Handler for socket mapped to spliced connection
  * @c:		Execution context
@@ -517,33 +547,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 	uint32_t *seq_read, *seq_write;
 	struct tcp_splice_conn *conn;
 
-	if (ref.r.p.tcp.tcp.listen) {
-		int s;
-
-		if (c->tcp.conn_count >= TCP_MAX_CONNS)
-			return;
-
-		if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0)
-			return;
-
-		if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
-			       sizeof(int))) {
-			trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
-			      s);
-		}
-
-		conn = CONN(c->tcp.conn_count++);
-		conn->c.spliced = true;
-		c->tcp.splice_conn_count++;
-		conn->a = s;
-		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
-
-		if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
-				   ref.r.p.tcp.tcp.outbound))
-			conn_flag(c, conn, CLOSING);
-
-		return;
-	}
+	assert(!ref.r.p.tcp.tcp.listen);
 
 	conn = CONN(ref.r.p.tcp.tcp.index);
 
diff --git a/tcp_splice.h b/tcp_splice.h
index 22024d6..f9462ae 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -6,8 +6,12 @@
 #ifndef TCP_SPLICE_H
 #define TCP_SPLICE_H
 
+struct tcp_splice_conn;
+
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
+void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -6,8 +6,12 @@
 #ifndef TCP_SPLICE_H
 #define TCP_SPLICE_H
 
+struct tcp_splice_conn;
+
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
+void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (14 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:54   ` Stefano Brivio
  2022-11-16  4:41 ` [PATCH 17/32] tcp: Remove splice from tcp_epoll_ref David Gibson
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address).  This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::).  This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.

However, it turns out we don't need to do this.  We can splice a connection
if and only if it originates from the loopback address.  Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback.  Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 127 +++++++++++++--------------------------------------
 tcp_splice.c |  25 ++++++++--
 tcp_splice.h |   5 +-
 3 files changed, 55 insertions(+), 102 deletions(-)

diff --git a/tcp.c b/tcp.c
index e66a82a..4065da7 100644
--- a/tcp.c
+++ b/tcp.c
@@ -434,7 +434,6 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = {
 };
 
 /* Listening sockets, used for automatic port forwarding in pasta mode only */
-static int tcp_sock_init_lo	[NUM_PORTS][IP_VERSIONS];
 static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
 static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
 
@@ -2851,21 +2850,31 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	socklen_t sl;
 	int s;
 
+	assert(ref.r.p.tcp.tcp.listen);
+	assert(!ref.r.p.tcp.tcp.splice);
+
 	if (c->tcp.conn_count >= TCP_MAX_CONNS)
 		return;
 
 	sl = sizeof(sa);
+	/* FIXME: Workaround clang-tidy not realizing that accept4()
+	 * writes the socket address.  See
+	 * https://github.com/llvm/llvm-project/issues/58992
+	 */
+	memset(&sa, 0, sizeof(struct sockaddr_in6));
 	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
 		return;
 
 	conn = tc + c->tcp.conn_count++;
 
-	if (ref.r.p.tcp.tcp.splice)
-		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
-	else
-		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
-				       (struct sockaddr *)&sa, now);
+	if (c->mode == MODE_PASTA &&
+	    tcp_splice_conn_from_sock(c, ref, &conn->splice,
+				      s, (struct sockaddr *)&sa))
+		return;
+
+	tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
+			       (struct sockaddr *)&sa, now);
 }
 
 /**
@@ -3018,47 +3027,16 @@ static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
 {
 	in_port_t idx = port + c->tcp.fwd_in.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
-	bool spliced = false, tap = true;
 	int s;
 
-	if (c->mode == MODE_PASTA) {
-		spliced = !addr || IN4_IS_ADDR_UNSPECIFIED(addr) ||
-			IN4_IS_ADDR_LOOPBACK(addr);
-
-		if (!addr)
-			addr = &c->ip4.addr;
-
-		tap = !IN4_IS_ADDR_LOOPBACK(addr);
-	}
-
-	if (tap) {
-		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
-			    tref.u32);
-		if (s >= 0)
-			tcp_sock_set_bufsize(c, s);
-		else
-			s = -1;
-
-		if (c->tcp.fwd_in.mode == FWD_AUTO)
-			tcp_sock_init_ext[port][V4] = s;
-	}
-
-	if (spliced) {
-		struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
-		tref.tcp.splice = 1;
-
-		addr = &loopback;
-
-		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
-			    tref.u32);
-		if (s >= 0)
-			tcp_sock_set_bufsize(c, s);
-		else
-			s = -1;
+	s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32);
+	if (s >= 0)
+		tcp_sock_set_bufsize(c, s);
+	else
+		s = -1;
 
-		if (c->tcp.fwd_out.mode == FWD_AUTO)
-			tcp_sock_init_lo[port][V4] = s;
-	}
+	if (c->tcp.fwd_in.mode == FWD_AUTO)
+		tcp_sock_init_ext[port][V4] = s;
 }
 
 /**
@@ -3075,47 +3053,16 @@ static void tcp_sock_init6(const struct ctx *c,
 	in_port_t idx = port + c->tcp.fwd_in.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
 				     .tcp.index = idx	};
-	bool spliced = false, tap = true;
 	int s;
 
-	if (c->mode == MODE_PASTA) {
-		spliced = !addr ||
-			  IN6_IS_ADDR_UNSPECIFIED(addr) ||
-			  IN6_IS_ADDR_LOOPBACK(addr);
-
-		if (!addr)
-			addr = &c->ip6.addr;
-
-		tap = !IN6_IS_ADDR_LOOPBACK(addr);
-	}
-
-	if (tap) {
-		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
-			    tref.u32);
-		if (s >= 0)
-			tcp_sock_set_bufsize(c, s);
-		else
-			s = -1;
-
-		if (c->tcp.fwd_in.mode == FWD_AUTO)
-			tcp_sock_init_ext[port][V6] = s;
-	}
-
-	if (spliced) {
-		tref.tcp.splice = 1;
-
-		addr = &in6addr_loopback;
-
-		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
-			    tref.u32);
-		if (s >= 0)
-			tcp_sock_set_bufsize(c, s);
-		else
-			s = -1;
+	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
+	if (s >= 0)
+		tcp_sock_set_bufsize(c, s);
+	else
+		s = -1;
 
-		if (c->tcp.fwd_out.mode == FWD_AUTO)
-			tcp_sock_init_lo[port][V6] = s;
-	}
+	if (c->tcp.fwd_in.mode == FWD_AUTO)
+		tcp_sock_init_ext[port][V6] = s;
 }
 
 /**
@@ -3144,7 +3091,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
 {
 	in_port_t idx = port + c->tcp.fwd_out.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
-				     .tcp.splice = 1, .tcp.index = idx };
+				     .tcp.index = idx };
 	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
 	int s;
 
@@ -3169,8 +3116,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
 {
 	in_port_t idx = port + c->tcp.fwd_out.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
-				     .tcp.splice = 1, .tcp.v6 = 1,
-				     .tcp.index = idx};
+				     .tcp.v6 = 1, .tcp.index = idx};
 	int s;
 
 	assert(c->mode == MODE_PASTA);
@@ -3337,7 +3283,6 @@ int tcp_init(struct ctx *c)
 	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
 	memset(ns_sock_pool4,		0xff,	sizeof(ns_sock_pool4));
 	memset(ns_sock_pool6,		0xff,	sizeof(ns_sock_pool6));
-	memset(tcp_sock_init_lo,	0xff,	sizeof(tcp_sock_init_lo));
 	memset(tcp_sock_init_ext,	0xff,	sizeof(tcp_sock_init_ext));
 	memset(tcp_sock_ns,		0xff,	sizeof(tcp_sock_ns));
 
@@ -3445,16 +3390,6 @@ static int tcp_port_rebind(void *arg)
 					close(tcp_sock_init_ext[port][V6]);
 					tcp_sock_init_ext[port][V6] = -1;
 				}
-
-				if (tcp_sock_init_lo[port][V4] >= 0) {
-					close(tcp_sock_init_lo[port][V4]);
-					tcp_sock_init_lo[port][V4] = -1;
-				}
-
-				if (tcp_sock_init_lo[port][V6] >= 0) {
-					close(tcp_sock_init_lo[port][V6]);
-					tcp_sock_init_lo[port][V6] = -1;
-				}
 				continue;
 			}
 
diff --git a/tcp_splice.c b/tcp_splice.c
index 7007501..30d49d4 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -502,19 +502,35 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
 }
 
 /**
- * tcp_splice_conn_from_sock() - Initialize state for spliced connection
+ * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
  * @ref:	epoll reference of listening socket
  * @conn:	connection structure to initialize
  * @s:		Accepted socket
+ * @sa:		Peer address of connection
  *
+ * Return: true if able to create a spliced connection, false otherwise
  * #syscalls:pasta setsockopt
  */
-void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
-			       struct tcp_splice_conn *conn, int s)
+bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s,
+			       const struct sockaddr *sa)
 {
 	assert(c->mode == MODE_PASTA);
 
+	if (ref.r.p.tcp.tcp.v6) {
+		const struct sockaddr_in6 *sa6
+			= (const struct sockaddr_in6 *)sa;
+		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
+			return false;
+		conn->flags = SPLICE_V6;
+	} else {
+		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
+		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
+			return false;
+		conn->flags = 0;
+	}
+
 	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
 		       sizeof(int))) {
 		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
@@ -524,11 +540,12 @@ void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	conn->c.spliced = true;
 	c->tcp.splice_conn_count++;
 	conn->a = s;
-	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
 
 	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
 			   ref.r.p.tcp.tcp.outbound))
 		conn_flag(c, conn, CLOSING);
+
+	return true;
 }
 
 /**
diff --git a/tcp_splice.h b/tcp_splice.h
index f9462ae..1a915dd 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -10,8 +10,9 @@ struct tcp_splice_conn;
 
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
-			       struct tcp_splice_conn *conn, int s);
+bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s,
+			       const struct sockaddr *sa);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -10,8 +10,9 @@ struct tcp_splice_conn;
 
 void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
-			       struct tcp_splice_conn *conn, int s);
+bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
+			       struct tcp_splice_conn *conn, int s,
+			       const struct sockaddr *sa);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 17/32] tcp: Remove splice from tcp_epoll_ref
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (15 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 18/32] tcp: Don't store hash bucket in connection structures David Gibson
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection.  However, the reference
also has the index of the connection structure which also indicates whether
it is spliced.  We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 60 +++++++++++++++++++++++++++++-----------------------
 tcp.h        |  2 --
 tcp_splice.c | 26 +++++++++--------------
 tcp_splice.h |  4 ++--
 4 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/tcp.c b/tcp.c
index 4065da7..e46330e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2851,7 +2851,6 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	int s;
 
 	assert(ref.r.p.tcp.tcp.listen);
-	assert(!ref.r.p.tcp.tcp.splice);
 
 	if (c->tcp.conn_count >= TCP_MAX_CONNS)
 		return;
@@ -2940,35 +2939,14 @@ static void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 }
 
 /**
- * tcp_sock_handler() - Handle new data from socket, or timerfd event
+ * tcp_tap_sock_handler() - Handle new data from non-spliced socket
  * @c:		Execution context
- * @ref:	epoll reference
+ * @conn:	Connection state
  * @events:	epoll events bitmap
- * @now:	Current timestamp
  */
-void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
-		      const struct timespec *now)
+static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn,
+				 uint32_t events)
 {
-	struct tcp_tap_conn *conn;
-
-	if (ref.r.p.tcp.tcp.timer) {
-		tcp_timer_handler(c, ref);
-		return;
-	}
-
-	if (ref.r.p.tcp.tcp.listen) {
-		tcp_conn_from_sock(c, ref, now);
-		return;
-	}
-
-	if (ref.r.p.tcp.tcp.splice) {
-		tcp_sock_handler_splice(c, ref, events);
-		return;
-	}
-
-	if (!(conn = conn_at_idx(ref.r.p.tcp.tcp.index)))
-		return;
-
 	if (conn->events == CLOSED)
 		return;
 
@@ -3015,6 +2993,36 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 	}
 }
 
+/**
+ * tcp_sock_handler() - Handle new data from socket, or timerfd event
+ * @c:		Execution context
+ * @ref:	epoll reference
+ * @events:	epoll events bitmap
+ * @now:	Current timestamp
+ */
+void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
+		      const struct timespec *now)
+{
+	union tcp_conn *conn;
+
+	if (ref.r.p.tcp.tcp.timer) {
+		tcp_timer_handler(c, ref);
+		return;
+	}
+
+	if (ref.r.p.tcp.tcp.listen) {
+		tcp_conn_from_sock(c, ref, now);
+		return;
+	}
+
+	conn = tc + ref.r.p.tcp.tcp.index;
+
+	if (conn->c.spliced)
+		tcp_splice_sock_handler(c, &conn->splice, ref.r.s, events);
+	else
+		tcp_tap_sock_handler(c, &conn->tap, events);
+}
+
 /**
  * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
  * @c:		Execution context
diff --git a/tcp.h b/tcp.h
index f4ed298..a940682 100644
--- a/tcp.h
+++ b/tcp.h
@@ -32,7 +32,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s,
 /**
  * union tcp_epoll_ref - epoll reference portion for TCP connections
  * @listen:		Set if this file descriptor is a listening socket
- * @splice:		Set if descriptor is associated to a spliced connection
  * @outbound:		Listening socket maps to outbound, spliced connection
  * @v6:			Set for IPv6 sockets or connections
  * @timer:		Reference is a timerfd descriptor for connection
@@ -42,7 +41,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s,
 union tcp_epoll_ref {
 	struct {
 		uint32_t	listen:1,
-				splice:1,
 				outbound:1,
 				v6:1,
 				timer:1,
diff --git a/tcp_splice.c b/tcp_splice.c
index 30d49d4..2852e76 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -166,11 +166,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 {
 	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
-				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
-				  .r.p.tcp.tcp.splice = 1,
 				  .r.p.tcp.tcp.index = CONN_IDX(conn),
 				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
 	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
@@ -549,24 +547,20 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * tcp_sock_handler_splice() - Handler for socket mapped to spliced connection
+ * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
  * @c:		Execution context
- * @ref:	epoll reference
+ * @conn:	Connection state
+ * @s:		Socket fd on which an event has occurred
  * @events:	epoll events bitmap
  *
  * #syscalls:pasta splice
  */
-void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
-			     uint32_t events)
+void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
+			     int s, uint32_t events)
 {
 	uint8_t lowat_set_flag, lowat_act_flag;
 	int from, to, *pipes, eof, never_read;
 	uint32_t *seq_read, *seq_write;
-	struct tcp_splice_conn *conn;
-
-	assert(!ref.r.p.tcp.tcp.listen);
-
-	conn = CONN(ref.r.p.tcp.tcp.index);
 
 	if (conn->events == SPLICE_CLOSED)
 		return;
@@ -582,25 +576,25 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
 	}
 
 	if (events & EPOLLOUT) {
-		if (ref.r.s == conn->a)
+		if (s == conn->a)
 			conn_event(c, conn, ~A_OUT_WAIT);
 		else
 			conn_event(c, conn, ~B_OUT_WAIT);
 
-		tcp_splice_dir(conn, ref.r.s, 1, &from, &to, &pipes);
+		tcp_splice_dir(conn, s, 1, &from, &to, &pipes);
 	} else {
-		tcp_splice_dir(conn, ref.r.s, 0, &from, &to, &pipes);
+		tcp_splice_dir(conn, s, 0, &from, &to, &pipes);
 	}
 
 	if (events & EPOLLRDHUP) {
-		if (ref.r.s == conn->a)
+		if (s == conn->a)
 			conn_event(c, conn, A_FIN_RCVD);
 		else
 			conn_event(c, conn, B_FIN_RCVD);
 	}
 
 	if (events & EPOLLHUP) {
-		if (ref.r.s == conn->a)
+		if (s == conn->a)
 			conn_event(c, conn, A_FIN_SENT); /* Fake, but implied */
 		else
 			conn_event(c, conn, B_FIN_SENT);
diff --git a/tcp_splice.h b/tcp_splice.h
index 1a915dd..6814ae7 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -8,8 +8,8 @@
 
 struct tcp_splice_conn;
 
-void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
-			     uint32_t events);
+void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
+			     int s, uint32_t events);
 bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
 			       const struct sockaddr *sa);
-- 
@@ -8,8 +8,8 @@
 
 struct tcp_splice_conn;
 
-void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
-			     uint32_t events);
+void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
+			     int s, uint32_t events);
 bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
 			       const struct sockaddr *sa);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 18/32] tcp: Don't store hash bucket in connection structures
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (16 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 17/32] tcp: Remove splice from tcp_epoll_ref David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16  4:41 ` [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6 David Gibson
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently when we insert a connection into the hash table, we store its
bucket number so we can find it when removing entries.  However, we can
recompute the hash value from other contents of the structure so we don't
need to store it.  This brings the size of tcp_tap_conn down to 64 bytes
again, which means it will fit in a single cacheline on common machines.

This change also removes a non-obvious constraint that the hash table have
less than twice TCP_MAX_CONNS buckets, because of the way
TCP_HASH_BUCKET_BITS was constructed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c      | 29 ++++++++++++++++++++++++-----
 tcp_conn.h |  5 -----
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/tcp.c b/tcp.c
index e46330e..7686766 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1243,6 +1243,24 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
 	return (unsigned int)(b % TCP_HASH_TABLE_SIZE);
 }
 
+/**
+ * tcp_conn_hash() - Calculate hash bucket of an existing connection
+ * @c:		Execution context
+ * @conn:	Connection
+ *
+ * Return: hash value, already modulo size of the hash table
+ */
+static unsigned int tcp_conn_hash(const struct ctx *c,
+				  const struct tcp_tap_conn *conn)
+{
+	if (CONN_V6(conn))
+		return tcp_hash(c, AF_INET6, &conn->a.a6,
+				conn->tap_port, conn->sock_port);
+	else
+		return tcp_hash(c, AF_INET, &conn->a.a4.a,
+				conn->tap_port, conn->sock_port);
+}
+
 /**
  * tcp_hash_insert() - Insert connection into hash table, chain link
  * @c:		Execution context
@@ -1258,7 +1276,6 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
 	b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port);
 	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
-	conn->hash_bucket = b;
 
 	debug("TCP: hash table insert: index %li, sock %i, bucket: %i, next: "
 	      "%p", CONN_IDX(conn), conn->sock, b, conn_at_idx(conn->next_index));
@@ -1266,12 +1283,14 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
 
 /**
  * tcp_hash_remove() - Drop connection from hash table, chain unlink
+ * @c:		Execution context
  * @conn:	Connection pointer
  */
-static void tcp_hash_remove(const struct tcp_tap_conn *conn)
+static void tcp_hash_remove(const struct ctx *c,
+			    const struct tcp_tap_conn *conn)
 {
 	struct tcp_tap_conn *entry, *prev = NULL;
-	int b = conn->hash_bucket;
+	int b = tcp_conn_hash(c, conn);
 
 	for (entry = tc_hash[b]; entry;
 	     prev = entry, entry = conn_at_idx(entry->next_index)) {
@@ -1299,7 +1318,7 @@ static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old,
 				struct tcp_tap_conn *new)
 {
 	struct tcp_tap_conn *entry, *prev = NULL;
-	int b = old->hash_bucket;
+	int b = tcp_conn_hash(c, old);
 
 	for (entry = tc_hash[b]; entry;
 	     prev = entry, entry = conn_at_idx(entry->next_index)) {
@@ -1387,7 +1406,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn)
 	if (conn->timer != -1)
 		close(conn->timer);
 
-	tcp_hash_remove(conn);
+	tcp_hash_remove(c, conn);
 	tcp_table_compact(c, (union tcp_conn *)conn);
 }
 
diff --git a/tcp_conn.h b/tcp_conn.h
index faa63dc..4bffe9a 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -9,8 +9,6 @@
 #ifndef TCP_CONN_H
 #define TCP_CONN_H
 
-#define TCP_HASH_BUCKET_BITS		(TCP_CONN_INDEX_BITS + 1)
-
 /**
  * struct tcp_conn_common - Common fields for spliced and non-spliced
  * @spliced:		Is this a spliced connection?
@@ -32,7 +30,6 @@ extern const char *tcp_common_flag_str[];
  * @events:		Connection events, implying connection states
  * @timer:		timerfd descriptor for timeout events
  * @flags:		Connection flags representing internal attributes
- * @hash_bucket:	Bucket index in connection lookup hash table
  * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
  * @ws_from_tap:	Window scaling factor advertised from tap/guest
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
@@ -97,8 +94,6 @@ struct tcp_tap_conn {
 #define ACK_FROM_TAP_DUE	BIT(5)
 
 
-	unsigned int	hash_bucket	:TCP_HASH_BUCKET_BITS;
-
 #define TCP_MSS_BITS			14
 	unsigned int	tap_mss		:TCP_MSS_BITS;
 #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
-- 
@@ -9,8 +9,6 @@
 #ifndef TCP_CONN_H
 #define TCP_CONN_H
 
-#define TCP_HASH_BUCKET_BITS		(TCP_CONN_INDEX_BITS + 1)
-
 /**
  * struct tcp_conn_common - Common fields for spliced and non-spliced
  * @spliced:		Is this a spliced connection?
@@ -32,7 +30,6 @@ extern const char *tcp_common_flag_str[];
  * @events:		Connection events, implying connection states
  * @timer:		timerfd descriptor for timeout events
  * @flags:		Connection flags representing internal attributes
- * @hash_bucket:	Bucket index in connection lookup hash table
  * @retrans:		Number of retransmissions occurred due to ACK_TIMEOUT
  * @ws_from_tap:	Window scaling factor advertised from tap/guest
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
@@ -97,8 +94,6 @@ struct tcp_tap_conn {
 #define ACK_FROM_TAP_DUE	BIT(5)
 
 
-	unsigned int	hash_bucket	:TCP_HASH_BUCKET_BITS;
-
 #define TCP_MSS_BITS			14
 	unsigned int	tap_mss		:TCP_MSS_BITS;
 #define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (17 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 18/32] tcp: Don't store hash bucket in connection structures David Gibson
@ 2022-11-16  4:41 ` David Gibson
  2022-11-16 23:54   ` Stefano Brivio
  2022-11-16  4:42 ` [PATCH 20/32] tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same David Gibson
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:41 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union.  We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.

This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h.  Along with
that create a number of helper functions to make working with these "inany"
addresses easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile     |  6 ++--
 inany.h      | 68 ++++++++++++++++++++++++++++++++++++++++
 tcp.c        | 88 +++++++++++++++++++++++++---------------------------
 tcp_conn.h   | 15 ++-------
 tcp_splice.c |  1 +
 5 files changed, 117 insertions(+), 61 deletions(-)
 create mode 100644 inany.h

diff --git a/Makefile b/Makefile
index 9046b0b..ca453aa 100644
--- a/Makefile
+++ b/Makefile
@@ -44,9 +44,9 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 MANPAGES = passt.1 pasta.1 qrap.1
 
 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \
-	isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h \
-	pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h \
-	util.h
+	inany.h isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h \
+	pasta.h pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h \
+	tcp_splice.h udp.h util.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 # On gcc 11 and 12, with -O2 and -flto, tcp_hash() and siphash_20b(), if
diff --git a/inany.h b/inany.h
new file mode 100644
index 0000000..4e53da9
--- /dev/null
+++ b/inany.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: AGPL-3.0-or-later
+ * Copyright Red Hat
+ * Author: David Gibson <david@gibson.dropbear.id.au>
+ *
+ * inany.h - Types and helpers for handling addresses which could be
+ *           IPv6 or IPv4 (encoded as IPv4-mapped IPv6 addresses)
+ */
+
+#include <assert.h>
+
+/** union inany_addr - Represents either an IPv4 or IPv6 address
+ * @a6:		Address as an IPv6 address, may be IPv4-mapped
+ * @_v4._zero:	All zero-bits for an IPv4 address
+ * @_v4._one:	All one-bits for an IPv4 address
+ * @_v4.a4:	If @a6 is an IPv4 mapped address, this is the raw IPv4 address
+ *
+ * Fields starting with _ shouldn't be accessed except via helpers.
+ */
+union inany_addr {
+	struct in6_addr a6;
+	struct {
+		uint8_t _zero[10];
+		uint8_t _one[2];
+		struct in_addr a4;
+	} _v4mapped;
+};
+
+/** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
+ * @addr:	IPv4 or IPv6 address
+ *
+ * Return: IPv4 address if @addr is IPv4, NULL otherwise
+ */
+static inline const struct in_addr *inany_v4(const union inany_addr *addr)
+{
+	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
+		return NULL;
+	return &addr->_v4mapped.a4;
+}
+
+/** inany_equals - Compare two IPv[46] addresses
+ * @a, @b:	IPv[46] addresses
+ *
+ * Return: true if @a and @b are the same address
+ */
+static inline bool inany_equals(const union inany_addr *a,
+				const union inany_addr *b)
+{
+	return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
+}
+
+/** inany_from_af - Set IPv[46] address from IPv4 or IPv6 address
+ * @aa:		Pointer to store IPv[46] address
+ * @af:		Address family of @addr
+ * @addr:	struct in_addr (IPv4) or struct in6_addr (IPv6)
+ */
+static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
+{
+	if (af == AF_INET6) {
+		aa->a6 = *((struct in6_addr *)addr);
+	} else if (af == AF_INET) {
+		memset(&aa->_v4mapped._zero, 0, sizeof(aa->_v4mapped._zero));
+		memset(&aa->_v4mapped._one, 0xff, sizeof(aa->_v4mapped._one));
+		aa->_v4mapped.a4 = *((struct in_addr *)addr);
+	} else {
+		/* Not valid to call with other address families */
+		assert(0);
+	}
+}
diff --git a/tcp.c b/tcp.c
index 7686766..4040198 100644
--- a/tcp.c
+++ b/tcp.c
@@ -301,6 +301,7 @@
 #include "conf.h"
 #include "tcp_splice.h"
 #include "log.h"
+#include "inany.h"
 
 #include "tcp_conn.h"
 
@@ -404,7 +405,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
 #define OPT_SACK	5
 #define OPT_TS		8
 
-#define CONN_V4(conn)		IN6_IS_ADDR_V4MAPPED(&conn->a.a6)
+#define CONN_V4(conn)		(!!inany_v4(&(conn)->addr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
 #define CONN_IS_CLOSING(conn)						\
 	((conn->events & ESTABLISHED) &&				\
@@ -438,7 +439,7 @@ static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
 static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
 
 /* Table of destinations with very low RTT (assumed to be local), LRU */
-static struct in6_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
+static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
 
 /* Static buffers */
 
@@ -861,7 +862,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
 	int i;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
-		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
+		if (inany_equals(&conn->addr, low_rtt_dst + i))
 			return 1;
 
 	return 0;
@@ -883,7 +884,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 		return;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
-		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
+		if (inany_equals(&conn->addr, low_rtt_dst + i))
 			return;
 		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
 			hole = i;
@@ -895,10 +896,10 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 	if (hole == -1)
 		return;
 
-	memcpy(low_rtt_dst + hole++, &conn->a.a6, sizeof(conn->a.a6));
+	low_rtt_dst[hole++] = conn->addr;
 	if (hole == LOW_RTT_TABLE_SIZE)
 		hole = 0;
-	memcpy(low_rtt_dst + hole, &in6addr_any, sizeof(conn->a.a6));
+	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
 #else
 	(void)conn;
 	(void)tinfo;
@@ -1187,13 +1188,14 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 			  int af, const void *addr,
 			  in_port_t tap_port, in_port_t sock_port)
 {
-	if (af == AF_INET && CONN_V4(conn)			&&
-	    !memcmp(&conn->a.a4.a, addr, sizeof(conn->a.a4.a))	&&
+	const struct in_addr *a4 = inany_v4(&conn->addr);
+
+	if (af == AF_INET && a4	&& !memcmp(a4, addr, sizeof(*a4)) &&
 	    conn->tap_port == tap_port && conn->sock_port == sock_port)
 		return 1;
 
 	if (af == AF_INET6					&&
-	    IN6_ARE_ADDR_EQUAL(&conn->a.a6, addr)		&&
+	    IN6_ARE_ADDR_EQUAL(&conn->addr.a6, addr)		&&
 	    conn->tap_port == tap_port && conn->sock_port == sock_port)
 		return 1;
 
@@ -1253,11 +1255,13 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
 static unsigned int tcp_conn_hash(const struct ctx *c,
 				  const struct tcp_tap_conn *conn)
 {
-	if (CONN_V6(conn))
-		return tcp_hash(c, AF_INET6, &conn->a.a6,
+	const struct in_addr *a4 = inany_v4(&conn->addr);
+
+	if (a4)
+		return tcp_hash(c, AF_INET, a4,
 				conn->tap_port, conn->sock_port);
 	else
-		return tcp_hash(c, AF_INET, &conn->a.a4.a,
+		return tcp_hash(c, AF_INET6, &conn->addr.a6,
 				conn->tap_port, conn->sock_port);
 }
 
@@ -1582,6 +1586,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
 				      void *p, size_t plen,
 				      const uint16_t *check, uint32_t seq)
 {
+	const struct in_addr *a4 = inany_v4(&conn->addr);
 	size_t ip_len, eth_len;
 
 #define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq)			\
@@ -1599,13 +1604,33 @@ do {									\
 	}								\
 } while (0)
 
-	if (CONN_V6(conn)) {
+	if (a4) {
+		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
+
+		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
+		b->iph.tot_len = htons(ip_len);
+		b->iph.saddr = a4->s_addr;
+		b->iph.daddr = c->ip4.addr_seen.s_addr;
+
+		if (check)
+			b->iph.check = *check;
+		else
+			tcp_update_check_ip4(b);
+
+		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
+
+		tcp_update_check_tcp4(b);
+
+		eth_len = ip_len + sizeof(struct ethhdr);
+		if (c->mode == MODE_PASST)
+			b->vnet_len = htonl(eth_len);
+	} else {
 		struct tcp6_l2_buf_t *b = (struct tcp6_l2_buf_t *)p;
 
 		ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
 
 		b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr));
-		b->ip6h.saddr = conn->a.a6;
+		b->ip6h.saddr = conn->addr.a6;
 		if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr))
 			b->ip6h.daddr = c->ip6.addr_ll_seen;
 		else
@@ -1621,26 +1646,6 @@ do {									\
 		b->ip6h.flow_lbl[1] = (conn->sock >> 8) & 0xff;
 		b->ip6h.flow_lbl[2] = (conn->sock >> 0) & 0xff;
 
-		eth_len = ip_len + sizeof(struct ethhdr);
-		if (c->mode == MODE_PASST)
-			b->vnet_len = htonl(eth_len);
-	} else {
-		struct tcp4_l2_buf_t *b = (struct tcp4_l2_buf_t *)p;
-
-		ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr);
-		b->iph.tot_len = htons(ip_len);
-		b->iph.saddr = conn->a.a4.a.s_addr;
-		b->iph.daddr = c->ip4.addr_seen.s_addr;
-
-		if (check)
-			b->iph.check = *check;
-		else
-			tcp_update_check_ip4(b);
-
-		SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq);
-
-		tcp_update_check_tcp4(b);
-
 		eth_len = ip_len + sizeof(struct ethhdr);
 		if (c->mode == MODE_PASST)
 			b->vnet_len = htonl(eth_len);
@@ -2144,18 +2149,14 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
 		conn->wnd_from_tap = 1;
 
+	inany_from_af(&conn->addr, af, addr);
+
 	if (af == AF_INET) {
 		sa = (struct sockaddr *)&addr4;
 		sl = sizeof(addr4);
-
-		memset(&conn->a.a4.zero, 0,    sizeof(conn->a.a4.zero));
-		memset(&conn->a.a4.one,  0xff, sizeof(conn->a.a4.one));
-		memcpy(&conn->a.a4.a,    addr, sizeof(conn->a.a4.a));
 	} else {
 		sa = (struct sockaddr *)&addr6;
 		sl = sizeof(addr6);
-
-		memcpy(&conn->a.a6,      addr, sizeof(conn->a.a6));
 	}
 
 	conn->sock_port = ntohs(th->dest);
@@ -2808,7 +2809,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			memcpy(&sa6.sin6_addr, src, sizeof(*src));
 		}
 
-		memcpy(&conn->a.a6, &sa6.sin6_addr, sizeof(conn->a.a6));
+		inany_from_af(&conn->addr, AF_INET6, &sa6.sin6_addr);
 
 		conn->sock_port = ntohs(sa6.sin6_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
@@ -2824,15 +2825,12 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 		memcpy(&sa4, sa, sizeof(sa4));
 
-		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
-		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
-
 		if (IN4_IS_ADDR_LOOPBACK(&sa4.sin_addr) ||
 		    IN4_IS_ADDR_UNSPECIFIED(&sa4.sin_addr) ||
 		    IN4_ARE_ADDR_EQUAL(&sa4.sin_addr, &c->ip4.addr_seen))
 			sa4.sin_addr = c->ip4.gw;
 
-		conn->a.a4.a = sa4.sin_addr;
+		inany_from_af(&conn->addr, AF_INET, &sa4.sin_addr);
 
 		conn->sock_port = ntohs(sa4.sin_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
diff --git a/tcp_conn.h b/tcp_conn.h
index 4bffe9a..bf50e1c 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -35,10 +35,7 @@ extern const char *tcp_common_flag_str[];
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
  * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @a.a6:		IPv6 remote address, can be IPv4-mapped
- * @a.a4.zero:		Zero prefix for IPv4-mapped, see RFC 6890, Table 20
- * @a.a4.one:		Ones prefix for IPv4-mapped
- * @a.a4.a:		IPv4 address
+ * @addr:		Remote address (IPv4 or IPv6)
  * @tap_port:		Guest-facing tap port
  * @sock_port:		Remote, socket-facing port
  * @wnd_from_tap:	Last window size from tap, unscaled (as received)
@@ -108,15 +105,7 @@ struct tcp_tap_conn {
 	uint8_t		seq_dup_ack_approx;
 
 
-	union {
-		struct in6_addr a6;
-		struct {
-			uint8_t zero[10];
-			uint8_t one[2];
-			struct in_addr a;
-		} a4;
-	} a;
-
+	union inany_addr addr;
 	in_port_t	tap_port;
 	in_port_t	sock_port;
 
diff --git a/tcp_splice.c b/tcp_splice.c
index 2852e76..30ab0eb 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -52,6 +52,7 @@
 #include "passt.h"
 #include "log.h"
 #include "tcp_splice.h"
+#include "inany.h"
 
 #include "tcp_conn.h"
 
-- 
@@ -52,6 +52,7 @@
 #include "passt.h"
 #include "log.h"
 #include "tcp_splice.h"
+#include "inany.h"
 
 #include "tcp_conn.h"
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 20/32] tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (18 preceding siblings ...)
  2022-11-16  4:41 ` [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6 David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 21/32] tcp: Take tcp_hash_insert() address from struct tcp_conn David Gibson
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In the tcp_conn structure, we represent the address with an inany_addr
which could be an IPv4 or IPv6 address.  However, we have different paths
which will calculate different hashes for IPv4 and equivalent IPv4-mapped
IPv6 addresses.  This will cause problems for some future changes.

Make the hash function work the same for these two cases, by taking an
inany_addr directly.  Since this represents IPv4 and IPv4-mapped IPv6
addresses the same way, it will trivially hash the same for both cases.

Callers are changed to construct an inany_addr from whatever they have.
Some of that will be elided in later changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 siphash.c |  1 +
 tcp.c     | 52 ++++++++++++++++++----------------------------------
 2 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/siphash.c b/siphash.c
index 37a6d73..516a508 100644
--- a/siphash.c
+++ b/siphash.c
@@ -104,6 +104,7 @@
  *
  * Return: the 64-bit hash output
  */
+/* cppcheck-suppress unusedFunction */
 uint64_t siphash_8b(const uint8_t *in, const uint64_t *k)
 {
 	PREAMBLE(8);
diff --git a/tcp.c b/tcp.c
index 4040198..56da864 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1205,8 +1205,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 /**
  * tcp_hash() - Calculate hash value for connection given address and ports
  * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
+ * @addr:	Remote address
  * @tap_port:	tap-facing port
  * @sock_port:	Socket-facing port
  *
@@ -1215,32 +1214,19 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 #if TCP_HASH_NOINLINE
 __attribute__((__noinline__))	/* See comment in Makefile */
 #endif
-static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
+static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *addr,
 			     in_port_t tap_port, in_port_t sock_port)
 {
+	struct {
+		union inany_addr addr;
+		in_port_t tap_port;
+		in_port_t sock_port;
+	} __attribute__((__packed__)) in = {
+		*addr, tap_port, sock_port
+	};
 	uint64_t b = 0;
 
-	if (af == AF_INET) {
-		struct {
-			struct in_addr addr;
-			in_port_t tap_port;
-			in_port_t sock_port;
-		} __attribute__((__packed__)) in = {
-			*(struct in_addr *)addr, tap_port, sock_port,
-		};
-
-		b = siphash_8b((uint8_t *)&in, c->tcp.hash_secret);
-	} else if (af == AF_INET6) {
-		struct {
-			struct in6_addr addr;
-			in_port_t tap_port;
-			in_port_t sock_port;
-		} __attribute__((__packed__)) in = {
-			*(struct in6_addr *)addr, tap_port, sock_port,
-		};
-
-		b = siphash_20b((uint8_t *)&in, c->tcp.hash_secret);
-	}
+	b = siphash_20b((uint8_t *)&in, c->tcp.hash_secret);
 
 	return (unsigned int)(b % TCP_HASH_TABLE_SIZE);
 }
@@ -1255,14 +1241,7 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
 static unsigned int tcp_conn_hash(const struct ctx *c,
 				  const struct tcp_tap_conn *conn)
 {
-	const struct in_addr *a4 = inany_v4(&conn->addr);
-
-	if (a4)
-		return tcp_hash(c, AF_INET, a4,
-				conn->tap_port, conn->sock_port);
-	else
-		return tcp_hash(c, AF_INET6, &conn->addr.a6,
-				conn->tap_port, conn->sock_port);
+	return tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port);
 }
 
 /**
@@ -1275,9 +1254,11 @@ static unsigned int tcp_conn_hash(const struct ctx *c,
 static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
 			    int af, const void *addr)
 {
+	union inany_addr aany;
 	int b;
 
-	b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port);
+	inany_from_af(&aany, af, addr);
+	b = tcp_hash(c, &aany, conn->tap_port, conn->sock_port);
 	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 
@@ -1357,9 +1338,12 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 					    in_port_t tap_port,
 					    in_port_t sock_port)
 {
-	int b = tcp_hash(c, af, addr, tap_port, sock_port);
+	union inany_addr aany;
 	struct tcp_tap_conn *conn;
+	int b;
 
+	inany_from_af(&aany, af, addr);
+	b = tcp_hash(c, &aany, tap_port, sock_port);
 	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
 		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
 			return conn;
-- 
@@ -1205,8 +1205,7 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 /**
  * tcp_hash() - Calculate hash value for connection given address and ports
  * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
+ * @addr:	Remote address
  * @tap_port:	tap-facing port
  * @sock_port:	Socket-facing port
  *
@@ -1215,32 +1214,19 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 #if TCP_HASH_NOINLINE
 __attribute__((__noinline__))	/* See comment in Makefile */
 #endif
-static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
+static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *addr,
 			     in_port_t tap_port, in_port_t sock_port)
 {
+	struct {
+		union inany_addr addr;
+		in_port_t tap_port;
+		in_port_t sock_port;
+	} __attribute__((__packed__)) in = {
+		*addr, tap_port, sock_port
+	};
 	uint64_t b = 0;
 
-	if (af == AF_INET) {
-		struct {
-			struct in_addr addr;
-			in_port_t tap_port;
-			in_port_t sock_port;
-		} __attribute__((__packed__)) in = {
-			*(struct in_addr *)addr, tap_port, sock_port,
-		};
-
-		b = siphash_8b((uint8_t *)&in, c->tcp.hash_secret);
-	} else if (af == AF_INET6) {
-		struct {
-			struct in6_addr addr;
-			in_port_t tap_port;
-			in_port_t sock_port;
-		} __attribute__((__packed__)) in = {
-			*(struct in6_addr *)addr, tap_port, sock_port,
-		};
-
-		b = siphash_20b((uint8_t *)&in, c->tcp.hash_secret);
-	}
+	b = siphash_20b((uint8_t *)&in, c->tcp.hash_secret);
 
 	return (unsigned int)(b % TCP_HASH_TABLE_SIZE);
 }
@@ -1255,14 +1241,7 @@ static unsigned int tcp_hash(const struct ctx *c, int af, const void *addr,
 static unsigned int tcp_conn_hash(const struct ctx *c,
 				  const struct tcp_tap_conn *conn)
 {
-	const struct in_addr *a4 = inany_v4(&conn->addr);
-
-	if (a4)
-		return tcp_hash(c, AF_INET, a4,
-				conn->tap_port, conn->sock_port);
-	else
-		return tcp_hash(c, AF_INET6, &conn->addr.a6,
-				conn->tap_port, conn->sock_port);
+	return tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port);
 }
 
 /**
@@ -1275,9 +1254,11 @@ static unsigned int tcp_conn_hash(const struct ctx *c,
 static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
 			    int af, const void *addr)
 {
+	union inany_addr aany;
 	int b;
 
-	b = tcp_hash(c, af, addr, conn->tap_port, conn->sock_port);
+	inany_from_af(&aany, af, addr);
+	b = tcp_hash(c, &aany, conn->tap_port, conn->sock_port);
 	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 
@@ -1357,9 +1338,12 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 					    in_port_t tap_port,
 					    in_port_t sock_port)
 {
-	int b = tcp_hash(c, af, addr, tap_port, sock_port);
+	union inany_addr aany;
 	struct tcp_tap_conn *conn;
+	int b;
 
+	inany_from_af(&aany, af, addr);
+	b = tcp_hash(c, &aany, tap_port, sock_port);
 	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
 		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
 			return conn;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 21/32] tcp: Take tcp_hash_insert() address from struct tcp_conn
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (19 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 20/32] tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 22/32] tcp: Simplify tcp_hash_match() to take an inany_addr David Gibson
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_hash_insert() takes an address to control which hash bucket the
connection will go into.  However, an inany_addr representation of that
address is already stored in struct tcp_conn.

Now that we've made the hashing of IPv4 and IPv4-mapped IPv6 addresses
equivalent, we can simplify tcp_hash_insert() to use the address in
struct tcp_conn, rather than taking it as an extra parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/tcp.c b/tcp.c
index 56da864..ca7f295 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1248,17 +1248,12 @@ static unsigned int tcp_conn_hash(const struct ctx *c,
  * tcp_hash_insert() - Insert connection into hash table, chain link
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
  */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
-			    int af, const void *addr)
+static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
 {
-	union inany_addr aany;
 	int b;
 
-	inany_from_af(&aany, af, addr);
-	b = tcp_hash(c, &aany, conn->tap_port, conn->sock_port);
+	b = tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port);
 	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 
@@ -2153,7 +2148,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	conn->seq_to_tap = tcp_seq_init(c, af, addr, th->dest, th->source, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
-	tcp_hash_insert(c, conn, af, addr);
+	tcp_hash_insert(c, conn);
 
 	if (!bind(s, sa, sl)) {
 		tcp_rst(c, conn);	/* Nobody is listening then */
@@ -2802,8 +2797,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 						conn->sock_port,
 						conn->tap_port,
 						now);
-
-		tcp_hash_insert(c, conn, AF_INET6, &sa6.sin6_addr);
 	} else {
 		struct sockaddr_in sa4;
 
@@ -2823,10 +2816,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 						conn->sock_port,
 						conn->tap_port,
 						now);
-
-		tcp_hash_insert(c, conn, AF_INET, &sa4.sin_addr);
 	}
 
+	tcp_hash_insert(c, conn);
+
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
 	conn->wnd_from_tap = WINDOW_DEFAULT;
-- 
@@ -1248,17 +1248,12 @@ static unsigned int tcp_conn_hash(const struct ctx *c,
  * tcp_hash_insert() - Insert connection into hash table, chain link
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
  */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn,
-			    int af, const void *addr)
+static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
 {
-	union inany_addr aany;
 	int b;
 
-	inany_from_af(&aany, af, addr);
-	b = tcp_hash(c, &aany, conn->tap_port, conn->sock_port);
+	b = tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port);
 	conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1;
 	tc_hash[b] = conn;
 
@@ -2153,7 +2148,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	conn->seq_to_tap = tcp_seq_init(c, af, addr, th->dest, th->source, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
-	tcp_hash_insert(c, conn, af, addr);
+	tcp_hash_insert(c, conn);
 
 	if (!bind(s, sa, sl)) {
 		tcp_rst(c, conn);	/* Nobody is listening then */
@@ -2802,8 +2797,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 						conn->sock_port,
 						conn->tap_port,
 						now);
-
-		tcp_hash_insert(c, conn, AF_INET6, &sa6.sin6_addr);
 	} else {
 		struct sockaddr_in sa4;
 
@@ -2823,10 +2816,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 						conn->sock_port,
 						conn->tap_port,
 						now);
-
-		tcp_hash_insert(c, conn, AF_INET, &sa4.sin_addr);
 	}
 
+	tcp_hash_insert(c, conn);
+
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
 	conn->wnd_from_tap = WINDOW_DEFAULT;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 22/32] tcp: Simplify tcp_hash_match() to take an inany_addr
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (20 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 21/32] tcp: Take tcp_hash_insert() address from struct tcp_conn David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 23/32] tcp: Unify initial sequence number calculation for IPv4 and IPv6 David Gibson
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_hash_match() can take either an IPv4 (struct in_addr) or IPv6 (struct
in6_addr) address.  It has two different paths for each of those cases.
However, its only caller has already constructed an equivalent inany
representation of the address, so we can have tcp_hash_match take that
directly and use a simpler comparison with the inany_equals() helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/tcp.c b/tcp.c
index ca7f295..f566060 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1177,25 +1177,17 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 /**
  * tcp_hash_match() - Check if a connection entry matches address and ports
  * @conn:	Connection entry to match against
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
+ * @addr:	Remote address
  * @tap_port:	tap-facing port
  * @sock_port:	Socket-facing port
  *
  * Return: 1 on match, 0 otherwise
  */
 static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  int af, const void *addr,
+			  const union inany_addr *addr,
 			  in_port_t tap_port, in_port_t sock_port)
 {
-	const struct in_addr *a4 = inany_v4(&conn->addr);
-
-	if (af == AF_INET && a4	&& !memcmp(a4, addr, sizeof(*a4)) &&
-	    conn->tap_port == tap_port && conn->sock_port == sock_port)
-		return 1;
-
-	if (af == AF_INET6					&&
-	    IN6_ARE_ADDR_EQUAL(&conn->addr.a6, addr)		&&
+	if (inany_equals(&conn->addr, addr) &&
 	    conn->tap_port == tap_port && conn->sock_port == sock_port)
 		return 1;
 
@@ -1340,7 +1332,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 	inany_from_af(&aany, af, addr);
 	b = tcp_hash(c, &aany, tap_port, sock_port);
 	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
-		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
+		if (tcp_hash_match(conn, &aany, tap_port, sock_port))
 			return conn;
 	}
 
-- 
@@ -1177,25 +1177,17 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 /**
  * tcp_hash_match() - Check if a connection entry matches address and ports
  * @conn:	Connection entry to match against
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
+ * @addr:	Remote address
  * @tap_port:	tap-facing port
  * @sock_port:	Socket-facing port
  *
  * Return: 1 on match, 0 otherwise
  */
 static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  int af, const void *addr,
+			  const union inany_addr *addr,
 			  in_port_t tap_port, in_port_t sock_port)
 {
-	const struct in_addr *a4 = inany_v4(&conn->addr);
-
-	if (af == AF_INET && a4	&& !memcmp(a4, addr, sizeof(*a4)) &&
-	    conn->tap_port == tap_port && conn->sock_port == sock_port)
-		return 1;
-
-	if (af == AF_INET6					&&
-	    IN6_ARE_ADDR_EQUAL(&conn->addr.a6, addr)		&&
+	if (inany_equals(&conn->addr, addr) &&
 	    conn->tap_port == tap_port && conn->sock_port == sock_port)
 		return 1;
 
@@ -1340,7 +1332,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 	inany_from_af(&aany, af, addr);
 	b = tcp_hash(c, &aany, tap_port, sock_port);
 	for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) {
-		if (tcp_hash_match(conn, af, addr, tap_port, sock_port))
+		if (tcp_hash_match(conn, &aany, tap_port, sock_port))
 			return conn;
 	}
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 23/32] tcp: Unify initial sequence number calculation for IPv4 and IPv6
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (21 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 22/32] tcp: Simplify tcp_hash_match() to take an inany_addr David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 24/32] tcp: Have tcp_seq_init() take its parameters from struct tcp_conn David Gibson
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_seq_init() has separate paths for IPv4 and IPv6 addresses, which means
we will calculate different sequence numbers for IPv4 and equivalent
IPv4-mapped IPv6 addresses.

Change it to treat these the same by always converting the input address
into an inany_addr representation and use that to calculate the sequence
number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 siphash.c |  1 +
 tcp.c     | 46 ++++++++++++++++++----------------------------
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/siphash.c b/siphash.c
index 516a508..811918b 100644
--- a/siphash.c
+++ b/siphash.c
@@ -123,6 +123,7 @@ uint64_t siphash_8b(const uint8_t *in, const uint64_t *k)
  *
  * Return: 32 bits obtained by XORing the two halves of the 64-bit hash output
  */
+/* cppcheck-suppress unusedFunction */
 uint32_t siphash_12b(const uint8_t *in, const uint64_t *k)
 {
 	uint32_t *in32 = (uint32_t *)in;
diff --git a/tcp.c b/tcp.c
index f566060..d4e9838 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1942,37 +1942,27 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 			     in_port_t dstport, in_port_t srcport,
 			     const struct timespec *now)
 {
+	union inany_addr aany;
+	struct {
+		union inany_addr src;
+		in_port_t srcport;
+		union inany_addr dst;
+		in_port_t dstport;
+	} __attribute__((__packed__)) in = {
+		.srcport = srcport,
+		.dstport = dstport,
+	};
 	uint32_t ns, seq = 0;
 
-	if (af == AF_INET) {
-		struct {
-			struct in_addr src;
-			in_port_t srcport;
-			struct in_addr dst;
-			in_port_t dstport;
-		} __attribute__((__packed__)) in = {
-			.src = *(struct in_addr *)addr,
-			.srcport = srcport,
-			.dst = c->ip4.addr,
-			.dstport = dstport,
-		};
-
-		seq = siphash_12b((uint8_t *)&in, c->tcp.hash_secret);
-	} else if (af == AF_INET6) {
-		struct {
-			struct in6_addr src;
-			in_port_t srcport;
-			struct in6_addr dst;
-			in_port_t dstport;
-		} __attribute__((__packed__)) in = {
-			.src = *(struct in6_addr *)addr,
-			.srcport = srcport,
-			.dst = c->ip6.addr,
-			.dstport = dstport,
-		};
+	inany_from_af(&aany, af, addr);
+	in.src = aany;
+	if (af == AF_INET)
+		inany_from_af(&aany, AF_INET, &c->ip4.addr);
+	else
+		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
+	in.dst = aany;
 
-		seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
-	}
+	seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
 
 	ns = now->tv_sec * 1E9;
 	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
-- 
@@ -1942,37 +1942,27 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 			     in_port_t dstport, in_port_t srcport,
 			     const struct timespec *now)
 {
+	union inany_addr aany;
+	struct {
+		union inany_addr src;
+		in_port_t srcport;
+		union inany_addr dst;
+		in_port_t dstport;
+	} __attribute__((__packed__)) in = {
+		.srcport = srcport,
+		.dstport = dstport,
+	};
 	uint32_t ns, seq = 0;
 
-	if (af == AF_INET) {
-		struct {
-			struct in_addr src;
-			in_port_t srcport;
-			struct in_addr dst;
-			in_port_t dstport;
-		} __attribute__((__packed__)) in = {
-			.src = *(struct in_addr *)addr,
-			.srcport = srcport,
-			.dst = c->ip4.addr,
-			.dstport = dstport,
-		};
-
-		seq = siphash_12b((uint8_t *)&in, c->tcp.hash_secret);
-	} else if (af == AF_INET6) {
-		struct {
-			struct in6_addr src;
-			in_port_t srcport;
-			struct in6_addr dst;
-			in_port_t dstport;
-		} __attribute__((__packed__)) in = {
-			.src = *(struct in6_addr *)addr,
-			.srcport = srcport,
-			.dst = c->ip6.addr,
-			.dstport = dstport,
-		};
+	inany_from_af(&aany, af, addr);
+	in.src = aany;
+	if (af == AF_INET)
+		inany_from_af(&aany, AF_INET, &c->ip4.addr);
+	else
+		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
+	in.dst = aany;
 
-		seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
-	}
+	seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
 
 	ns = now->tv_sec * 1E9;
 	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 24/32] tcp: Have tcp_seq_init() take its parameters from struct tcp_conn
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (22 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 23/32] tcp: Unify initial sequence number calculation for IPv4 and IPv6 David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 25/32] tcp: Fix small errors in tcp_seq_init() time handling David Gibson
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_seq_init() takes a number of parameters for the connection, but at
every call site, these are already populated in the tcp_conn structure.
Likewise we always store the result into the @seq_to_tap field.

Use this to simplify tcp_seq_init().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/tcp.c b/tcp.c
index d4e9838..7156246 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1930,17 +1930,11 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
 /**
  * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
  * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
- * @dstport:	Destination port, connection-wise, network order
- * @srcport:	Source port, connection-wise, network order
+ * @conn:	TCP connection, with addr, sock_port and tap_port populated
  * @now:	Current timestamp
- *
- * Return: initial TCP sequence
  */
-static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
-			     in_port_t dstport, in_port_t srcport,
-			     const struct timespec *now)
+static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
+			 const struct timespec *now)
 {
 	union inany_addr aany;
 	struct {
@@ -1949,14 +1943,13 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 		union inany_addr dst;
 		in_port_t dstport;
 	} __attribute__((__packed__)) in = {
-		.srcport = srcport,
-		.dstport = dstport,
+		.src = conn->addr,
+		.srcport = conn->tap_port,
+		.dstport = conn->sock_port,
 	};
 	uint32_t ns, seq = 0;
 
-	inany_from_af(&aany, af, addr);
-	in.src = aany;
-	if (af == AF_INET)
+	if (CONN_V4(conn))
 		inany_from_af(&aany, AF_INET, &c->ip4.addr);
 	else
 		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
@@ -1967,7 +1960,7 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 	ns = now->tv_sec * 1E9;
 	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
 
-	return seq + ns;
+	conn->seq_to_tap = seq + ns;
 }
 
 /**
@@ -2127,7 +2120,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	conn->seq_to_tap = tcp_seq_init(c, af, addr, th->dest, th->source, now);
+	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
 	tcp_hash_insert(c, conn);
@@ -2774,11 +2767,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 		conn->sock_port = ntohs(sa6.sin6_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
-
-		conn->seq_to_tap = tcp_seq_init(c, AF_INET6, &sa6.sin6_addr,
-						conn->sock_port,
-						conn->tap_port,
-						now);
 	} else {
 		struct sockaddr_in sa4;
 
@@ -2793,13 +2781,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 		conn->sock_port = ntohs(sa4.sin_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
-
-		conn->seq_to_tap = tcp_seq_init(c, AF_INET, &sa4.sin_addr,
-						conn->sock_port,
-						conn->tap_port,
-						now);
 	}
 
+	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
-- 
@@ -1930,17 +1930,11 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn,
 /**
  * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
  * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @addr:	Remote address, pointer to in_addr or in6_addr
- * @dstport:	Destination port, connection-wise, network order
- * @srcport:	Source port, connection-wise, network order
+ * @conn:	TCP connection, with addr, sock_port and tap_port populated
  * @now:	Current timestamp
- *
- * Return: initial TCP sequence
  */
-static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
-			     in_port_t dstport, in_port_t srcport,
-			     const struct timespec *now)
+static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
+			 const struct timespec *now)
 {
 	union inany_addr aany;
 	struct {
@@ -1949,14 +1943,13 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 		union inany_addr dst;
 		in_port_t dstport;
 	} __attribute__((__packed__)) in = {
-		.srcport = srcport,
-		.dstport = dstport,
+		.src = conn->addr,
+		.srcport = conn->tap_port,
+		.dstport = conn->sock_port,
 	};
 	uint32_t ns, seq = 0;
 
-	inany_from_af(&aany, af, addr);
-	in.src = aany;
-	if (af == AF_INET)
+	if (CONN_V4(conn))
 		inany_from_af(&aany, AF_INET, &c->ip4.addr);
 	else
 		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
@@ -1967,7 +1960,7 @@ static uint32_t tcp_seq_init(const struct ctx *c, int af, const void *addr,
 	ns = now->tv_sec * 1E9;
 	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
 
-	return seq + ns;
+	conn->seq_to_tap = seq + ns;
 }
 
 /**
@@ -2127,7 +2120,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	conn->seq_to_tap = tcp_seq_init(c, af, addr, th->dest, th->source, now);
+	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
 
 	tcp_hash_insert(c, conn);
@@ -2774,11 +2767,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 		conn->sock_port = ntohs(sa6.sin6_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
-
-		conn->seq_to_tap = tcp_seq_init(c, AF_INET6, &sa6.sin6_addr,
-						conn->sock_port,
-						conn->tap_port,
-						now);
 	} else {
 		struct sockaddr_in sa4;
 
@@ -2793,13 +2781,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 
 		conn->sock_port = ntohs(sa4.sin_port);
 		conn->tap_port = ref.r.p.tcp.tcp.index;
-
-		conn->seq_to_tap = tcp_seq_init(c, AF_INET, &sa4.sin_addr,
-						conn->sock_port,
-						conn->tap_port,
-						now);
 	}
 
+	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap + 1;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 25/32] tcp: Fix small errors in tcp_seq_init() time handling
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (23 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 24/32] tcp: Have tcp_seq_init() take its parameters from struct tcp_conn David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref David Gibson
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

It looks like tcp_seq_init() is supposed to advance the sequence number
by one every 32ns.  However we only right shift the ns part of the timespec
not the seconds part, meaning that we'll advance by an extra 32 steps on
each second.

I don't know if that's exploitable in any way, but it doesn't appear to be
the intent, nor what RFC 6528 suggests.

In addition, we convert from seconds to nanoseconds with a multiplication
by '1E9'.  In C '1E9' is a floating point constant, forcing a conversion
to floating point and back for what should be an integer calculation
(confirmed with objdump and Makefile default compiler flags).  Spell out
1000000000 in full to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index 7156246..0513b3b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1957,8 +1957,8 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 
 	seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
 
-	ns = now->tv_sec * 1E9;
-	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
+	/* 32ns ticks, overflows 32 bits every 137s */
+	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
 	conn->seq_to_tap = seq + ns;
 }
-- 
@@ -1957,8 +1957,8 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 
 	seq = siphash_36b((uint8_t *)&in, c->tcp.hash_secret);
 
-	ns = now->tv_sec * 1E9;
-	ns += now->tv_nsec >> 5; /* 32ns ticks, overflows 32 bits every 137s */
+	/* 32ns ticks, overflows 32 bits every 137s */
+	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
 	conn->seq_to_tap = seq + ns;
 }
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (24 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 25/32] tcp: Fix small errors in tcp_seq_init() time handling David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-17  0:15   ` Stefano Brivio
  2022-11-16  4:42 ` [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses David Gibson
                   ` (5 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This bit in the TCP specific epoll reference indicates whether the
connection is IPv6 or IPv4.  However the sites which refer to it are
already calling accept() which (optionally) returns an address for the
remote end of the connection.  We can use the sa_family field in that
address to determine the connection type independent of the epoll
reference.

This does have a cost: for the spliced case, it means we now need to get
that address from accept() which introduces an extran copy_to_user().
However, in future we want to allow handling IPv4 connectons through IPv6
sockets, which means we won't be able to determine the IP version at the
time we create the listening socket and epoll reference.  So, at some point
we'll have to pay this cost anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 10 ++++------
 tcp.h        |  2 --
 tcp_splice.c |  9 ++++-----
 3 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/tcp.c b/tcp.c
index 0513b3b..b05ed6c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -662,8 +662,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
-				.r.p.tcp.tcp.index = CONN_IDX(conn),
-				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
+				.r.p.tcp.tcp.index = CONN_IDX(conn) };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
 	if (conn->events == CLOSED) {
@@ -2745,7 +2744,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	if (ref.r.p.tcp.tcp.v6) {
+	if (sa->sa_family == AF_INET6) {
 		struct sockaddr_in6 sa6;
 
 		memcpy(&sa6, sa, sizeof(sa6));
@@ -3019,8 +3018,7 @@ static void tcp_sock_init6(const struct ctx *c,
 			   in_port_t port)
 {
 	in_port_t idx = port + c->tcp.fwd_in.delta[port];
-	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
-				     .tcp.index = idx	};
+	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx	};
 	int s;
 
 	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
@@ -3084,7 +3082,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
 {
 	in_port_t idx = port + c->tcp.fwd_out.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
-				     .tcp.v6 = 1, .tcp.index = idx};
+				     .tcp.index = idx};
 	int s;
 
 	assert(c->mode == MODE_PASTA);
diff --git a/tcp.h b/tcp.h
index a940682..739b451 100644
--- a/tcp.h
+++ b/tcp.h
@@ -33,7 +33,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s,
  * union tcp_epoll_ref - epoll reference portion for TCP connections
  * @listen:		Set if this file descriptor is a listening socket
  * @outbound:		Listening socket maps to outbound, spliced connection
- * @v6:			Set for IPv6 sockets or connections
  * @timer:		Reference is a timerfd descriptor for connection
  * @index:		Index of connection in table, or port for bound sockets
  * @u32:		Opaque u32 value of reference
@@ -42,7 +41,6 @@ union tcp_epoll_ref {
 	struct {
 		uint32_t	listen:1,
 				outbound:1,
-				v6:1,
 				timer:1,
 				index:20;
 	} tcp;
diff --git a/tcp_splice.c b/tcp_splice.c
index 30ab0eb..7c2f667 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -167,11 +167,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 {
 	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
-				  .r.p.tcp.tcp.index = CONN_IDX(conn),
-				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
+				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
-				  .r.p.tcp.tcp.index = CONN_IDX(conn),
-				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
+				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
 	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
 	uint32_t events_a, events_b;
@@ -504,6 +502,7 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
  * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
  * @ref:	epoll reference of listening socket
+ * @ipv6:	Should this be an IPv6 connection?
  * @conn:	connection structure to initialize
  * @s:		Accepted socket
  * @sa:		Peer address of connection
@@ -517,7 +516,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 {
 	assert(c->mode == MODE_PASTA);
 
-	if (ref.r.p.tcp.tcp.v6) {
+	if (sa->sa_family == AF_INET6) {
 		const struct sockaddr_in6 *sa6
 			= (const struct sockaddr_in6 *)sa;
 		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
-- 
@@ -167,11 +167,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 {
 	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
-				  .r.p.tcp.tcp.index = CONN_IDX(conn),
-				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
+				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
-				  .r.p.tcp.tcp.index = CONN_IDX(conn),
-				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
+				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
 	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
 	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
 	uint32_t events_a, events_b;
@@ -504,6 +502,7 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
  * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
  * @ref:	epoll reference of listening socket
+ * @ipv6:	Should this be an IPv6 connection?
  * @conn:	connection structure to initialize
  * @s:		Accepted socket
  * @sa:		Peer address of connection
@@ -517,7 +516,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 {
 	assert(c->mode == MODE_PASTA);
 
-	if (ref.r.p.tcp.tcp.v6) {
+	if (sa->sa_family == AF_INET6) {
 		const struct sockaddr_in6 *sa6
 			= (const struct sockaddr_in6 *)sa;
 		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (25 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-17  0:15   ` Stefano Brivio
  2022-11-16  4:42 ` [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback David Gibson
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

passt usually doesn't NAT, but it does do so for the remapping of the
gateway address to refer to the host.  Currently we perform this NAT with
slightly different rules on both IPv4 addresses and IPv6 addresses, but not
on IPv4-mapped IPv6 addresses.  This means we won't correctly handle the
case of an IPv4 connection over an IPv6 socket, which is possible on Linux
(and probably other platforms).

Refactor tcp_conn_from_sock() to perform the NAT after converting either
address family into an inany_addr, so IPv4 and and IPv4-mapped addresses
have the same representation.

With two new helpers this lets us remove the IPv4 and IPv6 specific paths
from tcp_conn_from_sock().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 inany.h | 30 ++++++++++++++++++++++++++--
 tcp.c   | 62 ++++++++++++++++++++++++---------------------------------
 2 files changed, 54 insertions(+), 38 deletions(-)

diff --git a/inany.h b/inany.h
index 4e53da9..a677aa7 100644
--- a/inany.h
+++ b/inany.h
@@ -30,11 +30,11 @@ union inany_addr {
  *
  * Return: IPv4 address if @addr is IPv4, NULL otherwise
  */
-static inline const struct in_addr *inany_v4(const union inany_addr *addr)
+static inline struct in_addr *inany_v4(const union inany_addr *addr)
 {
 	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
 		return NULL;
-	return &addr->_v4mapped.a4;
+	return (struct in_addr *)&addr->_v4mapped.a4;
 }
 
 /** inany_equals - Compare two IPv[46] addresses
@@ -66,3 +66,29 @@ static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
 		assert(0);
 	}
 }
+
+/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
+ * @a:		Pointer to store IPv[46] address
+ * @port:	Pointer to store port number, host order
+ * @addr:	struct sockaddr_in (IPv4) or struct sockaddr_in6 (IPv6)
+ */
+static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
+				       const void *sa_)
+{
+	const struct sockaddr *sa = (const struct sockaddr *)sa_;
+
+	if (sa->sa_family == AF_INET6) {
+		struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)sa;
+
+		inany_from_af(aa, AF_INET6, &sa6->sin6_addr);
+		*port = ntohs(sa6->sin6_port);
+	} else if (sa->sa_family == AF_INET) {
+		struct sockaddr_in *sa4 = (struct sockaddr_in *)sa;
+
+		inany_from_af(aa, AF_INET, &sa4->sin_addr);
+		*port = ntohs(sa4->sin_port);
+	} else {
+		/* Not valid to call with other address families */
+		assert(0);
+	}
+}
diff --git a/tcp.c b/tcp.c
index b05ed6c..fca5df4 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2724,6 +2724,29 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 }
 
+static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
+{
+	struct in_addr *addr4 = inany_v4(addr);
+
+	if (addr4) {
+		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
+		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
+		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
+			*addr4 = c->ip4.gw;
+	} else {
+		struct in6_addr *addr6 = &addr->a6;
+
+		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
+		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
+		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
+			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
+				*addr6 = c->ip6.gw;
+			else
+				*addr6 = c->ip6.addr_ll;
+		}
+	}
+}
+
 /**
  * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
  * @c:		Execution context
@@ -2744,43 +2767,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	if (sa->sa_family == AF_INET6) {
-		struct sockaddr_in6 sa6;
-
-		memcpy(&sa6, sa, sizeof(sa6));
-
-		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
-		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
-		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr)) {
-			struct in6_addr *src;
+	inany_from_sockaddr(&conn->addr, &conn->sock_port, sa);
+	conn->tap_port = ref.r.p.tcp.tcp.index;
 
-			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-				src = &c->ip6.gw;
-			else
-				src = &c->ip6.addr_ll;
-
-			memcpy(&sa6.sin6_addr, src, sizeof(*src));
-		}
-
-		inany_from_af(&conn->addr, AF_INET6, &sa6.sin6_addr);
-
-		conn->sock_port = ntohs(sa6.sin6_port);
-		conn->tap_port = ref.r.p.tcp.tcp.index;
-	} else {
-		struct sockaddr_in sa4;
-
-		memcpy(&sa4, sa, sizeof(sa4));
-
-		if (IN4_IS_ADDR_LOOPBACK(&sa4.sin_addr) ||
-		    IN4_IS_ADDR_UNSPECIFIED(&sa4.sin_addr) ||
-		    IN4_ARE_ADDR_EQUAL(&sa4.sin_addr, &c->ip4.addr_seen))
-			sa4.sin_addr = c->ip4.gw;
-
-		inany_from_af(&conn->addr, AF_INET, &sa4.sin_addr);
-
-		conn->sock_port = ntohs(sa4.sin_port);
-		conn->tap_port = ref.r.p.tcp.tcp.index;
-	}
+	tcp_snat_inbound(c, &conn->addr);
 
 	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
-- 
@@ -2724,6 +2724,29 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 }
 
+static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
+{
+	struct in_addr *addr4 = inany_v4(addr);
+
+	if (addr4) {
+		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
+		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
+		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
+			*addr4 = c->ip4.gw;
+	} else {
+		struct in6_addr *addr6 = &addr->a6;
+
+		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
+		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
+		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
+			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
+				*addr6 = c->ip6.gw;
+			else
+				*addr6 = c->ip6.addr_ll;
+		}
+	}
+}
+
 /**
  * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
  * @c:		Execution context
@@ -2744,43 +2767,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	if (sa->sa_family == AF_INET6) {
-		struct sockaddr_in6 sa6;
-
-		memcpy(&sa6, sa, sizeof(sa6));
-
-		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
-		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
-		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr)) {
-			struct in6_addr *src;
+	inany_from_sockaddr(&conn->addr, &conn->sock_port, sa);
+	conn->tap_port = ref.r.p.tcp.tcp.index;
 
-			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-				src = &c->ip6.gw;
-			else
-				src = &c->ip6.addr_ll;
-
-			memcpy(&sa6.sin6_addr, src, sizeof(*src));
-		}
-
-		inany_from_af(&conn->addr, AF_INET6, &sa6.sin6_addr);
-
-		conn->sock_port = ntohs(sa6.sin6_port);
-		conn->tap_port = ref.r.p.tcp.tcp.index;
-	} else {
-		struct sockaddr_in sa4;
-
-		memcpy(&sa4, sa, sizeof(sa4));
-
-		if (IN4_IS_ADDR_LOOPBACK(&sa4.sin_addr) ||
-		    IN4_IS_ADDR_UNSPECIFIED(&sa4.sin_addr) ||
-		    IN4_ARE_ADDR_EQUAL(&sa4.sin_addr, &c->ip4.addr_seen))
-			sa4.sin_addr = c->ip4.gw;
-
-		inany_from_af(&conn->addr, AF_INET, &sa4.sin_addr);
-
-		conn->sock_port = ntohs(sa4.sin_port);
-		conn->tap_port = ref.r.p.tcp.tcp.index;
-	}
+	tcp_snat_inbound(c, &conn->addr);
 
 	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (26 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-17  0:15   ` Stefano Brivio
  2022-11-16  4:42 ` [PATCH 29/32] tcp: Consolidate tcp_sock_init[46] David Gibson
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
same as the corresponding IPv4 addresses.  However currently we won't
splice a connection from ::ffff:127.0.0.1 the way we would one from
127.0.0.1.  Correct this so that we can splice connections from IPv4
localhost that have been received on an IPv6 dual stack socket.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_splice.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 7c2f667..61c56be 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -514,19 +514,23 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
 			       const struct sockaddr *sa)
 {
+	union inany_addr aany;
+	const struct in_addr *a4;
+	in_port_t port;
+
 	assert(c->mode == MODE_PASTA);
 
-	if (sa->sa_family == AF_INET6) {
-		const struct sockaddr_in6 *sa6
-			= (const struct sockaddr_in6 *)sa;
-		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
+	inany_from_sockaddr(&aany, &port, sa);
+	a4 = inany_v4(&aany);
+
+	if (a4) {
+		if (!IN4_IS_ADDR_LOOPBACK(a4))
 			return false;
-		conn->flags = SPLICE_V6;
+		conn->flags = 0;
 	} else {
-		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
-		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
+		if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
 			return false;
-		conn->flags = 0;
+		conn->flags = SPLICE_V6;
 	}
 
 	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
-- 
@@ -514,19 +514,23 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
 			       const struct sockaddr *sa)
 {
+	union inany_addr aany;
+	const struct in_addr *a4;
+	in_port_t port;
+
 	assert(c->mode == MODE_PASTA);
 
-	if (sa->sa_family == AF_INET6) {
-		const struct sockaddr_in6 *sa6
-			= (const struct sockaddr_in6 *)sa;
-		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
+	inany_from_sockaddr(&aany, &port, sa);
+	a4 = inany_v4(&aany);
+
+	if (a4) {
+		if (!IN4_IS_ADDR_LOOPBACK(a4))
 			return false;
-		conn->flags = SPLICE_V6;
+		conn->flags = 0;
 	} else {
-		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
-		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
+		if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
 			return false;
-		conn->flags = 0;
+		conn->flags = SPLICE_V6;
 	}
 
 	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 29/32] tcp: Consolidate tcp_sock_init[46]
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (27 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 30/32] util: Allow sock_l4() to open dual stack sockets David Gibson
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Previous cleanups mean that tcp_sock_init4() and tcp_sock_init6() are
almost identical, and the remaining differences can be easily
parameterized.  Combine both into a single tcp_sock_init_af() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 50 +++++++++++++++-----------------------------------
 1 file changed, 15 insertions(+), 35 deletions(-)

diff --git a/tcp.c b/tcp.c
index fca5df4..616b9d0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2973,52 +2973,32 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 }
 
 /**
- * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
+ * tcp_sock_init_af() - Initialise listening socket for a given af and port
  * @c:		Execution context
+ * @af:		Address family to listen on
+ * @port:	Port, host order
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
- * @port:	Port, host order
+ *
+ * Return: fd for the new listening socket, or -1 on failure
  */
-static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
-			   const char *ifname, in_port_t port)
+static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
+			    const struct in_addr *addr, const char *ifname)
 {
 	in_port_t idx = port + c->tcp.fwd_in.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
 	int s;
 
-	s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
-		s = -1;
+	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
 
 	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][V4] = s;
-}
+		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
 
-/**
- * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port
- * @c:		Execution context
- * @addr:	Pointer to address for binding, NULL if not configured
- * @ifname:	Name of interface to bind to, NULL if not configured
- * @port:	Port, host order
- */
-static void tcp_sock_init6(const struct ctx *c,
-			   const struct in6_addr *addr, const char *ifname,
-			   in_port_t port)
-{
-	in_port_t idx = port + c->tcp.fwd_in.delta[port];
-	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx	};
-	int s;
-
-	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
-		s = -1;
+	if (s < 0)
+		return -1;
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][V6] = s;
+	tcp_sock_set_bufsize(c, s);
+	return s;
 }
 
 /**
@@ -3033,9 +3013,9 @@ void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
 		   const char *ifname, in_port_t port)
 {
 	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
-		tcp_sock_init4(c, addr, ifname, port);
+		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
 	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
-		tcp_sock_init6(c, addr, ifname, port);
+		tcp_sock_init_af(c, AF_INET6, port, addr, ifname);
 }
 
 /**
-- 
@@ -2973,52 +2973,32 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
 }
 
 /**
- * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
+ * tcp_sock_init_af() - Initialise listening socket for a given af and port
  * @c:		Execution context
+ * @af:		Address family to listen on
+ * @port:	Port, host order
  * @addr:	Pointer to address for binding, NULL if not configured
  * @ifname:	Name of interface to bind to, NULL if not configured
- * @port:	Port, host order
+ *
+ * Return: fd for the new listening socket, or -1 on failure
  */
-static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
-			   const char *ifname, in_port_t port)
+static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
+			    const struct in_addr *addr, const char *ifname)
 {
 	in_port_t idx = port + c->tcp.fwd_in.delta[port];
 	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
 	int s;
 
-	s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
-		s = -1;
+	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
 
 	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][V4] = s;
-}
+		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
 
-/**
- * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port
- * @c:		Execution context
- * @addr:	Pointer to address for binding, NULL if not configured
- * @ifname:	Name of interface to bind to, NULL if not configured
- * @port:	Port, host order
- */
-static void tcp_sock_init6(const struct ctx *c,
-			   const struct in6_addr *addr, const char *ifname,
-			   in_port_t port)
-{
-	in_port_t idx = port + c->tcp.fwd_in.delta[port];
-	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx	};
-	int s;
-
-	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
-	if (s >= 0)
-		tcp_sock_set_bufsize(c, s);
-	else
-		s = -1;
+	if (s < 0)
+		return -1;
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][V6] = s;
+	tcp_sock_set_bufsize(c, s);
+	return s;
 }
 
 /**
@@ -3033,9 +3013,9 @@ void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
 		   const char *ifname, in_port_t port)
 {
 	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
-		tcp_sock_init4(c, addr, ifname, port);
+		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
 	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
-		tcp_sock_init6(c, addr, ifname, port);
+		tcp_sock_init_af(c, AF_INET6, port, addr, ifname);
 }
 
 /**
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 30/32] util: Allow sock_l4() to open dual stack sockets
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (28 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 29/32] tcp: Consolidate tcp_sock_init[46] David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 31/32] util: Always return -1 on error in sock_l4() David Gibson
  2022-11-16  4:42 ` [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible David Gibson
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently, when instructed to open an IPv6 socket, sock_l4() explicitly
sets the IPV6_V6ONLY socket option so that the socket will only respond to
IPv6 connections.  Linux (and probably other platforms) allow "dual stack"
sockets: IPv6 sockets which can also accept IPv4 connections.

Extend sock_l4() to be able to make such sockets, by passing AF_UNSPEC as
the address family and no bind address (binding to a specific address would
defeat the purpose).  We add a Makefile define 'DUAL_STACK_SOCKETS' to
indicate availability of this feature on the target platform.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 Makefile |  5 +++++
 util.c   | 17 +++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index ca453aa..0f92db5 100644
--- a/Makefile
+++ b/Makefile
@@ -11,6 +11,10 @@
 
 VERSION ?= $(shell git describe --tags HEAD 2>/dev/null || echo "unknown\ version")
 
+# Does the target platform allow IPv4 connections to be handled via
+# the IPv6 socket API? (Linux does)
+DUAL_STACK_SOCKETS := 1
+
 RLIMIT_STACK_VAL := $(shell /bin/sh -c 'ulimit -s')
 ifeq ($(RLIMIT_STACK_VAL),unlimited)
 RLIMIT_STACK_VAL := 1024
@@ -34,6 +38,7 @@ FLAGS += -DPASST_AUDIT_ARCH=AUDIT_ARCH_$(AUDIT_ARCH)
 FLAGS += -DRLIMIT_STACK_VAL=$(RLIMIT_STACK_VAL)
 FLAGS += -DARCH=\"$(TARGET_ARCH)\"
 FLAGS += -DVERSION=\"$(VERSION)\"
+FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c icmp.c igmp.c \
 	isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c passt.c \
diff --git a/util.c b/util.c
index 514bd44..fc629ed 100644
--- a/util.c
+++ b/util.c
@@ -22,6 +22,8 @@
 #include <string.h>
 #include <time.h>
 #include <errno.h>
+#include <stdbool.h>
+#include <assert.h>
 
 #include "util.h"
 #include "passt.h"
@@ -112,6 +114,7 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 		0, IN6ADDR_ANY_INIT, 0,
 	};
 	const struct sockaddr *sa;
+	bool dual_stack = false;
 	struct epoll_event ev;
 	int fd, sl, y = 1;
 
@@ -119,6 +122,13 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 	    proto != IPPROTO_ICMP && proto != IPPROTO_ICMPV6)
 		return -1;	/* Not implemented. */
 
+	if (af == AF_UNSPEC) {
+		if (!DUAL_STACK_SOCKETS || bind_addr)
+			return -1;
+		dual_stack = true;
+		af = AF_INET6;
+	}
+
 	if (proto == IPPROTO_TCP)
 		fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
 	else
@@ -158,8 +168,11 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 		sa = (const struct sockaddr *)&addr6;
 		sl = sizeof(addr6);
 
-		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
-			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
+		if (!dual_stack)
+			if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
+				       &y, sizeof(y)))
+				debug("Failed to set IPV6_V6ONLY on socket %i",
+				      fd);
 	}
 
 	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
-- 
@@ -22,6 +22,8 @@
 #include <string.h>
 #include <time.h>
 #include <errno.h>
+#include <stdbool.h>
+#include <assert.h>
 
 #include "util.h"
 #include "passt.h"
@@ -112,6 +114,7 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 		0, IN6ADDR_ANY_INIT, 0,
 	};
 	const struct sockaddr *sa;
+	bool dual_stack = false;
 	struct epoll_event ev;
 	int fd, sl, y = 1;
 
@@ -119,6 +122,13 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 	    proto != IPPROTO_ICMP && proto != IPPROTO_ICMPV6)
 		return -1;	/* Not implemented. */
 
+	if (af == AF_UNSPEC) {
+		if (!DUAL_STACK_SOCKETS || bind_addr)
+			return -1;
+		dual_stack = true;
+		af = AF_INET6;
+	}
+
 	if (proto == IPPROTO_TCP)
 		fd = socket(af, SOCK_STREAM | SOCK_NONBLOCK, proto);
 	else
@@ -158,8 +168,11 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 		sa = (const struct sockaddr *)&addr6;
 		sl = sizeof(addr6);
 
-		if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &y, sizeof(y)))
-			debug("Failed to set IPV6_V6ONLY on socket %i", fd);
+		if (!dual_stack)
+			if (setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY,
+				       &y, sizeof(y)))
+				debug("Failed to set IPV6_V6ONLY on socket %i",
+				      fd);
 	}
 
 	if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &y, sizeof(y)))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 31/32] util: Always return -1 on error in sock_l4()
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (29 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 30/32] util: Allow sock_l4() to open dual stack sockets David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-16  4:42 ` [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible David Gibson
  31 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

According to its doc comments, sock_l4() returns -1 on error.  It does,
except in one case where it returns -EIO.  Fix this inconsistency to match
the docs and always return -1.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/util.c b/util.c
index fc629ed..e2222b8 100644
--- a/util.c
+++ b/util.c
@@ -141,7 +141,7 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 
 	if (fd > SOCKET_MAX) {
 		close(fd);
-		return -EIO;
+		return -1;
 	}
 
 	ref.r.s = fd;
-- 
@@ -141,7 +141,7 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto,
 
 	if (fd > SOCKET_MAX) {
 		close(fd);
-		return -EIO;
+		return -1;
 	}
 
 	ref.r.s = fd;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible
  2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
                   ` (30 preceding siblings ...)
  2022-11-16  4:42 ` [PATCH 31/32] util: Always return -1 on error in sock_l4() David Gibson
@ 2022-11-16  4:42 ` David Gibson
  2022-11-17  0:15   ` Stefano Brivio
  31 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-16  4:42 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections.  By doing this we halve the number of
listening sockets we need for TCP (assuming passt/pasta is listening on the
same ports for IPv4 and IPv6).  When forwarding many ports (e.g. -t all)
this can significantly reduce the amount of kernel memory that passt
consumes.

When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all
-u all), this reduces kernel memory usage from ~677MiB to ~487MiB
(kernel version 6.0.8 on Fedora 37, x86_64).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/tcp.c b/tcp.c
index 616b9d0..5860c9f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2991,8 +2991,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
 
 	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
+	if (c->tcp.fwd_in.mode == FWD_AUTO) {
+		if (af == AF_INET || af == AF_UNSPEC)
+			tcp_sock_init_ext[port][V4] = s;
+		if (af == AF_INET6 || af == AF_UNSPEC)
+			tcp_sock_init_ext[port][V6] = s;
+	}
 
 	if (s < 0)
 		return -1;
@@ -3012,6 +3016,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
 void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
 		   const char *ifname, in_port_t port)
 {
+	if (af == AF_UNSPEC && c->ifi4 && c->ifi6)
+		/* Attempt to get a dual stack socket */
+		if (tcp_sock_init_af(c, AF_UNSPEC, port, addr, ifname) >= 0)
+			return;
+
+	/* Otherwise create a socket per IP version */
 	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
 		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
 	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
-- 
@@ -2991,8 +2991,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
 
 	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
 
-	if (c->tcp.fwd_in.mode == FWD_AUTO)
-		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
+	if (c->tcp.fwd_in.mode == FWD_AUTO) {
+		if (af == AF_INET || af == AF_UNSPEC)
+			tcp_sock_init_ext[port][V4] = s;
+		if (af == AF_INET6 || af == AF_UNSPEC)
+			tcp_sock_init_ext[port][V6] = s;
+	}
 
 	if (s < 0)
 		return -1;
@@ -3012,6 +3016,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
 void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
 		   const char *ifname, in_port_t port)
 {
+	if (af == AF_UNSPEC && c->ifi4 && c->ifi6)
+		/* Attempt to get a dual stack socket */
+		if (tcp_sock_init_af(c, AF_UNSPEC, port, addr, ifname) >= 0)
+			return;
+
+	/* Otherwise create a socket per IP version */
 	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
 		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
 	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements
  2022-11-16  4:41 ` [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements David Gibson
@ 2022-11-16 23:10   ` Stefano Brivio
  2022-11-17  1:20     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:10 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:41 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> clang-tools 15.0.0 appears to have added a new warning that will always
> complain about assignments in if statements, which we use in a number of
> places in passt/pasta.  Encountered on Fedora 37 with
> clang-tools-extra-15.0.0-3.fc37.x86_64.
> 
> Suppress the new warning so that we can compile and test.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  Makefile | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Makefile b/Makefile
> index 6b22408..8bcbbc0 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -262,6 +262,7 @@ clang-tidy: $(SRCS) $(HEADERS)
>  	clang-tidy -checks=*,-modernize-*,\
>  	-clang-analyzer-valist.Uninitialized,\
>  	-cppcoreguidelines-init-variables,\
> +	-bugprone-assignment-in-if-condition,\

I'm trying to keep, in the comment just above, a list of clang-tidy
warnings we disable and the reason. I think this could just be grouped
with:

# - cppcoreguidelines-init-variables
#	Dubious value, would kill readability

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02/32] style: Minor corrections to function comments
  2022-11-16  4:41 ` [PATCH 02/32] style: Minor corrections to function comments David Gibson
@ 2022-11-16 23:11   ` Stefano Brivio
  2022-11-17  1:21     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:11 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:42 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Some style issues and a typo.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  conf.c | 6 +++---
>  tap.c  | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 1adcf83..3ad247e 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -112,9 +112,9 @@ static int get_bound_ports_ns(void *arg)
>   * @s:		String to search
>   * @c:		Delimiter character
>   *
> - * Returns: If another @c is found in @s, returns a pointer to the
> - *	    character *after* the delimiter, if no further @c is in
> - *	    @s, return NULL
> + * Return: If another @c is found in @s, returns a pointer to the
> + *	   character *after* the delimiter, if no further @c is in @s,
> + *	   return NULL
>   */
>  static char *next_chunk(const char *s, char c)
>  {
> diff --git a/tap.c b/tap.c
> index abeff25..707660c 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -90,7 +90,7 @@ int tap_send(const struct ctx *c, const void *data, size_t len)
>   * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
>   * @c:		Execution context
>   *
> - * Returns: IPv4 address, network order
> + * Return:	IPv4 address, network order

Loosely based on kerneldoc style: single space after "Return: " is the
style adopted everywhere else. Rationale: it doesn't need to be aligned
with anything else.

>   */
>  struct in_addr tap_ip4_daddr(const struct ctx *c)
>  {
> @@ -98,11 +98,11 @@ struct in_addr tap_ip4_daddr(const struct ctx *c)
>  }
>  
>  /**
> - * tap_ip6_daddr() - Normal IPv4 destination address for inbound packets
> + * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
>   * @c:		Execution context
>   * @src:	Source address
>   *
> - * Returns: pointer to IPv6 address
> + * Return:	pointer to IPv6 address

Same here.

>   */
>  const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
>  				     const struct in6_addr *src)

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index
  2022-11-16  4:41 ` [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index David Gibson
@ 2022-11-16 23:11   ` Stefano Brivio
  2022-11-17  1:24     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:11 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> The macro CONN_OR_NULL() is used to look up connections by index with
> bounds checking.  Replace it with an inline function, which means:
>     - Better type checking
>     - No danger of multiple evaluation of an @index with side effects
> 
> Also add a helper to perform the reverse translation: from connection
> pointer to index.  Introduce a macro for this which will make later
> cleanups easier and safer.

Ah, yes, much better, agreed. Just two things here:

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c | 83 ++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 45 insertions(+), 38 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index d043123..4e56a6c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -518,14 +518,6 @@ struct tcp_conn {
>  	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
>  #define CONN_HAS(conn, set)	((conn->events & (set)) == (set))
>  
> -#define CONN(index)		(tc + (index))
> -
> -/* We probably don't want to use gcc statement expressions (for portability), so
> - * use this only after well-defined sequence points (no pre-/post-increments).
> - */
> -#define CONN_OR_NULL(index)						\
> -	(((int)(index) >= 0 && (index) < TCP_MAX_CONNS) ? (tc + (index)) : NULL)
> -
>  static const char *tcp_event_str[] __attribute((__unused__)) = {
>  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
>  
> @@ -705,6 +697,21 @@ static size_t tcp6_l2_flags_buf_bytes;
>  /* TCP connections */
>  static struct tcp_conn tc[TCP_MAX_CONNS];
>  
> +#define CONN(index)		(tc + (index))
> +#define CONN_IDX(conn)		((conn) - tc)
> +
> +/** conn_at_idx() - Find a connection by index, if present
> + * @index:	Index of connection to lookup
> + *
> + * Return:	Pointer to connection, or NULL if @index is out of bounds

Return: pointer [...]

> + */
> +static inline struct tcp_conn *conn_at_idx(int index)

The CONN_OR_NULL name made it very explicit that the pointer obtained
there could be NULL. On the other hand I find conn_at_idx() more
descriptive. But maybe conn_or_null() would be "safer". I don't really
have a preference.

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/32] tcp: Separate helpers to create ns listening sockets
  2022-11-16  4:41 ` [PATCH 14/32] tcp: Separate helpers to create ns listening sockets David Gibson
@ 2022-11-16 23:51   ` Stefano Brivio
  2022-11-17  1:32     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:51 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:54 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> tcp_sock_init*() can create either sockets listening on the host, or in
> the pasta network namespace (with @ns==1).  There are, however, a number
> of differences in how these two cases work in practice though.  "ns"
> sockets are only used in pasta mode, and they always lead to spliced
> connections only.  The functions are also only ever called in "ns" mode
> with a NULL address and interface name, and it doesn't really make sense
> for them to be called any other way.
> 
> Later changes will introduce further differences in behaviour between these
> two cases, so it makes more sense to use separate functions for creating
> the ns listening sockets than the regular external/host listening sockets.
> ---
>  conf.c |   6 +--
>  tcp.c  | 130 ++++++++++++++++++++++++++++++++++++++-------------------
>  tcp.h  |   4 +-
>  3 files changed, 92 insertions(+), 48 deletions(-)
> 
> diff --git a/conf.c b/conf.c
> index 3ad247e..2b39d18 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
>  
>  		for (i = 0; i < PORT_EPHEMERAL_MIN; i++) {
>  			if (optname == 't')
> -				tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
> +				tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i);
>  			else if (optname == 'u')
>  				udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
>  		}
> @@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
>  			bitmap_set(fwd->map, i);
>  
>  			if (optname == 't')
> -				tcp_sock_init(c, 0, af, addr, ifname, i);
> +				tcp_sock_init(c, af, addr, ifname, i);
>  			else if (optname == 'u')
>  				udp_sock_init(c, 0, af, addr, ifname, i);
>  		}
> @@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
>  			fwd->delta[i] = mapped_range.first - orig_range.first;
>  
>  			if (optname == 't')
> -				tcp_sock_init(c, 0, af, addr, ifname, i);
> +				tcp_sock_init(c, af, addr, ifname, i);
>  			else if (optname == 'u')
>  				udp_sock_init(c, 0, af, addr, ifname, i);
>  		}
> diff --git a/tcp.c b/tcp.c
> index aac70cd..72d3b49 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2987,15 +2987,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
>  /**
>   * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
>   * @c:		Execution context
> - * @ns:		In pasta mode, if set, bind with loopback address in namespace
>   * @addr:	Pointer to address for binding, NULL if not configured
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   */
> -static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr,
> +static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
>  			   const char *ifname, in_port_t port)
>  {
> -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns };
> +	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };

Usual order here...

>  	bool spliced = false, tap = true;
>  	int s;
>  
> @@ -3006,14 +3006,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
>  		if (!addr)
>  			addr = &c->ip4.addr;
>  
> -		tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr);
> +		tap = !IN4_IS_ADDR_LOOPBACK(addr);
>  	}
>  
> -	if (ns)
> -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
> -	else
> -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
> -
>  	if (tap) {
>  		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
>  			    tref.u32);
> @@ -3039,29 +3034,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
>  		else
>  			s = -1;
>  
> -		if (c->tcp.fwd_out.mode == FWD_AUTO) {
> -			if (ns)
> -				tcp_sock_ns[port][V4] = s;
> -			else
> -				tcp_sock_init_lo[port][V4] = s;
> -		}
> +		if (c->tcp.fwd_out.mode == FWD_AUTO)
> +			tcp_sock_init_lo[port][V4] = s;
>  	}
>  }
>  
>  /**
>   * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port
>   * @c:		Execution context
> - * @ns:		In pasta mode, if set, bind with loopback address in namespace
>   * @addr:	Pointer to address for binding, NULL if not configured
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   */
> -static void tcp_sock_init6(const struct ctx *c, int ns,
> +static void tcp_sock_init6(const struct ctx *c,
>  			   const struct in6_addr *addr, const char *ifname,
>  			   in_port_t port)
>  {
> -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns,
> -				     .tcp.v6 = 1 };
> +	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
> +				     .tcp.index = idx	};

Excess whitespace.

>  	bool spliced = false, tap = true;
>  	int s;
>  
> @@ -3073,14 +3064,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
>  		if (!addr)
>  			addr = &c->ip6.addr;
>  
> -		tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr);
> +		tap = !IN6_IS_ADDR_LOOPBACK(addr);
>  	}
>  
> -	if (ns)
> -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
> -	else
> -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
> -
>  	if (tap) {
>  		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
>  			    tref.u32);
> @@ -3105,40 +3091,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
>  		else
>  			s = -1;
>  
> -		if (c->tcp.fwd_out.mode == FWD_AUTO) {
> -			if (ns)
> -				tcp_sock_ns[port][V6] = s;
> -			else
> -				tcp_sock_init_lo[port][V6] = s;
> -		}
> +		if (c->tcp.fwd_out.mode == FWD_AUTO)
> +			tcp_sock_init_lo[port][V6] = s;
>  	}
>  }
>  
>  /**
>   * tcp_sock_init() - Initialise listening sockets for a given port

Maybe we should now indicate this is for "inbound" connections only
("for a given, inbound, port"?)

>   * @c:		Execution context
> - * @ns:		In pasta mode, if set, bind with loopback address in namespace
>   * @af:		Address family to select a specific IP version, or AF_UNSPEC
>   * @addr:	Pointer to address for binding, NULL if not configured
>   * @ifname:	Name of interface to bind to, NULL if not configured
>   * @port:	Port, host order
>   */
> -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> -		   const void *addr, const char *ifname, in_port_t port)
> +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> +		   const char *ifname, in_port_t port)
>  {
>  	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
> -		tcp_sock_init4(c, ns, addr, ifname, port);
> +		tcp_sock_init4(c, addr, ifname, port);
>  	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
> -		tcp_sock_init6(c, ns, addr, ifname, port);
> +		tcp_sock_init6(c, addr, ifname, port);
> +}
> +
> +/**
> + * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections
> + * @c:		Execution context
> + * @port:	Port, host order
> + */
> +static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
> +{
> +	in_port_t idx = port + c->tcp.fwd_out.delta[port];

Move after declaration of 'loopback'.

> +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> +				     .tcp.splice = 1, .tcp.index = idx };
> +	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
> +	int s;
> +
> +	assert(c->mode == MODE_PASTA);
> +
> +	s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32);
> +	if (s >= 0)
> +		tcp_sock_set_bufsize(c, s);
> +	else
> +		s = -1;
> +
> +	if (c->tcp.fwd_out.mode == FWD_AUTO)
> +		tcp_sock_ns[port][V4] = s;
>  }
>  
>  /**
> - * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections
> + * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections
> + * @c:		Execution context
> + * @port:	Port, host order
> + */
> +static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
> +{
> +	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> +				     .tcp.splice = 1, .tcp.v6 = 1,
> +				     .tcp.index = idx};

Missing whitespace between 'idx' and };

> +	int s;
> +
> +	assert(c->mode == MODE_PASTA);
> +
> +	s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port,
> +		    tref.u32);
> +	if (s >= 0)
> +		tcp_sock_set_bufsize(c, s);
> +	else
> +		s = -1;
> +
> +	if (c->tcp.fwd_out.mode == FWD_AUTO)
> +		tcp_sock_ns[port][V6] = s;
> +}
> +
> +/**
> + * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> + * @c:		Execution context
> + * @port:	Port, host order
> + */
> +void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> +{
> +	if (c->ifi4)
> +		tcp_ns_sock_init4(c, port);
> +	if (c->ifi6)
> +		tcp_ns_sock_init6(c, port);
> +}
> +
> +/**
> + * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
>   * @arg:	Execution context
>   *
>   * Return: 0
>   */
> -static int tcp_sock_init_ns(void *arg)
> +static int tcp_ns_socks_init(void *arg)
>  {
>  	struct ctx *c = (struct ctx *)arg;
>  	unsigned port;
> @@ -3149,7 +3194,7 @@ static int tcp_sock_init_ns(void *arg)
>  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
>  			continue;
>  
> -		tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port);
> +		tcp_ns_sock_init(c, port);
>  	}
>  
>  	return 0;
> @@ -3279,7 +3324,7 @@ int tcp_init(struct ctx *c)
>  	if (c->mode == MODE_PASTA) {
>  		tcp_splice_init(c);
>  
> -		NS_CALL(tcp_sock_init_ns, c);
> +		NS_CALL(tcp_ns_socks_init, c);
>  
>  		refill_arg.ns = 1;
>  		NS_CALL(tcp_sock_refill, &refill_arg);
> @@ -3364,8 +3409,7 @@ static int tcp_port_rebind(void *arg)
>  
>  			if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) ||
>  			    (a->c->ifi6 && tcp_sock_ns[port][V6] == -1))
> -				tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL,
> -					      port);
> +				tcp_ns_sock_init(a->c, port);
>  		}
>  	} else {
>  		for (port = 0; port < NUM_PORTS; port++) {
> @@ -3398,7 +3442,7 @@ static int tcp_port_rebind(void *arg)
>  
>  			if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) ||
>  			    (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1))
> -				tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL,
> +				tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL,
>  					      port);
>  		}
>  	}
> diff --git a/tcp.h b/tcp.h
> index 49738ef..f4ed298 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
>  		      const struct timespec *now);
>  int tcp_tap_handler(struct ctx *c, int af, const void *addr,
>  		    const struct pool *p, const struct timespec *now);
> -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> -		   const void *addr, const char *ifname, in_port_t port);
> +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> +		   const char *ifname, in_port_t port);
>  int tcp_init(struct ctx *c);
>  void tcp_timer(struct ctx *c, const struct timespec *ts);
>  void tcp_defer_handler(struct ctx *c);


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path
  2022-11-16  4:41 ` [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path David Gibson
@ 2022-11-16 23:53   ` Stefano Brivio
  2022-11-17  1:37     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:53 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:55 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> In tcp_sock_handler() we split off to handle spliced sockets before
> checking anything else.  However the first steps of the "new connection"
> path for each case are the same: allocate a connection entry and accept()
> the connection.
> 
> Remove this duplication by making tcp_conn_from_sock() handle both spliced
> and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
> and tcp_splice_conn_from_sock functions for the later stages which differ.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c        | 68 ++++++++++++++++++++++++++++++++++------------------
>  tcp_splice.c | 58 +++++++++++++++++++++++---------------------
>  tcp_splice.h |  4 ++++
>  3 files changed, 80 insertions(+), 50 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 72d3b49..e66a82a 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2753,28 +2753,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>  }
>  
>  /**
> - * tcp_conn_from_sock() - Handle new connection request from listening socket
> + * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
>   * @c:		Execution context
>   * @ref:	epoll reference of listening socket
> + * @conn:	connection structure to initialize
> + * @s:		Accepted socket
> + * @sa:		Peer socket address (from accept())
>   * @now:	Current timestamp
>   */
> -static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> -			       const struct timespec *now)
> +static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +				   struct tcp_tap_conn *conn, int s,
> +				   struct sockaddr *sa,
> +				   const struct timespec *now)
>  {
> -	struct sockaddr_storage sa;
> -	struct tcp_tap_conn *conn;
> -	socklen_t sl;
> -	int s;
> -
> -	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> -		return;
> -
> -	sl = sizeof(sa);
> -	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> -	if (s < 0)
> -		return;
> -
> -	conn = CONN(c->tcp.conn_count++);
>  	conn->c.spliced = false;
>  	conn->sock = s;
>  	conn->timer = -1;
> @@ -2784,7 +2775,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	if (ref.r.p.tcp.tcp.v6) {
>  		struct sockaddr_in6 sa6;
>  
> -		memcpy(&sa6, &sa, sizeof(sa6));
> +		memcpy(&sa6, sa, sizeof(sa6));
>  
>  		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
>  		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> @@ -2813,7 +2804,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	} else {
>  		struct sockaddr_in sa4;
>  
> -		memcpy(&sa4, &sa, sizeof(sa4));
> +		memcpy(&sa4, sa, sizeof(sa4));
>  
>  		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
>  		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
> @@ -2846,6 +2837,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	tcp_get_sndbuf(conn);
>  }
>  
> +/**
> + * tcp_conn_from_sock() - Handle new connection request from listening socket
> + * @c:		Execution context
> + * @ref:	epoll reference of listening socket
> + * @now:	Current timestamp
> + */
> +static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +			       const struct timespec *now)
> +{
> +	struct sockaddr_storage sa;
> +	union tcp_conn *conn;
> +	socklen_t sl;
> +	int s;
> +
> +	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> +		return;
> +
> +	sl = sizeof(sa);
> +	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);

Combined with 16/32 I'm not sure this is simplifying much -- it looks a
bit unnatural there to get the peer address not "directly" from
accept4(). On the other hand you drop a few lines -- I'm fine with
it either way.

> +	if (s < 0)
> +		return;
> +
> +	conn = tc + c->tcp.conn_count++;
> +
> +	if (ref.r.p.tcp.tcp.splice)
> +		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
> +	else
> +		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> +				       (struct sockaddr *)&sa, now);
> +}
> +
>  /**
>   * tcp_timer_handler() - timerfd events: close, send ACK, retransmit, or reset
>   * @c:		Execution context
> @@ -2925,13 +2947,13 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
>  		return;
>  	}
>  
> -	if (ref.r.p.tcp.tcp.splice) {
> -		tcp_sock_handler_splice(c, ref, events);
> +	if (ref.r.p.tcp.tcp.listen) {
> +		tcp_conn_from_sock(c, ref, now);
>  		return;
>  	}
>  
> -	if (ref.r.p.tcp.tcp.listen) {
> -		tcp_conn_from_sock(c, ref, now);
> +	if (ref.r.p.tcp.tcp.splice) {
> +		tcp_sock_handler_splice(c, ref, events);
>  		return;
>  	}
>  
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 7a06252..7007501 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -501,6 +501,36 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
>  	*pipes = *from == conn->a ? conn->pipe_a_b : conn->pipe_b_a;
>  }
>  
> +/**
> + * tcp_splice_conn_from_sock() - Initialize state for spliced connection
> + * @c:		Execution context
> + * @ref:	epoll reference of listening socket
> + * @conn:	connection structure to initialize
> + * @s:		Accepted socket
> + *
> + * #syscalls:pasta setsockopt
> + */
> +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +			       struct tcp_splice_conn *conn, int s)
> +{
> +	assert(c->mode == MODE_PASTA);
> +
> +	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> +		       sizeof(int))) {
> +		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> +		      s);

This could be indented sanely, now.

> +	}
> +
> +	conn->c.spliced = true;
> +	c->tcp.splice_conn_count++;
> +	conn->a = s;
> +	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
> +
> +	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
> +			   ref.r.p.tcp.tcp.outbound))
> +		conn_flag(c, conn, CLOSING);
> +}
> +
>  /**
>   * tcp_sock_handler_splice() - Handler for socket mapped to spliced connection
>   * @c:		Execution context
> @@ -517,33 +547,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
>  	uint32_t *seq_read, *seq_write;
>  	struct tcp_splice_conn *conn;
>  
> -	if (ref.r.p.tcp.tcp.listen) {
> -		int s;
> -
> -		if (c->tcp.conn_count >= TCP_MAX_CONNS)
> -			return;
> -
> -		if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0)
> -			return;
> -
> -		if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> -			       sizeof(int))) {
> -			trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> -			      s);
> -		}
> -
> -		conn = CONN(c->tcp.conn_count++);
> -		conn->c.spliced = true;
> -		c->tcp.splice_conn_count++;
> -		conn->a = s;
> -		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
> -
> -		if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
> -				   ref.r.p.tcp.tcp.outbound))
> -			conn_flag(c, conn, CLOSING);
> -
> -		return;
> -	}
> +	assert(!ref.r.p.tcp.tcp.listen);
>  
>  	conn = CONN(ref.r.p.tcp.tcp.index);
>  
> diff --git a/tcp_splice.h b/tcp_splice.h
> index 22024d6..f9462ae 100644
> --- a/tcp_splice.h
> +++ b/tcp_splice.h
> @@ -6,8 +6,12 @@
>  #ifndef TCP_SPLICE_H
>  #define TCP_SPLICE_H
>  
> +struct tcp_splice_conn;
> +
>  void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
>  			     uint32_t events);
> +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +			       struct tcp_splice_conn *conn, int s);
>  void tcp_splice_init(struct ctx *c);
>  
>  #endif /* TCP_SPLICE_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections
  2022-11-16  4:41 ` [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections David Gibson
@ 2022-11-16 23:54   ` Stefano Brivio
  2022-11-17  1:43     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:41:56 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
> spliced connections (these are bound to localhost) and non-spliced
> connections (these are bound to the host address).  This introduces a
> subtle behavioural difference between pasta and passt: by default, pasta
> will listen only on a single host address, whereas passt will listen on
> all addresses (0.0.0.0 or ::).  This also prevents us using some additional
> optimizations that only work with the unspecified (0.0.0.0 or ::) address.
> 
> However, it turns out we don't need to do this.  We can splice a connection
> if and only if it originates from the loopback address.  Currently we
> ensure this by having the "spliced" listening sockets listening only on
> loopback.  Instead, defer the decision about whether to splice a connection
> until after accept(), by checking if the connection was made from the
> loopback address.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c        | 127 +++++++++++++--------------------------------------
>  tcp_splice.c |  25 ++++++++--
>  tcp_splice.h |   5 +-
>  3 files changed, 55 insertions(+), 102 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index e66a82a..4065da7 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -434,7 +434,6 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = {
>  };
>  
>  /* Listening sockets, used for automatic port forwarding in pasta mode only */
> -static int tcp_sock_init_lo	[NUM_PORTS][IP_VERSIONS];
>  static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
>  static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
>  
> @@ -2851,21 +2850,31 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	socklen_t sl;
>  	int s;
>  
> +	assert(ref.r.p.tcp.tcp.listen);
> +	assert(!ref.r.p.tcp.tcp.splice);
> +
>  	if (c->tcp.conn_count >= TCP_MAX_CONNS)
>  		return;
>  
>  	sl = sizeof(sa);
> +	/* FIXME: Workaround clang-tidy not realizing that accept4()
> +	 * writes the socket address.  See
> +	 * https://github.com/llvm/llvm-project/issues/58992
> +	 */
> +	memset(&sa, 0, sizeof(struct sockaddr_in6));
>  	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);

Ah, interesting. That looks new by the way -- not even valgrind
complained about this.

>  	if (s < 0)
>  		return;
>  
>  	conn = tc + c->tcp.conn_count++;
>  
> -	if (ref.r.p.tcp.tcp.splice)
> -		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
> -	else
> -		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> -				       (struct sockaddr *)&sa, now);
> +	if (c->mode == MODE_PASTA &&
> +	    tcp_splice_conn_from_sock(c, ref, &conn->splice,
> +				      s, (struct sockaddr *)&sa))
> +		return;
> +
> +	tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> +			       (struct sockaddr *)&sa, now);
>  }
>  
>  /**
> @@ -3018,47 +3027,16 @@ static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
>  {
>  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
>  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
> -	bool spliced = false, tap = true;
>  	int s;
>  
> -	if (c->mode == MODE_PASTA) {
> -		spliced = !addr || IN4_IS_ADDR_UNSPECIFIED(addr) ||
> -			IN4_IS_ADDR_LOOPBACK(addr);
> -
> -		if (!addr)
> -			addr = &c->ip4.addr;
> -
> -		tap = !IN4_IS_ADDR_LOOPBACK(addr);
> -	}
> -
> -	if (tap) {
> -		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
> -			    tref.u32);
> -		if (s >= 0)
> -			tcp_sock_set_bufsize(c, s);
> -		else
> -			s = -1;
> -
> -		if (c->tcp.fwd_in.mode == FWD_AUTO)
> -			tcp_sock_init_ext[port][V4] = s;
> -	}
> -
> -	if (spliced) {
> -		struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
> -		tref.tcp.splice = 1;
> -
> -		addr = &loopback;
> -
> -		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
> -			    tref.u32);
> -		if (s >= 0)
> -			tcp_sock_set_bufsize(c, s);
> -		else
> -			s = -1;
> +	s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32);
> +	if (s >= 0)
> +		tcp_sock_set_bufsize(c, s);
> +	else
> +		s = -1;
>  
> -		if (c->tcp.fwd_out.mode == FWD_AUTO)
> -			tcp_sock_init_lo[port][V4] = s;
> -	}
> +	if (c->tcp.fwd_in.mode == FWD_AUTO)
> +		tcp_sock_init_ext[port][V4] = s;
>  }
>  
>  /**
> @@ -3075,47 +3053,16 @@ static void tcp_sock_init6(const struct ctx *c,
>  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
>  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
>  				     .tcp.index = idx	};
> -	bool spliced = false, tap = true;
>  	int s;
>  
> -	if (c->mode == MODE_PASTA) {
> -		spliced = !addr ||
> -			  IN6_IS_ADDR_UNSPECIFIED(addr) ||
> -			  IN6_IS_ADDR_LOOPBACK(addr);
> -
> -		if (!addr)
> -			addr = &c->ip6.addr;
> -
> -		tap = !IN6_IS_ADDR_LOOPBACK(addr);
> -	}
> -
> -	if (tap) {
> -		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
> -			    tref.u32);
> -		if (s >= 0)
> -			tcp_sock_set_bufsize(c, s);
> -		else
> -			s = -1;
> -
> -		if (c->tcp.fwd_in.mode == FWD_AUTO)
> -			tcp_sock_init_ext[port][V6] = s;
> -	}
> -
> -	if (spliced) {
> -		tref.tcp.splice = 1;
> -
> -		addr = &in6addr_loopback;
> -
> -		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
> -			    tref.u32);
> -		if (s >= 0)
> -			tcp_sock_set_bufsize(c, s);
> -		else
> -			s = -1;
> +	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
> +	if (s >= 0)
> +		tcp_sock_set_bufsize(c, s);
> +	else
> +		s = -1;
>  
> -		if (c->tcp.fwd_out.mode == FWD_AUTO)
> -			tcp_sock_init_lo[port][V6] = s;
> -	}
> +	if (c->tcp.fwd_in.mode == FWD_AUTO)
> +		tcp_sock_init_ext[port][V6] = s;
>  }
>  
>  /**
> @@ -3144,7 +3091,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
>  {
>  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
>  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> -				     .tcp.splice = 1, .tcp.index = idx };
> +				     .tcp.index = idx };
>  	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
>  	int s;
>  
> @@ -3169,8 +3116,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
>  {
>  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
>  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> -				     .tcp.splice = 1, .tcp.v6 = 1,
> -				     .tcp.index = idx};
> +				     .tcp.v6 = 1, .tcp.index = idx};

Space missing here (from 14/32).

>  	int s;
>  
>  	assert(c->mode == MODE_PASTA);
> @@ -3337,7 +3283,6 @@ int tcp_init(struct ctx *c)
>  	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
>  	memset(ns_sock_pool4,		0xff,	sizeof(ns_sock_pool4));
>  	memset(ns_sock_pool6,		0xff,	sizeof(ns_sock_pool6));
> -	memset(tcp_sock_init_lo,	0xff,	sizeof(tcp_sock_init_lo));
>  	memset(tcp_sock_init_ext,	0xff,	sizeof(tcp_sock_init_ext));
>  	memset(tcp_sock_ns,		0xff,	sizeof(tcp_sock_ns));
>  
> @@ -3445,16 +3390,6 @@ static int tcp_port_rebind(void *arg)
>  					close(tcp_sock_init_ext[port][V6]);
>  					tcp_sock_init_ext[port][V6] = -1;
>  				}
> -
> -				if (tcp_sock_init_lo[port][V4] >= 0) {
> -					close(tcp_sock_init_lo[port][V4]);
> -					tcp_sock_init_lo[port][V4] = -1;
> -				}
> -
> -				if (tcp_sock_init_lo[port][V6] >= 0) {
> -					close(tcp_sock_init_lo[port][V6]);
> -					tcp_sock_init_lo[port][V6] = -1;
> -				}
>  				continue;
>  			}
>  
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 7007501..30d49d4 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -502,19 +502,35 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
>  }
>  
>  /**
> - * tcp_splice_conn_from_sock() - Initialize state for spliced connection
> + * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
>   * @c:		Execution context
>   * @ref:	epoll reference of listening socket
>   * @conn:	connection structure to initialize
>   * @s:		Accepted socket
> + * @sa:		Peer address of connection
>   *
> + * Return: true if able to create a spliced connection, false otherwise
>   * #syscalls:pasta setsockopt
>   */
> -void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> -			       struct tcp_splice_conn *conn, int s)
> +bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +			       struct tcp_splice_conn *conn, int s,
> +			       const struct sockaddr *sa)
>  {
>  	assert(c->mode == MODE_PASTA);
>  
> +	if (ref.r.p.tcp.tcp.v6) {
> +		const struct sockaddr_in6 *sa6
> +			= (const struct sockaddr_in6 *)sa;

Maybe you could split declaration and assignment here.

> +		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> +			return false;
> +		conn->flags = SPLICE_V6;
> +	} else {
> +		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
> +		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> +			return false;
> +		conn->flags = 0;
> +	}
> +
>  	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
>  		       sizeof(int))) {
>  		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> @@ -524,11 +540,12 @@ void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	conn->c.spliced = true;
>  	c->tcp.splice_conn_count++;
>  	conn->a = s;
> -	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
>  
>  	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
>  			   ref.r.p.tcp.tcp.outbound))
>  		conn_flag(c, conn, CLOSING);
> +
> +	return true;
>  }
>  
>  /**
> diff --git a/tcp_splice.h b/tcp_splice.h
> index f9462ae..1a915dd 100644
> --- a/tcp_splice.h
> +++ b/tcp_splice.h
> @@ -10,8 +10,9 @@ struct tcp_splice_conn;
>  
>  void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
>  			     uint32_t events);
> -void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> -			       struct tcp_splice_conn *conn, int s);
> +bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> +			       struct tcp_splice_conn *conn, int s,
> +			       const struct sockaddr *sa);
>  void tcp_splice_init(struct ctx *c);
>  
>  #endif /* TCP_SPLICE_H */


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6
  2022-11-16  4:41 ` [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6 David Gibson
@ 2022-11-16 23:54   ` Stefano Brivio
  2022-11-17  1:48     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-16 23:54 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

[Reviewed until 25/32 so far]

On Wed, 16 Nov 2022 15:41:59 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> struct tcp_conn stores an address which could be IPv6 or IPv4 using a
> union.  We can do this without an additional tag by encoding IPv4 addresses
> as IPv4-mapped IPv6 addresses.
> 
> This approach is useful wider than the specific place in tcp_conn, so
> expose a new 'union inany_addr' like this from a new inany.h.  Along with
> that create a number of helper functions to make working with these "inany"
> addresses easier.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  Makefile     |  6 ++--
>  inany.h      | 68 ++++++++++++++++++++++++++++++++++++++++
>  tcp.c        | 88 +++++++++++++++++++++++++---------------------------
>  tcp_conn.h   | 15 ++-------
>  tcp_splice.c |  1 +
>  5 files changed, 117 insertions(+), 61 deletions(-)
>  create mode 100644 inany.h
> 
> diff --git a/Makefile b/Makefile
> index 9046b0b..ca453aa 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -44,9 +44,9 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  MANPAGES = passt.1 pasta.1 qrap.1
>  
>  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \
> -	isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h \
> -	pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h \
> -	util.h
> +	inany.h isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h \
> +	pasta.h pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h \
> +	tcp_splice.h udp.h util.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  # On gcc 11 and 12, with -O2 and -flto, tcp_hash() and siphash_20b(), if
> diff --git a/inany.h b/inany.h
> new file mode 100644
> index 0000000..4e53da9
> --- /dev/null
> +++ b/inany.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: AGPL-3.0-or-later
> + * Copyright Red Hat
> + * Author: David Gibson <david@gibson.dropbear.id.au>
> + *
> + * inany.h - Types and helpers for handling addresses which could be
> + *           IPv6 or IPv4 (encoded as IPv4-mapped IPv6 addresses)
> + */
> +
> +#include <assert.h>
> +
> +/** union inany_addr - Represents either an IPv4 or IPv6 address
> + * @a6:		Address as an IPv6 address, may be IPv4-mapped
> + * @_v4._zero:	All zero-bits for an IPv4 address
> + * @_v4._one:	All one-bits for an IPv4 address
> + * @_v4.a4:	If @a6 is an IPv4 mapped address, this is the raw IPv4 address
> + *
> + * Fields starting with _ shouldn't be accessed except via helpers.
> + */
> +union inany_addr {
> +	struct in6_addr a6;
> +	struct {
> +		uint8_t _zero[10];
> +		uint8_t _one[2];
> +		struct in_addr a4;
> +	} _v4mapped;

I'm not sure the extra _ are really worth it. I mean, that's not really
enforceable, so saying that v4mapped should only be accessed by helpers
should be equivalent.

> +};
> +
> +/** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
> + * @addr:	IPv4 or IPv6 address
> + *
> + * Return: IPv4 address if @addr is IPv4, NULL otherwise
> + */
> +static inline const struct in_addr *inany_v4(const union inany_addr *addr)
> +{
> +	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
> +		return NULL;
> +	return &addr->_v4mapped.a4;
> +}
> +
> +/** inany_equals - Compare two IPv[46] addresses
> + * @a, @b:	IPv[46] addresses
> + *
> + * Return: true if @a and @b are the same address
> + */
> +static inline bool inany_equals(const union inany_addr *a,
> +				const union inany_addr *b)
> +{
> +	return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
> +}
> +
> +/** inany_from_af - Set IPv[46] address from IPv4 or IPv6 address
> + * @aa:		Pointer to store IPv[46] address
> + * @af:		Address family of @addr
> + * @addr:	struct in_addr (IPv4) or struct in6_addr (IPv6)
> + */
> +static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
> +{
> +	if (af == AF_INET6) {
> +		aa->a6 = *((struct in6_addr *)addr);
> +	} else if (af == AF_INET) {
> +		memset(&aa->_v4mapped._zero, 0, sizeof(aa->_v4mapped._zero));
> +		memset(&aa->_v4mapped._one, 0xff, sizeof(aa->_v4mapped._one));
> +		aa->_v4mapped.a4 = *((struct in_addr *)addr);
> +	} else {
> +		/* Not valid to call with other address families */
> +		assert(0);
> +	}
> +}
> diff --git a/tcp.c b/tcp.c
> index 7686766..4040198 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -301,6 +301,7 @@
>  #include "conf.h"
>  #include "tcp_splice.h"
>  #include "log.h"
> +#include "inany.h"
>  
>  #include "tcp_conn.h"
>  
> @@ -404,7 +405,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
>  #define OPT_SACK	5
>  #define OPT_TS		8
>  
> -#define CONN_V4(conn)		IN6_IS_ADDR_V4MAPPED(&conn->a.a6)
> +#define CONN_V4(conn)		(!!inany_v4(&(conn)->addr))
>  #define CONN_V6(conn)		(!CONN_V4(conn))
>  #define CONN_IS_CLOSING(conn)						\
>  	((conn->events & ESTABLISHED) &&				\
> @@ -438,7 +439,7 @@ static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
>  static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
>  
>  /* Table of destinations with very low RTT (assumed to be local), LRU */
> -static struct in6_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> +static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
>  
>  /* Static buffers */
>  
> @@ -861,7 +862,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
>  	int i;
>  
>  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
> -		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
> +		if (inany_equals(&conn->addr, low_rtt_dst + i))
>  			return 1;
>  
>  	return 0;
> @@ -883,7 +884,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  		return;
>  
>  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
> -		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
> +		if (inany_equals(&conn->addr, low_rtt_dst + i))
>  			return;
>  		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
>  			hole = i;
> @@ -895,10 +896,10 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  	if (hole == -1)
>  		return;
>  
> -	memcpy(low_rtt_dst + hole++, &conn->a.a6, sizeof(conn->a.a6));
> +	low_rtt_dst[hole++] = conn->addr;
>  	if (hole == LOW_RTT_TABLE_SIZE)
>  		hole = 0;
> -	memcpy(low_rtt_dst + hole, &in6addr_any, sizeof(conn->a.a6));
> +	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
>  #else
>  	(void)conn;
>  	(void)tinfo;
> @@ -1187,13 +1188,14 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
>  			  int af, const void *addr,
>  			  in_port_t tap_port, in_port_t sock_port)
>  {
> -	if (af == AF_INET && CONN_V4(conn)			&&
> -	    !memcmp(&conn->a.a4.a, addr, sizeof(conn->a.a4.a))	&&
> +	const struct in_addr *a4 = inany_v4(&conn->addr);
> +
> +	if (af == AF_INET && a4	&& !memcmp(a4, addr, sizeof(*a4)) &&
>  	    conn->tap_port == tap_port && conn->sock_port == sock_port)
>  		return 1;
>  
>  	if (af == AF_INET6					&&
> -	    IN6_ARE_ADDR_EQUAL(&conn->a.a6, addr)		&&
> +	    IN6_ARE_ADDR_EQUAL(&conn->addr.a6, addr)		&&
>  	    conn->tap_port == tap_port && conn->sock_port == sock_port)
>  		return 1;

Note to self or other reviewers: switch to inany_equals() in 22/32.

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref
  2022-11-16  4:42 ` [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref David Gibson
@ 2022-11-17  0:15   ` Stefano Brivio
  2022-11-17  1:50     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-17  0:15 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:42:06 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> This bit in the TCP specific epoll reference indicates whether the
> connection is IPv6 or IPv4.  However the sites which refer to it are
> already calling accept() which (optionally) returns an address for the
> remote end of the connection.  We can use the sa_family field in that
> address to determine the connection type independent of the epoll
> reference.
> 
> This does have a cost: for the spliced case, it means we now need to get
> that address from accept() which introduces an extran copy_to_user().
> However, in future we want to allow handling IPv4 connectons through IPv6
> sockets, which means we won't be able to determine the IP version at the
> time we create the listening socket and epoll reference.  So, at some point
> we'll have to pay this cost anyway.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c        | 10 ++++------
>  tcp.h        |  2 --
>  tcp_splice.c |  9 ++++-----
>  3 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 0513b3b..b05ed6c 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -662,8 +662,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
> -				.r.p.tcp.tcp.index = CONN_IDX(conn),
> -				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
> +				.r.p.tcp.tcp.index = CONN_IDX(conn) };
>  	struct epoll_event ev = { .data.u64 = ref.u64 };
>  
>  	if (conn->events == CLOSED) {
> @@ -2745,7 +2744,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	conn->ws_to_tap = conn->ws_from_tap = 0;
>  	conn_event(c, conn, SOCK_ACCEPTED);
>  
> -	if (ref.r.p.tcp.tcp.v6) {
> +	if (sa->sa_family == AF_INET6) {
>  		struct sockaddr_in6 sa6;
>  
>  		memcpy(&sa6, sa, sizeof(sa6));
> @@ -3019,8 +3018,7 @@ static void tcp_sock_init6(const struct ctx *c,
>  			   in_port_t port)
>  {
>  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
> -				     .tcp.index = idx	};
> +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx	};

Excess whitespace (from earlier patch).

>  	int s;
>  
>  	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
> @@ -3084,7 +3082,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
>  {
>  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
>  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> -				     .tcp.v6 = 1, .tcp.index = idx};
> +				     .tcp.index = idx};

Missing whitespace (from earlier patch).

>  	int s;
>  
>  	assert(c->mode == MODE_PASTA);
> diff --git a/tcp.h b/tcp.h
> index a940682..739b451 100644
> --- a/tcp.h
> +++ b/tcp.h
> @@ -33,7 +33,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s,
>   * union tcp_epoll_ref - epoll reference portion for TCP connections
>   * @listen:		Set if this file descriptor is a listening socket
>   * @outbound:		Listening socket maps to outbound, spliced connection
> - * @v6:			Set for IPv6 sockets or connections
>   * @timer:		Reference is a timerfd descriptor for connection
>   * @index:		Index of connection in table, or port for bound sockets
>   * @u32:		Opaque u32 value of reference
> @@ -42,7 +41,6 @@ union tcp_epoll_ref {
>  	struct {
>  		uint32_t	listen:1,
>  				outbound:1,
> -				v6:1,
>  				timer:1,
>  				index:20;
>  	} tcp;
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 30ab0eb..7c2f667 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -167,11 +167,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
>  {
>  	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
> -				  .r.p.tcp.tcp.index = CONN_IDX(conn),
> -				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
> +				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
>  	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
> -				  .r.p.tcp.tcp.index = CONN_IDX(conn),
> -				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
> +				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
>  	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
>  	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
>  	uint32_t events_a, events_b;
> @@ -504,6 +502,7 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
>   * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
>   * @c:		Execution context
>   * @ref:	epoll reference of listening socket
> + * @ipv6:	Should this be an IPv6 connection?

Left-over from previous idea I guess.

>   * @conn:	connection structure to initialize
>   * @s:		Accepted socket
>   * @sa:		Peer address of connection
> @@ -517,7 +516,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  {
>  	assert(c->mode == MODE_PASTA);
>  
> -	if (ref.r.p.tcp.tcp.v6) {
> +	if (sa->sa_family == AF_INET6) {
>  		const struct sockaddr_in6 *sa6
>  			= (const struct sockaddr_in6 *)sa;
>  		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses
  2022-11-16  4:42 ` [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses David Gibson
@ 2022-11-17  0:15   ` Stefano Brivio
  2022-11-17  2:00     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-17  0:15 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:42:07 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> passt usually doesn't NAT, but it does do so for the remapping of the
> gateway address to refer to the host.  Currently we perform this NAT with
> slightly different rules on both IPv4 addresses and IPv6 addresses, but not
> on IPv4-mapped IPv6 addresses.  This means we won't correctly handle the
> case of an IPv4 connection over an IPv6 socket, which is possible on Linux
> (and probably other platforms).

By the way, I really think it's just Linux, I can't think of other
examples.

> Refactor tcp_conn_from_sock() to perform the NAT after converting either
> address family into an inany_addr, so IPv4 and and IPv4-mapped addresses
> have the same representation.
> 
> With two new helpers this lets us remove the IPv4 and IPv6 specific paths
> from tcp_conn_from_sock().
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  inany.h | 30 ++++++++++++++++++++++++++--
>  tcp.c   | 62 ++++++++++++++++++++++++---------------------------------
>  2 files changed, 54 insertions(+), 38 deletions(-)
> 
> diff --git a/inany.h b/inany.h
> index 4e53da9..a677aa7 100644
> --- a/inany.h
> +++ b/inany.h
> @@ -30,11 +30,11 @@ union inany_addr {
>   *
>   * Return: IPv4 address if @addr is IPv4, NULL otherwise
>   */
> -static inline const struct in_addr *inany_v4(const union inany_addr *addr)
> +static inline struct in_addr *inany_v4(const union inany_addr *addr)

There must be a reason, but I can't understand why this change is
needed here.

>  {
>  	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
>  		return NULL;
> -	return &addr->_v4mapped.a4;
> +	return (struct in_addr *)&addr->_v4mapped.a4;
>  }
>  
>  /** inany_equals - Compare two IPv[46] addresses
> @@ -66,3 +66,29 @@ static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
>  		assert(0);
>  	}
>  }
> +
> +/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
> + * @a:		Pointer to store IPv[46] address

This is aa below, I'm not sure why.

> + * @port:	Pointer to store port number, host order
> + * @addr:	struct sockaddr_in (IPv4) or struct sockaddr_in6 (IPv6)

This became sa_ (needless to say, addr would make more sense).

> + */
> +static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
> +				       const void *sa_)
> +{
> +	const struct sockaddr *sa = (const struct sockaddr *)sa_;
> +
> +	if (sa->sa_family == AF_INET6) {
> +		struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)sa;
> +
> +		inany_from_af(aa, AF_INET6, &sa6->sin6_addr);
> +		*port = ntohs(sa6->sin6_port);
> +	} else if (sa->sa_family == AF_INET) {
> +		struct sockaddr_in *sa4 = (struct sockaddr_in *)sa;
> +
> +		inany_from_af(aa, AF_INET, &sa4->sin_addr);
> +		*port = ntohs(sa4->sin_port);
> +	} else {
> +		/* Not valid to call with other address families */
> +		assert(0);
> +	}
> +}
> diff --git a/tcp.c b/tcp.c
> index b05ed6c..fca5df4 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2724,6 +2724,29 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  }
>  
> +static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)

What this does is kind of obvious, still a comment would be nice.

> +{
> +	struct in_addr *addr4 = inany_v4(addr);
> +
> +	if (addr4) {
> +		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
> +		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
> +		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
> +			*addr4 = c->ip4.gw;
> +	} else {
> +		struct in6_addr *addr6 = &addr->a6;
> +
> +		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
> +		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
> +		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
> +			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> +				*addr6 = c->ip6.gw;
> +			else
> +				*addr6 = c->ip6.addr_ll;
> +		}
> +	}
> +}
> +
>  /**
>   * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
>   * @c:		Execution context
> @@ -2744,43 +2767,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  	conn->ws_to_tap = conn->ws_from_tap = 0;
>  	conn_event(c, conn, SOCK_ACCEPTED);
>  
> -	if (sa->sa_family == AF_INET6) {
> -		struct sockaddr_in6 sa6;
> -
> -		memcpy(&sa6, sa, sizeof(sa6));
> -
> -		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
> -		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> -		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr)) {
> -			struct in6_addr *src;
> +	inany_from_sockaddr(&conn->addr, &conn->sock_port, sa);
> +	conn->tap_port = ref.r.p.tcp.tcp.index;
>  
> -			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> -				src = &c->ip6.gw;
> -			else
> -				src = &c->ip6.addr_ll;
> -
> -			memcpy(&sa6.sin6_addr, src, sizeof(*src));
> -		}
> -
> -		inany_from_af(&conn->addr, AF_INET6, &sa6.sin6_addr);
> -
> -		conn->sock_port = ntohs(sa6.sin6_port);
> -		conn->tap_port = ref.r.p.tcp.tcp.index;
> -	} else {
> -		struct sockaddr_in sa4;
> -
> -		memcpy(&sa4, sa, sizeof(sa4));
> -
> -		if (IN4_IS_ADDR_LOOPBACK(&sa4.sin_addr) ||
> -		    IN4_IS_ADDR_UNSPECIFIED(&sa4.sin_addr) ||
> -		    IN4_ARE_ADDR_EQUAL(&sa4.sin_addr, &c->ip4.addr_seen))
> -			sa4.sin_addr = c->ip4.gw;
> -
> -		inany_from_af(&conn->addr, AF_INET, &sa4.sin_addr);
> -
> -		conn->sock_port = ntohs(sa4.sin_port);
> -		conn->tap_port = ref.r.p.tcp.tcp.index;
> -	}
> +	tcp_snat_inbound(c, &conn->addr);
>  
>  	tcp_seq_init(c, conn, now);
>  	tcp_hash_insert(c, conn);

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback
  2022-11-16  4:42 ` [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback David Gibson
@ 2022-11-17  0:15   ` Stefano Brivio
  2022-11-17  2:05     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-17  0:15 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:42:08 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
> same as the corresponding IPv4 addresses.  However currently we won't
> splice a connection from ::ffff:127.0.0.1 the way we would one from
> 127.0.0.1.  Correct this so that we can splice connections from IPv4
> localhost that have been received on an IPv6 dual stack socket.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp_splice.c | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 7c2f667..61c56be 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -514,19 +514,23 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
>  			       struct tcp_splice_conn *conn, int s,
>  			       const struct sockaddr *sa)
>  {
> +	union inany_addr aany;
> +	const struct in_addr *a4;

The usual order.

> +	in_port_t port;

Is the port actually needed here? I don't see how you use it.

> +
>  	assert(c->mode == MODE_PASTA);
>  
> -	if (sa->sa_family == AF_INET6) {
> -		const struct sockaddr_in6 *sa6
> -			= (const struct sockaddr_in6 *)sa;
> -		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> +	inany_from_sockaddr(&aany, &port, sa);
> +	a4 = inany_v4(&aany);
> +
> +	if (a4) {
> +		if (!IN4_IS_ADDR_LOOPBACK(a4))
>  			return false;
> -		conn->flags = SPLICE_V6;
> +		conn->flags = 0;
>  	} else {
> -		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
> -		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> +		if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
>  			return false;
> -		conn->flags = 0;
> +		conn->flags = SPLICE_V6;
>  	}
>  
>  	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible
  2022-11-16  4:42 ` [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible David Gibson
@ 2022-11-17  0:15   ` Stefano Brivio
  2022-11-17  2:08     ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-17  0:15 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 16 Nov 2022 15:42:12 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
> well as native IPv6 connections.  By doing this we halve the number of
> listening sockets we need for TCP (assuming passt/pasta is listening on the
> same ports for IPv4 and IPv6).  When forwarding many ports (e.g. -t all)
> this can significantly reduce the amount of kernel memory that passt
> consumes.
> 
> When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all
> -u all), this reduces kernel memory usage from ~677MiB to ~487MiB
> (kernel version 6.0.8 on Fedora 37, x86_64).

Oh, nice, that's quite significant.

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  tcp.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/tcp.c b/tcp.c
> index 616b9d0..5860c9f 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2991,8 +2991,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
>  
>  	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
>  
> -	if (c->tcp.fwd_in.mode == FWD_AUTO)
> -		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
> +	if (c->tcp.fwd_in.mode == FWD_AUTO) {
> +		if (af == AF_INET || af == AF_UNSPEC)
> +			tcp_sock_init_ext[port][V4] = s;
> +		if (af == AF_INET6 || af == AF_UNSPEC)

Nit: you could align the || af == AF_UNSPEC above with an extra
whitespace (as it's done in the context below).

> +			tcp_sock_init_ext[port][V6] = s;
> +	}
>  
>  	if (s < 0)
>  		return -1;
> @@ -3012,6 +3016,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
>  void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
>  		   const char *ifname, in_port_t port)
>  {
> +	if (af == AF_UNSPEC && c->ifi4 && c->ifi6)
> +		/* Attempt to get a dual stack socket */
> +		if (tcp_sock_init_af(c, AF_UNSPEC, port, addr, ifname) >= 0)
> +			return;
> +
> +	/* Otherwise create a socket per IP version */

...this looks surprisingly clean by the way, at least much cleaner than
I expected.

>  	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
>  		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
>  	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)

I just finished reviewing this series, in general it looks great to me,
I would have another look (and test!) on Thursday -- either using this
version or a re-spin.

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements
  2022-11-16 23:10   ` Stefano Brivio
@ 2022-11-17  1:20     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

On Thu, Nov 17, 2022 at 12:10:48AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:41 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > clang-tools 15.0.0 appears to have added a new warning that will always
> > complain about assignments in if statements, which we use in a number of
> > places in passt/pasta.  Encountered on Fedora 37 with
> > clang-tools-extra-15.0.0-3.fc37.x86_64.
> > 
> > Suppress the new warning so that we can compile and test.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  Makefile | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/Makefile b/Makefile
> > index 6b22408..8bcbbc0 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -262,6 +262,7 @@ clang-tidy: $(SRCS) $(HEADERS)
> >  	clang-tidy -checks=*,-modernize-*,\
> >  	-clang-analyzer-valist.Uninitialized,\
> >  	-cppcoreguidelines-init-variables,\
> > +	-bugprone-assignment-in-if-condition,\
> 
> I'm trying to keep, in the comment just above, a list of clang-tidy
> warnings we disable and the reason. I think this could just be grouped
> with:

Good point, I've updated the comment.

> 
> # - cppcoreguidelines-init-variables
> #	Dubious value, would kill readability
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02/32] style: Minor corrections to function comments
  2022-11-16 23:11   ` Stefano Brivio
@ 2022-11-17  1:21     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:21 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2461 bytes --]

On Thu, Nov 17, 2022 at 12:11:09AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:42 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Some style issues and a typo.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  conf.c | 6 +++---
> >  tap.c  | 6 +++---
> >  2 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 1adcf83..3ad247e 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -112,9 +112,9 @@ static int get_bound_ports_ns(void *arg)
> >   * @s:		String to search
> >   * @c:		Delimiter character
> >   *
> > - * Returns: If another @c is found in @s, returns a pointer to the
> > - *	    character *after* the delimiter, if no further @c is in
> > - *	    @s, return NULL
> > + * Return: If another @c is found in @s, returns a pointer to the
> > + *	   character *after* the delimiter, if no further @c is in @s,
> > + *	   return NULL
> >   */
> >  static char *next_chunk(const char *s, char c)
> >  {
> > diff --git a/tap.c b/tap.c
> > index abeff25..707660c 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -90,7 +90,7 @@ int tap_send(const struct ctx *c, const void *data, size_t len)
> >   * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
> >   * @c:		Execution context
> >   *
> > - * Returns: IPv4 address, network order
> > + * Return:	IPv4 address, network order
> 
> Loosely based on kerneldoc style: single space after "Return: " is the
> style adopted everywhere else. Rationale: it doesn't need to be aligned
> with anything else.
> 
> >   */
> >  struct in_addr tap_ip4_daddr(const struct ctx *c)
> >  {
> > @@ -98,11 +98,11 @@ struct in_addr tap_ip4_daddr(const struct ctx *c)
> >  }
> >  
> >  /**
> > - * tap_ip6_daddr() - Normal IPv4 destination address for inbound packets
> > + * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
> >   * @c:		Execution context
> >   * @src:	Source address
> >   *
> > - * Returns: pointer to IPv6 address
> > + * Return:	pointer to IPv6 address
> 
> Same here.

Fixed, thanks.

> 
> >   */
> >  const struct in6_addr *tap_ip6_daddr(const struct ctx *c,
> >  				     const struct in6_addr *src)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index
  2022-11-16 23:11   ` Stefano Brivio
@ 2022-11-17  1:24     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:24 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2745 bytes --]

On Thu, Nov 17, 2022 at 12:11:30AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > The macro CONN_OR_NULL() is used to look up connections by index with
> > bounds checking.  Replace it with an inline function, which means:
> >     - Better type checking
> >     - No danger of multiple evaluation of an @index with side effects
> > 
> > Also add a helper to perform the reverse translation: from connection
> > pointer to index.  Introduce a macro for this which will make later
> > cleanups easier and safer.
> 
> Ah, yes, much better, agreed. Just two things here:
> 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c | 83 ++++++++++++++++++++++++++++++++---------------------------
> >  1 file changed, 45 insertions(+), 38 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index d043123..4e56a6c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -518,14 +518,6 @@ struct tcp_conn {
> >  	 (conn->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD)))
> >  #define CONN_HAS(conn, set)	((conn->events & (set)) == (set))
> >  
> > -#define CONN(index)		(tc + (index))
> > -
> > -/* We probably don't want to use gcc statement expressions (for portability), so
> > - * use this only after well-defined sequence points (no pre-/post-increments).
> > - */
> > -#define CONN_OR_NULL(index)						\
> > -	(((int)(index) >= 0 && (index) < TCP_MAX_CONNS) ? (tc + (index)) : NULL)
> > -
> >  static const char *tcp_event_str[] __attribute((__unused__)) = {
> >  	"SOCK_ACCEPTED", "TAP_SYN_RCVD", "ESTABLISHED", "TAP_SYN_ACK_SENT",
> >  
> > @@ -705,6 +697,21 @@ static size_t tcp6_l2_flags_buf_bytes;
> >  /* TCP connections */
> >  static struct tcp_conn tc[TCP_MAX_CONNS];
> >  
> > +#define CONN(index)		(tc + (index))
> > +#define CONN_IDX(conn)		((conn) - tc)
> > +
> > +/** conn_at_idx() - Find a connection by index, if present
> > + * @index:	Index of connection to lookup
> > + *
> > + * Return:	Pointer to connection, or NULL if @index is out of bounds
> 
> Return: pointer [...]

Fixed.

> > + */
> > +static inline struct tcp_conn *conn_at_idx(int index)
> 
> The CONN_OR_NULL name made it very explicit that the pointer obtained
> there could be NULL. On the other hand I find conn_at_idx() more
> descriptive. But maybe conn_or_null() would be "safer". I don't really
> have a preference.

I see your point, but on balance I think I marginally prefer
conn_at_idx(), so I've left it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 14/32] tcp: Separate helpers to create ns listening sockets
  2022-11-16 23:51   ` Stefano Brivio
@ 2022-11-17  1:32     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:32 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 12512 bytes --]

On Thu, Nov 17, 2022 at 12:51:38AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:54 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > tcp_sock_init*() can create either sockets listening on the host, or in
> > the pasta network namespace (with @ns==1).  There are, however, a number
> > of differences in how these two cases work in practice though.  "ns"
> > sockets are only used in pasta mode, and they always lead to spliced
> > connections only.  The functions are also only ever called in "ns" mode
> > with a NULL address and interface name, and it doesn't really make sense
> > for them to be called any other way.
> > 
> > Later changes will introduce further differences in behaviour between these
> > two cases, so it makes more sense to use separate functions for creating
> > the ns listening sockets than the regular external/host listening sockets.
> > ---
> >  conf.c |   6 +--
> >  tcp.c  | 130 ++++++++++++++++++++++++++++++++++++++-------------------
> >  tcp.h  |   4 +-
> >  3 files changed, 92 insertions(+), 48 deletions(-)
> > 
> > diff --git a/conf.c b/conf.c
> > index 3ad247e..2b39d18 100644
> > --- a/conf.c
> > +++ b/conf.c
> > @@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
> >  
> >  		for (i = 0; i < PORT_EPHEMERAL_MIN; i++) {
> >  			if (optname == 't')
> > -				tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
> > +				tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i);
> >  			else if (optname == 'u')
> >  				udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i);
> >  		}
> > @@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
> >  			bitmap_set(fwd->map, i);
> >  
> >  			if (optname == 't')
> > -				tcp_sock_init(c, 0, af, addr, ifname, i);
> > +				tcp_sock_init(c, af, addr, ifname, i);
> >  			else if (optname == 'u')
> >  				udp_sock_init(c, 0, af, addr, ifname, i);
> >  		}
> > @@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg,
> >  			fwd->delta[i] = mapped_range.first - orig_range.first;
> >  
> >  			if (optname == 't')
> > -				tcp_sock_init(c, 0, af, addr, ifname, i);
> > +				tcp_sock_init(c, af, addr, ifname, i);
> >  			else if (optname == 'u')
> >  				udp_sock_init(c, 0, af, addr, ifname, i);
> >  		}
> > diff --git a/tcp.c b/tcp.c
> > index aac70cd..72d3b49 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2987,15 +2987,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
> >  /**
> >   * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port
> >   * @c:		Execution context
> > - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> >   * @addr:	Pointer to address for binding, NULL if not configured
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   */
> > -static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr,
> > +static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
> >  			   const char *ifname, in_port_t port)
> >  {
> > -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns };
> > +	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> > +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
> 
> Usual order here...

You mean the reverse christmas tree thing?  I can't do that here,
because idx is used in the next declaration.

> >  	bool spliced = false, tap = true;
> >  	int s;
> >  
> > @@ -3006,14 +3006,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
> >  		if (!addr)
> >  			addr = &c->ip4.addr;
> >  
> > -		tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr);
> > +		tap = !IN4_IS_ADDR_LOOPBACK(addr);
> >  	}
> >  
> > -	if (ns)
> > -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
> > -	else
> > -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
> > -
> >  	if (tap) {
> >  		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
> >  			    tref.u32);
> > @@ -3039,29 +3034,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad
> >  		else
> >  			s = -1;
> >  
> > -		if (c->tcp.fwd_out.mode == FWD_AUTO) {
> > -			if (ns)
> > -				tcp_sock_ns[port][V4] = s;
> > -			else
> > -				tcp_sock_init_lo[port][V4] = s;
> > -		}
> > +		if (c->tcp.fwd_out.mode == FWD_AUTO)
> > +			tcp_sock_init_lo[port][V4] = s;
> >  	}
> >  }
> >  
> >  /**
> >   * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port
> >   * @c:		Execution context
> > - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> >   * @addr:	Pointer to address for binding, NULL if not configured
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   */
> > -static void tcp_sock_init6(const struct ctx *c, int ns,
> > +static void tcp_sock_init6(const struct ctx *c,
> >  			   const struct in6_addr *addr, const char *ifname,
> >  			   in_port_t port)
> >  {
> > -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns,
> > -				     .tcp.v6 = 1 };
> > +	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> > +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
> > +				     .tcp.index = idx	};
> 
> Excess whitespace.

Fixed.

> >  	bool spliced = false, tap = true;
> >  	int s;
> >  
> > @@ -3073,14 +3064,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
> >  		if (!addr)
> >  			addr = &c->ip6.addr;
> >  
> > -		tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr);
> > +		tap = !IN6_IS_ADDR_LOOPBACK(addr);
> >  	}
> >  
> > -	if (ns)
> > -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]);
> > -	else
> > -		tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]);
> > -
> >  	if (tap) {
> >  		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
> >  			    tref.u32);
> > @@ -3105,40 +3091,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns,
> >  		else
> >  			s = -1;
> >  
> > -		if (c->tcp.fwd_out.mode == FWD_AUTO) {
> > -			if (ns)
> > -				tcp_sock_ns[port][V6] = s;
> > -			else
> > -				tcp_sock_init_lo[port][V6] = s;
> > -		}
> > +		if (c->tcp.fwd_out.mode == FWD_AUTO)
> > +			tcp_sock_init_lo[port][V6] = s;
> >  	}
> >  }
> >  
> >  /**
> >   * tcp_sock_init() - Initialise listening sockets for a given port
> 
> Maybe we should now indicate this is for "inbound" connections only
> ("for a given, inbound, port"?)

Updated.

> >   * @c:		Execution context
> > - * @ns:		In pasta mode, if set, bind with loopback address in namespace
> >   * @af:		Address family to select a specific IP version, or AF_UNSPEC
> >   * @addr:	Pointer to address for binding, NULL if not configured
> >   * @ifname:	Name of interface to bind to, NULL if not configured
> >   * @port:	Port, host order
> >   */
> > -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> > -		   const void *addr, const char *ifname, in_port_t port)
> > +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> > +		   const char *ifname, in_port_t port)
> >  {
> >  	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
> > -		tcp_sock_init4(c, ns, addr, ifname, port);
> > +		tcp_sock_init4(c, addr, ifname, port);
> >  	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
> > -		tcp_sock_init6(c, ns, addr, ifname, port);
> > +		tcp_sock_init6(c, addr, ifname, port);
> > +}
> > +
> > +/**
> > + * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections
> > + * @c:		Execution context
> > + * @port:	Port, host order
> > + */
> > +static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
> > +{
> > +	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> 
> Move after declaration of 'loopback'.

Again, I can't do that because idx is used in the next declaration.

> > +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> > +				     .tcp.splice = 1, .tcp.index = idx };
> > +	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
> > +	int s;
> > +
> > +	assert(c->mode == MODE_PASTA);
> > +
> > +	s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32);
> > +	if (s >= 0)
> > +		tcp_sock_set_bufsize(c, s);
> > +	else
> > +		s = -1;
> > +
> > +	if (c->tcp.fwd_out.mode == FWD_AUTO)
> > +		tcp_sock_ns[port][V4] = s;
> >  }
> >  
> >  /**
> > - * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections
> > + * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections
> > + * @c:		Execution context
> > + * @port:	Port, host order
> > + */
> > +static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
> > +{
> > +	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> > +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> > +				     .tcp.splice = 1, .tcp.v6 = 1,
> > +				     .tcp.index = idx};
> 
> Missing whitespace between 'idx' and };

Fixed.

> > +	int s;
> > +
> > +	assert(c->mode == MODE_PASTA);
> > +
> > +	s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port,
> > +		    tref.u32);
> > +	if (s >= 0)
> > +		tcp_sock_set_bufsize(c, s);
> > +	else
> > +		s = -1;
> > +
> > +	if (c->tcp.fwd_out.mode == FWD_AUTO)
> > +		tcp_sock_ns[port][V6] = s;
> > +}
> > +
> > +/**
> > + * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections
> > + * @c:		Execution context
> > + * @port:	Port, host order
> > + */
> > +void tcp_ns_sock_init(const struct ctx *c, in_port_t port)
> > +{
> > +	if (c->ifi4)
> > +		tcp_ns_sock_init4(c, port);
> > +	if (c->ifi6)
> > +		tcp_ns_sock_init6(c, port);
> > +}
> > +
> > +/**
> > + * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections
> >   * @arg:	Execution context
> >   *
> >   * Return: 0
> >   */
> > -static int tcp_sock_init_ns(void *arg)
> > +static int tcp_ns_socks_init(void *arg)
> >  {
> >  	struct ctx *c = (struct ctx *)arg;
> >  	unsigned port;
> > @@ -3149,7 +3194,7 @@ static int tcp_sock_init_ns(void *arg)
> >  		if (!bitmap_isset(c->tcp.fwd_out.map, port))
> >  			continue;
> >  
> > -		tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port);
> > +		tcp_ns_sock_init(c, port);
> >  	}
> >  
> >  	return 0;
> > @@ -3279,7 +3324,7 @@ int tcp_init(struct ctx *c)
> >  	if (c->mode == MODE_PASTA) {
> >  		tcp_splice_init(c);
> >  
> > -		NS_CALL(tcp_sock_init_ns, c);
> > +		NS_CALL(tcp_ns_socks_init, c);
> >  
> >  		refill_arg.ns = 1;
> >  		NS_CALL(tcp_sock_refill, &refill_arg);
> > @@ -3364,8 +3409,7 @@ static int tcp_port_rebind(void *arg)
> >  
> >  			if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) ||
> >  			    (a->c->ifi6 && tcp_sock_ns[port][V6] == -1))
> > -				tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL,
> > -					      port);
> > +				tcp_ns_sock_init(a->c, port);
> >  		}
> >  	} else {
> >  		for (port = 0; port < NUM_PORTS; port++) {
> > @@ -3398,7 +3442,7 @@ static int tcp_port_rebind(void *arg)
> >  
> >  			if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) ||
> >  			    (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1))
> > -				tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL,
> > +				tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL,
> >  					      port);
> >  		}
> >  	}
> > diff --git a/tcp.h b/tcp.h
> > index 49738ef..f4ed298 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
> >  		      const struct timespec *now);
> >  int tcp_tap_handler(struct ctx *c, int af, const void *addr,
> >  		    const struct pool *p, const struct timespec *now);
> > -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af,
> > -		   const void *addr, const char *ifname, in_port_t port);
> > +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> > +		   const char *ifname, in_port_t port);
> >  int tcp_init(struct ctx *c);
> >  void tcp_timer(struct ctx *c, const struct timespec *ts);
> >  void tcp_defer_handler(struct ctx *c);
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path
  2022-11-16 23:53   ` Stefano Brivio
@ 2022-11-17  1:37     ` David Gibson
  2022-11-17  7:30       ` Stefano Brivio
  0 siblings, 1 reply; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:37 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8427 bytes --]

On Thu, Nov 17, 2022 at 12:53:58AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:55 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > In tcp_sock_handler() we split off to handle spliced sockets before
> > checking anything else.  However the first steps of the "new connection"
> > path for each case are the same: allocate a connection entry and accept()
> > the connection.
> > 
> > Remove this duplication by making tcp_conn_from_sock() handle both spliced
> > and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
> > and tcp_splice_conn_from_sock functions for the later stages which differ.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c        | 68 ++++++++++++++++++++++++++++++++++------------------
> >  tcp_splice.c | 58 +++++++++++++++++++++++---------------------
> >  tcp_splice.h |  4 ++++
> >  3 files changed, 80 insertions(+), 50 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 72d3b49..e66a82a 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2753,28 +2753,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> >  }
> >  
> >  /**
> > - * tcp_conn_from_sock() - Handle new connection request from listening socket
> > + * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
> >   * @c:		Execution context
> >   * @ref:	epoll reference of listening socket
> > + * @conn:	connection structure to initialize
> > + * @s:		Accepted socket
> > + * @sa:		Peer socket address (from accept())
> >   * @now:	Current timestamp
> >   */
> > -static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > -			       const struct timespec *now)
> > +static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +				   struct tcp_tap_conn *conn, int s,
> > +				   struct sockaddr *sa,
> > +				   const struct timespec *now)
> >  {
> > -	struct sockaddr_storage sa;
> > -	struct tcp_tap_conn *conn;
> > -	socklen_t sl;
> > -	int s;
> > -
> > -	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > -		return;
> > -
> > -	sl = sizeof(sa);
> > -	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> > -	if (s < 0)
> > -		return;
> > -
> > -	conn = CONN(c->tcp.conn_count++);
> >  	conn->c.spliced = false;
> >  	conn->sock = s;
> >  	conn->timer = -1;
> > @@ -2784,7 +2775,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	if (ref.r.p.tcp.tcp.v6) {
> >  		struct sockaddr_in6 sa6;
> >  
> > -		memcpy(&sa6, &sa, sizeof(sa6));
> > +		memcpy(&sa6, sa, sizeof(sa6));
> >  
> >  		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
> >  		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> > @@ -2813,7 +2804,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	} else {
> >  		struct sockaddr_in sa4;
> >  
> > -		memcpy(&sa4, &sa, sizeof(sa4));
> > +		memcpy(&sa4, sa, sizeof(sa4));
> >  
> >  		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
> >  		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
> > @@ -2846,6 +2837,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	tcp_get_sndbuf(conn);
> >  }
> >  
> > +/**
> > + * tcp_conn_from_sock() - Handle new connection request from listening socket
> > + * @c:		Execution context
> > + * @ref:	epoll reference of listening socket
> > + * @now:	Current timestamp
> > + */
> > +static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +			       const struct timespec *now)
> > +{
> > +	struct sockaddr_storage sa;
> > +	union tcp_conn *conn;
> > +	socklen_t sl;
> > +	int s;
> > +
> > +	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > +		return;
> > +
> > +	sl = sizeof(sa);
> > +	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> 
> Combined with 16/32 I'm not sure this is simplifying much -- it looks a
> bit unnatural there to get the peer address not "directly" from
> accept4(). On the other hand you drop a few lines -- I'm fine with
> it either way.

Um.. I'm not really sure what you're getting at here.

> > +	if (s < 0)
> > +		return;
> > +
> > +	conn = tc + c->tcp.conn_count++;
> > +
> > +	if (ref.r.p.tcp.tcp.splice)
> > +		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
> > +	else
> > +		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> > +				       (struct sockaddr *)&sa, now);
> > +}
> > +
> >  /**
> >   * tcp_timer_handler() - timerfd events: close, send ACK, retransmit, or reset
> >   * @c:		Execution context
> > @@ -2925,13 +2947,13 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
> >  		return;
> >  	}
> >  
> > -	if (ref.r.p.tcp.tcp.splice) {
> > -		tcp_sock_handler_splice(c, ref, events);
> > +	if (ref.r.p.tcp.tcp.listen) {
> > +		tcp_conn_from_sock(c, ref, now);
> >  		return;
> >  	}
> >  
> > -	if (ref.r.p.tcp.tcp.listen) {
> > -		tcp_conn_from_sock(c, ref, now);
> > +	if (ref.r.p.tcp.tcp.splice) {
> > +		tcp_sock_handler_splice(c, ref, events);
> >  		return;
> >  	}
> >  
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 7a06252..7007501 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -501,6 +501,36 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
> >  	*pipes = *from == conn->a ? conn->pipe_a_b : conn->pipe_b_a;
> >  }
> >  
> > +/**
> > + * tcp_splice_conn_from_sock() - Initialize state for spliced connection
> > + * @c:		Execution context
> > + * @ref:	epoll reference of listening socket
> > + * @conn:	connection structure to initialize
> > + * @s:		Accepted socket
> > + *
> > + * #syscalls:pasta setsockopt
> > + */
> > +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +			       struct tcp_splice_conn *conn, int s)
> > +{
> > +	assert(c->mode == MODE_PASTA);
> > +
> > +	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> > +		       sizeof(int))) {
> > +		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> > +		      s);
> 
> This could be indented sanely, now.

Fixed.

> > +	}
> > +
> > +	conn->c.spliced = true;
> > +	c->tcp.splice_conn_count++;
> > +	conn->a = s;
> > +	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
> > +
> > +	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
> > +			   ref.r.p.tcp.tcp.outbound))
> > +		conn_flag(c, conn, CLOSING);
> > +}
> > +
> >  /**
> >   * tcp_sock_handler_splice() - Handler for socket mapped to spliced connection
> >   * @c:		Execution context
> > @@ -517,33 +547,7 @@ void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
> >  	uint32_t *seq_read, *seq_write;
> >  	struct tcp_splice_conn *conn;
> >  
> > -	if (ref.r.p.tcp.tcp.listen) {
> > -		int s;
> > -
> > -		if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > -			return;
> > -
> > -		if ((s = accept4(ref.r.s, NULL, NULL, SOCK_NONBLOCK)) < 0)
> > -			return;
> > -
> > -		if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> > -			       sizeof(int))) {
> > -			trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> > -			      s);
> > -		}
> > -
> > -		conn = CONN(c->tcp.conn_count++);
> > -		conn->c.spliced = true;
> > -		c->tcp.splice_conn_count++;
> > -		conn->a = s;
> > -		conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
> > -
> > -		if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
> > -				   ref.r.p.tcp.tcp.outbound))
> > -			conn_flag(c, conn, CLOSING);
> > -
> > -		return;
> > -	}
> > +	assert(!ref.r.p.tcp.tcp.listen);
> >  
> >  	conn = CONN(ref.r.p.tcp.tcp.index);
> >  
> > diff --git a/tcp_splice.h b/tcp_splice.h
> > index 22024d6..f9462ae 100644
> > --- a/tcp_splice.h
> > +++ b/tcp_splice.h
> > @@ -6,8 +6,12 @@
> >  #ifndef TCP_SPLICE_H
> >  #define TCP_SPLICE_H
> >  
> > +struct tcp_splice_conn;
> > +
> >  void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
> >  			     uint32_t events);
> > +void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +			       struct tcp_splice_conn *conn, int s);
> >  void tcp_splice_init(struct ctx *c);
> >  
> >  #endif /* TCP_SPLICE_H */
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections
  2022-11-16 23:54   ` Stefano Brivio
@ 2022-11-17  1:43     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:43 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 11540 bytes --]

On Thu, Nov 17, 2022 at 12:54:05AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:41:56 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
> > spliced connections (these are bound to localhost) and non-spliced
> > connections (these are bound to the host address).  This introduces a
> > subtle behavioural difference between pasta and passt: by default, pasta
> > will listen only on a single host address, whereas passt will listen on
> > all addresses (0.0.0.0 or ::).  This also prevents us using some additional
> > optimizations that only work with the unspecified (0.0.0.0 or ::) address.
> > 
> > However, it turns out we don't need to do this.  We can splice a connection
> > if and only if it originates from the loopback address.  Currently we
> > ensure this by having the "spliced" listening sockets listening only on
> > loopback.  Instead, defer the decision about whether to splice a connection
> > until after accept(), by checking if the connection was made from the
> > loopback address.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c        | 127 +++++++++++++--------------------------------------
> >  tcp_splice.c |  25 ++++++++--
> >  tcp_splice.h |   5 +-
> >  3 files changed, 55 insertions(+), 102 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index e66a82a..4065da7 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -434,7 +434,6 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = {
> >  };
> >  
> >  /* Listening sockets, used for automatic port forwarding in pasta mode only */
> > -static int tcp_sock_init_lo	[NUM_PORTS][IP_VERSIONS];
> >  static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
> >  static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
> >  
> > @@ -2851,21 +2850,31 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	socklen_t sl;
> >  	int s;
> >  
> > +	assert(ref.r.p.tcp.tcp.listen);
> > +	assert(!ref.r.p.tcp.tcp.splice);
> > +
> >  	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> >  		return;
> >  
> >  	sl = sizeof(sa);
> > +	/* FIXME: Workaround clang-tidy not realizing that accept4()
> > +	 * writes the socket address.  See
> > +	 * https://github.com/llvm/llvm-project/issues/58992
> > +	 */
> > +	memset(&sa, 0, sizeof(struct sockaddr_in6));
> >  	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> 
> Ah, interesting. That looks new by the way -- not even valgrind
> complained about this.

Right, valgrind seems to have better modeling of the syscall here.
Note that this is harder for a static tool to get right, because the
amount that accept() writes is variable.

I got a reply on that bug report, saying apparently clang-tidy
*should* consider sa written here, and they weren't easily able to
reproduce the problem.  So, I'm not sure what's going on here :(.

> >  	if (s < 0)
> >  		return;
> >  
> >  	conn = tc + c->tcp.conn_count++;
> >  
> > -	if (ref.r.p.tcp.tcp.splice)
> > -		tcp_splice_conn_from_sock(c, ref, &conn->splice, s);
> > -	else
> > -		tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> > -				       (struct sockaddr *)&sa, now);
> > +	if (c->mode == MODE_PASTA &&
> > +	    tcp_splice_conn_from_sock(c, ref, &conn->splice,
> > +				      s, (struct sockaddr *)&sa))
> > +		return;
> > +
> > +	tcp_tap_conn_from_sock(c, ref, &conn->tap, s,
> > +			       (struct sockaddr *)&sa, now);
> >  }
> >  
> >  /**
> > @@ -3018,47 +3027,16 @@ static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr,
> >  {
> >  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> >  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx };
> > -	bool spliced = false, tap = true;
> >  	int s;
> >  
> > -	if (c->mode == MODE_PASTA) {
> > -		spliced = !addr || IN4_IS_ADDR_UNSPECIFIED(addr) ||
> > -			IN4_IS_ADDR_LOOPBACK(addr);
> > -
> > -		if (!addr)
> > -			addr = &c->ip4.addr;
> > -
> > -		tap = !IN4_IS_ADDR_LOOPBACK(addr);
> > -	}
> > -
> > -	if (tap) {
> > -		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
> > -			    tref.u32);
> > -		if (s >= 0)
> > -			tcp_sock_set_bufsize(c, s);
> > -		else
> > -			s = -1;
> > -
> > -		if (c->tcp.fwd_in.mode == FWD_AUTO)
> > -			tcp_sock_init_ext[port][V4] = s;
> > -	}
> > -
> > -	if (spliced) {
> > -		struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
> > -		tref.tcp.splice = 1;
> > -
> > -		addr = &loopback;
> > -
> > -		s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port,
> > -			    tref.u32);
> > -		if (s >= 0)
> > -			tcp_sock_set_bufsize(c, s);
> > -		else
> > -			s = -1;
> > +	s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32);
> > +	if (s >= 0)
> > +		tcp_sock_set_bufsize(c, s);
> > +	else
> > +		s = -1;
> >  
> > -		if (c->tcp.fwd_out.mode == FWD_AUTO)
> > -			tcp_sock_init_lo[port][V4] = s;
> > -	}
> > +	if (c->tcp.fwd_in.mode == FWD_AUTO)
> > +		tcp_sock_init_ext[port][V4] = s;
> >  }
> >  
> >  /**
> > @@ -3075,47 +3053,16 @@ static void tcp_sock_init6(const struct ctx *c,
> >  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> >  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
> >  				     .tcp.index = idx	};
> > -	bool spliced = false, tap = true;
> >  	int s;
> >  
> > -	if (c->mode == MODE_PASTA) {
> > -		spliced = !addr ||
> > -			  IN6_IS_ADDR_UNSPECIFIED(addr) ||
> > -			  IN6_IS_ADDR_LOOPBACK(addr);
> > -
> > -		if (!addr)
> > -			addr = &c->ip6.addr;
> > -
> > -		tap = !IN6_IS_ADDR_LOOPBACK(addr);
> > -	}
> > -
> > -	if (tap) {
> > -		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
> > -			    tref.u32);
> > -		if (s >= 0)
> > -			tcp_sock_set_bufsize(c, s);
> > -		else
> > -			s = -1;
> > -
> > -		if (c->tcp.fwd_in.mode == FWD_AUTO)
> > -			tcp_sock_init_ext[port][V6] = s;
> > -	}
> > -
> > -	if (spliced) {
> > -		tref.tcp.splice = 1;
> > -
> > -		addr = &in6addr_loopback;
> > -
> > -		s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port,
> > -			    tref.u32);
> > -		if (s >= 0)
> > -			tcp_sock_set_bufsize(c, s);
> > -		else
> > -			s = -1;
> > +	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
> > +	if (s >= 0)
> > +		tcp_sock_set_bufsize(c, s);
> > +	else
> > +		s = -1;
> >  
> > -		if (c->tcp.fwd_out.mode == FWD_AUTO)
> > -			tcp_sock_init_lo[port][V6] = s;
> > -	}
> > +	if (c->tcp.fwd_in.mode == FWD_AUTO)
> > +		tcp_sock_init_ext[port][V6] = s;
> >  }
> >  
> >  /**
> > @@ -3144,7 +3091,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
> >  {
> >  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> >  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> > -				     .tcp.splice = 1, .tcp.index = idx };
> > +				     .tcp.index = idx };
> >  	struct in_addr loopback = { htonl(INADDR_LOOPBACK) };
> >  	int s;
> >  
> > @@ -3169,8 +3116,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
> >  {
> >  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> >  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> > -				     .tcp.splice = 1, .tcp.v6 = 1,
> > -				     .tcp.index = idx};
> > +				     .tcp.v6 = 1, .tcp.index = idx};
> 
> Space missing here (from 14/32).

Fixed.

> >  	int s;
> >  
> >  	assert(c->mode == MODE_PASTA);
> > @@ -3337,7 +3283,6 @@ int tcp_init(struct ctx *c)
> >  	memset(init_sock_pool6,		0xff,	sizeof(init_sock_pool6));
> >  	memset(ns_sock_pool4,		0xff,	sizeof(ns_sock_pool4));
> >  	memset(ns_sock_pool6,		0xff,	sizeof(ns_sock_pool6));
> > -	memset(tcp_sock_init_lo,	0xff,	sizeof(tcp_sock_init_lo));
> >  	memset(tcp_sock_init_ext,	0xff,	sizeof(tcp_sock_init_ext));
> >  	memset(tcp_sock_ns,		0xff,	sizeof(tcp_sock_ns));
> >  
> > @@ -3445,16 +3390,6 @@ static int tcp_port_rebind(void *arg)
> >  					close(tcp_sock_init_ext[port][V6]);
> >  					tcp_sock_init_ext[port][V6] = -1;
> >  				}
> > -
> > -				if (tcp_sock_init_lo[port][V4] >= 0) {
> > -					close(tcp_sock_init_lo[port][V4]);
> > -					tcp_sock_init_lo[port][V4] = -1;
> > -				}
> > -
> > -				if (tcp_sock_init_lo[port][V6] >= 0) {
> > -					close(tcp_sock_init_lo[port][V6]);
> > -					tcp_sock_init_lo[port][V6] = -1;
> > -				}
> >  				continue;
> >  			}
> >  
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 7007501..30d49d4 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -502,19 +502,35 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
> >  }
> >  
> >  /**
> > - * tcp_splice_conn_from_sock() - Initialize state for spliced connection
> > + * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
> >   * @c:		Execution context
> >   * @ref:	epoll reference of listening socket
> >   * @conn:	connection structure to initialize
> >   * @s:		Accepted socket
> > + * @sa:		Peer address of connection
> >   *
> > + * Return: true if able to create a spliced connection, false otherwise
> >   * #syscalls:pasta setsockopt
> >   */
> > -void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > -			       struct tcp_splice_conn *conn, int s)
> > +bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +			       struct tcp_splice_conn *conn, int s,
> > +			       const struct sockaddr *sa)
> >  {
> >  	assert(c->mode == MODE_PASTA);
> >  
> > +	if (ref.r.p.tcp.tcp.v6) {
> > +		const struct sockaddr_in6 *sa6
> > +			= (const struct sockaddr_in6 *)sa;
> 
> Maybe you could split declaration and assignment here.

Good idea, done.

> > +		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> > +			return false;
> > +		conn->flags = SPLICE_V6;
> > +	} else {
> > +		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
> > +		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> > +			return false;
> > +		conn->flags = 0;
> > +	}
> > +
> >  	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> >  		       sizeof(int))) {
> >  		trace("TCP (spliced): failed to set TCP_QUICKACK on %i",
> > @@ -524,11 +540,12 @@ void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	conn->c.spliced = true;
> >  	c->tcp.splice_conn_count++;
> >  	conn->a = s;
> > -	conn->flags = ref.r.p.tcp.tcp.v6 ? SPLICE_V6 : 0;
> >  
> >  	if (tcp_splice_new(c, conn, ref.r.p.tcp.tcp.index,
> >  			   ref.r.p.tcp.tcp.outbound))
> >  		conn_flag(c, conn, CLOSING);
> > +
> > +	return true;
> >  }
> >  
> >  /**
> > diff --git a/tcp_splice.h b/tcp_splice.h
> > index f9462ae..1a915dd 100644
> > --- a/tcp_splice.h
> > +++ b/tcp_splice.h
> > @@ -10,8 +10,9 @@ struct tcp_splice_conn;
> >  
> >  void tcp_sock_handler_splice(struct ctx *c, union epoll_ref ref,
> >  			     uint32_t events);
> > -void tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > -			       struct tcp_splice_conn *conn, int s);
> > +bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > +			       struct tcp_splice_conn *conn, int s,
> > +			       const struct sockaddr *sa);
> >  void tcp_splice_init(struct ctx *c);
> >  
> >  #endif /* TCP_SPLICE_H */
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6
  2022-11-16 23:54   ` Stefano Brivio
@ 2022-11-17  1:48     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:48 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7910 bytes --]

On Thu, Nov 17, 2022 at 12:54:08AM +0100, Stefano Brivio wrote:
> [Reviewed until 25/32 so far]
> 
> On Wed, 16 Nov 2022 15:41:59 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > struct tcp_conn stores an address which could be IPv6 or IPv4 using a
> > union.  We can do this without an additional tag by encoding IPv4 addresses
> > as IPv4-mapped IPv6 addresses.
> > 
> > This approach is useful wider than the specific place in tcp_conn, so
> > expose a new 'union inany_addr' like this from a new inany.h.  Along with
> > that create a number of helper functions to make working with these "inany"
> > addresses easier.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  Makefile     |  6 ++--
> >  inany.h      | 68 ++++++++++++++++++++++++++++++++++++++++
> >  tcp.c        | 88 +++++++++++++++++++++++++---------------------------
> >  tcp_conn.h   | 15 ++-------
> >  tcp_splice.c |  1 +
> >  5 files changed, 117 insertions(+), 61 deletions(-)
> >  create mode 100644 inany.h
> > 
> > diff --git a/Makefile b/Makefile
> > index 9046b0b..ca453aa 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -44,9 +44,9 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS)
> >  MANPAGES = passt.1 pasta.1 qrap.1
> >  
> >  PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h icmp.h \
> > -	isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h \
> > -	pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h \
> > -	util.h
> > +	inany.h isolation.h lineread.h log.h ndp.h netlink.h packet.h passt.h \
> > +	pasta.h pcap.h port_fwd.h siphash.h tap.h tcp.h tcp_conn.h \
> > +	tcp_splice.h udp.h util.h
> >  HEADERS = $(PASST_HEADERS) seccomp.h
> >  
> >  # On gcc 11 and 12, with -O2 and -flto, tcp_hash() and siphash_20b(), if
> > diff --git a/inany.h b/inany.h
> > new file mode 100644
> > index 0000000..4e53da9
> > --- /dev/null
> > +++ b/inany.h
> > @@ -0,0 +1,68 @@
> > +/* SPDX-License-Identifier: AGPL-3.0-or-later
> > + * Copyright Red Hat
> > + * Author: David Gibson <david@gibson.dropbear.id.au>
> > + *
> > + * inany.h - Types and helpers for handling addresses which could be
> > + *           IPv6 or IPv4 (encoded as IPv4-mapped IPv6 addresses)
> > + */
> > +
> > +#include <assert.h>
> > +
> > +/** union inany_addr - Represents either an IPv4 or IPv6 address
> > + * @a6:		Address as an IPv6 address, may be IPv4-mapped
> > + * @_v4._zero:	All zero-bits for an IPv4 address
> > + * @_v4._one:	All one-bits for an IPv4 address
> > + * @_v4.a4:	If @a6 is an IPv4 mapped address, this is the raw IPv4 address
> > + *
> > + * Fields starting with _ shouldn't be accessed except via helpers.
> > + */
> > +union inany_addr {
> > +	struct in6_addr a6;
> > +	struct {
> > +		uint8_t _zero[10];
> > +		uint8_t _one[2];
> > +		struct in_addr a4;
> > +	} _v4mapped;
> 
> I'm not sure the extra _ are really worth it. I mean, that's not really
> enforceable, so saying that v4mapped should only be accessed by helpers
> should be equivalent.

Fair call.  Adjusted.

> 
> > +};
> > +
> > +/** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
> > + * @addr:	IPv4 or IPv6 address
> > + *
> > + * Return: IPv4 address if @addr is IPv4, NULL otherwise
> > + */
> > +static inline const struct in_addr *inany_v4(const union inany_addr *addr)
> > +{
> > +	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
> > +		return NULL;
> > +	return &addr->_v4mapped.a4;
> > +}
> > +
> > +/** inany_equals - Compare two IPv[46] addresses
> > + * @a, @b:	IPv[46] addresses
> > + *
> > + * Return: true if @a and @b are the same address
> > + */
> > +static inline bool inany_equals(const union inany_addr *a,
> > +				const union inany_addr *b)
> > +{
> > +	return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
> > +}
> > +
> > +/** inany_from_af - Set IPv[46] address from IPv4 or IPv6 address
> > + * @aa:		Pointer to store IPv[46] address
> > + * @af:		Address family of @addr
> > + * @addr:	struct in_addr (IPv4) or struct in6_addr (IPv6)
> > + */
> > +static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
> > +{
> > +	if (af == AF_INET6) {
> > +		aa->a6 = *((struct in6_addr *)addr);
> > +	} else if (af == AF_INET) {
> > +		memset(&aa->_v4mapped._zero, 0, sizeof(aa->_v4mapped._zero));
> > +		memset(&aa->_v4mapped._one, 0xff, sizeof(aa->_v4mapped._one));
> > +		aa->_v4mapped.a4 = *((struct in_addr *)addr);
> > +	} else {
> > +		/* Not valid to call with other address families */
> > +		assert(0);
> > +	}
> > +}
> > diff --git a/tcp.c b/tcp.c
> > index 7686766..4040198 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -301,6 +301,7 @@
> >  #include "conf.h"
> >  #include "tcp_splice.h"
> >  #include "log.h"
> > +#include "inany.h"
> >  
> >  #include "tcp_conn.h"
> >  
> > @@ -404,7 +405,7 @@ struct tcp6_l2_head {	/* For MSS6 macro: keep in sync with tcp6_l2_buf_t */
> >  #define OPT_SACK	5
> >  #define OPT_TS		8
> >  
> > -#define CONN_V4(conn)		IN6_IS_ADDR_V4MAPPED(&conn->a.a6)
> > +#define CONN_V4(conn)		(!!inany_v4(&(conn)->addr))
> >  #define CONN_V6(conn)		(!CONN_V4(conn))
> >  #define CONN_IS_CLOSING(conn)						\
> >  	((conn->events & ESTABLISHED) &&				\
> > @@ -438,7 +439,7 @@ static int tcp_sock_init_ext	[NUM_PORTS][IP_VERSIONS];
> >  static int tcp_sock_ns		[NUM_PORTS][IP_VERSIONS];
> >  
> >  /* Table of destinations with very low RTT (assumed to be local), LRU */
> > -static struct in6_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> > +static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];
> >  
> >  /* Static buffers */
> >  
> > @@ -861,7 +862,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
> >  	int i;
> >  
> >  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
> > -		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
> > +		if (inany_equals(&conn->addr, low_rtt_dst + i))
> >  			return 1;
> >  
> >  	return 0;
> > @@ -883,7 +884,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
> >  		return;
> >  
> >  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
> > -		if (IN6_ARE_ADDR_EQUAL(&conn->a.a6, low_rtt_dst + i))
> > +		if (inany_equals(&conn->addr, low_rtt_dst + i))
> >  			return;
> >  		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
> >  			hole = i;
> > @@ -895,10 +896,10 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
> >  	if (hole == -1)
> >  		return;
> >  
> > -	memcpy(low_rtt_dst + hole++, &conn->a.a6, sizeof(conn->a.a6));
> > +	low_rtt_dst[hole++] = conn->addr;
> >  	if (hole == LOW_RTT_TABLE_SIZE)
> >  		hole = 0;
> > -	memcpy(low_rtt_dst + hole, &in6addr_any, sizeof(conn->a.a6));
> > +	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
> >  #else
> >  	(void)conn;
> >  	(void)tinfo;
> > @@ -1187,13 +1188,14 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
> >  			  int af, const void *addr,
> >  			  in_port_t tap_port, in_port_t sock_port)
> >  {
> > -	if (af == AF_INET && CONN_V4(conn)			&&
> > -	    !memcmp(&conn->a.a4.a, addr, sizeof(conn->a.a4.a))	&&
> > +	const struct in_addr *a4 = inany_v4(&conn->addr);
> > +
> > +	if (af == AF_INET && a4	&& !memcmp(a4, addr, sizeof(*a4)) &&
> >  	    conn->tap_port == tap_port && conn->sock_port == sock_port)
> >  		return 1;
> >  
> >  	if (af == AF_INET6					&&
> > -	    IN6_ARE_ADDR_EQUAL(&conn->a.a6, addr)		&&
> > +	    IN6_ARE_ADDR_EQUAL(&conn->addr.a6, addr)		&&
> >  	    conn->tap_port == tap_port && conn->sock_port == sock_port)
> >  		return 1;
> 
> Note to self or other reviewers: switch to inany_equals() in 22/32.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref
  2022-11-17  0:15   ` Stefano Brivio
@ 2022-11-17  1:50     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  1:50 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5860 bytes --]

On Thu, Nov 17, 2022 at 01:15:11AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:42:06 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > This bit in the TCP specific epoll reference indicates whether the
> > connection is IPv6 or IPv4.  However the sites which refer to it are
> > already calling accept() which (optionally) returns an address for the
> > remote end of the connection.  We can use the sa_family field in that
> > address to determine the connection type independent of the epoll
> > reference.
> > 
> > This does have a cost: for the spliced case, it means we now need to get
> > that address from accept() which introduces an extran copy_to_user().
> > However, in future we want to allow handling IPv4 connectons through IPv6
> > sockets, which means we won't be able to determine the IP version at the
> > time we create the listening socket and epoll reference.  So, at some point
> > we'll have to pay this cost anyway.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c        | 10 ++++------
> >  tcp.h        |  2 --
> >  tcp_splice.c |  9 ++++-----
> >  3 files changed, 8 insertions(+), 13 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 0513b3b..b05ed6c 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -662,8 +662,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
> >  {
> >  	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> >  	union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock,
> > -				.r.p.tcp.tcp.index = CONN_IDX(conn),
> > -				.r.p.tcp.tcp.v6 = CONN_V6(conn) };
> > +				.r.p.tcp.tcp.index = CONN_IDX(conn) };
> >  	struct epoll_event ev = { .data.u64 = ref.u64 };
> >  
> >  	if (conn->events == CLOSED) {
> > @@ -2745,7 +2744,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	conn->ws_to_tap = conn->ws_from_tap = 0;
> >  	conn_event(c, conn, SOCK_ACCEPTED);
> >  
> > -	if (ref.r.p.tcp.tcp.v6) {
> > +	if (sa->sa_family == AF_INET6) {
> >  		struct sockaddr_in6 sa6;
> >  
> >  		memcpy(&sa6, sa, sizeof(sa6));
> > @@ -3019,8 +3018,7 @@ static void tcp_sock_init6(const struct ctx *c,
> >  			   in_port_t port)
> >  {
> >  	in_port_t idx = port + c->tcp.fwd_in.delta[port];
> > -	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1,
> > -				     .tcp.index = idx	};
> > +	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx	};
> 
> Excess whitespace (from earlier patch).

Fixed

> >  	int s;
> >  
> >  	s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32);
> > @@ -3084,7 +3082,7 @@ static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
> >  {
> >  	in_port_t idx = port + c->tcp.fwd_out.delta[port];
> >  	union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1,
> > -				     .tcp.v6 = 1, .tcp.index = idx};
> > +				     .tcp.index = idx};
> 
> Missing whitespace (from earlier patch).

Fixed.

> >  	int s;
> >  
> >  	assert(c->mode == MODE_PASTA);
> > diff --git a/tcp.h b/tcp.h
> > index a940682..739b451 100644
> > --- a/tcp.h
> > +++ b/tcp.h
> > @@ -33,7 +33,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s,
> >   * union tcp_epoll_ref - epoll reference portion for TCP connections
> >   * @listen:		Set if this file descriptor is a listening socket
> >   * @outbound:		Listening socket maps to outbound, spliced connection
> > - * @v6:			Set for IPv6 sockets or connections
> >   * @timer:		Reference is a timerfd descriptor for connection
> >   * @index:		Index of connection in table, or port for bound sockets
> >   * @u32:		Opaque u32 value of reference
> > @@ -42,7 +41,6 @@ union tcp_epoll_ref {
> >  	struct {
> >  		uint32_t	listen:1,
> >  				outbound:1,
> > -				v6:1,
> >  				timer:1,
> >  				index:20;
> >  	} tcp;
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 30ab0eb..7c2f667 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -167,11 +167,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
> >  {
> >  	int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
> >  	union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a,
> > -				  .r.p.tcp.tcp.index = CONN_IDX(conn),
> > -				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
> > +				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
> >  	union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b,
> > -				  .r.p.tcp.tcp.index = CONN_IDX(conn),
> > -				  .r.p.tcp.tcp.v6 = CONN_V6(conn) };
> > +				  .r.p.tcp.tcp.index = CONN_IDX(conn) };
> >  	struct epoll_event ev_a = { .data.u64 = ref_a.u64 };
> >  	struct epoll_event ev_b = { .data.u64 = ref_b.u64 };
> >  	uint32_t events_a, events_b;
> > @@ -504,6 +502,7 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock,
> >   * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
> >   * @c:		Execution context
> >   * @ref:	epoll reference of listening socket
> > + * @ipv6:	Should this be an IPv6 connection?
> 
> Left-over from previous idea I guess.

Yes indeed, fixed.

> >   * @conn:	connection structure to initialize
> >   * @s:		Accepted socket
> >   * @sa:		Peer address of connection
> > @@ -517,7 +516,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  {
> >  	assert(c->mode == MODE_PASTA);
> >  
> > -	if (ref.r.p.tcp.tcp.v6) {
> > +	if (sa->sa_family == AF_INET6) {
> >  		const struct sockaddr_in6 *sa6
> >  			= (const struct sockaddr_in6 *)sa;
> >  		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses
  2022-11-17  0:15   ` Stefano Brivio
@ 2022-11-17  2:00     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  2:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6850 bytes --]

On Thu, Nov 17, 2022 at 01:15:20AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:42:07 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > passt usually doesn't NAT, but it does do so for the remapping of the
> > gateway address to refer to the host.  Currently we perform this NAT with
> > slightly different rules on both IPv4 addresses and IPv6 addresses, but not
> > on IPv4-mapped IPv6 addresses.  This means we won't correctly handle the
> > case of an IPv4 connection over an IPv6 socket, which is possible on Linux
> > (and probably other platforms).
> 
> By the way, I really think it's just Linux, I can't think of other
> examples.

Hmm... so descriptions I've seen of the IPv4-mapped IPv6 addresses
seem to imply this is the behaviour in a number of systems. e.g.

https://en.wikipedia.org/wiki/IPv6#IPv4-mapped_IPv6_addresses

> > Refactor tcp_conn_from_sock() to perform the NAT after converting either
> > address family into an inany_addr, so IPv4 and and IPv4-mapped addresses
> > have the same representation.
> > 
> > With two new helpers this lets us remove the IPv4 and IPv6 specific paths
> > from tcp_conn_from_sock().
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  inany.h | 30 ++++++++++++++++++++++++++--
> >  tcp.c   | 62 ++++++++++++++++++++++++---------------------------------
> >  2 files changed, 54 insertions(+), 38 deletions(-)
> > 
> > diff --git a/inany.h b/inany.h
> > index 4e53da9..a677aa7 100644
> > --- a/inany.h
> > +++ b/inany.h
> > @@ -30,11 +30,11 @@ union inany_addr {
> >   *
> >   * Return: IPv4 address if @addr is IPv4, NULL otherwise
> >   */
> > -static inline const struct in_addr *inany_v4(const union inany_addr *addr)
> > +static inline struct in_addr *inany_v4(const union inany_addr *addr)
> 
> There must be a reason, but I can't understand why this change is
> needed here.

Because in tcp_snat_inbound() we want to modify, not just examine the
IPv4 address within the IPv6 address.  Ideally the return would be
const if and only if the input is, but C can't express that.  This
appears to be the conventional half-arsed solution (see, e.g. memchr()
or strstr()).

> >  {
> >  	if (!IN6_IS_ADDR_V4MAPPED(&addr->a6))
> >  		return NULL;
> > -	return &addr->_v4mapped.a4;
> > +	return (struct in_addr *)&addr->_v4mapped.a4;
> >  }
> >  
> >  /** inany_equals - Compare two IPv[46] addresses
> > @@ -66,3 +66,29 @@ static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
> >  		assert(0);
> >  	}
> >  }
> > +
> > +/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
> > + * @a:		Pointer to store IPv[46] address
> 
> This is aa below, I'm not sure why.

Fixed.

> > + * @port:	Pointer to store port number, host order
> > + * @addr:	struct sockaddr_in (IPv4) or struct sockaddr_in6 (IPv6)
> 
> This became sa_ (needless to say, addr would make more sense).

Good call, changed.

> > + */
> > +static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
> > +				       const void *sa_)
> > +{
> > +	const struct sockaddr *sa = (const struct sockaddr *)sa_;
> > +
> > +	if (sa->sa_family == AF_INET6) {
> > +		struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)sa;
> > +
> > +		inany_from_af(aa, AF_INET6, &sa6->sin6_addr);
> > +		*port = ntohs(sa6->sin6_port);
> > +	} else if (sa->sa_family == AF_INET) {
> > +		struct sockaddr_in *sa4 = (struct sockaddr_in *)sa;
> > +
> > +		inany_from_af(aa, AF_INET, &sa4->sin_addr);
> > +		*port = ntohs(sa4->sin_port);
> > +	} else {
> > +		/* Not valid to call with other address families */
> > +		assert(0);
> > +	}
> > +}
> > diff --git a/tcp.c b/tcp.c
> > index b05ed6c..fca5df4 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2724,6 +2724,29 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> >  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> >  }
> >  
> > +static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
> 
> What this does is kind of obvious, still a comment would be nice.

Good point, added.  Especially since I'm hoping to share this with UDP
at some later point.

> > +{
> > +	struct in_addr *addr4 = inany_v4(addr);
> > +
> > +	if (addr4) {
> > +		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
> > +		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
> > +		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
> > +			*addr4 = c->ip4.gw;
> > +	} else {
> > +		struct in6_addr *addr6 = &addr->a6;
> > +
> > +		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
> > +		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
> > +		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
> > +			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> > +				*addr6 = c->ip6.gw;
> > +			else
> > +				*addr6 = c->ip6.addr_ll;
> > +		}
> > +	}
> > +}
> > +
> >  /**
> >   * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
> >   * @c:		Execution context
> > @@ -2744,43 +2767,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  	conn->ws_to_tap = conn->ws_from_tap = 0;
> >  	conn_event(c, conn, SOCK_ACCEPTED);
> >  
> > -	if (sa->sa_family == AF_INET6) {
> > -		struct sockaddr_in6 sa6;
> > -
> > -		memcpy(&sa6, sa, sizeof(sa6));
> > -
> > -		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
> > -		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> > -		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr)) {
> > -			struct in6_addr *src;
> > +	inany_from_sockaddr(&conn->addr, &conn->sock_port, sa);
> > +	conn->tap_port = ref.r.p.tcp.tcp.index;
> >  
> > -			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> > -				src = &c->ip6.gw;
> > -			else
> > -				src = &c->ip6.addr_ll;
> > -
> > -			memcpy(&sa6.sin6_addr, src, sizeof(*src));
> > -		}
> > -
> > -		inany_from_af(&conn->addr, AF_INET6, &sa6.sin6_addr);
> > -
> > -		conn->sock_port = ntohs(sa6.sin6_port);
> > -		conn->tap_port = ref.r.p.tcp.tcp.index;
> > -	} else {
> > -		struct sockaddr_in sa4;
> > -
> > -		memcpy(&sa4, sa, sizeof(sa4));
> > -
> > -		if (IN4_IS_ADDR_LOOPBACK(&sa4.sin_addr) ||
> > -		    IN4_IS_ADDR_UNSPECIFIED(&sa4.sin_addr) ||
> > -		    IN4_ARE_ADDR_EQUAL(&sa4.sin_addr, &c->ip4.addr_seen))
> > -			sa4.sin_addr = c->ip4.gw;
> > -
> > -		inany_from_af(&conn->addr, AF_INET, &sa4.sin_addr);
> > -
> > -		conn->sock_port = ntohs(sa4.sin_port);
> > -		conn->tap_port = ref.r.p.tcp.tcp.index;
> > -	}
> > +	tcp_snat_inbound(c, &conn->addr);
> >  
> >  	tcp_seq_init(c, conn, now);
> >  	tcp_hash_insert(c, conn);
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback
  2022-11-17  0:15   ` Stefano Brivio
@ 2022-11-17  2:05     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  2:05 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2544 bytes --]

On Thu, Nov 17, 2022 at 01:15:26AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:42:08 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
> > same as the corresponding IPv4 addresses.  However currently we won't
> > splice a connection from ::ffff:127.0.0.1 the way we would one from
> > 127.0.0.1.  Correct this so that we can splice connections from IPv4
> > localhost that have been received on an IPv6 dual stack socket.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp_splice.c | 20 ++++++++++++--------
> >  1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 7c2f667..61c56be 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -514,19 +514,23 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref,
> >  			       struct tcp_splice_conn *conn, int s,
> >  			       const struct sockaddr *sa)
> >  {
> > +	union inany_addr aany;
> > +	const struct in_addr *a4;
> 
> The usual order.

Fixed.

> > +	in_port_t port;
> 
> Is the port actually needed here? I don't see how you use it.

Well, inany_from_sockaddr() requires a port parameter.  I could make
it accept NULL, but then I'd have some noise there instead of here.
Since inany_from_sockaddr() is an inline, I'm hoping the compiler will
be smart enough to elide it anyway.

I don't have a strong preference, I can change it if you'd like.

> 
> > +
> >  	assert(c->mode == MODE_PASTA);
> >  
> > -	if (sa->sa_family == AF_INET6) {
> > -		const struct sockaddr_in6 *sa6
> > -			= (const struct sockaddr_in6 *)sa;
> > -		if (!IN6_IS_ADDR_LOOPBACK(&sa6->sin6_addr))
> > +	inany_from_sockaddr(&aany, &port, sa);
> > +	a4 = inany_v4(&aany);
> > +
> > +	if (a4) {
> > +		if (!IN4_IS_ADDR_LOOPBACK(a4))
> >  			return false;
> > -		conn->flags = SPLICE_V6;
> > +		conn->flags = 0;
> >  	} else {
> > -		const struct sockaddr_in *sa4 = (const struct sockaddr_in *)sa;
> > -		if (!IN4_IS_ADDR_LOOPBACK(&sa4->sin_addr))
> > +		if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
> >  			return false;
> > -		conn->flags = 0;
> > +		conn->flags = SPLICE_V6;
> >  	}
> >  
> >  	if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }),
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible
  2022-11-17  0:15   ` Stefano Brivio
@ 2022-11-17  2:08     ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  2:08 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 3121 bytes --]

On Thu, Nov 17, 2022 at 01:15:30AM +0100, Stefano Brivio wrote:
> On Wed, 16 Nov 2022 15:42:12 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
> > well as native IPv6 connections.  By doing this we halve the number of
> > listening sockets we need for TCP (assuming passt/pasta is listening on the
> > same ports for IPv4 and IPv6).  When forwarding many ports (e.g. -t all)
> > this can significantly reduce the amount of kernel memory that passt
> > consumes.
> > 
> > When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all
> > -u all), this reduces kernel memory usage from ~677MiB to ~487MiB
> > (kernel version 6.0.8 on Fedora 37, x86_64).
> 
> Oh, nice, that's quite significant.
> 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  tcp.c | 14 ++++++++++++--
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tcp.c b/tcp.c
> > index 616b9d0..5860c9f 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -2991,8 +2991,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
> >  
> >  	s = sock_l4(c, af, IPPROTO_TCP, addr, ifname, port, tref.u32);
> >  
> > -	if (c->tcp.fwd_in.mode == FWD_AUTO)
> > -		tcp_sock_init_ext[port][(af == AF_INET) ? V4 : V6] = s;
> > +	if (c->tcp.fwd_in.mode == FWD_AUTO) {
> > +		if (af == AF_INET || af == AF_UNSPEC)
> > +			tcp_sock_init_ext[port][V4] = s;
> > +		if (af == AF_INET6 || af == AF_UNSPEC)
> 
> Nit: you could align the || af == AF_UNSPEC above with an extra
> whitespace (as it's done in the context below).

Done.

> > +			tcp_sock_init_ext[port][V6] = s;
> > +	}
> >  
> >  	if (s < 0)
> >  		return -1;
> > @@ -3012,6 +3016,12 @@ static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
> >  void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
> >  		   const char *ifname, in_port_t port)
> >  {
> > +	if (af == AF_UNSPEC && c->ifi4 && c->ifi6)
> > +		/* Attempt to get a dual stack socket */
> > +		if (tcp_sock_init_af(c, AF_UNSPEC, port, addr, ifname) >= 0)
> > +			return;
> > +
> > +	/* Otherwise create a socket per IP version */
> 
> ...this looks surprisingly clean by the way, at least much cleaner than
> I expected.

Right.  The trick is in realizing that the properties (spliced, IP
version) of an established connection don't need to be tied to the
properties of the listening socket which created it in the first
place.

> >  	if ((af == AF_INET  || af == AF_UNSPEC) && c->ifi4)
> >  		tcp_sock_init_af(c, AF_INET, port, addr, ifname);
> >  	if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6)
> 
> I just finished reviewing this series, in general it looks great to me,
> I would have another look (and test!) on Thursday -- either using this
> version or a re-spin.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path
  2022-11-17  1:37     ` David Gibson
@ 2022-11-17  7:30       ` Stefano Brivio
  2022-11-17  8:58         ` David Gibson
  0 siblings, 1 reply; 57+ messages in thread
From: Stefano Brivio @ 2022-11-17  7:30 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 17 Nov 2022 12:37:04 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Nov 17, 2022 at 12:53:58AM +0100, Stefano Brivio wrote:
> > On Wed, 16 Nov 2022 15:41:55 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > In tcp_sock_handler() we split off to handle spliced sockets before
> > > checking anything else.  However the first steps of the "new connection"
> > > path for each case are the same: allocate a connection entry and accept()
> > > the connection.
> > > 
> > > Remove this duplication by making tcp_conn_from_sock() handle both spliced
> > > and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
> > > and tcp_splice_conn_from_sock functions for the later stages which differ.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  tcp.c        | 68 ++++++++++++++++++++++++++++++++++------------------
> > >  tcp_splice.c | 58 +++++++++++++++++++++++---------------------
> > >  tcp_splice.h |  4 ++++
> > >  3 files changed, 80 insertions(+), 50 deletions(-)
> > > 
> > > diff --git a/tcp.c b/tcp.c
> > > index 72d3b49..e66a82a 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -2753,28 +2753,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> > >  }
> > >  
> > >  /**
> > > - * tcp_conn_from_sock() - Handle new connection request from listening socket
> > > + * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
> > >   * @c:		Execution context
> > >   * @ref:	epoll reference of listening socket
> > > + * @conn:	connection structure to initialize
> > > + * @s:		Accepted socket
> > > + * @sa:		Peer socket address (from accept())
> > >   * @now:	Current timestamp
> > >   */
> > > -static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > -			       const struct timespec *now)
> > > +static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > +				   struct tcp_tap_conn *conn, int s,
> > > +				   struct sockaddr *sa,
> > > +				   const struct timespec *now)
> > >  {
> > > -	struct sockaddr_storage sa;
> > > -	struct tcp_tap_conn *conn;
> > > -	socklen_t sl;
> > > -	int s;
> > > -
> > > -	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > > -		return;
> > > -
> > > -	sl = sizeof(sa);
> > > -	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> > > -	if (s < 0)
> > > -		return;
> > > -
> > > -	conn = CONN(c->tcp.conn_count++);
> > >  	conn->c.spliced = false;
> > >  	conn->sock = s;
> > >  	conn->timer = -1;
> > > @@ -2784,7 +2775,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > >  	if (ref.r.p.tcp.tcp.v6) {
> > >  		struct sockaddr_in6 sa6;
> > >  
> > > -		memcpy(&sa6, &sa, sizeof(sa6));
> > > +		memcpy(&sa6, sa, sizeof(sa6));
> > >  
> > >  		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
> > >  		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> > > @@ -2813,7 +2804,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > >  	} else {
> > >  		struct sockaddr_in sa4;
> > >  
> > > -		memcpy(&sa4, &sa, sizeof(sa4));
> > > +		memcpy(&sa4, sa, sizeof(sa4));
> > >  
> > >  		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
> > >  		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
> > > @@ -2846,6 +2837,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > >  	tcp_get_sndbuf(conn);
> > >  }
> > >  
> > > +/**
> > > + * tcp_conn_from_sock() - Handle new connection request from listening socket
> > > + * @c:		Execution context
> > > + * @ref:	epoll reference of listening socket
> > > + * @now:	Current timestamp
> > > + */
> > > +static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > +			       const struct timespec *now)
> > > +{
> > > +	struct sockaddr_storage sa;
> > > +	union tcp_conn *conn;
> > > +	socklen_t sl;
> > > +	int s;
> > > +
> > > +	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > > +		return;
> > > +
> > > +	sl = sizeof(sa);
> > > +	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);  
> > 
> > Combined with 16/32 I'm not sure this is simplifying much -- it looks a
> > bit unnatural there to get the peer address not "directly" from
> > accept4(). On the other hand you drop a few lines -- I'm fine with
> > it either way.  
> 
> Um.. I'm not really sure what you're getting at here.

By "directly" I mean assigned by accept4() in the same function,
instead of accept4() being done in the caller.

That is, if I now look at tcp_tap_conn_from_sock() we have 'sa' there
which comes as an argument, not directly a couple of lines above from
accept4(), which would be quicker to review.

On the other hand the function comment says "from accept()", so it's
not much effort to figure that out either.

-- 
Stefano


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path
  2022-11-17  7:30       ` Stefano Brivio
@ 2022-11-17  8:58         ` David Gibson
  0 siblings, 0 replies; 57+ messages in thread
From: David Gibson @ 2022-11-17  8:58 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5580 bytes --]

On Thu, Nov 17, 2022 at 08:30:29AM +0100, Stefano Brivio wrote:
> On Thu, 17 Nov 2022 12:37:04 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Nov 17, 2022 at 12:53:58AM +0100, Stefano Brivio wrote:
> > > On Wed, 16 Nov 2022 15:41:55 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > In tcp_sock_handler() we split off to handle spliced sockets before
> > > > checking anything else.  However the first steps of the "new connection"
> > > > path for each case are the same: allocate a connection entry and accept()
> > > > the connection.
> > > > 
> > > > Remove this duplication by making tcp_conn_from_sock() handle both spliced
> > > > and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
> > > > and tcp_splice_conn_from_sock functions for the later stages which differ.
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  tcp.c        | 68 ++++++++++++++++++++++++++++++++++------------------
> > > >  tcp_splice.c | 58 +++++++++++++++++++++++---------------------
> > > >  tcp_splice.h |  4 ++++
> > > >  3 files changed, 80 insertions(+), 50 deletions(-)
> > > > 
> > > > diff --git a/tcp.c b/tcp.c
> > > > index 72d3b49..e66a82a 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -2753,28 +2753,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
> > > >  }
> > > >  
> > > >  /**
> > > > - * tcp_conn_from_sock() - Handle new connection request from listening socket
> > > > + * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
> > > >   * @c:		Execution context
> > > >   * @ref:	epoll reference of listening socket
> > > > + * @conn:	connection structure to initialize
> > > > + * @s:		Accepted socket
> > > > + * @sa:		Peer socket address (from accept())
> > > >   * @now:	Current timestamp
> > > >   */
> > > > -static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > > -			       const struct timespec *now)
> > > > +static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > > +				   struct tcp_tap_conn *conn, int s,
> > > > +				   struct sockaddr *sa,
> > > > +				   const struct timespec *now)
> > > >  {
> > > > -	struct sockaddr_storage sa;
> > > > -	struct tcp_tap_conn *conn;
> > > > -	socklen_t sl;
> > > > -	int s;
> > > > -
> > > > -	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > > > -		return;
> > > > -
> > > > -	sl = sizeof(sa);
> > > > -	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
> > > > -	if (s < 0)
> > > > -		return;
> > > > -
> > > > -	conn = CONN(c->tcp.conn_count++);
> > > >  	conn->c.spliced = false;
> > > >  	conn->sock = s;
> > > >  	conn->timer = -1;
> > > > @@ -2784,7 +2775,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > >  	if (ref.r.p.tcp.tcp.v6) {
> > > >  		struct sockaddr_in6 sa6;
> > > >  
> > > > -		memcpy(&sa6, &sa, sizeof(sa6));
> > > > +		memcpy(&sa6, sa, sizeof(sa6));
> > > >  
> > > >  		if (IN6_IS_ADDR_LOOPBACK(&sa6.sin6_addr) ||
> > > >  		    IN6_ARE_ADDR_EQUAL(&sa6.sin6_addr, &c->ip6.addr_seen) ||
> > > > @@ -2813,7 +2804,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > >  	} else {
> > > >  		struct sockaddr_in sa4;
> > > >  
> > > > -		memcpy(&sa4, &sa, sizeof(sa4));
> > > > +		memcpy(&sa4, sa, sizeof(sa4));
> > > >  
> > > >  		memset(&conn->a.a4.zero,   0, sizeof(conn->a.a4.zero));
> > > >  		memset(&conn->a.a4.one, 0xff, sizeof(conn->a.a4.one));
> > > > @@ -2846,6 +2837,37 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > >  	tcp_get_sndbuf(conn);
> > > >  }
> > > >  
> > > > +/**
> > > > + * tcp_conn_from_sock() - Handle new connection request from listening socket
> > > > + * @c:		Execution context
> > > > + * @ref:	epoll reference of listening socket
> > > > + * @now:	Current timestamp
> > > > + */
> > > > +static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref,
> > > > +			       const struct timespec *now)
> > > > +{
> > > > +	struct sockaddr_storage sa;
> > > > +	union tcp_conn *conn;
> > > > +	socklen_t sl;
> > > > +	int s;
> > > > +
> > > > +	if (c->tcp.conn_count >= TCP_MAX_CONNS)
> > > > +		return;
> > > > +
> > > > +	sl = sizeof(sa);
> > > > +	s = accept4(ref.r.s, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);  
> > > 
> > > Combined with 16/32 I'm not sure this is simplifying much -- it looks a
> > > bit unnatural there to get the peer address not "directly" from
> > > accept4(). On the other hand you drop a few lines -- I'm fine with
> > > it either way.  
> > 
> > Um.. I'm not really sure what you're getting at here.
> 
> By "directly" I mean assigned by accept4() in the same function,
> instead of accept4() being done in the caller.
> 
> That is, if I now look at tcp_tap_conn_from_sock() we have 'sa' there
> which comes as an argument, not directly a couple of lines above from
> accept4(), which would be quicker to review.

Right, but this is unavoidable.  This patch is a preliminary to
deciding whether to take the spliced or non-spliced route based on the
address we get from accept4().

> On the other hand the function comment says "from accept()", so it's
> not much effort to figure that out either.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-11-17  8:59 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-16  4:41 [PATCH 00/32] Use dual stack sockets to listen for inbound TCP connections David Gibson
2022-11-16  4:41 ` [PATCH 01/32] clang-tidy: Suppress warning about assignments in if statements David Gibson
2022-11-16 23:10   ` Stefano Brivio
2022-11-17  1:20     ` David Gibson
2022-11-16  4:41 ` [PATCH 02/32] style: Minor corrections to function comments David Gibson
2022-11-16 23:11   ` Stefano Brivio
2022-11-17  1:21     ` David Gibson
2022-11-16  4:41 ` [PATCH 03/32] tcp_splice: #include tcp_splice.h in tcp_splice.c David Gibson
2022-11-16  4:41 ` [PATCH 04/32] tcp: Remove unused TCP_MAX_SOCKS constant David Gibson
2022-11-16  4:41 ` [PATCH 05/32] tcp: Better helpers for converting between connection pointer and index David Gibson
2022-11-16 23:11   ` Stefano Brivio
2022-11-17  1:24     ` David Gibson
2022-11-16  4:41 ` [PATCH 06/32] tcp_splice: Helpers for converting from index to/from tcp_splice_conn David Gibson
2022-11-16  4:41 ` [PATCH 07/32] tcp: Move connection state structures into a shared header David Gibson
2022-11-16  4:41 ` [PATCH 08/32] tcp: Add connection union type David Gibson
2022-11-16  4:41 ` [PATCH 09/32] tcp: Improved helpers to update connections after moving David Gibson
2022-11-16  4:41 ` [PATCH 10/32] tcp: Unify spliced and non-spliced connection tables David Gibson
2022-11-16  4:41 ` [PATCH 11/32] tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() David Gibson
2022-11-16  4:41 ` [PATCH 12/32] tcp: Partially unify tcp_timer() and tcp_splice_timer() David Gibson
2022-11-16  4:41 ` [PATCH 13/32] tcp: Unify the IN_EPOLL flag David Gibson
2022-11-16  4:41 ` [PATCH 14/32] tcp: Separate helpers to create ns listening sockets David Gibson
2022-11-16 23:51   ` Stefano Brivio
2022-11-17  1:32     ` David Gibson
2022-11-16  4:41 ` [PATCH 15/32] tcp: Unify part of spliced and non-spliced conn_from_sock path David Gibson
2022-11-16 23:53   ` Stefano Brivio
2022-11-17  1:37     ` David Gibson
2022-11-17  7:30       ` Stefano Brivio
2022-11-17  8:58         ` David Gibson
2022-11-16  4:41 ` [PATCH 16/32] tcp: Use the same sockets to listen for spliced and non-spliced connections David Gibson
2022-11-16 23:54   ` Stefano Brivio
2022-11-17  1:43     ` David Gibson
2022-11-16  4:41 ` [PATCH 17/32] tcp: Remove splice from tcp_epoll_ref David Gibson
2022-11-16  4:41 ` [PATCH 18/32] tcp: Don't store hash bucket in connection structures David Gibson
2022-11-16  4:41 ` [PATCH 19/32] inany: Helper functions for handling addresses which could be IPv4 or IPv6 David Gibson
2022-11-16 23:54   ` Stefano Brivio
2022-11-17  1:48     ` David Gibson
2022-11-16  4:42 ` [PATCH 20/32] tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same David Gibson
2022-11-16  4:42 ` [PATCH 21/32] tcp: Take tcp_hash_insert() address from struct tcp_conn David Gibson
2022-11-16  4:42 ` [PATCH 22/32] tcp: Simplify tcp_hash_match() to take an inany_addr David Gibson
2022-11-16  4:42 ` [PATCH 23/32] tcp: Unify initial sequence number calculation for IPv4 and IPv6 David Gibson
2022-11-16  4:42 ` [PATCH 24/32] tcp: Have tcp_seq_init() take its parameters from struct tcp_conn David Gibson
2022-11-16  4:42 ` [PATCH 25/32] tcp: Fix small errors in tcp_seq_init() time handling David Gibson
2022-11-16  4:42 ` [PATCH 26/32] tcp: Remove v6 flag from tcp_epoll_ref David Gibson
2022-11-17  0:15   ` Stefano Brivio
2022-11-17  1:50     ` David Gibson
2022-11-16  4:42 ` [PATCH 27/32] tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses David Gibson
2022-11-17  0:15   ` Stefano Brivio
2022-11-17  2:00     ` David Gibson
2022-11-16  4:42 ` [PATCH 28/32] tcp_splice: Allow splicing of connections from IPv4-mapped loopback David Gibson
2022-11-17  0:15   ` Stefano Brivio
2022-11-17  2:05     ` David Gibson
2022-11-16  4:42 ` [PATCH 29/32] tcp: Consolidate tcp_sock_init[46] David Gibson
2022-11-16  4:42 ` [PATCH 30/32] util: Allow sock_l4() to open dual stack sockets David Gibson
2022-11-16  4:42 ` [PATCH 31/32] util: Always return -1 on error in sock_l4() David Gibson
2022-11-16  4:42 ` [PATCH 32/32] tcp: Use dual stack sockets for port forwarding when possible David Gibson
2022-11-17  0:15   ` Stefano Brivio
2022-11-17  2:08     ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).