public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v3 00/13] Manage more flow related things from generic flow code
@ 2023-12-21  6:15 David Gibson
  2023-12-21  6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

There are a number of things that are more-or-less general to flows
which are still explicitly handled in tcp.c and tcp_splice.c including
allocation and freeing of flow entries, and dispatch of deferred and
timer functions.

Even without adding more fields to the common flow structure, we can
handle a number of these in a more flow-centric way.

Unlike v1 this version is based on the hash table rework series.

Changes since v2:
 * Realised the prealloc/commit functions where confusing and worked
   poorly for some future stuff.  Replaced with alloc/alloc_cancel
 * Fixed a bug where newly allocated flow entries might not be
   0-filled, because of the free tracking information in there.  This
   could cause very subtle problems.
Changes since v1:
 * Store the timestamp of last flow timers run in a global, rather
   than a ctx field
 * Rebased on the TCP hash table rework
 * Add patches 9..13/13 with changes to allocation and freeing of flow
   entries.

David Gibson (13):
  flow: Make flow_table.h #include the protocol specific headers it
    needs
  treewide: Standardise on 'now' for current timestamp variables
  tcp, tcp_splice: Remove redundant handling from tcp_timer()
  tcp, tcp_splice: Move per-type cleanup logic into per-type helpers
  flow, tcp: Add flow-centric dispatch for deferred flow handling
  flow, tcp: Add handling for per-flow timers
  epoll: Better handling of number of epoll types
  tcp, tcp_splice: Avoid double layered dispatch for connected TCP
    sockets
  flow: Move flow_log_() to near top of flow.c
  flow: Move flow_count from context structure to a global
  flow: Abstract allocation of new flows with helper function
  flow: Enforce that freeing of closed flows must happen in deferred
    handlers
  flow: Avoid moving flow entries to compact table

 flow.c       | 223 ++++++++++++++++++++++++++++++++++++++++++---------
 flow.h       |   5 +-
 flow_table.h |  20 +++++
 icmp.c       |  12 +--
 icmp.h       |   2 +-
 log.c        |  34 ++++----
 passt.c      |  20 +++--
 passt.h      |   9 +--
 tcp.c        | 143 +++++++++------------------------
 tcp.h        |   2 +-
 tcp_conn.h   |   8 +-
 tcp_splice.c |  49 +++++------
 tcp_splice.h |   4 +-
 udp.c        |  16 ++--
 udp.h        |   2 +-
 15 files changed, 324 insertions(+), 225 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables David Gibson
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

flow_table.h, the lower level flow header relies on having the struct
definitions for every protocol specific flow type - so far that means
tcp_conn.h.  It doesn't include it itself, so tcp_conn.h must be included
before flow_table.h.

That's ok for now, but as we use the flow table for more things,
flow_table.h will need the structs for all of them, which means the
protocol specific .c files would need to include tcp_conn.h _and_ the
equivalents for every other flow type before flow_table.h every time,
which is weird.

So, although we *mostly* lean towards the include style where .c files need
to handle the include dependencies, in this case it makes more sense to
have flow_table.h include all the protocol specific headers it needs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 1 -
 flow_table.h | 2 ++
 tcp.c        | 1 -
 tcp_splice.c | 1 -
 4 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/flow.c b/flow.c
index d24726d..a1c0a34 100644
--- a/flow.c
+++ b/flow.c
@@ -14,7 +14,6 @@
 #include "siphash.h"
 #include "inany.h"
 #include "flow.h"
-#include "tcp_conn.h"
 #include "flow_table.h"
 
 const char *flow_type_str[] = {
diff --git a/flow_table.h b/flow_table.h
index 0dee66f..e805f10 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -7,6 +7,8 @@
 #ifndef FLOW_TABLE_H
 #define FLOW_TABLE_H
 
+#include "tcp_conn.h"
+
 /**
  * union flow - Descriptor for a logical packet flow (e.g. connection)
  * @f:		Fields common between all variants
diff --git a/tcp.c b/tcp.c
index 63b39e0..422770f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -298,7 +298,6 @@
 #include "inany.h"
 #include "flow.h"
 
-#include "tcp_conn.h"
 #include "flow_table.h"
 
 /* Sides of a flow as we use them in "tap" connections */
diff --git a/tcp_splice.c b/tcp_splice.c
index 69ea79d..052f989 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -56,7 +56,6 @@
 #include "inany.h"
 #include "flow.h"
 
-#include "tcp_conn.h"
 #include "flow_table.h"
 
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
-- 
@@ -56,7 +56,6 @@
 #include "inany.h"
 #include "flow.h"
 
-#include "tcp_conn.h"
 #include "flow_table.h"
 
 #define MAX_PIPE_SIZE			(8UL * 1024 * 1024)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
  2023-12-21  6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer() David Gibson
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In a number of places we pass around a struct timespec representing the
(more or less) current time.  Sometimes we call it 'now', and sometimes we
call it 'ts'.  Standardise on the more informative 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 12 ++++++------
 icmp.h |  2 +-
 log.c  | 34 +++++++++++++++++-----------------
 tcp.c  |  6 +++---
 tcp.h  |  2 +-
 udp.c  | 16 ++++++++--------
 udp.h  |  2 +-
 7 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/icmp.c b/icmp.c
index a1de8ae..a9fbcb2 100644
--- a/icmp.c
+++ b/icmp.c
@@ -290,14 +290,14 @@ fail_sock:
  * @c:		Execution context
  * @v6:		Set for IPv6 echo identifier bindings
  * @id:		Echo identifier, host order
- * @ts:		Timestamp from caller
+ * @now:	Current timestamp
  */
 static void icmp_timer_one(const struct ctx *c, int v6, uint16_t id,
-			   const struct timespec *ts)
+			   const struct timespec *now)
 {
 	struct icmp_id_sock *id_map = &icmp_id_map[v6 ? V6 : V4][id];
 
-	if (ts->tv_sec - id_map->ts <= ICMP_ECHO_TIMEOUT)
+	if (now->tv_sec - id_map->ts <= ICMP_ECHO_TIMEOUT)
 		return;
 
 	bitmap_clear(icmp_act[v6 ? V6 : V4], id);
@@ -311,9 +311,9 @@ static void icmp_timer_one(const struct ctx *c, int v6, uint16_t id,
 /**
  * icmp_timer() - Scan activity bitmap for identifiers with timed events
  * @c:		Execution context
- * @ts:		Timestamp from caller
+ * @now:	Current timestamp
  */
-void icmp_timer(const struct ctx *c, const struct timespec *ts)
+void icmp_timer(const struct ctx *c, const struct timespec *now)
 {
 	long *word, tmp;
 	unsigned int i;
@@ -325,7 +325,7 @@ v6:
 		tmp = *word;
 		while ((n = ffsl(tmp))) {
 			tmp &= ~(1UL << (n - 1));
-			icmp_timer_one(c, v6, i * 8 + n - 1, ts);
+			icmp_timer_one(c, v6, i * 8 + n - 1, now);
 		}
 	}
 
diff --git a/icmp.h b/icmp.h
index 44cc495..1a08594 100644
--- a/icmp.h
+++ b/icmp.h
@@ -15,7 +15,7 @@ void icmpv6_sock_handler(const struct ctx *c, union epoll_ref ref);
 int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now);
-void icmp_timer(const struct ctx *c, const struct timespec *ts);
+void icmp_timer(const struct ctx *c, const struct timespec *now);
 void icmp_init(void);
 
 /**
diff --git a/log.c b/log.c
index b206f72..d7f6d35 100644
--- a/log.c
+++ b/log.c
@@ -216,11 +216,11 @@ void logfile_init(const char *name, const char *path, size_t size)
 /**
  * logfile_rotate_fallocate() - Write header, set log_written after fallocate()
  * @fd:		Log file descriptor
- * @ts:		Current timestamp
+ * @now:	Current timestamp
  *
  * #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek
  */
-static void logfile_rotate_fallocate(int fd, const struct timespec *ts)
+static void logfile_rotate_fallocate(int fd, const struct timespec *now)
 {
 	char buf[BUFSIZ], *nl;
 	int n;
@@ -232,8 +232,8 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *ts)
 
 	n = snprintf(buf, BUFSIZ,
 		     "%s - log truncated at %lli.%04lli", log_header,
-		     (long long int)(ts->tv_sec - log_start),
-		     (long long int)(ts->tv_nsec / (100L * 1000)));
+		     (long long int)(now->tv_sec - log_start),
+		     (long long int)(now->tv_nsec / (100L * 1000)));
 
 	/* Avoid partial lines by padding the header with spaces */
 	nl = memchr(buf + n + 1, '\n', BUFSIZ - n - 1);
@@ -252,20 +252,20 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *ts)
 /**
  * logfile_rotate_move() - Fallback: move recent entries toward start, then cut
  * @fd:		Log file descriptor
- * @ts:		Current timestamp
+ * @now:	Current timestamp
  *
  * #syscalls lseek ppc64le:_llseek ppc64:_llseek armv6l:_llseek armv7l:_llseek
  * #syscalls ftruncate
  */
-static void logfile_rotate_move(int fd, const struct timespec *ts)
+static void logfile_rotate_move(int fd, const struct timespec *now)
 {
 	int header_len, write_offset, end, discard, n;
 	char buf[BUFSIZ], *nl;
 
 	header_len = snprintf(buf, BUFSIZ,
 			      "%s - log truncated at %lli.%04lli\n", log_header,
-			      (long long int)(ts->tv_sec - log_start),
-			      (long long int)(ts->tv_nsec / (100L * 1000)));
+			      (long long int)(now->tv_sec - log_start),
+			      (long long int)(now->tv_nsec / (100L * 1000)));
 	if (lseek(fd, 0, SEEK_SET) == -1)
 		return;
 	if (write(fd, buf, header_len) == -1)
@@ -314,7 +314,7 @@ out:
 /**
  * logfile_rotate() - "Rotate" log file once it's full
  * @fd:		Log file descriptor
- * @ts:		Current timestamp
+ * @now:	Current timestamp
  *
  * Return: 0 on success, negative error code on failure
  *
@@ -322,7 +322,7 @@ out:
  *
  * fallocate() passed as EXTRA_SYSCALL only if FALLOC_FL_COLLAPSE_RANGE is there
  */
-static int logfile_rotate(int fd, const struct timespec *ts)
+static int logfile_rotate(int fd, const struct timespec *now)
 {
 	if (fcntl(fd, F_SETFL, O_RDWR /* Drop O_APPEND: explicit lseek() */))
 		return -errno;
@@ -330,10 +330,10 @@ static int logfile_rotate(int fd, const struct timespec *ts)
 #ifdef FALLOC_FL_COLLAPSE_RANGE
 	/* Only for Linux >= 3.15, extent-based ext4 or XFS, glibc >= 2.18 */
 	if (!fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, 0, log_cut_size))
-		logfile_rotate_fallocate(fd, ts);
+		logfile_rotate_fallocate(fd, now);
 	else
 #endif
-		logfile_rotate_move(fd, ts);
+		logfile_rotate_move(fd, now);
 
 	if (fcntl(fd, F_SETFL, O_RDWR | O_APPEND))
 		return -errno;
@@ -349,16 +349,16 @@ static int logfile_rotate(int fd, const struct timespec *ts)
  */
 void logfile_write(int pri, const char *format, va_list ap)
 {
-	struct timespec ts;
+	struct timespec now;
 	char buf[BUFSIZ];
 	int n;
 
-	if (clock_gettime(CLOCK_REALTIME, &ts))
+	if (clock_gettime(CLOCK_REALTIME, &now))
 		return;
 
 	n = snprintf(buf, BUFSIZ, "%lli.%04lli: %s",
-		     (long long int)(ts.tv_sec - log_start),
-		     (long long int)(ts.tv_nsec / (100L * 1000)),
+		     (long long int)(now.tv_sec - log_start),
+		     (long long int)(now.tv_nsec / (100L * 1000)),
 		     logfile_prefix[pri]);
 
 	n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
@@ -366,7 +366,7 @@ void logfile_write(int pri, const char *format, va_list ap)
 	if (format[strlen(format)] != '\n')
 		n += snprintf(buf + n, BUFSIZ - n, "\n");
 
-	if ((log_written + n >= log_size) && logfile_rotate(log_file, &ts))
+	if ((log_written + n >= log_size) && logfile_rotate(log_file, &now))
 		return;
 
 	if ((n = write(log_file, buf, n)) >= 0)
diff --git a/tcp.c b/tcp.c
index 422770f..1628896 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3180,13 +3180,13 @@ static int tcp_port_rebind_outbound(void *arg)
 /**
  * tcp_timer() - Periodic tasks: port detection, closed connections, pool refill
  * @c:		Execution context
- * @ts:		Unused
+ * @now:	Current timestamp
  */
-void tcp_timer(struct ctx *c, const struct timespec *ts)
+void tcp_timer(struct ctx *c, const struct timespec *now)
 {
 	union flow *flow;
 
-	(void)ts;
+	(void)now;
 
 	if (c->mode == MODE_PASTA) {
 		if (c->tcp.fwd_out.mode == FWD_AUTO) {
diff --git a/tcp.h b/tcp.h
index 27b1166..594d71a 100644
--- a/tcp.h
+++ b/tcp.h
@@ -20,7 +20,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
 int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
 		  const char *ifname, in_port_t port);
 int tcp_init(struct ctx *c);
-void tcp_timer(struct ctx *c, const struct timespec *ts);
+void tcp_timer(struct ctx *c, const struct timespec *now);
 void tcp_defer_handler(struct ctx *c);
 
 void tcp_sock_set_bufsize(const struct ctx *c, int s);
diff --git a/udp.c b/udp.c
index 1f8c306..0b4582c 100644
--- a/udp.c
+++ b/udp.c
@@ -1140,10 +1140,10 @@ int udp_init(struct ctx *c)
  * @v6:		Set for IPv6 connections
  * @type:	Socket type
  * @port:	Port number, host order
- * @ts:		Timestamp from caller
+ * @now:	Current timestamp
  */
 static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type,
-			  in_port_t port, const struct timespec *ts)
+			  in_port_t port, const struct timespec *now)
 {
 	struct udp_splice_port *sp;
 	struct udp_tap_port *tp;
@@ -1153,7 +1153,7 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type,
 	case UDP_ACT_TAP:
 		tp = &udp_tap_map[v6 ? V6 : V4][port];
 
-		if (ts->tv_sec - tp->ts > UDP_CONN_TIMEOUT) {
+		if (now->tv_sec - tp->ts > UDP_CONN_TIMEOUT) {
 			sockp = &tp->sock;
 			tp->flags = 0;
 		}
@@ -1162,14 +1162,14 @@ static void udp_timer_one(struct ctx *c, int v6, enum udp_act_type type,
 	case UDP_ACT_SPLICE_INIT:
 		sp = &udp_splice_init[v6 ? V6 : V4][port];
 
-		if (ts->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
+		if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
 			sockp = &sp->sock;
 
 		break;
 	case UDP_ACT_SPLICE_NS:
 		sp = &udp_splice_ns[v6 ? V6 : V4][port];
 
-		if (ts->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
+		if (now->tv_sec - sp->ts > UDP_CONN_TIMEOUT)
 			sockp = &sp->sock;
 
 		break;
@@ -1249,9 +1249,9 @@ static int udp_port_rebind_outbound(void *arg)
 /**
  * udp_timer() - Scan activity bitmaps for ports with associated timed events
  * @c:		Execution context
- * @ts:		Timestamp from caller
+ * @now:	Current timestamp
  */
-void udp_timer(struct ctx *c, const struct timespec *ts)
+void udp_timer(struct ctx *c, const struct timespec *now)
 {
 	int n, t, v6 = 0;
 	unsigned int i;
@@ -1281,7 +1281,7 @@ v6:
 			tmp = *word;
 			while ((n = ffsl(tmp))) {
 				tmp &= ~(1UL << (n - 1));
-				udp_timer_one(c, v6, t, i * 8 + n - 1, ts);
+				udp_timer_one(c, v6, t, i * 8 + n - 1, now);
 			}
 		}
 	}
diff --git a/udp.h b/udp.h
index 85ebaaa..087e482 100644
--- a/udp.h
+++ b/udp.h
@@ -17,7 +17,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
 int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
 		  const void *addr, const char *ifname, in_port_t port);
 int udp_init(struct ctx *c);
-void udp_timer(struct ctx *c, const struct timespec *ts);
+void udp_timer(struct ctx *c, const struct timespec *now);
 void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
 
 /**
-- 
@@ -17,7 +17,7 @@ int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
 int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
 		  const void *addr, const char *ifname, in_port_t port);
 int udp_init(struct ctx *c);
-void udp_timer(struct ctx *c, const struct timespec *ts);
+void udp_timer(struct ctx *c, const struct timespec *now);
 void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s);
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer()
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
  2023-12-21  6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
  2023-12-21  6:15 ` [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers David Gibson
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections.  tcp_splice_timer()
expires spliced connections and then does some other processing.

However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections.  So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 15 ++-------------
 tcp_conn.h   |  2 +-
 tcp_splice.c |  7 ++-----
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/tcp.c b/tcp.c
index 1628896..85d1146 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3200,20 +3200,9 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
 		}
 	}
 
-	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
-		switch (flow->f.type) {
-		case FLOW_TCP:
-			if (flow->tcp.events == CLOSED)
-				tcp_conn_destroy(c, flow);
-			break;
-		case FLOW_TCP_SPLICE:
+	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--)
+		if (flow->f.type == FLOW_TCP_SPLICE)
 			tcp_splice_timer(c, flow);
-			break;
-		default:
-			die("Unexpected %s in tcp_timer()",
-			    FLOW_TYPE(&flow->f));
-		}
-	}
 
 	tcp_sock_refill_init(c);
 	if (c->mode == MODE_PASTA)
diff --git a/tcp_conn.h b/tcp_conn.h
index e3400bb..e98559c 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -159,7 +159,7 @@ void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
 void tcp_splice_destroy(struct ctx *c, union flow *flow);
-void tcp_splice_timer(struct ctx *c, union flow *flow);
+void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
 int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
 void tcp_sock_refill_pool(const struct ctx *c, int pool[], int af);
diff --git a/tcp_splice.c b/tcp_splice.c
index 052f989..cf9b4e8 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -755,15 +755,12 @@ void tcp_splice_init(struct ctx *c)
  * @c:		Execution context
  * @flow:	Flow table entry
  */
-void tcp_splice_timer(struct ctx *c, union flow *flow)
+void tcp_splice_timer(const struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	int side;
 
-	if (conn->flags & CLOSING) {
-		tcp_splice_destroy(c, flow);
-		return;
-	}
+	ASSERT(!(conn->flags & CLOSING));
 
 	for (side = 0; side < SIDES; side++) {
 		uint8_t set = side == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1;
-- 
@@ -755,15 +755,12 @@ void tcp_splice_init(struct ctx *c)
  * @c:		Execution context
  * @flow:	Flow table entry
  */
-void tcp_splice_timer(struct ctx *c, union flow *flow)
+void tcp_splice_timer(const struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	int side;
 
-	if (conn->flags & CLOSING) {
-		tcp_splice_destroy(c, flow);
-		return;
-	}
+	ASSERT(!(conn->flags & CLOSING));
 
 	for (side = 0; side < SIDES; side++) {
 		uint8_t set = side == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (2 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer() David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing.  Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 13 +++++++------
 tcp_conn.h   |  2 +-
 tcp_splice.c |  9 ++++++---
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/tcp.c b/tcp.c
index 85d1146..ad1a70d 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1302,14 +1302,17 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 }
 
 /**
- * tcp_conn_destroy() - Close sockets, trigger hash table removal and compaction
+ * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
  * @c:		Execution context
  * @flow:	Flow table entry for this connection
  */
-static void tcp_conn_destroy(struct ctx *c, union flow *flow)
+static void tcp_flow_defer(struct ctx *c, union flow *flow)
 {
 	const struct tcp_tap_conn *conn = &flow->tcp;
 
+	if (flow->tcp.events != CLOSED)
+		return;
+
 	close(conn->sock);
 	if (conn->timer != -1)
 		close(conn->timer);
@@ -1371,12 +1374,10 @@ void tcp_defer_handler(struct ctx *c)
 	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
 		switch (flow->f.type) {
 		case FLOW_TCP:
-			if (flow->tcp.events == CLOSED)
-				tcp_conn_destroy(c, flow);
+			tcp_flow_defer(c, flow);
 			break;
 		case FLOW_TCP_SPLICE:
-			if (flow->tcp_splice.flags & CLOSING)
-				tcp_splice_destroy(c, flow);
+			tcp_splice_flow_defer(c, flow);
 			break;
 		default:
 			die("Unexpected %s in tcp_defer_handler()",
diff --git a/tcp_conn.h b/tcp_conn.h
index e98559c..4846565 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -158,7 +158,7 @@ extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
-void tcp_splice_destroy(struct ctx *c, union flow *flow);
+void tcp_splice_flow_defer(struct ctx *c, union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
 int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
diff --git a/tcp_splice.c b/tcp_splice.c
index cf9b4e8..09aa20f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -243,15 +243,18 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
 }
 
 /**
- * tcp_splice_destroy() - Close spliced connection and pipes, clear
+ * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
  * @c:		Execution context
- * @flow:	Flow table entry
+ * @flow:	Flow table entry for this connection
  */
-void tcp_splice_destroy(struct ctx *c, union flow *flow)
+void tcp_splice_flow_defer(struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
 
+	if (!(flow->tcp_splice.flags & CLOSING))
+		return;
+
 	for (side = 0; side < SIDES; side++) {
 		if (conn->events & SPLICE_ESTABLISHED) {
 			/* Flushing might need to block: don't recycle them. */
-- 
@@ -243,15 +243,18 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
 }
 
 /**
- * tcp_splice_destroy() - Close spliced connection and pipes, clear
+ * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
  * @c:		Execution context
- * @flow:	Flow table entry
+ * @flow:	Flow table entry for this connection
  */
-void tcp_splice_destroy(struct ctx *c, union flow *flow)
+void tcp_splice_flow_defer(struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
 
+	if (!(flow->tcp_splice.flags & CLOSING))
+		return;
+
 	for (side = 0; side < SIDES; side++) {
 		if (conn->events & SPLICE_ESTABLISHED) {
 			/* Flushing might need to block: don't recycle them. */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (3 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-28 18:24   ` Stefano Brivio
  2023-12-21  6:15 ` [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers David Gibson
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection.  When we add other protocols to
the flow table, they're likely to want some similar scanning.  It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c     | 23 +++++++++++++++++++++++
 flow.h     |  1 +
 passt.c    |  1 +
 tcp.c      | 19 ++-----------------
 tcp_conn.h |  1 +
 5 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/flow.c b/flow.c
index a1c0a34..0a0402d 100644
--- a/flow.c
+++ b/flow.c
@@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 
 	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
 }
+
+/**
+ * flow_defer_handler() - Handler for per-flow deferred tasks
+ * @c:		Execution context
+ */
+void flow_defer_handler(struct ctx *c)
+{
+	union flow *flow;
+
+	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
+		switch (flow->f.type) {
+		case FLOW_TCP:
+			tcp_flow_defer(c, flow);
+			break;
+		case FLOW_TCP_SPLICE:
+			tcp_splice_flow_defer(c, flow);
+			break;
+		default:
+			/* Assume other flow types don't need any handling */
+			;
+		}
+	}
+}
diff --git a/flow.h b/flow.h
index 959b461..6b17fa8 100644
--- a/flow.h
+++ b/flow.h
@@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 union flow;
 
 void flow_table_compact(struct ctx *c, union flow *hole);
+void flow_defer_handler(struct ctx *c);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/passt.c b/passt.c
index 0246b04..5f72a28 100644
--- a/passt.c
+++ b/passt.c
@@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
 	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
 
+	flow_defer_handler(c);
 #undef CALL_PROTO_HANDLER
 }
 
diff --git a/tcp.c b/tcp.c
index ad1a70d..9230d80 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
  * @c:		Execution context
  * @flow:	Flow table entry for this connection
  */
-static void tcp_flow_defer(struct ctx *c, union flow *flow)
+void tcp_flow_defer(struct ctx *c, union flow *flow)
 {
 	const struct tcp_tap_conn *conn = &flow->tcp;
 
@@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
  * tcp_defer_handler() - Handler for TCP deferred tasks
  * @c:		Execution context
  */
+/* cppcheck-suppress constParameterPointer */
 void tcp_defer_handler(struct ctx *c)
 {
-	union flow *flow;
-
 	tcp_l2_flags_buf_flush(c);
 	tcp_l2_data_buf_flush(c);
-
-	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
-		switch (flow->f.type) {
-		case FLOW_TCP:
-			tcp_flow_defer(c, flow);
-			break;
-		case FLOW_TCP_SPLICE:
-			tcp_splice_flow_defer(c, flow);
-			break;
-		default:
-			die("Unexpected %s in tcp_defer_handler()",
-			    FLOW_TYPE(&flow->f));
-		}
-	}
 }
 
 /**
diff --git a/tcp_conn.h b/tcp_conn.h
index 4846565..72b9ecb 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -158,6 +158,7 @@ extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
+void tcp_flow_defer(struct ctx *c, union flow *flow);
 void tcp_splice_flow_defer(struct ctx *c, union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
-- 
@@ -158,6 +158,7 @@ extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
+void tcp_flow_defer(struct ctx *c, union flow *flow);
 void tcp_splice_flow_defer(struct ctx *c, union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (4 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 07/13] epoll: Better handling of number of epoll types David Gibson
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection.  More generally, other flow types might want to
run similar timers in future.

We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler().  We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.

So, extend flow_defer_handler() to do this.  For now we use the same timer
interval for all flow types (1s).  We can make that more flexible in future
if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c  | 16 ++++++++++++++--
 flow.h  |  4 +++-
 passt.c |  7 ++++---
 tcp.c   |  6 ------
 4 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/flow.c b/flow.c
index 0a0402d..ef129db 100644
--- a/flow.c
+++ b/flow.c
@@ -27,6 +27,9 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
 /* Global Flow Table */
 union flow flowtab[FLOW_MAX];
 
+/* Last time the flow timers ran */
+static struct timespec flow_timer_run;
+
 /**
  * flow_table_compact() - Perform compaction on flow table
  * @c:		Execution context
@@ -85,13 +88,20 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 }
 
 /**
- * flow_defer_handler() - Handler for per-flow deferred tasks
+ * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
+ * @now:	Current timestamp
  */
-void flow_defer_handler(struct ctx *c)
+void flow_defer_handler(struct ctx *c, const struct timespec *now)
 {
+	bool timer = false;
 	union flow *flow;
 
+	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
+		timer = true;
+		flow_timer_run = *now;
+	}
+
 	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
 		switch (flow->f.type) {
 		case FLOW_TCP:
@@ -99,6 +109,8 @@ void flow_defer_handler(struct ctx *c)
 			break;
 		case FLOW_TCP_SPLICE:
 			tcp_splice_flow_defer(c, flow);
+			if (timer)
+				tcp_splice_timer(c, flow);
 			break;
 		default:
 			/* Assume other flow types don't need any handling */
diff --git a/flow.h b/flow.h
index 6b17fa8..423e685 100644
--- a/flow.h
+++ b/flow.h
@@ -7,6 +7,8 @@
 #ifndef FLOW_H
 #define FLOW_H
 
+#define FLOW_TIMER_INTERVAL		1000	/* ms */
+
 /**
  * enum flow_type - Different types of packet flows we track
  */
@@ -67,7 +69,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 union flow;
 
 void flow_table_compact(struct ctx *c, union flow *hole);
-void flow_defer_handler(struct ctx *c);
+void flow_defer_handler(struct ctx *c, const struct timespec *now);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/passt.c b/passt.c
index 5f72a28..870064f 100644
--- a/passt.c
+++ b/passt.c
@@ -53,8 +53,9 @@
 
 #define EPOLL_EVENTS		8
 
-#define __TIMER_INTERVAL	MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL)
-#define TIMER_INTERVAL		MIN(__TIMER_INTERVAL, ICMP_TIMER_INTERVAL)
+#define TIMER_INTERVAL__	MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL)
+#define TIMER_INTERVAL_		MIN(TIMER_INTERVAL__, ICMP_TIMER_INTERVAL)
+#define TIMER_INTERVAL		MIN(TIMER_INTERVAL_, FLOW_TIMER_INTERVAL)
 
 char pkt_buf[PKT_BUF_BYTES]	__attribute__ ((aligned(PAGE_SIZE)));
 
@@ -103,7 +104,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
 	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
 	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
 
-	flow_defer_handler(c);
+	flow_defer_handler(c, now);
 #undef CALL_PROTO_HANDLER
 }
 
diff --git a/tcp.c b/tcp.c
index 9230d80..b936fce 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3170,8 +3170,6 @@ static int tcp_port_rebind_outbound(void *arg)
  */
 void tcp_timer(struct ctx *c, const struct timespec *now)
 {
-	union flow *flow;
-
 	(void)now;
 
 	if (c->mode == MODE_PASTA) {
@@ -3186,10 +3184,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
 		}
 	}
 
-	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--)
-		if (flow->f.type == FLOW_TCP_SPLICE)
-			tcp_splice_timer(c, flow);
-
 	tcp_sock_refill_init(c);
 	if (c->mode == MODE_PASTA)
 		tcp_splice_refill(c);
-- 
@@ -3170,8 +3170,6 @@ static int tcp_port_rebind_outbound(void *arg)
  */
 void tcp_timer(struct ctx *c, const struct timespec *now)
 {
-	union flow *flow;
-
 	(void)now;
 
 	if (c->mode == MODE_PASTA) {
@@ -3186,10 +3184,6 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
 		}
 	}
 
-	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--)
-		if (flow->f.type == FLOW_TCP_SPLICE)
-			tcp_splice_timer(c, flow);
-
 	tcp_sock_refill_init(c);
 	if (c->mode == MODE_PASTA)
 		tcp_splice_refill(c);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 07/13] epoll: Better handling of number of epoll types
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (5 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets David Gibson
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of
EPOLL_TYPE_MAX, which is a little bit safer and clearer.  Add a static
assert on the size of the matching names array.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt.c | 4 +++-
 passt.h | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/passt.c b/passt.c
index 870064f..a37a2f4 100644
--- a/passt.c
+++ b/passt.c
@@ -59,7 +59,7 @@
 
 char pkt_buf[PKT_BUF_BYTES]	__attribute__ ((aligned(PAGE_SIZE)));
 
-char *epoll_type_str[EPOLL_TYPE_MAX + 1] = {
+char *epoll_type_str[] = {
 	[EPOLL_TYPE_TCP]	= "connected TCP socket",
 	[EPOLL_TYPE_TCP_LISTEN]	= "listening TCP socket",
 	[EPOLL_TYPE_TCP_TIMER]	= "TCP timer",
@@ -71,6 +71,8 @@ char *epoll_type_str[EPOLL_TYPE_MAX + 1] = {
 	[EPOLL_TYPE_TAP_PASST]	= "connected qemu socket",
 	[EPOLL_TYPE_TAP_LISTEN]	= "listening qemu socket",
 };
+static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
+	      "epoll_type_str[] doesn't match enum epoll_type");
 
 /**
  * post_handler() - Run periodic and deferred tasks for L4 protocol handlers
diff --git a/passt.h b/passt.h
index c74887a..f54023a 100644
--- a/passt.h
+++ b/passt.h
@@ -70,7 +70,7 @@ enum epoll_type {
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
 
-	EPOLL_TYPE_MAX = EPOLL_TYPE_TAP_LISTEN,
+	EPOLL_NUM_TYPES,
 };
 
 /**
@@ -115,7 +115,7 @@ extern char pkt_buf		[PKT_BUF_BYTES];
 
 extern char *epoll_type_str[];
 #define EPOLL_TYPE_STR(n)						\
-	(((uint8_t)(n) <= EPOLL_TYPE_MAX && epoll_type_str[(n)]) ?	\
+	(((uint8_t)(n) < EPOLL_NUM_TYPES && epoll_type_str[(n)]) ?	\
 	                                    epoll_type_str[(n)] : "?")
 
 #include <resolv.h>	/* For MAXNS below */
-- 
@@ -70,7 +70,7 @@ enum epoll_type {
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
 
-	EPOLL_TYPE_MAX = EPOLL_TYPE_TAP_LISTEN,
+	EPOLL_NUM_TYPES,
 };
 
 /**
@@ -115,7 +115,7 @@ extern char pkt_buf		[PKT_BUF_BYTES];
 
 extern char *epoll_type_str[];
 #define EPOLL_TYPE_STR(n)						\
-	(((uint8_t)(n) <= EPOLL_TYPE_MAX && epoll_type_str[(n)]) ?	\
+	(((uint8_t)(n) < EPOLL_NUM_TYPES && epoll_type_str[(n)]) ?	\
 	                                    epoll_type_str[(n)] : "?")
 
 #include <resolv.h>	/* For MAXNS below */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (6 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 07/13] epoll: Better handling of number of epoll types David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c David Gibson
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection.  This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function.  We can avoid this by adding a new
epoll type and dispatching directly to the right thing.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 passt.c      |  8 ++++++--
 passt.h      |  2 ++
 tcp.c        | 36 ++++++++----------------------------
 tcp_splice.c | 16 +++++++++-------
 tcp_splice.h |  4 ++--
 5 files changed, 27 insertions(+), 39 deletions(-)

diff --git a/passt.c b/passt.c
index a37a2f4..71bea8f 100644
--- a/passt.c
+++ b/passt.c
@@ -50,6 +50,7 @@
 #include "pasta.h"
 #include "arch.h"
 #include "log.h"
+#include "tcp_splice.h"
 
 #define EPOLL_EVENTS		8
 
@@ -61,6 +62,7 @@ char pkt_buf[PKT_BUF_BYTES]	__attribute__ ((aligned(PAGE_SIZE)));
 
 char *epoll_type_str[] = {
 	[EPOLL_TYPE_TCP]	= "connected TCP socket",
+	[EPOLL_TYPE_TCP_SPLICE]	= "connected spliced TCP socket",
 	[EPOLL_TYPE_TCP_LISTEN]	= "listening TCP socket",
 	[EPOLL_TYPE_TCP_TIMER]	= "TCP timer",
 	[EPOLL_TYPE_UDP]	= "UDP socket",
@@ -373,8 +375,10 @@ loop:
 			pasta_netns_quit_handler(&c, quit_fd);
 			break;
 		case EPOLL_TYPE_TCP:
-			if (!c.no_tcp)
-				tcp_sock_handler(&c, ref, eventmask);
+			tcp_sock_handler(&c, ref, eventmask);
+			break;
+		case EPOLL_TYPE_TCP_SPLICE:
+			tcp_splice_sock_handler(&c, ref, eventmask);
 			break;
 		case EPOLL_TYPE_TCP_LISTEN:
 			tcp_listen_handler(&c, ref, &now);
diff --git a/passt.h b/passt.h
index f54023a..82b0fcf 100644
--- a/passt.h
+++ b/passt.h
@@ -51,6 +51,8 @@ enum epoll_type {
 	EPOLL_TYPE_NONE = 0,
 	/* Connected TCP sockets */
 	EPOLL_TYPE_TCP,
+	/* Connected TCP sockets (spliced) */
+	EPOLL_TYPE_TCP_SPLICE,
 	/* Listening TCP sockets */
 	EPOLL_TYPE_TCP_LISTEN,
 	/* timerfds used for TCP timers */
diff --git a/tcp.c b/tcp.c
index b936fce..f28628a 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2803,14 +2803,18 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref)
 }
 
 /**
- * tcp_tap_sock_handler() - Handle new data from non-spliced socket
+ * tcp_sock_handler() - Handle new data from non-spliced socket
  * @c:		Execution context
- * @conn:	Connection state
+ * @ref:	epoll reference
  * @events:	epoll events bitmap
  */
-static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn,
-				 uint32_t events)
+void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 {
+	struct tcp_tap_conn *conn = CONN(ref.flowside.flow);
+
+	ASSERT(conn->f.type == FLOW_TCP);
+	ASSERT(ref.flowside.side == SOCKSIDE);
+
 	if (conn->events == CLOSED)
 		return;
 
@@ -2857,30 +2861,6 @@ static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn,
 	}
 }
 
-/**
- * tcp_sock_handler() - Handle new data from socket, or timerfd event
- * @c:		Execution context
- * @ref:	epoll reference
- * @events:	epoll events bitmap
- */
-void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
-{
-	union flow *flow = FLOW(ref.flowside.flow);
-
-	switch (flow->f.type) {
-	case FLOW_TCP:
-		tcp_tap_sock_handler(c, &flow->tcp, events);
-		break;
-	case FLOW_TCP_SPLICE:
-		tcp_splice_sock_handler(c, &flow->tcp_splice, ref.flowside.side,
-					events);
-		break;
-	default:
-		die("Unexpected %s in tcp_sock_handler_compact()",
-		    FLOW_TYPE(&flow->f));
-	}
-}
-
 /**
  * tcp_sock_init_af() - Initialise listening socket for a given af and port
  * @c:		Execution context
diff --git a/tcp_splice.c b/tcp_splice.c
index 09aa20f..9b1d331 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -127,9 +127,9 @@ static int tcp_splice_epoll_ctl(const struct ctx *c,
 {
 	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref[SIDES] = {
-		{ .type = EPOLL_TYPE_TCP, .fd = conn->s[0],
+		{ .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[0],
 		  .flowside = FLOW_SIDX(conn, 0) },
-		{ .type = EPOLL_TYPE_TCP, .fd = conn->s[1],
+		{ .type = EPOLL_TYPE_TCP_SPLICE, .fd = conn->s[1],
 		  .flowside = FLOW_SIDX(conn, 1) }
 	};
 	struct epoll_event ev[SIDES] = { { .data.u64 = ref[0].u64 },
@@ -484,18 +484,20 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 /**
  * tcp_splice_sock_handler() - Handler for socket mapped to spliced connection
  * @c:		Execution context
- * @conn:	Connection state
- * @side:	Side of the connection on which an event has occurred
+ * @ref:	epoll reference
  * @events:	epoll events bitmap
  *
  * #syscalls:pasta splice
  */
-void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
-			     int side, uint32_t events)
+void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
+			     uint32_t events)
 {
+	struct tcp_splice_conn *conn = CONN(ref.flowside.flow);
+	unsigned side = ref.flowside.side, fromside;
 	uint8_t lowat_set_flag, lowat_act_flag;
 	int eof, never_read;
-	unsigned fromside;
+
+	ASSERT(conn->f.type == FLOW_TCP_SPLICE);
 
 	if (conn->events == SPLICE_CLOSED)
 		return;
diff --git a/tcp_splice.h b/tcp_splice.h
index aa85c7c..18193e4 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -8,8 +8,8 @@
 
 struct tcp_splice_conn;
 
-void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
-			     int side, uint32_t events);
+void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
+			     uint32_t events);
 bool tcp_splice_conn_from_sock(const struct ctx *c,
 			       union tcp_listen_epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
-- 
@@ -8,8 +8,8 @@
 
 struct tcp_splice_conn;
 
-void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn,
-			     int side, uint32_t events);
+void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
+			     uint32_t events);
 bool tcp_splice_conn_from_sock(const struct ctx *c,
 			       union tcp_listen_epoll_ref ref,
 			       struct tcp_splice_conn *conn, int s,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (7 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

flow_log_() is a very basic widely used function that many other functions
in flow.c will end up needing.  At present it's below flow_table_compact()
which happens not to need it, but that's likely to change.  Move it to
near the top of flow.c to avoid forward declarations.

Code motion only, no changes.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/flow.c b/flow.c
index ef129db..79c6ae6 100644
--- a/flow.c
+++ b/flow.c
@@ -30,6 +30,24 @@ union flow flowtab[FLOW_MAX];
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
 
+/** flow_log_ - Log flow-related message
+ * @f:		flow the message is related to
+ * @pri:	Log priority
+ * @fmt:	Format string
+ * @...:	printf-arguments
+ */
+void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
+{
+	char msg[BUFSIZ];
+	va_list args;
+
+	va_start(args, fmt);
+	(void)vsnprintf(msg, sizeof(msg), fmt, args);
+	va_end(args);
+
+	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
+}
+
 /**
  * flow_table_compact() - Perform compaction on flow table
  * @c:		Execution context
@@ -69,24 +87,6 @@ void flow_table_compact(struct ctx *c, union flow *hole)
 	memset(from, 0, sizeof(*from));
 }
 
-/** flow_log_ - Log flow-related message
- * @f:		flow the message is related to
- * @pri:	Log priority
- * @fmt:	Format string
- * @...:	printf-arguments
- */
-void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
-{
-	char msg[BUFSIZ];
-	va_list args;
-
-	va_start(args, fmt);
-	(void)vsnprintf(msg, sizeof(msg), fmt, args);
-	va_end(args);
-
-	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
-}
-
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
-- 
@@ -30,6 +30,24 @@ union flow flowtab[FLOW_MAX];
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
 
+/** flow_log_ - Log flow-related message
+ * @f:		flow the message is related to
+ * @pri:	Log priority
+ * @fmt:	Format string
+ * @...:	printf-arguments
+ */
+void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
+{
+	char msg[BUFSIZ];
+	va_list args;
+
+	va_start(args, fmt);
+	(void)vsnprintf(msg, sizeof(msg), fmt, args);
+	va_end(args);
+
+	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
+}
+
 /**
  * flow_table_compact() - Perform compaction on flow table
  * @c:		Execution context
@@ -69,24 +87,6 @@ void flow_table_compact(struct ctx *c, union flow *hole)
 	memset(from, 0, sizeof(*from));
 }
 
-/** flow_log_ - Log flow-related message
- * @f:		flow the message is related to
- * @pri:	Log priority
- * @fmt:	Format string
- * @...:	printf-arguments
- */
-void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
-{
-	char msg[BUFSIZ];
-	va_list args;
-
-	va_start(args, fmt);
-	(void)vsnprintf(msg, sizeof(msg), fmt, args);
-	va_end(args);
-
-	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
-}
-
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (8 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-28 18:25   ` Stefano Brivio
  2023-12-21  6:15 ` [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function David Gibson
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'.  The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.

Move flow_count to be a regular global to match.  For now it needs to be
public, rather than static, but we expect to be able to change that in
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 11 ++++++-----
 flow.h       |  4 ++--
 flow_table.h |  1 +
 passt.h      |  3 ---
 tcp.c        | 10 +++++-----
 tcp_conn.h   |  4 ++--
 tcp_splice.c |  2 +-
 7 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/flow.c b/flow.c
index 79c6ae6..294412a 100644
--- a/flow.c
+++ b/flow.c
@@ -25,6 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
 	      "flow_type_str[] doesn't match enum flow_type");
 
 /* Global Flow Table */
+unsigned flow_count;
 union flow flowtab[FLOW_MAX];
 
 /* Last time the flow timers ran */
@@ -53,18 +54,18 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
  * @c:		Execution context
  * @hole:	Pointer to recently closed flow
  */
-void flow_table_compact(struct ctx *c, union flow *hole)
+void flow_table_compact(const struct ctx *c, union flow *hole)
 {
 	union flow *from;
 
-	if (FLOW_IDX(hole) == --c->flow_count) {
+	if (FLOW_IDX(hole) == --flow_count) {
 		debug("flow: table compaction: maximum index was %u (%p)",
 		      FLOW_IDX(hole), (void *)hole);
 		memset(hole, 0, sizeof(*hole));
 		return;
 	}
 
-	from = flowtab + c->flow_count;
+	from = flowtab + flow_count;
 	memcpy(hole, from, sizeof(*hole));
 
 	switch (from->f.type) {
@@ -92,7 +93,7 @@ void flow_table_compact(struct ctx *c, union flow *hole)
  * @c:		Execution context
  * @now:	Current timestamp
  */
-void flow_defer_handler(struct ctx *c, const struct timespec *now)
+void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 {
 	bool timer = false;
 	union flow *flow;
@@ -102,7 +103,7 @@ void flow_defer_handler(struct ctx *c, const struct timespec *now)
 		flow_timer_run = *now;
 	}
 
-	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
+	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
 		switch (flow->f.type) {
 		case FLOW_TCP:
 			tcp_flow_defer(c, flow);
diff --git a/flow.h b/flow.h
index 423e685..44058bf 100644
--- a/flow.h
+++ b/flow.h
@@ -68,8 +68,8 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 
 union flow;
 
-void flow_table_compact(struct ctx *c, union flow *hole);
-void flow_defer_handler(struct ctx *c, const struct timespec *now);
+void flow_table_compact(const struct ctx *c, union flow *hole);
+void flow_defer_handler(const struct ctx *c, const struct timespec *now);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	__attribute__((format(printf, 3, 4)));
diff --git a/flow_table.h b/flow_table.h
index e805f10..4aa2398 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -22,6 +22,7 @@ union flow {
 };
 
 /* Global Flow Table */
+extern unsigned flow_count;
 extern union flow flowtab[];
 
 
diff --git a/passt.h b/passt.h
index 82b0fcf..a9e8f15 100644
--- a/passt.h
+++ b/passt.h
@@ -224,7 +224,6 @@ struct ip6_ctx {
  * @pasta_conf_ns:	Configure namespace after creating it
  * @no_copy_routes:	Don't copy all routes when configuring target namespace
  * @no_copy_addrs:	Don't copy all addresses when configuring namespace
- * @flow_count:		Number of tracked packet flows (connections etc.)
  * @no_tcp:		Disable TCP operation
  * @tcp:		Context for TCP protocol handler
  * @no_tcp:		Disable UDP operation
@@ -284,8 +283,6 @@ struct ctx {
 	int no_copy_routes;
 	int no_copy_addrs;
 
-	unsigned flow_count;
-
 	int no_tcp;
 	struct tcp_ctx tcp;
 	int no_udp;
diff --git a/tcp.c b/tcp.c
index f28628a..6aefd0b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
  * @c:		Execution context
  * @flow:	Flow table entry for this connection
  */
-void tcp_flow_defer(struct ctx *c, union flow *flow)
+void tcp_flow_defer(const struct ctx *c, union flow *flow)
 {
 	const struct tcp_tap_conn *conn = &flow->tcp;
 
@@ -1948,7 +1948,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 
 	(void)saddr;
 
-	if (c->flow_count >= FLOW_MAX)
+	if (flow_count >= FLOW_MAX)
 		return;
 
 	if ((s = tcp_conn_pool_sock(pool)) < 0)
@@ -1974,7 +1974,7 @@ static void tcp_conn_from_tap(struct ctx *c,
 		}
 	}
 
-	conn = CONN(c->flow_count++);
+	conn = CONN(flow_count++);
 	conn->f.type = FLOW_TCP;
 	conn->sock = s;
 	conn->timer = -1;
@@ -2723,14 +2723,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	union flow *flow;
 	int s;
 
-	if (c->no_tcp || c->flow_count >= FLOW_MAX)
+	if (c->no_tcp || flow_count >= FLOW_MAX)
 		return;
 
 	s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
 		return;
 
-	flow = flowtab + c->flow_count++;
+	flow = flowtab + flow_count++;
 
 	if (c->mode == MODE_PASTA &&
 	    tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice,
diff --git a/tcp_conn.h b/tcp_conn.h
index 72b9ecb..825155a 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -158,8 +158,8 @@ extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
-void tcp_flow_defer(struct ctx *c, union flow *flow);
-void tcp_splice_flow_defer(struct ctx *c, union flow *flow);
+void tcp_flow_defer(const struct ctx *c, union flow *flow);
+void tcp_splice_flow_defer(const struct ctx *c, union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
 int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
diff --git a/tcp_splice.c b/tcp_splice.c
index 9b1d331..18cd2e6 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -247,7 +247,7 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
  * @c:		Execution context
  * @flow:	Flow table entry for this connection
  */
-void tcp_splice_flow_defer(struct ctx *c, union flow *flow)
+void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
-- 
@@ -247,7 +247,7 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
  * @c:		Execution context
  * @flow:	Flow table entry for this connection
  */
-void tcp_splice_flow_defer(struct ctx *c, union flow *flow)
+void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (9 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers David Gibson
  2023-12-21  6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections.  This duplication isn't ideal and will get worse as we
add more protocols to the flow table.  It also makes it harder to
experiment with different ways of handling flow table allocation.

Instead, introduce a function to allocate a new flow: flow_alloc().  In
some cases we currently check if we're able to allocate, but delay the
actual allocation.  We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation.  We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 26 ++++++++++++++++++++++++++
 flow_table.h |  3 +++
 tcp.c        | 29 ++++++++++++++++++-----------
 3 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/flow.c b/flow.c
index 294412a..8fbe512 100644
--- a/flow.c
+++ b/flow.c
@@ -49,6 +49,32 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
 }
 
+/**
+ * flow_alloc() - Allocate a new flow
+ *
+ * Return: pointer to an unused flow entry, or NULL if the table is full
+ */
+union flow *flow_alloc(void)
+{
+	if (flow_count >= FLOW_MAX)
+		return NULL;
+
+	return &flowtab[flow_count++];
+}
+
+/**
+ * flow_alloc_cancel() - Free a newly allocated flow
+ * @flow:	Flow to deallocate
+ *
+ * @flow must be the last flow allocated by flow_alloc()
+ */
+void flow_alloc_cancel(union flow *flow)
+{
+	ASSERT(FLOW_IDX(flow) == flow_count - 1);
+	memset(flow, 0, sizeof(*flow));
+	flow_count--;
+}
+
 /**
  * flow_table_compact() - Perform compaction on flow table
  * @c:		Execution context
diff --git a/flow_table.h b/flow_table.h
index 4aa2398..2773a2b 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -88,4 +88,7 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
  */
 #define FLOW_SIDX(f_, side)	(flow_sidx(&(f_)->f, (side)))
 
+union flow *flow_alloc(void);
+void flow_alloc_cancel(union flow *flow);
+
 #endif /* FLOW_TABLE_H */
diff --git a/tcp.c b/tcp.c
index 6aefd0b..6477203 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1943,17 +1943,18 @@ static void tcp_conn_from_tap(struct ctx *c,
 	};
 	const struct sockaddr *sa;
 	struct tcp_tap_conn *conn;
+	union flow *flow;
 	socklen_t sl;
 	int s, mss;
 
 	(void)saddr;
 
-	if (flow_count >= FLOW_MAX)
+	if (!(flow = flow_alloc()))
 		return;
 
 	if ((s = tcp_conn_pool_sock(pool)) < 0)
 		if ((s = tcp_conn_new_sock(c, af)) < 0)
-			return;
+			goto cancel;
 
 	if (!c->no_map_gw) {
 		if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
@@ -1968,13 +1969,11 @@ static void tcp_conn_from_tap(struct ctx *c,
 			.sin6_addr = c->ip6.addr_ll,
 			.sin6_scope_id = c->ifi6,
 		};
-		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) {
-			close(s);
-			return;
-		}
+		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll)))
+			goto cancel;
 	}
 
-	conn = CONN(flow_count++);
+	conn = &flow->tcp;
 	conn->f.type = FLOW_TCP;
 	conn->sock = s;
 	conn->timer = -1;
@@ -2046,6 +2045,12 @@ static void tcp_conn_from_tap(struct ctx *c,
 	}
 
 	tcp_epoll_ctl(c, conn);
+	return;
+
+cancel:
+	if (s >= 0)
+		close(s);
+	flow_alloc_cancel(flow);
 }
 
 /**
@@ -2723,14 +2728,12 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	union flow *flow;
 	int s;
 
-	if (c->no_tcp || flow_count >= FLOW_MAX)
+	if (c->no_tcp || !(flow = flow_alloc()))
 		return;
 
 	s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
-		return;
-
-	flow = flowtab + flow_count++;
+		goto cancel;
 
 	if (c->mode == MODE_PASTA &&
 	    tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice,
@@ -2739,6 +2742,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 
 	tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s,
 			       (struct sockaddr *)&sa, now);
+	return;
+
+cancel:
+	flow_alloc_cancel(flow);
 }
 
 /**
-- 
@@ -1943,17 +1943,18 @@ static void tcp_conn_from_tap(struct ctx *c,
 	};
 	const struct sockaddr *sa;
 	struct tcp_tap_conn *conn;
+	union flow *flow;
 	socklen_t sl;
 	int s, mss;
 
 	(void)saddr;
 
-	if (flow_count >= FLOW_MAX)
+	if (!(flow = flow_alloc()))
 		return;
 
 	if ((s = tcp_conn_pool_sock(pool)) < 0)
 		if ((s = tcp_conn_new_sock(c, af)) < 0)
-			return;
+			goto cancel;
 
 	if (!c->no_map_gw) {
 		if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
@@ -1968,13 +1969,11 @@ static void tcp_conn_from_tap(struct ctx *c,
 			.sin6_addr = c->ip6.addr_ll,
 			.sin6_scope_id = c->ifi6,
 		};
-		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) {
-			close(s);
-			return;
-		}
+		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll)))
+			goto cancel;
 	}
 
-	conn = CONN(flow_count++);
+	conn = &flow->tcp;
 	conn->f.type = FLOW_TCP;
 	conn->sock = s;
 	conn->timer = -1;
@@ -2046,6 +2045,12 @@ static void tcp_conn_from_tap(struct ctx *c,
 	}
 
 	tcp_epoll_ctl(c, conn);
+	return;
+
+cancel:
+	if (s >= 0)
+		close(s);
+	flow_alloc_cancel(flow);
 }
 
 /**
@@ -2723,14 +2728,12 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	union flow *flow;
 	int s;
 
-	if (c->no_tcp || flow_count >= FLOW_MAX)
+	if (c->no_tcp || !(flow = flow_alloc()))
 		return;
 
 	s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
-		return;
-
-	flow = flowtab + flow_count++;
+		goto cancel;
 
 	if (c->mode == MODE_PASTA &&
 	    tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice,
@@ -2739,6 +2742,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 
 	tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s,
 			       (struct sockaddr *)&sa, now);
+	return;
+
+cancel:
+	flow_alloc_cancel(flow);
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (10 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-21  6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
  12 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers.  Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers.  If true, this indicates that the flow
code itself should free the flow.

This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 13 +++++++++----
 flow.h       |  1 -
 tcp.c        |  9 +++++----
 tcp_conn.h   |  4 ++--
 tcp_splice.c |  9 +++++----
 5 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/flow.c b/flow.c
index 8fbe512..7b99b58 100644
--- a/flow.c
+++ b/flow.c
@@ -80,7 +80,7 @@ void flow_alloc_cancel(union flow *flow)
  * @c:		Execution context
  * @hole:	Pointer to recently closed flow
  */
-void flow_table_compact(const struct ctx *c, union flow *hole)
+static void flow_table_compact(const struct ctx *c, union flow *hole)
 {
 	union flow *from;
 
@@ -130,18 +130,23 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 	}
 
 	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
+		bool closed = false;
+
 		switch (flow->f.type) {
 		case FLOW_TCP:
-			tcp_flow_defer(c, flow);
+			closed = tcp_flow_defer(flow);
 			break;
 		case FLOW_TCP_SPLICE:
-			tcp_splice_flow_defer(c, flow);
-			if (timer)
+			closed = tcp_splice_flow_defer(flow);
+			if (!closed && timer)
 				tcp_splice_timer(c, flow);
 			break;
 		default:
 			/* Assume other flow types don't need any handling */
 			;
 		}
+
+		if (closed)
+			flow_table_compact(c, flow);
 	}
 }
diff --git a/flow.h b/flow.h
index 44058bf..8064f0e 100644
--- a/flow.h
+++ b/flow.h
@@ -68,7 +68,6 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 
 union flow;
 
-void flow_table_compact(const struct ctx *c, union flow *hole);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
diff --git a/tcp.c b/tcp.c
index 6477203..a38deb0 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1303,21 +1303,22 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
 
 /**
  * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
- * @c:		Execution context
  * @flow:	Flow table entry for this connection
+ *
+ * Return: true if the flow is ready to free, false otherwise
  */
-void tcp_flow_defer(const struct ctx *c, union flow *flow)
+bool tcp_flow_defer(union flow *flow)
 {
 	const struct tcp_tap_conn *conn = &flow->tcp;
 
 	if (flow->tcp.events != CLOSED)
-		return;
+		return false;
 
 	close(conn->sock);
 	if (conn->timer != -1)
 		close(conn->timer);
 
-	flow_table_compact(c, flow);
+	return true;
 }
 
 static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
diff --git a/tcp_conn.h b/tcp_conn.h
index 825155a..636224e 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -158,8 +158,8 @@ extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
 			 struct tcp_tap_conn *new);
 void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
-void tcp_flow_defer(const struct ctx *c, union flow *flow);
-void tcp_splice_flow_defer(const struct ctx *c, union flow *flow);
+bool tcp_flow_defer(union flow *flow);
+bool tcp_splice_flow_defer(union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
 int tcp_conn_pool_sock(int pool[]);
 int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
diff --git a/tcp_splice.c b/tcp_splice.c
index 18cd2e6..0711b19 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -244,16 +244,17 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
 
 /**
  * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
- * @c:		Execution context
  * @flow:	Flow table entry for this connection
+ *
+ * Return: true if the flow is ready to free, false otherwise
  */
-void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
+bool tcp_splice_flow_defer(union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
 
 	if (!(flow->tcp_splice.flags & CLOSING))
-		return;
+		return false;
 
 	for (side = 0; side < SIDES; side++) {
 		if (conn->events & SPLICE_ESTABLISHED) {
@@ -277,7 +278,7 @@ void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
 	conn->flags = 0;
 	flow_dbg(conn, "CLOSED");
 
-	flow_table_compact(c, flow);
+	return true;
 }
 
 /**
-- 
@@ -244,16 +244,17 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
 
 /**
  * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
- * @c:		Execution context
  * @flow:	Flow table entry for this connection
+ *
+ * Return: true if the flow is ready to free, false otherwise
  */
-void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
+bool tcp_splice_flow_defer(union flow *flow)
 {
 	struct tcp_splice_conn *conn = &flow->tcp_splice;
 	unsigned side;
 
 	if (!(flow->tcp_splice.flags & CLOSING))
-		return;
+		return false;
 
 	for (side = 0; side < SIDES; side++) {
 		if (conn->events & SPLICE_ESTABLISHED) {
@@ -277,7 +278,7 @@ void tcp_splice_flow_defer(const struct ctx *c, union flow *flow)
 	conn->flags = 0;
 	flow_dbg(conn, "CLOSED");
 
-	flow_table_compact(c, flow);
+	return true;
 }
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
                   ` (11 preceding siblings ...)
  2023-12-21  6:15 ` [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers David Gibson
@ 2023-12-21  6:15 ` David Gibson
  2023-12-28 18:25   ` Stefano Brivio
  12 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2023-12-21  6:15 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table.  Doing this
sometimes requires moving an entry when one is freed.  That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move
entries.  It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 177 +++++++++++++++++++++++++++++++++++++--------------
 flow.h       |   1 +
 flow_table.h |  16 ++++-
 passt.c      |   2 +
 tcp.c        |  23 -------
 tcp_conn.h   |   3 -
 tcp_splice.c |  11 ----
 7 files changed, 146 insertions(+), 87 deletions(-)

diff --git a/flow.c b/flow.c
index 7b99b58..421e6b5 100644
--- a/flow.c
+++ b/flow.c
@@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
 	      "flow_type_str[] doesn't match enum flow_type");
 
 /* Global Flow Table */
-unsigned flow_count;
+unsigned flow_first_free;
 union flow flowtab[FLOW_MAX];
 
 /* Last time the flow timers ran */
@@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
 }
 
+/**
+ * DOC: Theory of Operation - allocation and freeing of flow entries
+ *
+ * Each flow takes a single slot in flowtab[].  Moving entries in that table
+ * (which we used to do) is fiddly and possibly expensive: it requires updating
+ * the hash table indexing flows, and may require updating epoll data which
+ * references the flow by index.  However, we also want to keep the active
+ * entries in the table compact where possible, because otherwise scanning
+ * through the entire table becomes more expensive.  This describes the
+ * compromise implemented below.
+ *
+ * Free blocks
+ *    A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries
+ *    in flowtab.  The first entry in each block contains metadata, specifically
+ *    the number of entries in the block, and the index of the next (non
+ *    contiguous) free block (struct flow_free_block).
+ *
+ * Free block list
+ *    flow_first_free gives the index of the first entry of the first (lowest
+ *    index) free block.  Each free block has the index of the next free block,
+ *    or MAX_FLOW if it is the last free block.  Together these form a linked
+ *    list of free blocks, in strictly increasing order of index.
+ *
+ * Allocation
+ *    When allocating a new flow, we always use the first entry of the first
+ *    free block, that is, at index flow_first_free.  If the block has more than
+ *    one entry, flow_first_free is updated to the next entry, which is updated
+ *    to represent the new smaller free block.  Otherwise the free block is
+ *    eliminated and flow_first_free is updated to the next free block.
+ *
+ * Scanning the table
+ *    Theoretically, scanning the table requires FLOW_MAX iterations.  However,
+ *    when we encounter the start of a free block, we can immediately skip to
+ *    its end, meaning that in practice we only need (number of active
+ *    connections) + (number of free blocks) iterations.
+ *
+ * Freeing
+ *    We can only free entries when scanning the whole flow table in
+ *    flow_defer_handler().  This is what lets us maintain the fee block list in
+ *    index sorted order.  As we scan we keep track of whether the previous
+ *    entry was in a free block or not.  If so when an entry is freed (its
+ *    deferred handler returns 'true'), we add it to that free block.  Otherwise
+ *    we create a new free block for it and chain it to the last free block we
+ *    scanned.
+ */
+
 /**
  * flow_alloc() - Allocate a new flow
  *
@@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
  */
 union flow *flow_alloc(void)
 {
-	if (flow_count >= FLOW_MAX)
+	union flow *flow = &flowtab[flow_first_free];
+
+	if (flow_first_free >= FLOW_MAX)
 		return NULL;
 
-	return &flowtab[flow_count++];
+	ASSERT(flow->f.type == FLOW_TYPE_NONE);
+	ASSERT(flow->free.n >= 1);
+
+	if (flow->free.n > 1) {
+		/* Use one entry from the block */
+		union flow *next = &flowtab[++flow_first_free];
+
+		ASSERT(FLOW_IDX(next) < FLOW_MAX);
+		ASSERT(next->f.type == FLOW_TYPE_NONE);
+		ASSERT(next->free.n == 0);
+
+		next->free.n = flow->free.n - 1;
+		next->free.next = flow->free.next;
+	} else {
+		/* Use the entire block */
+		flow_first_free = flow->free.next;
+	}
+
+	memset(flow, 0, sizeof(*flow));
+	return flow;
 }
 
 /**
@@ -70,48 +137,15 @@ union flow *flow_alloc(void)
  */
 void flow_alloc_cancel(union flow *flow)
 {
-	ASSERT(FLOW_IDX(flow) == flow_count - 1);
-	memset(flow, 0, sizeof(*flow));
-	flow_count--;
-}
-
-/**
- * flow_table_compact() - Perform compaction on flow table
- * @c:		Execution context
- * @hole:	Pointer to recently closed flow
- */
-static void flow_table_compact(const struct ctx *c, union flow *hole)
-{
-	union flow *from;
-
-	if (FLOW_IDX(hole) == --flow_count) {
-		debug("flow: table compaction: maximum index was %u (%p)",
-		      FLOW_IDX(hole), (void *)hole);
-		memset(hole, 0, sizeof(*hole));
-		return;
-	}
-
-	from = flowtab + flow_count;
-	memcpy(hole, from, sizeof(*hole));
-
-	switch (from->f.type) {
-	case FLOW_TCP:
-		tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
-		break;
-	case FLOW_TCP_SPLICE:
-		tcp_splice_conn_update(c, &hole->tcp_splice);
-		break;
-	default:
-		die("Unexpected %s in tcp_table_compact()",
-		    FLOW_TYPE(&from->f));
-	}
-
-	debug("flow: table compaction (%s): old index %u, new index %u, "
-	      "from: %p, to: %p",
-	      FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
-	      (void *)from, (void *)hole);
-
-	memset(from, 0, sizeof(*from));
+	ASSERT(flow_first_free > FLOW_IDX(flow));
+
+	flow->f.type = FLOW_TYPE_NONE;
+	/* Put it back in a length 1 free block, don't attempt to fully reverse
+	 * flow_alloc()s steps.  This will get folded together the next time
+	 * flow_defer_handler runs anyway() */
+	flow->free.n = 1;
+	flow->free.next = flow_first_free;
+	flow_first_free = FLOW_IDX(flow);
 }
 
 /**
@@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
  */
 void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 {
+	struct flow_free_block *free_head = NULL;
+	unsigned *last_next = &flow_first_free;
 	bool timer = false;
-	union flow *flow;
+	unsigned idx;
 
 	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
 		timer = true;
 		flow_timer_run = *now;
 	}
 
-	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
+	for (idx = 0; idx < FLOW_MAX; idx++) {
+		union flow *flow = &flowtab[idx];
 		bool closed = false;
 
+		if (flow->f.type == FLOW_TYPE_NONE) {
+			/* Start of a free block */
+			free_head = &flow->free;
+			*last_next = idx;
+			last_next = &free_head->next;
+			/* Skip the rest of the block */
+			idx += free_head->n - 1;
+			continue;
+		}
+
+		
 		switch (flow->f.type) {
+		case FLOW_TYPE_NONE:
+			closed = true;
+			break;
 		case FLOW_TCP:
 			closed = tcp_flow_defer(flow);
 			break;
@@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 			;
 		}
 
-		if (closed)
-			flow_table_compact(c, flow);
+		if (closed) {
+			flow->f.type = FLOW_TYPE_NONE;
+
+			if (free_head) {
+				/* Add slot to current free block */
+				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
+				free_head->n++;
+				flow->free.n = flow->free.next = 0;
+			} else {
+				/* Create new free block */
+				free_head = &flow->free;
+				free_head->n = 1;
+				*last_next = idx;
+				last_next = &free_head->next;
+			}
+		} else {
+			free_head = NULL;
+		}
 	}
+
+	*last_next = FLOW_MAX;
+}
+
+/**
+ * flow_init() - Initialise flow related data structures
+ */
+void flow_init(void)
+{
+	/* Initial state is a single free block containing the whole table */
+	flowtab[0].free.n = FLOW_MAX;
+	flowtab[0].free.next = FLOW_MAX;
 }
diff --git a/flow.h b/flow.h
index 8064f0e..48a0ab4 100644
--- a/flow.h
+++ b/flow.h
@@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 
 union flow;
 
+void flow_init(void);
 void flow_defer_handler(const struct ctx *c, const struct timespec *now);
 
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
diff --git a/flow_table.h b/flow_table.h
index 2773a2b..34fc679 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -9,6 +9,19 @@
 
 #include "tcp_conn.h"
 
+/**
+ * struct flow_free_block - Information about a block of free entries
+ * @f:		Generic flow information
+ * @n:		Number of entries in this free block (including this one)
+ * @next:	Index of next free block
+ */
+struct flow_free_block {
+	/* Must be first element */
+	struct flow_common f;
+	unsigned n;
+	unsigned next;
+};
+
 /**
  * union flow - Descriptor for a logical packet flow (e.g. connection)
  * @f:		Fields common between all variants
@@ -17,12 +30,13 @@
 */
 union flow {
 	struct flow_common f;
+	struct flow_free_block free;
 	struct tcp_tap_conn tcp;
 	struct tcp_splice_conn tcp_splice;
 };
 
 /* Global Flow Table */
-extern unsigned flow_count;
+extern unsigned flow_first_free;
 extern union flow flowtab[];
 
 
diff --git a/passt.c b/passt.c
index 71bea8f..d315438 100644
--- a/passt.c
+++ b/passt.c
@@ -285,6 +285,8 @@ int main(int argc, char **argv)
 
 	clock_gettime(CLOCK_MONOTONIC, &now);
 
+	flow_init();
+
 	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
 		exit(EXIT_FAILURE);
 
diff --git a/tcp.c b/tcp.c
index a38deb0..6aacb56 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
 	tc_hash[b] = FLOW_SIDX_NONE;
 }
 
-/**
- * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
- * @c:		Execution context
- * @old:	Old location of tcp_tap_conn
- * @new:	New location of tcp_tap_conn
- */
-void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
-			 struct tcp_tap_conn *new)
-
-{
-	unsigned b = tcp_hash_probe(c, old);
-
-	if (!flow_at_sidx(tc_hash[b]))
-		return; /* Not in hash table, nothing to update */
-
-	tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
-
-	debug("TCP: hash table update: old index %u, new index %u, sock %i, "
-	      "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
-
-	tcp_epoll_ctl(c, new);
-}
-
 /**
  * tcp_hash_lookup() - Look up connection given remote address and ports
  * @c:		Execution context
diff --git a/tcp_conn.h b/tcp_conn.h
index 636224e..a5f5cfe 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -155,9 +155,6 @@ struct tcp_splice_conn {
 extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
 extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
 
-void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
-			 struct tcp_tap_conn *new);
-void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
 bool tcp_flow_defer(union flow *flow);
 bool tcp_splice_flow_defer(union flow *flow);
 void tcp_splice_timer(const struct ctx *c, union flow *flow);
diff --git a/tcp_splice.c b/tcp_splice.c
index 0711b19..7aae623 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 	} while (0)
 
 
-/**
- * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
- * @c:		Execution context
- * @new:	New location of tcp_splice_conn
- */
-void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
-{
-	if (tcp_splice_epoll_ctl(c, new))
-		conn_flag(c, new, CLOSING);
-}
-
 /**
  * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
  * @flow:	Flow table entry for this connection
-- 
@@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
 	} while (0)
 
 
-/**
- * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
- * @c:		Execution context
- * @new:	New location of tcp_splice_conn
- */
-void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
-{
-	if (tcp_splice_epoll_ctl(c, new))
-		conn_flag(c, new, CLOSING);
-}
-
 /**
  * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
  * @flow:	Flow table entry for this connection
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling
  2023-12-21  6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
@ 2023-12-28 18:24   ` Stefano Brivio
  2023-12-31  5:56     ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2023-12-28 18:24 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 21 Dec 2023 17:15:41 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> tcp_defer_handler(), amongst other things, scans the flow table and does
> some processing for each TCP connection.  When we add other protocols to
> the flow table, they're likely to want some similar scanning.  It makes
> more sense for cache friendliness to perform a single scan of the flow
> table and dispatch to the protocol specific handlers, rather than having
> each protocol separately scan the table.
> 
> To that end, add a new flow_defer_handler() handling all flow-linked
> deferred operations.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c     | 23 +++++++++++++++++++++++
>  flow.h     |  1 +
>  passt.c    |  1 +
>  tcp.c      | 19 ++-----------------
>  tcp_conn.h |  1 +
>  5 files changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index a1c0a34..0a0402d 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  
>  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
>  }
> +
> +/**
> + * flow_defer_handler() - Handler for per-flow deferred tasks
> + * @c:		Execution context
> + */
> +void flow_defer_handler(struct ctx *c)
> +{
> +	union flow *flow;
> +
> +	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
> +		switch (flow->f.type) {
> +		case FLOW_TCP:
> +			tcp_flow_defer(c, flow);
> +			break;
> +		case FLOW_TCP_SPLICE:
> +			tcp_splice_flow_defer(c, flow);
> +			break;
> +		default:
> +			/* Assume other flow types don't need any handling */
> +			;
> +		}
> +	}
> +}
> diff --git a/flow.h b/flow.h
> index 959b461..6b17fa8 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
>  union flow;
>  
>  void flow_table_compact(struct ctx *c, union flow *hole);
> +void flow_defer_handler(struct ctx *c);
>  
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	__attribute__((format(printf, 3, 4)));
> diff --git a/passt.c b/passt.c
> index 0246b04..5f72a28 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
>  	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
>  	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
>  
> +	flow_defer_handler(c);
>  #undef CALL_PROTO_HANDLER
>  }
>  
> diff --git a/tcp.c b/tcp.c
> index ad1a70d..9230d80 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
>   * @c:		Execution context
>   * @flow:	Flow table entry for this connection
>   */
> -static void tcp_flow_defer(struct ctx *c, union flow *flow)
> +void tcp_flow_defer(struct ctx *c, union flow *flow)
>  {
>  	const struct tcp_tap_conn *conn = &flow->tcp;
>  
> @@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
>   * tcp_defer_handler() - Handler for TCP deferred tasks
>   * @c:		Execution context
>   */
> +/* cppcheck-suppress constParameterPointer */

This needs to be:

  /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */

otherwise we get warnings with cppcheck 2.10, and we'll get warnings if
cppcheck's behaviour ever changes again.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2023-12-21  6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
@ 2023-12-28 18:25   ` Stefano Brivio
  2023-12-31  5:58     ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2023-12-28 18:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 21 Dec 2023 17:15:46 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> In general, the passt code is a bit haphazard about what's a true global
> variable and what's in the quasi-global 'context structure'.  The
> flow_count field is one such example: it's in the context structure,
> although it's really part of the same data structure as flowtab[], which
> is a genuine global.

Well, the reason is that flow_tab[FLOW_MAX] might be problematically
too big to live on the stack, unlike flow_count.

But anyway, as far as thoughts of multithreading are concerned, both
should probably be global. And sure, it's more consistent this way.

> Move flow_count to be a regular global to match.  For now it needs to be
> public, rather than static, but we expect to be able to change that in
> future.

If it's not static, it should be initialised, and that's not done here.

This becomes 'flow_first_free' in 13/13, but it's not initialised
either, and that should also start off as zero.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2023-12-21  6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
@ 2023-12-28 18:25   ` Stefano Brivio
  2023-12-30 10:33     ` Stefano Brivio
  2024-01-01 10:44     ` David Gibson
  0 siblings, 2 replies; 40+ messages in thread
From: Stefano Brivio @ 2023-12-28 18:25 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 21 Dec 2023 17:15:49 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we always keep the flow table maximally compact: that is all the
> active entries are contiguous at the start of the table.  Doing this
> sometimes requires moving an entry when one is freed.  That's kind of
> fiddly, and potentially expensive: it requires updating the hash table for
> the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
> system calls to update epoll tags with the new location too.
> 
> Implement a new way of managing the flow table that doesn't ever move
> entries.  It attempts to maintain some compactness by always using the
> first free slot for a new connection, and mitigates the effect of non
> compactness by cheaply skipping over contiguous blocks of free entries.
> See the "theory of operation" comment in flow.c for details.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c       | 177 +++++++++++++++++++++++++++++++++++++--------------
>  flow.h       |   1 +
>  flow_table.h |  16 ++++-
>  passt.c      |   2 +
>  tcp.c        |  23 -------
>  tcp_conn.h   |   3 -
>  tcp_splice.c |  11 ----
>  7 files changed, 146 insertions(+), 87 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 7b99b58..421e6b5 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
>  	      "flow_type_str[] doesn't match enum flow_type");
>  
>  /* Global Flow Table */
> -unsigned flow_count;
> +unsigned flow_first_free;

As mentioned in the comment to 10/13: this should be initialised to
zero.

>  union flow flowtab[FLOW_MAX];
>  
>  /* Last time the flow timers ran */
> @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
>  }
>  
> +/**

As a general comment: interesting -- I wonder if this already has a name,
but I couldn't find anything similar in the literature. Perhaps it's
equivalent to a flattened search tree where free/busy slots can be
found by following the links.

> + * DOC: Theory of Operation - allocation and freeing of flow entries

"allocating and freeing flow entries" ...flows better for me.

I think this comment should be before the declaration of flowtab[] --
above here, with a function in between, is not exactly where I expected
to find it as I started reading the next paragraph.

> + *
> + * Each flow takes a single slot in flowtab[].  Moving entries in that table

You jump right away to "moving entries", we know why, but I guess any
other reader would be lost at this point.

In general, it took me some effort to grasp this paragraph. What about:

 * Flows are entries in the flowtab[]. We need to routinely scan the whole table
 * to perform deferred bookkeeping tasks on active entries, and sparse empty
 * slots waste time and worsen data locality.
 *
 * But keeping the table compacted by moving entries on deletion is fiddly: it
 * requires updating hash tables, and the epoll references to the flows. Instead,
 * we implement the compromise described below.

?

> + * (which we used to do) is fiddly and possibly expensive: it requires updating
> + * the hash table indexing flows, and may require updating epoll data which
> + * references the flow by index.  However, we also want to keep the active
> + * entries in the table compact where possible, because otherwise scanning
> + * through the entire table becomes more expensive.  This describes the
> + * compromise implemented below.
> + *
> + * Free blocks
> + *    A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries

This gave me the idea that blocks are some kind of "bucket" with a fixed
size. I think we should either mention that the blocks have variable
size, or call them sequences, or clusters (of free slots) perhaps?

> + *    in flowtab.  The first entry in each block contains metadata, specifically
> + *    the number of entries in the block, and the index of the next (non
> + *    contiguous) free block (struct flow_free_block).

...it doesn't waste any space, but I don't think we actually need to
store the number of "entries" (slots?)... more details below.

> + *
> + * Free block list
> + *    flow_first_free gives the index of the first entry of the first (lowest
> + *    index) free block.  Each free block has the index of the next free block,
> + *    or MAX_FLOW if it is the last free block.  Together these form a linked
> + *    list of free blocks, in strictly increasing order of index.
> + *
> + * Allocation
> + *    When allocating a new flow, we always use the first entry of the first
> + *    free block, that is, at index flow_first_free.  If the block has more than
> + *    one entry, flow_first_free is updated to the next entry, which is updated
> + *    to represent the new smaller free block.  Otherwise the free block is
> + *    eliminated and flow_first_free is updated to the next free block.
> + *
> + * Scanning the table
> + *    Theoretically, scanning the table requires FLOW_MAX iterations.  However,
> + *    when we encounter the start of a free block, we can immediately skip to
> + *    its end, meaning that in practice we only need (number of active

after its end

> + *    connections) + (number of free blocks) iterations.
> + *
> + * Freeing
> + *    We can only free entries when scanning the whole flow table in
> + *    flow_defer_handler().  This is what lets us maintain the fee block list in

s/fee/free/

> + *    index sorted order.  As we scan we keep track of whether the previous
> + *    entry was in a free block or not.  If so when an entry is freed (its
> + *    deferred handler returns 'true'), we add it to that free block.  Otherwise
> + *    we create a new free block for it and chain it to the last free block we
> + *    scanned.
> + */
> +
>  /**
>   * flow_alloc() - Allocate a new flow
>   *
> @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>   */
>  union flow *flow_alloc(void)
>  {
> -	if (flow_count >= FLOW_MAX)
> +	union flow *flow = &flowtab[flow_first_free];
> +
> +	if (flow_first_free >= FLOW_MAX)
>  		return NULL;
>  
> -	return &flowtab[flow_count++];
> +	ASSERT(flow->f.type == FLOW_TYPE_NONE);
> +	ASSERT(flow->free.n >= 1);
> +
> +	if (flow->free.n > 1) {
> +		/* Use one entry from the block */
> +		union flow *next = &flowtab[++flow_first_free];

...but we just checked (flow_first_free < FLOW_MAX) above. Now you
increment flow_first_free by one.

I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n
should be 0, so we shouldn't end up in this case -- but it's a bit
confusing.

> +
> +		ASSERT(FLOW_IDX(next) < FLOW_MAX);
> +		ASSERT(next->f.type == FLOW_TYPE_NONE);
> +		ASSERT(next->free.n == 0);
> +
> +		next->free.n = flow->free.n - 1;
> +		next->free.next = flow->free.next;
> +	} else {
> +		/* Use the entire block */
> +		flow_first_free = flow->free.next;
> +	}

I wonder if we really have to keep track of the number of (non-)entries
in the free "block", and if we have to be explicit about the two cases.

I'm trying to find out if we can simplify the whole thing with slightly
different variants, for example:

  struct flow_free_block {
  	[...]
  	unsigned next_free;
  	unsigned next_busy;
  };

or:

  struct flow_free_block {
  	[...]
  	unsigned next_free;
  };

or even some sort of double-linked list:

  struct flow_free_block {
  	[...]
  	unsigned prev_free;
  	unsigned next_free;
  };

but I couldn't quite find a more elegant solution yet.

> +
> +	memset(flow, 0, sizeof(*flow));
> +	return flow;
>  }
>  
>  /**
> @@ -70,48 +137,15 @@ union flow *flow_alloc(void)
>   */
>  void flow_alloc_cancel(union flow *flow)
>  {
> -	ASSERT(FLOW_IDX(flow) == flow_count - 1);
> -	memset(flow, 0, sizeof(*flow));
> -	flow_count--;
> -}
> -
> -/**
> - * flow_table_compact() - Perform compaction on flow table
> - * @c:		Execution context
> - * @hole:	Pointer to recently closed flow
> - */
> -static void flow_table_compact(const struct ctx *c, union flow *hole)
> -{
> -	union flow *from;
> -
> -	if (FLOW_IDX(hole) == --flow_count) {
> -		debug("flow: table compaction: maximum index was %u (%p)",
> -		      FLOW_IDX(hole), (void *)hole);
> -		memset(hole, 0, sizeof(*hole));
> -		return;
> -	}
> -
> -	from = flowtab + flow_count;
> -	memcpy(hole, from, sizeof(*hole));
> -
> -	switch (from->f.type) {
> -	case FLOW_TCP:
> -		tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
> -		break;
> -	case FLOW_TCP_SPLICE:
> -		tcp_splice_conn_update(c, &hole->tcp_splice);
> -		break;
> -	default:
> -		die("Unexpected %s in tcp_table_compact()",
> -		    FLOW_TYPE(&from->f));
> -	}
> -
> -	debug("flow: table compaction (%s): old index %u, new index %u, "
> -	      "from: %p, to: %p",
> -	      FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
> -	      (void *)from, (void *)hole);
> -
> -	memset(from, 0, sizeof(*from));
> +	ASSERT(flow_first_free > FLOW_IDX(flow));
> +
> +	flow->f.type = FLOW_TYPE_NONE;
> +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> +	 * flow_alloc()s steps.  This will get folded together the next time
> +	 * flow_defer_handler runs anyway() */
> +	flow->free.n = 1;
> +	flow->free.next = flow_first_free;
> +	flow_first_free = FLOW_IDX(flow);
>  }
>  
>  /**
> @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
>   */
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  {
> +	struct flow_free_block *free_head = NULL;
> +	unsigned *last_next = &flow_first_free;
>  	bool timer = false;
> -	union flow *flow;
> +	unsigned idx;
>  
>  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
>  		timer = true;
>  		flow_timer_run = *now;
>  	}
>  
> -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> +	for (idx = 0; idx < FLOW_MAX; idx++) {
> +		union flow *flow = &flowtab[idx];
>  		bool closed = false;
>  
> +		if (flow->f.type == FLOW_TYPE_NONE) {
> +			/* Start of a free block */
> +			free_head = &flow->free;
> +			*last_next = idx;
> +			last_next = &free_head->next;
> +			/* Skip the rest of the block */
> +			idx += free_head->n - 1;
> +			continue;
> +		}
> +
> +		

Stray tabs.

>  		switch (flow->f.type) {
> +		case FLOW_TYPE_NONE:
> +			closed = true;
> +			break;
>  		case FLOW_TCP:
>  			closed = tcp_flow_defer(flow);
>  			break;
> @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  			;
>  		}
>  
> -		if (closed)
> -			flow_table_compact(c, flow);
> +		if (closed) {

Should this be a flow_del() function perhaps, possibly taking flow and
free_head as argument?

> +			flow->f.type = FLOW_TYPE_NONE;
> +
> +			if (free_head) {
> +				/* Add slot to current free block */
> +				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
> +				free_head->n++;
> +				flow->free.n = flow->free.next = 0;
> +			} else {
> +				/* Create new free block */
> +				free_head = &flow->free;
> +				free_head->n = 1;
> +				*last_next = idx;
> +				last_next = &free_head->next;
> +			}
> +		} else {
> +			free_head = NULL;
> +		}
>  	}
> +
> +	*last_next = FLOW_MAX;
> +}
> +
> +/**
> + * flow_init() - Initialise flow related data structures
> + */
> +void flow_init(void)
> +{
> +	/* Initial state is a single free block containing the whole table */
> +	flowtab[0].free.n = FLOW_MAX;
> +	flowtab[0].free.next = FLOW_MAX;
>  }
> diff --git a/flow.h b/flow.h
> index 8064f0e..48a0ab4 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
>  
>  union flow;
>  
> +void flow_init(void);
>  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
>  
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> diff --git a/flow_table.h b/flow_table.h
> index 2773a2b..34fc679 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -9,6 +9,19 @@
>  
>  #include "tcp_conn.h"
>  
> +/**
> + * struct flow_free_block - Information about a block of free entries
> + * @f:		Generic flow information
> + * @n:		Number of entries in this free block (including this one)
> + * @next:	Index of next free block
> + */
> +struct flow_free_block {
> +	/* Must be first element */
> +	struct flow_common f;
> +	unsigned n;
> +	unsigned next;
> +};
> +
>  /**
>   * union flow - Descriptor for a logical packet flow (e.g. connection)
>   * @f:		Fields common between all variants
> @@ -17,12 +30,13 @@
>  */
>  union flow {
>  	struct flow_common f;
> +	struct flow_free_block free;
>  	struct tcp_tap_conn tcp;
>  	struct tcp_splice_conn tcp_splice;
>  };
>  
>  /* Global Flow Table */
> -extern unsigned flow_count;
> +extern unsigned flow_first_free;
>  extern union flow flowtab[];
>  
>  
> diff --git a/passt.c b/passt.c
> index 71bea8f..d315438 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -285,6 +285,8 @@ int main(int argc, char **argv)
>  
>  	clock_gettime(CLOCK_MONOTONIC, &now);
>  
> +	flow_init();
> +
>  	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
>  		exit(EXIT_FAILURE);
>  
> diff --git a/tcp.c b/tcp.c
> index a38deb0..6aacb56 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
>  	tc_hash[b] = FLOW_SIDX_NONE;
>  }
>  
> -/**
> - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
> - * @c:		Execution context
> - * @old:	Old location of tcp_tap_conn
> - * @new:	New location of tcp_tap_conn
> - */
> -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> -			 struct tcp_tap_conn *new)
> -
> -{
> -	unsigned b = tcp_hash_probe(c, old);
> -
> -	if (!flow_at_sidx(tc_hash[b]))
> -		return; /* Not in hash table, nothing to update */
> -
> -	tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
> -
> -	debug("TCP: hash table update: old index %u, new index %u, sock %i, "
> -	      "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
> -
> -	tcp_epoll_ctl(c, new);
> -}
> -
>  /**
>   * tcp_hash_lookup() - Look up connection given remote address and ports
>   * @c:		Execution context
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 636224e..a5f5cfe 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -155,9 +155,6 @@ struct tcp_splice_conn {
>  extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
>  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
>  
> -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> -			 struct tcp_tap_conn *new);
> -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
>  bool tcp_flow_defer(union flow *flow);
>  bool tcp_splice_flow_defer(union flow *flow);
>  void tcp_splice_timer(const struct ctx *c, union flow *flow);
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 0711b19..7aae623 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
>  	} while (0)
>  
>  
> -/**
> - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
> - * @c:		Execution context
> - * @new:	New location of tcp_splice_conn
> - */
> -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
> -{
> -	if (tcp_splice_epoll_ctl(c, new))
> -		conn_flag(c, new, CLOSING);
> -}
> -
>  /**
>   * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
>   * @flow:	Flow table entry for this connection

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2023-12-28 18:25   ` Stefano Brivio
@ 2023-12-30 10:33     ` Stefano Brivio
  2024-01-01 12:01       ` David Gibson
  2024-01-01 10:44     ` David Gibson
  1 sibling, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2023-12-30 10:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 28 Dec 2023 19:25:25 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:

> > On Thu, 21 Dec 2023 17:15:49 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > [...]
>
> [...]
>
> I wonder if we really have to keep track of the number of (non-)entries
> in the free "block", and if we have to be explicit about the two cases.
> 
> I'm trying to find out if we can simplify the whole thing with slightly
> different variants, for example:

So... I think the version with (explicit) blocks has this fundamental
advantage, on deletion:

> > +	flow->f.type = FLOW_TYPE_NONE;
> > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > +	 * flow_alloc()s steps.  This will get folded together the next time
> > +	 * flow_defer_handler runs anyway() */
> > +	flow->free.n = 1;
> > +	flow->free.next = flow_first_free;
> > +	flow_first_free = FLOW_IDX(flow);

which is doable even without explicit blocks, but much harder to
follow.

Other than the comments I have, I would go ahead with your approach. My
attempt below in case you're interested. The comment at the end of
flow_del() shows the issue I'm facing:

struct flow_free_block {
	/* Must be first element */
	struct flow_common f;

	unsigned next_free;
	unsigned next_used;
};

static union flow *flow_alloc(void)
{
	union flow *flow;

	if (flow_first_free >= FLOW_MAX)
		return NULL;

	flow = flowtab + flow_first_free;

	flow_first_free = flow->free.next_free;

	memset(flow, 0, sizeof(*flow));		/* TODO: select region */
	return flow;
}

static void flow_del(union flow *del)
{
	union flow *left, *right, *next_used = NULL, *first_free;

	del->f.type = FLOW_TYPE_NONE;

	left =  (FLOW_IDX(del) > 0)            ? FLOW(FLOW_IDX(del) - 1) : NULL;
	right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;

	first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;

	if (right) {
		if (right->f.type == FLOW_TYPE_NONE)
			del->free.next_used = right->free.next_used;
		else
			del->free.next_used = right;
	} else {
		del->free.next_used = FLOW_MAX;
	}

	if (left && left->f.type == FLOW_TYPE_NONE) {
		left->free.next_free = FLOW_IDX(del);
		left->free.next_used = del->free.next_used;
		return;
	}

	if (flow_first_free == FLOW_MAX) {
		flow_first_free = FLOW_IDX(del);
	} else if (flow_first_free > FLOW_IDX(del)) {
		flow_first_free = FLOW_IDX(del);
		del->free.next_free = flow_first_free;
	} else if (flow_first_free < FLOW_IDX(del)) {
		;
		/* Walk free slots from flow_first_free until FLOW_IDX(del) to
		 * find insertion point... but that makes deletion at most O(n),
		 * perhaps O(log(n)), certainly not O(1).
		 *
		 * Or just link out-of-order for the moment, and fix this up in
		 * flow_next(), but then "blocks" help representing this.
		 */
	}
}

static union flow *flow_next(union flow *whence)
{
	if (FLOW_IDX(whence) >= FLOW_MAX - 1)
		return NULL;

	if (whence->f.type != FLOW_TYPE_NONE)
		whence++;

	if (whence->f.type == FLOW_TYPE_NONE)
		return whence->free.next_used;

	return whence;
}

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling
  2023-12-28 18:24   ` Stefano Brivio
@ 2023-12-31  5:56     ` David Gibson
  2024-01-02 18:13       ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2023-12-31  5:56 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4461 bytes --]

On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:
> On Thu, 21 Dec 2023 17:15:41 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > tcp_defer_handler(), amongst other things, scans the flow table and does
> > some processing for each TCP connection.  When we add other protocols to
> > the flow table, they're likely to want some similar scanning.  It makes
> > more sense for cache friendliness to perform a single scan of the flow
> > table and dispatch to the protocol specific handlers, rather than having
> > each protocol separately scan the table.
> > 
> > To that end, add a new flow_defer_handler() handling all flow-linked
> > deferred operations.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c     | 23 +++++++++++++++++++++++
> >  flow.h     |  1 +
> >  passt.c    |  1 +
> >  tcp.c      | 19 ++-----------------
> >  tcp_conn.h |  1 +
> >  5 files changed, 28 insertions(+), 17 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index a1c0a34..0a0402d 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  
> >  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> >  }
> > +
> > +/**
> > + * flow_defer_handler() - Handler for per-flow deferred tasks
> > + * @c:		Execution context
> > + */
> > +void flow_defer_handler(struct ctx *c)
> > +{
> > +	union flow *flow;
> > +
> > +	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
> > +		switch (flow->f.type) {
> > +		case FLOW_TCP:
> > +			tcp_flow_defer(c, flow);
> > +			break;
> > +		case FLOW_TCP_SPLICE:
> > +			tcp_splice_flow_defer(c, flow);
> > +			break;
> > +		default:
> > +			/* Assume other flow types don't need any handling */
> > +			;
> > +		}
> > +	}
> > +}
> > diff --git a/flow.h b/flow.h
> > index 959b461..6b17fa8 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> >  union flow;
> >  
> >  void flow_table_compact(struct ctx *c, union flow *hole);
> > +void flow_defer_handler(struct ctx *c);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	__attribute__((format(printf, 3, 4)));
> > diff --git a/passt.c b/passt.c
> > index 0246b04..5f72a28 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
> >  	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
> >  	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
> >  
> > +	flow_defer_handler(c);
> >  #undef CALL_PROTO_HANDLER
> >  }
> >  
> > diff --git a/tcp.c b/tcp.c
> > index ad1a70d..9230d80 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
> >   * @c:		Execution context
> >   * @flow:	Flow table entry for this connection
> >   */
> > -static void tcp_flow_defer(struct ctx *c, union flow *flow)
> > +void tcp_flow_defer(struct ctx *c, union flow *flow)
> >  {
> >  	const struct tcp_tap_conn *conn = &flow->tcp;
> >  
> > @@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> >   * tcp_defer_handler() - Handler for TCP deferred tasks
> >   * @c:		Execution context
> >   */
> > +/* cppcheck-suppress constParameterPointer */
> 
> This needs to be:
> 
>   /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
> 
> otherwise we get warnings with cppcheck 2.10,

Drat, do we?  I was hoping this was a new warning type with the newer
cppcheck, and it would ignore the suppression if it was for a warning
type it didn't know about.

> and we'll get warnings if
> cppcheck's behaviour ever changes again.

That's actually a good thing.  This one isn't a workaround for a
cppcheck false positive or weird semantic that we hope will go away.
Rhe warning is real and correct as far as it goes.  The problem is
that the signature needs to match that of other deferred handlers
because of how we generate the calls from a macro.  Some of those
others need write access to the context.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2023-12-28 18:25   ` Stefano Brivio
@ 2023-12-31  5:58     ` David Gibson
  2024-01-02 18:13       ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2023-12-31  5:58 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1525 bytes --]

On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:
> On Thu, 21 Dec 2023 17:15:46 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > In general, the passt code is a bit haphazard about what's a true global
> > variable and what's in the quasi-global 'context structure'.  The
> > flow_count field is one such example: it's in the context structure,
> > although it's really part of the same data structure as flowtab[], which
> > is a genuine global.
> 
> Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> too big to live on the stack, unlike flow_count.
> 
> But anyway, as far as thoughts of multithreading are concerned, both
> should probably be global. And sure, it's more consistent this way.
> 
> > Move flow_count to be a regular global to match.  For now it needs to be
> > public, rather than static, but we expect to be able to change that in
> > future.
> 
> If it's not static, it should be initialised, and that's not done here.

Uh... what?  "static" here is meaning module-global rather than
global-global, which has no bearing on initialisation.  AFAIK globals
are zero-initialised whether they're static or not.

> This becomes 'flow_first_free' in 13/13, but it's not initialised
> either, and that should also start off as zero.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2023-12-28 18:25   ` Stefano Brivio
  2023-12-30 10:33     ` Stefano Brivio
@ 2024-01-01 10:44     ` David Gibson
  2024-01-02 18:13       ` Stefano Brivio
  1 sibling, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-01 10:44 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 18177 bytes --]

On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:
> On Thu, 21 Dec 2023 17:15:49 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we always keep the flow table maximally compact: that is all the
> > active entries are contiguous at the start of the table.  Doing this
> > sometimes requires moving an entry when one is freed.  That's kind of
> > fiddly, and potentially expensive: it requires updating the hash table for
> > the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
> > system calls to update epoll tags with the new location too.
> > 
> > Implement a new way of managing the flow table that doesn't ever move
> > entries.  It attempts to maintain some compactness by always using the
> > first free slot for a new connection, and mitigates the effect of non
> > compactness by cheaply skipping over contiguous blocks of free entries.
> > See the "theory of operation" comment in flow.c for details.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c       | 177 +++++++++++++++++++++++++++++++++++++--------------
> >  flow.h       |   1 +
> >  flow_table.h |  16 ++++-
> >  passt.c      |   2 +
> >  tcp.c        |  23 -------
> >  tcp_conn.h   |   3 -
> >  tcp_splice.c |  11 ----
> >  7 files changed, 146 insertions(+), 87 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index 7b99b58..421e6b5 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -25,7 +25,7 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
> >  	      "flow_type_str[] doesn't match enum flow_type");
> >  
> >  /* Global Flow Table */
> > -unsigned flow_count;
> > +unsigned flow_first_free;
> 
> As mentioned in the comment to 10/13: this should be initialised to
> zero.

I'm pretty sure it already is..

> >  union flow flowtab[FLOW_MAX];
> >  
> >  /* Last time the flow timers ran */
> > @@ -49,6 +49,52 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> >  }
> >  
> > +/**
> 
> As a general comment: interesting -- I wonder if this already has a name,
> but I couldn't find anything similar in the literature. Perhaps it's
> equivalent to a flattened search tree where free/busy slots can be
> found by following the links.

Yeah, I'm not aware of this technique being used elsewhere, though I
can't rule it out, of course.

> > + * DOC: Theory of Operation - allocation and freeing of flow entries
> 
> "allocating and freeing flow entries" ...flows better for me.

Good idea, changed.

> I think this comment should be before the declaration of flowtab[] --
> above here, with a function in between, is not exactly where I expected
> to find it as I started reading the next paragraph.

Fair enough, moved.

> > + *
> > + * Each flow takes a single slot in flowtab[].  Moving entries in that table
> 
> You jump right away to "moving entries", we know why, but I guess any
> other reader would be lost at this point.
> 
> In general, it took me some effort to grasp this paragraph. What about:
> 
>  * Flows are entries in the flowtab[]. We need to routinely scan the whole table
>  * to perform deferred bookkeeping tasks on active entries, and sparse empty
>  * slots waste time and worsen data locality.
>  *
>  * But keeping the table compacted by moving entries on deletion is fiddly: it
>  * requires updating hash tables, and the epoll references to the flows. Instead,
>  * we implement the compromise described below.
> 
> ?

Much better, thanks

> > + * (which we used to do) is fiddly and possibly expensive: it requires updating
> > + * the hash table indexing flows, and may require updating epoll data which
> > + * references the flow by index.  However, we also want to keep the active
> > + * entries in the table compact where possible, because otherwise scanning
> > + * through the entire table becomes more expensive.  This describes the
> > + * compromise implemented below.
> > + *
> > + * Free blocks
> > + *    A "free block" is a contiguous sequence of unused (FLOW_TYPE_NONE) entries
> 
> This gave me the idea that blocks are some kind of "bucket" with a fixed
> size. I think we should either mention that the blocks have variable
> size, or call them sequences, or clusters (of free slots) perhaps?

Fair point.  I've changed everything to the term "free cluster".

> > + *    in flowtab.  The first entry in each block contains metadata, specifically
> > + *    the number of entries in the block, and the index of the next (non
> > + *    contiguous) free block (struct flow_free_block).
> 
> ...it doesn't waste any space, but I don't think we actually need to
> store the number of "entries" (slots?)... more details below.

I haven't seen a way to avoid it, while maintaining the properties we
want, but I'm open to suggestions.

> > + *
> > + * Free block list
> > + *    flow_first_free gives the index of the first entry of the first (lowest
> > + *    index) free block.  Each free block has the index of the next free block,
> > + *    or MAX_FLOW if it is the last free block.  Together these form a linked
> > + *    list of free blocks, in strictly increasing order of index.
> > + *
> > + * Allocation
> > + *    When allocating a new flow, we always use the first entry of the first
> > + *    free block, that is, at index flow_first_free.  If the block has more than
> > + *    one entry, flow_first_free is updated to the next entry, which is updated
> > + *    to represent the new smaller free block.  Otherwise the free block is
> > + *    eliminated and flow_first_free is updated to the next free block.
> > + *
> > + * Scanning the table
> > + *    Theoretically, scanning the table requires FLOW_MAX iterations.  However,
> > + *    when we encounter the start of a free block, we can immediately skip to
> > + *    its end, meaning that in practice we only need (number of active
> 
> after its end

better yet, "...immediately skip past it,".

> > + *    connections) + (number of free blocks) iterations.
> > + *
> > + * Freeing
> > + *    We can only free entries when scanning the whole flow table in
> > + *    flow_defer_handler().  This is what lets us maintain the fee block list in
> 
> s/fee/free/

Fixed.

> > + *    index sorted order.  As we scan we keep track of whether the previous
> > + *    entry was in a free block or not.  If so when an entry is freed (its
> > + *    deferred handler returns 'true'), we add it to that free block.  Otherwise
> > + *    we create a new free block for it and chain it to the last free block we
> > + *    scanned.
> > + */
> > +
> >  /**
> >   * flow_alloc() - Allocate a new flow
> >   *
> > @@ -56,10 +102,31 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >   */
> >  union flow *flow_alloc(void)
> >  {
> > -	if (flow_count >= FLOW_MAX)
> > +	union flow *flow = &flowtab[flow_first_free];
> > +
> > +	if (flow_first_free >= FLOW_MAX)
> >  		return NULL;
> >  
> > -	return &flowtab[flow_count++];
> > +	ASSERT(flow->f.type == FLOW_TYPE_NONE);
> > +	ASSERT(flow->free.n >= 1);
> > +
> > +	if (flow->free.n > 1) {
> > +		/* Use one entry from the block */
> > +		union flow *next = &flowtab[++flow_first_free];
> 
> ...but we just checked (flow_first_free < FLOW_MAX) above. Now you
> increment flow_first_free by one.
> 
> I *guess* that if flow_first_free is FLOW_MAX - 1, then flow->free.n
> should be 0, so we shouldn't end up in this case -- but it's a bit
> confusing.

That's correct, except that it's flow->free.n == 1, not 0 (we're
checking for strictly > 1 above).  If it's not, that's a free cluster
claiming it extends past the end of the table.

I've added
	ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);

Above the if to try to clarify this.

> > +
> > +		ASSERT(FLOW_IDX(next) < FLOW_MAX);
> > +		ASSERT(next->f.type == FLOW_TYPE_NONE);
> > +		ASSERT(next->free.n == 0);
> > +
> > +		next->free.n = flow->free.n - 1;
> > +		next->free.next = flow->free.next;
> > +	} else {
> > +		/* Use the entire block */
> > +		flow_first_free = flow->free.next;
> > +	}
> 
> I wonder if we really have to keep track of the number of (non-)entries
> in the free "block", and if we have to be explicit about the two cases.
> 
> I'm trying to find out if we can simplify the whole thing with slightly
> different variants, for example:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned next_free;
>   	unsigned next_busy;
>   };
> 
> or:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned next_free;
>   };
> 
> or even some sort of double-linked list:
> 
>   struct flow_free_block {
>   	[...]
>   	unsigned prev_free;
>   	unsigned next_free;
>   };
> 
> but I couldn't quite find a more elegant solution yet.
> 
> > +
> > +	memset(flow, 0, sizeof(*flow));
> > +	return flow;
> >  }
> >  
> >  /**
> > @@ -70,48 +137,15 @@ union flow *flow_alloc(void)
> >   */
> >  void flow_alloc_cancel(union flow *flow)
> >  {
> > -	ASSERT(FLOW_IDX(flow) == flow_count - 1);
> > -	memset(flow, 0, sizeof(*flow));
> > -	flow_count--;
> > -}
> > -
> > -/**
> > - * flow_table_compact() - Perform compaction on flow table
> > - * @c:		Execution context
> > - * @hole:	Pointer to recently closed flow
> > - */
> > -static void flow_table_compact(const struct ctx *c, union flow *hole)
> > -{
> > -	union flow *from;
> > -
> > -	if (FLOW_IDX(hole) == --flow_count) {
> > -		debug("flow: table compaction: maximum index was %u (%p)",
> > -		      FLOW_IDX(hole), (void *)hole);
> > -		memset(hole, 0, sizeof(*hole));
> > -		return;
> > -	}
> > -
> > -	from = flowtab + flow_count;
> > -	memcpy(hole, from, sizeof(*hole));
> > -
> > -	switch (from->f.type) {
> > -	case FLOW_TCP:
> > -		tcp_tap_conn_update(c, &from->tcp, &hole->tcp);
> > -		break;
> > -	case FLOW_TCP_SPLICE:
> > -		tcp_splice_conn_update(c, &hole->tcp_splice);
> > -		break;
> > -	default:
> > -		die("Unexpected %s in tcp_table_compact()",
> > -		    FLOW_TYPE(&from->f));
> > -	}
> > -
> > -	debug("flow: table compaction (%s): old index %u, new index %u, "
> > -	      "from: %p, to: %p",
> > -	      FLOW_TYPE(&from->f), FLOW_IDX(from), FLOW_IDX(hole),
> > -	      (void *)from, (void *)hole);
> > -
> > -	memset(from, 0, sizeof(*from));
> > +	ASSERT(flow_first_free > FLOW_IDX(flow));
> > +
> > +	flow->f.type = FLOW_TYPE_NONE;
> > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > +	 * flow_alloc()s steps.  This will get folded together the next time
> > +	 * flow_defer_handler runs anyway() */
> > +	flow->free.n = 1;
> > +	flow->free.next = flow_first_free;
> > +	flow_first_free = FLOW_IDX(flow);
> >  }
> >  
> >  /**
> > @@ -121,18 +155,35 @@ static void flow_table_compact(const struct ctx *c, union flow *hole)
> >   */
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> >  {
> > +	struct flow_free_block *free_head = NULL;
> > +	unsigned *last_next = &flow_first_free;
> >  	bool timer = false;
> > -	union flow *flow;
> > +	unsigned idx;
> >  
> >  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
> >  		timer = true;
> >  		flow_timer_run = *now;
> >  	}
> >  
> > -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> > +	for (idx = 0; idx < FLOW_MAX; idx++) {
> > +		union flow *flow = &flowtab[idx];
> >  		bool closed = false;
> >  
> > +		if (flow->f.type == FLOW_TYPE_NONE) {
> > +			/* Start of a free block */
> > +			free_head = &flow->free;
> > +			*last_next = idx;
> > +			last_next = &free_head->next;
> > +			/* Skip the rest of the block */
> > +			idx += free_head->n - 1;
> > +			continue;
> > +		}
> > +
> > +		
> 
> Stray tabs.
> 
> >  		switch (flow->f.type) {
> > +		case FLOW_TYPE_NONE:
> > +			closed = true;
> > +			break;
> >  		case FLOW_TCP:
> >  			closed = tcp_flow_defer(flow);
> >  			break;
> > @@ -146,7 +197,35 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> >  			;
> >  		}
> >  
> > -		if (closed)
> > -			flow_table_compact(c, flow);
> > +		if (closed) {
> 
> Should this be a flow_del() function perhaps, possibly taking flow and
> free_head as argument?

I suppose it could, but since it's not that long I'd prefer to inline
it to emphasise the fact that for the tracking scheme to work,
deletions *must* be done in the context of scanning the full table.

> 
> > +			flow->f.type = FLOW_TYPE_NONE;
> > +
> > +			if (free_head) {
> > +				/* Add slot to current free block */
> > +				ASSERT(idx == FLOW_IDX(free_head) + free_head->n);
> > +				free_head->n++;
> > +				flow->free.n = flow->free.next = 0;
> > +			} else {
> > +				/* Create new free block */
> > +				free_head = &flow->free;
> > +				free_head->n = 1;
> > +				*last_next = idx;
> > +				last_next = &free_head->next;
> > +			}
> > +		} else {
> > +			free_head = NULL;
> > +		}
> >  	}
> > +
> > +	*last_next = FLOW_MAX;
> > +}
> > +
> > +/**
> > + * flow_init() - Initialise flow related data structures
> > + */
> > +void flow_init(void)
> > +{
> > +	/* Initial state is a single free block containing the whole table */
> > +	flowtab[0].free.n = FLOW_MAX;
> > +	flowtab[0].free.next = FLOW_MAX;
> >  }
> > diff --git a/flow.h b/flow.h
> > index 8064f0e..48a0ab4 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -68,6 +68,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> >  
> >  union flow;
> >  
> > +void flow_init(void);
> >  void flow_defer_handler(const struct ctx *c, const struct timespec *now);
> >  
> >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > diff --git a/flow_table.h b/flow_table.h
> > index 2773a2b..34fc679 100644
> > --- a/flow_table.h
> > +++ b/flow_table.h
> > @@ -9,6 +9,19 @@
> >  
> >  #include "tcp_conn.h"
> >  
> > +/**
> > + * struct flow_free_block - Information about a block of free entries
> > + * @f:		Generic flow information
> > + * @n:		Number of entries in this free block (including this one)
> > + * @next:	Index of next free block
> > + */
> > +struct flow_free_block {
> > +	/* Must be first element */
> > +	struct flow_common f;
> > +	unsigned n;
> > +	unsigned next;
> > +};
> > +
> >  /**
> >   * union flow - Descriptor for a logical packet flow (e.g. connection)
> >   * @f:		Fields common between all variants
> > @@ -17,12 +30,13 @@
> >  */
> >  union flow {
> >  	struct flow_common f;
> > +	struct flow_free_block free;
> >  	struct tcp_tap_conn tcp;
> >  	struct tcp_splice_conn tcp_splice;
> >  };
> >  
> >  /* Global Flow Table */
> > -extern unsigned flow_count;
> > +extern unsigned flow_first_free;
> >  extern union flow flowtab[];
> >  
> >  
> > diff --git a/passt.c b/passt.c
> > index 71bea8f..d315438 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -285,6 +285,8 @@ int main(int argc, char **argv)
> >  
> >  	clock_gettime(CLOCK_MONOTONIC, &now);
> >  
> > +	flow_init();
> > +
> >  	if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
> >  		exit(EXIT_FAILURE);
> >  
> > diff --git a/tcp.c b/tcp.c
> > index a38deb0..6aacb56 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1250,29 +1250,6 @@ static void tcp_hash_remove(const struct ctx *c,
> >  	tc_hash[b] = FLOW_SIDX_NONE;
> >  }
> >  
> > -/**
> > - * tcp_tap_conn_update() - Update tcp_tap_conn when being moved in the table
> > - * @c:		Execution context
> > - * @old:	Old location of tcp_tap_conn
> > - * @new:	New location of tcp_tap_conn
> > - */
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > -			 struct tcp_tap_conn *new)
> > -
> > -{
> > -	unsigned b = tcp_hash_probe(c, old);
> > -
> > -	if (!flow_at_sidx(tc_hash[b]))
> > -		return; /* Not in hash table, nothing to update */
> > -
> > -	tc_hash[b] = FLOW_SIDX(new, TAPSIDE);
> > -
> > -	debug("TCP: hash table update: old index %u, new index %u, sock %i, "
> > -	      "bucket: %u", FLOW_IDX(old), FLOW_IDX(new), new->sock, b);
> > -
> > -	tcp_epoll_ctl(c, new);
> > -}
> > -
> >  /**
> >   * tcp_hash_lookup() - Look up connection given remote address and ports
> >   * @c:		Execution context
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 636224e..a5f5cfe 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -155,9 +155,6 @@ struct tcp_splice_conn {
> >  extern int init_sock_pool4	[TCP_SOCK_POOL_SIZE];
> >  extern int init_sock_pool6	[TCP_SOCK_POOL_SIZE];
> >  
> > -void tcp_tap_conn_update(const struct ctx *c, struct tcp_tap_conn *old,
> > -			 struct tcp_tap_conn *new);
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new);
> >  bool tcp_flow_defer(union flow *flow);
> >  bool tcp_splice_flow_defer(union flow *flow);
> >  void tcp_splice_timer(const struct ctx *c, union flow *flow);
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 0711b19..7aae623 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -231,17 +231,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn,
> >  	} while (0)
> >  
> >  
> > -/**
> > - * tcp_splice_conn_update() - Update tcp_splice_conn when being moved in the table
> > - * @c:		Execution context
> > - * @new:	New location of tcp_splice_conn
> > - */
> > -void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new)
> > -{
> > -	if (tcp_splice_epoll_ctl(c, new))
> > -		conn_flag(c, new, CLOSING);
> > -}
> > -
> >  /**
> >   * tcp_splice_flow_defer() - Deferred per-flow handling (clean up closed)
> >   * @flow:	Flow table entry for this connection
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2023-12-30 10:33     ` Stefano Brivio
@ 2024-01-01 12:01       ` David Gibson
  2024-01-02 18:13         ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-01 12:01 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4401 bytes --]

On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:
> On Thu, 28 Dec 2023 19:25:25 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > 
> > > [...]
> >
> > [...]
> >
> > I wonder if we really have to keep track of the number of (non-)entries
> > in the free "block", and if we have to be explicit about the two cases.
> > 
> > I'm trying to find out if we can simplify the whole thing with slightly
> > different variants, for example:
> 
> So... I think the version with (explicit) blocks has this fundamental
> advantage, on deletion:
> 
> > > +	flow->f.type = FLOW_TYPE_NONE;
> > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > +	 * flow_defer_handler runs anyway() */
> > > +	flow->free.n = 1;
> > > +	flow->free.next = flow_first_free;
> > > +	flow_first_free = FLOW_IDX(flow);
> 
> which is doable even without explicit blocks, but much harder to
> follow.

Remember this is not a general deletion, only a "cancel" of the most
recent allocation.  To reduce fragmentation we are keeping the linked
list of free clusters in strictly ascending order.  The logic above is
only correct if the entry we're freeing is before any other free entry
in the table.  That will be the case for the most recent allocation,
because we always allocatte the first free entry in the table.

> Other than the comments I have, I would go ahead with your approach. My
> attempt below in case you're interested. The comment at the end of
> flow_del() shows the issue I'm facing:
> 
> struct flow_free_block {
> 	/* Must be first element */
> 	struct flow_common f;
> 
> 	unsigned next_free;
> 	unsigned next_used;
> };
> 
> static union flow *flow_alloc(void)
> {
> 	union flow *flow;
> 
> 	if (flow_first_free >= FLOW_MAX)
> 		return NULL;
> 
> 	flow = flowtab + flow_first_free;
> 
> 	flow_first_free = flow->free.next_free;
> 
> 	memset(flow, 0, sizeof(*flow));		/* TODO: select region */
> 	return flow;
> }
> 
> static void flow_del(union flow *del)
> {
> 	union flow *left, *right, *next_used = NULL, *first_free;
fg
> 
> 	del->f.type = FLOW_TYPE_NONE;
> 
> 	left =  (FLOW_IDX(del) > 0)            ? FLOW(FLOW_IDX(del) - 1) : NULL;
> 	right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;
> 
> 	first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;
> 
> 	if (right) {
> 		if (right->f.type == FLOW_TYPE_NONE)
> 			del->free.next_used = right->free.next_used;
> 		else
> 			del->free.next_used = right;
> 	} else {
> 		del->free.next_used = FLOW_MAX;
> 	}
> 
> 	if (left && left->f.type == FLOW_TYPE_NONE) {
> 		left->free.next_free = FLOW_IDX(del);
> 		left->free.next_used = del->free.next_used;
> 		return;
> 	}
> 
> 	if (flow_first_free == FLOW_MAX) {
> 		flow_first_free = FLOW_IDX(del);
> 	} else if (flow_first_free > FLOW_IDX(del)) {
> 		flow_first_free = FLOW_IDX(del);
> 		del->free.next_free = flow_first_free;
> 	} else if (flow_first_free < FLOW_IDX(del)) {
> 		;
> 		/* Walk free slots from flow_first_free until FLOW_IDX(del) to
> 		 * find insertion point... but that makes deletion at most O(n),
> 		 * perhaps O(log(n)), certainly not O(1).

Exactly.  I can't see any way (short of far more complex data
structures) around having a linear scan somewhere, although you can
kind of choose whether it happens on alloc or free.

The idea of my implementation is to have it at free time - but to
merge it with an existing linear scan so it won't have an additional
cost.

> 		 *
> 		 * Or just link out-of-order for the moment, and fix this up in
> 		 * flow_next(), but then "blocks" help representing this.
> 		 */
> 	}
> }
> 
> static union flow *flow_next(union flow *whence)
> {
> 	if (FLOW_IDX(whence) >= FLOW_MAX - 1)
> 		return NULL;
> 
> 	if (whence->f.type != FLOW_TYPE_NONE)
> 		whence++;
> 
> 	if (whence->f.type == FLOW_TYPE_NONE)
> 		return whence->free.next_used;
> 
> 	return whence;
> }
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling
  2023-12-31  5:56     ` David Gibson
@ 2024-01-02 18:13       ` Stefano Brivio
  2024-01-03  3:45         ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-02 18:13 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Sun, 31 Dec 2023 16:56:30 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:
> > On Thu, 21 Dec 2023 17:15:41 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > tcp_defer_handler(), amongst other things, scans the flow table and does
> > > some processing for each TCP connection.  When we add other protocols to
> > > the flow table, they're likely to want some similar scanning.  It makes
> > > more sense for cache friendliness to perform a single scan of the flow
> > > table and dispatch to the protocol specific handlers, rather than having
> > > each protocol separately scan the table.
> > > 
> > > To that end, add a new flow_defer_handler() handling all flow-linked
> > > deferred operations.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > ---
> > >  flow.c     | 23 +++++++++++++++++++++++
> > >  flow.h     |  1 +
> > >  passt.c    |  1 +
> > >  tcp.c      | 19 ++-----------------
> > >  tcp_conn.h |  1 +
> > >  5 files changed, 28 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/flow.c b/flow.c
> > > index a1c0a34..0a0402d 100644
> > > --- a/flow.c
> > > +++ b/flow.c
> > > @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > >  
> > >  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> > >  }
> > > +
> > > +/**
> > > + * flow_defer_handler() - Handler for per-flow deferred tasks
> > > + * @c:		Execution context
> > > + */
> > > +void flow_defer_handler(struct ctx *c)
> > > +{
> > > +	union flow *flow;
> > > +
> > > +	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
> > > +		switch (flow->f.type) {
> > > +		case FLOW_TCP:
> > > +			tcp_flow_defer(c, flow);
> > > +			break;
> > > +		case FLOW_TCP_SPLICE:
> > > +			tcp_splice_flow_defer(c, flow);
> > > +			break;
> > > +		default:
> > > +			/* Assume other flow types don't need any handling */
> > > +			;
> > > +		}
> > > +	}
> > > +}
> > > diff --git a/flow.h b/flow.h
> > > index 959b461..6b17fa8 100644
> > > --- a/flow.h
> > > +++ b/flow.h
> > > @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> > >  union flow;
> > >  
> > >  void flow_table_compact(struct ctx *c, union flow *hole);
> > > +void flow_defer_handler(struct ctx *c);
> > >  
> > >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > >  	__attribute__((format(printf, 3, 4)));
> > > diff --git a/passt.c b/passt.c
> > > index 0246b04..5f72a28 100644
> > > --- a/passt.c
> > > +++ b/passt.c
> > > @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
> > >  	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
> > >  	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
> > >  
> > > +	flow_defer_handler(c);
> > >  #undef CALL_PROTO_HANDLER
> > >  }
> > >  
> > > diff --git a/tcp.c b/tcp.c
> > > index ad1a70d..9230d80 100644
> > > --- a/tcp.c
> > > +++ b/tcp.c
> > > @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
> > >   * @c:		Execution context
> > >   * @flow:	Flow table entry for this connection
> > >   */
> > > -static void tcp_flow_defer(struct ctx *c, union flow *flow)
> > > +void tcp_flow_defer(struct ctx *c, union flow *flow)
> > >  {
> > >  	const struct tcp_tap_conn *conn = &flow->tcp;
> > >  
> > > @@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> > >   * tcp_defer_handler() - Handler for TCP deferred tasks
> > >   * @c:		Execution context
> > >   */
> > > +/* cppcheck-suppress constParameterPointer */  
> > 
> > This needs to be:
> > 
> >   /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
> > 
> > otherwise we get warnings with cppcheck 2.10,  
> 
> Drat, do we?  I was hoping this was a new warning type with the newer
> cppcheck, and it would ignore the suppression if it was for a warning
> type it didn't know about.

Yeah... go figure. On the other hand it's not really a new type in the
sense that this _should_ have been covered by a "constParameter"
warning, before 2.11.

> > and we'll get warnings if
> > cppcheck's behaviour ever changes again.  
> 
> That's actually a good thing.  This one isn't a workaround for a
> cppcheck false positive or weird semantic that we hope will go away.
> Rhe warning is real and correct as far as it goes.  The problem is
> that the signature needs to match that of other deferred handlers
> because of how we generate the calls from a macro.  Some of those
> others need write access to the context.

Oops, I didn't realise. Well, in any case, then we don't expect
cppcheck's behaviour to ever change in this regard, so I don't see any
advantage omitting unmatchedSuppression here.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2023-12-31  5:58     ` David Gibson
@ 2024-01-02 18:13       ` Stefano Brivio
  2024-01-03  3:54         ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-02 18:13 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Sun, 31 Dec 2023 16:58:39 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:
> > On Thu, 21 Dec 2023 17:15:46 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > In general, the passt code is a bit haphazard about what's a true global
> > > variable and what's in the quasi-global 'context structure'.  The
> > > flow_count field is one such example: it's in the context structure,
> > > although it's really part of the same data structure as flowtab[], which
> > > is a genuine global.  
> > 
> > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > too big to live on the stack, unlike flow_count.
> > 
> > But anyway, as far as thoughts of multithreading are concerned, both
> > should probably be global. And sure, it's more consistent this way.
> >   
> > > Move flow_count to be a regular global to match.  For now it needs to be
> > > public, rather than static, but we expect to be able to change that in
> > > future.  
> > 
> > If it's not static, it should be initialised, and that's not done here.  
> 
> Uh... what?  "static" here is meaning module-global rather than
> global-global, which has no bearing on initialisation.  AFAIK globals
> are zero-initialised whether they're static or not.

...and to my utter surprise, I just discovered that if you talk C11,
you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
"Initialization", clause 10:

  If an object that has automatic storage duration is not initialized
  explicitly, its value is indeterminate. If an object that has static
  or thread storage duration is not initialized explicitly, then:

  [...]

  — if it has arithmetic type, it is initialized to (positive or
    unsigned) zero;

And 'flow_count' has thread storage duration. In C99, however (draft
N1256), Section 6.7.8 "Initialization", clause 10:

  If an object that has automatic storage duration is not initialized
  explicitly, its value is indeterminate. If an object that has static
  storage duration is not initialized explicitly, then:

  [...]

note the missing "or thread storage duration".

C89, the one I was actually basing my observation on, says, at 3.5.7
"Initialization":

  If an object that has static storage duration is not initialized
  explicitly, it is initialized implicitly as if every member that has
  arithmetic type were assigned 0 and every member that has pointer type
  were assigned a null pointer constant.  If an object that has
  automatic storage duration is not initialized explicitly, its value is
  indeterminate.

so... um. We won't go back to C99. But to me, and maybe others, not
having a "= 0;" for a "global" means pretty much that we don't rely on
any particular initial value.

If you're strongly against it, fine, C11 says it's zero... but it makes
me a bit nervous nevertheless.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-01 12:01       ` David Gibson
@ 2024-01-02 18:13         ` Stefano Brivio
  2024-01-04 10:02           ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-02 18:13 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Mon, 1 Jan 2024 23:01:17 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:
> > On Thu, 28 Dec 2023 19:25:25 +0100
> > Stefano Brivio <sbrivio@redhat.com> wrote:
> >   
> > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > 
> > > > [...]  
> > >
> > > [...]
> > >
> > > I wonder if we really have to keep track of the number of (non-)entries
> > > in the free "block", and if we have to be explicit about the two cases.
> > > 
> > > I'm trying to find out if we can simplify the whole thing with slightly
> > > different variants, for example:  
> > 
> > So... I think the version with (explicit) blocks has this fundamental
> > advantage, on deletion:
> >   
> > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > +	 * flow_defer_handler runs anyway() */
> > > > +	flow->free.n = 1;
> > > > +	flow->free.next = flow_first_free;
> > > > +	flow_first_free = FLOW_IDX(flow);  
> > 
> > which is doable even without explicit blocks, but much harder to
> > follow.  
> 
> Remember this is not a general deletion, only a "cancel" of the most
> recent allocation.

Oh, I thought that was only the case for this series and you would use
that as actual deletion in another pending series (which I haven't
finished reviewing yet).

But now I'm not sure anymore why I was thinking this...

Anyway... do we really need it, then? Can't we just mark the "failed"
flows as whatever means "closed" for a specific protocol, and clean
them up later, instead of calling cancel() right away?

>  To reduce fragmentation we are keeping the linked
> list of free clusters in strictly ascending order.  The logic above is
> only correct if the entry we're freeing is before any other free entry
> in the table.  That will be the case for the most recent allocation,
> because we always allocatte the first free entry in the table.

I see now. This isn't entirely clear from the "Theory of Operation",
on the other hand it's probably a bad idea to overload that description
with all these details.

> > Other than the comments I have, I would go ahead with your approach. My
> > attempt below in case you're interested. The comment at the end of
> > flow_del() shows the issue I'm facing:
> > 
> > struct flow_free_block {
> > 	/* Must be first element */
> > 	struct flow_common f;
> > 
> > 	unsigned next_free;
> > 	unsigned next_used;
> > };
> > 
> > static union flow *flow_alloc(void)
> > {
> > 	union flow *flow;
> > 
> > 	if (flow_first_free >= FLOW_MAX)
> > 		return NULL;
> > 
> > 	flow = flowtab + flow_first_free;
> > 
> > 	flow_first_free = flow->free.next_free;
> > 
> > 	memset(flow, 0, sizeof(*flow));		/* TODO: select region */
> > 	return flow;
> > }
> > 
> > static void flow_del(union flow *del)
> > {
> > 	union flow *left, *right, *next_used = NULL, *first_free;  
> fg
> > 
> > 	del->f.type = FLOW_TYPE_NONE;
> > 
> > 	left =  (FLOW_IDX(del) > 0)            ? FLOW(FLOW_IDX(del) - 1) : NULL;
> > 	right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;
> > 
> > 	first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;
> > 
> > 	if (right) {
> > 		if (right->f.type == FLOW_TYPE_NONE)
> > 			del->free.next_used = right->free.next_used;
> > 		else
> > 			del->free.next_used = right;
> > 	} else {
> > 		del->free.next_used = FLOW_MAX;
> > 	}
> > 
> > 	if (left && left->f.type == FLOW_TYPE_NONE) {
> > 		left->free.next_free = FLOW_IDX(del);
> > 		left->free.next_used = del->free.next_used;
> > 		return;
> > 	}
> > 
> > 	if (flow_first_free == FLOW_MAX) {
> > 		flow_first_free = FLOW_IDX(del);
> > 	} else if (flow_first_free > FLOW_IDX(del)) {
> > 		flow_first_free = FLOW_IDX(del);
> > 		del->free.next_free = flow_first_free;
> > 	} else if (flow_first_free < FLOW_IDX(del)) {
> > 		;
> > 		/* Walk free slots from flow_first_free until FLOW_IDX(del) to
> > 		 * find insertion point... but that makes deletion at most O(n),
> > 		 * perhaps O(log(n)), certainly not O(1).  
> 
> Exactly.  I can't see any way (short of far more complex data
> structures) around having a linear scan somewhere, although you can
> kind of choose whether it happens on alloc or free.
> 
> The idea of my implementation is to have it at free time - but to
> merge it with an existing linear scan so it won't have an additional
> cost.

Right, yes, I see now. This idea is also not terribly clear from the
description, by the way, but I don't have a good proposal right now.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-01 10:44     ` David Gibson
@ 2024-01-02 18:13       ` Stefano Brivio
  2024-01-05  9:45         ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-02 18:13 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Mon, 1 Jan 2024 21:44:54 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:
> > On Thu, 21 Dec 2023 17:15:49 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > [...]
> >
> > >  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> > >  {
> > > +	struct flow_free_block *free_head = NULL;
> > > +	unsigned *last_next = &flow_first_free;
> > >  	bool timer = false;
> > > -	union flow *flow;
> > > +	unsigned idx;
> > >  
> > >  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
> > >  		timer = true;
> > >  		flow_timer_run = *now;
> > >  	}
> > >  
> > > -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> > > +	for (idx = 0; idx < FLOW_MAX; idx++) {
> > > +		union flow *flow = &flowtab[idx];
> > >  		bool closed = false;
> > >  
> > > +		if (flow->f.type == FLOW_TYPE_NONE) {
> > > +			/* Start of a free block */
> > > +			free_head = &flow->free;
> > > +			*last_next = idx;
> > > +			last_next = &free_head->next;
> > > +			/* Skip the rest of the block */
> > > +			idx += free_head->n - 1;
> > > +			continue;
> > > +		}
> > > +
> > > +		  
> > 
> > Stray tabs.
> >   
> > >  		switch (flow->f.type) {
> > > +		case FLOW_TYPE_NONE:
> > > +			closed = true;
> > > +			break;

...more important than stray tabs, I noticed only now: how would we ever
hit this switch case, if you're checking for this in the if clause just
before?

> > >  		case FLOW_TCP:
> > >  			closed = tcp_flow_defer(flow);
> > >  			break;

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling
  2024-01-02 18:13       ` Stefano Brivio
@ 2024-01-03  3:45         ` David Gibson
  0 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2024-01-03  3:45 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5486 bytes --]

On Tue, Jan 02, 2024 at 07:13:28PM +0100, Stefano Brivio wrote:
> On Sun, 31 Dec 2023 16:56:30 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 28, 2023 at 07:24:46PM +0100, Stefano Brivio wrote:
> > > On Thu, 21 Dec 2023 17:15:41 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > tcp_defer_handler(), amongst other things, scans the flow table and does
> > > > some processing for each TCP connection.  When we add other protocols to
> > > > the flow table, they're likely to want some similar scanning.  It makes
> > > > more sense for cache friendliness to perform a single scan of the flow
> > > > table and dispatch to the protocol specific handlers, rather than having
> > > > each protocol separately scan the table.
> > > > 
> > > > To that end, add a new flow_defer_handler() handling all flow-linked
> > > > deferred operations.
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  flow.c     | 23 +++++++++++++++++++++++
> > > >  flow.h     |  1 +
> > > >  passt.c    |  1 +
> > > >  tcp.c      | 19 ++-----------------
> > > >  tcp_conn.h |  1 +
> > > >  5 files changed, 28 insertions(+), 17 deletions(-)
> > > > 
> > > > diff --git a/flow.c b/flow.c
> > > > index a1c0a34..0a0402d 100644
> > > > --- a/flow.c
> > > > +++ b/flow.c
> > > > @@ -83,3 +83,26 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > > >  
> > > >  	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> > > >  }
> > > > +
> > > > +/**
> > > > + * flow_defer_handler() - Handler for per-flow deferred tasks
> > > > + * @c:		Execution context
> > > > + */
> > > > +void flow_defer_handler(struct ctx *c)
> > > > +{
> > > > +	union flow *flow;
> > > > +
> > > > +	for (flow = flowtab + c->flow_count - 1; flow >= flowtab; flow--) {
> > > > +		switch (flow->f.type) {
> > > > +		case FLOW_TCP:
> > > > +			tcp_flow_defer(c, flow);
> > > > +			break;
> > > > +		case FLOW_TCP_SPLICE:
> > > > +			tcp_splice_flow_defer(c, flow);
> > > > +			break;
> > > > +		default:
> > > > +			/* Assume other flow types don't need any handling */
> > > > +			;
> > > > +		}
> > > > +	}
> > > > +}
> > > > diff --git a/flow.h b/flow.h
> > > > index 959b461..6b17fa8 100644
> > > > --- a/flow.h
> > > > +++ b/flow.h
> > > > @@ -67,6 +67,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
> > > >  union flow;
> > > >  
> > > >  void flow_table_compact(struct ctx *c, union flow *hole);
> > > > +void flow_defer_handler(struct ctx *c);
> > > >  
> > > >  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> > > >  	__attribute__((format(printf, 3, 4)));
> > > > diff --git a/passt.c b/passt.c
> > > > index 0246b04..5f72a28 100644
> > > > --- a/passt.c
> > > > +++ b/passt.c
> > > > @@ -103,6 +103,7 @@ static void post_handler(struct ctx *c, const struct timespec *now)
> > > >  	/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
> > > >  	CALL_PROTO_HANDLER(c, now, icmp, ICMP);
> > > >  
> > > > +	flow_defer_handler(c);
> > > >  #undef CALL_PROTO_HANDLER
> > > >  }
> > > >  
> > > > diff --git a/tcp.c b/tcp.c
> > > > index ad1a70d..9230d80 100644
> > > > --- a/tcp.c
> > > > +++ b/tcp.c
> > > > @@ -1306,7 +1306,7 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
> > > >   * @c:		Execution context
> > > >   * @flow:	Flow table entry for this connection
> > > >   */
> > > > -static void tcp_flow_defer(struct ctx *c, union flow *flow)
> > > > +void tcp_flow_defer(struct ctx *c, union flow *flow)
> > > >  {
> > > >  	const struct tcp_tap_conn *conn = &flow->tcp;
> > > >  
> > > > @@ -1364,26 +1364,11 @@ static void tcp_l2_data_buf_flush(const struct ctx *c)
> > > >   * tcp_defer_handler() - Handler for TCP deferred tasks
> > > >   * @c:		Execution context
> > > >   */
> > > > +/* cppcheck-suppress constParameterPointer */  
> > > 
> > > This needs to be:
> > > 
> > >   /* cppcheck-suppress [constParameterPointer, unmatchedSuppression] */
> > > 
> > > otherwise we get warnings with cppcheck 2.10,  
> > 
> > Drat, do we?  I was hoping this was a new warning type with the newer
> > cppcheck, and it would ignore the suppression if it was for a warning
> > type it didn't know about.
> 
> Yeah... go figure. On the other hand it's not really a new type in the
> sense that this _should_ have been covered by a "constParameter"
> warning, before 2.11.

Dang.  Oh well, updated.

> > > and we'll get warnings if
> > > cppcheck's behaviour ever changes again.  
> > 
> > That's actually a good thing.  This one isn't a workaround for a
> > cppcheck false positive or weird semantic that we hope will go away.
> > Rhe warning is real and correct as far as it goes.  The problem is
> > that the signature needs to match that of other deferred handlers
> > because of how we generate the calls from a macro.  Some of those
> > others need write access to the context.
> 
> Oops, I didn't realise. Well, in any case, then we don't expect
> cppcheck's behaviour to ever change in this regard, so I don't see any
> advantage omitting unmatchedSuppression here.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2024-01-02 18:13       ` Stefano Brivio
@ 2024-01-03  3:54         ` David Gibson
  2024-01-03  7:08           ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-03  3:54 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4071 bytes --]

On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:
> On Sun, 31 Dec 2023 16:58:39 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:
> > > On Thu, 21 Dec 2023 17:15:46 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > In general, the passt code is a bit haphazard about what's a true global
> > > > variable and what's in the quasi-global 'context structure'.  The
> > > > flow_count field is one such example: it's in the context structure,
> > > > although it's really part of the same data structure as flowtab[], which
> > > > is a genuine global.  
> > > 
> > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > > too big to live on the stack, unlike flow_count.
> > > 
> > > But anyway, as far as thoughts of multithreading are concerned, both
> > > should probably be global. And sure, it's more consistent this way.
> > >   
> > > > Move flow_count to be a regular global to match.  For now it needs to be
> > > > public, rather than static, but we expect to be able to change that in
> > > > future.  
> > > 
> > > If it's not static, it should be initialised, and that's not done here.  
> > 
> > Uh... what?  "static" here is meaning module-global rather than
> > global-global, which has no bearing on initialisation.  AFAIK globals
> > are zero-initialised whether they're static or not.
> 
> ...and to my utter surprise, I just discovered that if you talk C11,
> you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
> "Initialization", clause 10:
> 
>   If an object that has automatic storage duration is not initialized
>   explicitly, its value is indeterminate. If an object that has static
>   or thread storage duration is not initialized explicitly, then:
> 
>   [...]
> 
>   — if it has arithmetic type, it is initialized to (positive or
>     unsigned) zero;
> 
> And 'flow_count' has thread storage duration.

No.. I don't think it does.  AFAICT only thread-local variables have
thread storage duration.  As a global flow_count will have static
storage duration, even without the static keyword.

> In C99, however (draft
> N1256), Section 6.7.8 "Initialization", clause 10:
> 
>   If an object that has automatic storage duration is not initialized
>   explicitly, its value is indeterminate. If an object that has static
>   storage duration is not initialized explicitly, then:
> 
>   [...]
> 
> note the missing "or thread storage duration".
> 
> C89, the one I was actually basing my observation on, says, at 3.5.7
> "Initialization":
> 
>   If an object that has static storage duration is not initialized
>   explicitly, it is initialized implicitly as if every member that has
>   arithmetic type were assigned 0 and every member that has pointer type
>   were assigned a null pointer constant.  If an object that has
>   automatic storage duration is not initialized explicitly, its value is
>   indeterminate.
> 
> so... um. We won't go back to C99. But to me, and maybe others, not
> having a "= 0;" for a "global" means pretty much that we don't rely on
> any particular initial value.

Again, I'm pretty sure that's not true, even for C99 and C89.  AIUI,
'static' locals and *all* globals have "static storage diration".

I'm not sure where to get the actual text of the standards but see for
example

https://en.cppreference.com/w/c/language/static_storage_duration

Here 'flow_count' has external linkage, thus satisfying the conditions
for static storage duration.

Fwiw, I'm pretty sure the kernel has relied on zero-initialization of
non-static globals for many years.

> If you're strongly against it, fine, C11 says it's zero... but it makes
> me a bit nervous nevertheless.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2024-01-03  3:54         ` David Gibson
@ 2024-01-03  7:08           ` Stefano Brivio
  2024-01-04  9:51             ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-03  7:08 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Wed, 3 Jan 2024 14:54:27 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> I'm not sure where to get the actual text of the standards

Let me answer this first: one (the?) trick is to use so-called final
drafts, which are made freely available (same as working drafts) by the
Working Group.

Those are not the same as the standards, but differences from the final
draft are also published... and they are usually not substantial.

This is _very_ informative:
  https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents

Wikipedia also has the links, by the way. Anyway, in practice:

- C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
- C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
- C89:
  https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/2014/10/02/ansi-iso-9899-1990-1/ansi-iso-9899-1990-1.pdf

> On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:
> > On Sun, 31 Dec 2023 16:58:39 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:  
> > > > On Thu, 21 Dec 2023 17:15:46 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > In general, the passt code is a bit haphazard about what's a true global
> > > > > variable and what's in the quasi-global 'context structure'.  The
> > > > > flow_count field is one such example: it's in the context structure,
> > > > > although it's really part of the same data structure as flowtab[], which
> > > > > is a genuine global.    
> > > > 
> > > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > > > too big to live on the stack, unlike flow_count.
> > > > 
> > > > But anyway, as far as thoughts of multithreading are concerned, both
> > > > should probably be global. And sure, it's more consistent this way.
> > > >     
> > > > > Move flow_count to be a regular global to match.  For now it needs to be
> > > > > public, rather than static, but we expect to be able to change that in
> > > > > future.    
> > > > 
> > > > If it's not static, it should be initialised, and that's not done here.    
> > > 
> > > Uh... what?  "static" here is meaning module-global rather than
> > > global-global, which has no bearing on initialisation.  AFAIK globals
> > > are zero-initialised whether they're static or not.  
> > 
> > ...and to my utter surprise, I just discovered that if you talk C11,
> > you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
> > "Initialization", clause 10:
> > 
> >   If an object that has automatic storage duration is not initialized
> >   explicitly, its value is indeterminate. If an object that has static
> >   or thread storage duration is not initialized explicitly, then:
> > 
> >   [...]
> > 
> >   — if it has arithmetic type, it is initialized to (positive or
> >     unsigned) zero;
> > 
> > And 'flow_count' has thread storage duration.  
> 
> No.. I don't think it does.  AFAICT only thread-local variables have
> thread storage duration.  As a global flow_count will have static
> storage duration, even without the static keyword.

So, C11 defines static storage duration here:

  6.2.4 Storage durations of objects

  [...]

  3 An object whose identifier is declared without the storage-class
  specifier _Thread_local, and either with external or internal linkage
  or with the storage-class specifier static, has static storage
  duration. Its lifetime is the entire execution of the program and its
  stored value is initialized only once, prior to program startup.

do we have any linkage here? I would have said no -- but, going back
to C99 for this, "6.2.2 Linkages of identifiers":

  5 [...] If the declaration of an identifier for an object has file
  scope and no storage-class specifier, its linkage is external.

which supports your paragraph below.

By the way, C11 now says:

  6.11.2 Linkages of identifiers

  1 Declaring an identifier with internal linkage at file scope without
  the static storage-class specifier is an obsolescent feature

> > In C99, however (draft
> > N1256), Section 6.7.8 "Initialization", clause 10:
> > 
> >   If an object that has automatic storage duration is not initialized
> >   explicitly, its value is indeterminate. If an object that has static
> >   storage duration is not initialized explicitly, then:
> > 
> >   [...]
> > 
> > note the missing "or thread storage duration".
> > 
> > C89, the one I was actually basing my observation on, says, at 3.5.7
> > "Initialization":
> > 
> >   If an object that has static storage duration is not initialized
> >   explicitly, it is initialized implicitly as if every member that has
> >   arithmetic type were assigned 0 and every member that has pointer type
> >   were assigned a null pointer constant.  If an object that has
> >   automatic storage duration is not initialized explicitly, its value is
> >   indeterminate.
> > 
> > so... um. We won't go back to C99. But to me, and maybe others, not
> > having a "= 0;" for a "global" means pretty much that we don't rely on
> > any particular initial value.  
> 
> Again, I'm pretty sure that's not true, even for C99 and C89.  AIUI,
> 'static' locals and *all* globals have "static storage diration".
> 
> I'm not sure where to get the actual text of the standards but see for
> example
> 
> https://en.cppreference.com/w/c/language/static_storage_duration
> 
> Here 'flow_count' has external linkage, thus satisfying the conditions
> for static storage duration.

Right. Well, for C99 and C11 at least. For C89 things are slightly
different:

  6.1.2.4 Storage durations of objects

  [...]

  An object whose identifier is declared with external or internal
  linkage. or with the storage-class specifier static has static storage
  duration.

  [...]

  An object whose identifier is declared with no linkage and without the
  storage-class specifier static has automatic storage duration.

You might say it has external linkage. But it was not *declared with*
external linkage -- it just happens to have it (C89 and C99 don't
differ here).

> Fwiw, I'm pretty sure the kernel has relied on zero-initialization of
> non-static globals for many years.

True, and the opposite is even considered as a style issue since 2007,
commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also
found a discussion similar to this one:
  https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r

Anyway... a couple of years before that, it must have been a gcc version
in the late 2.x, I actually hit an issue with it. Was it a compiler
issue, or the correct interpretation of C89? Or maybe something on the
lines of:
  https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/

? I'll never know/remember. And then, after reading the standard, I
started obsessively adding those = 0. Well, good to know I can stop
doing it now. :)

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2024-01-03  7:08           ` Stefano Brivio
@ 2024-01-04  9:51             ` David Gibson
  2024-01-05  7:55               ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-04  9:51 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8166 bytes --]

On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:
> On Wed, 3 Jan 2024 14:54:27 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > I'm not sure where to get the actual text of the standards
> 
> Let me answer this first: one (the?) trick is to use so-called final
> drafts, which are made freely available (same as working drafts) by the
> Working Group.
> 
> Those are not the same as the standards, but differences from the final
> draft are also published... and they are usually not substantial.
> 
> This is _very_ informative:
>   https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents

Ah, thanks.

> Wikipedia also has the links, by the way. Anyway, in practice:
> 
> - C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
> - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
> - C89:
>   https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/2014/10/02/ansi-iso-9899-1990-1/ansi-iso-9899-1990-1.pdf
> 
> > On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:
> > > On Sun, 31 Dec 2023 16:58:39 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:  
> > > > > On Thu, 21 Dec 2023 17:15:46 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > In general, the passt code is a bit haphazard about what's a true global
> > > > > > variable and what's in the quasi-global 'context structure'.  The
> > > > > > flow_count field is one such example: it's in the context structure,
> > > > > > although it's really part of the same data structure as flowtab[], which
> > > > > > is a genuine global.    
> > > > > 
> > > > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > > > > too big to live on the stack, unlike flow_count.
> > > > > 
> > > > > But anyway, as far as thoughts of multithreading are concerned, both
> > > > > should probably be global. And sure, it's more consistent this way.
> > > > >     
> > > > > > Move flow_count to be a regular global to match.  For now it needs to be
> > > > > > public, rather than static, but we expect to be able to change that in
> > > > > > future.    
> > > > > 
> > > > > If it's not static, it should be initialised, and that's not done here.    
> > > > 
> > > > Uh... what?  "static" here is meaning module-global rather than
> > > > global-global, which has no bearing on initialisation.  AFAIK globals
> > > > are zero-initialised whether they're static or not.  
> > > 
> > > ...and to my utter surprise, I just discovered that if you talk C11,
> > > you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
> > > "Initialization", clause 10:
> > > 
> > >   If an object that has automatic storage duration is not initialized
> > >   explicitly, its value is indeterminate. If an object that has static
> > >   or thread storage duration is not initialized explicitly, then:
> > > 
> > >   [...]
> > > 
> > >   — if it has arithmetic type, it is initialized to (positive or
> > >     unsigned) zero;
> > > 
> > > And 'flow_count' has thread storage duration.  
> > 
> > No.. I don't think it does.  AFAICT only thread-local variables have
> > thread storage duration.  As a global flow_count will have static
> > storage duration, even without the static keyword.
> 
> So, C11 defines static storage duration here:
> 
>   6.2.4 Storage durations of objects
> 
>   [...]
> 
>   3 An object whose identifier is declared without the storage-class
>   specifier _Thread_local, and either with external or internal linkage
>   or with the storage-class specifier static, has static storage
>   duration. Its lifetime is the entire execution of the program and its
>   stored value is initialized only once, prior to program startup.
> 
> do we have any linkage here? I would have said no -- but, going back
> to C99 for this, "6.2.2 Linkages of identifiers":
> 
>   5 [...] If the declaration of an identifier for an object has file
>   scope and no storage-class specifier, its linkage is external.
> 
> which supports your paragraph below.

Right.

> By the way, C11 now says:
> 
>   6.11.2 Linkages of identifiers
> 
>   1 Declaring an identifier with internal linkage at file scope without
>   the static storage-class specifier is an obsolescent feature

Ok.  I'm not even sure how you would do that.

> > > In C99, however (draft
> > > N1256), Section 6.7.8 "Initialization", clause 10:
> > > 
> > >   If an object that has automatic storage duration is not initialized
> > >   explicitly, its value is indeterminate. If an object that has static
> > >   storage duration is not initialized explicitly, then:
> > > 
> > >   [...]
> > > 
> > > note the missing "or thread storage duration".
> > > 
> > > C89, the one I was actually basing my observation on, says, at 3.5.7
> > > "Initialization":
> > > 
> > >   If an object that has static storage duration is not initialized
> > >   explicitly, it is initialized implicitly as if every member that has
> > >   arithmetic type were assigned 0 and every member that has pointer type
> > >   were assigned a null pointer constant.  If an object that has
> > >   automatic storage duration is not initialized explicitly, its value is
> > >   indeterminate.
> > > 
> > > so... um. We won't go back to C99. But to me, and maybe others, not
> > > having a "= 0;" for a "global" means pretty much that we don't rely on
> > > any particular initial value.  
> > 
> > Again, I'm pretty sure that's not true, even for C99 and C89.  AIUI,
> > 'static' locals and *all* globals have "static storage diration".
> > 
> > I'm not sure where to get the actual text of the standards but see for
> > example
> > 
> > https://en.cppreference.com/w/c/language/static_storage_duration
> > 
> > Here 'flow_count' has external linkage, thus satisfying the conditions
> > for static storage duration.
> 
> Right. Well, for C99 and C11 at least. For C89 things are slightly
> different:
> 
>   6.1.2.4 Storage durations of objects
> 
>   [...]
> 
>   An object whose identifier is declared with external or internal
>   linkage. or with the storage-class specifier static has static storage
>   duration.
> 
>   [...]
> 
>   An object whose identifier is declared with no linkage and without the
>   storage-class specifier static has automatic storage duration.
> 
> You might say it has external linkage. But it was not *declared with*
> external linkage -- it just happens to have it (C89 and C99 don't
> differ here).

Hrm.  We do have:
	extern unsigned flow_first_free;
in flow_table.h.  Does that cound as declaring with external linkage?

> > Fwiw, I'm pretty sure the kernel has relied on zero-initialization of
> > non-static globals for many years.
> 
> True, and the opposite is even considered as a style issue since 2007,
> commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also
> found a discussion similar to this one:
>   https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r
> 
> Anyway... a couple of years before that, it must have been a gcc version
> in the late 2.x, I actually hit an issue with it. Was it a compiler
> issue, or the correct interpretation of C89? Or maybe something on the
> lines of:
>   https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/

If it was an embedded setup, that last one is certainly possible.
Zeroing the BSS is typically the loader's job, and I've certainly seen
loader implementations - particularly in embedded firmware - that got
this wrong.

> ? I'll never know/remember. And then, after reading the standard, I
> started obsessively adding those = 0. Well, good to know I can stop
> doing it now. :)
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-02 18:13         ` Stefano Brivio
@ 2024-01-04 10:02           ` David Gibson
  2024-01-05  8:33             ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-04 10:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 6513 bytes --]

On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:
> On Mon, 1 Jan 2024 23:01:17 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:
> > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >   
> > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > 
> > > > > [...]  
> > > >
> > > > [...]
> > > >
> > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > in the free "block", and if we have to be explicit about the two cases.
> > > > 
> > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > different variants, for example:  
> > > 
> > > So... I think the version with (explicit) blocks has this fundamental
> > > advantage, on deletion:
> > >   
> > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > +	 * flow_defer_handler runs anyway() */
> > > > > +	flow->free.n = 1;
> > > > > +	flow->free.next = flow_first_free;
> > > > > +	flow_first_free = FLOW_IDX(flow);  
> > > 
> > > which is doable even without explicit blocks, but much harder to
> > > follow.  
> > 
> > Remember this is not a general deletion, only a "cancel" of the most
> > recent allocation.
> 
> Oh, I thought that was only the case for this series and you would use
> that as actual deletion in another pending series (which I haven't
> finished reviewing yet).

No.  Not allowing deletion of any entry at any time is what I'm
trading off to get both O(1) allocation and (effectively) O(1)
deletion.

> But now I'm not sure anymore why I was thinking this...
> 
> Anyway... do we really need it, then? Can't we just mark the "failed"
> flows as whatever means "closed" for a specific protocol, and clean
> them up later, instead of calling cancel() right away?

We could, but I'm not sure we want to.  For starters, that requires
protocol-specific behaviour whenever we need to back out an allocation
like this.  Not a big deal, since that's in protocol specific code
already, but I think it's uglier than calling cancel.

It also requires that the protocol specific deferred cleanup functions
(e.g. tcp_flow_defer()) handle partially initialised entries.  With
'cancel' we can back out just the initialisation steps we've already
done (because we know where we've failed during init), then remove the
entry.  The deferred cleanup function only needs to deal with
"complete" entries.  Again, certainly possible, but IMO uglier than
having 'cancel'.

Finally, leaving the "cancelled" entry there until the next deferred
pass means if we allocate other flows before that defer pass they'll
avoid this slot.  That works against trying to keep the entries
reasonably compact.

> >  To reduce fragmentation we are keeping the linked
> > list of free clusters in strictly ascending order.  The logic above is
> > only correct if the entry we're freeing is before any other free entry
> > in the table.  That will be the case for the most recent allocation,
> > because we always allocatte the first free entry in the table.
> 
> I see now. This isn't entirely clear from the "Theory of Operation",
> on the other hand it's probably a bad idea to overload that description
> with all these details.

I'll see what I can in terms of making that clearer in the description.

> > > Other than the comments I have, I would go ahead with your approach. My
> > > attempt below in case you're interested. The comment at the end of
> > > flow_del() shows the issue I'm facing:
> > > 
> > > struct flow_free_block {
> > > 	/* Must be first element */
> > > 	struct flow_common f;
> > > 
> > > 	unsigned next_free;
> > > 	unsigned next_used;
> > > };
> > > 
> > > static union flow *flow_alloc(void)
> > > {
> > > 	union flow *flow;
> > > 
> > > 	if (flow_first_free >= FLOW_MAX)
> > > 		return NULL;
> > > 
> > > 	flow = flowtab + flow_first_free;
> > > 
> > > 	flow_first_free = flow->free.next_free;
> > > 
> > > 	memset(flow, 0, sizeof(*flow));		/* TODO: select region */
> > > 	return flow;
> > > }
> > > 
> > > static void flow_del(union flow *del)
> > > {
> > > 	union flow *left, *right, *next_used = NULL, *first_free;  
> > fg
> > > 
> > > 	del->f.type = FLOW_TYPE_NONE;
> > > 
> > > 	left =  (FLOW_IDX(del) > 0)            ? FLOW(FLOW_IDX(del) - 1) : NULL;
> > > 	right = (FLOW_IDX(del) < FLOW_MAX - 1) ? FLOW(FLOW_IDX(del) + 1) : NULL;
> > > 
> > > 	first_free = flow_first_free < FLOW_MAX ? FLOW(flow_first_free) : NULL;
> > > 
> > > 	if (right) {
> > > 		if (right->f.type == FLOW_TYPE_NONE)
> > > 			del->free.next_used = right->free.next_used;
> > > 		else
> > > 			del->free.next_used = right;
> > > 	} else {
> > > 		del->free.next_used = FLOW_MAX;
> > > 	}
> > > 
> > > 	if (left && left->f.type == FLOW_TYPE_NONE) {
> > > 		left->free.next_free = FLOW_IDX(del);
> > > 		left->free.next_used = del->free.next_used;
> > > 		return;
> > > 	}
> > > 
> > > 	if (flow_first_free == FLOW_MAX) {
> > > 		flow_first_free = FLOW_IDX(del);
> > > 	} else if (flow_first_free > FLOW_IDX(del)) {
> > > 		flow_first_free = FLOW_IDX(del);
> > > 		del->free.next_free = flow_first_free;
> > > 	} else if (flow_first_free < FLOW_IDX(del)) {
> > > 		;
> > > 		/* Walk free slots from flow_first_free until FLOW_IDX(del) to
> > > 		 * find insertion point... but that makes deletion at most O(n),
> > > 		 * perhaps O(log(n)), certainly not O(1).  
> > 
> > Exactly.  I can't see any way (short of far more complex data
> > structures) around having a linear scan somewhere, although you can
> > kind of choose whether it happens on alloc or free.
> > 
> > The idea of my implementation is to have it at free time - but to
> > merge it with an existing linear scan so it won't have an additional
> > cost.
> 
> Right, yes, I see now. This idea is also not terribly clear from the
> description, by the way, but I don't have a good proposal right now.

I'll see what I can come up with.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2024-01-04  9:51             ` David Gibson
@ 2024-01-05  7:55               ` Stefano Brivio
  2024-01-07  5:23                 ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-05  7:55 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 4 Jan 2024 20:51:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:
> > On Wed, 3 Jan 2024 14:54:27 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > I'm not sure where to get the actual text of the standards  
> > 
> > Let me answer this first: one (the?) trick is to use so-called final
> > drafts, which are made freely available (same as working drafts) by the
> > Working Group.
> > 
> > Those are not the same as the standards, but differences from the final
> > draft are also published... and they are usually not substantial.
> > 
> > This is _very_ informative:
> >   https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents  
> 
> Ah, thanks.
> 
> > Wikipedia also has the links, by the way. Anyway, in practice:
> > 
> > - C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
> > - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
> > - C89:
> >   https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/2014/10/02/ansi-iso-9899-1990-1/ansi-iso-9899-1990-1.pdf
> >   
> > > On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:  
> > > > On Sun, 31 Dec 2023 16:58:39 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:    
> > > > > > On Thu, 21 Dec 2023 17:15:46 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > In general, the passt code is a bit haphazard about what's a true global
> > > > > > > variable and what's in the quasi-global 'context structure'.  The
> > > > > > > flow_count field is one such example: it's in the context structure,
> > > > > > > although it's really part of the same data structure as flowtab[], which
> > > > > > > is a genuine global.      
> > > > > > 
> > > > > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > > > > > too big to live on the stack, unlike flow_count.
> > > > > > 
> > > > > > But anyway, as far as thoughts of multithreading are concerned, both
> > > > > > should probably be global. And sure, it's more consistent this way.
> > > > > >       
> > > > > > > Move flow_count to be a regular global to match.  For now it needs to be
> > > > > > > public, rather than static, but we expect to be able to change that in
> > > > > > > future.      
> > > > > > 
> > > > > > If it's not static, it should be initialised, and that's not done here.      
> > > > > 
> > > > > Uh... what?  "static" here is meaning module-global rather than
> > > > > global-global, which has no bearing on initialisation.  AFAIK globals
> > > > > are zero-initialised whether they're static or not.    
> > > > 
> > > > ...and to my utter surprise, I just discovered that if you talk C11,
> > > > you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
> > > > "Initialization", clause 10:
> > > > 
> > > >   If an object that has automatic storage duration is not initialized
> > > >   explicitly, its value is indeterminate. If an object that has static
> > > >   or thread storage duration is not initialized explicitly, then:
> > > > 
> > > >   [...]
> > > > 
> > > >   — if it has arithmetic type, it is initialized to (positive or
> > > >     unsigned) zero;
> > > > 
> > > > And 'flow_count' has thread storage duration.    
> > > 
> > > No.. I don't think it does.  AFAICT only thread-local variables have
> > > thread storage duration.  As a global flow_count will have static
> > > storage duration, even without the static keyword.  
> > 
> > So, C11 defines static storage duration here:
> > 
> >   6.2.4 Storage durations of objects
> > 
> >   [...]
> > 
> >   3 An object whose identifier is declared without the storage-class
> >   specifier _Thread_local, and either with external or internal linkage
> >   or with the storage-class specifier static, has static storage
> >   duration. Its lifetime is the entire execution of the program and its
> >   stored value is initialized only once, prior to program startup.
> > 
> > do we have any linkage here? I would have said no -- but, going back
> > to C99 for this, "6.2.2 Linkages of identifiers":
> > 
> >   5 [...] If the declaration of an identifier for an object has file
> >   scope and no storage-class specifier, its linkage is external.
> > 
> > which supports your paragraph below.  
> 
> Right.
> 
> > By the way, C11 now says:
> > 
> >   6.11.2 Linkages of identifiers
> > 
> >   1 Declaring an identifier with internal linkage at file scope without
> >   the static storage-class specifier is an obsolescent feature  
> 
> Ok.  I'm not even sure how you would do that.

By doing what I *thought* you were doing (see below): "int x" at file
scope (outside functions), no static, no extern declaration, nothing.

> > > > In C99, however (draft
> > > > N1256), Section 6.7.8 "Initialization", clause 10:
> > > > 
> > > >   If an object that has automatic storage duration is not initialized
> > > >   explicitly, its value is indeterminate. If an object that has static
> > > >   storage duration is not initialized explicitly, then:
> > > > 
> > > >   [...]
> > > > 
> > > > note the missing "or thread storage duration".
> > > > 
> > > > C89, the one I was actually basing my observation on, says, at 3.5.7
> > > > "Initialization":
> > > > 
> > > >   If an object that has static storage duration is not initialized
> > > >   explicitly, it is initialized implicitly as if every member that has
> > > >   arithmetic type were assigned 0 and every member that has pointer type
> > > >   were assigned a null pointer constant.  If an object that has
> > > >   automatic storage duration is not initialized explicitly, its value is
> > > >   indeterminate.
> > > > 
> > > > so... um. We won't go back to C99. But to me, and maybe others, not
> > > > having a "= 0;" for a "global" means pretty much that we don't rely on
> > > > any particular initial value.    
> > > 
> > > Again, I'm pretty sure that's not true, even for C99 and C89.  AIUI,
> > > 'static' locals and *all* globals have "static storage diration".
> > > 
> > > I'm not sure where to get the actual text of the standards but see for
> > > example
> > > 
> > > https://en.cppreference.com/w/c/language/static_storage_duration
> > > 
> > > Here 'flow_count' has external linkage, thus satisfying the conditions
> > > for static storage duration.  
> > 
> > Right. Well, for C99 and C11 at least. For C89 things are slightly
> > different:
> > 
> >   6.1.2.4 Storage durations of objects
> > 
> >   [...]
> > 
> >   An object whose identifier is declared with external or internal
> >   linkage. or with the storage-class specifier static has static storage
> >   duration.
> > 
> >   [...]
> > 
> >   An object whose identifier is declared with no linkage and without the
> >   storage-class specifier static has automatic storage duration.
> > 
> > You might say it has external linkage. But it was not *declared with*
> > external linkage -- it just happens to have it (C89 and C99 don't
> > differ here).  
> 
> Hrm.  We do have:
> 	extern unsigned flow_first_free;
> in flow_table.h.  Does that cound as declaring with external linkage?

Gosh, sorry... yes, it also counts as me completely missing it.

> > > Fwiw, I'm pretty sure the kernel has relied on zero-initialization of
> > > non-static globals for many years.  
> > 
> > True, and the opposite is even considered as a style issue since 2007,
> > commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also
> > found a discussion similar to this one:
> >   https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r
> > 
> > Anyway... a couple of years before that, it must have been a gcc version
> > in the late 2.x, I actually hit an issue with it. Was it a compiler
> > issue, or the correct interpretation of C89? Or maybe something on the
> > lines of:
> >   https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/  
> 
> If it was an embedded setup, that last one is certainly possible.
> Zeroing the BSS is typically the loader's job, and I've certainly seen
> loader implementations - particularly in embedded firmware - that got
> this wrong.

Embedded setup, yes, but in a Linux kernel (early 2.4.x). No 'extern'
there, though.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-04 10:02           ` David Gibson
@ 2024-01-05  8:33             ` Stefano Brivio
  2024-01-05  9:39               ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-05  8:33 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Thu, 4 Jan 2024 21:02:19 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:
> > On Mon, 1 Jan 2024 23:01:17 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:  
> > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >     
> > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > 
> > > > > > [...]    
> > > > >
> > > > > [...]
> > > > >
> > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > 
> > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > different variants, for example:    
> > > > 
> > > > So... I think the version with (explicit) blocks has this fundamental
> > > > advantage, on deletion:
> > > >     
> > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > +	flow->free.n = 1;
> > > > > > +	flow->free.next = flow_first_free;
> > > > > > +	flow_first_free = FLOW_IDX(flow);    
> > > > 
> > > > which is doable even without explicit blocks, but much harder to
> > > > follow.    
> > > 
> > > Remember this is not a general deletion, only a "cancel" of the most
> > > recent allocation.  
> > 
> > Oh, I thought that was only the case for this series and you would use
> > that as actual deletion in another pending series (which I haven't
> > finished reviewing yet).  
> 
> No.  Not allowing deletion of any entry at any time is what I'm
> trading off to get both O(1) allocation and (effectively) O(1)
> deletion.
> 
> > But now I'm not sure anymore why I was thinking this...
> > 
> > Anyway... do we really need it, then? Can't we just mark the "failed"
> > flows as whatever means "closed" for a specific protocol, and clean
> > them up later, instead of calling cancel() right away?  
> 
> We could, but I'm not sure we want to.  For starters, that requires
> protocol-specific behaviour whenever we need to back out an allocation
> like this.  Not a big deal, since that's in protocol specific code
> already, but I think it's uglier than calling cancel.
> 
> It also requires that the protocol specific deferred cleanup functions
> (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> 'cancel' we can back out just the initialisation steps we've already
> done (because we know where we've failed during init), then remove the
> entry.  The deferred cleanup function only needs to deal with
> "complete" entries.  Again, certainly possible, but IMO uglier than
> having 'cancel'.

Okay, yes, I see now.

Another doubt that comes to me now is: if you don't plan to use this
alloc_cancel() thing anywhere else, the only reason why you are adding
it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
version that can fail.

But at this point, speaking of ugliness, couldn't we just have a
bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
can use to decide to abort earlier? To me it looks so much simpler and
more robust.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-05  8:33             ` Stefano Brivio
@ 2024-01-05  9:39               ` David Gibson
  2024-01-05 10:27                 ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-05  9:39 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5104 bytes --]

On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:
> On Thu, 4 Jan 2024 21:02:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:
> > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:  
> > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >     
> > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > 
> > > > > > > [...]    
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > > 
> > > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > > different variants, for example:    
> > > > > 
> > > > > So... I think the version with (explicit) blocks has this fundamental
> > > > > advantage, on deletion:
> > > > >     
> > > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > +	flow->free.n = 1;
> > > > > > > +	flow->free.next = flow_first_free;
> > > > > > > +	flow_first_free = FLOW_IDX(flow);    
> > > > > 
> > > > > which is doable even without explicit blocks, but much harder to
> > > > > follow.    
> > > > 
> > > > Remember this is not a general deletion, only a "cancel" of the most
> > > > recent allocation.  
> > > 
> > > Oh, I thought that was only the case for this series and you would use
> > > that as actual deletion in another pending series (which I haven't
> > > finished reviewing yet).  
> > 
> > No.  Not allowing deletion of any entry at any time is what I'm
> > trading off to get both O(1) allocation and (effectively) O(1)
> > deletion.
> > 
> > > But now I'm not sure anymore why I was thinking this...
> > > 
> > > Anyway... do we really need it, then? Can't we just mark the "failed"
> > > flows as whatever means "closed" for a specific protocol, and clean
> > > them up later, instead of calling cancel() right away?  
> > 
> > We could, but I'm not sure we want to.  For starters, that requires
> > protocol-specific behaviour whenever we need to back out an allocation
> > like this.  Not a big deal, since that's in protocol specific code
> > already, but I think it's uglier than calling cancel.
> > 
> > It also requires that the protocol specific deferred cleanup functions
> > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > 'cancel' we can back out just the initialisation steps we've already
> > done (because we know where we've failed during init), then remove the
> > entry.  The deferred cleanup function only needs to deal with
> > "complete" entries.  Again, certainly possible, but IMO uglier than
> > having 'cancel'.
> 
> Okay, yes, I see now.
> 
> Another doubt that comes to me now is: if you don't plan to use this
> alloc_cancel() thing anywhere else, the only reason why you are adding
> it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
> version that can fail.
> 
> But at this point, speaking of ugliness, couldn't we just have a
> bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
> can use to decide to abort earlier? To me it looks so much simpler and
> more robust.

Well, we could, but there are a couple of reasons I don't love it.
The first is abstraction: this returns explicit handling of the layout
of the table to the protocol specific callers.  It's not a huge deal
right now, but once we have 4 or 5 protocols doing this, having to
change all of them if we make any tiny change to the semantics of
flow_first_free isn't great.

The other issue is that to do this (without a bunch of fairly large
and ugly temporaries) means we'd populate at least some of the fields
in flow_common before we have officially "allocated" the entry.  At
that point it becomes a bit fuzzy as to when that allocation really
occurs.  Is it when we do the FLOW_MAX tesT?  Is it when we write to
f.type?  Is it when we update flow_first_free?  If we fail somewhere
in the middle of that, what steps do we need to reverse?

For those reasons I prefer the scheme presented.  Fwiw, in an earlier
draft I did this differently with a "flow_prealloc()", which was
essentially the check against FLOW_MAX, then a later
flow_alloc_commit().  I thought it turned out pretty confusing
compared to the alloc/cancel approach.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-02 18:13       ` Stefano Brivio
@ 2024-01-05  9:45         ` David Gibson
  0 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2024-01-05  9:45 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 2235 bytes --]

On Tue, Jan 02, 2024 at 07:13:48PM +0100, Stefano Brivio wrote:
> On Mon, 1 Jan 2024 21:44:54 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Dec 28, 2023 at 07:25:25PM +0100, Stefano Brivio wrote:
> > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > [...]
> > >
> > > >  void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> > > >  {
> > > > +	struct flow_free_block *free_head = NULL;
> > > > +	unsigned *last_next = &flow_first_free;
> > > >  	bool timer = false;
> > > > -	union flow *flow;
> > > > +	unsigned idx;
> > > >  
> > > >  	if (timespec_diff_ms(now, &flow_timer_run) >= FLOW_TIMER_INTERVAL) {
> > > >  		timer = true;
> > > >  		flow_timer_run = *now;
> > > >  	}
> > > >  
> > > > -	for (flow = flowtab + flow_count - 1; flow >= flowtab; flow--) {
> > > > +	for (idx = 0; idx < FLOW_MAX; idx++) {
> > > > +		union flow *flow = &flowtab[idx];
> > > >  		bool closed = false;
> > > >  
> > > > +		if (flow->f.type == FLOW_TYPE_NONE) {
> > > > +			/* Start of a free block */
> > > > +			free_head = &flow->free;
> > > > +			*last_next = idx;
> > > > +			last_next = &free_head->next;
> > > > +			/* Skip the rest of the block */
> > > > +			idx += free_head->n - 1;
> > > > +			continue;
> > > > +		}
> > > > +
> > > > +		  
> > > 
> > > Stray tabs.

Fixed.

> > > >  		switch (flow->f.type) {
> > > > +		case FLOW_TYPE_NONE:
> > > > +			closed = true;
> > > > +			break;
> 
> ...more important than stray tabs, I noticed only now: how would we ever
> hit this switch case, if you're checking for this in the if clause just
> before?

Ah.. it can't.. but fixing it is a bit more complex.  The intention
here is that if there are multiple contiguous free clusters we'll
merge them together, which this currently won't do.  I'll have to take
another look at that.

> 
> > > >  		case FLOW_TCP:
> > > >  			closed = tcp_flow_defer(flow);
> > > >  			break;
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-05  9:39               ` David Gibson
@ 2024-01-05 10:27                 ` Stefano Brivio
  2024-01-06 11:32                   ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-05 10:27 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 5 Jan 2024 20:39:50 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:
> > On Thu, 4 Jan 2024 21:02:19 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:  
> > > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:    
> > > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >       
> > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > 
> > > > > > > > [...]      
> > > > > > >
> > > > > > > [...]
> > > > > > >
> > > > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > > > 
> > > > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > > > different variants, for example:      
> > > > > > 
> > > > > > So... I think the version with (explicit) blocks has this fundamental
> > > > > > advantage, on deletion:
> > > > > >       
> > > > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > > +	flow->free.n = 1;
> > > > > > > > +	flow->free.next = flow_first_free;
> > > > > > > > +	flow_first_free = FLOW_IDX(flow);      
> > > > > > 
> > > > > > which is doable even without explicit blocks, but much harder to
> > > > > > follow.      
> > > > > 
> > > > > Remember this is not a general deletion, only a "cancel" of the most
> > > > > recent allocation.    
> > > > 
> > > > Oh, I thought that was only the case for this series and you would use
> > > > that as actual deletion in another pending series (which I haven't
> > > > finished reviewing yet).    
> > > 
> > > No.  Not allowing deletion of any entry at any time is what I'm
> > > trading off to get both O(1) allocation and (effectively) O(1)
> > > deletion.
> > >   
> > > > But now I'm not sure anymore why I was thinking this...
> > > > 
> > > > Anyway... do we really need it, then? Can't we just mark the "failed"
> > > > flows as whatever means "closed" for a specific protocol, and clean
> > > > them up later, instead of calling cancel() right away?    
> > > 
> > > We could, but I'm not sure we want to.  For starters, that requires
> > > protocol-specific behaviour whenever we need to back out an allocation
> > > like this.  Not a big deal, since that's in protocol specific code
> > > already, but I think it's uglier than calling cancel.
> > > 
> > > It also requires that the protocol specific deferred cleanup functions
> > > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > > 'cancel' we can back out just the initialisation steps we've already
> > > done (because we know where we've failed during init), then remove the
> > > entry.  The deferred cleanup function only needs to deal with
> > > "complete" entries.  Again, certainly possible, but IMO uglier than
> > > having 'cancel'.  
> > 
> > Okay, yes, I see now.
> > 
> > Another doubt that comes to me now is: if you don't plan to use this
> > alloc_cancel() thing anywhere else, the only reason why you are adding
> > it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
> > version that can fail.
> > 
> > But at this point, speaking of ugliness, couldn't we just have a
> > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
> > can use to decide to abort earlier? To me it looks so much simpler and
> > more robust.  
> 
> Well, we could, but there are a couple of reasons I don't love it.
> The first is abstraction: this returns explicit handling of the layout
> of the table to the protocol specific callers.  It's not a huge deal
> right now, but once we have 4 or 5 protocols doing this, having to
> change all of them if we make any tiny change to the semantics of
> flow_first_free isn't great.

Hmm, I don't get the difference in terms of abstraction between
checking the return value of flow_alloc() and checking the return value
of flow_may_alloc().

In both cases the protocol handlers know that there's a table, a
function to reserve entries, and that reserving entries might fail...
and not much else.

> The other issue is that to do this (without a bunch of fairly large
> and ugly temporaries) means we'd populate at least some of the fields
> in flow_common before we have officially "allocated" the entry.  At
> that point it becomes a bit fuzzy as to when that allocation really
> occurs.  Is it when we do the FLOW_MAX tesT?

I would say yes -- after that we can't fail.

I mean, we work with rather constrained structures for a number of
reasons, which comes with a number of hard problems... let's at least
reap the benefits of it?

> Is it when we write to
> f.type?  Is it when we update flow_first_free?  If we fail somewhere
> in the middle of that, what steps do we need to reverse?

We can't fail in the middle of it, at the moment. Of course, if the
"allocation" function changes, we might need to change the scheme. But
is it really likely? And anyway it's just a few lines in your current
version...

> For those reasons I prefer the scheme presented.  Fwiw, in an earlier
> draft I did this differently with a "flow_prealloc()", which was
> essentially the check against FLOW_MAX, then a later
> flow_alloc_commit().  I thought it turned out pretty confusing
> compared to the alloc/cancel approach.

The current flow_alloc_cancel() implementation is definitely simple and
semantically clear.

What worries me a bit is that you would have two different cases for
free "blocks" of size one, depending on the order of the events. So if
we want to debug something like that and we see a block with size one
it might be a failed bind(), so a fake one, or also not: it might be an
actual block with size one.

Thinking of multithreading: defining flow_may_alloc() becomes more
complicated because the caller can't just assume the "allocation" will
succeed (as long as we don't cross an "epoll cycle" or something like
that). But we'll probably need some form of locking or userspace RCU
giving us barriers around the pair may_alloc() / alloc().

If we stick to the failing alloc(), this part is simpler, but the
interpretation of flow_first_free and block sizes becomes non-trivial.

Well, on the other hand, it's all simple enough that we can change it
as needed (for example for multithreading). If we can hope that the new
scheme is reasonably low on bugs and we'll probably never have to guess
why a block has size one, I'm fine with the failing alloc() as well.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-05 10:27                 ` Stefano Brivio
@ 2024-01-06 11:32                   ` David Gibson
  2024-01-06 13:02                     ` Stefano Brivio
  0 siblings, 1 reply; 40+ messages in thread
From: David Gibson @ 2024-01-06 11:32 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 8938 bytes --]

On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:
> On Fri, 5 Jan 2024 20:39:50 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:
> > > On Thu, 4 Jan 2024 21:02:19 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:  
> > > > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:    
> > > > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >       
> > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > > 
> > > > > > > > > [...]      
> > > > > > > >
> > > > > > > > [...]
> > > > > > > >
> > > > > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > > > > 
> > > > > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > > > > different variants, for example:      
> > > > > > > 
> > > > > > > So... I think the version with (explicit) blocks has this fundamental
> > > > > > > advantage, on deletion:
> > > > > > >       
> > > > > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > > > +	flow->free.n = 1;
> > > > > > > > > +	flow->free.next = flow_first_free;
> > > > > > > > > +	flow_first_free = FLOW_IDX(flow);      
> > > > > > > 
> > > > > > > which is doable even without explicit blocks, but much harder to
> > > > > > > follow.      
> > > > > > 
> > > > > > Remember this is not a general deletion, only a "cancel" of the most
> > > > > > recent allocation.    
> > > > > 
> > > > > Oh, I thought that was only the case for this series and you would use
> > > > > that as actual deletion in another pending series (which I haven't
> > > > > finished reviewing yet).    
> > > > 
> > > > No.  Not allowing deletion of any entry at any time is what I'm
> > > > trading off to get both O(1) allocation and (effectively) O(1)
> > > > deletion.
> > > >   
> > > > > But now I'm not sure anymore why I was thinking this...
> > > > > 
> > > > > Anyway... do we really need it, then? Can't we just mark the "failed"
> > > > > flows as whatever means "closed" for a specific protocol, and clean
> > > > > them up later, instead of calling cancel() right away?    
> > > > 
> > > > We could, but I'm not sure we want to.  For starters, that requires
> > > > protocol-specific behaviour whenever we need to back out an allocation
> > > > like this.  Not a big deal, since that's in protocol specific code
> > > > already, but I think it's uglier than calling cancel.
> > > > 
> > > > It also requires that the protocol specific deferred cleanup functions
> > > > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > > > 'cancel' we can back out just the initialisation steps we've already
> > > > done (because we know where we've failed during init), then remove the
> > > > entry.  The deferred cleanup function only needs to deal with
> > > > "complete" entries.  Again, certainly possible, but IMO uglier than
> > > > having 'cancel'.  
> > > 
> > > Okay, yes, I see now.
> > > 
> > > Another doubt that comes to me now is: if you don't plan to use this
> > > alloc_cancel() thing anywhere else, the only reason why you are adding
> > > it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
> > > version that can fail.
> > > 
> > > But at this point, speaking of ugliness, couldn't we just have a
> > > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
> > > can use to decide to abort earlier? To me it looks so much simpler and
> > > more robust.  
> > 
> > Well, we could, but there are a couple of reasons I don't love it.
> > The first is abstraction: this returns explicit handling of the layout
> > of the table to the protocol specific callers.  It's not a huge deal
> > right now, but once we have 4 or 5 protocols doing this, having to
> > change all of them if we make any tiny change to the semantics of
> > flow_first_free isn't great.
> 
> Hmm, I don't get the difference in terms of abstraction between
> checking the return value of flow_alloc() and checking the return value
> of flow_may_alloc().

Oh, sorry, I thought you were proposing open-coding the check against
FLOW_MAX, like it is at the moment.  See below for comments on a
flow_may_alloc() or similar call.

> In both cases the protocol handlers know that there's a table, a
> function to reserve entries, and that reserving entries might fail...
> and not much else.
> 
> > The other issue is that to do this (without a bunch of fairly large
> > and ugly temporaries) means we'd populate at least some of the fields
> > in flow_common before we have officially "allocated" the entry.  At
> > that point it becomes a bit fuzzy as to when that allocation really
> > occurs.  Is it when we do the FLOW_MAX tesT?
> 
> I would say yes -- after that we can't fail.

Uh.. we can't fail to allocate.  We can fail for other reasons.

> I mean, we work with rather constrained structures for a number of
> reasons, which comes with a number of hard problems... let's at least
> reap the benefits of it?

I'm not sure what you're getting at here.

> > Is it when we write to
> > f.type?  Is it when we update flow_first_free?  If we fail somewhere
> > in the middle of that, what steps do we need to reverse?
> 
> We can't fail in the middle of it, at the moment.

Yes we can, that's kind of the whole point of this.

> Of course, if the
> "allocation" function changes, we might need to change the scheme. But
> is it really likely? And anyway it's just a few lines in your current
> version...
> 
> > For those reasons I prefer the scheme presented.  Fwiw, in an earlier
> > draft I did this differently with a "flow_prealloc()", which was
> > essentially the check against FLOW_MAX, then a later
> > flow_alloc_commit().  I thought it turned out pretty confusing
> > compared to the alloc/cancel approach.
> 
> The current flow_alloc_cancel() implementation is definitely simple and
> semantically clear.
> 
> What worries me a bit is that you would have two different cases for
> free "blocks" of size one, depending on the order of the events. So if
> we want to debug something like that and we see a block with size one
> it might be a failed bind(), so a fake one, or also not: it might be an
> actual block with size one.

The cluster of size one from cancel is still a valid free cluster that
satisfies all the usual invaraints, it's not "fake".  It does mean
that we could get two contiguous free clusters, which we wouldn't
otherwise.  The idea is that they'll be merged back together on the
next deferred scan, but as noted on a different subthread, that's not
currently working and I'll have to fix it.

> Thinking of multithreading: defining flow_may_alloc() becomes more
> complicated because the caller can't just assume the "allocation" will
> succeed (as long as we don't cross an "epoll cycle" or something like
> that). But we'll probably need some form of locking or userspace RCU
> giving us barriers around the pair may_alloc() / alloc().

For multi-threaded, I really thin we'd want alloc/free semantics, not
may_alloc/alloc semantics.  We have to change state in some way at
"may alloc" time, or something else could try to allocate the same
slot.  "cancel" semantics also don't make sense here, because we can
no longer be confident that the alloc we did above is still the most
recent allloc.  So a bunch of things would need to change for
multi-threading.

> If we stick to the failing alloc(), this part is simpler, but the
> interpretation of flow_first_free and block sizes becomes non-trivial.

Again, not sure what you're getting at.

> Well, on the other hand, it's all simple enough that we can change it
> as needed (for example for multithreading). If we can hope that the new
> scheme is reasonably low on bugs and we'll probably never have to guess
> why a block has size one, I'm fine with the failing alloc() as well.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-06 11:32                   ` David Gibson
@ 2024-01-06 13:02                     ` Stefano Brivio
  2024-01-07  5:20                       ` David Gibson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefano Brivio @ 2024-01-06 13:02 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Sat, 6 Jan 2024 22:32:10 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:
> > On Fri, 5 Jan 2024 20:39:50 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:  
> > > > On Thu, 4 Jan 2024 21:02:19 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:    
> > > > > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >       
> > > > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:      
> > > > > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > >         
> > > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > > > 
> > > > > > > > > > [...]        
> > > > > > > > >
> > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > > > > > 
> > > > > > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > > > > > different variants, for example:        
> > > > > > > > 
> > > > > > > > So... I think the version with (explicit) blocks has this fundamental
> > > > > > > > advantage, on deletion:
> > > > > > > >         
> > > > > > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > > > > +	flow->free.n = 1;
> > > > > > > > > > +	flow->free.next = flow_first_free;
> > > > > > > > > > +	flow_first_free = FLOW_IDX(flow);        
> > > > > > > > 
> > > > > > > > which is doable even without explicit blocks, but much harder to
> > > > > > > > follow.        
> > > > > > > 
> > > > > > > Remember this is not a general deletion, only a "cancel" of the most
> > > > > > > recent allocation.      
> > > > > > 
> > > > > > Oh, I thought that was only the case for this series and you would use
> > > > > > that as actual deletion in another pending series (which I haven't
> > > > > > finished reviewing yet).      
> > > > > 
> > > > > No.  Not allowing deletion of any entry at any time is what I'm
> > > > > trading off to get both O(1) allocation and (effectively) O(1)
> > > > > deletion.
> > > > >     
> > > > > > But now I'm not sure anymore why I was thinking this...
> > > > > > 
> > > > > > Anyway... do we really need it, then? Can't we just mark the "failed"
> > > > > > flows as whatever means "closed" for a specific protocol, and clean
> > > > > > them up later, instead of calling cancel() right away?      
> > > > > 
> > > > > We could, but I'm not sure we want to.  For starters, that requires
> > > > > protocol-specific behaviour whenever we need to back out an allocation
> > > > > like this.  Not a big deal, since that's in protocol specific code
> > > > > already, but I think it's uglier than calling cancel.
> > > > > 
> > > > > It also requires that the protocol specific deferred cleanup functions
> > > > > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > > > > 'cancel' we can back out just the initialisation steps we've already
> > > > > done (because we know where we've failed during init), then remove the
> > > > > entry.  The deferred cleanup function only needs to deal with
> > > > > "complete" entries.  Again, certainly possible, but IMO uglier than
> > > > > having 'cancel'.    
> > > > 
> > > > Okay, yes, I see now.
> > > > 
> > > > Another doubt that comes to me now is: if you don't plan to use this
> > > > alloc_cancel() thing anywhere else, the only reason why you are adding
> > > > it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
> > > > version that can fail.
> > > > 
> > > > But at this point, speaking of ugliness, couldn't we just have a
> > > > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
> > > > can use to decide to abort earlier? To me it looks so much simpler and
> > > > more robust.    
> > > 
> > > Well, we could, but there are a couple of reasons I don't love it.
> > > The first is abstraction: this returns explicit handling of the layout
> > > of the table to the protocol specific callers.  It's not a huge deal
> > > right now, but once we have 4 or 5 protocols doing this, having to
> > > change all of them if we make any tiny change to the semantics of
> > > flow_first_free isn't great.  
> > 
> > Hmm, I don't get the difference in terms of abstraction between
> > checking the return value of flow_alloc() and checking the return value
> > of flow_may_alloc().  
> 
> Oh, sorry, I thought you were proposing open-coding the check against
> FLOW_MAX, like it is at the moment.  See below for comments on a
> flow_may_alloc() or similar call.
> 
> > In both cases the protocol handlers know that there's a table, a
> > function to reserve entries, and that reserving entries might fail...
> > and not much else.
> >   
> > > The other issue is that to do this (without a bunch of fairly large
> > > and ugly temporaries) means we'd populate at least some of the fields
> > > in flow_common before we have officially "allocated" the entry.  At
> > > that point it becomes a bit fuzzy as to when that allocation really
> > > occurs.  Is it when we do the FLOW_MAX tesT?  
> > 
> > I would say yes -- after that we can't fail.  
> 
> Uh.. we can't fail to allocate.  We can fail for other reasons.

Yes, I meant, if we pass the FLOW_MAX check, then we can't fail to
allocate -- therefore flow_may_alloc() would contain just that check.

> > I mean, we work with rather constrained structures for a number of
> > reasons, which comes with a number of hard problems... let's at least
> > reap the benefits of it?  
> 
> I'm not sure what you're getting at here.

...I'm saying that we don't actually allocate memory, which means that
we can have a simple test (on FLOW_MAX), and then we know that, at
least within the handling of a single epoll event, we will be able to
allocate memory, without possible races.

We couldn't do that with an actual heap allocation because that's out
of our control.

This is one of the few aspects where using statically allocated memory
only could make our lives easier (while in general we need to spend
more effort for other things).

> > > Is it when we write to
> > > f.type?  Is it when we update flow_first_free?  If we fail somewhere
> > > in the middle of that, what steps do we need to reverse?  
> > 
> > We can't fail in the middle of it, at the moment.  
> 
> Yes we can, that's kind of the whole point of this.

But as of this patch flow_alloc() can only fail on the FLOW_MAX check...

> > Of course, if the
> > "allocation" function changes, we might need to change the scheme. But
> > is it really likely? And anyway it's just a few lines in your current
> > version...
> >   
> > > For those reasons I prefer the scheme presented.  Fwiw, in an earlier
> > > draft I did this differently with a "flow_prealloc()", which was
> > > essentially the check against FLOW_MAX, then a later
> > > flow_alloc_commit().  I thought it turned out pretty confusing
> > > compared to the alloc/cancel approach.  
> > 
> > The current flow_alloc_cancel() implementation is definitely simple and
> > semantically clear.
> > 
> > What worries me a bit is that you would have two different cases for
> > free "blocks" of size one, depending on the order of the events. So if
> > we want to debug something like that and we see a block with size one
> > it might be a failed bind(), so a fake one, or also not: it might be an
> > actual block with size one.  
> 
> The cluster of size one from cancel is still a valid free cluster that
> satisfies all the usual invaraints, it's not "fake".  It does mean
> that we could get two contiguous free clusters, which we wouldn't
> otherwise.

Okay, yes, not having contiguous free clusters is probably not a
valuable invariant anyway.

> The idea is that they'll be merged back together on the
> next deferred scan, but as noted on a different subthread, that's not
> currently working and I'll have to fix it.

Yes yes that part is clear to me. I don't find it very problematic, or
in any way "wrong".

> > Thinking of multithreading: defining flow_may_alloc() becomes more
> > complicated because the caller can't just assume the "allocation" will
> > succeed (as long as we don't cross an "epoll cycle" or something like
> > that). But we'll probably need some form of locking or userspace RCU
> > giving us barriers around the pair may_alloc() / alloc().  
> 
> For multi-threaded, I really thin we'd want alloc/free semantics, not
> may_alloc/alloc semantics.  We have to change state in some way at
> "may alloc" time, or something else could try to allocate the same
> slot.  "cancel" semantics also don't make sense here, because we can
> no longer be confident that the alloc we did above is still the most
> recent allloc.  So a bunch of things would need to change for
> multi-threading.

By the way, another possibility would be to just go ahead and call
socket() and bind(), or accept() the connection from the socket, then if
we fail to "allocate" a flow we'll close the socket. That doesn't solve
the synchronisation problem entirely but it makes it simpler to handle.

Now, a bound socket or an accepted connection is visible to the user
which is (I guess) the reason why you want to avoid this, but if we
can't open/accept new connections it's getting pretty bad anyway... so
should we really care?

Actually, calling accept() and then close() on a socket (peer gets RST)
is probably a nicer behaviour than letting a peer hang because we ran
out of slots.

> > If we stick to the failing alloc(), this part is simpler, but the
> > interpretation of flow_first_free and block sizes becomes non-trivial.  
> 
> Again, not sure what you're getting at.

By "failing alloc()" I mean the alloc/cancel semantics, as opposed to
an allocation function that can't fail.

There, as you mentioned, we can no longer be confident that the
canceled allocation would be the most recent one (which changes the
meaning of flow_first_free).

> > Well, on the other hand, it's all simple enough that we can change it
> > as needed (for example for multithreading). If we can hope that the new
> > scheme is reasonably low on bugs and we'll probably never have to guess
> > why a block has size one, I'm fine with the failing alloc() as well.

-- 
Stefano


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 13/13] flow: Avoid moving flow entries to compact table
  2024-01-06 13:02                     ` Stefano Brivio
@ 2024-01-07  5:20                       ` David Gibson
  0 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2024-01-07  5:20 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13456 bytes --]

On Sat, Jan 06, 2024 at 02:02:01PM +0100, Stefano Brivio wrote:
> On Sat, 6 Jan 2024 22:32:10 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Jan 05, 2024 at 11:27:54AM +0100, Stefano Brivio wrote:
> > > On Fri, 5 Jan 2024 20:39:50 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Fri, Jan 05, 2024 at 09:33:35AM +0100, Stefano Brivio wrote:  
> > > > > On Thu, 4 Jan 2024 21:02:19 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Tue, Jan 02, 2024 at 07:13:41PM +0100, Stefano Brivio wrote:    
> > > > > > > On Mon, 1 Jan 2024 23:01:17 +1100
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >       
> > > > > > > > On Sat, Dec 30, 2023 at 11:33:04AM +0100, Stefano Brivio wrote:      
> > > > > > > > > On Thu, 28 Dec 2023 19:25:25 +0100
> > > > > > > > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > > > >         
> > > > > > > > > > > On Thu, 21 Dec 2023 17:15:49 +1100
> > > > > > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > [...]        
> > > > > > > > > >
> > > > > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > I wonder if we really have to keep track of the number of (non-)entries
> > > > > > > > > > in the free "block", and if we have to be explicit about the two cases.
> > > > > > > > > > 
> > > > > > > > > > I'm trying to find out if we can simplify the whole thing with slightly
> > > > > > > > > > different variants, for example:        
> > > > > > > > > 
> > > > > > > > > So... I think the version with (explicit) blocks has this fundamental
> > > > > > > > > advantage, on deletion:
> > > > > > > > >         
> > > > > > > > > > > +	flow->f.type = FLOW_TYPE_NONE;
> > > > > > > > > > > +	/* Put it back in a length 1 free block, don't attempt to fully reverse
> > > > > > > > > > > +	 * flow_alloc()s steps.  This will get folded together the next time
> > > > > > > > > > > +	 * flow_defer_handler runs anyway() */
> > > > > > > > > > > +	flow->free.n = 1;
> > > > > > > > > > > +	flow->free.next = flow_first_free;
> > > > > > > > > > > +	flow_first_free = FLOW_IDX(flow);        
> > > > > > > > > 
> > > > > > > > > which is doable even without explicit blocks, but much harder to
> > > > > > > > > follow.        
> > > > > > > > 
> > > > > > > > Remember this is not a general deletion, only a "cancel" of the most
> > > > > > > > recent allocation.      
> > > > > > > 
> > > > > > > Oh, I thought that was only the case for this series and you would use
> > > > > > > that as actual deletion in another pending series (which I haven't
> > > > > > > finished reviewing yet).      
> > > > > > 
> > > > > > No.  Not allowing deletion of any entry at any time is what I'm
> > > > > > trading off to get both O(1) allocation and (effectively) O(1)
> > > > > > deletion.
> > > > > >     
> > > > > > > But now I'm not sure anymore why I was thinking this...
> > > > > > > 
> > > > > > > Anyway... do we really need it, then? Can't we just mark the "failed"
> > > > > > > flows as whatever means "closed" for a specific protocol, and clean
> > > > > > > them up later, instead of calling cancel() right away?      
> > > > > > 
> > > > > > We could, but I'm not sure we want to.  For starters, that requires
> > > > > > protocol-specific behaviour whenever we need to back out an allocation
> > > > > > like this.  Not a big deal, since that's in protocol specific code
> > > > > > already, but I think it's uglier than calling cancel.
> > > > > > 
> > > > > > It also requires that the protocol specific deferred cleanup functions
> > > > > > (e.g. tcp_flow_defer()) handle partially initialised entries.  With
> > > > > > 'cancel' we can back out just the initialisation steps we've already
> > > > > > done (because we know where we've failed during init), then remove the
> > > > > > entry.  The deferred cleanup function only needs to deal with
> > > > > > "complete" entries.  Again, certainly possible, but IMO uglier than
> > > > > > having 'cancel'.    
> > > > > 
> > > > > Okay, yes, I see now.
> > > > > 
> > > > > Another doubt that comes to me now is: if you don't plan to use this
> > > > > alloc_cancel() thing anywhere else, the only reason why you are adding
> > > > > it is to replace the (flow_count >= FLOW_MAX) check with a flow_alloc()
> > > > > version that can fail.
> > > > > 
> > > > > But at this point, speaking of ugliness, couldn't we just have a
> > > > > bool flow_may_alloc() { return flow_first_free < FLOW_MAX }; the caller
> > > > > can use to decide to abort earlier? To me it looks so much simpler and
> > > > > more robust.    
> > > > 
> > > > Well, we could, but there are a couple of reasons I don't love it.
> > > > The first is abstraction: this returns explicit handling of the layout
> > > > of the table to the protocol specific callers.  It's not a huge deal
> > > > right now, but once we have 4 or 5 protocols doing this, having to
> > > > change all of them if we make any tiny change to the semantics of
> > > > flow_first_free isn't great.  
> > > 
> > > Hmm, I don't get the difference in terms of abstraction between
> > > checking the return value of flow_alloc() and checking the return value
> > > of flow_may_alloc().  
> > 
> > Oh, sorry, I thought you were proposing open-coding the check against
> > FLOW_MAX, like it is at the moment.  See below for comments on a
> > flow_may_alloc() or similar call.
> > 
> > > In both cases the protocol handlers know that there's a table, a
> > > function to reserve entries, and that reserving entries might fail...
> > > and not much else.
> > >   
> > > > The other issue is that to do this (without a bunch of fairly large
> > > > and ugly temporaries) means we'd populate at least some of the fields
> > > > in flow_common before we have officially "allocated" the entry.  At
> > > > that point it becomes a bit fuzzy as to when that allocation really
> > > > occurs.  Is it when we do the FLOW_MAX tesT?  
> > > 
> > > I would say yes -- after that we can't fail.  
> > 
> > Uh.. we can't fail to allocate.  We can fail for other reasons.
> 
> Yes, I meant, if we pass the FLOW_MAX check, then we can't fail to
> allocate -- therefore flow_may_alloc() would contain just that check.

Right, ok.

> > > I mean, we work with rather constrained structures for a number of
> > > reasons, which comes with a number of hard problems... let's at least
> > > reap the benefits of it?  
> > 
> > I'm not sure what you're getting at here.
> 
> ...I'm saying that we don't actually allocate memory, which means that
> we can have a simple test (on FLOW_MAX), and then we know that, at
> least within the handling of a single epoll event, we will be able to
> allocate memory, without possible races.
> 
> We couldn't do that with an actual heap allocation because that's out
> of our control.
> 
> This is one of the few aspects where using statically allocated memory
> only could make our lives easier (while in general we need to spend
> more effort for other things).

Ok, I understand.

> > > > Is it when we write to
> > > > f.type?  Is it when we update flow_first_free?  If we fail somewhere
> > > > in the middle of that, what steps do we need to reverse?  
> > > 
> > > We can't fail in the middle of it, at the moment.  
> > 
> > Yes we can, that's kind of the whole point of this.
> 
> But as of this patch flow_alloc() can only fail on the FLOW_MAX check...

I'm not talking about allocation failure, I'm talking about the other
failures in setting up a new flow.  We can fail between checking
FLOW_MAX and "committing" to the allocation by updating
flow_first_free (or whatever).

Before the point we finally commit, we do want to write some fields of
the flow.  That's the obvious place to put the address from accept()
for example.  But there are some things we mustn't touch, because it
will mess up how the slot is interpreted as a free slot (type, and
anything in the protocol specific fields, because that will overlap
struct flow_free_cluster.

I don't like having this fragile intermediate state with subtle rules
about what we can and can't safely do to the flow entry.  With alloc
cancel, we can do whatever we like as soon as we've alloc()ed, and on
failure the cancel() will restore whatever free cluster information
needs to be there.

> > > Of course, if the
> > > "allocation" function changes, we might need to change the scheme. But
> > > is it really likely? And anyway it's just a few lines in your current
> > > version...
> > >   
> > > > For those reasons I prefer the scheme presented.  Fwiw, in an earlier
> > > > draft I did this differently with a "flow_prealloc()", which was
> > > > essentially the check against FLOW_MAX, then a later
> > > > flow_alloc_commit().  I thought it turned out pretty confusing
> > > > compared to the alloc/cancel approach.  
> > > 
> > > The current flow_alloc_cancel() implementation is definitely simple and
> > > semantically clear.
> > > 
> > > What worries me a bit is that you would have two different cases for
> > > free "blocks" of size one, depending on the order of the events. So if
> > > we want to debug something like that and we see a block with size one
> > > it might be a failed bind(), so a fake one, or also not: it might be an
> > > actual block with size one.  
> > 
> > The cluster of size one from cancel is still a valid free cluster that
> > satisfies all the usual invaraints, it's not "fake".  It does mean
> > that we could get two contiguous free clusters, which we wouldn't
> > otherwise.
> 
> Okay, yes, not having contiguous free clusters is probably not a
> valuable invariant anyway.

Right, that was my feeling.

> > The idea is that they'll be merged back together on the
> > next deferred scan, but as noted on a different subthread, that's not
> > currently working and I'll have to fix it.
> 
> Yes yes that part is clear to me. I don't find it very problematic, or
> in any way "wrong".
> 
> > > Thinking of multithreading: defining flow_may_alloc() becomes more
> > > complicated because the caller can't just assume the "allocation" will
> > > succeed (as long as we don't cross an "epoll cycle" or something like
> > > that). But we'll probably need some form of locking or userspace RCU
> > > giving us barriers around the pair may_alloc() / alloc().  
> > 
> > For multi-threaded, I really thin we'd want alloc/free semantics, not
> > may_alloc/alloc semantics.  We have to change state in some way at
> > "may alloc" time, or something else could try to allocate the same
> > slot.  "cancel" semantics also don't make sense here, because we can
> > no longer be confident that the alloc we did above is still the most
> > recent allloc.  So a bunch of things would need to change for
> > multi-threading.
> 
> By the way, another possibility would be to just go ahead and call
> socket() and bind(), or accept() the connection from the socket, then if
> we fail to "allocate" a flow we'll close the socket. That doesn't solve
> the synchronisation problem entirely but it makes it simpler to handle.

We could.  But (a) the may-alloc check is much cheaper than the
syscalls and (b) that means we need to store all information from
those syscalls temporarily before copying it into the flow table,
which is kind of nasty.

> Now, a bound socket or an accepted connection is visible to the user
> which is (I guess) the reason why you want to avoid this, but if we
> can't open/accept new connections it's getting pretty bad anyway... so
> should we really care?

Right, that's certainly not my concern about it.

> Actually, calling accept() and then close() on a socket (peer gets RST)
> is probably a nicer behaviour than letting a peer hang because we ran
> out of slots.

Maybe, maybe not.  If we are persistently overloaded, then yes.
However if we just have a sudden spike of short lived connections,
then we could just "bounce" against the flow limit, but we'll still
process those other connection attempts in a little bit.

> > > If we stick to the failing alloc(), this part is simpler, but the
> > > interpretation of flow_first_free and block sizes becomes non-trivial.  
> > 
> > Again, not sure what you're getting at.
> 
> By "failing alloc()" I mean the alloc/cancel semantics, as opposed to
> an allocation function that can't fail.

That bit I got.

> There, as you mentioned, we can no longer be confident that the
> canceled allocation would be the most recent one (which changes the
> meaning of flow_first_free).

If we're multithreading, then I think this allocation scheme needs
some revision anyway.

> > > Well, on the other hand, it's all simple enough that we can change it
> > > as needed (for example for multithreading). If we can hope that the new
> > > scheme is reasonably low on bugs and we'll probably never have to guess
> > > why a block has size one, I'm fine with the failing alloc() as well.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/13] flow: Move flow_count from context structure to a global
  2024-01-05  7:55               ` Stefano Brivio
@ 2024-01-07  5:23                 ` David Gibson
  0 siblings, 0 replies; 40+ messages in thread
From: David Gibson @ 2024-01-07  5:23 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 9771 bytes --]

On Fri, Jan 05, 2024 at 08:55:13AM +0100, Stefano Brivio wrote:
> On Thu, 4 Jan 2024 20:51:19 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Jan 03, 2024 at 08:08:34AM +0100, Stefano Brivio wrote:
> > > On Wed, 3 Jan 2024 14:54:27 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > I'm not sure where to get the actual text of the standards  
> > > 
> > > Let me answer this first: one (the?) trick is to use so-called final
> > > drafts, which are made freely available (same as working drafts) by the
> > > Working Group.
> > > 
> > > Those are not the same as the standards, but differences from the final
> > > draft are also published... and they are usually not substantial.
> > > 
> > > This is _very_ informative:
> > >   https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents  
> > 
> > Ah, thanks.
> > 
> > > Wikipedia also has the links, by the way. Anyway, in practice:
> > > 
> > > - C11: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
> > > - C99: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
> > > - C89:
> > >   https://web.archive.org/web/20200909074736if_/https://www.pdf-archive.com/2014/10/02/ansi-iso-9899-1990-1/ansi-iso-9899-1990-1.pdf
> > >   
> > > > On Tue, Jan 02, 2024 at 07:13:35PM +0100, Stefano Brivio wrote:  
> > > > > On Sun, 31 Dec 2023 16:58:39 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Thu, Dec 28, 2023 at 07:25:18PM +0100, Stefano Brivio wrote:    
> > > > > > > On Thu, 21 Dec 2023 17:15:46 +1100
> > > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > > >       
> > > > > > > > In general, the passt code is a bit haphazard about what's a true global
> > > > > > > > variable and what's in the quasi-global 'context structure'.  The
> > > > > > > > flow_count field is one such example: it's in the context structure,
> > > > > > > > although it's really part of the same data structure as flowtab[], which
> > > > > > > > is a genuine global.      
> > > > > > > 
> > > > > > > Well, the reason is that flow_tab[FLOW_MAX] might be problematically
> > > > > > > too big to live on the stack, unlike flow_count.
> > > > > > > 
> > > > > > > But anyway, as far as thoughts of multithreading are concerned, both
> > > > > > > should probably be global. And sure, it's more consistent this way.
> > > > > > >       
> > > > > > > > Move flow_count to be a regular global to match.  For now it needs to be
> > > > > > > > public, rather than static, but we expect to be able to change that in
> > > > > > > > future.      
> > > > > > > 
> > > > > > > If it's not static, it should be initialised, and that's not done here.      
> > > > > > 
> > > > > > Uh... what?  "static" here is meaning module-global rather than
> > > > > > global-global, which has no bearing on initialisation.  AFAIK globals
> > > > > > are zero-initialised whether they're static or not.    
> > > > > 
> > > > > ...and to my utter surprise, I just discovered that if you talk C11,
> > > > > you're right. From the N1570 draft (ISO/IEC 9899:201x), Section 6.7.9
> > > > > "Initialization", clause 10:
> > > > > 
> > > > >   If an object that has automatic storage duration is not initialized
> > > > >   explicitly, its value is indeterminate. If an object that has static
> > > > >   or thread storage duration is not initialized explicitly, then:
> > > > > 
> > > > >   [...]
> > > > > 
> > > > >   — if it has arithmetic type, it is initialized to (positive or
> > > > >     unsigned) zero;
> > > > > 
> > > > > And 'flow_count' has thread storage duration.    
> > > > 
> > > > No.. I don't think it does.  AFAICT only thread-local variables have
> > > > thread storage duration.  As a global flow_count will have static
> > > > storage duration, even without the static keyword.  
> > > 
> > > So, C11 defines static storage duration here:
> > > 
> > >   6.2.4 Storage durations of objects
> > > 
> > >   [...]
> > > 
> > >   3 An object whose identifier is declared without the storage-class
> > >   specifier _Thread_local, and either with external or internal linkage
> > >   or with the storage-class specifier static, has static storage
> > >   duration. Its lifetime is the entire execution of the program and its
> > >   stored value is initialized only once, prior to program startup.
> > > 
> > > do we have any linkage here? I would have said no -- but, going back
> > > to C99 for this, "6.2.2 Linkages of identifiers":
> > > 
> > >   5 [...] If the declaration of an identifier for an object has file
> > >   scope and no storage-class specifier, its linkage is external.
> > > 
> > > which supports your paragraph below.  
> > 
> > Right.
> > 
> > > By the way, C11 now says:
> > > 
> > >   6.11.2 Linkages of identifiers
> > > 
> > >   1 Declaring an identifier with internal linkage at file scope without
> > >   the static storage-class specifier is an obsolescent feature  
> > 
> > Ok.  I'm not even sure how you would do that.
> 
> By doing what I *thought* you were doing (see below): "int x" at file
> scope (outside functions), no static, no extern declaration, nothing.

Oh, ok.  Even without that, I think the behaviour has to be the same.
Plain "int x" at file scope is a public variable, and a linked module
could "extern" it, even if that extern isn't visible while we're
compiling this module.

> > > > > In C99, however (draft
> > > > > N1256), Section 6.7.8 "Initialization", clause 10:
> > > > > 
> > > > >   If an object that has automatic storage duration is not initialized
> > > > >   explicitly, its value is indeterminate. If an object that has static
> > > > >   storage duration is not initialized explicitly, then:
> > > > > 
> > > > >   [...]
> > > > > 
> > > > > note the missing "or thread storage duration".
> > > > > 
> > > > > C89, the one I was actually basing my observation on, says, at 3.5.7
> > > > > "Initialization":
> > > > > 
> > > > >   If an object that has static storage duration is not initialized
> > > > >   explicitly, it is initialized implicitly as if every member that has
> > > > >   arithmetic type were assigned 0 and every member that has pointer type
> > > > >   were assigned a null pointer constant.  If an object that has
> > > > >   automatic storage duration is not initialized explicitly, its value is
> > > > >   indeterminate.
> > > > > 
> > > > > so... um. We won't go back to C99. But to me, and maybe others, not
> > > > > having a "= 0;" for a "global" means pretty much that we don't rely on
> > > > > any particular initial value.    
> > > > 
> > > > Again, I'm pretty sure that's not true, even for C99 and C89.  AIUI,
> > > > 'static' locals and *all* globals have "static storage diration".
> > > > 
> > > > I'm not sure where to get the actual text of the standards but see for
> > > > example
> > > > 
> > > > https://en.cppreference.com/w/c/language/static_storage_duration
> > > > 
> > > > Here 'flow_count' has external linkage, thus satisfying the conditions
> > > > for static storage duration.  
> > > 
> > > Right. Well, for C99 and C11 at least. For C89 things are slightly
> > > different:
> > > 
> > >   6.1.2.4 Storage durations of objects
> > > 
> > >   [...]
> > > 
> > >   An object whose identifier is declared with external or internal
> > >   linkage. or with the storage-class specifier static has static storage
> > >   duration.
> > > 
> > >   [...]
> > > 
> > >   An object whose identifier is declared with no linkage and without the
> > >   storage-class specifier static has automatic storage duration.
> > > 
> > > You might say it has external linkage. But it was not *declared with*
> > > external linkage -- it just happens to have it (C89 and C99 don't
> > > differ here).  
> > 
> > Hrm.  We do have:
> > 	extern unsigned flow_first_free;
> > in flow_table.h.  Does that cound as declaring with external linkage?
> 
> Gosh, sorry... yes, it also counts as me completely missing it.
> 
> > > > Fwiw, I'm pretty sure the kernel has relied on zero-initialization of
> > > > non-static globals for many years.  
> > > 
> > > True, and the opposite is even considered as a style issue since 2007,
> > > commit f0a594c1c74f ("update checkpatch.pl to version 0.08"). I also
> > > found a discussion similar to this one:
> > >   https://lore.kernel.org/all/20201102184147.GA42288@localhost/#r
> > > 
> > > Anyway... a couple of years before that, it must have been a gcc version
> > > in the late 2.x, I actually hit an issue with it. Was it a compiler
> > > issue, or the correct interpretation of C89? Or maybe something on the
> > > lines of:
> > >   https://www.thegoodpenguin.co.uk/blog/u-boot-relocation-bss-hang/  
> > 
> > If it was an embedded setup, that last one is certainly possible.
> > Zeroing the BSS is typically the loader's job, and I've certainly seen
> > loader implementations - particularly in embedded firmware - that got
> > this wrong.
> 
> Embedded setup, yes, but in a Linux kernel (early 2.4.x). No 'extern'
> there, though.

Right.  Depending on platform, the kernel sometimes has a loader for a
later stage so does need to clear BSS.  Or it might need to as a
workaround for a broken firmware loader.  I'm pretty sure I've also
seem Linux bugs on some embedded platforms that involved failing to
clear the BSS.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-01-07  5:23 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-21  6:15 [PATCH v3 00/13] Manage more flow related things from generic flow code David Gibson
2023-12-21  6:15 ` [PATCH v3 01/13] flow: Make flow_table.h #include the protocol specific headers it needs David Gibson
2023-12-21  6:15 ` [PATCH v3 02/13] treewide: Standardise on 'now' for current timestamp variables David Gibson
2023-12-21  6:15 ` [PATCH v3 03/13] tcp, tcp_splice: Remove redundant handling from tcp_timer() David Gibson
2023-12-21  6:15 ` [PATCH v3 04/13] tcp, tcp_splice: Move per-type cleanup logic into per-type helpers David Gibson
2023-12-21  6:15 ` [PATCH v3 05/13] flow, tcp: Add flow-centric dispatch for deferred flow handling David Gibson
2023-12-28 18:24   ` Stefano Brivio
2023-12-31  5:56     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-03  3:45         ` David Gibson
2023-12-21  6:15 ` [PATCH v3 06/13] flow, tcp: Add handling for per-flow timers David Gibson
2023-12-21  6:15 ` [PATCH v3 07/13] epoll: Better handling of number of epoll types David Gibson
2023-12-21  6:15 ` [PATCH v3 08/13] tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets David Gibson
2023-12-21  6:15 ` [PATCH v3 09/13] flow: Move flow_log_() to near top of flow.c David Gibson
2023-12-21  6:15 ` [PATCH v3 10/13] flow: Move flow_count from context structure to a global David Gibson
2023-12-28 18:25   ` Stefano Brivio
2023-12-31  5:58     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-03  3:54         ` David Gibson
2024-01-03  7:08           ` Stefano Brivio
2024-01-04  9:51             ` David Gibson
2024-01-05  7:55               ` Stefano Brivio
2024-01-07  5:23                 ` David Gibson
2023-12-21  6:15 ` [PATCH v3 11/13] flow: Abstract allocation of new flows with helper function David Gibson
2023-12-21  6:15 ` [PATCH v3 12/13] flow: Enforce that freeing of closed flows must happen in deferred handlers David Gibson
2023-12-21  6:15 ` [PATCH v3 13/13] flow: Avoid moving flow entries to compact table David Gibson
2023-12-28 18:25   ` Stefano Brivio
2023-12-30 10:33     ` Stefano Brivio
2024-01-01 12:01       ` David Gibson
2024-01-02 18:13         ` Stefano Brivio
2024-01-04 10:02           ` David Gibson
2024-01-05  8:33             ` Stefano Brivio
2024-01-05  9:39               ` David Gibson
2024-01-05 10:27                 ` Stefano Brivio
2024-01-06 11:32                   ` David Gibson
2024-01-06 13:02                     ` Stefano Brivio
2024-01-07  5:20                       ` David Gibson
2024-01-01 10:44     ` David Gibson
2024-01-02 18:13       ` Stefano Brivio
2024-01-05  9:45         ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).