public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v5 00/19] RFC: Unified flow table
@ 2024-05-14  1:03 David Gibson
  2024-05-14  1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
                   ` (18 more replies)
  0 siblings, 19 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

This is a fourth draft of the first steps in implementing more general
"connection" tracking, as described at:
    https://pad.passt.top/p/NewForwardingModel

This series changes the TCP connection table and hash table into a
more general flow table that can track other protocols as well.  Each
flow uniformly keeps track of all the relevant addresses and ports,
which will allow for more robust control of NAT and port forwarding.

ICMP is converted to use the new flow table.

This doesn't include UDP, but I'm working on it right now and making
progress.  I'm posting this to give a head start on the review :)

Caveats:
 * We significantly increase the size of a connection/flow entry

Changes since v4:
 * flowside_from_af() no longer fills in unspecified addresses when
   passed NULL
 * Split and rename flow hash lookup function
 * Clarified flow state transitions, and enforced where practical
 * Made side 0 always the initiating side of a flow, rather than
   letting the protocol specific code decide
 * Separated pifs from flowside addresses to allow better structure
   packing

Changes since v3:
 * Complex rebase on top of the many things that have happened
   upstream since v2.
 * Assorted other changes.
 * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables.

Changes since v2:
 * Cosmetic fixes based on review
 * Extra doc comments for enum flow_type
 * Rename flowside to flowaddrs which turns out to make more sense in
   light of future changes
 * Fix bug where the socket flowaddrs for tap initiated connections
   wasn't initialised to match the socket address we were using in the
   case of map-gw NAT
 * New flowaddrs_from_sock() helper used in most cases which is cleaner
   and should avoid bugs like the above
 * Using newer centralised workarounds for clang-tidy issue 58992
 * Remove duplicate definition of FLOW_MAX as maximum flow type and
   maximum number of tracked flows
 * Rebased on newer versions of preliminary work (ICMP, flow based
   dispatch and allocation, bind/address cleanups)
 * Unified hash table as well as base flow table
 * Integrated ICMP

Changes since v1:
 * Terminology changes
   - "Endpoint" address/port instead of "correspondent" address/port
   - "flowside" instead of "demiflow"
 * Actually move the connection table to a new flow table structure in
   new files
 * Significant rearrangement of earlier patchs on top of that new
   table, to reduce churn

David Gibson (19):
  flow: Clarify and enforce flow state transitions
  flow: Make side 0 always be the initiating side
  flow: Record the pifs for each side of each flow
  tcp: Remove interim 'tapside' field from connection
  flow: Common data structures for tracking flow addresses
  flow: Populate address information for initiating side
  flow: Populate address information for non-initiating side
  tcp, flow: Remove redundant information, repack connection structures
  tcp: Obtain guest address from flowside
  tcp: Simplify endpoint validation using flowside information
  tcp_splice: Eliminate SPLICE_V6 flag
  tcp, flow: Replace TCP specific hash function with general flow hash
  flow, tcp: Generalise TCP hash table to general flow hash table
  tcp: Re-use flow hash for initial sequence number generation
  icmp: Use flowsides as the source of truth wherever possible
  icmp: Look up ping flows using flow hash
  icmp: Eliminate icmp_id_map
  flow, tcp: Flow based NAT and port forwarding for TCP
  flow, icmp: Use general flow forwarding rules for ICMP

 flow.c       | 538 +++++++++++++++++++++++++++++++++++++++++++++------
 flow.h       | 149 +++++++++++++-
 flow_table.h |  21 ++
 fwd.c        | 110 +++++++++++
 fwd.h        |  12 ++
 icmp.c       |  98 ++++++----
 icmp_flow.h  |   1 -
 inany.h      |  29 ++-
 passt.h      |   3 +
 pif.h        |   1 -
 tap.c        |  11 --
 tap.h        |   1 -
 tcp.c        | 484 ++++++++++++---------------------------------
 tcp_conn.h   |  36 ++--
 tcp_splice.c |  97 ++--------
 tcp_splice.h |   5 +-
 16 files changed, 999 insertions(+), 597 deletions(-)

-- 
2.45.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v5 01/19] flow: Clarify and enforce flow state transitions
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-16  9:30   ` Stefano Brivio
  2024-05-14  1:03 ` [PATCH v5 02/19] flow: Make side 0 always be the initiating side David Gibson
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Flows move over several different states in their lifetime.  The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.

Change the code to explicitly track the states in a field.  Make all
transitions explicit and logged.  To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.

While we're at it, tweak the docs to clarify the restrictions on each state
a bit.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 144 ++++++++++++++++++++++++++++++---------------------
 flow.h       |  67 ++++++++++++++++++++++--
 flow_table.h |  10 ++++
 icmp.c       |   4 +-
 tcp.c        |   8 ++-
 tcp_splice.c |   4 +-
 6 files changed, 168 insertions(+), 69 deletions(-)

diff --git a/flow.c b/flow.c
index 80dd269..768e0f6 100644
--- a/flow.c
+++ b/flow.c
@@ -18,6 +18,15 @@
 #include "flow.h"
 #include "flow_table.h"
 
+const char *flow_state_str[] = {
+	[FLOW_STATE_FREE]	= "FREE",
+	[FLOW_STATE_NEW]	= "NEW",
+	[FLOW_STATE_TYPED]	= "TYPED",
+	[FLOW_STATE_ACTIVE]	= "ACTIVE",
+};
+static_assert(ARRAY_SIZE(flow_state_str) == FLOW_NUM_STATES,
+	      "flow_state_str[] doesn't match enum flow_state");
+
 const char *flow_type_str[] = {
 	[FLOW_TYPE_NONE]	= "<none>",
 	[FLOW_TCP]		= "TCP connection",
@@ -39,46 +48,6 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 
 /* Global Flow Table */
 
-/**
- * DOC: Theory of Operation - flow entry life cycle
- *
- * An individual flow table entry moves through these logical states, usually in
- * this order.
- *
- *    FREE - Part of the general pool of free flow table entries
- *        Operations:
- *            - flow_alloc() finds an entry and moves it to ALLOC state
- *
- *    ALLOC - A tentatively allocated entry
- *        Operations:
- *            - flow_alloc_cancel() returns the entry to FREE state
- *            - FLOW_START() set the entry's type and moves to START state
- *        Caveats:
- *            - It's not safe to write fields in the flow entry
- *            - It's not safe to allocate further entries with flow_alloc()
- *            - It's not safe to return to the main epoll loop (use FLOW_START()
- *              to move to START state before doing so)
- *            - It's not safe to use flow_*() logging functions
- *
- *    START - An entry being prepared by flow type specific code
- *        Operations:
- *            - Flow type specific fields may be accessed
- *            - flow_*() logging functions
- *            - flow_alloc_cancel() returns the entry to FREE state
- *        Caveats:
- *            - Returning to the main epoll loop or allocating another entry
- *              with flow_alloc() implicitly moves the entry to ACTIVE state.
- *
- *    ACTIVE - An active flow entry managed by flow type specific code
- *        Operations:
- *            - Flow type specific fields may be accessed
- *            - flow_*() logging functions
- *            - Flow may be expired by returning 'true' from flow type specific
- *              deferred or timer handler.  This will return it to FREE state.
- *        Caveats:
- *            - It's not safe to call flow_alloc_cancel()
- */
-
 /**
  * DOC: Theory of Operation - allocating and freeing flow entries
  *
@@ -132,6 +101,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
 
 unsigned flow_first_free;
 union flow flowtab[FLOW_MAX];
+static const union flow *flow_new_entry; /* = NULL */
 
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
@@ -144,6 +114,7 @@ static struct timespec flow_timer_run;
  */
 void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 {
+	const char *typestate;
 	char msg[BUFSIZ];
 	va_list args;
 
@@ -151,40 +122,65 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 	(void)vsnprintf(msg, sizeof(msg), fmt, args);
 	va_end(args);
 
-	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
+	/* Show type if it's set, otherwise the state */
+	if (f->state < FLOW_STATE_TYPED)
+		typestate = FLOW_STATE(f);
+	else
+		typestate = FLOW_TYPE(f);
+
+	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), typestate, msg);
+}
+
+/**
+ * flow_set_state() - Change flow's state
+ * @f:		Flow to update
+ * @state:	New state
+ */
+static void flow_set_state(struct flow_common *f, enum flow_state state)
+{
+	uint8_t oldstate = f->state;
+
+	ASSERT(state < FLOW_NUM_STATES);
+	ASSERT(oldstate < FLOW_NUM_STATES);
+
+	f->state = state;
+	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
+		  FLOW_STATE(f));
 }
 
 /**
- * flow_start() - Set flow type for new flow and log
- * @flow:	Flow to set type for
+ * flow_set_type() - Set type and mvoe to TYPED state
+ * @flow:	Flow to change state
  * @type:	Type for new flow
  * @iniside:	Which side initiated the new flow
  *
  * Return: @flow
- *
- * Should be called before setting any flow type specific fields in the flow
- * table entry.
  */
-union flow *flow_start(union flow *flow, enum flow_type type,
-		       unsigned iniside)
+union flow *flow_set_type(union flow *flow, enum flow_type type,
+			  unsigned iniside)
 {
+	struct flow_common *f = &flow->f;
+
+	ASSERT(type != FLOW_TYPE_NONE);
+	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
+	ASSERT(f->type == FLOW_TYPE_NONE);
+
 	(void)iniside;
-	flow->f.type = type;
-	flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
+	f->type = type;
+	flow_set_state(f, FLOW_STATE_TYPED);
 	return flow;
 }
 
 /**
- * flow_end() - Clear flow type for finished flow and log
- * @flow:	Flow to clear
+ * flow_activate() - Move flow to ACTIVE state
+ * @f:		Flow to change state
  */
-static void flow_end(union flow *flow)
+void flow_activate(struct flow_common *f)
 {
-	if (flow->f.type == FLOW_TYPE_NONE)
-		return; /* Nothing to do */
+	ASSERT(&flow_new_entry->f == f && f->state == FLOW_STATE_TYPED);
 
-	flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
-	flow->f.type = FLOW_TYPE_NONE;
+	flow_set_state(f, FLOW_STATE_ACTIVE);
+	flow_new_entry = NULL;
 }
 
 /**
@@ -196,9 +192,12 @@ union flow *flow_alloc(void)
 {
 	union flow *flow = &flowtab[flow_first_free];
 
+	ASSERT(!flow_new_entry);
+
 	if (flow_first_free >= FLOW_MAX)
 		return NULL;
 
+	ASSERT(flow->f.state == FLOW_STATE_FREE);
 	ASSERT(flow->f.type == FLOW_TYPE_NONE);
 	ASSERT(flow->free.n >= 1);
 	ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);
@@ -221,7 +220,10 @@ union flow *flow_alloc(void)
 		flow_first_free = flow->free.next;
 	}
 
+	flow_new_entry = flow;
 	memset(flow, 0, sizeof(*flow));
+	flow_set_state(&flow->f, FLOW_STATE_NEW);
+
 	return flow;
 }
 
@@ -233,15 +235,21 @@ union flow *flow_alloc(void)
  */
 void flow_alloc_cancel(union flow *flow)
 {
+	ASSERT(flow_new_entry == flow);
+	ASSERT(flow->f.state == FLOW_STATE_NEW ||
+	       flow->f.state == FLOW_STATE_TYPED);
 	ASSERT(flow_first_free > FLOW_IDX(flow));
 
-	flow_end(flow);
+	flow_set_state(&flow->f, FLOW_STATE_FREE);
+	memset(flow, 0, sizeof(*flow));
+
 	/* Put it back in a length 1 free cluster, don't attempt to fully
 	 * reverse flow_alloc()s steps.  This will get folded together the next
 	 * time flow_defer_handler runs anyway() */
 	flow->free.n = 1;
 	flow->free.next = flow_first_free;
 	flow_first_free = FLOW_IDX(flow);
+	flow_new_entry = NULL;
 }
 
 /**
@@ -265,7 +273,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 		union flow *flow = &flowtab[idx];
 		bool closed = false;
 
-		if (flow->f.type == FLOW_TYPE_NONE) {
+		switch (flow->f.state) {
+		case FLOW_STATE_FREE: {
 			unsigned skip = flow->free.n;
 
 			/* First entry of a free cluster must have n >= 1 */
@@ -287,6 +296,20 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 			continue;
 		}
 
+		case FLOW_STATE_NEW:
+		case FLOW_STATE_TYPED:
+			flow_err(flow, "Incomplete flow at end of cycle");
+			ASSERT(false);
+			break;
+
+		case FLOW_STATE_ACTIVE:
+			/* Nothing to do */
+			break;
+
+		default:
+			ASSERT(false);
+		}
+
 		switch (flow->f.type) {
 		case FLOW_TYPE_NONE:
 			ASSERT(false);
@@ -310,7 +333,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
 		}
 
 		if (closed) {
-			flow_end(flow);
+			flow_set_state(&flow->f, FLOW_STATE_FREE);
+			memset(flow, 0, sizeof(*flow));
 
 			if (free_head) {
 				/* Add slot to current free cluster */
diff --git a/flow.h b/flow.h
index c943c44..073a734 100644
--- a/flow.h
+++ b/flow.h
@@ -9,6 +9,66 @@
 
 #define FLOW_TIMER_INTERVAL		1000	/* ms */
 
+/**
+ * enum flow_state - States of a flow table entry
+ *
+ * An individual flow table entry moves through these states, usually in this
+ * order.
+ *  General rules:
+ *    - Code outside flow.c should never write common fields of union flow.
+ *    - The state field may always be read.
+ *
+ *    FREE - Part of the general pool of free flow table entries
+ *        Operations:
+ *            - flow_alloc() finds an entry and moves it to NEW state
+ *
+ *    NEW - Freshly allocated, uninitialised entry
+ *        Operations:
+ *            - flow_alloc_cancel() returns the entry to FREE state
+ *            - FLOW_SET_TYPE() sets the entry's type and moves to TYPED state
+ *        Caveats:
+ *            - No fields other than state may be accessed.
+ *            - At most one entry may be in NEW or TYPED state at a time, so it's
+ *              unsafe to use flow_alloc() again until this entry moves to
+ *              ACTIVE or FREE state 
+ *            - You may not return to the main epoll loop while an entry is in
+ *              NEW state.
+ *
+ *    TYPED - Generic info initialised, type specific initialisation underway
+ *        Operations:
+ *            - All common fields may be read
+ *            - Type specific fields may be read and written
+ *            - flow_alloc_cancel() returns the entry to FREE state
+ *            - FLOW_ACTIVATE() moves the entry to ACTIVE STATE
+ *        Caveats:
+ *            - At most one entry may be in NEW or TYPED state at a time, so it's
+ *              unsafe to use flow_alloc() again until this entry moves to
+ *              ACTIVE or FREE state 
+ *            - You may not return to the main epoll loop while an entry is in
+ *              TYPED state.
+ *
+ *    ACTIVE - An active, fully-initialised flow entry
+ *        Operations:
+ *            - All common fields may be read
+ *            - Type specific fields may be read and written
+ *            - Flow may be expired by returning 'true' from flow type specific
+ *              deferred or timer handler.  This will return it to FREE state.
+ *        Caveats:
+ *            - flow_alloc_cancel() may not be called on it
+ */
+enum flow_state {
+	FLOW_STATE_FREE,
+	FLOW_STATE_NEW,
+	FLOW_STATE_TYPED,
+	FLOW_STATE_ACTIVE,
+
+	FLOW_NUM_STATES,
+};
+
+extern const char *flow_state_str[];
+#define FLOW_STATE(f)							\
+        ((f)->state < FLOW_NUM_STATES ? flow_state_str[(f)->state] : "?")
+
 /**
  * enum flow_type - Different types of packet flows we track
  */
@@ -37,9 +97,11 @@ extern const uint8_t flow_proto[];
 
 /**
  * struct flow_common - Common fields for packet flows
+ * @state:	State of the flow table entry
  * @type:	Type of packet flow
  */
 struct flow_common {
+	uint8_t		state;
 	uint8_t		type;
 };
 
@@ -49,11 +111,6 @@ struct flow_common {
 #define FLOW_TABLE_PRESSURE		30	/* % of FLOW_MAX */
 #define FLOW_FILE_PRESSURE		30	/* % of c->nofile */
 
-union flow *flow_start(union flow *flow, enum flow_type type,
-		       unsigned iniside);
-#define FLOW_START(flow_, t_, var_, i_)		\
-	(&flow_start((flow_), (t_), (i_))->var_)
-
 /**
  * struct flow_sidx - ID for one side of a specific flow
  * @side:	Side referenced (0 or 1)
diff --git a/flow_table.h b/flow_table.h
index b7e5529..58014d8 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -107,4 +107,14 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
 union flow *flow_alloc(void);
 void flow_alloc_cancel(union flow *flow);
 
+union flow *flow_set_type(union flow *flow, enum flow_type type,
+			  unsigned iniside);
+#define FLOW_SET_TYPE(flow_, t_, var_, i_)	\
+	(&flow_set_type((flow_), (t_), (i_))->var_)
+
+void flow_activate(struct flow_common *f);
+#define FLOW_ACTIVATE(flow_)			\
+	(flow_activate(&(flow_)->f))
+
+
 #endif /* FLOW_TABLE_H */
diff --git a/icmp.c b/icmp.c
index 1c5cf84..fda868d 100644
--- a/icmp.c
+++ b/icmp.c
@@ -167,7 +167,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
-	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
+	pingf = FLOW_SET_TYPE(flow, flowtype, ping, TAPSIDE);
 
 	pingf->seq = -1;
 	pingf->id = id;
@@ -198,6 +198,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	*id_sock = pingf;
 
+	FLOW_ACTIVATE(pingf);
+
 	return pingf;
 
 cancel:
diff --git a/tcp.c b/tcp.c
index 21d0af0..65208ca 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2006,7 +2006,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 			goto cancel;
 	}
 
-	conn = FLOW_START(flow, FLOW_TCP, tcp, TAPSIDE);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp, TAPSIDE);
 	conn->sock = s;
 	conn->timer = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
@@ -2077,6 +2077,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	}
 
 	tcp_epoll_ctl(c, conn);
+	FLOW_ACTIVATE(conn);
 	return;
 
 cancel:
@@ -2724,7 +2725,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
-	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
+	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp,
+						  SOCKSIDE);
 
 	conn->sock = s;
 	conn->timer = -1;
@@ -2747,6 +2749,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 
 	tcp_get_sndbuf(conn);
+
+	FLOW_ACTIVATE(conn);
 }
 
 /**
diff --git a/tcp_splice.c b/tcp_splice.c
index 4c36b72..abe98a0 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
@@ -486,6 +486,8 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 	if (tcp_splice_connect(c, conn, af, pif1, dstport))
 		conn_flag(c, conn, CLOSING);
 
+	FLOW_ACTIVATE(conn);
+
 	return true;
 }
 
-- 
@@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
@@ -486,6 +486,8 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 	if (tcp_splice_connect(c, conn, af, pif1, dstport))
 		conn_flag(c, conn, CLOSING);
 
+	FLOW_ACTIVATE(conn);
+
 	return true;
 }
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 02/19] flow: Make side 0 always be the initiating side
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
  2024-05-14  1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-16 12:06   ` Stefano Brivio
  2024-05-14  1:03 ` [PATCH v5 03/19] flow: Record the pifs for each side of each flow David Gibson
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code:  TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the accepting side.  ICMP also uses 0 for the host/"sock" side and 1 for
the guest/"tap" side, but in its case the latter is always also the
initiating side.

Make this generically consistent by always using side 0 for the initiating
side and 1 for the accepting side.  This doesn't simplify a lot for now,
and arguably makes TCP slightly more complex, since we add an extra field
to the connection structure to record which is the guest facing side.
This is an interim change, which we'll be able to remove later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       |  5 +----
 flow.h       |  5 +++++
 flow_table.h |  6 ++----
 icmp.c       |  8 ++------
 tcp.c        | 19 ++++++++-----------
 tcp_conn.h   |  3 ++-
 tcp_splice.c |  2 +-
 7 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/flow.c b/flow.c
index 768e0f6..7456021 100644
--- a/flow.c
+++ b/flow.c
@@ -152,12 +152,10 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
  * flow_set_type() - Set type and mvoe to TYPED state
  * @flow:	Flow to change state
  * @type:	Type for new flow
- * @iniside:	Which side initiated the new flow
  *
  * Return: @flow
  */
-union flow *flow_set_type(union flow *flow, enum flow_type type,
-			  unsigned iniside)
+union flow *flow_set_type(union flow *flow, enum flow_type type)
 {
 	struct flow_common *f = &flow->f;
 
@@ -165,7 +163,6 @@ union flow *flow_set_type(union flow *flow, enum flow_type type,
 	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
 	ASSERT(f->type == FLOW_TYPE_NONE);
 
-	(void)iniside;
 	f->type = type;
 	flow_set_state(f, FLOW_STATE_TYPED);
 	return flow;
diff --git a/flow.h b/flow.h
index 073a734..28169a8 100644
--- a/flow.h
+++ b/flow.h
@@ -95,6 +95,11 @@ extern const uint8_t flow_proto[];
 #define FLOW_PROTO(f)				\
 	((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
 
+#define SIDES			2
+
+#define INISIDE			0	/* Initiating side */
+#define FWDSIDE			1	/* Forwarded side */
+
 /**
  * struct flow_common - Common fields for packet flows
  * @state:	State of the flow table entry
diff --git a/flow_table.h b/flow_table.h
index 58014d8..7c98195 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -107,10 +107,8 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
 union flow *flow_alloc(void);
 void flow_alloc_cancel(union flow *flow);
 
-union flow *flow_set_type(union flow *flow, enum flow_type type,
-			  unsigned iniside);
-#define FLOW_SET_TYPE(flow_, t_, var_, i_)	\
-	(&flow_set_type((flow_), (t_), (i_))->var_)
+union flow *flow_set_type(union flow *flow, enum flow_type type);
+#define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
 
 void flow_activate(struct flow_common *f);
 #define FLOW_ACTIVATE(flow_)			\
diff --git a/icmp.c b/icmp.c
index fda868d..6df0989 100644
--- a/icmp.c
+++ b/icmp.c
@@ -45,10 +45,6 @@
 #define ICMP_ECHO_TIMEOUT	60 /* s, timeout for ICMP socket activity */
 #define ICMP_NUM_IDS		(1U << 16)
 
-/* Sides of a flow as we use them for ping streams */
-#define	SOCKSIDE	0
-#define	TAPSIDE		1
-
 #define PINGF(idx)		(&(FLOW(idx)->ping))
 
 /* Indexed by ICMP echo identifier */
@@ -167,7 +163,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
-	pingf = FLOW_SET_TYPE(flow, flowtype, ping, TAPSIDE);
+	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
 	pingf->id = id;
@@ -180,7 +176,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 		bind_if = c->ip6.ifname_out;
 	}
 
-	ref.flowside = FLOW_SIDX(flow, SOCKSIDE);
+	ref.flowside = FLOW_SIDX(flow, FWDSIDE);
 	pingf->sock = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if,
 			      0, ref.data);
 
diff --git a/tcp.c b/tcp.c
index 65208ca..06401ba 100644
--- a/tcp.c
+++ b/tcp.c
@@ -303,10 +303,6 @@
 
 #include "flow_table.h"
 
-/* Sides of a flow as we use them in "tap" connections */
-#define	SOCKSIDE	0
-#define	TAPSIDE		1
-
 #define TCP_FRAMES_MEM			128
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
@@ -581,7 +577,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
-				.flowside = FLOW_SIDX(conn, SOCKSIDE) };
+				.flowside = FLOW_SIDX(conn, !conn->tapside), };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
 	if (conn->events == CLOSED) {
@@ -1134,7 +1130,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c,
 static inline unsigned tcp_hash_probe(const struct ctx *c,
 				      const struct tcp_tap_conn *conn)
 {
-	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE);
+	flow_sidx_t sidx = FLOW_SIDX(conn, conn->tapside);
 	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
 
 	/* Linear probing */
@@ -1154,7 +1150,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	unsigned b = tcp_hash_probe(c, conn);
 
-	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE);
+	tc_hash[b] = FLOW_SIDX(conn, conn->tapside);
 	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
 }
 
@@ -2006,7 +2002,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 			goto cancel;
 	}
 
-	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp, TAPSIDE);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
+	conn->tapside = INISIDE;
 	conn->sock = s;
 	conn->timer = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
@@ -2725,9 +2722,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
-	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp,
-						  SOCKSIDE);
+	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 
+	conn->tapside = FWDSIDE;
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
@@ -2884,7 +2881,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 	struct tcp_tap_conn *conn = CONN(ref.flowside.flow);
 
 	ASSERT(conn->f.type == FLOW_TCP);
-	ASSERT(ref.flowside.side == SOCKSIDE);
+	ASSERT(ref.flowside.side == !conn->tapside);
 
 	if (conn->events == CLOSED)
 		return;
diff --git a/tcp_conn.h b/tcp_conn.h
index d280b22..5df0076 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -13,6 +13,7 @@
  * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
  * @f:			Generic flow information
  * @in_epoll:		Is the connection in the epoll set?
+ * @tapside:		Which side of the flow faces the tap/guest interface
  * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
  * @sock:		Socket descriptor number
  * @events:		Connection events, implying connection states
@@ -39,6 +40,7 @@ struct tcp_tap_conn {
 	struct flow_common f;
 
 	bool		in_epoll	:1;
+	unsigned	tapside		:1;
 
 #define TCP_RETRANS_BITS		3
 	unsigned int	retrans		:TCP_RETRANS_BITS;
@@ -106,7 +108,6 @@ struct tcp_tap_conn {
 	uint32_t	seq_init_from_tap;
 };
 
-#define SIDES			2
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
  * @f:			Generic flow information
diff --git a/tcp_splice.c b/tcp_splice.c
index abe98a0..5da7021 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
-- 
@@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 03/19] flow: Record the pifs for each side of each flow
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
  2024-05-14  1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
  2024-05-14  1:03 ` [PATCH v5 02/19] flow: Make side 0 always be the initiating side David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 04/19] tcp: Remove interim 'tapside' field from connection David Gibson
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type.  Start introducing
generic flow information by recording the pifs which the flow connects.

To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and FWD for when
both sides information is complete.  For now, these states seem like busy
work, but they'll become more important as we add more generic information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 56 +++++++++++++++++++++++++++++++++++++++++++++++-----
 flow.h       | 49 ++++++++++++++++++++++++++++++++++++++-------
 flow_table.h |  3 +++
 icmp.c       |  2 ++
 pif.h        |  1 -
 tcp.c        | 10 +++++++++-
 tcp_splice.c |  1 +
 7 files changed, 108 insertions(+), 14 deletions(-)

diff --git a/flow.c b/flow.c
index 7456021..aee2736 100644
--- a/flow.c
+++ b/flow.c
@@ -21,6 +21,8 @@
 const char *flow_state_str[] = {
 	[FLOW_STATE_FREE]	= "FREE",
 	[FLOW_STATE_NEW]	= "NEW",
+	[FLOW_STATE_INI]	= "INI",
+	[FLOW_STATE_FWD]	= "FWD",
 	[FLOW_STATE_TYPED]	= "TYPED",
 	[FLOW_STATE_ACTIVE]	= "ACTIVE",
 };
@@ -146,22 +148,63 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 	f->state = state;
 	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
 		  FLOW_STATE(f));
+
+	if (MAX(state, oldstate) >= FLOW_STATE_FWD)
+		flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]),
+			                            pif_name(f->pif[FWDSIDE]));
+	else if (MAX(state, oldstate) >= FLOW_STATE_INI)
+		flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE]));
 }
 
 /**
- * flow_set_type() - Set type and mvoe to TYPED state
+ * flow_initiate() - Move flow to INI state, setting INISIDE details
  * @flow:	Flow to change state
- * @type:	Type for new flow
- *
- * Return: @flow
+ * @pif:	pif of the initiating side
+ */
+void flow_initiate(union flow *flow, uint8_t pif)
+{
+	struct flow_common *f = &flow->f;
+
+	ASSERT(pif != PIF_NONE);
+	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
+	ASSERT(f->type == FLOW_TYPE_NONE);
+	ASSERT(f->pif[INISIDE] == PIF_NONE && f->pif[FWDSIDE] == PIF_NONE);
+
+	f->pif[INISIDE] = pif;
+	flow_set_state(f, FLOW_STATE_INI);
+}
+
+/**
+ * flow_forward() - Move flow to FWD state, setting FWDSIDE details
+ * @flow:	Flow to change state
+ * @pif:	pif of the forwarded side
+ */
+void flow_forward(union flow *flow, uint8_t pif)
+{
+	struct flow_common *f = &flow->f;
+
+	ASSERT(pif != PIF_NONE);
+	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI);
+	ASSERT(f->type == FLOW_TYPE_NONE);
+	ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[FWDSIDE] == PIF_NONE);
+
+	f->pif[FWDSIDE] = pif;
+	flow_set_state(f, FLOW_STATE_FWD);
+}
+
+/**
+ * flow_set_type() - Set type and move to TYPED state
+ * @flow:	Flow to change state
+ * @pif:	pif of the initiating side
  */
 union flow *flow_set_type(union flow *flow, enum flow_type type)
 {
 	struct flow_common *f = &flow->f;
 
 	ASSERT(type != FLOW_TYPE_NONE);
-	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
+	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_FWD);
 	ASSERT(f->type == FLOW_TYPE_NONE);
+	ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[FWDSIDE] != PIF_NONE);
 
 	f->type = type;
 	flow_set_state(f, FLOW_STATE_TYPED);
@@ -175,6 +218,7 @@ union flow *flow_set_type(union flow *flow, enum flow_type type)
 void flow_activate(struct flow_common *f)
 {
 	ASSERT(&flow_new_entry->f == f && f->state == FLOW_STATE_TYPED);
+	ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[FWDSIDE] != PIF_NONE);
 
 	flow_set_state(f, FLOW_STATE_ACTIVE);
 	flow_new_entry = NULL;
@@ -234,6 +278,8 @@ void flow_alloc_cancel(union flow *flow)
 {
 	ASSERT(flow_new_entry == flow);
 	ASSERT(flow->f.state == FLOW_STATE_NEW ||
+	       flow->f.state == FLOW_STATE_INI ||
+	       flow->f.state == FLOW_STATE_FWD ||
 	       flow->f.state == FLOW_STATE_TYPED);
 	ASSERT(flow_first_free > FLOW_IDX(flow));
 
diff --git a/flow.h b/flow.h
index 28169a8..9871e3b 100644
--- a/flow.h
+++ b/flow.h
@@ -25,25 +25,56 @@
  *    NEW - Freshly allocated, uninitialised entry
  *        Operations:
  *            - flow_alloc_cancel() returns the entry to FREE state
+ *            - flow_initiate() sets the entry's INISIDE details and moves to
+ *              INI state
  *            - FLOW_SET_TYPE() sets the entry's type and moves to TYPED state
  *        Caveats:
  *            - No fields other than state may be accessed.
- *            - At most one entry may be in NEW or TYPED state at a time, so it's
- *              unsafe to use flow_alloc() again until this entry moves to
- *              ACTIVE or FREE state 
+ *            - At most one entry may be in NEW, INI, FWD or TYPED state at a
+ *              time, so it's unsafe to use flow_alloc() again until this entry
+ *              moves to ACTIVE or FREE state 
  *            - You may not return to the main epoll loop while an entry is in
  *              NEW state.
  *
+ *    INI - An entry with INISIDE common information completed
+ *        Operations:
+ *            - Common fields related to INISIDE may be read
+ *            - flow_alloc_cancel() returns the entry to FREE state
+ *            - flow_forward() sets the entry's FWDSIDE details and moves to FWD
+ *              state
+ *        Caveats:
+ *            - Other common fields may not be read
+ *            - Type specific fields may not be read or written
+ *            - At most one entry may be in NEW, INI, FWD or TYPED state at a
+ *              time, so it's unsafe to use flow_alloc() again until this entry
+ *              moves to ACTIVE or FREE state 
+ *            - You may not return to the main epoll loop while an entry is in
+ *              INI state.
+ *
+ *    FWD - An entry with only INISIDE and FWDSIDE common information completed
+ *        Operations:
+ *            - Common fields related to INISIDE & FWDSIDE may be read
+ *            - flow_alloc_cancel() returns the entry to FREE state
+ *            - FLOW_SET_TYPE() sets the entry's type and moves to TYPED state
+ *        Caveats:
+ *            - Other common fields may not be read
+ *            - Type specific fields may not be read or written
+ *            - At most one entry may be in NEW, INI, FWD or TYPED state at a
+ *              time, so it's unsafe to use flow_alloc() again until this entry
+ *              moves to ACTIVE or FREE state 
+ *            - You may not return to the main epoll loop while an entry is in
+ *              FWD state.
+ *
  *    TYPED - Generic info initialised, type specific initialisation underway
  *        Operations:
  *            - All common fields may be read
  *            - Type specific fields may be read and written
  *            - flow_alloc_cancel() returns the entry to FREE state
- *            - FLOW_ACTIVATE() moves the entry to ACTIVE STATE
+ *            - FLOW_ACTIVATE() moves the entry to ACTIVE state
  *        Caveats:
- *            - At most one entry may be in NEW or TYPED state at a time, so it's
- *              unsafe to use flow_alloc() again until this entry moves to
- *              ACTIVE or FREE state 
+ *            - At most one entry may be in NEW, INI, FWD or TYPED state at a
+ *              time, so it's unsafe to use flow_alloc() again until this entry
+ *              moves to ACTIVE or FREE state 
  *            - You may not return to the main epoll loop while an entry is in
  *              TYPED state.
  *
@@ -59,6 +90,8 @@
 enum flow_state {
 	FLOW_STATE_FREE,
 	FLOW_STATE_NEW,
+	FLOW_STATE_INI,
+	FLOW_STATE_FWD,
 	FLOW_STATE_TYPED,
 	FLOW_STATE_ACTIVE,
 
@@ -104,10 +137,12 @@ extern const uint8_t flow_proto[];
  * struct flow_common - Common fields for packet flows
  * @state:	State of the flow table entry
  * @type:	Type of packet flow
+ * @pif[]:	Interface for each side of the flow
  */
 struct flow_common {
 	uint8_t		state;
 	uint8_t		type;
+	uint8_t		pif[SIDES];
 };
 
 #define FLOW_INDEX_BITS		17	/* 128k - 1 */
diff --git a/flow_table.h b/flow_table.h
index 7c98195..01c9326 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -107,6 +107,9 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
 union flow *flow_alloc(void);
 void flow_alloc_cancel(union flow *flow);
 
+void flow_initiate(union flow *flow, uint8_t pif);
+void flow_forward(union flow *flow, uint8_t pif);
+
 union flow *flow_set_type(union flow *flow, enum flow_type type);
 #define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
 
diff --git a/icmp.c b/icmp.c
index 6df0989..f5b8405 100644
--- a/icmp.c
+++ b/icmp.c
@@ -163,6 +163,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
+	flow_initiate(flow, PIF_TAP);
+	flow_forward(flow, PIF_HOST);
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
diff --git a/pif.h b/pif.h
index bd52936..ca85b34 100644
--- a/pif.h
+++ b/pif.h
@@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt)
 		return "?";
 }
 
-/* cppcheck-suppress unusedFunction */
 static inline const char *pif_name(uint8_t pif)
 {
 	return pif_type(pif);
diff --git a/tcp.c b/tcp.c
index 06401ba..48aae30 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1950,6 +1950,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(flow = flow_alloc()))
 		return;
 
+	flow_initiate(flow, PIF_TAP);
+
 	if (af == AF_INET) {
 		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
 		    IN4_IS_ADDR_BROADCAST(saddr) ||
@@ -2002,6 +2004,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 			goto cancel;
 	}
 
+	flow_forward(flow, PIF_HOST);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 	conn->tapside = INISIDE;
 	conn->sock = s;
@@ -2722,7 +2725,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
-	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
+	struct tcp_tap_conn *conn;
+
+	flow_forward(flow, PIF_TAP);
+	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 
 	conn->tapside = FWDSIDE;
 	conn->sock = s;
@@ -2771,6 +2777,8 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
+	flow_initiate(flow, ref.tcp_listen.pif);
+
 	if (sa.sa_family == AF_INET) {
 		const struct in_addr *addr = &sa.sa4.sin_addr;
 		in_port_t port = sa.sa4.sin_port;
diff --git a/tcp_splice.c b/tcp_splice.c
index 5da7021..0e02732 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -472,6 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
+	flow_forward(flow, pif1);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
-- 
@@ -472,6 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
+	flow_forward(flow, pif1);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 04/19] tcp: Remove interim 'tapside' field from connection
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (2 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 03/19] flow: Record the pifs for each side of each flow David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 05/19] flow: Common data structures for tracking flow addresses David Gibson
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one.  Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c      | 12 ++++++------
 tcp_conn.h |  2 --
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/tcp.c b/tcp.c
index 48aae30..3895f3f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -368,6 +368,8 @@
 #define OPT_SACK	5
 #define OPT_TS		8
 
+#define TAPSIDE(conn_)	((conn_)->f.pif[1] == PIF_TAP)
+
 #define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
 #define CONN_IS_CLOSING(conn)						\
@@ -577,7 +579,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
 	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
-				.flowside = FLOW_SIDX(conn, !conn->tapside), };
+		                .flowside = FLOW_SIDX(conn, !TAPSIDE(conn)), };
 	struct epoll_event ev = { .data.u64 = ref.u64 };
 
 	if (conn->events == CLOSED) {
@@ -1130,8 +1132,8 @@ static uint64_t tcp_conn_hash(const struct ctx *c,
 static inline unsigned tcp_hash_probe(const struct ctx *c,
 				      const struct tcp_tap_conn *conn)
 {
-	flow_sidx_t sidx = FLOW_SIDX(conn, conn->tapside);
 	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
+	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn));
 
 	/* Linear probing */
 	while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) &&
@@ -1150,7 +1152,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
 {
 	unsigned b = tcp_hash_probe(c, conn);
 
-	tc_hash[b] = FLOW_SIDX(conn, conn->tapside);
+	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn));
 	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
 }
 
@@ -2006,7 +2008,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	flow_forward(flow, PIF_HOST);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
-	conn->tapside = INISIDE;
 	conn->sock = s;
 	conn->timer = -1;
 	conn_event(c, conn, TAP_SYN_RCVD);
@@ -2730,7 +2731,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	flow_forward(flow, PIF_TAP);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 
-	conn->tapside = FWDSIDE;
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
@@ -2889,7 +2889,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
 	struct tcp_tap_conn *conn = CONN(ref.flowside.flow);
 
 	ASSERT(conn->f.type == FLOW_TCP);
-	ASSERT(ref.flowside.side == !conn->tapside);
+	ASSERT(conn->f.pif[ref.flowside.side] != PIF_TAP);
 
 	if (conn->events == CLOSED)
 		return;
diff --git a/tcp_conn.h b/tcp_conn.h
index 5df0076..1a07dd5 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -13,7 +13,6 @@
  * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
  * @f:			Generic flow information
  * @in_epoll:		Is the connection in the epoll set?
- * @tapside:		Which side of the flow faces the tap/guest interface
  * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
  * @sock:		Socket descriptor number
  * @events:		Connection events, implying connection states
@@ -40,7 +39,6 @@ struct tcp_tap_conn {
 	struct flow_common f;
 
 	bool		in_epoll	:1;
-	unsigned	tapside		:1;
 
 #define TCP_RETRANS_BITS		3
 	unsigned int	retrans		:TCP_RETRANS_BITS;
-- 
@@ -13,7 +13,6 @@
  * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
  * @f:			Generic flow information
  * @in_epoll:		Is the connection in the epoll set?
- * @tapside:		Which side of the flow faces the tap/guest interface
  * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
  * @sock:		Socket descriptor number
  * @events:		Connection events, implying connection states
@@ -40,7 +39,6 @@ struct tcp_tap_conn {
 	struct flow_common f;
 
 	bool		in_epoll	:1;
-	unsigned	tapside		:1;
 
 #define TCP_RETRANS_BITS		3
 	unsigned int	retrans		:TCP_RETRANS_BITS;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 05/19] flow: Common data structures for tracking flow addresses
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (3 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 04/19] tcp: Remove interim 'tapside' field from connection David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 06/19] flow: Populate address information for initiating side David Gibson
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow.  Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).

To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.  For
now we just introduce the structure itself, later patches will
actually populate and use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.h  | 16 ++++++++++++++++
 passt.h |  3 +++
 2 files changed, 19 insertions(+)

diff --git a/flow.h b/flow.h
index 9871e3b..437579b 100644
--- a/flow.h
+++ b/flow.h
@@ -133,8 +133,23 @@ extern const uint8_t flow_proto[];
 #define INISIDE			0	/* Initiating side */
 #define FWDSIDE			1	/* Forwarded side */
 
+/**
+ * struct flowside - Address information for one side of a flow
+ * @eaddr:	Endpoint address (remote address from passt's PoV)
+ * @faddr:	Forwarding address (local address from passt's PoV)
+ * @eport:	Endpoint port
+ * @fport:	Forwarding port
+ */
+struct flowside {
+	union inany_addr	faddr;
+	union inany_addr	eaddr;
+	in_port_t		fport;
+	in_port_t		eport;
+};
+
 /**
  * struct flow_common - Common fields for packet flows
+ * @side[]:	Information for each side of the flow
  * @state:	State of the flow table entry
  * @type:	Type of packet flow
  * @pif[]:	Interface for each side of the flow
@@ -143,6 +158,7 @@ struct flow_common {
 	uint8_t		state;
 	uint8_t		type;
 	uint8_t		pif[SIDES];
+	struct flowside	side[SIDES];
 };
 
 #define FLOW_INDEX_BITS		17	/* 128k - 1 */
diff --git a/passt.h b/passt.h
index bc58d64..3db0b8e 100644
--- a/passt.h
+++ b/passt.h
@@ -17,6 +17,9 @@ union epoll_ref;
 
 #include "pif.h"
 #include "packet.h"
+#include "siphash.h"
+#include "ip.h"
+#include "inany.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
-- 
@@ -17,6 +17,9 @@ union epoll_ref;
 
 #include "pif.h"
 #include "packet.h"
+#include "siphash.h"
+#include "ip.h"
+#include "inany.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 06/19] flow: Populate address information for initiating side
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (4 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 05/19] flow: Common data structures for tracking flow addresses David Gibson
@ 2024-05-14  1:03 ` David Gibson
       [not found]   ` <20240516202337.1b90e5f2@elisabeth>
  2024-05-14  1:03 ` [PATCH v5 07/19] flow: Populate address information for non-initiating side David Gibson
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

This requires the address and port information for the initiating side be
populated when a flow enters INI state.  Implement that for TCP and ICMP.

For now this leaves some information redundantly recorded in both generic
and type specific fields.  We'll fix that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 92 +++++++++++++++++++++++++++++++++++++++++++++++++---
 flow_table.h |  8 ++++-
 icmp.c       | 10 ++++--
 tcp.c        |  4 +--
 4 files changed, 103 insertions(+), 11 deletions(-)

diff --git a/flow.c b/flow.c
index aee2736..3d5b3a5 100644
--- a/flow.c
+++ b/flow.c
@@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
 
+/** flowside_from_af() - Initialise flowside from addresses
+ * @fside:	flowside to initialise
+ * @af:		Address family (AF_INET or AF_INET6)
+ * @eaddr:	Endpoint address (pointer to in_addr or in6_addr)
+ * @eport:	Endpoint port
+ * @faddr:	Forwarding address (pointer to in_addr or in6_addr)
+ * @fport:	Forwarding port
+ */
+static void flowside_from_af(struct flowside *fside, sa_family_t af,
+			     const void *eaddr, in_port_t eport,
+			     const void *faddr, in_port_t fport)
+{
+	if (faddr)
+		inany_from_af(&fside->faddr, af, faddr);
+	else
+		fside->faddr = inany_any6;
+	fside->fport = fport;
+
+	if (eaddr)
+		inany_from_af(&fside->eaddr, af, eaddr);
+	else
+		fside->eaddr = inany_any6;
+	fside->eport = eport;
+}
+
 /** flow_log_ - Log flow-related message
  * @f:		flow the message is related to
  * @pri:	Log priority
@@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
  */
 static void flow_set_state(struct flow_common *f, enum flow_state state)
 {
+	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+	const struct flowside *ini = &f->side[INISIDE];
 	uint8_t oldstate = f->state;
 
 	ASSERT(state < FLOW_NUM_STATES);
@@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 		  FLOW_STATE(f));
 
 	if (MAX(state, oldstate) >= FLOW_STATE_FWD)
-		flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]),
-			                            pif_name(f->pif[FWDSIDE]));
+		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s",
+			  pif_name(f->pif[INISIDE]),
+			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
+			  ini->eport,
+			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
+			  ini->fport,
+			  pif_name(f->pif[FWDSIDE]));
 	else if (MAX(state, oldstate) >= FLOW_STATE_INI)
-		flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE]));
+		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
+			  pif_name(f->pif[INISIDE]),
+			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
+			  ini->eport,
+			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
+			  ini->fport);
 }
 
 /**
- * flow_initiate() - Move flow to INI state, setting INISIDE details
+ * flow_initiate_() - Move flow to INI state, setting pif[INISIDE]
  * @flow:	Flow to change state
  * @pif:	pif of the initiating side
  */
-void flow_initiate(union flow *flow, uint8_t pif)
+static void flow_initiate_(union flow *flow, uint8_t pif)
 {
 	struct flow_common *f = &flow->f;
 
@@ -174,6 +211,51 @@ void flow_initiate(union flow *flow, uint8_t pif)
 	flow_set_state(f, FLOW_STATE_INI);
 }
 
+/**
+ * flow_initiate_af() - Move flow to INI state, setting INISIDE details
+ * @flow:	Flow to change state
+ * @pif:	pif of the initiating side
+ * @af:		Address family of @eaddr and @faddr
+ * @saddr:	Source address (pointer to in_addr or in6_addr)
+ * @sport:	Endpoint port
+ * @daddr:	Destination address (pointer to in_addr or in6_addr)
+ * @dport:	Destination port
+ *
+ * Return: pointer to the initiating flowside information
+ */
+const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
+					sa_family_t af,
+					const void *saddr, in_port_t sport,
+					const void *daddr, in_port_t dport)
+{
+	struct flowside *ini = &flow->f.side[INISIDE];
+
+	flowside_from_af(ini, af, saddr, sport, daddr, dport);
+	flow_initiate_(flow, pif);
+	return ini;
+}
+
+/**
+ * flow_initiate_sa() - Move flow to INI state, setting INISIDE details
+ * @flow:	Flow to change state
+ * @pif:	pif of the initiating side
+ * @ssa:	Source socket address
+ * @dport:	Destination port
+ *
+ * Return: pointer to the initiating flowside information
+ */
+const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
+					const union sockaddr_inany *ssa,
+					in_port_t dport)
+{
+	struct flowside *ini = &flow->f.side[INISIDE];
+
+	inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa);
+	ini->fport = dport;
+	flow_initiate_(flow, pif);
+	return ini;
+}
+
 /**
  * flow_forward() - Move flow to FWD state, setting FWDSIDE details
  * @flow:	Flow to change state
diff --git a/flow_table.h b/flow_table.h
index 01c9326..ca7f228 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -107,7 +107,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
 union flow *flow_alloc(void);
 void flow_alloc_cancel(union flow *flow);
 
-void flow_initiate(union flow *flow, uint8_t pif);
+const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
+					sa_family_t af,
+					const void *saddr, in_port_t sport,
+					const void *daddr, in_port_t dport);
+const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
+					const union sockaddr_inany *ssa,
+					in_port_t dport);
 void flow_forward(union flow *flow, uint8_t pif);
 
 union flow *flow_set_type(union flow *flow, enum flow_type type);
diff --git a/icmp.c b/icmp.c
index f5b8405..90708fe 100644
--- a/icmp.c
+++ b/icmp.c
@@ -146,12 +146,15 @@ static void icmp_ping_close(const struct ctx *c,
  * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
+ * @saddr:	Source address
+ * @daddr:	Destination address
  *
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 					    struct icmp_ping_flow **id_sock,
-					    sa_family_t af, uint16_t id)
+					    sa_family_t af, uint16_t id,
+					    const void *saddr, const void *daddr)
 {
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
 	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
@@ -163,7 +166,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
-	flow_initiate(flow, PIF_TAP);
+
+	flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
 	flow_forward(flow, PIF_HOST);
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
@@ -269,7 +273,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	}
 
 	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id)))
+		if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
 			return 1;
 
 	pingf->ts = now->tv_sec;
diff --git a/tcp.c b/tcp.c
index 3895f3f..bcc36fb 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1952,7 +1952,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(flow = flow_alloc()))
 		return;
 
-	flow_initiate(flow, PIF_TAP);
+	flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
 
 	if (af == AF_INET) {
 		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
@@ -2777,7 +2777,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
-	flow_initiate(flow, ref.tcp_listen.pif);
+	flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
 
 	if (sa.sa_family == AF_INET) {
 		const struct in_addr *addr = &sa.sa4.sin_addr;
-- 
@@ -1952,7 +1952,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(flow = flow_alloc()))
 		return;
 
-	flow_initiate(flow, PIF_TAP);
+	flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
 
 	if (af == AF_INET) {
 		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
@@ -2777,7 +2777,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
-	flow_initiate(flow, ref.tcp_listen.pif);
+	flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
 
 	if (sa.sa_family == AF_INET) {
 		const struct in_addr *addr = &sa.sa4.sin_addr;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 07/19] flow: Populate address information for non-initiating side
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (5 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 06/19] flow: Populate address information for initiating side David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 08/19] tcp, flow: Remove redundant information, repack connection structures David Gibson
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

This requires the address and port information for the forwarded (non
initiating) side to be populated when a flow enters FWD state.  Implement
that for TCP and ICMP.  For now this leaves some information redundantly
recorded in both generic and type specific fields.  We'll fix that in later
patches.

For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 38 ++++++++++++++++++------
 flow_table.h |  5 +++-
 icmp.c       |  2 +-
 inany.h      | 28 ++++++++++++++++++
 tcp.c        | 83 ++++++++++++++++++++++++++++------------------------
 tcp_splice.c | 44 ++++++++++------------------
 6 files changed, 123 insertions(+), 77 deletions(-)

diff --git a/flow.c b/flow.c
index 3d5b3a5..aff077b 100644
--- a/flow.c
+++ b/flow.c
@@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
  */
 static void flow_set_state(struct flow_common *f, enum flow_state state)
 {
-	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+	char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN];
+	char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN];
 	const struct flowside *ini = &f->side[INISIDE];
+	const struct flowside *fwd = &f->side[FWDSIDE];
 	uint8_t oldstate = f->state;
 
 	ASSERT(state < FLOW_NUM_STATES);
@@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
 		  FLOW_STATE(f));
 
 	if (MAX(state, oldstate) >= FLOW_STATE_FWD)
-		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s",
+		flow_log_(f, LOG_DEBUG,
+			  "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu",
 			  pif_name(f->pif[INISIDE]),
-			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
+			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
 			  ini->eport,
-			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
+			  inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)),
 			  ini->fport,
-			  pif_name(f->pif[FWDSIDE]));
+			  pif_name(f->pif[FWDSIDE]),
+			  inany_ntop(&fwd->faddr, fstr1, sizeof(fstr1)),
+			  ini->fport,
+			  inany_ntop(&fwd->eaddr, estr1, sizeof(estr1)),
+			  ini->eport);
 	else if (MAX(state, oldstate) >= FLOW_STATE_INI)
 		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
 			  pif_name(f->pif[INISIDE]),
-			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
+			  inany_ntop(&ini->eaddr, estr0, sizeof(estr0)),
 			  ini->eport,
-			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
+			  inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)),
 			  ini->fport);
 }
 
@@ -257,21 +264,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
 }
 
 /**
- * flow_forward() - Move flow to FWD state, setting FWDSIDE details
+ * flow_forward_af() - Move flow to FWD state, setting FWDSIDE details
  * @flow:	Flow to change state
  * @pif:	pif of the forwarded side
+ * @af:		Address family for @eaddr and @faddr
+ * @saddr:	Source address (pointer to in_addr or in6_addr)
+ * @sport:	Endpoint port
+ * @daddr:	Destination address (pointer to in_addr or in6_addr)
+ * @dport:	Destination port
+ *
+ * Return: pointer to the forwarded flowside information
  */
-void flow_forward(union flow *flow, uint8_t pif)
+const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
+				       sa_family_t af,
+				       const void *saddr, in_port_t sport,
+				       const void *daddr, in_port_t dport)
 {
 	struct flow_common *f = &flow->f;
+	struct flowside *fwd = &f->side[FWDSIDE];
 
 	ASSERT(pif != PIF_NONE);
 	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI);
 	ASSERT(f->type == FLOW_TYPE_NONE);
 	ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[FWDSIDE] == PIF_NONE);
 
+	flowside_from_af(fwd, af, daddr, dport, saddr, sport);
 	f->pif[FWDSIDE] = pif;
 	flow_set_state(f, FLOW_STATE_FWD);
+	return fwd;
 }
 
 /**
diff --git a/flow_table.h b/flow_table.h
index ca7f228..91ade0a 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif,
 const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
 					const union sockaddr_inany *ssa,
 					in_port_t dport);
-void flow_forward(union flow *flow, uint8_t pif);
+const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
+				       sa_family_t af,
+				       const void *saddr, in_port_t sport,
+				       const void *daddr, in_port_t dport);
 
 union flow *flow_set_type(union flow *flow, enum flow_type type);
 #define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
diff --git a/icmp.c b/icmp.c
index 90708fe..37a3586 100644
--- a/icmp.c
+++ b/icmp.c
@@ -168,7 +168,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 
 	flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
-	flow_forward(flow, PIF_HOST);
+	flow_forward_af(flow, PIF_HOST,	af, NULL, 0, daddr, 0);
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
diff --git a/inany.h b/inany.h
index 407690e..d962ff3 100644
--- a/inany.h
+++ b/inany.h
@@ -184,4 +184,32 @@ static inline void inany_siphash_feed(struct siphash_state *state,
 
 const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
 
+/** sockaddr_from_inany - Construct a sockaddr from an inany
+ * @sa:		Pointer to sockaddr to fill in
+ * @sl:		Updated to relevant of length of initialised @sa
+ * @addr:	IPv[46] address
+ * @port:	Port (host byte order)
+ * @scope:	Scope ID (ignored for IPv4 addresses)
+ */
+static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl,
+				       const union inany_addr *addr,
+				       in_port_t port, uint32_t scope)
+{
+	const struct in_addr *v4 = inany_v4(addr);
+
+	if (v4) {
+		sa->sa_family = AF_INET;
+		sa->sa4.sin_addr = *v4;
+		sa->sa4.sin_port = htons(port);
+		*sl = sizeof(sa->sa4);
+	} else {
+		sa->sa_family = AF_INET6;
+		sa->sa6.sin6_addr = addr->a6;
+		sa->sa6.sin6_port = htons(port);
+		sa->sa6.sin6_scope_id = scope;
+		sa->sa6.sin6_flowinfo = 0;
+		*sl = sizeof(sa->sa6);
+	}
+}
+
 #endif /* INANY_H */
diff --git a/tcp.c b/tcp.c
index bcc36fb..5fb3ce9 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1933,18 +1933,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 {
 	in_port_t srcport = ntohs(th->source);
 	in_port_t dstport = ntohs(th->dest);
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(dstport),
-		.sin_addr = *(struct in_addr *)daddr,
-	};
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(dstport),
-		.sin6_addr = *(struct in6_addr *)daddr,
-	};
-	const struct sockaddr *sa;
+	const struct flowside *ini, *fwd;
 	struct tcp_tap_conn *conn;
+	union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */
+	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
 	socklen_t sl;
@@ -1952,7 +1944,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(flow = flow_alloc()))
 		return;
 
-	flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
+	ini = flow_initiate_af(flow, PIF_TAP,
+			       af, saddr, srcport, daddr, dstport);
 
 	if (af == AF_INET) {
 		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
@@ -1984,19 +1977,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 			      dstport);
 			goto cancel;
 		}
+	} else {
+		ASSERT(0);
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
 		goto cancel;
 
+	dstaddr = ini->faddr;
+
 	if (!c->no_map_gw) {
-		if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
-			addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
-		if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw))
-			addr6.sin6_addr	= in6addr_loopback;
+		struct in_addr *v4 = inany_v4(&dstaddr);
+
+		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
+			*v4 = in4addr_loopback;
+		if (IN6_ARE_ADDR_EQUAL(&dstaddr, &c->ip6.gw))
+			dstaddr.a6 = in6addr_loopback;
 	}
 
-	if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) {
+	fwd = flow_forward_af(flow, PIF_HOST, AF_INET6,
+			      &inany_any6, srcport, &dstaddr, dstport);
+
+	if (IN6_IS_ADDR_LINKLOCAL(&fwd->eaddr)) {
 		struct sockaddr_in6 addr6_ll = {
 			.sin6_family = AF_INET6,
 			.sin6_addr = c->ip6.addr_ll,
@@ -2004,9 +2006,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		};
 		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll)))
 			goto cancel;
+	} else if (!inany_is_loopback(&fwd->eaddr)) {
+		tcp_bind_outbound(c, s, af);
 	}
 
-	flow_forward(flow, PIF_HOST);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 	conn->sock = s;
 	conn->timer = -1;
@@ -2029,14 +2032,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	inany_from_af(&conn->faddr, af, daddr);
 
-	if (af == AF_INET) {
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	} else {
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	}
-
 	conn->fport = dstport;
 	conn->eport = srcport;
 
@@ -2049,19 +2044,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	tcp_hash_insert(c, conn);
 
-	if (!bind(s, sa, sl)) {
+	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, c->ifi6);
+
+	if (!bind(s, &sa.sa, sl)) {
 		tcp_rst(c, conn);	/* Nobody is listening then */
 		return;
 	}
 	if (errno != EADDRNOTAVAIL && errno != EACCES)
 		conn_flag(c, conn, LOCAL);
 
-	if ((af == AF_INET &&  !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) ||
-	    (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) &&
-			       !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)))
-		tcp_bind_outbound(c, s, af);
-
-	if (connect(s, sa, sl)) {
+	if (connect(s, &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			tcp_rst(c, conn);
 			return;
@@ -2726,9 +2718,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
+	union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
 	struct tcp_tap_conn *conn;
+	in_port_t srcport;
+
+	inany_from_sockaddr(&saddr, &srcport, sa);
+	tcp_snat_inbound(c, &saddr);
 
-	flow_forward(flow, PIF_TAP);
+	if (inany_v4(&saddr)) {
+		inany_from_af(&daddr, AF_INET, &c->ip4.addr_seen);
+	} else {
+		if (IN6_IS_ADDR_LINKLOCAL(&saddr))
+			daddr.a6 = c->ip6.addr_ll_seen;
+		else
+			daddr.a6 = c->ip6.addr_seen;
+	}
+	dstport += c->tcp.fwd_in.delta[dstport];
+
+	flow_forward_af(flow,  PIF_TAP, AF_INET6,
+			&saddr, srcport, &daddr, dstport);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 
 	conn->sock = s;
@@ -2736,10 +2744,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
-	conn->eport = dstport + c->tcp.fwd_in.delta[dstport];
-
-	tcp_snat_inbound(c, &conn->faddr);
+	conn->faddr = saddr;
+	conn->fport = srcport;
+	conn->eport = dstport;
 
 	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
diff --git a/tcp_splice.c b/tcp_splice.c
index 0e02732..3a20b40 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -321,31 +321,20 @@ static int tcp_splice_connect_finish(const struct ctx *c,
  * tcp_splice_connect() - Create and connect socket for new spliced connection
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family
- * @pif:	pif on which to create socket
- * @port:	Destination port, host order
  *
  * Return: 0 for connect() succeeded or in progress, negative value on error
  */
-static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
-			      sa_family_t af, uint8_t pif, in_port_t port)
+static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 {
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(port),
-		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
-	};
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(port),
-		.sin_addr = IN4ADDR_LOOPBACK_INIT,
-	};
-	const struct sockaddr *sa;
+	const struct flowside *fwd = &conn->f.side[FWDSIDE];
+	sa_family_t af = inany_v4(&fwd->eaddr) ? AF_INET : AF_INET6;
+	uint8_t pif1 = conn->f.pif[FWDSIDE];
+	union sockaddr_inany sa;
 	socklen_t sl;
 
-	if (pif == PIF_HOST)
+	if (pif1 == PIF_HOST)
 		conn->s[1] = tcp_conn_sock(c, af);
-	else if (pif == PIF_SPLICE)
+	else if (pif1 == PIF_SPLICE)
 		conn->s[1] = tcp_conn_sock_ns(c, af);
 	else
 		ASSERT(0);
@@ -359,15 +348,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			   conn->s[1]);
 	}
 
-	if (CONN_V6(conn)) {
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	} else {
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	}
+	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, 0);
 
-	if (connect(conn->s[1], sa, sl)) {
+	if (connect(conn->s[1], &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			flow_trace(conn, "Couldn't connect socket for splice: %s",
 				   strerror(errno));
@@ -472,7 +455,12 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	flow_forward(flow, pif1);
+	if (af == AF_INET)
+		flow_forward_af(flow, pif1, AF_INET,
+				NULL, 0, &in4addr_loopback, dstport);
+	else
+		flow_forward_af(flow, pif1, AF_INET6,
+				NULL, 0, &in6addr_loopback, dstport);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
@@ -484,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 	if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
 		flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
 
-	if (tcp_splice_connect(c, conn, af, pif1, dstport))
+	if (tcp_splice_connect(c, conn))
 		conn_flag(c, conn, CLOSING);
 
 	FLOW_ACTIVATE(conn);
-- 
@@ -321,31 +321,20 @@ static int tcp_splice_connect_finish(const struct ctx *c,
  * tcp_splice_connect() - Create and connect socket for new spliced connection
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family
- * @pif:	pif on which to create socket
- * @port:	Destination port, host order
  *
  * Return: 0 for connect() succeeded or in progress, negative value on error
  */
-static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
-			      sa_family_t af, uint8_t pif, in_port_t port)
+static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 {
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(port),
-		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
-	};
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(port),
-		.sin_addr = IN4ADDR_LOOPBACK_INIT,
-	};
-	const struct sockaddr *sa;
+	const struct flowside *fwd = &conn->f.side[FWDSIDE];
+	sa_family_t af = inany_v4(&fwd->eaddr) ? AF_INET : AF_INET6;
+	uint8_t pif1 = conn->f.pif[FWDSIDE];
+	union sockaddr_inany sa;
 	socklen_t sl;
 
-	if (pif == PIF_HOST)
+	if (pif1 == PIF_HOST)
 		conn->s[1] = tcp_conn_sock(c, af);
-	else if (pif == PIF_SPLICE)
+	else if (pif1 == PIF_SPLICE)
 		conn->s[1] = tcp_conn_sock_ns(c, af);
 	else
 		ASSERT(0);
@@ -359,15 +348,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			   conn->s[1]);
 	}
 
-	if (CONN_V6(conn)) {
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	} else {
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	}
+	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, 0);
 
-	if (connect(conn->s[1], sa, sl)) {
+	if (connect(conn->s[1], &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			flow_trace(conn, "Couldn't connect socket for splice: %s",
 				   strerror(errno));
@@ -472,7 +455,12 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
-	flow_forward(flow, pif1);
+	if (af == AF_INET)
+		flow_forward_af(flow, pif1, AF_INET,
+				NULL, 0, &in4addr_loopback, dstport);
+	else
+		flow_forward_af(flow, pif1, AF_INET6,
+				NULL, 0, &in6addr_loopback, dstport);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
@@ -484,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 	if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
 		flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
 
-	if (tcp_splice_connect(c, conn, af, pif1, dstport))
+	if (tcp_splice_connect(c, conn))
 		conn_flag(c, conn, CLOSING);
 
 	FLOW_ACTIVATE(conn);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 08/19] tcp, flow: Remove redundant information, repack connection structures
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (6 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 07/19] flow: Populate address information for non-initiating side David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 09/19] tcp: Obtain guest address from flowside David Gibson
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure.  Access it from there instead, and
remove it from the TCP specific structure.   With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c      | 53 +++++++++++++++++++++++++++++------------------------
 tcp_conn.h | 24 ++++++++----------------
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/tcp.c b/tcp.c
index 5fb3ce9..d4cd78c 100644
--- a/tcp.c
+++ b/tcp.c
@@ -369,8 +369,9 @@
 #define OPT_TS		8
 
 #define TAPSIDE(conn_)	((conn_)->f.pif[1] == PIF_TAP)
+#define TAPFLOW(conn_)	(&((conn_)->f.side[TAPSIDE(conn_)]))
 
-#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
+#define CONN_V4(conn)		(!!inany_v4(&TAPFLOW(conn)->faddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
 #define CONN_IS_CLOSING(conn)						\
 	((conn->events & ESTABLISHED) &&				\
@@ -793,10 +794,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
 {
+	const struct flowside *tapside = TAPFLOW(conn);
 	int i;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
-		if (inany_equals(&conn->faddr, low_rtt_dst + i))
+		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
 			return 1;
 
 	return 0;
@@ -811,6 +813,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 			      const struct tcp_info *tinfo)
 {
 #ifdef HAS_MIN_RTT
+	const struct flowside *tapside = TAPFLOW(conn);
 	int i, hole = -1;
 
 	if (!tinfo->tcpi_min_rtt ||
@@ -818,7 +821,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 		return;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
-		if (inany_equals(&conn->faddr, low_rtt_dst + i))
+		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
 			return;
 		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
 			hole = i;
@@ -830,7 +833,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 	if (hole == -1)
 		return;
 
-	low_rtt_dst[hole++] = conn->faddr;
+	low_rtt_dst[hole++] = tapside->faddr;
 	if (hole == LOW_RTT_TABLE_SIZE)
 		hole = 0;
 	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
@@ -1083,8 +1086,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 			  const union inany_addr *faddr,
 			  in_port_t eport, in_port_t fport)
 {
-	if (inany_equals(&conn->faddr, faddr) &&
-	    conn->eport == eport && conn->fport == fport)
+	const struct flowside *tapside = TAPFLOW(conn);
+
+	if (inany_equals(&tapside->faddr, faddr) &&
+	    tapside->eport == eport && tapside->fport == fport)
 		return 1;
 
 	return 0;
@@ -1118,7 +1123,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
 static uint64_t tcp_conn_hash(const struct ctx *c,
 			      const struct tcp_tap_conn *conn)
 {
-	return tcp_hash(c, &conn->faddr, conn->eport, conn->fport);
+	const struct flowside *tapside = TAPFLOW(conn);
+
+	return tcp_hash(c, &tapside->faddr, tapside->eport,
+			tapside->fport);
 }
 
 /**
@@ -1300,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c)
  * @seq:	Sequence number
  */
 static void tcp_fill_header(struct tcphdr *th,
-			       const struct tcp_tap_conn *conn, uint32_t seq)
+			    const struct tcp_tap_conn *conn, uint32_t seq)
 {
-	th->source = htons(conn->fport);
-	th->dest = htons(conn->eport);
+	const struct flowside *tapside = TAPFLOW(conn);
+
+	th->source = htons(tapside->fport);
+	th->dest = htons(tapside->eport);
 	th->seq = htonl(seq);
 	th->ack_seq = htonl(conn->seq_ack_to_tap);
 	if (conn->events & ESTABLISHED)	{
@@ -1335,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
-	const struct in_addr *a4 = inany_v4(&conn->faddr);
+	const struct flowside *tapside = TAPFLOW(conn);
+	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
@@ -1377,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
 {
+	const struct flowside *tapside = TAPFLOW(conn);
 	size_t l4len = dlen + sizeof(*th);
 
 	ip6h->payload_len = htons(l4len);
-	ip6h->saddr = conn->faddr.a6;
+	ip6h->saddr = tapside->faddr.a6;
 	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
 		ip6h->daddr = c->ip6.addr_ll_seen;
 	else
@@ -1419,7 +1431,8 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
-	const struct in_addr *a4 = inany_v4(&conn->faddr);
+	const struct flowside *tapside = TAPFLOW(conn);
+	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 
 	if (a4) {
 		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
@@ -1743,6 +1756,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 			 const struct timespec *now)
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
+	const struct flowside *tapside = TAPFLOW(conn);
 	union inany_addr aany;
 	uint64_t hash;
 	uint32_t ns;
@@ -1752,10 +1766,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 	else
 		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
 
-	inany_siphash_feed(&state, &conn->faddr);
+	inany_siphash_feed(&state, &tapside->faddr);
 	inany_siphash_feed(&state, &aany);
 	hash = siphash_final(&state, 36,
-			     (uint64_t)conn->fport << 16 | conn->eport);
+			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
 	/* 32ns ticks, overflows 32 bits every 137s */
 	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
@@ -2030,11 +2044,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
 		conn->wnd_from_tap = 1;
 
-	inany_from_af(&conn->faddr, af, daddr);
-
-	conn->fport = dstport;
-	conn->eport = srcport;
-
 	conn->seq_init_from_tap = ntohl(th->seq);
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
@@ -2744,10 +2753,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	conn->faddr = saddr;
-	conn->fport = srcport;
-	conn->eport = dstport;
-
 	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 1a07dd5..efca9ef 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -23,9 +23,6 @@
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
  * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @faddr:		Guest side forwarding address (guest's remote address)
- * @eport:		Guest side endpoint port (guest's local port)
- * @fport:		Guest side forwarding port (guest's remote port)
  * @wnd_from_tap:	Last window size from tap, unscaled (as received)
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
@@ -49,6 +46,10 @@ struct tcp_tap_conn {
 	unsigned int	ws_from_tap	:TCP_WS_BITS;
 	unsigned int	ws_to_tap	:TCP_WS_BITS;
 
+#define TCP_MSS_BITS			14
+	unsigned int	tap_mss		:TCP_MSS_BITS;
+#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
+#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
 
 	int		sock		:FD_REF_BITS;
 
@@ -78,11 +79,6 @@ struct tcp_tap_conn {
 #define ACK_FROM_TAP_DUE	BIT(4)
 
 
-#define TCP_MSS_BITS			14
-	unsigned int	tap_mss		:TCP_MSS_BITS;
-#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
-#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
-
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
@@ -91,11 +87,6 @@ struct tcp_tap_conn {
 
 	uint8_t		seq_dup_ack_approx;
 
-
-	union inany_addr faddr;
-	in_port_t	eport;
-	in_port_t	fport;
-
 	uint16_t	wnd_from_tap;
 	uint16_t	wnd_to_tap;
 
@@ -121,10 +112,12 @@ struct tcp_splice_conn {
 	/* Must be first element */
 	struct flow_common f;
 
-	bool in_epoll	:1;
 	int s[SIDES];
 	int pipe[SIDES][2];
 
+	uint32_t read[SIDES];
+	uint32_t written[SIDES];
+
 	uint8_t events;
 #define SPLICE_CLOSED			0
 #define SPLICE_CONNECT			BIT(0)
@@ -144,8 +137,7 @@ struct tcp_splice_conn {
 #define RCVLOWAT_ACT_1			BIT(4)
 #define CLOSING				BIT(5)
 
-	uint32_t read[SIDES];
-	uint32_t written[SIDES];
+		bool in_epoll	:1;
 };
 
 /* Socket pools */
-- 
@@ -23,9 +23,6 @@
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
  * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @faddr:		Guest side forwarding address (guest's remote address)
- * @eport:		Guest side endpoint port (guest's local port)
- * @fport:		Guest side forwarding port (guest's remote port)
  * @wnd_from_tap:	Last window size from tap, unscaled (as received)
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
@@ -49,6 +46,10 @@ struct tcp_tap_conn {
 	unsigned int	ws_from_tap	:TCP_WS_BITS;
 	unsigned int	ws_to_tap	:TCP_WS_BITS;
 
+#define TCP_MSS_BITS			14
+	unsigned int	tap_mss		:TCP_MSS_BITS;
+#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
+#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
 
 	int		sock		:FD_REF_BITS;
 
@@ -78,11 +79,6 @@ struct tcp_tap_conn {
 #define ACK_FROM_TAP_DUE	BIT(4)
 
 
-#define TCP_MSS_BITS			14
-	unsigned int	tap_mss		:TCP_MSS_BITS;
-#define MSS_SET(conn, mss)	(conn->tap_mss = (mss >> (16 - TCP_MSS_BITS)))
-#define MSS_GET(conn)		(conn->tap_mss << (16 - TCP_MSS_BITS))
-
 
 #define SNDBUF_BITS		24
 	unsigned int	sndbuf		:SNDBUF_BITS;
@@ -91,11 +87,6 @@ struct tcp_tap_conn {
 
 	uint8_t		seq_dup_ack_approx;
 
-
-	union inany_addr faddr;
-	in_port_t	eport;
-	in_port_t	fport;
-
 	uint16_t	wnd_from_tap;
 	uint16_t	wnd_to_tap;
 
@@ -121,10 +112,12 @@ struct tcp_splice_conn {
 	/* Must be first element */
 	struct flow_common f;
 
-	bool in_epoll	:1;
 	int s[SIDES];
 	int pipe[SIDES][2];
 
+	uint32_t read[SIDES];
+	uint32_t written[SIDES];
+
 	uint8_t events;
 #define SPLICE_CLOSED			0
 #define SPLICE_CONNECT			BIT(0)
@@ -144,8 +137,7 @@ struct tcp_splice_conn {
 #define RCVLOWAT_ACT_1			BIT(4)
 #define CLOSING				BIT(5)
 
-	uint32_t read[SIDES];
-	uint32_t written[SIDES];
+		bool in_epoll	:1;
 };
 
 /* Socket pools */
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 09/19] tcp: Obtain guest address from flowside
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (7 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 08/19] tcp, flow: Remove redundant information, repack connection structures David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 10/19] tcp: Simplify endpoint validation using flowside information David Gibson
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will probably result in an RST from the guest.  Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 47 ++++++++++++++++-------------------------------
 1 file changed, 16 insertions(+), 31 deletions(-)

diff --git a/tcp.c b/tcp.c
index d4cd78c..078ec69 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1327,7 +1327,6 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -1338,27 +1337,26 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
 	const struct flowside *tapside = TAPFLOW(conn);
-	const struct in_addr *a4 = inany_v4(&tapside->faddr);
+	const struct in_addr *src4 = inany_v4(&tapside->faddr);
+	const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
-	ASSERT(a4);
+	ASSERT(src4 && dst4);
 
 	iph->tot_len = htons(l3len);
-	iph->saddr = a4->s_addr;
-	iph->daddr = c->ip4.addr_seen.s_addr;
+	iph->saddr = src4->s_addr;
+	iph->daddr = dst4->s_addr;
 
 	iph->check = check ? *check :
-			     csum_ip4_header(l3len, IPPROTO_TCP,
-					     *a4, c->ip4.addr_seen);
+			     csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4);
 
 	tcp_fill_header(th, conn, seq);
 
@@ -1371,7 +1369,6 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -1382,8 +1379,7 @@ static size_t tcp_fill_headers4(const struct ctx *c,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -1393,10 +1389,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 	ip6h->payload_len = htons(l4len);
 	ip6h->saddr = tapside->faddr.a6;
-	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
-		ip6h->daddr = c->ip6.addr_ll_seen;
-	else
-		ip6h->daddr = c->ip6.addr_seen;
+	ip6h->daddr = tapside->eaddr.a6;
 
 	ip6h->hop_limit = 255;
 	ip6h->version = 6;
@@ -1417,7 +1410,6 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -1426,8 +1418,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
  *
  * Return: IP payload length, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
+static size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
@@ -1435,13 +1426,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
 	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 
 	if (a4) {
-		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1657,7 +1648,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
+	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL,
 					conn->seq_to_tap);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
@@ -1757,17 +1748,11 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
 	const struct flowside *tapside = TAPFLOW(conn);
-	union inany_addr aany;
 	uint64_t hash;
 	uint32_t ns;
 
-	if (CONN_V4(conn))
-		inany_from_af(&aany, AF_INET, &c->ip4.addr);
-	else
-		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
-
 	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &aany);
+	inany_siphash_feed(&state, &tapside->eaddr);
 	hash = siphash_final(&state, 36,
 			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
@@ -2149,7 +2134,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_seq_update[tcp4_payload_used].len = dlen;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -2158,7 +2143,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_seq_update[tcp6_payload_used].len = dlen;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
-- 
@@ -1327,7 +1327,6 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -1338,27 +1337,26 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
 	const struct flowside *tapside = TAPFLOW(conn);
-	const struct in_addr *a4 = inany_v4(&tapside->faddr);
+	const struct in_addr *src4 = inany_v4(&tapside->faddr);
+	const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
-	ASSERT(a4);
+	ASSERT(src4 && dst4);
 
 	iph->tot_len = htons(l3len);
-	iph->saddr = a4->s_addr;
-	iph->daddr = c->ip4.addr_seen.s_addr;
+	iph->saddr = src4->s_addr;
+	iph->daddr = dst4->s_addr;
 
 	iph->check = check ? *check :
-			     csum_ip4_header(l3len, IPPROTO_TCP,
-					     *a4, c->ip4.addr_seen);
+			     csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4);
 
 	tcp_fill_header(th, conn, seq);
 
@@ -1371,7 +1369,6 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -1382,8 +1379,7 @@ static size_t tcp_fill_headers4(const struct ctx *c,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -1393,10 +1389,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 	ip6h->payload_len = htons(l4len);
 	ip6h->saddr = tapside->faddr.a6;
-	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
-		ip6h->daddr = c->ip6.addr_ll_seen;
-	else
-		ip6h->daddr = c->ip6.addr_seen;
+	ip6h->daddr = tapside->eaddr.a6;
 
 	ip6h->hop_limit = 255;
 	ip6h->version = 6;
@@ -1417,7 +1410,6 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -1426,8 +1418,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
  *
  * Return: IP payload length, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
+static size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
@@ -1435,13 +1426,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
 	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 
 	if (a4) {
-		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1657,7 +1648,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
+	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL,
 					conn->seq_to_tap);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
@@ -1757,17 +1748,11 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
 	const struct flowside *tapside = TAPFLOW(conn);
-	union inany_addr aany;
 	uint64_t hash;
 	uint32_t ns;
 
-	if (CONN_V4(conn))
-		inany_from_af(&aany, AF_INET, &c->ip4.addr);
-	else
-		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
-
 	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &aany);
+	inany_siphash_feed(&state, &tapside->eaddr);
 	hash = siphash_final(&state, 36,
 			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
@@ -2149,7 +2134,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_seq_update[tcp4_payload_used].len = dlen;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -2158,7 +2143,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_seq_update[tcp6_payload_used].len = dlen;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 10/19] tcp: Simplify endpoint validation using flowside information
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (8 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 09/19] tcp: Obtain guest address from flowside David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 11/19] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 inany.h |  1 -
 tcp.c   | 71 ++++++++++++---------------------------------------------
 2 files changed, 15 insertions(+), 57 deletions(-)

diff --git a/inany.h b/inany.h
index d962ff3..6135f26 100644
--- a/inany.h
+++ b/inany.h
@@ -123,7 +123,6 @@ static inline bool inany_is_multicast(const union inany_addr *a)
  *
  * Return: true if @a is specified and a unicast address
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool inany_is_unicast(const union inany_addr *a)
 {
 	return !inany_is_unspecified(a) && !inany_is_multicast(a);
diff --git a/tcp.c b/tcp.c
index 078ec69..9895c19 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1946,38 +1946,14 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	ini = flow_initiate_af(flow, PIF_TAP,
 			       af, saddr, srcport, daddr, dstport);
 
-	if (af == AF_INET) {
-		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN4_IS_ADDR_BROADCAST(saddr) ||
-		    IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN4_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN4_IS_ADDR_BROADCAST(daddr) ||
-		    IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else if (af == AF_INET6) {
-		if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN6_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else {
-		ASSERT(0);
+	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 ||
+	    !inany_is_unicast(&ini->faddr) || ini->fport == 0) {
+		char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN];
+
+		debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+		      inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport,
+		      inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport);
+		goto cancel;
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
@@ -2762,6 +2738,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
+	const struct flowside *ini;
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
 	union flow *flow;
@@ -2775,32 +2752,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		goto cancel;
 
 	flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
+	ini = &flow->f.side[INISIDE];
 
-	if (sa.sa_family == AF_INET) {
-		const struct in_addr *addr = &sa.sa4.sin_addr;
-		in_port_t port = sa.sa4.sin_port;
+	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) {
+		char str[INANY_ADDRSTRLEN];
 
-		if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN4_IS_ADDR_BROADCAST(addr) ||
-		    IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
-	} else if (sa.sa_family == AF_INET6) {
-		const struct in6_addr *addr = &sa.sa6.sin6_addr;
-		in_port_t port = sa.sa6.sin6_port;
-
-		if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET6_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
+		err("Invalid endpoint from TCP accept(): %s:%hu",
+		    inany_ntop(&ini->eaddr, str, sizeof(str)), ini->eport);
+		goto cancel;
 	}
 
 	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
-- 
@@ -1946,38 +1946,14 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	ini = flow_initiate_af(flow, PIF_TAP,
 			       af, saddr, srcport, daddr, dstport);
 
-	if (af == AF_INET) {
-		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN4_IS_ADDR_BROADCAST(saddr) ||
-		    IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN4_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN4_IS_ADDR_BROADCAST(daddr) ||
-		    IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else if (af == AF_INET6) {
-		if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN6_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else {
-		ASSERT(0);
+	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0 ||
+	    !inany_is_unicast(&ini->faddr) || ini->fport == 0) {
+		char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN];
+
+		debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+		      inany_ntop(&ini->eaddr, sstr, sizeof(sstr)), ini->eport,
+		      inany_ntop(&ini->faddr, dstr, sizeof(dstr)), ini->fport);
+		goto cancel;
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
@@ -2762,6 +2738,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
+	const struct flowside *ini;
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
 	union flow *flow;
@@ -2775,32 +2752,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		goto cancel;
 
 	flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
+	ini = &flow->f.side[INISIDE];
 
-	if (sa.sa_family == AF_INET) {
-		const struct in_addr *addr = &sa.sa4.sin_addr;
-		in_port_t port = sa.sa4.sin_port;
+	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) {
+		char str[INANY_ADDRSTRLEN];
 
-		if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN4_IS_ADDR_BROADCAST(addr) ||
-		    IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
-	} else if (sa.sa_family == AF_INET6) {
-		const struct in6_addr *addr = &sa.sa6.sin6_addr;
-		in_port_t port = sa.sa6.sin6_port;
-
-		if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET6_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
+		err("Invalid endpoint from TCP accept(): %s:%hu",
+		    inany_ntop(&ini->eaddr, str, sizeof(str)), ini->eport);
+		goto cancel;
 	}
 
 	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 11/19] tcp_splice: Eliminate SPLICE_V6 flag
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (9 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 10/19] tcp: Simplify endpoint validation using flowside information David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 12/19] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection.  Hence, drop the now unused SPLICE_V6 flag.

As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_conn.h   | 11 +++++------
 tcp_splice.c |  3 ---
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/tcp_conn.h b/tcp_conn.h
index efca9ef..ccf7357 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -130,12 +130,11 @@ struct tcp_splice_conn {
 #define FIN_SENT_1			BIT(7)
 
 	uint8_t flags;
-#define SPLICE_V6			BIT(0)
-#define RCVLOWAT_SET_0			BIT(1)
-#define RCVLOWAT_SET_1			BIT(2)
-#define RCVLOWAT_ACT_0			BIT(3)
-#define RCVLOWAT_ACT_1			BIT(4)
-#define CLOSING				BIT(5)
+#define RCVLOWAT_SET_0			BIT(0)
+#define RCVLOWAT_SET_1			BIT(1)
+#define RCVLOWAT_ACT_0			BIT(2)
+#define RCVLOWAT_ACT_1			BIT(3)
+#define CLOSING				BIT(4)
 
 		bool in_epoll	:1;
 };
diff --git a/tcp_splice.c b/tcp_splice.c
index 3a20b40..aa92325 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -73,8 +73,6 @@ static int ns_sock_pool6	[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2];
 
-#define CONN_V6(x)			(x->flags & SPLICE_V6)
-#define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(idx)			(&FLOW(idx)->tcp_splice)
 
@@ -463,7 +461,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 				NULL, 0, &in6addr_loopback, dstport);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
-	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
 	conn->s[1] = -1;
 	conn->pipe[0][0] = conn->pipe[0][1] = -1;
-- 
@@ -73,8 +73,6 @@ static int ns_sock_pool6	[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2];
 
-#define CONN_V6(x)			(x->flags & SPLICE_V6)
-#define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(idx)			(&FLOW(idx)->tcp_splice)
 
@@ -463,7 +461,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 				NULL, 0, &in6addr_loopback, dstport);
 	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
 
-	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
 	conn->s[1] = -1;
 	conn->pipe[0][0] = conn->pipe[0][1] = -1;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 12/19] tcp, flow: Replace TCP specific hash function with general flow hash
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (10 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 11/19] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 13/19] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports.  We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well.  We also want a unified hash table
to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow.  This includes all the
information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 35 +++++++++++++++++++++++++++---
 flow.h       | 19 ++++++++++++++++
 flow_table.h |  2 ++
 tcp.c        | 61 ++++++++++------------------------------------------
 4 files changed, 64 insertions(+), 53 deletions(-)

diff --git a/flow.c b/flow.c
index aff077b..fdd22b7 100644
--- a/flow.c
+++ b/flow.c
@@ -116,9 +116,9 @@ static struct timespec flow_timer_run;
  * @faddr:	Forwarding address (pointer to in_addr or in6_addr)
  * @fport:	Forwarding port
  */
-static void flowside_from_af(struct flowside *fside, sa_family_t af,
-			     const void *eaddr, in_port_t eport,
-			     const void *faddr, in_port_t fport)
+void flowside_from_af(struct flowside *fside, sa_family_t af,
+		      const void *eaddr, in_port_t eport,
+		      const void *faddr, in_port_t fport)
 {
 	if (faddr)
 		inany_from_af(&fside->faddr, af, faddr);
@@ -397,6 +397,35 @@ void flow_alloc_cancel(union flow *flow)
 	flow_new_entry = NULL;
 }
 
+/**
+ * flow_hash() - Calculate hash value for one side of a flow
+ * @c:		Execution context
+ * @proto:	Protocol of this flow (IP L4 protocol number)
+ * @pif:	pif of the side to hash
+ * @fside:	Flowside (must not have unspecified parts)
+ *
+ * Return: hash value
+ */
+uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
+		   const struct flowside *fside)
+{
+	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
+
+	/* For the hash table to work, we need complete information in the
+	 * flowside.
+	 */
+	ASSERT(pif != PIF_NONE &&
+	       !inany_is_unspecified(&fside->faddr) && fside->fport != 0 &&
+	       !inany_is_unspecified(&fside->eaddr) && fside->eport != 0);
+
+	inany_siphash_feed(&state, &fside->faddr);
+	inany_siphash_feed(&state, &fside->eaddr);
+
+	return siphash_final(&state, 38, (uint64_t)proto << 40 |
+			     (uint64_t)pif << 32 |
+			     fside->fport << 16 | fside->eport);
+}
+
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
diff --git a/flow.h b/flow.h
index 437579b..6d68e09 100644
--- a/flow.h
+++ b/flow.h
@@ -147,6 +147,25 @@ struct flowside {
 	in_port_t		eport;
 };
 
+/**
+ * flowside_eq() - Check if two flowsides are equal
+ * @left, @right:	Flowsides to compare
+ *
+ * Return: true if equal, false otherwise
+ */
+static inline bool flowside_eq(const struct flowside *left,
+			       const struct flowside *right)
+{
+	return inany_equals(&left->eaddr, &right->eaddr) &&
+	       left->eport == right->eport &&
+	       inany_equals(&left->faddr, &right->faddr) &&
+	       left->fport == right->fport;
+}
+
+void flowside_from_af(struct flowside *fside, sa_family_t af,
+		      const void *eaddr, in_port_t eport,
+		      const void *faddr, in_port_t fport);
+
 /**
  * struct flow_common - Common fields for packet flows
  * @side[]:	Information for each side of the flow
diff --git a/flow_table.h b/flow_table.h
index 91ade0a..0083c87 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -126,5 +126,7 @@ void flow_activate(struct flow_common *f);
 #define FLOW_ACTIVATE(flow_)			\
 	(flow_activate(&(flow_)->f))
 
+uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
+		   const struct flowside *fside);
 
 #endif /* FLOW_TABLE_H */
diff --git a/tcp.c b/tcp.c
index 9895c19..983a537 100644
--- a/tcp.c
+++ b/tcp.c
@@ -523,7 +523,7 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from remote address, local port, remote port */
+/* Table for lookup from flowside information */
 static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
 
 static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
@@ -1073,46 +1073,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_hash_match() - Check if a connection entry matches address and ports
- * @conn:	Connection entry to match against
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: 1 on match, 0 otherwise
- */
-static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  const union inany_addr *faddr,
-			  in_port_t eport, in_port_t fport)
-{
-	const struct flowside *tapside = TAPFLOW(conn);
-
-	if (inany_equals(&tapside->faddr, faddr) &&
-	    tapside->eport == eport && tapside->fport == fport)
-		return 1;
-
-	return 0;
-}
-
-/**
- * tcp_hash() - Calculate hash value for connection given address and ports
- * @c:		Execution context
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
-			 in_port_t eport, in_port_t fport)
-{
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-
-	inany_siphash_feed(&state, faddr);
-	return siphash_final(&state, 20, (uint64_t)eport << 16 | fport);
-}
-
 /**
  * tcp_conn_hash() - Calculate hash bucket of an existing connection
  * @c:		Execution context
@@ -1125,8 +1085,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c,
 {
 	const struct flowside *tapside = TAPFLOW(conn);
 
-	return tcp_hash(c, &tapside->faddr, tapside->eport,
-			tapside->fport);
+	return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside);
 }
 
 /**
@@ -1201,25 +1160,26 @@ static void tcp_hash_remove(const struct ctx *c,
  * tcp_hash_lookup() - Look up connection given remote address and ports
  * @c:		Execution context
  * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
  * @faddr:	Guest side forwarding address (guest remote address)
  * @eport:	Guest side endpoint port (guest local port)
  * @fport:	Guest side forwarding port (guest remote port)
  *
  * Return: connection pointer, if found, -ENOENT otherwise
  */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
-					    sa_family_t af, const void *faddr,
+static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
+					    const void *eaddr, const void *faddr,
 					    in_port_t eport, in_port_t fport)
 {
-	union inany_addr aany;
+	struct flowside fside;
 	union flow *flow;
 	unsigned b;
 
-	inany_from_af(&aany, af, faddr);
+	flowside_from_af(&fside, af, eaddr, eport, faddr, fport);
 
-	b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE;
+	b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE;
 	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !tcp_hash_match(&flow->tcp, &aany, eport, fport))
+	       !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside))
 		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
 
 	return &flow->tcp;
@@ -2523,7 +2483,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest));
+	conn = tcp_hash_lookup(c, af, saddr, daddr,
+			       ntohs(th->source), ntohs(th->dest));
 
 	/* New connection from tap */
 	if (!conn) {
-- 
@@ -523,7 +523,7 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from remote address, local port, remote port */
+/* Table for lookup from flowside information */
 static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
 
 static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
@@ -1073,46 +1073,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_hash_match() - Check if a connection entry matches address and ports
- * @conn:	Connection entry to match against
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: 1 on match, 0 otherwise
- */
-static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  const union inany_addr *faddr,
-			  in_port_t eport, in_port_t fport)
-{
-	const struct flowside *tapside = TAPFLOW(conn);
-
-	if (inany_equals(&tapside->faddr, faddr) &&
-	    tapside->eport == eport && tapside->fport == fport)
-		return 1;
-
-	return 0;
-}
-
-/**
- * tcp_hash() - Calculate hash value for connection given address and ports
- * @c:		Execution context
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
-			 in_port_t eport, in_port_t fport)
-{
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-
-	inany_siphash_feed(&state, faddr);
-	return siphash_final(&state, 20, (uint64_t)eport << 16 | fport);
-}
-
 /**
  * tcp_conn_hash() - Calculate hash bucket of an existing connection
  * @c:		Execution context
@@ -1125,8 +1085,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c,
 {
 	const struct flowside *tapside = TAPFLOW(conn);
 
-	return tcp_hash(c, &tapside->faddr, tapside->eport,
-			tapside->fport);
+	return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside);
 }
 
 /**
@@ -1201,25 +1160,26 @@ static void tcp_hash_remove(const struct ctx *c,
  * tcp_hash_lookup() - Look up connection given remote address and ports
  * @c:		Execution context
  * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
  * @faddr:	Guest side forwarding address (guest remote address)
  * @eport:	Guest side endpoint port (guest local port)
  * @fport:	Guest side forwarding port (guest remote port)
  *
  * Return: connection pointer, if found, -ENOENT otherwise
  */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
-					    sa_family_t af, const void *faddr,
+static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
+					    const void *eaddr, const void *faddr,
 					    in_port_t eport, in_port_t fport)
 {
-	union inany_addr aany;
+	struct flowside fside;
 	union flow *flow;
 	unsigned b;
 
-	inany_from_af(&aany, af, faddr);
+	flowside_from_af(&fside, af, eaddr, eport, faddr, fport);
 
-	b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE;
+	b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE;
 	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !tcp_hash_match(&flow->tcp, &aany, eport, fport))
+	       !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside))
 		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
 
 	return &flow->tcp;
@@ -2523,7 +2483,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest));
+	conn = tcp_hash_lookup(c, af, saddr, daddr,
+			       ntohs(th->source), ntohs(th->dest));
 
 	/* New connection from tap */
 	if (!conn) {
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 13/19] flow, tcp: Generalise TCP hash table to general flow hash table
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (11 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 12/19] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 14/19] tcp: Re-use flow hash for initial sequence number generation David Gibson
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows.  This is
largely code motion and straightforward renames.  There are two semantic
changes:

 * flow_lookup_af() now needs to verify that the entry has a matching
   protocol as well as matching addresses, ports and interface

 * We double the size of the hash table, because it's now at least
   theoretically possible for both sides of each flow to be hashed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       | 146 ++++++++++++++++++++++++++++++++++++++++++++++++-
 flow.h       |   7 +++
 flow_table.h |   3 --
 tcp.c        | 149 +++++----------------------------------------------
 4 files changed, 165 insertions(+), 140 deletions(-)

diff --git a/flow.c b/flow.c
index fdd22b7..30a6904 100644
--- a/flow.c
+++ b/flow.c
@@ -108,6 +108,16 @@ static const union flow *flow_new_entry; /* = NULL */
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
 
+/* Hash table to index it */
+#define FLOW_HASH_LOAD		70		/* % */
+#define FLOW_HASH_SIZE		((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD))
+
+/* Table for lookup from flowside information */
+static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE];
+
+static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX,
+"Safe linear probing requires hash table with more entries than the number of sides in the flow table");
+
 /** flowside_from_af() - Initialise flowside from addresses
  * @fside:	flowside to initialise
  * @af:		Address family (AF_INET or AF_INET6)
@@ -406,8 +416,8 @@ void flow_alloc_cancel(union flow *flow)
  *
  * Return: hash value
  */
-uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
-		   const struct flowside *fside)
+static uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
+			  const struct flowside *fside)
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
 
@@ -426,6 +436,133 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
 			     fside->fport << 16 | fside->eport);
 }
 
+/**
+ * flow_sidx_hash() - Calculate hash value for given side of a given flow
+ * @c:		Execution context
+ * @sidx:	Flow & side index to get hash for
+ *
+ * Return: hash value, of the flow & side represented by @sidx
+ */
+static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
+{
+	const struct flow_common *f = &flow_at_sidx(sidx)->f;
+	return flow_hash(c, FLOW_PROTO(f),
+			 f->pif[sidx.side], &f->side[sidx.side]);
+}
+
+/**
+ * flow_hash_probe() - Find hash bucket for a flow
+ * @c:		Execution context
+ * @sidx:	Flow and side to find bucket for
+ *
+ * Return: If @sidx is in the hash table, its current bucket, otherwise a
+ *         suitable free bucket for it.
+ */
+static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE;
+
+	/* Linear probing */
+	while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) &&
+	       !flow_sidx_eq(flow_hashtab[b], sidx))
+		b = mod_sub(b, 1, FLOW_HASH_SIZE);
+
+	return b;
+}
+
+/**
+ * flow_hash_insert() - Insert side of a flow into into hash table
+ * @c:		Execution context
+ * @sidx:	Flow & side index
+ */
+void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_hash_probe(c, sidx);
+
+	flow_hashtab[b] = sidx;
+	flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b);
+}
+
+/**
+ * flow_hash_remove() - Drop side of a flow from the hash table
+ * @c:		Execution context
+ * @sidx:	Side of flow to remove
+ */
+void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_hash_probe(c, sidx), s;
+
+	if (flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE))
+		return; /* Redundant remove */
+
+	flow_dbg(flow_at_sidx(sidx), "hash table remove: bucket: %u", b);
+
+	/* Scan the remainder of the cluster */
+	for (s = mod_sub(b, 1, FLOW_HASH_SIZE);
+	     !flow_sidx_eq(flow_hashtab[s], FLOW_SIDX_NONE);
+	     s = mod_sub(s, 1, FLOW_HASH_SIZE)) {
+		unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE;
+
+		if (!mod_between(h, s, b, FLOW_HASH_SIZE)) {
+			/* flow_hashtab[s] can live in flow_hashtab[b]'s slot */
+			debug("hash table remove: shuffle %u -> %u", s, b);
+			flow_hashtab[b] = flow_hashtab[s];
+			b = s;
+		}
+	}
+
+	flow_hashtab[b] = FLOW_SIDX_NONE;
+}
+
+/**
+ * flowside_lookup() - Look for a matching flowside in the flow table
+ * @c:		Execution context
+ * @proto:	Protocol of the flow (IP L4 protocol number)
+ * @pif:	pif to look for in the table
+ * @fside:	Flowside to look for in the table
+ *
+ * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
+ */
+static flow_sidx_t flowside_lookup(const struct ctx *c, uint8_t proto,
+				   uint8_t pif, const struct flowside *fside)
+{
+	union flow *flow;
+	int b;
+
+	b = flow_hash(c, proto, pif, fside) % FLOW_HASH_SIZE;
+	while ((flow = flow_at_sidx(flow_hashtab[b])) &&
+	       FLOW_PROTO(&flow->f) == proto &&
+	       !(flow->f.pif[flow_hashtab[b].side] == pif &&
+		 flowside_eq(&flow->f.side[flow_hashtab[b].side], fside)))
+		b = (b + 1) % FLOW_HASH_SIZE;
+
+	return flow_hashtab[b];
+}
+
+/**
+ * flow_lookup_af() - Look up a flow given addressing information
+ * @c:		Execution context
+ * @proto:	Protocol of the flow (IP L4 protocol number)
+ * @pif:	Interface of the flow
+ * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
+ * @faddr:	Guest side forwarding address (guest remote address)
+ * @eport:	Guest side endpoint port (guest local port)
+ * @fport:	Guest side forwarding port (guest remote port)
+ *
+ * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
+ */
+flow_sidx_t flow_lookup_af(const struct ctx *c,
+			   uint8_t proto, uint8_t pif, sa_family_t af,
+			   const void *eaddr, const void *faddr,
+			   in_port_t eport, in_port_t fport)
+{
+	struct flowside fside;
+
+	flowside_from_af(&fside, af, eaddr, eport, faddr, fport);
+	return flowside_lookup(c, proto, pif, &fside);
+}
+
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
@@ -535,7 +672,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
  */
 void flow_init(void)
 {
+	unsigned b;
+
 	/* Initial state is a single free cluster containing the whole table */
 	flowtab[0].free.n = FLOW_MAX;
 	flowtab[0].free.next = FLOW_MAX;
+
+	for (b = 0; b < FLOW_HASH_SIZE; b++)
+		flow_hashtab[b] = FLOW_SIDX_NONE;
 }
diff --git a/flow.h b/flow.h
index 6d68e09..0ba00da 100644
--- a/flow.h
+++ b/flow.h
@@ -211,6 +211,13 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 	return (a.flow == b.flow) && (a.side == b.side);
 }
 
+void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
+void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx);
+flow_sidx_t flow_lookup_af(const struct ctx *c,
+			   uint8_t proto, uint8_t pif, sa_family_t af,
+			   const void *eaddr, const void *faddr,
+			   in_port_t eport, in_port_t fport);
+
 union flow;
 
 void flow_init(void);
diff --git a/flow_table.h b/flow_table.h
index 0083c87..d17ffba 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -126,7 +126,4 @@ void flow_activate(struct flow_common *f);
 #define FLOW_ACTIVATE(flow_)			\
 	(flow_activate(&(flow_)->f))
 
-uint64_t flow_hash(const struct ctx *c, uint8_t proto, uint8_t pif,
-		   const struct flowside *fside);
-
 #endif /* FLOW_TABLE_H */
diff --git a/tcp.c b/tcp.c
index 983a537..8ab8c4d 100644
--- a/tcp.c
+++ b/tcp.c
@@ -307,9 +307,6 @@
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
 
-#define TCP_HASH_TABLE_LOAD		70		/* % */
-#define TCP_HASH_TABLE_SIZE		(FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD)
-
 #define MAX_WS				8
 #define MAX_WINDOW			(1 << (16 + (MAX_WS)))
 
@@ -370,6 +367,7 @@
 
 #define TAPSIDE(conn_)	((conn_)->f.pif[1] == PIF_TAP)
 #define TAPFLOW(conn_)	(&((conn_)->f.side[TAPSIDE(conn_)]))
+#define TAP_SIDX(conn_)	(FLOW_SIDX((conn_), TAPSIDE(conn_)))
 
 #define CONN_V4(conn)		(!!inany_v4(&TAPFLOW(conn)->faddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
@@ -523,12 +521,6 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from flowside information */
-static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
-
-static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
-	"Safe linear probing requires hash table larger than connection table");
-
 /* Pools for pre-opened sockets (in init) */
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
 int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
@@ -722,9 +714,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_timer_ctl(c, conn);
 }
 
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn);
-
 /**
  * conn_event_do() - Set and log connection events, update epoll state
  * @c:		Execution context
@@ -770,7 +759,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 
 	if (event == CLOSED)
-		tcp_hash_remove(c, conn);
+		flow_hash_remove(c, TAP_SIDX(conn));
 	else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
 		conn_flag(c, conn, ACTIVE_CLOSE);
 	else
@@ -1073,118 +1062,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_conn_hash() - Calculate hash bucket of an existing connection
- * @c:		Execution context
- * @conn:	Connection
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_conn_hash(const struct ctx *c,
-			      const struct tcp_tap_conn *conn)
-{
-	const struct flowside *tapside = TAPFLOW(conn);
-
-	return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside);
-}
-
-/**
- * tcp_hash_probe() - Find hash bucket for a connection
- * @c:		Execution context
- * @conn:	Connection to find bucket for
- *
- * Return: If @conn is in the table, its current bucket, otherwise a suitable
- *         free bucket for it.
- */
-static inline unsigned tcp_hash_probe(const struct ctx *c,
-				      const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
-	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn));
-
-	/* Linear probing */
-	while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) &&
-	       !flow_sidx_eq(tc_hash[b], sidx))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return b;
-}
-
-/**
- * tcp_hash_insert() - Insert connection into hash table, chain link
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn);
-
-	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn));
-	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
-}
-
-/**
- * tcp_hash_remove() - Drop connection from hash table, chain unlink
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn), s;
-	union flow *flow = flow_at_sidx(tc_hash[b]);
-
-	if (!flow)
-		return; /* Redundant remove */
-
-	flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b);
-
-	/* Scan the remainder of the cluster */
-	for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-	     (flow = flow_at_sidx(tc_hash[s]));
-	     s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) {
-		unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE;
-
-		if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) {
-			/* tc_hash[s] can live in tc_hash[b]'s slot */
-			debug("hash table remove: shuffle %u -> %u", s, b);
-			tc_hash[b] = tc_hash[s];
-			b = s;
-		}
-	}
-
-	tc_hash[b] = FLOW_SIDX_NONE;
-}
-
-/**
- * tcp_hash_lookup() - Look up connection given remote address and ports
- * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @eaddr:	Guest side endpoint address (guest local address)
- * @faddr:	Guest side forwarding address (guest remote address)
- * @eport:	Guest side endpoint port (guest local port)
- * @fport:	Guest side forwarding port (guest remote port)
- *
- * Return: connection pointer, if found, -ENOENT otherwise
- */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
-					    const void *eaddr, const void *faddr,
-					    in_port_t eport, in_port_t fport)
-{
-	struct flowside fside;
-	union flow *flow;
-	unsigned b;
-
-	flowside_from_af(&fside, af, eaddr, eport, faddr, fport);
-
-	b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE;
-	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return &flow->tcp;
-}
-
 /**
  * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
  * @flow:	Flow table entry for this connection
@@ -1972,7 +1849,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, TAP_SIDX(conn));
 
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, c->ifi6);
 
@@ -2468,6 +2345,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	const struct tcphdr *th;
 	size_t optlen, len;
 	const char *opts;
+	union flow *flow;
+	flow_sidx_t sidx;
 	int ack_due = 0;
 	int count;
 
@@ -2483,17 +2362,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, saddr, daddr,
-			       ntohs(th->source), ntohs(th->dest));
+	sidx = flow_lookup_af(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr,
+			      ntohs(th->source), ntohs(th->dest));
+	flow = flow_at_sidx(sidx);
 
 	/* New connection from tap */
-	if (!conn) {
+	if (!flow) {
 		if (opts && th->syn && !th->ack)
 			tcp_conn_from_tap(c, af, saddr, daddr, th,
 					  opts, optlen, now);
 		return 1;
 	}
 
+	ASSERT(flow->f.type == FLOW_TCP);
+	ASSERT(flow->f.pif[sidx.side] == PIF_TAP);
+	conn = &flow->tcp;
+
 	flow_trace(conn, "packet length %zu from tap", len);
 
 	if (th->rst) {
@@ -2676,7 +2560,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn_event(c, conn, SOCK_ACCEPTED);
 
 	tcp_seq_init(c, conn, now);
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, TAP_SIDX(conn));
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
@@ -3065,11 +2949,6 @@ static void tcp_sock_refill_init(const struct ctx *c)
  */
 int tcp_init(struct ctx *c)
 {
-	unsigned b;
-
-	for (b = 0; b < TCP_HASH_TABLE_SIZE; b++)
-		tc_hash[b] = FLOW_SIDX_NONE;
-
 	if (c->ifi4)
 		tcp_sock4_iov_init(c);
 
-- 
@@ -307,9 +307,6 @@
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
 
-#define TCP_HASH_TABLE_LOAD		70		/* % */
-#define TCP_HASH_TABLE_SIZE		(FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD)
-
 #define MAX_WS				8
 #define MAX_WINDOW			(1 << (16 + (MAX_WS)))
 
@@ -370,6 +367,7 @@
 
 #define TAPSIDE(conn_)	((conn_)->f.pif[1] == PIF_TAP)
 #define TAPFLOW(conn_)	(&((conn_)->f.side[TAPSIDE(conn_)]))
+#define TAP_SIDX(conn_)	(FLOW_SIDX((conn_), TAPSIDE(conn_)))
 
 #define CONN_V4(conn)		(!!inany_v4(&TAPFLOW(conn)->faddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
@@ -523,12 +521,6 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from flowside information */
-static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
-
-static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
-	"Safe linear probing requires hash table larger than connection table");
-
 /* Pools for pre-opened sockets (in init) */
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
 int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
@@ -722,9 +714,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_timer_ctl(c, conn);
 }
 
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn);
-
 /**
  * conn_event_do() - Set and log connection events, update epoll state
  * @c:		Execution context
@@ -770,7 +759,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 
 	if (event == CLOSED)
-		tcp_hash_remove(c, conn);
+		flow_hash_remove(c, TAP_SIDX(conn));
 	else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
 		conn_flag(c, conn, ACTIVE_CLOSE);
 	else
@@ -1073,118 +1062,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_conn_hash() - Calculate hash bucket of an existing connection
- * @c:		Execution context
- * @conn:	Connection
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_conn_hash(const struct ctx *c,
-			      const struct tcp_tap_conn *conn)
-{
-	const struct flowside *tapside = TAPFLOW(conn);
-
-	return flow_hash(c, IPPROTO_TCP, conn->f.pif[TAPSIDE(conn)], tapside);
-}
-
-/**
- * tcp_hash_probe() - Find hash bucket for a connection
- * @c:		Execution context
- * @conn:	Connection to find bucket for
- *
- * Return: If @conn is in the table, its current bucket, otherwise a suitable
- *         free bucket for it.
- */
-static inline unsigned tcp_hash_probe(const struct ctx *c,
-				      const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
-	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE(conn));
-
-	/* Linear probing */
-	while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) &&
-	       !flow_sidx_eq(tc_hash[b], sidx))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return b;
-}
-
-/**
- * tcp_hash_insert() - Insert connection into hash table, chain link
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn);
-
-	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE(conn));
-	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
-}
-
-/**
- * tcp_hash_remove() - Drop connection from hash table, chain unlink
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn), s;
-	union flow *flow = flow_at_sidx(tc_hash[b]);
-
-	if (!flow)
-		return; /* Redundant remove */
-
-	flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b);
-
-	/* Scan the remainder of the cluster */
-	for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-	     (flow = flow_at_sidx(tc_hash[s]));
-	     s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) {
-		unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE;
-
-		if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) {
-			/* tc_hash[s] can live in tc_hash[b]'s slot */
-			debug("hash table remove: shuffle %u -> %u", s, b);
-			tc_hash[b] = tc_hash[s];
-			b = s;
-		}
-	}
-
-	tc_hash[b] = FLOW_SIDX_NONE;
-}
-
-/**
- * tcp_hash_lookup() - Look up connection given remote address and ports
- * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @eaddr:	Guest side endpoint address (guest local address)
- * @faddr:	Guest side forwarding address (guest remote address)
- * @eport:	Guest side endpoint port (guest local port)
- * @fport:	Guest side forwarding port (guest remote port)
- *
- * Return: connection pointer, if found, -ENOENT otherwise
- */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
-					    const void *eaddr, const void *faddr,
-					    in_port_t eport, in_port_t fport)
-{
-	struct flowside fside;
-	union flow *flow;
-	unsigned b;
-
-	flowside_from_af(&fside, af, eaddr, eport, faddr, fport);
-
-	b = flow_hash(c, IPPROTO_TCP, PIF_TAP, &fside) % TCP_HASH_TABLE_SIZE;
-	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !flowside_eq(&flow->f.side[TAPSIDE(flow)], &fside))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return &flow->tcp;
-}
-
 /**
  * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
  * @flow:	Flow table entry for this connection
@@ -1972,7 +1849,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, TAP_SIDX(conn));
 
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, c->ifi6);
 
@@ -2468,6 +2345,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	const struct tcphdr *th;
 	size_t optlen, len;
 	const char *opts;
+	union flow *flow;
+	flow_sidx_t sidx;
 	int ack_due = 0;
 	int count;
 
@@ -2483,17 +2362,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, saddr, daddr,
-			       ntohs(th->source), ntohs(th->dest));
+	sidx = flow_lookup_af(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr,
+			      ntohs(th->source), ntohs(th->dest));
+	flow = flow_at_sidx(sidx);
 
 	/* New connection from tap */
-	if (!conn) {
+	if (!flow) {
 		if (opts && th->syn && !th->ack)
 			tcp_conn_from_tap(c, af, saddr, daddr, th,
 					  opts, optlen, now);
 		return 1;
 	}
 
+	ASSERT(flow->f.type == FLOW_TCP);
+	ASSERT(flow->f.pif[sidx.side] == PIF_TAP);
+	conn = &flow->tcp;
+
 	flow_trace(conn, "packet length %zu from tap", len);
 
 	if (th->rst) {
@@ -2676,7 +2560,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn_event(c, conn, SOCK_ACCEPTED);
 
 	tcp_seq_init(c, conn, now);
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, TAP_SIDX(conn));
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
@@ -3065,11 +2949,6 @@ static void tcp_sock_refill_init(const struct ctx *c)
  */
 int tcp_init(struct ctx *c)
 {
-	unsigned b;
-
-	for (b = 0; b < TCP_HASH_TABLE_SIZE; b++)
-		tc_hash[b] = FLOW_SIDX_NONE;
-
 	if (c->ifi4)
 		tcp_sock4_iov_init(c);
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 14/19] tcp: Re-use flow hash for initial sequence number generation
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (12 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 13/19] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible David Gibson
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp.  Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.

With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one.  It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 30 ++++++++++++++++++++++++------
 flow.h |  2 +-
 tcp.c  | 33 +++++++++++----------------------
 3 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/flow.c b/flow.c
index 30a6904..4942075 100644
--- a/flow.c
+++ b/flow.c
@@ -451,16 +451,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
 }
 
 /**
- * flow_hash_probe() - Find hash bucket for a flow
- * @c:		Execution context
+ * flow_hash_probe_() - Find hash bucket for a flow, given hash
+ * @hash:	Raw hash value for flow & side
  * @sidx:	Flow and side to find bucket for
  *
  * Return: If @sidx is in the hash table, its current bucket, otherwise a
  *         suitable free bucket for it.
  */
-static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx)
 {
-	unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE;
+	unsigned b = hash % FLOW_HASH_SIZE;
 
 	/* Linear probing */
 	while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) &&
@@ -470,17 +470,35 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
 	return b;
 }
 
+/**
+ * flow_hash_probe() - Find hash bucket for a flow
+ * @c:		Execution context
+ * @sidx:	Flow and side to find bucket for
+ *
+ * Return: If @sidx is in the hash table, its current bucket, otherwise a
+ *         suitable free bucket for it.
+ */
+static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+{
+	return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx);
+}
+
 /**
  * flow_hash_insert() - Insert side of a flow into into hash table
  * @c:		Execution context
  * @sidx:	Flow & side index
+ *
+ * Return: raw (un-modded) hash value of side of flow
  */
-void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
+uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
 {
-	unsigned b = flow_hash_probe(c, sidx);
+	uint64_t hash = flow_sidx_hash(c, sidx);
+	unsigned b = flow_hash_probe_(hash, sidx);
 
 	flow_hashtab[b] = sidx;
 	flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b);
+
+	return hash;
 }
 
 /**
diff --git a/flow.h b/flow.h
index 0ba00da..612459e 100644
--- a/flow.h
+++ b/flow.h
@@ -211,7 +211,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 	return (a.flow == b.flow) && (a.side == b.side);
 }
 
-void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
+uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
 void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx);
 flow_sidx_t flow_lookup_af(const struct ctx *c,
 			   uint8_t proto, uint8_t pif, sa_family_t af,
diff --git a/tcp.c b/tcp.c
index 8ab8c4d..91b8a46 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1575,28 +1575,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 }
 
 /**
- * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
- * @c:		Execution context
- * @conn:	TCP connection, with faddr, fport and eport populated
+ * tcp_init_seq() - Calculate initial sequence number according to RFC 6528
+ * @hash:	Hash of connection details
  * @now:	Current timestamp
  */
-static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
-			 const struct timespec *now)
+static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now)
 {
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	const struct flowside *tapside = TAPFLOW(conn);
-	uint64_t hash;
-	uint32_t ns;
-
-	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &tapside->eaddr);
-	hash = siphash_final(&state, 36,
-			     (uint64_t)tapside->fport << 16 | tapside->eport);
-
 	/* 32ns ticks, overflows 32 bits every 137s */
-	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
+	uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
-	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
 }
 
 /**
@@ -1775,6 +1763,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
+	uint64_t hash;
 	socklen_t sl;
 
 	if (!(flow = flow_alloc()))
@@ -1846,11 +1835,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	tcp_seq_init(c, conn, now);
+	hash = flow_hash_insert(c, TAP_SIDX(conn));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	flow_hash_insert(c, TAP_SIDX(conn));
-
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, c->ifi6);
 
 	if (!bind(s, &sa.sa, sl)) {
@@ -2536,6 +2524,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
 	struct tcp_tap_conn *conn;
 	in_port_t srcport;
+	uint64_t hash;
 
 	inany_from_sockaddr(&saddr, &srcport, sa);
 	tcp_snat_inbound(c, &saddr);
@@ -2559,8 +2548,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	tcp_seq_init(c, conn, now);
-	flow_hash_insert(c, TAP_SIDX(conn));
+	hash = flow_hash_insert(c, TAP_SIDX(conn));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-- 
@@ -1575,28 +1575,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 }
 
 /**
- * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
- * @c:		Execution context
- * @conn:	TCP connection, with faddr, fport and eport populated
+ * tcp_init_seq() - Calculate initial sequence number according to RFC 6528
+ * @hash:	Hash of connection details
  * @now:	Current timestamp
  */
-static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
-			 const struct timespec *now)
+static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now)
 {
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	const struct flowside *tapside = TAPFLOW(conn);
-	uint64_t hash;
-	uint32_t ns;
-
-	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &tapside->eaddr);
-	hash = siphash_final(&state, 36,
-			     (uint64_t)tapside->fport << 16 | tapside->eport);
-
 	/* 32ns ticks, overflows 32 bits every 137s */
-	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
+	uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
-	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
 }
 
 /**
@@ -1775,6 +1763,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
+	uint64_t hash;
 	socklen_t sl;
 
 	if (!(flow = flow_alloc()))
@@ -1846,11 +1835,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	tcp_seq_init(c, conn, now);
+	hash = flow_hash_insert(c, TAP_SIDX(conn));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	flow_hash_insert(c, TAP_SIDX(conn));
-
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, fwd->eport, c->ifi6);
 
 	if (!bind(s, &sa.sa, sl)) {
@@ -2536,6 +2524,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
 	struct tcp_tap_conn *conn;
 	in_port_t srcport;
+	uint64_t hash;
 
 	inany_from_sockaddr(&saddr, &srcport, sa);
 	tcp_snat_inbound(c, &saddr);
@@ -2559,8 +2548,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	tcp_seq_init(c, conn, now);
-	flow_hash_insert(c, TAP_SIDX(conn));
+	hash = flow_hash_insert(c, TAP_SIDX(conn));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (13 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 14/19] tcp: Re-use flow hash for initial sequence number generation David Gibson
@ 2024-05-14  1:03 ` David Gibson
       [not found]   ` <20240516225350.06aebcd7@elisabeth>
  2024-05-14  1:03 ` [PATCH v5 16/19] icmp: Look up ping flows using flow hash David Gibson
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

icmp_sock_handler() obtains the guest address from it's most recently
observed IP, and the ICMP id from the epoll reference.  Both of these
can be obtained readily from the flow.

icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.

struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.

Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c      | 37 ++++++++++++++++++++++---------------
 icmp_flow.h |  1 -
 tap.c       | 11 -----------
 tap.h       |  1 -
 4 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/icmp.c b/icmp.c
index 37a3586..1e9a05e 100644
--- a/icmp.c
+++ b/icmp.c
@@ -58,6 +58,7 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
 void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 {
 	struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow);
+	const struct flowside *ini = &pingf->f.side[INISIDE];
 	union sockaddr_inany sr;
 	socklen_t sl = sizeof(sr);
 	char buf[USHRT_MAX];
@@ -83,7 +84,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 			goto unexpected;
 
 		/* Adjust packet back to guest-side ID */
-		ih4->un.echo.id = htons(pingf->id);
+		ih4->un.echo.id = htons(ini->eport);
 		seq = ntohs(ih4->un.echo.sequence);
 	} else if (pingf->f.type == FLOW_PING6) {
 		struct icmp6hdr *ih6 = (struct icmp6hdr *)buf;
@@ -93,7 +94,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 			goto unexpected;
 
 		/* Adjust packet back to guest-side ID */
-		ih6->icmp6_identifier = htons(pingf->id);
+		ih6->icmp6_identifier = htons(ini->eport);
 		seq = ntohs(ih6->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -108,13 +109,20 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 	}
 
 	flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
-		 pingf->id, seq);
+		 ini->eport, seq);
 
-	if (pingf->f.type == FLOW_PING4)
-		tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n);
-	else if (pingf->f.type == FLOW_PING6)
-		tap_icmp6_send(c, &sr.sa6.sin6_addr,
-			       tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n);
+	if (pingf->f.type == FLOW_PING4) {
+		const struct in_addr *saddr = inany_v4(&ini->faddr);
+		const struct in_addr *daddr = inany_v4(&ini->eaddr);
+
+		ASSERT(saddr && daddr); /* Must have IPv4 addresses */
+		tap_icmp4_send(c, *saddr, *daddr, buf, n);
+	} else if (pingf->f.type == FLOW_PING6) {
+		const struct in6_addr *saddr = &ini->faddr.a6;
+		const struct in6_addr *daddr = &ini->eaddr.a6;
+
+		tap_icmp6_send(c, saddr, daddr, buf, n);
+	}
 	return;
 
 unexpected:
@@ -129,7 +137,7 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->id;
+	uint16_t id = pingf->f.side[INISIDE].eport;
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
@@ -172,7 +180,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
-	pingf->id = id;
 
 	if (af == AF_INET) {
 		bind_addr = &c->ip4.addr_out;
@@ -225,11 +232,12 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	union sockaddr_inany sa = { .sa_family = af };
-	const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6);
 	struct icmp_ping_flow *pingf, **id_sock;
+	const struct flowside *fwd;
+	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	socklen_t sl;
 	void *pkt;
 
 	(void)saddr;
@@ -250,7 +258,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
-		sa.sa4.sin_addr = *(struct in_addr *)daddr;
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
 
@@ -266,8 +273,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
-		sa.sa6.sin6_addr = *(struct in6_addr *)daddr;
-		sa.sa6.sin6_scope_id = c->ifi6;
 	} else {
 		ASSERT(0);
 	}
@@ -276,8 +281,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
 			return 1;
 
+	fwd = &pingf->f.side[FWDSIDE];
 	pingf->ts = now->tv_sec;
 
+	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, 0, c->ifi6);
 	if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
 		flow_dbg(pingf, "failed to relay request to socket: %s",
 			 strerror(errno));
diff --git a/icmp_flow.h b/icmp_flow.h
index 5a2eed9..f053211 100644
--- a/icmp_flow.h
+++ b/icmp_flow.h
@@ -22,7 +22,6 @@ struct icmp_ping_flow {
 	int seq;
 	int sock;
 	time_t ts;
-	uint16_t id;
 };
 
 bool icmp_ping_timer(const struct ctx *c, union flow *flow,
diff --git a/tap.c b/tap.c
index 91fd2e2..052f6f0 100644
--- a/tap.c
+++ b/tap.c
@@ -90,17 +90,6 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 	tap_send_frames(c, iov, iovcnt, 1);
 }
 
-/**
- * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
- * @c:		Execution context
- *
- * Return: IPv4 address
- */
-struct in_addr tap_ip4_daddr(const struct ctx *c)
-{
-	return c->ip4.addr_seen;
-}
-
 /**
  * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
  * @c:		Execution context
diff --git a/tap.h b/tap.h
index d146d2f..a4981a6 100644
--- a/tap.h
+++ b/tap.h
@@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 	thdr->vnet_len = htonl(l2len);
 }
 
-struct in_addr tap_ip4_daddr(const struct ctx *c);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
-- 
@@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 	thdr->vnet_len = htonl(l2len);
 }
 
-struct in_addr tap_ip4_daddr(const struct ctx *c);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 16/19] icmp: Look up ping flows using flow hash
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (14 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 17/19] icmp: Eliminate icmp_id_map David Gibson
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables.  However, we can look this up using the flow hash
table - that's what it's for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c
index 1e9a05e..ef7362a 100644
--- a/icmp.c
+++ b/icmp.c
@@ -141,6 +141,7 @@ static void icmp_ping_close(const struct ctx *c,
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
+	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
 
 	if (pingf->f.type == FLOW_PING4)
 		icmp_id_map[V4][id] = NULL;
@@ -205,6 +206,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
+	flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
 	*id_sock = pingf;
 
 	FLOW_ACTIVATE(pingf);
@@ -237,6 +239,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	union flow *flow;
+	uint8_t proto;
 	socklen_t sl;
 	void *pkt;
 
@@ -255,6 +259,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->type != ICMP_ECHO)
 			return 1;
 
+		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
@@ -270,6 +275,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->icmp6_type != ICMPV6_ECHO_REQUEST)
 			return 1;
 
+		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
@@ -277,11 +283,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		ASSERT(0);
 	}
 
-	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
-			return 1;
+	flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP,
+					   af, saddr, daddr, id, id));
+
+	if (flow)
+		pingf = &flow->ping;
+	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
+		return 1;
 
 	fwd = &pingf->f.side[FWDSIDE];
+
+	ASSERT(flow_proto[pingf->f.type] == proto);
 	pingf->ts = now->tv_sec;
 
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, 0, c->ifi6);
-- 
@@ -141,6 +141,7 @@ static void icmp_ping_close(const struct ctx *c,
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
+	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
 
 	if (pingf->f.type == FLOW_PING4)
 		icmp_id_map[V4][id] = NULL;
@@ -205,6 +206,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
+	flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
 	*id_sock = pingf;
 
 	FLOW_ACTIVATE(pingf);
@@ -237,6 +239,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	union flow *flow;
+	uint8_t proto;
 	socklen_t sl;
 	void *pkt;
 
@@ -255,6 +259,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->type != ICMP_ECHO)
 			return 1;
 
+		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
@@ -270,6 +275,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->icmp6_type != ICMPV6_ECHO_REQUEST)
 			return 1;
 
+		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
@@ -277,11 +283,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		ASSERT(0);
 	}
 
-	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
-			return 1;
+	flow = flow_at_sidx(flow_lookup_af(c, proto, PIF_TAP,
+					   af, saddr, daddr, id, id));
+
+	if (flow)
+		pingf = &flow->ping;
+	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
+		return 1;
 
 	fwd = &pingf->f.side[FWDSIDE];
+
+	ASSERT(flow_proto[pingf->f.type] == proto);
 	pingf->ts = now->tv_sec;
 
 	sockaddr_from_inany(&sa, &sl, &fwd->eaddr, 0, c->ifi6);
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 17/19] icmp: Eliminate icmp_id_map
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (15 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 16/19] icmp: Look up ping flows using flow hash David Gibson
@ 2024-05-14  1:03 ` David Gibson
  2024-05-14  1:03 ` [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP David Gibson
  2024-05-14  1:03 ` [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
  18 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything.  Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/icmp.c b/icmp.c
index ef7362a..0112fd9 100644
--- a/icmp.c
+++ b/icmp.c
@@ -47,9 +47,6 @@
 
 #define PINGF(idx)		(&(FLOW(idx)->ping))
 
-/* Indexed by ICMP echo identifier */
-static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
-
 /**
  * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket
  * @c:		Execution context
@@ -137,22 +134,14 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->f.side[INISIDE].eport;
-
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
-
-	if (pingf->f.type == FLOW_PING4)
-		icmp_id_map[V4][id] = NULL;
-	else
-		icmp_id_map[V6][id] = NULL;
 }
 
 /**
  * icmp_ping_new() - Prepare a new ping socket for a new id
  * @c:		Execution context
- * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
  * @saddr:	Source address
@@ -161,7 +150,6 @@ static void icmp_ping_close(const struct ctx *c,
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
-					    struct icmp_ping_flow **id_sock,
 					    sa_family_t af, uint16_t id,
 					    const void *saddr, const void *daddr)
 {
@@ -207,7 +195,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
 	flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
-	*id_sock = pingf;
 
 	FLOW_ACTIVATE(pingf);
 
@@ -234,8 +221,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	struct icmp_ping_flow *pingf, **id_sock;
 	const struct flowside *fwd;
+	struct icmp_ping_flow *pingf;
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
@@ -261,7 +248,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
-		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
@@ -277,7 +263,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
-		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -288,7 +273,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	if (flow)
 		pingf = &flow->ping;
-	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
+	else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr)))
 		return 1;
 
 	fwd = &pingf->f.side[FWDSIDE];
-- 
@@ -47,9 +47,6 @@
 
 #define PINGF(idx)		(&(FLOW(idx)->ping))
 
-/* Indexed by ICMP echo identifier */
-static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
-
 /**
  * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket
  * @c:		Execution context
@@ -137,22 +134,14 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->f.side[INISIDE].eport;
-
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, INISIDE));
-
-	if (pingf->f.type == FLOW_PING4)
-		icmp_id_map[V4][id] = NULL;
-	else
-		icmp_id_map[V6][id] = NULL;
 }
 
 /**
  * icmp_ping_new() - Prepare a new ping socket for a new id
  * @c:		Execution context
- * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
  * @saddr:	Source address
@@ -161,7 +150,6 @@ static void icmp_ping_close(const struct ctx *c,
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
-					    struct icmp_ping_flow **id_sock,
 					    sa_family_t af, uint16_t id,
 					    const void *saddr, const void *daddr)
 {
@@ -207,7 +195,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
 	flow_hash_insert(c, FLOW_SIDX(pingf, INISIDE));
-	*id_sock = pingf;
 
 	FLOW_ACTIVATE(pingf);
 
@@ -234,8 +221,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	struct icmp_ping_flow *pingf, **id_sock;
 	const struct flowside *fwd;
+	struct icmp_ping_flow *pingf;
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
@@ -261,7 +248,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
-		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
@@ -277,7 +263,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
-		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -288,7 +273,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	if (flow)
 		pingf = &flow->ping;
-	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr)))
+	else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr)))
 		return 1;
 
 	fwd = &pingf->f.side[FWDSIDE];
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (16 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 17/19] icmp: Eliminate icmp_id_map David Gibson
@ 2024-05-14  1:03 ` David Gibson
       [not found]   ` <20240518001345.2d127b09@elisabeth>
  2024-05-14  1:03 ` [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather this logic into fwd_from_*() functions for each input interface
in fwd.c which take protocol and address information for the initiating
side and generates the pif and address information for the forwarded side.
This performs any NAT or port forwarding needed.

We create a flow_forward() helper which applies those forwarding functions
as needed to automatically move a flow from INI to FWD state.  For now we
leave the older flow_forward_af() function taking explicit addresses as
a transitional tool.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c       |  53 +++++++++++++++++++++++++
 flow_table.h |   2 +
 fwd.c        | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fwd.h        |  12 ++++++
 tcp.c        | 102 +++++++++++++++--------------------------------
 tcp_splice.c |  63 ++---------------------------
 tcp_splice.h |   5 +--
 7 files changed, 213 insertions(+), 134 deletions(-)

diff --git a/flow.c b/flow.c
index 4942075..a6afe39 100644
--- a/flow.c
+++ b/flow.c
@@ -304,6 +304,59 @@ const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
 	return fwd;
 }
 
+
+/**
+ * flow_forward() - Determine where flow should forward to, and move to FWD
+ * @c:		Execution context
+ * @flow:	Flow to forward
+ * @proto:	Protocol
+ *
+ * Return: pointer to the forwarded flowside information
+ */
+const struct flowside *flow_forward(const struct ctx *c, union flow *flow,
+				    uint8_t proto)
+{
+	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+	struct flow_common *f = &flow->f;
+	const struct flowside *ini = &f->side[INISIDE];
+	struct flowside *fwd = &f->side[FWDSIDE];
+	uint8_t pif1 = PIF_NONE;
+
+	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI);
+	ASSERT(f->type == FLOW_TYPE_NONE);
+	ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[FWDSIDE] == PIF_NONE);
+	ASSERT(flow->f.state == FLOW_STATE_INI);
+
+	switch (f->pif[INISIDE]) {
+	case PIF_TAP:
+		pif1 = fwd_from_tap(c, proto, ini, fwd);
+		break;
+
+	case PIF_SPLICE:
+		pif1 = fwd_from_splice(c, proto, ini, fwd);
+		break;
+
+	case PIF_HOST:
+		pif1 = fwd_from_host(c, proto, ini, fwd);
+		break;
+
+	default:
+		flow_err(flow, "No rules to forward %s [%s]:%hu -> [%s]:%hu",
+			 pif_name(f->pif[INISIDE]),
+			 inany_ntop(&ini->eaddr, estr, sizeof(estr)),
+			 ini->eport,
+			 inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
+			 ini->fport);
+	}
+
+	if (pif1 == PIF_NONE)
+		return NULL;
+
+	f->pif[FWDSIDE] = pif1;
+	flow_set_state(f, FLOW_STATE_FWD);
+	return fwd;
+}
+
 /**
  * flow_set_type() - Set type and move to TYPED state
  * @flow:	Flow to change state
diff --git a/flow_table.h b/flow_table.h
index d17ffba..3ac0b8c 100644
--- a/flow_table.h
+++ b/flow_table.h
@@ -118,6 +118,8 @@ const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
 				       sa_family_t af,
 				       const void *saddr, in_port_t sport,
 				       const void *daddr, in_port_t dport);
+const struct flowside *flow_forward(const struct ctx *c, union flow *flow,
+				    uint8_t proto);
 
 union flow *flow_set_type(union flow *flow, enum flow_type type);
 #define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
diff --git a/fwd.c b/fwd.c
index b3d5a37..5fe2361 100644
--- a/fwd.c
+++ b/fwd.c
@@ -25,6 +25,7 @@
 #include "fwd.h"
 #include "passt.h"
 #include "lineread.h"
+#include "flow_table.h"
 
 /* See enum in kernel's include/net/tcp_states.h */
 #define UDP_LISTEN	0x07
@@ -154,3 +155,112 @@ void fwd_scan_ports_init(struct ctx *c)
 				   &c->tcp.fwd_out, &c->tcp.fwd_in);
 	}
 }
+
+uint8_t fwd_from_tap(const struct ctx *c, uint8_t proto,
+		     const struct flowside *a, struct flowside *b)
+{
+	(void)proto;
+
+	b->eaddr = a->faddr;
+	b->eport = a->fport;
+
+	if (!c->no_map_gw) {
+		struct in_addr *v4 = inany_v4(&b->eaddr);
+
+		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
+			*v4 = in4addr_loopback;
+		if (IN6_ARE_ADDR_EQUAL(&b->eaddr, &c->ip6.gw))
+			b->eaddr.a6 = in6addr_loopback;
+	}
+
+	return PIF_HOST;
+}
+
+uint8_t fwd_from_splice(const struct ctx *c, uint8_t proto,
+			const struct flowside *a, struct flowside *b)
+{
+	const struct in_addr *ae4 = inany_v4(&a->eaddr);
+
+	if (!inany_is_loopback(&a->eaddr) ||
+	    (!inany_is_loopback(&a->faddr) && !inany_is_unspecified(&a->faddr))) {
+		char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+
+		debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu",
+		      pif_name(PIF_SPLICE),
+		      inany_ntop(&a->eaddr, estr, sizeof(estr)), a->eport,
+		      inany_ntop(&a->faddr, fstr, sizeof(fstr)), a->fport);
+		return PIF_NONE;
+	}
+
+	if (ae4)
+		inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
+	else
+		inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
+
+	b->eport = a->fport;
+
+	if (proto == IPPROTO_TCP)
+		b->eport += c->tcp.fwd_out.delta[b->eport];
+
+	return PIF_HOST;
+}
+
+uint8_t fwd_from_host(const struct ctx *c, uint8_t proto,
+		      const struct flowside *a, struct flowside *b)
+{
+	struct in_addr *bf4;
+
+	if (c->mode == MODE_PASTA && inany_is_loopback(&a->eaddr) &&
+	    proto == IPPROTO_TCP) {
+		/* spliceable */
+		b->faddr = a->eaddr;
+
+		if (inany_v4(&a->eaddr))
+			inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
+		else
+			inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
+		b->eport = a->fport;
+		if (proto == IPPROTO_TCP)
+			b->eport += c->tcp.fwd_in.delta[b->eport];
+
+		return PIF_SPLICE;
+	}
+
+	b->faddr = a->eaddr;
+	b->fport = a->eport;
+
+	bf4 = inany_v4(&b->faddr);
+
+	if (bf4) {
+		if (IN4_IS_ADDR_LOOPBACK(bf4) ||
+		    IN4_IS_ADDR_UNSPECIFIED(bf4) ||
+		    IN4_ARE_ADDR_EQUAL(bf4, &c->ip4.addr_seen))
+			*bf4 = c->ip4.gw;
+	} else {
+		struct in6_addr *bf6 = &b->faddr.a6;
+
+		if (IN6_IS_ADDR_LOOPBACK(bf6) ||
+		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr_seen) ||
+		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr)) {
+			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
+				*bf6 = c->ip6.gw;
+			else
+				*bf6 = c->ip6.addr_ll;
+		}
+	}
+
+	if (bf4) {
+		inany_from_af(&b->eaddr, AF_INET, &c->ip4.addr_seen);
+	} else {
+		if (IN6_IS_ADDR_LINKLOCAL(&b->faddr.a6))
+			b->eaddr.a6 = c->ip6.addr_ll_seen;
+		else
+			b->eaddr.a6 = c->ip6.addr_seen;
+	}
+
+	b->eport = a->fport;
+	if (proto == IPPROTO_TCP)
+		b->eport += c->tcp.fwd_in.delta[b->eport];
+
+	return PIF_TAP;
+}
diff --git a/fwd.h b/fwd.h
index 41645d7..eefe0f0 100644
--- a/fwd.h
+++ b/fwd.h
@@ -7,6 +7,8 @@
 #ifndef FWD_H
 #define FWD_H
 
+struct flowside;
+
 /* Number of ports for both TCP and UDP */
 #define	NUM_PORTS	(1U << 16)
 
@@ -42,4 +44,14 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
 			const struct fwd_ports *tcp_rev);
 void fwd_scan_ports_init(struct ctx *c);
 
+uint8_t fwd_from_tap(const struct ctx *c, uint8_t proto,
+		     const struct flowside *a, struct flowside *b);
+uint8_t fwd_from_splice(const struct ctx *c, uint8_t proto,
+			const struct flowside *a, struct flowside *b);
+uint8_t fwd_from_host(const struct ctx *c, uint8_t proto,
+		      const struct flowside *a, struct flowside *b);
+
+bool fwd_nat_flow(const struct ctx *c, uint8_t proto,
+		  const struct flowside *a, struct flowside *b);
+
 #endif /* FWD_H */
diff --git a/tcp.c b/tcp.c
index 91b8a46..7e08b53 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1759,7 +1759,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	in_port_t dstport = ntohs(th->dest);
 	const struct flowside *ini, *fwd;
 	struct tcp_tap_conn *conn;
-	union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */
 	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
@@ -1782,22 +1781,18 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		goto cancel;
 	}
 
-	if ((s = tcp_conn_sock(c, af)) < 0)
+	if (!(fwd = flow_forward(c, flow, IPPROTO_TCP)))
 		goto cancel;
 
-	dstaddr = ini->faddr;
-
-	if (!c->no_map_gw) {
-		struct in_addr *v4 = inany_v4(&dstaddr);
-
-		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
-			*v4 = in4addr_loopback;
-		if (IN6_ARE_ADDR_EQUAL(&dstaddr, &c->ip6.gw))
-			dstaddr.a6 = in6addr_loopback;
+	if (flow->f.pif[FWDSIDE] != PIF_HOST) {
+		flow_err(flow, "No support for forwarding TCP from %s to %s",
+			 pif_name(flow->f.pif[INISIDE]),
+			 pif_name(flow->f.pif[FWDSIDE]));
+		goto cancel;
 	}
 
-	fwd = flow_forward_af(flow, PIF_HOST, AF_INET6,
-			      &inany_any6, srcport, &dstaddr, dstport);
+	if ((s = tcp_conn_sock(c, af)) < 0)
+		goto cancel;
 
 	if (IN6_IS_ADDR_LINKLOCAL(&fwd->eaddr)) {
 		struct sockaddr_in6 addr6_ll = {
@@ -2479,70 +2474,21 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 }
 
-/**
- * tcp_snat_inbound() - Translate source address for inbound data if needed
- * @c:		Execution context
- * @addr:	Source address of inbound packet/connection
- */
-static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
-{
-	struct in_addr *addr4 = inany_v4(addr);
-
-	if (addr4) {
-		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
-		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
-		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
-			*addr4 = c->ip4.gw;
-	} else {
-		struct in6_addr *addr6 = &addr->a6;
-
-		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
-		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
-		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
-			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-				*addr6 = c->ip6.gw;
-			else
-				*addr6 = c->ip6.addr_ll;
-		}
-	}
-}
-
 /**
  * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
  * @c:		Execution context
- * @dstport:	Destination port for connection (host side)
  * @flow:	flow to initialise
  * @s:		Accepted socket
  * @sa:		Peer socket address (from accept())
  * @now:	Current timestamp
  */
-static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
-				   union flow *flow, int s,
-				   const union sockaddr_inany *sa,
+static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
 				   const struct timespec *now)
 {
-	union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
-	struct tcp_tap_conn *conn;
-	in_port_t srcport;
+	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
 	uint64_t hash;
 
-	inany_from_sockaddr(&saddr, &srcport, sa);
-	tcp_snat_inbound(c, &saddr);
-
-	if (inany_v4(&saddr)) {
-		inany_from_af(&daddr, AF_INET, &c->ip4.addr_seen);
-	} else {
-		if (IN6_IS_ADDR_LINKLOCAL(&saddr))
-			daddr.a6 = c->ip6.addr_ll_seen;
-		else
-			daddr.a6 = c->ip6.addr_seen;
-	}
-	dstport += c->tcp.fwd_in.delta[dstport];
-
-	flow_forward_af(flow,  PIF_TAP, AF_INET6,
-			&saddr, srcport, &daddr, dstport);
-	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
-
+	
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
@@ -2585,8 +2531,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	if (s < 0)
 		goto cancel;
 
-	flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
-	ini = &flow->f.side[INISIDE];
+	ini = flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
 
 	if (!inany_is_unicast(&ini->eaddr) || ini->eport == 0) {
 		char str[INANY_ADDRSTRLEN];
@@ -2596,11 +2541,26 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		goto cancel;
 	}
 
-	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
-				      ref.tcp_listen.port, flow, s, &sa))
-		return;
+	if (!flow_forward(c, flow, IPPROTO_TCP))
+		goto cancel;
+
+	switch (flow->f.pif[FWDSIDE]) {
+	case PIF_SPLICE:
+	case PIF_HOST:
+		tcp_splice_conn_from_sock(c, flow, s);
+		break;
+
+	case PIF_TAP:
+		tcp_tap_conn_from_sock(c, flow, s, now);
+		break;
+
+	default:
+		flow_err(flow, "No support for forwarding TCP from %s to %s",
+			 pif_name(flow->f.pif[INISIDE]),
+			 pif_name(flow->f.pif[FWDSIDE]));
+		goto cancel;
+	}
 
-	tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now);
 	return;
 
 cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index aa92325..a0581f0 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -395,71 +395,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
 /**
  * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
- * @pif0:	pif id of side 0
- * @dstport:	Side 0 destination port of connection
  * @flow:	flow to initialise
  * @s0:		Accepted (side 0) socket
  * @sa:		Peer address of connection
  *
- * Return: true if able to create a spliced connection, false otherwise
  * #syscalls:pasta setsockopt
  */
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa)
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
 {
-	struct tcp_splice_conn *conn;
-	union inany_addr src;
-	in_port_t srcport;
-	sa_family_t af;
-	uint8_t pif1;
+	struct tcp_splice_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE,
+						     tcp_splice);
 
-	if (c->mode != MODE_PASTA)
-		return false;
-
-	inany_from_sockaddr(&src, &srcport, sa);
-	af = inany_v4(&src) ? AF_INET : AF_INET6;
-
-	switch (pif0) {
-	case PIF_SPLICE:
-		if (!inany_is_loopback(&src)) {
-			char str[INANY_ADDRSTRLEN];
-
-			/* We can't use flow_err() etc. because we haven't set
-			 * the flow type yet
-			 */
-			warn("Bad source address %s for splice, closing",
-			     inany_ntop(&src, str, sizeof(str)));
-
-			/* We *don't* want to fall back to tap */
-			flow_alloc_cancel(flow);
-			return true;
-		}
-
-		pif1 = PIF_HOST;
-		dstport += c->tcp.fwd_out.delta[dstport];
-		break;
-
-	case PIF_HOST:
-		if (!inany_is_loopback(&src))
-			return false;
-
-		pif1 = PIF_SPLICE;
-		dstport += c->tcp.fwd_in.delta[dstport];
-		break;
-
-	default:
-		return false;
-	}
-
-	if (af == AF_INET)
-		flow_forward_af(flow, pif1, AF_INET,
-				NULL, 0, &in4addr_loopback, dstport);
-	else
-		flow_forward_af(flow, pif1, AF_INET6,
-				NULL, 0, &in6addr_loopback, dstport);
-	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
+	ASSERT(c->mode == MODE_PASTA);
 
 	conn->s[0] = s0;
 	conn->s[1] = -1;
@@ -473,8 +420,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		conn_flag(c, conn, CLOSING);
 
 	FLOW_ACTIVATE(conn);
-
-	return true;
 }
 
 /**
diff --git a/tcp_splice.h b/tcp_splice.h
index ed8f0c5..a20f3e2 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -11,10 +11,7 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa);
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -11,10 +11,7 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa);
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP
  2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
                   ` (17 preceding siblings ...)
  2024-05-14  1:03 ` [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP David Gibson
@ 2024-05-14  1:03 ` David Gibson
       [not found]   ` <20240518001408.004011b2@elisabeth>
  18 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2024-05-14  1:03 UTC (permalink / raw)
  To: Stefano Brivio, passt-dev; +Cc: David Gibson

Current ICMP hard codes its forwarding rules, and never applies any
translations.  Change it to use the flow_forward() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).

This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine.  This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c |  1 +
 icmp.c | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/flow.c b/flow.c
index a6afe39..b43a079 100644
--- a/flow.c
+++ b/flow.c
@@ -285,6 +285,7 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
  *
  * Return: pointer to the forwarded flowside information
  */
+/* cppcheck-suppress unusedFunction */
 const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
 				       sa_family_t af,
 				       const void *saddr, in_port_t sport,
diff --git a/icmp.c b/icmp.c
index 0112fd9..6310178 100644
--- a/icmp.c
+++ b/icmp.c
@@ -153,6 +153,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 					    sa_family_t af, uint16_t id,
 					    const void *saddr, const void *daddr)
 {
+	uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
 	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
 	union flow *flow = flow_alloc();
@@ -163,9 +164,18 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
-
 	flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
-	flow_forward_af(flow, PIF_HOST,	af, NULL, 0, daddr, 0);
+	if (!flow_forward(c, flow, proto))
+		goto cancel;
+
+	if (flow->f.pif[FWDSIDE] != PIF_HOST) {
+		flow_err(flow, "No support for forwarding %s from %s to %s",
+			 proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6",
+			 pif_name(flow->f.pif[INISIDE]),
+			 pif_name(flow->f.pif[FWDSIDE]));
+		goto cancel;
+	}
+
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
-- 
@@ -153,6 +153,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 					    sa_family_t af, uint16_t id,
 					    const void *saddr, const void *daddr)
 {
+	uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
 	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
 	union flow *flow = flow_alloc();
@@ -163,9 +164,18 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
-
 	flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
-	flow_forward_af(flow, PIF_HOST,	af, NULL, 0, daddr, 0);
+	if (!flow_forward(c, flow, proto))
+		goto cancel;
+
+	if (flow->f.pif[FWDSIDE] != PIF_HOST) {
+		flow_err(flow, "No support for forwarding %s from %s to %s",
+			 proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6",
+			 pif_name(flow->f.pif[INISIDE]),
+			 pif_name(flow->f.pif[FWDSIDE]));
+		goto cancel;
+	}
+
 	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
 
 	pingf->seq = -1;
-- 
2.45.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 01/19] flow: Clarify and enforce flow state transitions
  2024-05-14  1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
@ 2024-05-16  9:30   ` Stefano Brivio
       [not found]     ` <ZkbVxtvmP7f0aL1S@zatzit>
  0 siblings, 1 reply; 28+ messages in thread
From: Stefano Brivio @ 2024-05-16  9:30 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 14 May 2024 11:03:19 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Flows move over several different states in their lifetime.  The rules for
> these are documented in comments, but they're pretty complex and a number
> of the transitions are implicit, which makes this pretty fragile and
> error prone.
> 
> Change the code to explicitly track the states in a field.  Make all
> transitions explicit and logged.  To the extent that it's practical in C,
> enforce what can and can't be done in various states with ASSERT()s.
> 
> While we're at it, tweak the docs to clarify the restrictions on each state
> a bit.

Now it looks much clearer to me.

> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c       | 144 ++++++++++++++++++++++++++++++---------------------
>  flow.h       |  67 ++++++++++++++++++++++--
>  flow_table.h |  10 ++++
>  icmp.c       |   4 +-
>  tcp.c        |   8 ++-
>  tcp_splice.c |   4 +-
>  6 files changed, 168 insertions(+), 69 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 80dd269..768e0f6 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -18,6 +18,15 @@
>  #include "flow.h"
>  #include "flow_table.h"
>  
> +const char *flow_state_str[] = {
> +	[FLOW_STATE_FREE]	= "FREE",
> +	[FLOW_STATE_NEW]	= "NEW",
> +	[FLOW_STATE_TYPED]	= "TYPED",
> +	[FLOW_STATE_ACTIVE]	= "ACTIVE",
> +};
> +static_assert(ARRAY_SIZE(flow_state_str) == FLOW_NUM_STATES,
> +	      "flow_state_str[] doesn't match enum flow_state");
> +
>  const char *flow_type_str[] = {
>  	[FLOW_TYPE_NONE]	= "<none>",
>  	[FLOW_TCP]		= "TCP connection",
> @@ -39,46 +48,6 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>  
>  /* Global Flow Table */
>  
> -/**
> - * DOC: Theory of Operation - flow entry life cycle
> - *
> - * An individual flow table entry moves through these logical states, usually in
> - * this order.
> - *
> - *    FREE - Part of the general pool of free flow table entries
> - *        Operations:
> - *            - flow_alloc() finds an entry and moves it to ALLOC state
> - *
> - *    ALLOC - A tentatively allocated entry
> - *        Operations:
> - *            - flow_alloc_cancel() returns the entry to FREE state
> - *            - FLOW_START() set the entry's type and moves to START state
> - *        Caveats:
> - *            - It's not safe to write fields in the flow entry
> - *            - It's not safe to allocate further entries with flow_alloc()
> - *            - It's not safe to return to the main epoll loop (use FLOW_START()
> - *              to move to START state before doing so)
> - *            - It's not safe to use flow_*() logging functions
> - *
> - *    START - An entry being prepared by flow type specific code
> - *        Operations:
> - *            - Flow type specific fields may be accessed
> - *            - flow_*() logging functions
> - *            - flow_alloc_cancel() returns the entry to FREE state
> - *        Caveats:
> - *            - Returning to the main epoll loop or allocating another entry
> - *              with flow_alloc() implicitly moves the entry to ACTIVE state.
> - *
> - *    ACTIVE - An active flow entry managed by flow type specific code
> - *        Operations:
> - *            - Flow type specific fields may be accessed
> - *            - flow_*() logging functions
> - *            - Flow may be expired by returning 'true' from flow type specific
> - *              deferred or timer handler.  This will return it to FREE state.
> - *        Caveats:
> - *            - It's not safe to call flow_alloc_cancel()
> - */
> -
>  /**
>   * DOC: Theory of Operation - allocating and freeing flow entries
>   *
> @@ -132,6 +101,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>  
>  unsigned flow_first_free;
>  union flow flowtab[FLOW_MAX];
> +static const union flow *flow_new_entry; /* = NULL */
>  
>  /* Last time the flow timers ran */
>  static struct timespec flow_timer_run;
> @@ -144,6 +114,7 @@ static struct timespec flow_timer_run;
>   */
>  void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  {
> +	const char *typestate;

type_or_state? It took me a while to figure this out (well, because I
didn't read the rest, my bad, but still it could be clearer).

>  	char msg[BUFSIZ];
>  	va_list args;
>  
> @@ -151,40 +122,65 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  	(void)vsnprintf(msg, sizeof(msg), fmt, args);
>  	va_end(args);
>  
> -	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
> +	/* Show type if it's set, otherwise the state */
> +	if (f->state < FLOW_STATE_TYPED)
> +		typestate = FLOW_STATE(f);
> +	else
> +		typestate = FLOW_TYPE(f);
> +
> +	logmsg(pri, "Flow %u (%s): %s", flow_idx(f), typestate, msg);
> +}
> +
> +/**
> + * flow_set_state() - Change flow's state
> + * @f:		Flow to update
> + * @state:	New state
> + */
> +static void flow_set_state(struct flow_common *f, enum flow_state state)
> +{
> +	uint8_t oldstate = f->state;
> +
> +	ASSERT(state < FLOW_NUM_STATES);
> +	ASSERT(oldstate < FLOW_NUM_STATES);
> +
> +	f->state = state;
> +	flow_log_(f, LOG_DEBUG, "%s -> %s", flow_state_str[oldstate],
> +		  FLOW_STATE(f));
>  }
>  
>  /**
> - * flow_start() - Set flow type for new flow and log
> - * @flow:	Flow to set type for
> + * flow_set_type() - Set type and mvoe to TYPED state

move

> + * @flow:	Flow to change state

...for? Or Flow changing state?

>   * @type:	Type for new flow
>   * @iniside:	Which side initiated the new flow
>   *
>   * Return: @flow
> - *
> - * Should be called before setting any flow type specific fields in the flow
> - * table entry.
>   */
> -union flow *flow_start(union flow *flow, enum flow_type type,
> -		       unsigned iniside)
> +union flow *flow_set_type(union flow *flow, enum flow_type type,
> +			  unsigned iniside)
>  {
> +	struct flow_common *f = &flow->f;
> +
> +	ASSERT(type != FLOW_TYPE_NONE);
> +	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
> +	ASSERT(f->type == FLOW_TYPE_NONE);
> +
>  	(void)iniside;
> -	flow->f.type = type;
> -	flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
> +	f->type = type;
> +	flow_set_state(f, FLOW_STATE_TYPED);
>  	return flow;
>  }
>  
>  /**
> - * flow_end() - Clear flow type for finished flow and log
> - * @flow:	Flow to clear
> + * flow_activate() - Move flow to ACTIVE state
> + * @f:		Flow to change state
>   */
> -static void flow_end(union flow *flow)
> +void flow_activate(struct flow_common *f)
>  {
> -	if (flow->f.type == FLOW_TYPE_NONE)
> -		return; /* Nothing to do */
> +	ASSERT(&flow_new_entry->f == f && f->state == FLOW_STATE_TYPED);
>  
> -	flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
> -	flow->f.type = FLOW_TYPE_NONE;
> +	flow_set_state(f, FLOW_STATE_ACTIVE);
> +	flow_new_entry = NULL;
>  }
>  
>  /**
> @@ -196,9 +192,12 @@ union flow *flow_alloc(void)
>  {
>  	union flow *flow = &flowtab[flow_first_free];
>  
> +	ASSERT(!flow_new_entry);
> +
>  	if (flow_first_free >= FLOW_MAX)
>  		return NULL;
>  
> +	ASSERT(flow->f.state == FLOW_STATE_FREE);
>  	ASSERT(flow->f.type == FLOW_TYPE_NONE);
>  	ASSERT(flow->free.n >= 1);
>  	ASSERT(flow_first_free + flow->free.n <= FLOW_MAX);
> @@ -221,7 +220,10 @@ union flow *flow_alloc(void)
>  		flow_first_free = flow->free.next;
>  	}
>  
> +	flow_new_entry = flow;
>  	memset(flow, 0, sizeof(*flow));
> +	flow_set_state(&flow->f, FLOW_STATE_NEW);
> +
>  	return flow;
>  }
>  
> @@ -233,15 +235,21 @@ union flow *flow_alloc(void)
>   */
>  void flow_alloc_cancel(union flow *flow)
>  {
> +	ASSERT(flow_new_entry == flow);
> +	ASSERT(flow->f.state == FLOW_STATE_NEW ||
> +	       flow->f.state == FLOW_STATE_TYPED);
>  	ASSERT(flow_first_free > FLOW_IDX(flow));
>  
> -	flow_end(flow);
> +	flow_set_state(&flow->f, FLOW_STATE_FREE);
> +	memset(flow, 0, sizeof(*flow));
> +
>  	/* Put it back in a length 1 free cluster, don't attempt to fully
>  	 * reverse flow_alloc()s steps.  This will get folded together the next
>  	 * time flow_defer_handler runs anyway() */
>  	flow->free.n = 1;
>  	flow->free.next = flow_first_free;
>  	flow_first_free = FLOW_IDX(flow);
> +	flow_new_entry = NULL;
>  }
>  
>  /**
> @@ -265,7 +273,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  		union flow *flow = &flowtab[idx];
>  		bool closed = false;
>  
> -		if (flow->f.type == FLOW_TYPE_NONE) {
> +		switch (flow->f.state) {
> +		case FLOW_STATE_FREE: {
>  			unsigned skip = flow->free.n;
>  
>  			/* First entry of a free cluster must have n >= 1 */
> @@ -287,6 +296,20 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  			continue;
>  		}
>  
> +		case FLOW_STATE_NEW:
> +		case FLOW_STATE_TYPED:
> +			flow_err(flow, "Incomplete flow at end of cycle");
> +			ASSERT(false);
> +			break;
> +
> +		case FLOW_STATE_ACTIVE:
> +			/* Nothing to do */
> +			break;
> +
> +		default:
> +			ASSERT(false);
> +		}
> +
>  		switch (flow->f.type) {
>  		case FLOW_TYPE_NONE:
>  			ASSERT(false);
> @@ -310,7 +333,8 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
>  		}
>  
>  		if (closed) {
> -			flow_end(flow);
> +			flow_set_state(&flow->f, FLOW_STATE_FREE);
> +			memset(flow, 0, sizeof(*flow));
>  
>  			if (free_head) {
>  				/* Add slot to current free cluster */
> diff --git a/flow.h b/flow.h
> index c943c44..073a734 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -9,6 +9,66 @@
>  
>  #define FLOW_TIMER_INTERVAL		1000	/* ms */
>  
> +/**
> + * enum flow_state - States of a flow table entry
> + *
> + * An individual flow table entry moves through these states, usually in this
> + * order.
> + *  General rules:
> + *    - Code outside flow.c should never write common fields of union flow.
> + *    - The state field may always be read.
> + *
> + *    FREE - Part of the general pool of free flow table entries
> + *        Operations:
> + *            - flow_alloc() finds an entry and moves it to NEW state

even s/ state// (same below) maybe? It's a bit redundant. No strong
preference though.

> + *
> + *    NEW - Freshly allocated, uninitialised entry
> + *        Operations:
> + *            - flow_alloc_cancel() returns the entry to FREE state
> + *            - FLOW_SET_TYPE() sets the entry's type and moves to TYPED state
> + *        Caveats:
> + *            - No fields other than state may be accessed.

s/\.//

> + *            - At most one entry may be in NEW or TYPED state at a time, so it's
> + *              unsafe to use flow_alloc() again until this entry moves to
> + *              ACTIVE or FREE state 
> + *            - You may not return to the main epoll loop while an entry is in
> + *              NEW state.
> + *
> + *    TYPED - Generic info initialised, type specific initialisation underway
> + *        Operations:
> + *            - All common fields may be read
> + *            - Type specific fields may be read and written
> + *            - flow_alloc_cancel() returns the entry to FREE state
> + *            - FLOW_ACTIVATE() moves the entry to ACTIVE STATE

s/STATE/state/ (if you want to keep it)

> + *        Caveats:
> + *            - At most one entry may be in NEW or TYPED state at a time, so it's
> + *              unsafe to use flow_alloc() again until this entry moves to
> + *              ACTIVE or FREE state 
> + *            - You may not return to the main epoll loop while an entry is in
> + *              TYPED state.
> + *
> + *    ACTIVE - An active, fully-initialised flow entry
> + *        Operations:
> + *            - All common fields may be read
> + *            - Type specific fields may be read and written
> + *            - Flow may be expired by returning 'true' from flow type specific

'to expire' in this sense is actually intransitive. What you mean is
perfectly clear after reading this a couple of times, but it might
confuse non-native English speakers I guess?

> + *              deferred or timer handler.  This will return it to FREE state.
> + *        Caveats:
> + *            - flow_alloc_cancel() may not be called on it
> + */
> +enum flow_state {
> +	FLOW_STATE_FREE,
> +	FLOW_STATE_NEW,
> +	FLOW_STATE_TYPED,
> +	FLOW_STATE_ACTIVE,
> +
> +	FLOW_NUM_STATES,
> +};
> +
> +extern const char *flow_state_str[];
> +#define FLOW_STATE(f)							\
> +        ((f)->state < FLOW_NUM_STATES ? flow_state_str[(f)->state] : "?")
> +
>  /**
>   * enum flow_type - Different types of packet flows we track
>   */
> @@ -37,9 +97,11 @@ extern const uint8_t flow_proto[];
>  
>  /**
>   * struct flow_common - Common fields for packet flows
> + * @state:	State of the flow table entry
>   * @type:	Type of packet flow
>   */
>  struct flow_common {
> +	uint8_t		state;

In this case, I would typically do
(https://seitan.rocks/seitan/tree/common/gluten.h?id=5a9302bab9c9bb3d1577f04678d074fb7af4115f#n53):

#ifdef __GNUC__
	enum flow_state		state:8;
#else
	uint8_t			state;
#endif

...and in any case we need to make sure to assign single values in the
enum above: there are no guarantees that FLOW_STATE_ACTIVE is 3
otherwise (except for that static_assert(), but that's not its purpose).

>  	uint8_t		type;
>  };
>  
> @@ -49,11 +111,6 @@ struct flow_common {
>  #define FLOW_TABLE_PRESSURE		30	/* % of FLOW_MAX */
>  #define FLOW_FILE_PRESSURE		30	/* % of c->nofile */
>  
> -union flow *flow_start(union flow *flow, enum flow_type type,
> -		       unsigned iniside);
> -#define FLOW_START(flow_, t_, var_, i_)		\
> -	(&flow_start((flow_), (t_), (i_))->var_)
> -
>  /**
>   * struct flow_sidx - ID for one side of a specific flow
>   * @side:	Side referenced (0 or 1)
> diff --git a/flow_table.h b/flow_table.h
> index b7e5529..58014d8 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -107,4 +107,14 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
>  union flow *flow_alloc(void);
>  void flow_alloc_cancel(union flow *flow);
>  
> +union flow *flow_set_type(union flow *flow, enum flow_type type,
> +			  unsigned iniside);
> +#define FLOW_SET_TYPE(flow_, t_, var_, i_)	\
> +	(&flow_set_type((flow_), (t_), (i_))->var_)
> +
> +void flow_activate(struct flow_common *f);
> +#define FLOW_ACTIVATE(flow_)			\
> +	(flow_activate(&(flow_)->f))
> +
> +
>  #endif /* FLOW_TABLE_H */
> diff --git a/icmp.c b/icmp.c
> index 1c5cf84..fda868d 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -167,7 +167,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  	if (!flow)
>  		return NULL;
>  
> -	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
> +	pingf = FLOW_SET_TYPE(flow, flowtype, ping, TAPSIDE);
>  
>  	pingf->seq = -1;
>  	pingf->id = id;
> @@ -198,6 +198,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  
>  	*id_sock = pingf;
>  
> +	FLOW_ACTIVATE(pingf);
> +
>  	return pingf;
>  
>  cancel:
> diff --git a/tcp.c b/tcp.c
> index 21d0af0..65208ca 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -2006,7 +2006,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  			goto cancel;
>  	}
>  
> -	conn = FLOW_START(flow, FLOW_TCP, tcp, TAPSIDE);
> +	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp, TAPSIDE);
>  	conn->sock = s;
>  	conn->timer = -1;
>  	conn_event(c, conn, TAP_SYN_RCVD);
> @@ -2077,6 +2077,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  	}
>  
>  	tcp_epoll_ctl(c, conn);
> +	FLOW_ACTIVATE(conn);
>  	return;
>  
>  cancel:
> @@ -2724,7 +2725,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
>  				   const union sockaddr_inany *sa,
>  				   const struct timespec *now)
>  {
> -	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
> +	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp,
> +						  SOCKSIDE);
>  
>  	conn->sock = s;
>  	conn->timer = -1;
> @@ -2747,6 +2749,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
>  	conn_flag(c, conn, ACK_FROM_TAP_DUE);
>  
>  	tcp_get_sndbuf(conn);
> +
> +	FLOW_ACTIVATE(conn);
>  }
>  
>  /**
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 4c36b72..abe98a0 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
>  		return false;
>  	}
>  
> -	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
> +	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
>  
>  	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
>  	conn->s[0] = s0;
> @@ -486,6 +486,8 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
>  	if (tcp_splice_connect(c, conn, af, pif1, dstport))
>  		conn_flag(c, conn, CLOSING);
>  
> +	FLOW_ACTIVATE(conn);
> +
>  	return true;
>  }
>  

Everything else looks good to me.

-- 
Stefano


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 02/19] flow: Make side 0 always be the initiating side
  2024-05-14  1:03 ` [PATCH v5 02/19] flow: Make side 0 always be the initiating side David Gibson
@ 2024-05-16 12:06   ` Stefano Brivio
  0 siblings, 0 replies; 28+ messages in thread
From: Stefano Brivio @ 2024-05-16 12:06 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Tue, 14 May 2024 11:03:20 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Each flow in the flow table has two sides, 0 and 1, representing the
> two interfaces between which passt/pasta will forward data for that flow.
> Which side is which is currently up to the protocol specific code:  TCP
> uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
> for spliced connections where it uses 0 for the initiating side and 1 for
> the accepting side.  ICMP also uses 0 for the host/"sock" side and 1 for
> the guest/"tap" side, but in its case the latter is always also the
> initiating side.
> 
> Make this generically consistent by always using side 0 for the initiating
> side and 1 for the accepting side.  This doesn't simplify a lot for now,
> and arguably makes TCP slightly more complex, since we add an extra field
> to the connection structure to record which is the guest facing side.
> This is an interim change, which we'll be able to remove later.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c       |  5 +----
>  flow.h       |  5 +++++
>  flow_table.h |  6 ++----
>  icmp.c       |  8 ++------
>  tcp.c        | 19 ++++++++-----------
>  tcp_conn.h   |  3 ++-
>  tcp_splice.c |  2 +-
>  7 files changed, 21 insertions(+), 27 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 768e0f6..7456021 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -152,12 +152,10 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
>   * flow_set_type() - Set type and mvoe to TYPED state
>   * @flow:	Flow to change state
>   * @type:	Type for new flow
> - * @iniside:	Which side initiated the new flow
>   *
>   * Return: @flow
>   */
> -union flow *flow_set_type(union flow *flow, enum flow_type type,
> -			  unsigned iniside)
> +union flow *flow_set_type(union flow *flow, enum flow_type type)
>  {
>  	struct flow_common *f = &flow->f;
>  
> @@ -165,7 +163,6 @@ union flow *flow_set_type(union flow *flow, enum flow_type type,
>  	ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_NEW);
>  	ASSERT(f->type == FLOW_TYPE_NONE);
>  
> -	(void)iniside;
>  	f->type = type;
>  	flow_set_state(f, FLOW_STATE_TYPED);
>  	return flow;
> diff --git a/flow.h b/flow.h
> index 073a734..28169a8 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -95,6 +95,11 @@ extern const uint8_t flow_proto[];
>  #define FLOW_PROTO(f)				\
>  	((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
>  
> +#define SIDES			2
> +
> +#define INISIDE			0	/* Initiating side */
> +#define FWDSIDE			1	/* Forwarded side */
> +
>  /**
>   * struct flow_common - Common fields for packet flows
>   * @state:	State of the flow table entry
> diff --git a/flow_table.h b/flow_table.h
> index 58014d8..7c98195 100644
> --- a/flow_table.h
> +++ b/flow_table.h
> @@ -107,10 +107,8 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f,
>  union flow *flow_alloc(void);
>  void flow_alloc_cancel(union flow *flow);
>  
> -union flow *flow_set_type(union flow *flow, enum flow_type type,
> -			  unsigned iniside);
> -#define FLOW_SET_TYPE(flow_, t_, var_, i_)	\
> -	(&flow_set_type((flow_), (t_), (i_))->var_)
> +union flow *flow_set_type(union flow *flow, enum flow_type type);
> +#define FLOW_SET_TYPE(flow_, t_, var_)	(&flow_set_type((flow_), (t_))->var_)
>  
>  void flow_activate(struct flow_common *f);
>  #define FLOW_ACTIVATE(flow_)			\
> diff --git a/icmp.c b/icmp.c
> index fda868d..6df0989 100644
> --- a/icmp.c
> +++ b/icmp.c
> @@ -45,10 +45,6 @@
>  #define ICMP_ECHO_TIMEOUT	60 /* s, timeout for ICMP socket activity */
>  #define ICMP_NUM_IDS		(1U << 16)
>  
> -/* Sides of a flow as we use them for ping streams */
> -#define	SOCKSIDE	0
> -#define	TAPSIDE		1
> -
>  #define PINGF(idx)		(&(FLOW(idx)->ping))
>  
>  /* Indexed by ICMP echo identifier */
> @@ -167,7 +163,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  	if (!flow)
>  		return NULL;
>  
> -	pingf = FLOW_SET_TYPE(flow, flowtype, ping, TAPSIDE);
> +	pingf = FLOW_SET_TYPE(flow, flowtype, ping);
>  
>  	pingf->seq = -1;
>  	pingf->id = id;
> @@ -180,7 +176,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
>  		bind_if = c->ip6.ifname_out;
>  	}
>  
> -	ref.flowside = FLOW_SIDX(flow, SOCKSIDE);
> +	ref.flowside = FLOW_SIDX(flow, FWDSIDE);
>  	pingf->sock = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if,
>  			      0, ref.data);
>  
> diff --git a/tcp.c b/tcp.c
> index 65208ca..06401ba 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -303,10 +303,6 @@
>  
>  #include "flow_table.h"
>  
> -/* Sides of a flow as we use them in "tap" connections */
> -#define	SOCKSIDE	0
> -#define	TAPSIDE		1
> -
>  #define TCP_FRAMES_MEM			128
>  #define TCP_FRAMES							\
>  	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
> @@ -581,7 +577,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD;
>  	union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock,
> -				.flowside = FLOW_SIDX(conn, SOCKSIDE) };
> +				.flowside = FLOW_SIDX(conn, !conn->tapside), };
>  	struct epoll_event ev = { .data.u64 = ref.u64 };
>  
>  	if (conn->events == CLOSED) {
> @@ -1134,7 +1130,7 @@ static uint64_t tcp_conn_hash(const struct ctx *c,
>  static inline unsigned tcp_hash_probe(const struct ctx *c,
>  				      const struct tcp_tap_conn *conn)
>  {
> -	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE);
> +	flow_sidx_t sidx = FLOW_SIDX(conn, conn->tapside);
>  	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
>  
>  	/* Linear probing */
> @@ -1154,7 +1150,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
>  {
>  	unsigned b = tcp_hash_probe(c, conn);
>  
> -	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE);
> +	tc_hash[b] = FLOW_SIDX(conn, conn->tapside);
>  	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
>  }
>  
> @@ -2006,7 +2002,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  			goto cancel;
>  	}
>  
> -	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp, TAPSIDE);
> +	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> +	conn->tapside = INISIDE;
>  	conn->sock = s;
>  	conn->timer = -1;
>  	conn_event(c, conn, TAP_SYN_RCVD);
> @@ -2725,9 +2722,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
>  				   const union sockaddr_inany *sa,
>  				   const struct timespec *now)
>  {
> -	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp,
> -						  SOCKSIDE);
> +	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
>  
> +	conn->tapside = FWDSIDE;
>  	conn->sock = s;
>  	conn->timer = -1;
>  	conn->ws_to_tap = conn->ws_from_tap = 0;
> @@ -2884,7 +2881,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
>  	struct tcp_tap_conn *conn = CONN(ref.flowside.flow);
>  
>  	ASSERT(conn->f.type == FLOW_TCP);
> -	ASSERT(ref.flowside.side == SOCKSIDE);
> +	ASSERT(ref.flowside.side == !conn->tapside);
>  
>  	if (conn->events == CLOSED)
>  		return;
> diff --git a/tcp_conn.h b/tcp_conn.h
> index d280b22..5df0076 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -13,6 +13,7 @@
>   * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced)
>   * @f:			Generic flow information
>   * @in_epoll:		Is the connection in the epoll set?
> + * @tapside:		Which side of the flow faces the tap/guest interface
>   * @tap_mss:		MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS
>   * @sock:		Socket descriptor number
>   * @events:		Connection events, implying connection states
> @@ -39,6 +40,7 @@ struct tcp_tap_conn {
>  	struct flow_common f;
>  
>  	bool		in_epoll	:1;
> +	unsigned	tapside		:1;

This is a bit "far" from where the bit meaning is defined (flow.h).
Perhaps, in the comment:

 * @tapside: Which side (INISIDE/FWDSIDE) corresponds to the tap/guest interface

?

And this is almost too obvious to ask, but I'm not sure: why isn't this in
flow_common? I guess we'll need it for all the protocols, eventually, right?
Is it because otherwise we have 17 bits there?

>  
>  #define TCP_RETRANS_BITS		3
>  	unsigned int	retrans		:TCP_RETRANS_BITS;
> @@ -106,7 +108,6 @@ struct tcp_tap_conn {
>  	uint32_t	seq_init_from_tap;
>  };
>  
> -#define SIDES			2
>  /**
>   * struct tcp_splice_conn - Descriptor for a spliced TCP connection
>   * @f:			Generic flow information
> diff --git a/tcp_splice.c b/tcp_splice.c
> index abe98a0..5da7021 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -472,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
>  		return false;
>  	}
>  
> -	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
> +	conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
>  
>  	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
>  	conn->s[0] = s0;

-- 
Stefano


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 01/19] flow: Clarify and enforce flow state transitions
       [not found]     ` <ZkbVxtvmP7f0aL1S@zatzit>
@ 2024-05-17 11:00       ` Stefano Brivio
  2024-05-18  6:47         ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Stefano Brivio @ 2024-05-17 11:00 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri, 17 May 2024 13:57:58 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, May 16, 2024 at 11:30:58AM +0200, Stefano Brivio wrote:
> > On Tue, 14 May 2024 11:03:19 +1000
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > [...]
> >
> > >  /**
> > >   * struct flow_common - Common fields for packet flows
> > > + * @state:	State of the flow table entry
> > >   * @type:	Type of packet flow
> > >   */
> > >  struct flow_common {
> > > +	uint8_t		state;  
> > 
> > In this case, I would typically do
> > (https://seitan.rocks/seitan/tree/common/gluten.h?id=5a9302bab9c9bb3d1577f04678d074fb7af4115f#n53):
> > 
> > #ifdef __GNUC__
> > 	enum flow_state		state:8;
> > #else
> > 	uint8_t			state;
> > #endif  
> 
> I don't object to that, but I have two questions
> 
>   - What's the advantage to using the explicit enum?  Is that for the
>     benefit of static checkers and/or compiler diagnostics?

Yes: if we assign a value that's not in the enum, I expect static
checkers to complain. But also for humans: even with that ifdef, a
reader would know right away what values that might have.

>     AFAIK C
>     itself doesn't really treat enums any differently to integer
>     types.

Right.

>   - What's the need for GNUC?  Are enum bitfields a gnu extension?

They're rather permitted by gcc's interpretation of the standard:

  https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html

  Allowable bit-field types other than _Bool, signed int, and unsigned
  int (C99 and C11 6.7.2.1).

  Other integer types, such as long int, and enumerated types are permitted
  even in strictly conforming mode.

but C11 says (6.7.2.1):

  A bit-field shall have a type that is a qualified or unqualified
  version of _Bool, signed int, unsigned int, or some other
  implementation-defined type. It is implementation-defined whether
  atomic types are permitted.

Is an enum a "version" of those? Maybe not. That was at least the
interpretation adopted by older gcc versions, up to 4.7.4:

  https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html

  Allowable bit-field types other than _Bool, signed int, and unsigned
  int (C99 6.7.2.1).

  No other types are permitted in strictly conforming mode.

Did that change from C99? Not really:

  A bit-field shall have a type that is a qualified or unqualified
  version of _Bool, signed int, unsigned int, or some other
  implementation-defined type.

>     Even versus C11, which we already require?

Yes, I would say it doesn't change things. Only C23 would improve that:
  https://open-std.org/JTC1/SC22/WG14/www/docs/n3030.htm#design-constant.type

by allowing us to define the underlying type.

> > ...and in any case we need to make sure to assign single values in the
> > enum above: there are no guarantees that FLOW_STATE_ACTIVE is 3
> > otherwise (except for that static_assert(), but that's not its purpose).  
> 
> I'm not clear how this comment relates to the one before.

It's unrelated, but:

> AFAIK
> nothing in here (or the rest of the series) relies on the specific
> numeric values of the flow state values (although we do rely on them
> being ordered as written in some places).

while we rely on the fact that no value is bigger than 255, I realised
that the standards already guarantee that values start from 0 and every
subsequent constant is defined as one more than the previous one, all
the way from C90 to C11, so this would actually be fine.

Sorry, I don't know exactly why I thought that wouldn't be the case, I
was pretty sure of the opposite until I checked.

-- 
Stefano


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 01/19] flow: Clarify and enforce flow state transitions
  2024-05-17 11:00       ` Stefano Brivio
@ 2024-05-18  6:47         ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-18  6:47 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4499 bytes --]

On Fri, May 17, 2024 at 01:00:44PM +0200, Stefano Brivio wrote:
> On Fri, 17 May 2024 13:57:58 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, May 16, 2024 at 11:30:58AM +0200, Stefano Brivio wrote:
> > > On Tue, 14 May 2024 11:03:19 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > [...]
> > >
> > > >  /**
> > > >   * struct flow_common - Common fields for packet flows
> > > > + * @state:	State of the flow table entry
> > > >   * @type:	Type of packet flow
> > > >   */
> > > >  struct flow_common {
> > > > +	uint8_t		state;  
> > > 
> > > In this case, I would typically do
> > > (https://seitan.rocks/seitan/tree/common/gluten.h?id=5a9302bab9c9bb3d1577f04678d074fb7af4115f#n53):
> > > 
> > > #ifdef __GNUC__
> > > 	enum flow_state		state:8;
> > > #else
> > > 	uint8_t			state;
> > > #endif  
> > 
> > I don't object to that, but I have two questions
> > 
> >   - What's the advantage to using the explicit enum?  Is that for the
> >     benefit of static checkers and/or compiler diagnostics?
> 
> Yes: if we assign a value that's not in the enum, I expect static
> checkers to complain. But also for humans: even with that ifdef, a
> reader would know right away what values that might have.
> 
> >     AFAIK C
> >     itself doesn't really treat enums any differently to integer
> >     types.
> 
> Right.
> 
> >   - What's the need for GNUC?  Are enum bitfields a gnu extension?
> 
> They're rather permitted by gcc's interpretation of the standard:
> 
>   https://gcc.gnu.org/onlinedocs/gcc-14.1.0/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html
> 
>   Allowable bit-field types other than _Bool, signed int, and unsigned
>   int (C99 and C11 6.7.2.1).
> 
>   Other integer types, such as long int, and enumerated types are permitted
>   even in strictly conforming mode.
> 
> but C11 says (6.7.2.1):
> 
>   A bit-field shall have a type that is a qualified or unqualified
>   version of _Bool, signed int, unsigned int, or some other
>   implementation-defined type. It is implementation-defined whether
>   atomic types are permitted.
> 
> Is an enum a "version" of those? Maybe not. That was at least the
> interpretation adopted by older gcc versions, up to 4.7.4:
> 
>   https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html
> 
>   Allowable bit-field types other than _Bool, signed int, and unsigned
>   int (C99 6.7.2.1).
> 
>   No other types are permitted in strictly conforming mode.
> 
> Did that change from C99? Not really:
> 
>   A bit-field shall have a type that is a qualified or unqualified
>   version of _Bool, signed int, unsigned int, or some other
>   implementation-defined type.
> 
> >     Even versus C11, which we already require?
> 
> Yes, I would say it doesn't change things. Only C23 would improve that:
>   https://open-std.org/JTC1/SC22/WG14/www/docs/n3030.htm#design-constant.type
> 
> by allowing us to define the underlying type.

Ah, ok.  Makes sense, I've made that change.

> > > ...and in any case we need to make sure to assign single values in the
> > > enum above: there are no guarantees that FLOW_STATE_ACTIVE is 3
> > > otherwise (except for that static_assert(), but that's not its purpose).  
> > 
> > I'm not clear how this comment relates to the one before.
> 
> It's unrelated, but:
> 
> > AFAIK
> > nothing in here (or the rest of the series) relies on the specific
> > numeric values of the flow state values (although we do rely on them
> > being ordered as written in some places).
> 
> while we rely on the fact that no value is bigger than 255, I realised
> that the standards already guarantee that values start from 0 and every
> subsequent constant is defined as one more than the previous one, all
> the way from C90 to C11, so this would actually be fine.

Right, I was expecting that behaviour - the way we define
FLOW_NUM_STATES and similar things in a bunch of places relies on
this.

> Sorry, I don't know exactly why I thought that wouldn't be the case, I
> was pretty sure of the opposite until I checked.

Ok.  I've scattered in some extra static_assert()s in the vicinity, too.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 06/19] flow: Populate address information for initiating side
       [not found]       ` <20240517215845.4d09eaae@elisabeth>
@ 2024-05-18  7:00         ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-18  7:00 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 7230 bytes --]

On Fri, May 17, 2024 at 09:58:45PM +0200, Stefano Brivio wrote:
> On Fri, 17 May 2024 14:27:46 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, May 16, 2024 at 08:23:37PM +0200, Stefano Brivio wrote:
> > > On Tue, 14 May 2024 11:03:24 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > This requires the address and port information for the initiating side be
> > > > populated when a flow enters INI state.  Implement that for TCP and ICMP.
> > > > 
> > > > For now this leaves some information redundantly recorded in both generic
> > > > and type specific fields.  We'll fix that in later patches.
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  flow.c       | 92 +++++++++++++++++++++++++++++++++++++++++++++++++---
> > > >  flow_table.h |  8 ++++-
> > > >  icmp.c       | 10 ++++--
> > > >  tcp.c        |  4 +--
> > > >  4 files changed, 103 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/flow.c b/flow.c
> > > > index aee2736..3d5b3a5 100644
> > > > --- a/flow.c
> > > > +++ b/flow.c
> > > > @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */
> > > >  /* Last time the flow timers ran */
> > > >  static struct timespec flow_timer_run;
> > > >  
> > > > +/** flowside_from_af() - Initialise flowside from addresses
> > > > + * @fside:	flowside to initialise
> > > > + * @af:		Address family (AF_INET or AF_INET6)
> > > > + * @eaddr:	Endpoint address (pointer to in_addr or in6_addr)
> > > > + * @eport:	Endpoint port
> > > > + * @faddr:	Forwarding address (pointer to in_addr or in6_addr)
> > > > + * @fport:	Forwarding port  
> > > 
> > > This starts showing a mild inconsistency:
> > > 
> > > - initiating / forwarding sides  
> > 
> > So I was trying to go with "initiating" and "forwarded" sides here,
> > distinct from "forwarding" used for addresses.  But the potential for
> > confusion between "forwarded" and "forwarding" is way too high, I
> > agree.
> > 
> > I considered "accepting side", but I don't like that because a) in the
> > case of UDP "accepting" isn't really a thing and b) there's no
> > guarantee that the peer on that side will actually "accept" the
> > "connection".
> > 
> > "receiving side" has all those problems and more.
> > 
> > > - endpoint / forwarding addresses
> > > 
> > > The initiating side can correspond to the endpoint address ("ours") or
> > > to the forwarding address ("theirs").  
> > 
> > Not sure what you mean by that - both initiating side and forwarded
> > side each have both an endpoint and a forwarding address.
> > 
> > > The forwarding side doesn't necessarily correspond to the forwarding
> > > address.  
> > 
> > Right :/
> > 
> > > But I have no better ideas than "initiating" and "forwarding" for
> > > sides. Should we consider something different for addresses and ports,
> > > such as "ours" and "theirs"?  
> > 
> > I dislike those because I think it's not always clear if "us" is just
> > pasta, or includes the guest.  Plus the distinction here is between
> > the side that initiates the flow and the.. other one.. that doesn't
> > correspond to "ours" or "theirs" in any consistent way I can see.
> > 
> > > Local/remote might be misleading as well.
> > > Peer, passt and pasta all share the same initial.  
> > 
> > "initiating side" / "target side" maybe?  Or just "primary side" /
> > "secondary side" or "A side" / "B side"?
> > 
> > Actually, I think I like "target" - I feel like "flow_target()"
> > conveys what "flow_forward()" does pretty well.  I'll go with that
> > pending any further suggestions you have.
> 
> Yes, "target" sounds pretty good to me as well. So we would have
> "endpoint" and "forwarding" addresses, but "initiating" and "target"
> side, right?

Yes, and all four combinations are valid.  We have an initiating
endpoint address, an initiating forwarding address, a target
forwarding address and a target endpoint address.

> An alternative could be "client" / "server", it should be always
> correct, but it would sound a bit strange for ICMP and UDP.

I tend to agree.

> > [snip]
> > > > @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state)
> > > >  		  FLOW_STATE(f));
> > > >  
> > > >  	if (MAX(state, oldstate) >= FLOW_STATE_FWD)
> > > > -		flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]),
> > > > -			                            pif_name(f->pif[FWDSIDE]));
> > > > +		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s",
> > > > +			  pif_name(f->pif[INISIDE]),
> > > > +			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
> > > > +			  ini->eport,
> > > > +			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
> > > > +			  ini->fport,
> > > > +			  pif_name(f->pif[FWDSIDE]));
> > > >  	else if (MAX(state, oldstate) >= FLOW_STATE_INI)
> > > > -		flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE]));
> > > > +		flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?",
> > > > +			  pif_name(f->pif[INISIDE]),
> > > > +			  inany_ntop(&ini->eaddr, estr, sizeof(estr)),
> > > > +			  ini->eport,
> > > > +			  inany_ntop(&ini->faddr, fstr, sizeof(fstr)),
> > > > +			  ini->fport);  
> > > 
> > > If I review v2 of "util, tcp: Add helper to display socket addresses",
> > > can I skip reviewing this? :)  
> > 
> > Alas, no - this isn't based on a socket address.  However, I have been
> > looking at introducing a flowside_ntop() function which would help in
> > a similar way here.
> 
> Right, I realised only later that your patch only took care of socket
> addresses... anyway, reviewed, this hunk also looks correct to me.

It turns out to be harder to write / use flowside_ntop() than I
thought.  In the current log messages, the two flowsides are actually
displayed in reverse order: endpoint then forwarding for the
initiating side, forwarding then endpoint for the target side.  That
gives the overall order that makes the most sense to me, because it's
the order the payload of the initiating packet goes through.

But that means we'd need an order parameter for flowside_ntop() or
something, and it all becomes a bit of a mess.

> > > >  }
> > > >  
> > > >  /**
> > > > - * flow_initiate() - Move flow to INI state, setting INISIDE details
> > > > + * flow_initiate_() - Move flow to INI state, setting pif[INISIDE]
> > > >   * @flow:	Flow to change state
> > > >   * @pif:	pif of the initiating side
> > > >   */
> > > > -void flow_initiate(union flow *flow, uint8_t pif)
> > > > +static void flow_initiate_(union flow *flow, uint8_t pif)  
> > > 
> > > I don't feel like the underscore here is really necessary.  
> > 
> > Yeah, I'm trying to convey that it's not safe to call this directly -
> > it won't perform a correct state transition without the extra logic in
> > flow_initiate_af() and flow_initiate_sa().
> 
> Ah, I see now, okay.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible
       [not found]       ` <20240517221123.1c7197a3@elisabeth>
@ 2024-05-18  7:08         ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-18  7:08 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5069 bytes --]

On Fri, May 17, 2024 at 10:11:23PM +0200, Stefano Brivio wrote:
> On Fri, 17 May 2024 16:58:38 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, May 16, 2024 at 10:53:50PM +0200, Stefano Brivio wrote:
> > > On Tue, 14 May 2024 11:03:33 +1000
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > icmp_sock_handler() obtains the guest address from it's most recently
> > > > observed IP, and the ICMP id from the epoll reference.  Both of these
> > > > can be obtained readily from the flow.
> > > > 
> > > > icmp_tap_handler() builds its socket address for sendto() directly
> > > > from the destination address supplied by the incoming tap packet.
> > > > This can instead be generated from the flow.
> > > > 
> > > > struct icmp_ping_flow contains a field for the ICMP id of the ping, but
> > > > this is now redundant, since the id is also stored as the "port" in the
> > > > common flowsides.
> > > > 
> > > > Using the flowsides as the common source of truth here prepares us for
> > > > allowing more flexible NAT and forwarding by properly initialising
> > > > that flowside information.
> > > > 
> > > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > > ---
> > > >  icmp.c      | 37 ++++++++++++++++++++++---------------
> > > >  icmp_flow.h |  1 -
> > > >  tap.c       | 11 -----------
> > > >  tap.h       |  1 -
> > > >  4 files changed, 22 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/icmp.c b/icmp.c
> > > > index 37a3586..1e9a05e 100644
> > > > --- a/icmp.c
> > > > +++ b/icmp.c
> > > > @@ -58,6 +58,7 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
> > > >  void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
> > > >  {
> > > >  	struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow);
> > > > +	const struct flowside *ini = &pingf->f.side[INISIDE];
> > > >  	union sockaddr_inany sr;
> > > >  	socklen_t sl = sizeof(sr);
> > > >  	char buf[USHRT_MAX];
> > > > @@ -83,7 +84,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
> > > >  			goto unexpected;
> > > >  
> > > >  		/* Adjust packet back to guest-side ID */
> > > > -		ih4->un.echo.id = htons(pingf->id);
> > > > +		ih4->un.echo.id = htons(ini->eport);
> > > >  		seq = ntohs(ih4->un.echo.sequence);
> > > >  	} else if (pingf->f.type == FLOW_PING6) {
> > > >  		struct icmp6hdr *ih6 = (struct icmp6hdr *)buf;
> > > > @@ -93,7 +94,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
> > > >  			goto unexpected;
> > > >  
> > > >  		/* Adjust packet back to guest-side ID */
> > > > -		ih6->icmp6_identifier = htons(pingf->id);
> > > > +		ih6->icmp6_identifier = htons(ini->eport);
> > > >  		seq = ntohs(ih6->icmp6_sequence);
> > > >  	} else {
> > > >  		ASSERT(0);
> > > > @@ -108,13 +109,20 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
> > > >  	}
> > > >  
> > > >  	flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
> > > > -		 pingf->id, seq);
> > > > +		 ini->eport, seq);
> > > >  
> > > > -	if (pingf->f.type == FLOW_PING4)
> > > > -		tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n);
> > > > -	else if (pingf->f.type == FLOW_PING6)
> > > > -		tap_icmp6_send(c, &sr.sa6.sin6_addr,
> > > > -			       tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n);
> > > > +	if (pingf->f.type == FLOW_PING4) {
> > > > +		const struct in_addr *saddr = inany_v4(&ini->faddr);
> > > > +		const struct in_addr *daddr = inany_v4(&ini->eaddr);
> > > > +
> > > > +		ASSERT(saddr && daddr); /* Must have IPv4 addresses */  
> > > 
> > > ...are those somehow special compared to IPv6 ones?  
> > 
> > Well, we're about to call a function that's specific to IPv4 ICMP and
> > will require IPv4 addresses for both ends.
> 
> Yes, I understand that part, but I was wondering why "Must have IPv4
> addresses" and not IPv6 ones below. On the other hand, due to how the
> inany thing works, we'll always have IPv6 addresses "set" there.

Well, there are different strength of requirements here.  We simply
can't proceed without an IPv4 address here: if we tried we'd either
NULL pointer dereference from inany_v4() or we'd take a garbage chunk
out of an IPv6 address.

For the IPv6 case, if the address are v4 it means we'll send ICMPv6
packets with IPv4 mapped addresses in them.  That's kinda weird, but
not necessarily wrong.

> 
> > 
> > [snip]
> > > > diff --git a/icmp_flow.h b/icmp_flow.h
> > > > index 5a2eed9..f053211 100644
> > > > --- a/icmp_flow.h
> > > > +++ b/icmp_flow.h
> > > > @@ -22,7 +22,6 @@ struct icmp_ping_flow {
> > > >  	int seq;
> > > >  	int sock;
> > > >  	time_t ts;
> > > > -	uint16_t id;  
> > > 
> > > Nit: drop 'id' from struct comment.  
> > 
> > Fixed, thanks.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP
       [not found]   ` <20240518001345.2d127b09@elisabeth>
@ 2024-05-20  5:44     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-20  5:44 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10562 bytes --]

On Sat, May 18, 2024 at 12:13:45AM +0200, Stefano Brivio wrote:
> On Tue, 14 May 2024 11:03:36 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently the code to translate host side addresses and ports to guest side
> > addresses and ports, and vice versa, is scattered across the TCP code.
> > This includes both port redirection as controlled by the -t and -T options,
> > and our special case NAT controlled by the --no-map-gw option.
> > 
> > Gather this logic into fwd_from_*() functions for each input interface
> > in fwd.c which take protocol and address information for the initiating
> > side and generates the pif and address information for the forwarded side.
> > This performs any NAT or port forwarding needed.
> > 
> > We create a flow_forward() helper which applies those forwarding functions
> > as needed to automatically move a flow from INI to FWD state.  For now we
> > leave the older flow_forward_af() function taking explicit addresses as
> > a transitional tool.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c       |  53 +++++++++++++++++++++++++
> >  flow_table.h |   2 +
> >  fwd.c        | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fwd.h        |  12 ++++++
> >  tcp.c        | 102 +++++++++++++++--------------------------------
> >  tcp_splice.c |  63 ++---------------------------
> >  tcp_splice.h |   5 +--
> >  7 files changed, 213 insertions(+), 134 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index 4942075..a6afe39 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -304,6 +304,59 @@ const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
> >  	return fwd;
> >  }
> >  
> > +
> > +/**
> > + * flow_forward() - Determine where flow should forward to, and move to FWD
> > + * @c:		Execution context
> > + * @flow:	Flow to forward
> > + * @proto:	Protocol
> > + *
> > + * Return: pointer to the forwarded flowside information
> > + */
> > +const struct flowside *flow_forward(const struct ctx *c, union flow *flow,
> > +				    uint8_t proto)
> > +{
> > +	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
> > +	struct flow_common *f = &flow->f;
> > +	const struct flowside *ini = &f->side[INISIDE];
> > +	struct flowside *fwd = &f->side[FWDSIDE];
> > +	uint8_t pif1 = PIF_NONE;
> 
> This could now be 'pif_fwd' / 'pif_tgt', right?

Good idea, changed.

[snip]
> > diff --git a/fwd.c b/fwd.c
> > index b3d5a37..5fe2361 100644
> > --- a/fwd.c
> > +++ b/fwd.c
> > @@ -25,6 +25,7 @@
> >  #include "fwd.h"
> >  #include "passt.h"
> >  #include "lineread.h"
> > +#include "flow_table.h"
> >  
> >  /* See enum in kernel's include/net/tcp_states.h */
> >  #define UDP_LISTEN	0x07
> > @@ -154,3 +155,112 @@ void fwd_scan_ports_init(struct ctx *c)
> >  				   &c->tcp.fwd_out, &c->tcp.fwd_in);
> >  	}
> >  }
> > +
> > +uint8_t fwd_from_tap(const struct ctx *c, uint8_t proto,
> > +		     const struct flowside *a, struct flowside *b)
> 
> A function comment would be nice to have, albeit a bit redundant.

Ah, yes.  I meant to go back and add these, but obviously forgot.
Fixed now.

> Now
> 'a' and 'b' could also be called 'ini' and 'tgt' I guess?

Also a good idea, done.

> > +{
> > +	(void)proto;
> > +
> > +	b->eaddr = a->faddr;
> > +	b->eport = a->fport;
> > +
> > +	if (!c->no_map_gw) {
> > +		struct in_addr *v4 = inany_v4(&b->eaddr);
> > +
> > +		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
> > +			*v4 = in4addr_loopback;
> > +		if (IN6_ARE_ADDR_EQUAL(&b->eaddr, &c->ip6.gw))
> > +			b->eaddr.a6 = in6addr_loopback;
> 
> I haven't tested this, but I'm a bit lost: I thought that in this case
> we would also set b->faddr here. Where does that happen?

Ah.. right.  So notionally we should set tgt->faddr here.  However,
because in this case we're forwarding to PIF_HOST we don't actually
know tgt->faddr (or tgt->fport) without a getsockname() call, so we're
leaving them blank.  They will, in fact, be blank because we zero the
entire entry in flow_alloc().

That's pretty non-obvious though, I'll change this to explicitly set
faddr and fport with a comment.

> > +	}
> > +
> > +	return PIF_HOST;
> > +}
> > +
> > +uint8_t fwd_from_splice(const struct ctx *c, uint8_t proto,
> > +			const struct flowside *a, struct flowside *b)
> > +{
> > +	const struct in_addr *ae4 = inany_v4(&a->eaddr);
> > +
> > +	if (!inany_is_loopback(&a->eaddr) ||
> > +	    (!inany_is_loopback(&a->faddr) && !inany_is_unspecified(&a->faddr))) {
> > +		char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
> > +
> > +		debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu",
> > +		      pif_name(PIF_SPLICE),
> > +		      inany_ntop(&a->eaddr, estr, sizeof(estr)), a->eport,
> > +		      inany_ntop(&a->faddr, fstr, sizeof(fstr)), a->fport);
> > +		return PIF_NONE;
> > +	}
> > +
> > +	if (ae4)
> > +		inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
> > +	else
> > +		inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
> > +
> > +	b->eport = a->fport;
> > +
> > +	if (proto == IPPROTO_TCP)
> > +		b->eport += c->tcp.fwd_out.delta[b->eport];
> > +
> > +	return PIF_HOST;
> > +}
> > +
> > +uint8_t fwd_from_host(const struct ctx *c, uint8_t proto,
> > +		      const struct flowside *a, struct flowside *b)
> > +{
> > +	struct in_addr *bf4;
> > +
> > +	if (c->mode == MODE_PASTA && inany_is_loopback(&a->eaddr) &&
> > +	    proto == IPPROTO_TCP) {
> > +		/* spliceable */
> 
> Before we conclude this, does f->pif[INISIDE] == PIF_HOST in the caller
> guarantee that inany_is_loopback(&a->faddr), too?

Only in the sense that if we accept()ed a connection from a loopback
address on a socket not bound to a loopback address (or ANY), then the
kernel has done something wrong.  This kind of has the inverse of the
issue above: we don't necessarily know the forwarding address here -
we only know that with either a getsockname(), or by looking at the
bound address of the listening socket (which might be unspecified).

> If not, we shouldn't
> splice unless that's true as well.

So I'm pretty confident what we do here is equivalent to what we did
before.  That might not be correct, but fixing that is for a different
patch.  Making problems like that more obvious is one of the
advantages I expect for gathering all this forwarding logic into one
place.

> > +		b->faddr = a->eaddr;
> > +
> > +		if (inany_v4(&a->eaddr))
> > +			inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
> > +		else
> > +			inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
> > +		b->eport = a->fport;
> > +		if (proto == IPPROTO_TCP)
> > +			b->eport += c->tcp.fwd_in.delta[b->eport];
> > +
> > +		return PIF_SPLICE;
> > +	}
> > +
> > +	b->faddr = a->eaddr;
> > +	b->fport = a->eport;
> > +
> > +	bf4 = inany_v4(&b->faddr);
> > +
> > +	if (bf4) {
> > +		if (IN4_IS_ADDR_LOOPBACK(bf4) ||
> > +		    IN4_IS_ADDR_UNSPECIFIED(bf4) ||
> > +		    IN4_ARE_ADDR_EQUAL(bf4, &c->ip4.addr_seen))
> > +			*bf4 = c->ip4.gw;
> > +	} else {
> > +		struct in6_addr *bf6 = &b->faddr.a6;
> > +
> > +		if (IN6_IS_ADDR_LOOPBACK(bf6) ||
> > +		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr_seen) ||
> > +		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr)) {
> > +			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
> > +				*bf6 = c->ip6.gw;
> > +			else
> > +				*bf6 = c->ip6.addr_ll;
> > +		}
> > +	}
> > +
> > +	if (bf4) {
> > +		inany_from_af(&b->eaddr, AF_INET, &c->ip4.addr_seen);
> > +	} else {
> > +		if (IN6_IS_ADDR_LINKLOCAL(&b->faddr.a6))
> > +			b->eaddr.a6 = c->ip6.addr_ll_seen;
> > +		else
> > +			b->eaddr.a6 = c->ip6.addr_seen;
> > +	}
> > +
> > +	b->eport = a->fport;
> > +	if (proto == IPPROTO_TCP)
> > +		b->eport += c->tcp.fwd_in.delta[b->eport];
> 
> As we do this in any case, spliced or not spliced, I would find it less
> confusing to have these assignments in common, earlier (I just spent
> half an hour trying to figure out why you wouldn't set b->eport for the
> non-spliced case...).

Fair point.  This was just because I thought my way through the two
cases separately.  I've made this stanza common.

[snip]
> > +static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
> >  				   const struct timespec *now)
> >  {
> > -	union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */
> > -	struct tcp_tap_conn *conn;
> > -	in_port_t srcport;
> > +	struct tcp_tap_conn *conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> >  	uint64_t hash;
> >  
> > -	inany_from_sockaddr(&saddr, &srcport, sa);
> > -	tcp_snat_inbound(c, &saddr);
> > -
> > -	if (inany_v4(&saddr)) {
> > -		inany_from_af(&daddr, AF_INET, &c->ip4.addr_seen);
> > -	} else {
> > -		if (IN6_IS_ADDR_LINKLOCAL(&saddr))
> > -			daddr.a6 = c->ip6.addr_ll_seen;
> > -		else
> > -			daddr.a6 = c->ip6.addr_seen;
> > -	}
> > -	dstport += c->tcp.fwd_in.delta[dstport];
> > -
> > -	flow_forward_af(flow,  PIF_TAP, AF_INET6,
> > -			&saddr, srcport, &daddr, dstport);
> > -	conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
> > -
> > +	
> 
> Excess newline and tab.

Looks like I already fixed that.

[snip]
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -395,71 +395,18 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
> >  /**
> >   * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
> >   * @c:		Execution context
> > - * @pif0:	pif id of side 0
> > - * @dstport:	Side 0 destination port of connection
> >   * @flow:	flow to initialise
> >   * @s0:		Accepted (side 0) socket
> >   * @sa:		Peer address of connection
> >   *
> > - * Return: true if able to create a spliced connection, false otherwise
> 
> Not related to this patch, but I think we should probably describe in
> the theory of operation for flows what's the threshold between calling
> flow_alloc_cancel() on a flow (which would imply returning something
> here, in case tcp_splice_connect() fails), and deferring that instead
> to a CLOSING state.

That's included in the new descriptions of the flow states.  There
might be a way to make it more obvious, but I'm not immediately sure
of it.  In any case the answer is: you can't cancel once you
FLOW_ACTIVATE().

[snip]

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP
       [not found]   ` <20240518001408.004011b2@elisabeth>
@ 2024-05-20  5:56     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2024-05-20  5:56 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 4574 bytes --]

On Sat, May 18, 2024 at 12:14:08AM +0200, Stefano Brivio wrote:
> On Tue, 14 May 2024 11:03:37 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Current ICMP hard codes its forwarding rules, and never applies any
> > translations.  Change it to use the flow_forward() function, so that
> > it's translated the same as TCP (excluding TCP specific port
> > redirection).
> > 
> > This means that gw mapping now applies to ICMP so "ping <gw address>" will
> > now ping the host's loopback instead of the actual gw machine.  This
> > removes the surprising behaviour that the target you ping might not be the
> > same as you connect to with TCP.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c |  1 +
> >  icmp.c | 14 ++++++++++++--
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index a6afe39..b43a079 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -285,6 +285,7 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif,
> >   *
> >   * Return: pointer to the forwarded flowside information
> >   */
> > +/* cppcheck-suppress unusedFunction */
> >  const struct flowside *flow_forward_af(union flow *flow, uint8_t pif,
> >  				       sa_family_t af,
> >  				       const void *saddr, in_port_t sport,
> > diff --git a/icmp.c b/icmp.c
> > index 0112fd9..6310178 100644
> > --- a/icmp.c
> > +++ b/icmp.c
> > @@ -153,6 +153,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
> >  					    sa_family_t af, uint16_t id,
> >  					    const void *saddr, const void *daddr)
> >  {
> > +	uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
> >  	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
> >  	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
> >  	union flow *flow = flow_alloc();
> > @@ -163,9 +164,18 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
> >  	if (!flow)
> >  		return NULL;
> >  
> > -
> >  	flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id);
> > -	flow_forward_af(flow, PIF_HOST,	af, NULL, 0, daddr, 0);
> > +	if (!flow_forward(c, flow, proto))
> > +		goto cancel;
> > +
> > +	if (flow->f.pif[FWDSIDE] != PIF_HOST) {
> > +		flow_err(flow, "No support for forwarding %s from %s to %s",
> > +			 proto == IPPROTO_ICMP ? "ICMP" : "ICMPv6",
> 
> Which brings me to two remarks:
> 
> - having the protocol name also in the flow_err() message printed in
>   flow_forward() could be helpful

It would, and I've thought about it, but haven't seen a great way to
go about it.

 * The flow type is not set at this point, so we can't use that.  We
   can't trivially move setting the type earlier, because for TCP at
   least we need the information from flow_froward() to determine if
   we're spliced and set the type based on that.

 * Including both flow type and protocol in flow_common is annoyingly
   redundant, as well as adding a full 32-bits to the structure
   because of padding.

 * We could possibly eliminate flow type and make it implicit based on
   the protocol+pifs: regular TCP flow is (TCP, HOST, TAP) or (TCP,
   TAP, HOST), whereas TCP spliced is (TCP, HOST, SPLICE) or (TCP,
   SPLICE, HOST).  Type is the selector field for which of the union
   variants is valid, and I don't love that being something with a
   kind of complicated calculation behind it.

 * We could make the type field hold the protocol until
   FLOW_SET_TYPE(), but I don't love semantics of a field changing
   like that.

 * We could just pass either a protocol number or a string to
   flow_forward() etc., but that seems a bit awkward.

Hrm... actually thinking on that last one.  It might make sense to add
a descriptive string to flow_initiate(), not just the protocol but
something like "TCP SYN" versus "TCP accept()" or the like.  That
wouldn't directly help flow_forward(), but the info line from
flow_initiate() is likely to be in close proximity, so it would help.


> - then, perhaps, we should re-introduce ip_proto_str[] which was dropped
>   with 340164445341 ("epoll: Generalize epoll_ref to cover things other
>   than sockets")

Maybe.  I guess the standard way to do that is with getprotobyname(3),
but that probably won't work in our isolated namespace, I guess.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-05-20  5:57 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-14  1:03 [PATCH v5 00/19] RFC: Unified flow table David Gibson
2024-05-14  1:03 ` [PATCH v5 01/19] flow: Clarify and enforce flow state transitions David Gibson
2024-05-16  9:30   ` Stefano Brivio
     [not found]     ` <ZkbVxtvmP7f0aL1S@zatzit>
2024-05-17 11:00       ` Stefano Brivio
2024-05-18  6:47         ` David Gibson
2024-05-14  1:03 ` [PATCH v5 02/19] flow: Make side 0 always be the initiating side David Gibson
2024-05-16 12:06   ` Stefano Brivio
2024-05-14  1:03 ` [PATCH v5 03/19] flow: Record the pifs for each side of each flow David Gibson
2024-05-14  1:03 ` [PATCH v5 04/19] tcp: Remove interim 'tapside' field from connection David Gibson
2024-05-14  1:03 ` [PATCH v5 05/19] flow: Common data structures for tracking flow addresses David Gibson
2024-05-14  1:03 ` [PATCH v5 06/19] flow: Populate address information for initiating side David Gibson
     [not found]   ` <20240516202337.1b90e5f2@elisabeth>
     [not found]     ` <ZkbcwkdEwjGv6uwG@zatzit>
     [not found]       ` <20240517215845.4d09eaae@elisabeth>
2024-05-18  7:00         ` David Gibson
2024-05-14  1:03 ` [PATCH v5 07/19] flow: Populate address information for non-initiating side David Gibson
2024-05-14  1:03 ` [PATCH v5 08/19] tcp, flow: Remove redundant information, repack connection structures David Gibson
2024-05-14  1:03 ` [PATCH v5 09/19] tcp: Obtain guest address from flowside David Gibson
2024-05-14  1:03 ` [PATCH v5 10/19] tcp: Simplify endpoint validation using flowside information David Gibson
2024-05-14  1:03 ` [PATCH v5 11/19] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
2024-05-14  1:03 ` [PATCH v5 12/19] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
2024-05-14  1:03 ` [PATCH v5 13/19] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
2024-05-14  1:03 ` [PATCH v5 14/19] tcp: Re-use flow hash for initial sequence number generation David Gibson
2024-05-14  1:03 ` [PATCH v5 15/19] icmp: Use flowsides as the source of truth wherever possible David Gibson
     [not found]   ` <20240516225350.06aebcd7@elisabeth>
     [not found]     ` <ZkcAHhCpx3F0SW2K@zatzit>
     [not found]       ` <20240517221123.1c7197a3@elisabeth>
2024-05-18  7:08         ` David Gibson
2024-05-14  1:03 ` [PATCH v5 16/19] icmp: Look up ping flows using flow hash David Gibson
2024-05-14  1:03 ` [PATCH v5 17/19] icmp: Eliminate icmp_id_map David Gibson
2024-05-14  1:03 ` [PATCH v5 18/19] flow, tcp: Flow based NAT and port forwarding for TCP David Gibson
     [not found]   ` <20240518001345.2d127b09@elisabeth>
2024-05-20  5:44     ` David Gibson
2024-05-14  1:03 ` [PATCH v5 19/19] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
     [not found]   ` <20240518001408.004011b2@elisabeth>
2024-05-20  5:56     ` David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).