public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed
* [PATCH v4 00/16] RFC: Unified flow table
@ 2024-05-03  1:11 David Gibson
  2024-05-03  1:11 ` [PATCH v4 01/16] flow: Common data structures for tracking flow addresses David Gibson
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

This is a fourth draft of the first steps in implementing more general
"connection" tracking, as described at:
    https://pad.passt.top/p/NewForwardingModel

This series changes the TCP connection table and hash table into a
more general flow table that can track other protocols as well.  Each
flow uniformly keeps track of all the relevant addresses and ports,
which will allow for more robust control of NAT and port forwarding.

ICMP is converted to use the new flow table.

This doesn't include UDP, but I'm working on it right now and making
progress.  I'm posting this to give a head start on the review :)

Caveats:
 * We significantly increase the size of a connection/flow entry
   - Can probably be mitigated, but I haven't investigated much yet
 * We perform a number of extra getsockname() calls to know some of
   the socket endpoints
    - Haven't yet measured how much performance impact that has
    - Can be mitigated in at least some cases, but again, haven't
      tried yet

Changes since v4:
 * Complex rebase on top of the many things that have happened
   upstream since v3.
 * Assorted other changes.

Changes since v3:
 * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables.

Changes since v2:
 * Cosmetic fixes based on review
 * Extra doc comments for enum flow_type
 * Rename flowside to flowaddrs which turns out to make more sense in
   light of future changes
 * Fix bug where the socket flowaddrs for tap initiated connections
   wasn't initialised to match the socket address we were using in the
   case of map-gw NAT
 * New flowaddrs_from_sock() helper used in most cases which is cleaner
   and should avoid bugs like the above
 * Using newer centralised workarounds for clang-tidy issue 58992
 * Remove duplicate definition of FLOW_MAX as maximum flow type and
   maximum number of tracked flows
 * Rebased on newer versions of preliminary work (ICMP, flow based
   dispatch and allocation, bind/address cleanups)
 * Unified hash table as well as base flow table
 * Integrated ICMP

Changes since v1:
 * Terminology changes
   - "Endpoint" address/port instead of "correspondent" address/port
   - "flowside" instead of "demiflow"
 * Actually move the connection table to a new flow table structure in
   new files
 * Significant rearrangement of earlier patchs on top of that new
   table, to reduce churn

David Gibson (16):
  flow: Common data structures for tracking flow addresses
  tcp: Maintain flowside information for "tap" connections
  tcp_splice: Maintain flowside information for spliced connections
  tcp: Obtain guest address from flowside
  tcp: Simplify endpoint validation using flowside information
  tcp, tcp_splice: Construct sockaddrs for connect() from flowside
  tcp_splice: Eliminate SPLICE_V6 flag
  tcp, flow: Replace TCP specific hash function with general flow hash
  flow, tcp: Generalise TCP hash table to general flow hash table
  tcp: Re-use flow hash for initial sequence number generation
  icmp: Populate flowside information
  icmp: Use flowsides as the source of truth wherever possible
  icmp: Look up ping flows using flow hash
  icmp: Eliminate icmp_id_map
  flow, tcp: flow based NAT and port forwarding for TCP
  flow, icmp: Use general flow forwarding rules for ICMP

 flow.c       | 199 ++++++++++++++++++++-
 flow.h       |  97 +++++++++++
 fwd.c        | 139 +++++++++++++++
 fwd.h        |   5 +
 icmp.c       |  83 +++++----
 icmp_flow.h  |   1 -
 inany.h      |  29 +++-
 passt.h      |   3 +
 pif.h        |   1 -
 tap.c        |  11 --
 tap.h        |   1 -
 tcp.c        | 475 +++++++++++++--------------------------------------
 tcp_conn.h   |  20 +--
 tcp_splice.c |  93 ++--------
 tcp_splice.h |   5 +-
 15 files changed, 649 insertions(+), 513 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 01/16] flow: Common data structures for tracking flow addresses
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-13 18:07   ` Stefano Brivio
  2024-05-03  1:11 ` [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections David Gibson
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Handling of each protocol needs some degree of tracking of the addresses
and ports at the end of each connection or flow.  Sometimes that's explicit
(as in the guest visible addresses for TCP connections), sometimes implicit
(the bound and connected addresses of sockets).

To allow more general and robust handling, and more consistency across
protocols we want to uniformly track the address and port at each end of
the connection.  Furthermore, because we allow port remapping, and we
sometimes need to apply NAT, the addresses and ports can be different as
seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of common information
related to one side of each flow.  For now that's the addresses, ports
and the pif id.  Store two of these in the common fields of a flow to
track that information for both sides.  For now we just introduce the
structure itself, helpers to populate it, and logging of the contents
when starting and ending flows.  Later patches will actually put
something useful there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c     | 28 ++++++++++++++++++--
 flow.h     | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 passt.h    |  3 +++
 pif.h      |  1 -
 tcp_conn.h |  1 -
 5 files changed, 104 insertions(+), 4 deletions(-)

diff --git a/flow.c b/flow.c
index 80dd269..02d6008 100644
--- a/flow.c
+++ b/flow.c
@@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
  *
  *    ALLOC - A tentatively allocated entry
  *        Operations:
+ *            - Common flow fields other than type may be accessed
  *            - flow_alloc_cancel() returns the entry to FREE state
  *            - FLOW_START() set the entry's type and moves to START state
  *        Caveats:
- *            - It's not safe to write fields in the flow entry
+ *            - It's not safe to write flow type specific fields in the entry
  *            - It's not safe to allocate further entries with flow_alloc()
  *            - It's not safe to return to the main epoll loop (use FLOW_START()
  *              to move to START state before doing so)
@@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
  *
  *    START - An entry being prepared by flow type specific code
  *        Operations:
+ *            - Common flow fields other than type may be accessed
  *            - Flow type specific fields may be accessed
  *            - flow_*() logging functions
  *            - flow_alloc_cancel() returns the entry to FREE state
@@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
 union flow *flow_start(union flow *flow, enum flow_type type,
 		       unsigned iniside)
 {
-	(void)iniside;
+	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
+	const struct flowside *a = &flow->f.side[iniside];
+	const struct flowside *b = &flow->f.side[!iniside];
+
 	flow->f.type = type;
 	flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
+	flow_dbg(flow, "  from side %u (%s): [%s]:%hu -> [%s]:%hu",
+		 iniside, pif_name(a->pif),
+		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport,
+		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport);
+	flow_dbg(flow, "    to side %u (%s): [%s]:%hu -> [%s]:%hu",
+		 !iniside, pif_name(b->pif),
+		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
+		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
+
 	return flow;
 }
 
@@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type,
  */
 static void flow_end(union flow *flow)
 {
+	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
+	const struct flowside *a = &flow->f.side[0];
+	const struct flowside *b = &flow->f.side[1];
+
 	if (flow->f.type == FLOW_TYPE_NONE)
 		return; /* Nothing to do */
 
 	flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
+	flow_dbg(flow, "  side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif),
+		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport,
+		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport);
+	flow_dbg(flow, "  side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif),
+		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
+		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
 	flow->f.type = FLOW_TYPE_NONE;
 }
 
diff --git a/flow.h b/flow.h
index c943c44..f7fb537 100644
--- a/flow.h
+++ b/flow.h
@@ -35,11 +35,86 @@ extern const uint8_t flow_proto[];
 #define FLOW_PROTO(f)				\
 	((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
 
+/**
+ * struct flowside - Common information for one side of a flow
+ * @eaddr:	Endpoint address (remote address from passt's PoV)
+ * @faddr:	Forwarding address (local address from passt's PoV)
+ * @eport:	Endpoint port
+ * @fport:	Forwarding port
+ * @pif:	pif ID on which this side of the flow exists
+ */
+struct flowside {
+	union inany_addr	faddr;
+	union inany_addr	eaddr;
+	in_port_t		fport;
+	in_port_t		eport;
+	uint8_t			pif;
+};
+static_assert(_Alignof(struct flowside) == _Alignof(uint32_t),
+	      "Unexpected alignment for struct flowside");
+
+/** flowside_from_inany - Initialize flowside from inany addresses
+ * @fside:	flowside to initialize
+ * @pif:	pif id of this flowside
+ * @faddr:	Forwarding address (inany)
+ * @fport:	Forwarding port
+ * @eaddr:	Endpoint address (inany)
+ * @eport:	Endpoint port
+ */
+/* cppcheck-suppress unusedFunction */
+static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
+				const union inany_addr *faddr, in_port_t fport,
+				const union inany_addr *eaddr, in_port_t eport)
+{
+	fside->pif = pif;
+	fside->faddr = *faddr;
+	fside->eaddr = *eaddr;
+	fside->fport = fport;
+	fside->eport = eport;
+}
+
+/** flowside_from_af - Initialize flowside from addresses
+ * @fside:	flowside to initialize
+ * @pif:	pif id of this flowside
+ * @af:		Address family (AF_INET or AF_INET6)
+ * @faddr:	Forwarding address (pointer to in_addr or in6_addr, or NULL)
+ * @fport:	Forwarding port
+ * @eaddr:	Endpoint address (pointer to in_addr or in6_addr, or NULL)
+ * @eport:	Endpoint port
+ *
+ * If NULL is given for either address, the appropriate unspecified/any address
+ * for the address family is substituted.
+ */
+/* cppcheck-suppress unusedFunction */
+static inline void flowside_from_af(struct flowside *fside,
+				    uint8_t pif, sa_family_t af,
+				    const void *faddr, in_port_t fport,
+				    const void *eaddr, in_port_t eport)
+{
+	const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6;
+
+	fside->pif = pif;
+	if (faddr)
+		inany_from_af(&fside->faddr, af, faddr);
+	else
+		fside->faddr = *any;
+	if (eaddr)
+		inany_from_af(&fside->eaddr, af, eaddr);
+	else
+		fside->eaddr = *any;
+	fside->fport = fport;
+	fside->eport = eport;
+}
+
+#define SIDES			2
+
 /**
  * struct flow_common - Common fields for packet flows
+ * @side[]:	Information for each side of the flow
  * @type:	Type of packet flow
  */
 struct flow_common {
+	struct flowside	side[SIDES];
 	uint8_t		type;
 };
 
diff --git a/passt.h b/passt.h
index bc58d64..3db0b8e 100644
--- a/passt.h
+++ b/passt.h
@@ -17,6 +17,9 @@ union epoll_ref;
 
 #include "pif.h"
 #include "packet.h"
+#include "siphash.h"
+#include "ip.h"
+#include "inany.h"
 #include "flow.h"
 #include "icmp.h"
 #include "fwd.h"
diff --git a/pif.h b/pif.h
index bd52936..ca85b34 100644
--- a/pif.h
+++ b/pif.h
@@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt)
 		return "?";
 }
 
-/* cppcheck-suppress unusedFunction */
 static inline const char *pif_name(uint8_t pif)
 {
 	return pif_type(pif);
diff --git a/tcp_conn.h b/tcp_conn.h
index d280b22..1a07dd5 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -106,7 +106,6 @@ struct tcp_tap_conn {
 	uint32_t	seq_init_from_tap;
 };
 
-#define SIDES			2
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
  * @f:			Generic flow information
-- 
@@ -106,7 +106,6 @@ struct tcp_tap_conn {
 	uint32_t	seq_init_from_tap;
 };
 
-#define SIDES			2
 /**
  * struct tcp_splice_conn - Descriptor for a spliced TCP connection
  * @f:			Generic flow information
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
  2024-05-03  1:11 ` [PATCH v4 01/16] flow: Common data structures for tracking flow addresses David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-13 18:07   ` Stefano Brivio
  2024-05-03  1:11 ` [PATCH v4 03/16] tcp_splice: Maintain flowside information for spliced connections David Gibson
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

tcp_tap_conn has several fields to track addresses and ports as seen
by the guest/namespace.  We now have general fields for this in the
common flowside struct so use those instead of protocol specific
fields.  The flowside also has space for the guest side endpoint
address (local address from the guest's PoV) so we fill that in as
well.

We didn't previously store equivalent information for the connection
as it appears to the host; that was implicit in the state of the host
side socket.  For future generalisations of flow/connection tracking,
we're going to need that information, so populate the other flowside
in each flow table entry with as much of this information as we can
easily obtain.  For connections initiated by the guest that's the
endpoint address and port.  To get the forwarding address and port
we'd need to call getsockname() in general, so leave that blank for
now.  For connections initiated from outside, we also have the
endpoint address from accept().  We have the forwarding port from the
epoll ref, but we leave the forwarding address blank.

For now we just fill the information in without really using it for
anything.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.h     |  1 -
 tcp.c      | 88 +++++++++++++++++++++++++++++++++++++-----------------
 tcp_conn.h |  8 -----
 3 files changed, 60 insertions(+), 37 deletions(-)

diff --git a/flow.h b/flow.h
index f7fb537..88caa76 100644
--- a/flow.h
+++ b/flow.h
@@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
  * If NULL is given for either address, the appropriate unspecified/any address
  * for the address family is substituted.
  */
-/* cppcheck-suppress unusedFunction */
 static inline void flowside_from_af(struct flowside *fside,
 				    uint8_t pif, sa_family_t af,
 				    const void *faddr, in_port_t fport,
diff --git a/tcp.c b/tcp.c
index 21d0af0..1835b86 100644
--- a/tcp.c
+++ b/tcp.c
@@ -372,7 +372,7 @@
 #define OPT_SACK	5
 #define OPT_TS		8
 
-#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
+#define CONN_V4(conn)		(!!inany_v4(&conn->f.side[TAPSIDE].faddr))
 #define CONN_V6(conn)		(!CONN_V4(conn))
 #define CONN_IS_CLOSING(conn)						\
 	((conn->events & ESTABLISHED) &&				\
@@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
  */
 static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
 {
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	int i;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
-		if (inany_equals(&conn->faddr, low_rtt_dst + i))
+		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
 			return 1;
 
 	return 0;
@@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 			      const struct tcp_info *tinfo)
 {
 #ifdef HAS_MIN_RTT
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	int i, hole = -1;
 
 	if (!tinfo->tcpi_min_rtt ||
@@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 		return;
 
 	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
-		if (inany_equals(&conn->faddr, low_rtt_dst + i))
+		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
 			return;
 		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
 			hole = i;
@@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
 	if (hole == -1)
 		return;
 
-	low_rtt_dst[hole++] = conn->faddr;
+	low_rtt_dst[hole++] = tapside->faddr;
 	if (hole == LOW_RTT_TABLE_SIZE)
 		hole = 0;
 	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
@@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
 			  const union inany_addr *faddr,
 			  in_port_t eport, in_port_t fport)
 {
-	if (inany_equals(&conn->faddr, faddr) &&
-	    conn->eport == eport && conn->fport == fport)
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
+
+	if (inany_equals(&tapside->faddr, faddr) &&
+	    tapside->eport == eport && tapside->fport == fport)
 		return 1;
 
 	return 0;
@@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
 static uint64_t tcp_conn_hash(const struct ctx *c,
 			      const struct tcp_tap_conn *conn)
 {
-	return tcp_hash(c, &conn->faddr, conn->eport, conn->fport);
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
+
+	return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport);
 }
 
 /**
@@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c)
  * @seq:	Sequence number
  */
 static void tcp_fill_header(struct tcphdr *th,
-			       const struct tcp_tap_conn *conn, uint32_t seq)
+			    const struct tcp_tap_conn *conn, uint32_t seq)
 {
-	th->source = htons(conn->fport);
-	th->dest = htons(conn->eport);
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];	
+
+	th->source = htons(tapside->fport);
+	th->dest = htons(tapside->eport);
 	th->seq = htonl(seq);
 	th->ack_seq = htonl(conn->seq_ack_to_tap);
 	if (conn->events & ESTABLISHED)	{
@@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
-	const struct in_addr *a4 = inany_v4(&conn->faddr);
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
+	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
@@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
 {
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	size_t l4len = dlen + sizeof(*th);
 
 	ip6h->payload_len = htons(l4len);
-	ip6h->saddr = conn->faddr.a6;
+	ip6h->saddr = tapside->faddr.a6;
 	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
 		ip6h->daddr = c->ip6.addr_ll_seen;
 	else
@@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
-	const struct in_addr *a4 = inany_v4(&conn->faddr);
-
-	if (a4) {
+	if (CONN_V4(conn)) {
 		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
@@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 /**
  * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
  * @c:		Execution context
- * @conn:	TCP connection, with faddr, fport and eport populated
+ * @conn:	TCP connection, with tap flowside faddr, fport and eport
  * @now:	Current timestamp
  */
 static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
@@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
 	union inany_addr aany;
+	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	uint64_t hash;
 	uint32_t ns;
 
@@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 	else
 		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
 
-	inany_siphash_feed(&state, &conn->faddr);
+	inany_siphash_feed(&state, &tapside->faddr);
 	inany_siphash_feed(&state, &aany);
 	hash = siphash_final(&state, 36,
-			     (uint64_t)conn->fport << 16 | conn->eport);
+			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
 	/* 32ns ticks, overflows 32 bits every 137s */
 	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
@@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		.sin6_port = htons(dstport),
 		.sin6_addr = *(struct in6_addr *)daddr,
 	};
+	struct flowside *tapside, *sockside;
 	const struct sockaddr *sa;
 	struct tcp_tap_conn *conn;
 	union flow *flow;
@@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(flow = flow_alloc()))
 		return;
 
+	tapside = &flow->f.side[TAPSIDE];
+	sockside = &flow->f.side[SOCKSIDE];
+
+	flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport);
+
 	if (af == AF_INET) {
 		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
 		    IN4_IS_ADDR_BROADCAST(saddr) ||
@@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
 		conn->wnd_from_tap = 1;
 
-	inany_from_af(&conn->faddr, af, daddr);
+	sockside->pif = PIF_HOST;
+	sockside->eport = dstport;
 
 	if (af == AF_INET) {
+		inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr);
 		sa = (struct sockaddr *)&addr4;
 		sl = sizeof(addr4);
 	} else {
+		inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr);
 		sa = (struct sockaddr *)&addr6;
 		sl = sizeof(addr6);
 	}
 
-	conn->fport = dstport;
-	conn->eport = srcport;
-
 	conn->seq_init_from_tap = ntohl(th->seq);
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
@@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
-	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
+	struct flowside *sockside = &flow->f.side[SOCKSIDE];
+	struct flowside *tapside = &flow->f.side[TAPSIDE];
+	struct tcp_tap_conn *conn;
+
+	sockside->pif = PIF_HOST;
+	inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa);
+	sockside->fport = dstport;
+
+	tapside->pif = PIF_TAP;
+	tapside->faddr = sockside->eaddr;
+	tapside->fport = sockside->eport;
+	tcp_snat_inbound(c, &tapside->faddr);
+	if (CONN_V4(flow)) {
+		inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen);
+	} else {
+		if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6))
+			tapside->eaddr.a6 = c->ip6.addr_ll_seen;
+		else
+			tapside->eaddr.a6 = c->ip6.addr_seen;
+	}
+	tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];
+
+	conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
 
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
-	conn->eport = dstport + c->tcp.fwd_in.delta[dstport];
-
-	tcp_snat_inbound(c, &conn->faddr);
-
 	tcp_seq_init(c, conn, now);
 	tcp_hash_insert(c, conn);
 
diff --git a/tcp_conn.h b/tcp_conn.h
index 1a07dd5..f55f144 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -23,9 +23,6 @@
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
  * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @faddr:		Guest side forwarding address (guest's remote address)
- * @eport:		Guest side endpoint port (guest's local port)
- * @fport:		Guest side forwarding port (guest's remote port)
  * @wnd_from_tap:	Last window size from tap, unscaled (as received)
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
@@ -91,11 +88,6 @@ struct tcp_tap_conn {
 
 	uint8_t		seq_dup_ack_approx;
 
-
-	union inany_addr faddr;
-	in_port_t	eport;
-	in_port_t	fport;
-
 	uint16_t	wnd_from_tap;
 	uint16_t	wnd_to_tap;
 
-- 
@@ -23,9 +23,6 @@
  * @ws_to_tap:		Window scaling factor advertised to tap/guest
  * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
  * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
- * @faddr:		Guest side forwarding address (guest's remote address)
- * @eport:		Guest side endpoint port (guest's local port)
- * @fport:		Guest side forwarding port (guest's remote port)
  * @wnd_from_tap:	Last window size from tap, unscaled (as received)
  * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
  * @seq_to_tap:		Next sequence for packets to tap
@@ -91,11 +88,6 @@ struct tcp_tap_conn {
 
 	uint8_t		seq_dup_ack_approx;
 
-
-	union inany_addr faddr;
-	in_port_t	eport;
-	in_port_t	fport;
-
 	uint16_t	wnd_from_tap;
 	uint16_t	wnd_to_tap;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 03/16] tcp_splice: Maintain flowside information for spliced connections
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
  2024-05-03  1:11 ` [PATCH v4 01/16] flow: Common data structures for tracking flow addresses David Gibson
  2024-05-03  1:11 ` [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 04/16] tcp: Obtain guest address from flowside David Gibson
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Every flow in the flow table now has space for the the addresses as seen by
both the host and guest side.  We already fill that information in for
regular "tap" TCP connections, now do it for spliced connections too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c        | 20 ++++++++++----------
 tcp_splice.c | 31 +++++++++++++++++--------------
 tcp_splice.h |  6 ++----
 3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/tcp.c b/tcp.c
index 1835b86..cd5bffe 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2731,22 +2731,16 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
  * @dstport:	Destination port for connection (host side)
  * @flow:	flow to initialise
  * @s:		Accepted socket
- * @sa:		Peer socket address (from accept())
  * @now:	Current timestamp
  */
 static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 				   union flow *flow, int s,
-				   const union sockaddr_inany *sa,
 				   const struct timespec *now)
 {
-	struct flowside *sockside = &flow->f.side[SOCKSIDE];
+	const struct flowside *sockside = &flow->f.side[SOCKSIDE];
 	struct flowside *tapside = &flow->f.side[TAPSIDE];
 	struct tcp_tap_conn *conn;
 
-	sockside->pif = PIF_HOST;
-	inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa);
-	sockside->fport = dstport;
-
 	tapside->pif = PIF_TAP;
 	tapside->faddr = sockside->eaddr;
 	tapside->fport = sockside->eport;
@@ -2792,16 +2786,23 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 {
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
+	struct flowside *side0;
 	union flow *flow;
 	int s;
 
 	if (c->no_tcp || !(flow = flow_alloc()))
 		return;
 
+	side0 = &flow->f.side[0];
+
 	s = accept4(ref.fd, &sa.sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
 		goto cancel;
 
+	side0->pif = ref.tcp_listen.pif;
+	inany_from_sockaddr(&side0->eaddr, &side0->eport, &sa);
+	side0->fport = ref.tcp_listen.port;
+
 	if (sa.sa_family == AF_INET) {
 		const struct in_addr *addr = &sa.sa4.sin_addr;
 		in_port_t port = sa.sa4.sin_port;
@@ -2829,11 +2830,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		}
 	}
 
-	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
-				      ref.tcp_listen.port, flow, s, &sa))
+	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.port, flow, s))
 		return;
 
-	tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now);
+	tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, now);
 	return;
 
 cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index 42b7be0..462ed0c 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -414,42 +414,38 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
 /**
  * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
- * @pif0:	pif id of side 0
  * @dstport:	Side 0 destination port of connection
  * @flow:	flow to initialise
  * @s0:		Accepted (side 0) socket
- * @sa:		Peer address of connection
  *
  * Return: true if able to create a spliced connection, false otherwise
  * #syscalls:pasta setsockopt
  */
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa)
+bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
+			       union flow *flow, int s0)
 {
+	const struct flowside *side0 = &flow->f.side[0];
+	const union inany_addr *src = &side0->eaddr;
+	struct flowside *side1 = &flow->f.side[1];
 	struct tcp_splice_conn *conn;
-	union inany_addr src;
-	in_port_t srcport;
 	sa_family_t af;
 	uint8_t pif1;
 
 	if (c->mode != MODE_PASTA)
 		return false;
 
-	inany_from_sockaddr(&src, &srcport, sa);
-	af = inany_v4(&src) ? AF_INET : AF_INET6;
+	af = inany_v4(src) ? AF_INET : AF_INET6;
 
-	switch (pif0) {
+	switch (side0->pif) {
 	case PIF_SPLICE:
-		if (!inany_is_loopback(&src)) {
+		if (!inany_is_loopback(src)) {
 			char str[INANY_ADDRSTRLEN];
 
 			/* We can't use flow_err() etc. because we haven't set
 			 * the flow type yet
 			 */
 			warn("Bad source address %s for splice, closing",
-			     inany_ntop(&src, str, sizeof(str)));
+			     inany_ntop(src, str, sizeof(str)));
 
 			/* We *don't* want to fall back to tap */
 			flow_alloc_cancel(flow);
@@ -461,7 +457,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		break;
 
 	case PIF_HOST:
-		if (!inany_is_loopback(&src))
+		if (!inany_is_loopback(src))
 			return false;
 
 		pif1 = PIF_SPLICE;
@@ -472,6 +468,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
 		return false;
 	}
 
+	if (af == AF_INET)
+		flowside_from_af(side1, pif1, AF_INET, NULL, 0,
+				 &in4addr_loopback, dstport);
+	else
+		flowside_from_af(side1, pif1, AF_INET6, NULL, 0,
+				 &in6addr_loopback, dstport);
+
 	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
 	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
diff --git a/tcp_splice.h b/tcp_splice.h
index ed8f0c5..e523c7e 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -11,10 +11,8 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa);
+bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
+			       union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -11,10 +11,8 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c,
-			       uint8_t pif0, in_port_t dstport,
-			       union flow *flow, int s0,
-			       const union sockaddr_inany *sa);
+bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
+			       union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 04/16] tcp: Obtain guest address from flowside
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (2 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 03/16] tcp_splice: Maintain flowside information for spliced connections David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-13 18:07   ` Stefano Brivio
  2024-05-03  1:11 ` [PATCH v4 05/16] tcp: Simplify endpoint validation using flowside information David Gibson
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address.  This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address.  That won't
work; it will will probably result in an RST from the guest.  Worse,
if the guest added a new address but also retains the old one, then we
could break those old connections by redirecting them to the new
address.

Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp.c | 47 ++++++++++++++++-------------------------------
 1 file changed, 16 insertions(+), 31 deletions(-)

diff --git a/tcp.c b/tcp.c
index cd5bffe..5ff7480 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1327,7 +1327,6 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -1338,27 +1337,26 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
 	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-	const struct in_addr *a4 = inany_v4(&tapside->faddr);
+	const struct in_addr *src4 = inany_v4(&tapside->faddr);
+	const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
-	ASSERT(a4);
+	ASSERT(src4 && dst4);
 
 	iph->tot_len = htons(l3len);
-	iph->saddr = a4->s_addr;
-	iph->daddr = c->ip4.addr_seen.s_addr;
+	iph->saddr = src4->s_addr;
+	iph->daddr = dst4->s_addr;
 
 	iph->check = check ? *check :
-			     csum_ip4_header(l3len, IPPROTO_TCP,
-					     *a4, c->ip4.addr_seen);
+			     csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4);
 
 	tcp_fill_header(th, conn, seq);
 
@@ -1371,7 +1369,6 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -1382,8 +1379,7 @@ static size_t tcp_fill_headers4(const struct ctx *c,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -1393,10 +1389,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 	ip6h->payload_len = htons(l4len);
 	ip6h->saddr = tapside->faddr.a6;
-	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
-		ip6h->daddr = c->ip6.addr_ll_seen;
-	else
-		ip6h->daddr = c->ip6.addr_seen;
+	ip6h->daddr = tapside->eaddr.a6;
 
 	ip6h->hop_limit = 255;
 	ip6h->version = 6;
@@ -1417,7 +1410,6 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -1426,19 +1418,18 @@ static size_t tcp_fill_headers6(const struct ctx *c,
  *
  * Return: IP payload length, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
+static size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
 	if (CONN_V4(conn)) {
-		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1654,7 +1645,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
+	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL,
 					conn->seq_to_tap);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
@@ -1753,18 +1744,12 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 			 const struct timespec *now)
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	union inany_addr aany;
 	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	uint64_t hash;
 	uint32_t ns;
 
-	if (CONN_V4(conn))
-		inany_from_af(&aany, AF_INET, &c->ip4.addr);
-	else
-		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
-
 	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &aany);
+	inany_siphash_feed(&state, &tapside->eaddr);
 	hash = siphash_final(&state, 36,
 			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
@@ -2161,7 +2146,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_seq_update[tcp4_payload_used].len = dlen;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -2170,7 +2155,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_seq_update[tcp6_payload_used].len = dlen;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
-- 
@@ -1327,7 +1327,6 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -1338,27 +1337,26 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
 				uint32_t seq)
 {
 	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-	const struct in_addr *a4 = inany_v4(&tapside->faddr);
+	const struct in_addr *src4 = inany_v4(&tapside->faddr);
+	const struct in_addr *dst4 = inany_v4(&tapside->eaddr);
 	size_t l4len = dlen + sizeof(*th);
 	size_t l3len = l4len + sizeof(*iph);
 
-	ASSERT(a4);
+	ASSERT(src4 && dst4);
 
 	iph->tot_len = htons(l3len);
-	iph->saddr = a4->s_addr;
-	iph->daddr = c->ip4.addr_seen.s_addr;
+	iph->saddr = src4->s_addr;
+	iph->daddr = dst4->s_addr;
 
 	iph->check = check ? *check :
-			     csum_ip4_header(l3len, IPPROTO_TCP,
-					     *a4, c->ip4.addr_seen);
+			     csum_ip4_header(l3len, IPPROTO_TCP, *src4, *dst4);
 
 	tcp_fill_header(th, conn, seq);
 
@@ -1371,7 +1369,6 @@ static size_t tcp_fill_headers4(const struct ctx *c,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -1382,8 +1379,7 @@ static size_t tcp_fill_headers4(const struct ctx *c,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct ctx *c,
-				const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -1393,10 +1389,7 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 	ip6h->payload_len = htons(l4len);
 	ip6h->saddr = tapside->faddr.a6;
-	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
-		ip6h->daddr = c->ip6.addr_ll_seen;
-	else
-		ip6h->daddr = c->ip6.addr_seen;
+	ip6h->daddr = tapside->eaddr.a6;
 
 	ip6h->hop_limit = 255;
 	ip6h->version = 6;
@@ -1417,7 +1410,6 @@ static size_t tcp_fill_headers6(const struct ctx *c,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
- * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -1426,19 +1418,18 @@ static size_t tcp_fill_headers6(const struct ctx *c,
  *
  * Return: IP payload length, host order
  */
-static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
-				      const struct tcp_tap_conn *conn,
+static size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 				      struct iovec *iov, size_t dlen,
 				      const uint16_t *check, uint32_t seq)
 {
 	if (CONN_V4(conn)) {
-		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1654,7 +1645,7 @@ static int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 	th->syn = !!(flags & SYN);
 	th->fin = !!(flags & FIN);
 
-	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL,
+	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL,
 					conn->seq_to_tap);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
@@ -1753,18 +1744,12 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
 			 const struct timespec *now)
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	union inany_addr aany;
 	const struct flowside *tapside = &conn->f.side[TAPSIDE];
 	uint64_t hash;
 	uint32_t ns;
 
-	if (CONN_V4(conn))
-		inany_from_af(&aany, AF_INET, &c->ip4.addr);
-	else
-		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
-
 	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &aany);
+	inany_siphash_feed(&state, &tapside->eaddr);
 	hash = siphash_final(&state, 36,
 			     (uint64_t)tapside->fport << 16 | tapside->eport);
 
@@ -2161,7 +2146,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_seq_update[tcp4_payload_used].len = dlen;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -2170,7 +2155,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_seq_update[tcp6_payload_used].len = dlen;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 05/16] tcp: Simplify endpoint validation using flowside information
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (3 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 04/16] tcp: Obtain guest address from flowside David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 06/16] tcp, tcp_splice: Construct sockaddrs for connect() from flowside David Gibson
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 inany.h |  1 -
 tcp.c   | 69 +++++++++++++--------------------------------------------
 2 files changed, 15 insertions(+), 55 deletions(-)

diff --git a/inany.h b/inany.h
index 407690e..c0228a1 100644
--- a/inany.h
+++ b/inany.h
@@ -123,7 +123,6 @@ static inline bool inany_is_multicast(const union inany_addr *a)
  *
  * Return: true if @a is specified and a unicast address
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool inany_is_unicast(const union inany_addr *a)
 {
 	return !inany_is_unspecified(a) && !inany_is_multicast(a);
diff --git a/tcp.c b/tcp.c
index 5ff7480..e669b18 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1954,36 +1954,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport);
 
-	if (af == AF_INET) {
-		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN4_IS_ADDR_BROADCAST(saddr) ||
-		    IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN4_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN4_IS_ADDR_BROADCAST(daddr) ||
-		    IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else if (af == AF_INET6) {
-		if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN6_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
+	if (!inany_is_unicast(&tapside->eaddr) || tapside->eport == 0 ||
+	    !inany_is_unicast(&tapside->faddr) || tapside->fport == 0) {
+		char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN];
+
+		debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+		      inany_ntop(&tapside->eaddr, sstr, sizeof(sstr)),
+		      tapside->eport,
+		      inany_ntop(&tapside->faddr, dstr, sizeof(dstr)),
+		      tapside->fport);
+		goto cancel;
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
@@ -2788,31 +2768,12 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	inany_from_sockaddr(&side0->eaddr, &side0->eport, &sa);
 	side0->fport = ref.tcp_listen.port;
 
-	if (sa.sa_family == AF_INET) {
-		const struct in_addr *addr = &sa.sa4.sin_addr;
-		in_port_t port = sa.sa4.sin_port;
-
-		if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN4_IS_ADDR_BROADCAST(addr) ||
-		    IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET_ADDRSTRLEN];
+	if (!inany_is_unicast(&side0->eaddr) || side0->eport == 0) {
+		char str[INANY_ADDRSTRLEN];
 
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
-	} else if (sa.sa_family == AF_INET6) {
-		const struct in6_addr *addr = &sa.sa6.sin6_addr;
-		in_port_t port = sa.sa6.sin6_port;
-
-		if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET6_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
+		err("Invalid endpoint from TCP accept(): %s:%hu",
+		    inany_ntop(&side0->eaddr, str, sizeof(str)), side0->eport);
+		goto cancel;
 	}
 
 	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.port, flow, s))
-- 
@@ -1954,36 +1954,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport);
 
-	if (af == AF_INET) {
-		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN4_IS_ADDR_BROADCAST(saddr) ||
-		    IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN4_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN4_IS_ADDR_BROADCAST(daddr) ||
-		    IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
-	} else if (af == AF_INET6) {
-		if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
-		    IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
-		    IN6_IS_ADDR_UNSPECIFIED(daddr) ||
-		    IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
-			char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
-
-			debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
-			      inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
-			      srcport,
-			      inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
-			      dstport);
-			goto cancel;
-		}
+	if (!inany_is_unicast(&tapside->eaddr) || tapside->eport == 0 ||
+	    !inany_is_unicast(&tapside->faddr) || tapside->fport == 0) {
+		char sstr[INANY_ADDRSTRLEN], dstr[INANY_ADDRSTRLEN];
+
+		debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+		      inany_ntop(&tapside->eaddr, sstr, sizeof(sstr)),
+		      tapside->eport,
+		      inany_ntop(&tapside->faddr, dstr, sizeof(dstr)),
+		      tapside->fport);
+		goto cancel;
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
@@ -2788,31 +2768,12 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 	inany_from_sockaddr(&side0->eaddr, &side0->eport, &sa);
 	side0->fport = ref.tcp_listen.port;
 
-	if (sa.sa_family == AF_INET) {
-		const struct in_addr *addr = &sa.sa4.sin_addr;
-		in_port_t port = sa.sa4.sin_port;
-
-		if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN4_IS_ADDR_BROADCAST(addr) ||
-		    IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET_ADDRSTRLEN];
+	if (!inany_is_unicast(&side0->eaddr) || side0->eport == 0) {
+		char str[INANY_ADDRSTRLEN];
 
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
-	} else if (sa.sa_family == AF_INET6) {
-		const struct in6_addr *addr = &sa.sa6.sin6_addr;
-		in_port_t port = sa.sa6.sin6_port;
-
-		if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
-		    IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
-			char str[INET6_ADDRSTRLEN];
-
-			err("Invalid endpoint from TCP accept(): %s:%hu",
-			    inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
-			goto cancel;
-		}
+		err("Invalid endpoint from TCP accept(): %s:%hu",
+		    inany_ntop(&side0->eaddr, str, sizeof(str)), side0->eport);
+		goto cancel;
 	}
 
 	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.port, flow, s))
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 06/16] tcp, tcp_splice: Construct sockaddrs for connect() from flowside
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (4 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 05/16] tcp: Simplify endpoint validation using flowside information David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 07/16] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

In tcp_conn_from_tap() we currently generate the sockaddr we need to
connect, and from it we fill in the sock flowside information.  To make
flowsides to be the primary source of truth, reverse this.  We use a new
sockaddr_from_inany() helper to build the sockaddr from the flowside.
We also rearrange things a bit in a way that's now more natural.

Similarly tcp_splice_connect() is given parameters from which it constructs
a sockaddr in parallel to the address already in the flowside.  Construct
the sockaddr from the flowside instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 inany.h      | 28 +++++++++++++++++++++++++
 tcp.c        | 59 +++++++++++++++++++---------------------------------
 tcp_splice.c | 36 ++++++++------------------------
 3 files changed, 58 insertions(+), 65 deletions(-)

diff --git a/inany.h b/inany.h
index c0228a1..6135f26 100644
--- a/inany.h
+++ b/inany.h
@@ -183,4 +183,32 @@ static inline void inany_siphash_feed(struct siphash_state *state,
 
 const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
 
+/** sockaddr_from_inany - Construct a sockaddr from an inany
+ * @sa:		Pointer to sockaddr to fill in
+ * @sl:		Updated to relevant of length of initialised @sa
+ * @addr:	IPv[46] address
+ * @port:	Port (host byte order)
+ * @scope:	Scope ID (ignored for IPv4 addresses)
+ */
+static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl,
+				       const union inany_addr *addr,
+				       in_port_t port, uint32_t scope)
+{
+	const struct in_addr *v4 = inany_v4(addr);
+
+	if (v4) {
+		sa->sa_family = AF_INET;
+		sa->sa4.sin_addr = *v4;
+		sa->sa4.sin_port = htons(port);
+		*sl = sizeof(sa->sa4);
+	} else {
+		sa->sa_family = AF_INET6;
+		sa->sa6.sin6_addr = addr->a6;
+		sa->sa6.sin6_port = htons(port);
+		sa->sa6.sin6_scope_id = scope;
+		sa->sa6.sin6_flowinfo = 0;
+		*sl = sizeof(sa->sa6);
+	}
+}
+
 #endif /* INANY_H */
diff --git a/tcp.c b/tcp.c
index e669b18..9ba2b07 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1929,19 +1929,9 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 {
 	in_port_t srcport = ntohs(th->source);
 	in_port_t dstport = ntohs(th->dest);
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(dstport),
-		.sin_addr = *(struct in_addr *)daddr,
-	};
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(dstport),
-		.sin6_addr = *(struct in6_addr *)daddr,
-	};
 	struct flowside *tapside, *sockside;
-	const struct sockaddr *sa;
 	struct tcp_tap_conn *conn;
+	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
 	socklen_t sl;
@@ -1966,17 +1956,23 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		goto cancel;
 	}
 
-	if ((s = tcp_conn_sock(c, af)) < 0)
-		goto cancel;
+	sockside->pif = PIF_HOST;
+	sockside->eaddr = tapside->faddr;
+	sockside->eport = tapside->fport;
 
 	if (!c->no_map_gw) {
-		if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
-			addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
-		if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw))
-			addr6.sin6_addr	= in6addr_loopback;
+		struct in_addr *v4 = inany_v4(&sockside->eaddr);
+
+		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
+			*v4 = in4addr_loopback;
+		if (IN6_ARE_ADDR_EQUAL(&sockside->eaddr, &c->ip6.gw))
+			sockside->eaddr.a6 = in6addr_loopback;
 	}
 
-	if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) {
+	if ((s = tcp_conn_sock(c, af)) < 0)
+		goto cancel;
+
+	if (IN6_IS_ADDR_LINKLOCAL(&sockside->eaddr)) {
 		struct sockaddr_in6 addr6_ll = {
 			.sin6_family = AF_INET6,
 			.sin6_addr = c->ip6.addr_ll,
@@ -1984,6 +1980,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		};
 		if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll)))
 			goto cancel;
+	} else if (!inany_is_loopback(&sockside->eaddr)) {
+		tcp_bind_outbound(c, s, af);
 	}
 
 	conn = FLOW_START(flow, FLOW_TCP, tcp, TAPSIDE);
@@ -2006,19 +2004,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
 		conn->wnd_from_tap = 1;
 
-	sockside->pif = PIF_HOST;
-	sockside->eport = dstport;
-
-	if (af == AF_INET) {
-		inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr);
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	} else {
-		inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr);
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	}
-
 	conn->seq_init_from_tap = ntohl(th->seq);
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
@@ -2028,19 +2013,17 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 
 	tcp_hash_insert(c, conn);
 
-	if (!bind(s, sa, sl)) {
+	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, sockside->eport,
+			    c->ifi6);
+
+	if (!bind(s, &sa.sa, sl)) {
 		tcp_rst(c, conn);	/* Nobody is listening then */
 		return;
 	}
 	if (errno != EADDRNOTAVAIL && errno != EACCES)
 		conn_flag(c, conn, LOCAL);
 
-	if ((af == AF_INET &&  !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) ||
-	    (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) &&
-			       !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)))
-		tcp_bind_outbound(c, s, af);
-
-	if (connect(s, sa, sl)) {
+	if (connect(s, &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			tcp_rst(c, conn);
 			return;
diff --git a/tcp_splice.c b/tcp_splice.c
index 462ed0c..aa04a9b 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -321,31 +321,19 @@ static int tcp_splice_connect_finish(const struct ctx *c,
  * tcp_splice_connect() - Create and connect socket for new spliced connection
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family
- * @pif:	pif on which to create socket
- * @port:	Destination port, host order
  *
  * Return: 0 for connect() succeeded or in progress, negative value on error
  */
-static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
-			      sa_family_t af, uint8_t pif, in_port_t port)
+static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 {
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(port),
-		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
-	};
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(port),
-		.sin_addr = IN4ADDR_LOOPBACK_INIT,
-	};
-	const struct sockaddr *sa;
+	const struct flowside *side1 = &conn->f.side[1];
+	sa_family_t af = inany_v4(&side1->eaddr) ? AF_INET : AF_INET6;
+	union sockaddr_inany sa;
 	socklen_t sl;
 
-	if (pif == PIF_HOST)
+	if (side1->pif == PIF_HOST)
 		conn->s[1] = tcp_conn_sock(c, af);
-	else if (pif == PIF_SPLICE)
+	else if (side1->pif == PIF_SPLICE)
 		conn->s[1] = tcp_conn_sock_ns(c, af);
 	else
 		ASSERT(0);
@@ -359,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			   conn->s[1]);
 	}
 
-	if (CONN_V6(conn)) {
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	} else {
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	}
+	sockaddr_from_inany(&sa, &sl, &side1->eaddr, side1->eport, 0);
 
-	if (connect(conn->s[1], sa, sl)) {
+	if (connect(conn->s[1], &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			flow_trace(conn, "Couldn't connect socket for splice: %s",
 				   strerror(errno));
@@ -486,7 +468,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
 	if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
 		flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
 
-	if (tcp_splice_connect(c, conn, af, pif1, dstport))
+	if (tcp_splice_connect(c, conn))
 		conn_flag(c, conn, CLOSING);
 
 	return true;
-- 
@@ -321,31 +321,19 @@ static int tcp_splice_connect_finish(const struct ctx *c,
  * tcp_splice_connect() - Create and connect socket for new spliced connection
  * @c:		Execution context
  * @conn:	Connection pointer
- * @af:		Address family
- * @pif:	pif on which to create socket
- * @port:	Destination port, host order
  *
  * Return: 0 for connect() succeeded or in progress, negative value on error
  */
-static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
-			      sa_family_t af, uint8_t pif, in_port_t port)
+static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn)
 {
-	struct sockaddr_in6 addr6 = {
-		.sin6_family = AF_INET6,
-		.sin6_port = htons(port),
-		.sin6_addr = IN6ADDR_LOOPBACK_INIT,
-	};
-	struct sockaddr_in addr4 = {
-		.sin_family = AF_INET,
-		.sin_port = htons(port),
-		.sin_addr = IN4ADDR_LOOPBACK_INIT,
-	};
-	const struct sockaddr *sa;
+	const struct flowside *side1 = &conn->f.side[1];
+	sa_family_t af = inany_v4(&side1->eaddr) ? AF_INET : AF_INET6;
+	union sockaddr_inany sa;
 	socklen_t sl;
 
-	if (pif == PIF_HOST)
+	if (side1->pif == PIF_HOST)
 		conn->s[1] = tcp_conn_sock(c, af);
-	else if (pif == PIF_SPLICE)
+	else if (side1->pif == PIF_SPLICE)
 		conn->s[1] = tcp_conn_sock_ns(c, af);
 	else
 		ASSERT(0);
@@ -359,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
 			   conn->s[1]);
 	}
 
-	if (CONN_V6(conn)) {
-		sa = (struct sockaddr *)&addr6;
-		sl = sizeof(addr6);
-	} else {
-		sa = (struct sockaddr *)&addr4;
-		sl = sizeof(addr4);
-	}
+	sockaddr_from_inany(&sa, &sl, &side1->eaddr, side1->eport, 0);
 
-	if (connect(conn->s[1], sa, sl)) {
+	if (connect(conn->s[1], &sa.sa, sl)) {
 		if (errno != EINPROGRESS) {
 			flow_trace(conn, "Couldn't connect socket for splice: %s",
 				   strerror(errno));
@@ -486,7 +468,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
 	if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
 		flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
 
-	if (tcp_splice_connect(c, conn, af, pif1, dstport))
+	if (tcp_splice_connect(c, conn))
 		conn_flag(c, conn, CLOSING);
 
 	return true;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 07/16] tcp_splice: Eliminate SPLICE_V6 flag
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (5 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 06/16] tcp, tcp_splice: Construct sockaddrs for connect() from flowside David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 08/16] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection.  Hence, drop the now unused SPLICE_V6 flag.

As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 tcp_conn.h   | 11 +++++------
 tcp_splice.c |  3 ---
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/tcp_conn.h b/tcp_conn.h
index f55f144..dd31d10 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -129,12 +129,11 @@ struct tcp_splice_conn {
 #define FIN_SENT_1			BIT(7)
 
 	uint8_t flags;
-#define SPLICE_V6			BIT(0)
-#define RCVLOWAT_SET_0			BIT(1)
-#define RCVLOWAT_SET_1			BIT(2)
-#define RCVLOWAT_ACT_0			BIT(3)
-#define RCVLOWAT_ACT_1			BIT(4)
-#define CLOSING				BIT(5)
+#define RCVLOWAT_SET_0			BIT(0)
+#define RCVLOWAT_SET_1			BIT(1)
+#define RCVLOWAT_ACT_0			BIT(2)
+#define RCVLOWAT_ACT_1			BIT(3)
+#define CLOSING				BIT(4)
 
 	uint32_t read[SIDES];
 	uint32_t written[SIDES];
diff --git a/tcp_splice.c b/tcp_splice.c
index aa04a9b..00b88c5 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -73,8 +73,6 @@ static int ns_sock_pool6	[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2];
 
-#define CONN_V6(x)			(x->flags & SPLICE_V6)
-#define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(idx)			(&FLOW(idx)->tcp_splice)
 
@@ -459,7 +457,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
 
 	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
-	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
 	conn->s[1] = -1;
 	conn->pipe[0][0] = conn->pipe[0][1] = -1;
-- 
@@ -73,8 +73,6 @@ static int ns_sock_pool6	[TCP_SOCK_POOL_SIZE];
 /* Pool of pre-opened pipes */
 static int splice_pipe_pool		[TCP_SPLICE_PIPE_POOL_SIZE][2];
 
-#define CONN_V6(x)			(x->flags & SPLICE_V6)
-#define CONN_V4(x)			(!CONN_V6(x))
 #define CONN_HAS(conn, set)		((conn->events & (set)) == (set))
 #define CONN(idx)			(&FLOW(idx)->tcp_splice)
 
@@ -459,7 +457,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
 
 	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
-	conn->flags = af == AF_INET ? 0 : SPLICE_V6;
 	conn->s[0] = s0;
 	conn->s[1] = -1;
 	conn->pipe[0][0] = conn->pipe[0][1] = -1;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 08/16] tcp, flow: Replace TCP specific hash function with general flow hash
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (6 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 07/16] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 09/16] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports.  We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well.  We also want a unified hash table
to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow.  This includes all the
information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 28 ++++++++++++++++++++++++++
 flow.h | 19 ++++++++++++++++++
 tcp.c  | 62 +++++++++++-----------------------------------------------
 3 files changed, 58 insertions(+), 51 deletions(-)

diff --git a/flow.c b/flow.c
index 02d6008..370d422 100644
--- a/flow.c
+++ b/flow.c
@@ -268,6 +268,34 @@ void flow_alloc_cancel(union flow *flow)
 	flow_first_free = FLOW_IDX(flow);
 }
 
+/**
+ * flow_hash() - Calculate hash value for one side of a flow
+ * @c:		Execution context
+ * @proto:	Protocol of this flow (IP L4 protocol number)
+ * @fside:	Flowside (must not have unspecified parts)
+ *
+ * Return: hash value
+ */
+uint64_t flow_hash(const struct ctx *c, uint8_t proto,
+		   const struct flowside *fside)
+{
+	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
+
+	/* For the hash table to work, we need complete information in the
+	 * flowside.
+	 */
+	ASSERT(fside->pif != PIF_NONE &&
+	       !inany_is_unspecified(&fside->faddr) && fside->fport != 0 &&
+	       !inany_is_unspecified(&fside->eaddr) && fside->eport != 0);
+
+	inany_siphash_feed(&state, &fside->faddr);
+	inany_siphash_feed(&state, &fside->eaddr);
+
+	return siphash_final(&state, 38, (uint64_t)proto << 40 |
+			     (uint64_t)fside->pif << 32 |
+			     fside->fport << 16 | fside->eport);
+}
+
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
diff --git a/flow.h b/flow.h
index 88caa76..218edc6 100644
--- a/flow.h
+++ b/flow.h
@@ -107,6 +107,22 @@ static inline void flowside_from_af(struct flowside *fside,
 
 #define SIDES			2
 
+/**
+ * flowside_eq() - Check if two flowsides are equal
+ * @left, @right:	Flowsides to compare
+ *
+ * Return: true if equal, false otherwise
+ */
+static inline bool flowside_eq(const struct flowside *left,
+			       const struct flowside *right)
+{
+	return left->pif == right->pif &&
+	       inany_equals(&left->eaddr, &right->eaddr) &&
+	       left->eport == right->eport &&
+	       inany_equals(&left->faddr, &right->faddr) &&
+	       left->fport == right->fport;
+}
+
 /**
  * struct flow_common - Common fields for packet flows
  * @side[]:	Information for each side of the flow
@@ -128,6 +144,9 @@ union flow *flow_start(union flow *flow, enum flow_type type,
 #define FLOW_START(flow_, t_, var_, i_)		\
 	(&flow_start((flow_), (t_), (i_))->var_)
 
+uint64_t flow_hash(const struct ctx *c, uint8_t proto,
+		   const struct flowside *fside);
+
 /**
  * struct flow_sidx - ID for one side of a specific flow
  * @side:	Side referenced (0 or 1)
diff --git a/tcp.c b/tcp.c
index 9ba2b07..2ad47c5 100644
--- a/tcp.c
+++ b/tcp.c
@@ -524,7 +524,7 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from remote address, local port, remote port */
+/* Table for lookup from flowside information */
 static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
 
 static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
@@ -1074,46 +1074,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_hash_match() - Check if a connection entry matches address and ports
- * @conn:	Connection entry to match against
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: 1 on match, 0 otherwise
- */
-static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  const union inany_addr *faddr,
-			  in_port_t eport, in_port_t fport)
-{
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-
-	if (inany_equals(&tapside->faddr, faddr) &&
-	    tapside->eport == eport && tapside->fport == fport)
-		return 1;
-
-	return 0;
-}
-
-/**
- * tcp_hash() - Calculate hash value for connection given address and ports
- * @c:		Execution context
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
-			 in_port_t eport, in_port_t fport)
-{
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-
-	inany_siphash_feed(&state, faddr);
-	return siphash_final(&state, 20, (uint64_t)eport << 16 | fport);
-}
-
 /**
  * tcp_conn_hash() - Calculate hash bucket of an existing connection
  * @c:		Execution context
@@ -1124,9 +1084,7 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
 static uint64_t tcp_conn_hash(const struct ctx *c,
 			      const struct tcp_tap_conn *conn)
 {
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-
-	return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport);
+	return flow_hash(c, IPPROTO_TCP, &conn->f.side[TAPSIDE]);
 }
 
 /**
@@ -1201,25 +1159,26 @@ static void tcp_hash_remove(const struct ctx *c,
  * tcp_hash_lookup() - Look up connection given remote address and ports
  * @c:		Execution context
  * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
  * @faddr:	Guest side forwarding address (guest remote address)
  * @eport:	Guest side endpoint port (guest local port)
  * @fport:	Guest side forwarding port (guest remote port)
  *
  * Return: connection pointer, if found, -ENOENT otherwise
  */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
-					    sa_family_t af, const void *faddr,
+static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
+					    const void *eaddr, const void *faddr,
 					    in_port_t eport, in_port_t fport)
 {
-	union inany_addr aany;
+	struct flowside fside;
 	union flow *flow;
 	unsigned b;
 
-	inany_from_af(&aany, af, faddr);
+	flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport);
 
-	b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE;
+	b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE;
 	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !tcp_hash_match(&flow->tcp, &aany, eport, fport))
+	       !flowside_eq(&flow->f.side[TAPSIDE], &fside))
 		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
 
 	return &flow->tcp;
@@ -2522,7 +2481,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest));
+	conn = tcp_hash_lookup(c, af, saddr, daddr,
+			       ntohs(th->source), ntohs(th->dest));
 
 	/* New connection from tap */
 	if (!conn) {
-- 
@@ -524,7 +524,7 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from remote address, local port, remote port */
+/* Table for lookup from flowside information */
 static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
 
 static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
@@ -1074,46 +1074,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_hash_match() - Check if a connection entry matches address and ports
- * @conn:	Connection entry to match against
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: 1 on match, 0 otherwise
- */
-static int tcp_hash_match(const struct tcp_tap_conn *conn,
-			  const union inany_addr *faddr,
-			  in_port_t eport, in_port_t fport)
-{
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-
-	if (inany_equals(&tapside->faddr, faddr) &&
-	    tapside->eport == eport && tapside->fport == fport)
-		return 1;
-
-	return 0;
-}
-
-/**
- * tcp_hash() - Calculate hash value for connection given address and ports
- * @c:		Execution context
- * @faddr:	Guest side forwarding address
- * @eport:	Guest side endpoint port
- * @fport:	Guest side forwarding port
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
-			 in_port_t eport, in_port_t fport)
-{
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-
-	inany_siphash_feed(&state, faddr);
-	return siphash_final(&state, 20, (uint64_t)eport << 16 | fport);
-}
-
 /**
  * tcp_conn_hash() - Calculate hash bucket of an existing connection
  * @c:		Execution context
@@ -1124,9 +1084,7 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
 static uint64_t tcp_conn_hash(const struct ctx *c,
 			      const struct tcp_tap_conn *conn)
 {
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-
-	return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport);
+	return flow_hash(c, IPPROTO_TCP, &conn->f.side[TAPSIDE]);
 }
 
 /**
@@ -1201,25 +1159,26 @@ static void tcp_hash_remove(const struct ctx *c,
  * tcp_hash_lookup() - Look up connection given remote address and ports
  * @c:		Execution context
  * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
  * @faddr:	Guest side forwarding address (guest remote address)
  * @eport:	Guest side endpoint port (guest local port)
  * @fport:	Guest side forwarding port (guest remote port)
  *
  * Return: connection pointer, if found, -ENOENT otherwise
  */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
-					    sa_family_t af, const void *faddr,
+static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
+					    const void *eaddr, const void *faddr,
 					    in_port_t eport, in_port_t fport)
 {
-	union inany_addr aany;
+	struct flowside fside;
 	union flow *flow;
 	unsigned b;
 
-	inany_from_af(&aany, af, faddr);
+	flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport);
 
-	b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE;
+	b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE;
 	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !tcp_hash_match(&flow->tcp, &aany, eport, fport))
+	       !flowside_eq(&flow->f.side[TAPSIDE], &fside))
 		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
 
 	return &flow->tcp;
@@ -2522,7 +2481,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest));
+	conn = tcp_hash_lookup(c, af, saddr, daddr,
+			       ntohs(th->source), ntohs(th->dest));
 
 	/* New connection from tap */
 	if (!conn) {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 09/16] flow, tcp: Generalise TCP hash table to general flow hash table
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (7 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 08/16] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 10/16] tcp: Re-use flow hash for initial sequence number generation David Gibson
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows.  This is
largely code motion and straightforward renames.  There are two semantic
changes:

 * flow_hash_lookup() now needs to verify that the entry has a matching
   protocol as well as matching addresses, ports and interface

 * We double the size of the hash table, because it's now at least
   theoretically possible for both sides of each flow to be hashed.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++-
 flow.h |  10 ++--
 tcp.c  | 146 +++++----------------------------------------------------
 3 files changed, 147 insertions(+), 138 deletions(-)

diff --git a/flow.c b/flow.c
index 370d422..8cefbed 100644
--- a/flow.c
+++ b/flow.c
@@ -138,6 +138,16 @@ union flow flowtab[FLOW_MAX];
 /* Last time the flow timers ran */
 static struct timespec flow_timer_run;
 
+/* Hash table to index it */
+#define FLOW_HASH_LOAD		70		/* % */
+#define FLOW_HASH_SIZE		((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD))
+
+/* Table for lookup from flowside information */
+static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE];
+
+static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX,
+"Safe linear probing requires hash table with more entries than the number of sides in the flow table");
+
 /** flow_log_ - Log flow-related message
  * @f:		flow the message is related to
  * @pri:	Log priority
@@ -276,8 +286,8 @@ void flow_alloc_cancel(union flow *flow)
  *
  * Return: hash value
  */
-uint64_t flow_hash(const struct ctx *c, uint8_t proto,
-		   const struct flowside *fside)
+static uint64_t flow_hash(const struct ctx *c, uint8_t proto,
+			  const struct flowside *fside)
 {
 	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
 
@@ -296,6 +306,116 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto,
 			     fside->fport << 16 | fside->eport);
 }
 
+/**
+ * flow_sidx_hash() - Calculate hash value for given side of a given flow
+ * @c:		Execution context
+ * @sidx:	Flow & side index to get hash for
+ *
+ * Return: hash value, of the flow & side represented by @sidx
+ */
+static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
+{
+	const struct flow_common *f = &flow_at_sidx(sidx)->f;
+	return flow_hash(c, FLOW_PROTO(f), &f->side[sidx.side]);
+}
+
+/**
+ * flow_hash_probe() - Find hash bucket for a flow
+ * @c:		Execution context
+ * @sidx:	Flow and side to find bucket for
+ *
+ * Return: If @sidx is in the hash table, its current bucket, otherwise a
+ *         suitable free bucket for it.
+ */
+static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE;
+
+	/* Linear probing */
+	while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) &&
+	       !flow_sidx_eq(flow_hashtab[b], sidx))
+		b = mod_sub(b, 1, FLOW_HASH_SIZE);
+
+	return b;
+}
+
+/**
+ * flow_hash_insert() - Insert side of a flow into into hash table
+ * @c:		Execution context
+ * @sidx:	Flow & side index
+ */
+void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_hash_probe(c, sidx);
+
+	flow_hashtab[b] = sidx;
+	flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b);
+}
+
+/**
+ * flow_hash_remove() - Drop side of a flow from the hash table
+ * @c:		Execution context
+ * @sidx:	Side of flow to remove
+ */
+void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx)
+{
+	unsigned b = flow_hash_probe(c, sidx), s;
+
+	if (flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE))
+		return; /* Redundant remove */
+
+	flow_dbg(flow_at_sidx(sidx), "hash table remove: bucket: %u", b);
+
+	/* Scan the remainder of the cluster */
+	for (s = mod_sub(b, 1, FLOW_HASH_SIZE);
+	     !flow_sidx_eq(flow_hashtab[s], FLOW_SIDX_NONE);
+	     s = mod_sub(s, 1, FLOW_HASH_SIZE)) {
+		unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE;
+
+		if (!mod_between(h, s, b, FLOW_HASH_SIZE)) {
+			/* flow_hashtab[s] can live in flow_hashtab[b]'s slot */
+			debug("hash table remove: shuffle %u -> %u", s, b);
+			flow_hashtab[b] = flow_hashtab[s];
+			b = s;
+		}
+	}
+
+	flow_hashtab[b] = FLOW_SIDX_NONE;
+}
+
+/**
+ * flow_hash_lookup() - Look up a flow given addressing information
+ * @c:		Execution context
+ * @proto:	Protocol of the flow (IP L4 protocol number)
+ * @pif:	Interface of the flow
+ * @af:		Address family, AF_INET or AF_INET6
+ * @eaddr:	Guest side endpoint address (guest local address)
+ * @faddr:	Guest side forwarding address (guest remote address)
+ * @eport:	Guest side endpoint port (guest local port)
+ * @fport:	Guest side forwarding port (guest remote port)
+ *
+ * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found
+ */
+flow_sidx_t flow_hash_lookup(const struct ctx *c,
+			     uint8_t proto, uint8_t pif, sa_family_t af,
+			     const void *eaddr, const void *faddr,
+			     in_port_t eport, in_port_t fport)
+{
+	struct flowside fside;
+	union flow *flow;
+	int b;
+
+	flowside_from_af(&fside, pif, af, faddr, fport, eaddr, eport);
+
+	b = flow_hash(c, proto, &fside) % FLOW_HASH_SIZE;
+	while ((flow = flow_at_sidx(flow_hashtab[b])) &&
+	       FLOW_PROTO(&flow->f) == proto &&
+	       !flowside_eq(&flow->f.side[flow_hashtab[b].side], &fside))
+		b = (b + 1) % FLOW_HASH_SIZE;
+
+	return flow_hashtab[b];
+}
+
 /**
  * flow_defer_handler() - Handler for per-flow deferred and timed tasks
  * @c:		Execution context
@@ -389,7 +509,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
  */
 void flow_init(void)
 {
+	unsigned b;
+
 	/* Initial state is a single free cluster containing the whole table */
 	flowtab[0].free.n = FLOW_MAX;
 	flowtab[0].free.next = FLOW_MAX;
+
+	for (b = 0; b < FLOW_HASH_SIZE; b++)
+		flow_hashtab[b] = FLOW_SIDX_NONE;
 }
diff --git a/flow.h b/flow.h
index 218edc6..c1181cc 100644
--- a/flow.h
+++ b/flow.h
@@ -144,9 +144,6 @@ union flow *flow_start(union flow *flow, enum flow_type type,
 #define FLOW_START(flow_, t_, var_, i_)		\
 	(&flow_start((flow_), (t_), (i_))->var_)
 
-uint64_t flow_hash(const struct ctx *c, uint8_t proto,
-		   const struct flowside *fside);
-
 /**
  * struct flow_sidx - ID for one side of a specific flow
  * @side:	Side referenced (0 or 1)
@@ -172,6 +169,13 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 	return (a.flow == b.flow) && (a.side == b.side);
 }
 
+void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
+void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx);
+flow_sidx_t flow_hash_lookup(const struct ctx *c,
+			     uint8_t proto, uint8_t pif, sa_family_t af,
+			     const void *eaddr, const void *faddr,
+			     in_port_t eport, in_port_t fport);
+
 union flow;
 
 void flow_init(void);
diff --git a/tcp.c b/tcp.c
index 2ad47c5..7f83014 100644
--- a/tcp.c
+++ b/tcp.c
@@ -311,9 +311,6 @@
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
 
-#define TCP_HASH_TABLE_LOAD		70		/* % */
-#define TCP_HASH_TABLE_SIZE		(FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD)
-
 #define MAX_WS				8
 #define MAX_WINDOW			(1 << (16 + (MAX_WS)))
 
@@ -524,12 +521,6 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from flowside information */
-static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
-
-static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
-	"Safe linear probing requires hash table larger than connection table");
-
 /* Pools for pre-opened sockets (in init) */
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
 int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
@@ -723,9 +714,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_timer_ctl(c, conn);
 }
 
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn);
-
 /**
  * conn_event_do() - Set and log connection events, update epoll state
  * @c:		Execution context
@@ -771,7 +759,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 
 	if (event == CLOSED)
-		tcp_hash_remove(c, conn);
+		flow_hash_remove(c, FLOW_SIDX(conn, TAPSIDE));
 	else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
 		conn_flag(c, conn, ACTIVE_CLOSE);
 	else
@@ -1074,116 +1062,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_conn_hash() - Calculate hash bucket of an existing connection
- * @c:		Execution context
- * @conn:	Connection
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_conn_hash(const struct ctx *c,
-			      const struct tcp_tap_conn *conn)
-{
-	return flow_hash(c, IPPROTO_TCP, &conn->f.side[TAPSIDE]);
-}
-
-/**
- * tcp_hash_probe() - Find hash bucket for a connection
- * @c:		Execution context
- * @conn:	Connection to find bucket for
- *
- * Return: If @conn is in the table, its current bucket, otherwise a suitable
- *         free bucket for it.
- */
-static inline unsigned tcp_hash_probe(const struct ctx *c,
-				      const struct tcp_tap_conn *conn)
-{
-	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE);
-	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
-
-	/* Linear probing */
-	while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) &&
-	       !flow_sidx_eq(tc_hash[b], sidx))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return b;
-}
-
-/**
- * tcp_hash_insert() - Insert connection into hash table, chain link
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn);
-
-	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE);
-	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
-}
-
-/**
- * tcp_hash_remove() - Drop connection from hash table, chain unlink
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn), s;
-	union flow *flow = flow_at_sidx(tc_hash[b]);
-
-	if (!flow)
-		return; /* Redundant remove */
-
-	flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b);
-
-	/* Scan the remainder of the cluster */
-	for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-	     (flow = flow_at_sidx(tc_hash[s]));
-	     s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) {
-		unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE;
-
-		if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) {
-			/* tc_hash[s] can live in tc_hash[b]'s slot */
-			debug("hash table remove: shuffle %u -> %u", s, b);
-			tc_hash[b] = tc_hash[s];
-			b = s;
-		}
-	}
-
-	tc_hash[b] = FLOW_SIDX_NONE;
-}
-
-/**
- * tcp_hash_lookup() - Look up connection given remote address and ports
- * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @eaddr:	Guest side endpoint address (guest local address)
- * @faddr:	Guest side forwarding address (guest remote address)
- * @eport:	Guest side endpoint port (guest local port)
- * @fport:	Guest side forwarding port (guest remote port)
- *
- * Return: connection pointer, if found, -ENOENT otherwise
- */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
-					    const void *eaddr, const void *faddr,
-					    in_port_t eport, in_port_t fport)
-{
-	struct flowside fside;
-	union flow *flow;
-	unsigned b;
-
-	flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport);
-
-	b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE;
-	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !flowside_eq(&flow->f.side[TAPSIDE], &fside))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return &flow->tcp;
-}
-
 /**
  * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
  * @flow:	Flow table entry for this connection
@@ -1970,7 +1848,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, sockside->eport,
 			    c->ifi6);
@@ -2466,6 +2344,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	const struct tcphdr *th;
 	size_t optlen, len;
 	const char *opts;
+	union flow *flow;
+	flow_sidx_t sidx;
 	int ack_due = 0;
 	int count;
 
@@ -2481,17 +2361,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, saddr, daddr,
-			       ntohs(th->source), ntohs(th->dest));
+	sidx = flow_hash_lookup(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr,
+				ntohs(th->source), ntohs(th->dest));
+	flow = flow_at_sidx(sidx);
 
 	/* New connection from tap */
-	if (!conn) {
+	if (!flow) {
 		if (opts && th->syn && !th->ack)
 			tcp_conn_from_tap(c, af, saddr, daddr, th,
 					  opts, optlen, now);
 		return 1;
 	}
 
+	ASSERT(flow->f.type == FLOW_TCP);
+	ASSERT(sidx.side == TAPSIDE);
+	conn = &flow->tcp;
+
 	flow_trace(conn, "packet length %zu from tap", len);
 
 	if (th->rst) {
@@ -2671,7 +2556,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn_event(c, conn, SOCK_ACCEPTED);
 
 	tcp_seq_init(c, conn, now);
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
@@ -3060,11 +2945,6 @@ static void tcp_sock_refill_init(const struct ctx *c)
  */
 int tcp_init(struct ctx *c)
 {
-	unsigned b;
-
-	for (b = 0; b < TCP_HASH_TABLE_SIZE; b++)
-		tc_hash[b] = FLOW_SIDX_NONE;
-
 	if (c->ifi4)
 		tcp_sock4_iov_init(c);
 
-- 
@@ -311,9 +311,6 @@
 #define TCP_FRAMES							\
 	(c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1)
 
-#define TCP_HASH_TABLE_LOAD		70		/* % */
-#define TCP_HASH_TABLE_SIZE		(FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD)
-
 #define MAX_WS				8
 #define MAX_WINDOW			(1 << (16 + (MAX_WS)))
 
@@ -524,12 +521,6 @@ static struct iovec	tcp_iov			[UIO_MAXIOV];
 
 #define CONN(idx)		(&(FLOW(idx)->tcp))
 
-/* Table for lookup from flowside information */
-static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE];
-
-static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX,
-	"Safe linear probing requires hash table larger than connection table");
-
 /* Pools for pre-opened sockets (in init) */
 int init_sock_pool4		[TCP_SOCK_POOL_SIZE];
 int init_sock_pool6		[TCP_SOCK_POOL_SIZE];
@@ -723,9 +714,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn,
 		tcp_timer_ctl(c, conn);
 }
 
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn);
-
 /**
  * conn_event_do() - Set and log connection events, update epoll state
  * @c:		Execution context
@@ -771,7 +759,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
 			 num == -1 	       ? "CLOSED" : tcp_event_str[num]);
 
 	if (event == CLOSED)
-		tcp_hash_remove(c, conn);
+		flow_hash_remove(c, FLOW_SIDX(conn, TAPSIDE));
 	else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD))
 		conn_flag(c, conn, ACTIVE_CLOSE);
 	else
@@ -1074,116 +1062,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find,
 	return -1;
 }
 
-/**
- * tcp_conn_hash() - Calculate hash bucket of an existing connection
- * @c:		Execution context
- * @conn:	Connection
- *
- * Return: hash value, needs to be adjusted for table size
- */
-static uint64_t tcp_conn_hash(const struct ctx *c,
-			      const struct tcp_tap_conn *conn)
-{
-	return flow_hash(c, IPPROTO_TCP, &conn->f.side[TAPSIDE]);
-}
-
-/**
- * tcp_hash_probe() - Find hash bucket for a connection
- * @c:		Execution context
- * @conn:	Connection to find bucket for
- *
- * Return: If @conn is in the table, its current bucket, otherwise a suitable
- *         free bucket for it.
- */
-static inline unsigned tcp_hash_probe(const struct ctx *c,
-				      const struct tcp_tap_conn *conn)
-{
-	flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE);
-	unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE;
-
-	/* Linear probing */
-	while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) &&
-	       !flow_sidx_eq(tc_hash[b], sidx))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return b;
-}
-
-/**
- * tcp_hash_insert() - Insert connection into hash table, chain link
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn);
-
-	tc_hash[b] = FLOW_SIDX(conn, TAPSIDE);
-	flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b);
-}
-
-/**
- * tcp_hash_remove() - Drop connection from hash table, chain unlink
- * @c:		Execution context
- * @conn:	Connection pointer
- */
-static void tcp_hash_remove(const struct ctx *c,
-			    const struct tcp_tap_conn *conn)
-{
-	unsigned b = tcp_hash_probe(c, conn), s;
-	union flow *flow = flow_at_sidx(tc_hash[b]);
-
-	if (!flow)
-		return; /* Redundant remove */
-
-	flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b);
-
-	/* Scan the remainder of the cluster */
-	for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-	     (flow = flow_at_sidx(tc_hash[s]));
-	     s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) {
-		unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE;
-
-		if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) {
-			/* tc_hash[s] can live in tc_hash[b]'s slot */
-			debug("hash table remove: shuffle %u -> %u", s, b);
-			tc_hash[b] = tc_hash[s];
-			b = s;
-		}
-	}
-
-	tc_hash[b] = FLOW_SIDX_NONE;
-}
-
-/**
- * tcp_hash_lookup() - Look up connection given remote address and ports
- * @c:		Execution context
- * @af:		Address family, AF_INET or AF_INET6
- * @eaddr:	Guest side endpoint address (guest local address)
- * @faddr:	Guest side forwarding address (guest remote address)
- * @eport:	Guest side endpoint port (guest local port)
- * @fport:	Guest side forwarding port (guest remote port)
- *
- * Return: connection pointer, if found, -ENOENT otherwise
- */
-static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, sa_family_t af,
-					    const void *eaddr, const void *faddr,
-					    in_port_t eport, in_port_t fport)
-{
-	struct flowside fside;
-	union flow *flow;
-	unsigned b;
-
-	flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport);
-
-	b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE;
-	while ((flow = flow_at_sidx(tc_hash[b])) &&
-	       !flowside_eq(&flow->f.side[TAPSIDE], &fside))
-		b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE);
-
-	return &flow->tcp;
-}
-
 /**
  * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections)
  * @flow:	Flow table entry for this connection
@@ -1970,7 +1848,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	tcp_seq_init(c, conn, now);
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, sockside->eport,
 			    c->ifi6);
@@ -2466,6 +2344,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	const struct tcphdr *th;
 	size_t optlen, len;
 	const char *opts;
+	union flow *flow;
+	flow_sidx_t sidx;
 	int ack_due = 0;
 	int count;
 
@@ -2481,17 +2361,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
 	optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL);
 	opts = packet_get(p, idx, sizeof(*th), optlen, NULL);
 
-	conn = tcp_hash_lookup(c, af, saddr, daddr,
-			       ntohs(th->source), ntohs(th->dest));
+	sidx = flow_hash_lookup(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr,
+				ntohs(th->source), ntohs(th->dest));
+	flow = flow_at_sidx(sidx);
 
 	/* New connection from tap */
-	if (!conn) {
+	if (!flow) {
 		if (opts && th->syn && !th->ack)
 			tcp_conn_from_tap(c, af, saddr, daddr, th,
 					  opts, optlen, now);
 		return 1;
 	}
 
+	ASSERT(flow->f.type == FLOW_TCP);
+	ASSERT(sidx.side == TAPSIDE);
+	conn = &flow->tcp;
+
 	flow_trace(conn, "packet length %zu from tap", len);
 
 	if (th->rst) {
@@ -2671,7 +2556,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn_event(c, conn, SOCK_ACCEPTED);
 
 	tcp_seq_init(c, conn, now);
-	tcp_hash_insert(c, conn);
+	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
@@ -3060,11 +2945,6 @@ static void tcp_sock_refill_init(const struct ctx *c)
  */
 int tcp_init(struct ctx *c)
 {
-	unsigned b;
-
-	for (b = 0; b < TCP_HASH_TABLE_SIZE; b++)
-		tc_hash[b] = FLOW_SIDX_NONE;
-
 	if (c->ifi4)
 		tcp_sock4_iov_init(c);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 10/16] tcp: Re-use flow hash for initial sequence number generation
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (8 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 09/16] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 11/16] icmp: Populate flowside information David Gibson
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp.  Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.

With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one.  It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 flow.c | 30 ++++++++++++++++++++++++------
 flow.h |  2 +-
 tcp.c  | 34 ++++++++++++----------------------
 3 files changed, 37 insertions(+), 29 deletions(-)

diff --git a/flow.c b/flow.c
index 8cefbed..c30910e 100644
--- a/flow.c
+++ b/flow.c
@@ -320,16 +320,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx)
 }
 
 /**
- * flow_hash_probe() - Find hash bucket for a flow
- * @c:		Execution context
+ * flow_hash_probe_() - Find hash bucket for a flow, given hash
+ * @hash:	Raw hash value for flow & side
  * @sidx:	Flow and side to find bucket for
  *
  * Return: If @sidx is in the hash table, its current bucket, otherwise a
  *         suitable free bucket for it.
  */
-static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx)
 {
-	unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE;
+	unsigned b = hash % FLOW_HASH_SIZE;
 
 	/* Linear probing */
 	while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) &&
@@ -339,17 +339,35 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
 	return b;
 }
 
+/**
+ * flow_hash_probe() - Find hash bucket for a flow
+ * @c:		Execution context
+ * @sidx:	Flow and side to find bucket for
+ *
+ * Return: If @sidx is in the hash table, its current bucket, otherwise a
+ *         suitable free bucket for it.
+ */
+static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx)
+{
+	return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx);
+}
+
 /**
  * flow_hash_insert() - Insert side of a flow into into hash table
  * @c:		Execution context
  * @sidx:	Flow & side index
+ *
+ * Return: raw (un-modded) hash value of side of flow
  */
-void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
+uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx)
 {
-	unsigned b = flow_hash_probe(c, sidx);
+	uint64_t hash = flow_sidx_hash(c, sidx);
+	unsigned b = flow_hash_probe_(hash, sidx);
 
 	flow_hashtab[b] = sidx;
 	flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b);
+
+	return hash;
 }
 
 /**
diff --git a/flow.h b/flow.h
index c1181cc..c186f24 100644
--- a/flow.h
+++ b/flow.h
@@ -169,7 +169,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b)
 	return (a.flow == b.flow) && (a.side == b.side);
 }
 
-void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
+uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx);
 void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx);
 flow_sidx_t flow_hash_lookup(const struct ctx *c,
 			     uint8_t proto, uint8_t pif, sa_family_t af,
diff --git a/tcp.c b/tcp.c
index 7f83014..8b4d792 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1572,28 +1572,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 }
 
 /**
- * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
- * @c:		Execution context
- * @conn:	TCP connection, with tap flowside faddr, fport and eport
+ * tcp_init_seq() - Calculate initial sequence number according to RFC 6528
+ * @hash:	Hash of connection details
  * @now:	Current timestamp
  */
-static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
-			 const struct timespec *now)
+static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now)
 {
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-	uint64_t hash;
-	uint32_t ns;
-
-	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &tapside->eaddr);
-	hash = siphash_final(&state, 36,
-			     (uint64_t)tapside->fport << 16 | tapside->eport);
-
 	/* 32ns ticks, overflows 32 bits every 137s */
-	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
+	uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
-	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
 }
 
 /**
@@ -1771,6 +1759,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
+	uint64_t hash;
 	socklen_t sl;
 
 	if (!(flow = flow_alloc()))
@@ -1845,10 +1834,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	tcp_seq_init(c, conn, now);
-	conn->seq_ack_from_tap = conn->seq_to_tap;
+	hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
-	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
+	conn->seq_ack_from_tap = conn->seq_to_tap;
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, sockside->eport,
 			    c->ifi6);
@@ -2533,6 +2522,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	const struct flowside *sockside = &flow->f.side[SOCKSIDE];
 	struct flowside *tapside = &flow->f.side[TAPSIDE];
 	struct tcp_tap_conn *conn;
+	uint64_t hash;
 
 	tapside->pif = PIF_TAP;
 	tapside->faddr = sockside->eaddr;
@@ -2555,8 +2545,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	tcp_seq_init(c, conn, now);
-	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-- 
@@ -1572,28 +1572,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
 }
 
 /**
- * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
- * @c:		Execution context
- * @conn:	TCP connection, with tap flowside faddr, fport and eport
+ * tcp_init_seq() - Calculate initial sequence number according to RFC 6528
+ * @hash:	Hash of connection details
  * @now:	Current timestamp
  */
-static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
-			 const struct timespec *now)
+static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now)
 {
-	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
-	const struct flowside *tapside = &conn->f.side[TAPSIDE];
-	uint64_t hash;
-	uint32_t ns;
-
-	inany_siphash_feed(&state, &tapside->faddr);
-	inany_siphash_feed(&state, &tapside->eaddr);
-	hash = siphash_final(&state, 36,
-			     (uint64_t)tapside->fport << 16 | tapside->eport);
-
 	/* 32ns ticks, overflows 32 bits every 137s */
-	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
+	uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
 
-	conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
+	return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns;
 }
 
 /**
@@ -1771,6 +1759,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	union sockaddr_inany sa;
 	union flow *flow;
 	int s = -1, mss;
+	uint64_t hash;
 	socklen_t sl;
 
 	if (!(flow = flow_alloc()))
@@ -1845,10 +1834,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 	conn->seq_from_tap = conn->seq_init_from_tap + 1;
 	conn->seq_ack_to_tap = conn->seq_from_tap;
 
-	tcp_seq_init(c, conn, now);
-	conn->seq_ack_from_tap = conn->seq_to_tap;
+	hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
 
-	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
+	conn->seq_ack_from_tap = conn->seq_to_tap;
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, sockside->eport,
 			    c->ifi6);
@@ -2533,6 +2522,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	const struct flowside *sockside = &flow->f.side[SOCKSIDE];
 	struct flowside *tapside = &flow->f.side[TAPSIDE];
 	struct tcp_tap_conn *conn;
+	uint64_t hash;
 
 	tapside->pif = PIF_TAP;
 	tapside->faddr = sockside->eaddr;
@@ -2555,8 +2545,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 	conn->ws_to_tap = conn->ws_from_tap = 0;
 	conn_event(c, conn, SOCK_ACCEPTED);
 
-	tcp_seq_init(c, conn, now);
-	flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE));
+	conn->seq_to_tap = tcp_init_seq(hash, now);
 
 	conn->seq_ack_from_tap = conn->seq_to_tap;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 11/16] icmp: Populate flowside information
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (9 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 10/16] tcp: Re-use flow hash for initial sequence number generation David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 12/16] icmp: Use flowsides as the source of truth wherever possible David Gibson
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

For ICMP (ping) flows we currently don't populate the common flowside
fields.   Fill out those parts of the common information that we can easily
obtain, using the ICMP id as the "port" on both ends.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c
index 1c5cf84..c000175 100644
--- a/icmp.c
+++ b/icmp.c
@@ -150,16 +150,23 @@ static void icmp_ping_close(const struct ctx *c,
  * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
+ * @pif:	pif originating the ping request
+ * @saddr:	Source address
+ * @daddr:	Destination address
  *
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 					    struct icmp_ping_flow **id_sock,
-					    sa_family_t af, uint16_t id)
+					    sa_family_t af, uint16_t id,
+					    uint8_t pif, const void *saddr,
+					    const void *daddr)
 {
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
 	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
 	union flow *flow = flow_alloc();
+	struct flowside *sockside = &flow->f.side[SOCKSIDE];
+	struct flowside *tapside = &flow->f.side[TAPSIDE];
 	struct icmp_ping_flow *pingf;
 	const void *bind_addr;
 	const char *bind_if;
@@ -167,6 +174,9 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
+	flowside_from_af(tapside, pif, af, daddr, id, saddr, id);
+	flowside_from_af(sockside, PIF_HOST, af, NULL, 0, daddr, 0);
+
 	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
 
 	pingf->seq = -1;
@@ -228,7 +238,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	uint16_t id, seq;
 	void *pkt;
 
-	(void)saddr;
 	ASSERT(pif == PIF_TAP);
 
 	if (af == AF_INET) {
@@ -269,7 +278,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	}
 
 	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id)))
+		if (!(pingf = icmp_ping_new(c, id_sock, af, id,
+					    pif, saddr, daddr)))
 			return 1;
 
 	pingf->ts = now->tv_sec;
-- 
@@ -150,16 +150,23 @@ static void icmp_ping_close(const struct ctx *c,
  * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
+ * @pif:	pif originating the ping request
+ * @saddr:	Source address
+ * @daddr:	Destination address
  *
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 					    struct icmp_ping_flow **id_sock,
-					    sa_family_t af, uint16_t id)
+					    sa_family_t af, uint16_t id,
+					    uint8_t pif, const void *saddr,
+					    const void *daddr)
 {
 	uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6;
 	union epoll_ref ref = { .type = EPOLL_TYPE_PING };
 	union flow *flow = flow_alloc();
+	struct flowside *sockside = &flow->f.side[SOCKSIDE];
+	struct flowside *tapside = &flow->f.side[TAPSIDE];
 	struct icmp_ping_flow *pingf;
 	const void *bind_addr;
 	const char *bind_if;
@@ -167,6 +174,9 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	if (!flow)
 		return NULL;
 
+	flowside_from_af(tapside, pif, af, daddr, id, saddr, id);
+	flowside_from_af(sockside, PIF_HOST, af, NULL, 0, daddr, 0);
+
 	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
 
 	pingf->seq = -1;
@@ -228,7 +238,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	uint16_t id, seq;
 	void *pkt;
 
-	(void)saddr;
 	ASSERT(pif == PIF_TAP);
 
 	if (af == AF_INET) {
@@ -269,7 +278,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	}
 
 	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id)))
+		if (!(pingf = icmp_ping_new(c, id_sock, af, id,
+					    pif, saddr, daddr)))
 			return 1;
 
 	pingf->ts = now->tv_sec;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 12/16] icmp: Use flowsides as the source of truth wherever possible
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (10 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 11/16] icmp: Populate flowside information David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 13/16] icmp: Look up ping flows using flow hash David Gibson
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

icmp_sock_handler() obtains the guest address from it's most recently
observed IP, and the ICMP id from the epoll reference.  Both of these
can be obtained readily from the flow.

icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.

struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.

Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c      | 37 ++++++++++++++++++++++---------------
 icmp_flow.h |  1 -
 tap.c       | 11 -----------
 tap.h       |  1 -
 4 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/icmp.c b/icmp.c
index c000175..e29416f 100644
--- a/icmp.c
+++ b/icmp.c
@@ -62,6 +62,7 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
 void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 {
 	struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow);
+	const struct flowside *tapside = &pingf->f.side[TAPSIDE];
 	union sockaddr_inany sr;
 	socklen_t sl = sizeof(sr);
 	char buf[USHRT_MAX];
@@ -87,7 +88,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 			goto unexpected;
 
 		/* Adjust packet back to guest-side ID */
-		ih4->un.echo.id = htons(pingf->id);
+		ih4->un.echo.id = htons(tapside->eport);
 		seq = ntohs(ih4->un.echo.sequence);
 	} else if (pingf->f.type == FLOW_PING6) {
 		struct icmp6hdr *ih6 = (struct icmp6hdr *)buf;
@@ -97,7 +98,7 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 			goto unexpected;
 
 		/* Adjust packet back to guest-side ID */
-		ih6->icmp6_identifier = htons(pingf->id);
+		ih6->icmp6_identifier = htons(tapside->eport);
 		seq = ntohs(ih6->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -112,13 +113,20 @@ void icmp_sock_handler(const struct ctx *c, union epoll_ref ref)
 	}
 
 	flow_dbg(pingf, "echo reply to tap, ID: %"PRIu16", seq: %"PRIu16,
-		 pingf->id, seq);
+		 tapside->eport, seq);
 
-	if (pingf->f.type == FLOW_PING4)
-		tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n);
-	else if (pingf->f.type == FLOW_PING6)
-		tap_icmp6_send(c, &sr.sa6.sin6_addr,
-			       tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n);
+	if (pingf->f.type == FLOW_PING4) {
+		const struct in_addr *saddr = inany_v4(&tapside->faddr);
+		const struct in_addr *daddr = inany_v4(&tapside->eaddr);
+
+		ASSERT(saddr && daddr); /* Must have IPv4 addresses */
+		tap_icmp4_send(c, *saddr, *daddr, buf, n);
+	} else if (pingf->f.type == FLOW_PING6) {
+		const struct in6_addr *saddr = &tapside->faddr.a6;
+		const struct in6_addr *daddr = &tapside->eaddr.a6;
+
+		tap_icmp6_send(c, saddr, daddr, buf, n);
+	}
 	return;
 
 unexpected:
@@ -133,7 +141,7 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->id;
+	uint16_t id = pingf->f.side[TAPSIDE].eport;
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
@@ -180,7 +188,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
 
 	pingf->seq = -1;
-	pingf->id = id;
 
 	if (af == AF_INET) {
 		bind_addr = &c->ip4.addr_out;
@@ -231,11 +238,12 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	union sockaddr_inany sa = { .sa_family = af };
-	const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6);
 	struct icmp_ping_flow *pingf, **id_sock;
+	const struct flowside *sockside;
+	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	socklen_t sl;
 	void *pkt;
 
 	ASSERT(pif == PIF_TAP);
@@ -255,7 +263,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
-		sa.sa4.sin_addr = *(struct in_addr *)daddr;
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
 
@@ -271,8 +278,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
-		sa.sa6.sin6_addr = *(struct in6_addr *)daddr;
-		sa.sa6.sin6_scope_id = c->ifi6;
 	} else {
 		ASSERT(0);
 	}
@@ -282,8 +287,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 					    pif, saddr, daddr)))
 			return 1;
 
+	sockside = &pingf->f.side[SOCKSIDE];
 	pingf->ts = now->tv_sec;
 
+	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, 0, c->ifi6);
 	if (sendto(pingf->sock, pkt, l4len, MSG_NOSIGNAL, &sa.sa, sl) < 0) {
 		flow_dbg(pingf, "failed to relay request to socket: %s",
 			 strerror(errno));
diff --git a/icmp_flow.h b/icmp_flow.h
index 5a2eed9..f053211 100644
--- a/icmp_flow.h
+++ b/icmp_flow.h
@@ -22,7 +22,6 @@ struct icmp_ping_flow {
 	int seq;
 	int sock;
 	time_t ts;
-	uint16_t id;
 };
 
 bool icmp_ping_timer(const struct ctx *c, union flow *flow,
diff --git a/tap.c b/tap.c
index 91fd2e2..052f6f0 100644
--- a/tap.c
+++ b/tap.c
@@ -90,17 +90,6 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 	tap_send_frames(c, iov, iovcnt, 1);
 }
 
-/**
- * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets
- * @c:		Execution context
- *
- * Return: IPv4 address
- */
-struct in_addr tap_ip4_daddr(const struct ctx *c)
-{
-	return c->ip4.addr_seen;
-}
-
 /**
  * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets
  * @c:		Execution context
diff --git a/tap.h b/tap.h
index d146d2f..a4981a6 100644
--- a/tap.h
+++ b/tap.h
@@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 	thdr->vnet_len = htonl(l2len);
 }
 
-struct in_addr tap_ip4_daddr(const struct ctx *c);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
-- 
@@ -43,7 +43,6 @@ static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 	thdr->vnet_len = htonl(l2len);
 }
 
-struct in_addr tap_ip4_daddr(const struct ctx *c);
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
 		   struct in_addr dst, in_port_t dport,
 		   const void *in, size_t dlen);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 13/16] icmp: Look up ping flows using flow hash
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (11 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 12/16] icmp: Use flowsides as the source of truth wherever possible David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 14/16] icmp: Eliminate icmp_id_map David Gibson
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables.  However, we can look this up using the flow hash
table - that's what it's for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/icmp.c b/icmp.c
index e29416f..3b8f282 100644
--- a/icmp.c
+++ b/icmp.c
@@ -145,6 +145,7 @@ static void icmp_ping_close(const struct ctx *c,
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
+	flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE));
 
 	if (pingf->f.type == FLOW_PING4)
 		icmp_id_map[V4][id] = NULL;
@@ -213,6 +214,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
+	flow_hash_insert(c, FLOW_SIDX(pingf, TAPSIDE));
 	*id_sock = pingf;
 
 	return pingf;
@@ -243,6 +245,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	union flow *flow;
+	uint8_t proto;
 	socklen_t sl;
 	void *pkt;
 
@@ -260,6 +264,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->type != ICMP_ECHO)
 			return 1;
 
+		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
@@ -275,6 +280,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->icmp6_type != ICMPV6_ECHO_REQUEST)
 			return 1;
 
+		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
@@ -282,12 +288,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		ASSERT(0);
 	}
 
-	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id,
-					    pif, saddr, daddr)))
-			return 1;
+	flow = flow_at_sidx(flow_hash_lookup(c, proto, PIF_TAP,
+					     af, saddr, daddr, id, id));
+
+	if (flow)
+		pingf = &flow->ping;
+	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, pif, saddr, daddr)))
+		return 1;
 
 	sockside = &pingf->f.side[SOCKSIDE];
+
+	ASSERT(flow_proto[pingf->f.type] == proto);
 	pingf->ts = now->tv_sec;
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, 0, c->ifi6);
-- 
@@ -145,6 +145,7 @@ static void icmp_ping_close(const struct ctx *c,
 
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
+	flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE));
 
 	if (pingf->f.type == FLOW_PING4)
 		icmp_id_map[V4][id] = NULL;
@@ -213,6 +214,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
+	flow_hash_insert(c, FLOW_SIDX(pingf, TAPSIDE));
 	*id_sock = pingf;
 
 	return pingf;
@@ -243,6 +245,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
+	union flow *flow;
+	uint8_t proto;
 	socklen_t sl;
 	void *pkt;
 
@@ -260,6 +264,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->type != ICMP_ECHO)
 			return 1;
 
+		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
 		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
@@ -275,6 +280,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		if (ih->icmp6_type != ICMPV6_ECHO_REQUEST)
 			return 1;
 
+		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
 		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
@@ -282,12 +288,17 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		ASSERT(0);
 	}
 
-	if (!(pingf = *id_sock))
-		if (!(pingf = icmp_ping_new(c, id_sock, af, id,
-					    pif, saddr, daddr)))
-			return 1;
+	flow = flow_at_sidx(flow_hash_lookup(c, proto, PIF_TAP,
+					     af, saddr, daddr, id, id));
+
+	if (flow)
+		pingf = &flow->ping;
+	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, pif, saddr, daddr)))
+		return 1;
 
 	sockside = &pingf->f.side[SOCKSIDE];
+
+	ASSERT(flow_proto[pingf->f.type] == proto);
 	pingf->ts = now->tv_sec;
 
 	sockaddr_from_inany(&sa, &sl, &sockside->eaddr, 0, c->ifi6);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 14/16] icmp: Eliminate icmp_id_map
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (12 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 13/16] icmp: Look up ping flows using flow hash David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 15/16] flow, tcp: flow based NAT and port forwarding for TCP David Gibson
  2024-05-03  1:11 ` [PATCH v4 16/16] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything.  Eliminate it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 19 ++-----------------
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/icmp.c b/icmp.c
index 3b8f282..6a76d05 100644
--- a/icmp.c
+++ b/icmp.c
@@ -51,9 +51,6 @@
 
 #define PINGF(idx)		(&(FLOW(idx)->ping))
 
-/* Indexed by ICMP echo identifier */
-static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
-
 /**
  * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket
  * @c:		Execution context
@@ -141,22 +138,14 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->f.side[TAPSIDE].eport;
-
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE));
-
-	if (pingf->f.type == FLOW_PING4)
-		icmp_id_map[V4][id] = NULL;
-	else
-		icmp_id_map[V6][id] = NULL;
 }
 
 /**
  * icmp_ping_new() - Prepare a new ping socket for a new id
  * @c:		Execution context
- * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
  * @pif:	pif originating the ping request
@@ -166,7 +155,6 @@ static void icmp_ping_close(const struct ctx *c,
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
-					    struct icmp_ping_flow **id_sock,
 					    sa_family_t af, uint16_t id,
 					    uint8_t pif, const void *saddr,
 					    const void *daddr)
@@ -215,7 +203,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
 	flow_hash_insert(c, FLOW_SIDX(pingf, TAPSIDE));
-	*id_sock = pingf;
 
 	return pingf;
 
@@ -240,8 +227,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	struct icmp_ping_flow *pingf, **id_sock;
 	const struct flowside *sockside;
+	struct icmp_ping_flow *pingf;
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
@@ -266,7 +253,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
-		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
@@ -282,7 +268,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
-		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -293,7 +278,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	if (flow)
 		pingf = &flow->ping;
-	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, pif, saddr, daddr)))
+	else if (!(pingf = icmp_ping_new(c, af, id, pif, saddr, daddr)))
 		return 1;
 
 	sockside = &pingf->f.side[SOCKSIDE];
-- 
@@ -51,9 +51,6 @@
 
 #define PINGF(idx)		(&(FLOW(idx)->ping))
 
-/* Indexed by ICMP echo identifier */
-static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
-
 /**
  * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket
  * @c:		Execution context
@@ -141,22 +138,14 @@ unexpected:
 static void icmp_ping_close(const struct ctx *c,
 			    const struct icmp_ping_flow *pingf)
 {
-	uint16_t id = pingf->f.side[TAPSIDE].eport;
-
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL);
 	close(pingf->sock);
 	flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE));
-
-	if (pingf->f.type == FLOW_PING4)
-		icmp_id_map[V4][id] = NULL;
-	else
-		icmp_id_map[V6][id] = NULL;
 }
 
 /**
  * icmp_ping_new() - Prepare a new ping socket for a new id
  * @c:		Execution context
- * @id_sock:	Pointer to ping flow entry slot in icmp_id_map[] to update
  * @af:		Address family, AF_INET or AF_INET6
  * @id:		ICMP id for the new socket
  * @pif:	pif originating the ping request
@@ -166,7 +155,6 @@ static void icmp_ping_close(const struct ctx *c,
  * Return: Newly opened ping flow, or NULL on failure
  */
 static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
-					    struct icmp_ping_flow **id_sock,
 					    sa_family_t af, uint16_t id,
 					    uint8_t pif, const void *saddr,
 					    const void *daddr)
@@ -215,7 +203,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 	flow_dbg(pingf, "new socket %i for echo ID %"PRIu16, pingf->sock, id);
 
 	flow_hash_insert(c, FLOW_SIDX(pingf, TAPSIDE));
-	*id_sock = pingf;
 
 	return pingf;
 
@@ -240,8 +227,8 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 		     const void *saddr, const void *daddr,
 		     const struct pool *p, const struct timespec *now)
 {
-	struct icmp_ping_flow *pingf, **id_sock;
 	const struct flowside *sockside;
+	struct icmp_ping_flow *pingf;
 	union sockaddr_inany sa;
 	size_t dlen, l4len;
 	uint16_t id, seq;
@@ -266,7 +253,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMP;
 		id = ntohs(ih->un.echo.id);
-		id_sock = &icmp_id_map[V4][id];
 		seq = ntohs(ih->un.echo.sequence);
 	} else if (af == AF_INET6) {
 		const struct icmp6hdr *ih;
@@ -282,7 +268,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 		proto = IPPROTO_ICMPV6;
 		id = ntohs(ih->icmp6_identifier);
-		id_sock = &icmp_id_map[V6][id];
 		seq = ntohs(ih->icmp6_sequence);
 	} else {
 		ASSERT(0);
@@ -293,7 +278,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
 
 	if (flow)
 		pingf = &flow->ping;
-	else if (!(pingf = icmp_ping_new(c, id_sock, af, id, pif, saddr, daddr)))
+	else if (!(pingf = icmp_ping_new(c, af, id, pif, saddr, daddr)))
 		return 1;
 
 	sockside = &pingf->f.side[SOCKSIDE];
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 15/16] flow, tcp: flow based NAT and port forwarding for TCP
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (13 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 14/16] icmp: Eliminate icmp_id_map David Gibson
@ 2024-05-03  1:11 ` David Gibson
  2024-05-03  1:11 ` [PATCH v4 16/16] flow, icmp: Use general flow forwarding rules for ICMP David Gibson
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.

Gather all this logic into a new fwd_nat_flow() function in fwd.c which
takes protocol and flowside as input and generates the protocol and
flowside to which to forward the flow, including applying any NAT or port
translation.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 fwd.c        | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fwd.h        |   5 ++
 tcp.c        |  92 +++++++++++-----------------------
 tcp_splice.c |  57 +--------------------
 tcp_splice.h |   3 +-
 5 files changed, 175 insertions(+), 121 deletions(-)

diff --git a/fwd.c b/fwd.c
index b3d5a37..d0ca492 100644
--- a/fwd.c
+++ b/fwd.c
@@ -25,6 +25,7 @@
 #include "fwd.h"
 #include "passt.h"
 #include "lineread.h"
+#include "flow_table.h"
 
 /* See enum in kernel's include/net/tcp_states.h */
 #define UDP_LISTEN	0x07
@@ -154,3 +155,141 @@ void fwd_scan_ports_init(struct ctx *c)
 				   &c->tcp.fwd_out, &c->tcp.fwd_in);
 	}
 }
+
+static bool fwd_from_tap(const struct ctx *c, uint8_t proto,
+			 const struct flowside *a, struct flowside *b)
+{
+	(void)proto;
+
+	b->pif = PIF_HOST;
+	b->eaddr = a->faddr;
+	b->eport = a->fport;
+
+	if (!c->no_map_gw) {
+		struct in_addr *v4 = inany_v4(&b->eaddr);
+
+		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
+			*v4 = in4addr_loopback;
+		if (IN6_ARE_ADDR_EQUAL(&b->eaddr, &c->ip6.gw))
+			b->eaddr.a6 = in6addr_loopback;
+	}
+
+	return true;
+}
+
+static bool fwd_from_splice(const struct ctx *c, uint8_t proto,
+			 const struct flowside *a, struct flowside *b)
+{
+	const struct in_addr *ae4 = inany_v4(&a->eaddr);
+
+	if (!inany_is_loopback(&a->eaddr) ||
+	    (!inany_is_loopback(&a->faddr) && !inany_is_unspecified(&a->faddr))) {
+		char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+
+		debug("Non loopback address on %s: [%s]:%hu -> [%s]:%hu",
+		      pif_name(a->pif),
+		      inany_ntop(&a->eaddr, estr, sizeof(estr)), a->eport,
+		      inany_ntop(&a->faddr, fstr, sizeof(fstr)), a->fport);
+		return false;
+	}
+
+	b->pif = PIF_HOST;
+
+	if (ae4)
+		inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
+	else
+		inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
+
+	b->eport = a->fport;
+
+	if (proto == IPPROTO_TCP)
+		b->eport += c->tcp.fwd_out.delta[b->eport];
+
+	return true;
+}
+
+static bool fwd_from_host(const struct ctx *c, uint8_t proto,
+			  const struct flowside *a, struct flowside *b)
+{
+	struct in_addr *bf4;
+
+	if (c->mode == MODE_PASTA && inany_is_loopback(&a->eaddr) &&
+	    proto == IPPROTO_TCP) {
+		/* spliceable */
+		b->pif = PIF_SPLICE;
+		b->faddr = a->eaddr;
+
+		if (inany_v4(&a->eaddr))
+			inany_from_af(&b->eaddr, AF_INET, &in4addr_loopback);
+		else
+			inany_from_af(&b->eaddr, AF_INET6, &in6addr_loopback);
+		b->eport = a->fport;
+		if (proto == IPPROTO_TCP)
+			b->eport += c->tcp.fwd_in.delta[b->eport];
+
+		return true;
+	}
+
+	b->pif = PIF_TAP;
+	b->faddr = a->eaddr;
+	b->fport = a->eport;
+
+	bf4 = inany_v4(&b->faddr);
+
+	if (bf4) {
+		if (IN4_IS_ADDR_LOOPBACK(bf4) ||
+		    IN4_IS_ADDR_UNSPECIFIED(bf4) ||
+		    IN4_ARE_ADDR_EQUAL(bf4, &c->ip4.addr_seen))
+			*bf4 = c->ip4.gw;
+	} else {
+		struct in6_addr *bf6 = &b->faddr.a6;
+
+		if (IN6_IS_ADDR_LOOPBACK(bf6) ||
+		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr_seen) ||
+		    IN6_ARE_ADDR_EQUAL(bf6, &c->ip6.addr)) {
+			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
+				*bf6 = c->ip6.gw;
+			else
+				*bf6 = c->ip6.addr_ll;
+		}
+	}
+
+	if (bf4) {
+		inany_from_af(&b->eaddr, AF_INET, &c->ip4.addr_seen);
+	} else {
+		if (IN6_IS_ADDR_LINKLOCAL(&b->faddr.a6))
+			b->eaddr.a6 = c->ip6.addr_ll_seen;
+		else
+			b->eaddr.a6 = c->ip6.addr_seen;
+	}
+
+	b->eport = a->fport;
+	if (proto == IPPROTO_TCP)
+		b->eport += c->tcp.fwd_in.delta[b->eport];
+
+	return true;
+}
+
+bool fwd_nat_flow(const struct ctx *c, uint8_t proto,
+		  const struct flowside *a, struct flowside *b)
+{
+	char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN];
+
+	switch (a->pif) {
+	case PIF_TAP:
+		return fwd_from_tap(c, proto, a, b);
+
+	case PIF_SPLICE:
+		return fwd_from_splice(c, proto, a, b);
+
+	case PIF_HOST:
+		return fwd_from_host(c, proto, a, b);
+
+	default:
+		debug("No rules to forward from %s: [%s]:%hu -> [%s]:%hu",
+		      pif_name(a->pif),
+		      inany_ntop(&a->eaddr, estr, sizeof(estr)), a->eport,
+		      inany_ntop(&a->faddr, fstr, sizeof(fstr)), a->fport);
+		return false;
+	}
+}
diff --git a/fwd.h b/fwd.h
index 23281d9..e884f4e 100644
--- a/fwd.h
+++ b/fwd.h
@@ -7,6 +7,8 @@
 #ifndef FWD_H
 #define FWD_H
 
+struct flowside;
+
 /* Number of ports for both TCP and UDP */
 #define	NUM_PORTS	(1U << 16)
 
@@ -41,4 +43,7 @@ void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
 			const struct fwd_ports *tcp_rev);
 void fwd_scan_ports_init(struct ctx *c);
 
+bool fwd_nat_flow(const struct ctx *c, uint8_t proto,
+		  const struct flowside *a, struct flowside *b);
+
 #endif /* FWD_H */
diff --git a/tcp.c b/tcp.c
index 8b4d792..e131280 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1782,17 +1782,13 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
 		goto cancel;
 	}
 
-	sockside->pif = PIF_HOST;
-	sockside->eaddr = tapside->faddr;
-	sockside->eport = tapside->fport;
-
-	if (!c->no_map_gw) {
-		struct in_addr *v4 = inany_v4(&sockside->eaddr);
+	if (!fwd_nat_flow(c, IPPROTO_TCP, tapside, sockside))
+		goto cancel;
 
-		if (v4 && IN4_ARE_ADDR_EQUAL(v4, &c->ip4.gw))
-			*v4 = in4addr_loopback;
-		if (IN6_ARE_ADDR_EQUAL(&sockside->eaddr, &c->ip6.gw))
-			sockside->eaddr.a6 = in6addr_loopback;
+	if (sockside->pif != PIF_HOST) {
+		err("No support for forwarding TCP from %s to %s",
+		    pif_name(tapside->pif), pif_name(sockside->pif));
+		goto cancel;
 	}
 
 	if ((s = tcp_conn_sock(c, af)) < 0)
@@ -2479,67 +2475,19 @@ static void tcp_connect_finish(struct ctx *c, struct tcp_tap_conn *conn)
 	conn_flag(c, conn, ACK_FROM_TAP_DUE);
 }
 
-/**
- * tcp_snat_inbound() - Translate source address for inbound data if needed
- * @c:		Execution context
- * @addr:	Source address of inbound packet/connection
- */
-static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
-{
-	struct in_addr *addr4 = inany_v4(addr);
-
-	if (addr4) {
-		if (IN4_IS_ADDR_LOOPBACK(addr4) ||
-		    IN4_IS_ADDR_UNSPECIFIED(addr4) ||
-		    IN4_ARE_ADDR_EQUAL(addr4, &c->ip4.addr_seen))
-			*addr4 = c->ip4.gw;
-	} else {
-		struct in6_addr *addr6 = &addr->a6;
-
-		if (IN6_IS_ADDR_LOOPBACK(addr6) ||
-		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr_seen) ||
-		    IN6_ARE_ADDR_EQUAL(addr6, &c->ip6.addr)) {
-			if (IN6_IS_ADDR_LINKLOCAL(&c->ip6.gw))
-				*addr6 = c->ip6.gw;
-			else
-				*addr6 = c->ip6.addr_ll;
-		}
-	}
-}
-
 /**
  * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
  * @c:		Execution context
- * @dstport:	Destination port for connection (host side)
  * @flow:	flow to initialise
  * @s:		Accepted socket
  * @now:	Current timestamp
  */
-static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
-				   union flow *flow, int s,
+static void tcp_tap_conn_from_sock(struct ctx *c, union flow *flow, int s,
 				   const struct timespec *now)
 {
-	const struct flowside *sockside = &flow->f.side[SOCKSIDE];
-	struct flowside *tapside = &flow->f.side[TAPSIDE];
-	struct tcp_tap_conn *conn;
+	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
 	uint64_t hash;
 
-	tapside->pif = PIF_TAP;
-	tapside->faddr = sockside->eaddr;
-	tapside->fport = sockside->eport;
-	tcp_snat_inbound(c, &tapside->faddr);
-	if (CONN_V4(flow)) {
-		inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen);
-	} else {
-		if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6))
-			tapside->eaddr.a6 = c->ip6.addr_ll_seen;
-		else
-			tapside->eaddr.a6 = c->ip6.addr_seen;
-	}
-	tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];
-
-	conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
-
 	conn->sock = s;
 	conn->timer = -1;
 	conn->ws_to_tap = conn->ws_from_tap = 0;
@@ -2567,9 +2515,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
 void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 			const struct timespec *now)
 {
+	struct flowside *side0, *side1;
 	union sockaddr_inany sa;
 	socklen_t sl = sizeof(sa);
-	struct flowside *side0;
 	union flow *flow;
 	int s;
 
@@ -2577,6 +2525,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		return;
 
 	side0 = &flow->f.side[0];
+	side1 = &flow->f.side[1];
 
 	s = accept4(ref.fd, &sa.sa, &sl, SOCK_NONBLOCK);
 	if (s < 0)
@@ -2594,10 +2543,25 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
 		goto cancel;
 	}
 
-	if (tcp_splice_conn_from_sock(c, ref.tcp_listen.port, flow, s))
-		return;
+	if (!fwd_nat_flow(c, IPPROTO_TCP, side0, side1))
+		goto cancel;
+
+	switch (side1->pif) {
+	case PIF_SPLICE:
+	case PIF_HOST:
+		tcp_splice_conn_from_sock(c, flow, s);
+		break;
+
+	case PIF_TAP:
+		tcp_tap_conn_from_sock(c, flow, s, now);
+		break;
+
+	default:
+		err("No support for forwarding TCP from %s to %s",
+		    pif_name(side0->pif), pif_name(side1->pif));
+		goto cancel;
+	}
 
-	tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, now);
 	return;
 
 cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index 00b88c5..c4fbbdd 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -394,67 +394,16 @@ static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
 /**
  * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
  * @c:		Execution context
- * @dstport:	Side 0 destination port of connection
  * @flow:	flow to initialise
  * @s0:		Accepted (side 0) socket
  *
- * Return: true if able to create a spliced connection, false otherwise
  * #syscalls:pasta setsockopt
  */
-bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
-			       union flow *flow, int s0)
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0)
 {
-	const struct flowside *side0 = &flow->f.side[0];
-	const union inany_addr *src = &side0->eaddr;
-	struct flowside *side1 = &flow->f.side[1];
 	struct tcp_splice_conn *conn;
-	sa_family_t af;
-	uint8_t pif1;
-
-	if (c->mode != MODE_PASTA)
-		return false;
-
-	af = inany_v4(src) ? AF_INET : AF_INET6;
-
-	switch (side0->pif) {
-	case PIF_SPLICE:
-		if (!inany_is_loopback(src)) {
-			char str[INANY_ADDRSTRLEN];
-
-			/* We can't use flow_err() etc. because we haven't set
-			 * the flow type yet
-			 */
-			warn("Bad source address %s for splice, closing",
-			     inany_ntop(src, str, sizeof(str)));
-
-			/* We *don't* want to fall back to tap */
-			flow_alloc_cancel(flow);
-			return true;
-		}
-
-		pif1 = PIF_HOST;
-		dstport += c->tcp.fwd_out.delta[dstport];
-		break;
-
-	case PIF_HOST:
-		if (!inany_is_loopback(src))
-			return false;
-
-		pif1 = PIF_SPLICE;
-		dstport += c->tcp.fwd_in.delta[dstport];
-		break;
-
-	default:
-		return false;
-	}
-
-	if (af == AF_INET)
-		flowside_from_af(side1, pif1, AF_INET, NULL, 0,
-				 &in4addr_loopback, dstport);
-	else
-		flowside_from_af(side1, pif1, AF_INET6, NULL, 0,
-				 &in6addr_loopback, dstport);
 
+	ASSERT(c->mode == MODE_PASTA);
 	conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
 
 	conn->s[0] = s0;
@@ -467,8 +416,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
 
 	if (tcp_splice_connect(c, conn))
 		conn_flag(c, conn, CLOSING);
-
-	return true;
 }
 
 /**
diff --git a/tcp_splice.h b/tcp_splice.h
index e523c7e..a20f3e2 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -11,8 +11,7 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
-			       union flow *flow, int s0);
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
@@ -11,8 +11,7 @@ union sockaddr_inany;
 
 void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
 			     uint32_t events);
-bool tcp_splice_conn_from_sock(const struct ctx *c, in_port_t dstport,
-			       union flow *flow, int s0);
+void tcp_splice_conn_from_sock(const struct ctx *c, union flow *flow, int s0);
 void tcp_splice_init(struct ctx *c);
 
 #endif /* TCP_SPLICE_H */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 16/16] flow, icmp: Use general flow forwarding rules for ICMP
  2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
                   ` (14 preceding siblings ...)
  2024-05-03  1:11 ` [PATCH v4 15/16] flow, tcp: flow based NAT and port forwarding for TCP David Gibson
@ 2024-05-03  1:11 ` David Gibson
  15 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-03  1:11 UTC (permalink / raw)
  To: passt-dev, Stefano Brivio; +Cc: David Gibson

Current ICMP hard codes its forwarding rules, and never applies any
translations.  Change it to use the fwd_nat_flow() function, so that it's
translated the same as TCP (excluding TCP specific port redirection).

This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine.  This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 icmp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/icmp.c b/icmp.c
index 6a76d05..c960abf 100644
--- a/icmp.c
+++ b/icmp.c
@@ -172,7 +172,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 		return NULL;
 
 	flowside_from_af(tapside, pif, af, daddr, id, saddr, id);
-	flowside_from_af(sockside, PIF_HOST, af, NULL, 0, daddr, 0);
+	fwd_nat_flow(c, flow_proto[flowtype], tapside, sockside);
 
 	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
 
-- 
@@ -172,7 +172,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c,
 		return NULL;
 
 	flowside_from_af(tapside, pif, af, daddr, id, saddr, id);
-	flowside_from_af(sockside, PIF_HOST, af, NULL, 0, daddr, 0);
+	fwd_nat_flow(c, flow_proto[flowtype], tapside, sockside);
 
 	pingf = FLOW_START(flow, flowtype, ping, TAPSIDE);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 01/16] flow: Common data structures for tracking flow addresses
  2024-05-03  1:11 ` [PATCH v4 01/16] flow: Common data structures for tracking flow addresses David Gibson
@ 2024-05-13 18:07   ` Stefano Brivio
  2024-05-14  0:11     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-05-13 18:07 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

Minor comments/nits only:

On Fri,  3 May 2024 11:11:20 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Handling of each protocol needs some degree of tracking of the addresses
> and ports at the end of each connection or flow.  Sometimes that's explicit
> (as in the guest visible addresses for TCP connections), sometimes implicit
> (the bound and connected addresses of sockets).
> 
> To allow more general and robust handling, and more consistency across
> protocols we want to uniformly track the address and port at each end of
> the connection.  Furthermore, because we allow port remapping, and we
> sometimes need to apply NAT, the addresses and ports can be different as
> seen by the guest/namespace and as by the host.
> 
> Introduce 'struct flowside' to keep track of common information
> related to one side of each flow.  For now that's the addresses, ports
> and the pif id.  Store two of these in the common fields of a flow to
> track that information for both sides.  For now we just introduce the
> structure itself, helpers to populate it, and logging of the contents
> when starting and ending flows.  Later patches will actually put
> something useful there.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.c     | 28 ++++++++++++++++++--
>  flow.h     | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  passt.h    |  3 +++
>  pif.h      |  1 -
>  tcp_conn.h |  1 -
>  5 files changed, 104 insertions(+), 4 deletions(-)
> 
> diff --git a/flow.c b/flow.c
> index 80dd269..02d6008 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>   *
>   *    ALLOC - A tentatively allocated entry
>   *        Operations:
> + *            - Common flow fields other than type may be accessed
>   *            - flow_alloc_cancel() returns the entry to FREE state
>   *            - FLOW_START() set the entry's type and moves to START state
>   *        Caveats:
> - *            - It's not safe to write fields in the flow entry
> + *            - It's not safe to write flow type specific fields in the entry
>   *            - It's not safe to allocate further entries with flow_alloc()
>   *            - It's not safe to return to the main epoll loop (use FLOW_START()
>   *              to move to START state before doing so)
> @@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>   *
>   *    START - An entry being prepared by flow type specific code
>   *        Operations:
> + *            - Common flow fields other than type may be accessed
>   *            - Flow type specific fields may be accessed
>   *            - flow_*() logging functions
>   *            - flow_alloc_cancel() returns the entry to FREE state
> @@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
>  union flow *flow_start(union flow *flow, enum flow_type type,
>  		       unsigned iniside)
>  {
> -	(void)iniside;
> +	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
> +	const struct flowside *a = &flow->f.side[iniside];

As long as iniside is used as a binary value (I guess it's unsigned
because you have in mind that it could eventually be extended, right?),
I think '!!iniside' would be clearer and perhaps more robust here.

> +	const struct flowside *b = &flow->f.side[!iniside];
> +
>  	flow->f.type = type;
>  	flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
> +	flow_dbg(flow, "  from side %u (%s): [%s]:%hu -> [%s]:%hu",
> +		 iniside, pif_name(a->pif),
> +		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport,
> +		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport);
> +	flow_dbg(flow, "    to side %u (%s): [%s]:%hu -> [%s]:%hu",
> +		 !iniside, pif_name(b->pif),
> +		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
> +		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
> +
>  	return flow;
>  }
>  
> @@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type,
>   */
>  static void flow_end(union flow *flow)
>  {
> +	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
> +	const struct flowside *a = &flow->f.side[0];
> +	const struct flowside *b = &flow->f.side[1];
> +
>  	if (flow->f.type == FLOW_TYPE_NONE)
>  		return; /* Nothing to do */
>  
>  	flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
> +	flow_dbg(flow, "  side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif),
> +		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport,
> +		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport);
> +	flow_dbg(flow, "  side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif),
> +		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
> +		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
>  	flow->f.type = FLOW_TYPE_NONE;
>  }
>  
> diff --git a/flow.h b/flow.h
> index c943c44..f7fb537 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -35,11 +35,86 @@ extern const uint8_t flow_proto[];
>  #define FLOW_PROTO(f)				\
>  	((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
>  
> +/**
> + * struct flowside - Common information for one side of a flow
> + * @eaddr:	Endpoint address (remote address from passt's PoV)
> + * @faddr:	Forwarding address (local address from passt's PoV)
> + * @eport:	Endpoint port
> + * @fport:	Forwarding port
> + * @pif:	pif ID on which this side of the flow exists
> + */
> +struct flowside {
> +	union inany_addr	faddr;
> +	union inany_addr	eaddr;
> +	in_port_t		fport;
> +	in_port_t		eport;
> +	uint8_t			pif;
> +};
> +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t),
> +	      "Unexpected alignment for struct flowside");

I'm too thick to understand the reason behind this assert.

> +
> +/** flowside_from_inany - Initialize flowside from inany addresses

flowside_from_inany(), it's a function.

> + * @fside:	flowside to initialize
> + * @pif:	pif id of this flowside
> + * @faddr:	Forwarding address (inany)
> + * @fport:	Forwarding port
> + * @eaddr:	Endpoint address (inany)
> + * @eport:	Endpoint port
> + */
> +/* cppcheck-suppress unusedFunction */
> +static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
> +				const union inany_addr *faddr, in_port_t fport,
> +				const union inany_addr *eaddr, in_port_t eport)
> +{
> +	fside->pif = pif;
> +	fside->faddr = *faddr;
> +	fside->eaddr = *eaddr;
> +	fside->fport = fport;
> +	fside->eport = eport;
> +}
> +
> +/** flowside_from_af - Initialize flowside from addresses

flowside_from_af()

> + * @fside:	flowside to initialize
> + * @pif:	pif id of this flowside
> + * @af:		Address family (AF_INET or AF_INET6)
> + * @faddr:	Forwarding address (pointer to in_addr or in6_addr, or NULL)
> + * @fport:	Forwarding port
> + * @eaddr:	Endpoint address (pointer to in_addr or in6_addr, or NULL)
> + * @eport:	Endpoint port
> + *
> + * If NULL is given for either address, the appropriate unspecified/any address

s/any/wildcard/ makes it a bit easier to follow, I guess.

> + * for the address family is substituted.
> + */
> +/* cppcheck-suppress unusedFunction */
> +static inline void flowside_from_af(struct flowside *fside,
> +				    uint8_t pif, sa_family_t af,
> +				    const void *faddr, in_port_t fport,
> +				    const void *eaddr, in_port_t eport)
> +{
> +	const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6;
> +
> +	fside->pif = pif;
> +	if (faddr)
> +		inany_from_af(&fside->faddr, af, faddr);
> +	else
> +		fside->faddr = *any;
> +	if (eaddr)
> +		inany_from_af(&fside->eaddr, af, eaddr);
> +	else
> +		fside->eaddr = *any;
> +	fside->fport = fport;
> +	fside->eport = eport;
> +}
> +
> +#define SIDES			2
> +
>  /**
>   * struct flow_common - Common fields for packet flows
> + * @side[]:	Information for each side of the flow
>   * @type:	Type of packet flow
>   */
>  struct flow_common {
> +	struct flowside	side[SIDES];
>  	uint8_t		type;
>  };
>  
> diff --git a/passt.h b/passt.h
> index bc58d64..3db0b8e 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -17,6 +17,9 @@ union epoll_ref;
>  
>  #include "pif.h"
>  #include "packet.h"
> +#include "siphash.h"
> +#include "ip.h"
> +#include "inany.h"
>  #include "flow.h"
>  #include "icmp.h"
>  #include "fwd.h"
> diff --git a/pif.h b/pif.h
> index bd52936..ca85b34 100644
> --- a/pif.h
> +++ b/pif.h
> @@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt)
>  		return "?";
>  }
>  
> -/* cppcheck-suppress unusedFunction */
>  static inline const char *pif_name(uint8_t pif)
>  {
>  	return pif_type(pif);
> diff --git a/tcp_conn.h b/tcp_conn.h
> index d280b22..1a07dd5 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -106,7 +106,6 @@ struct tcp_tap_conn {
>  	uint32_t	seq_init_from_tap;
>  };
>  
> -#define SIDES			2
>  /**
>   * struct tcp_splice_conn - Descriptor for a spliced TCP connection
>   * @f:			Generic flow information

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections
  2024-05-03  1:11 ` [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections David Gibson
@ 2024-05-13 18:07   ` Stefano Brivio
  2024-05-14  0:15     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-05-13 18:07 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri,  3 May 2024 11:11:21 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> tcp_tap_conn has several fields to track addresses and ports as seen
> by the guest/namespace.  We now have general fields for this in the
> common flowside struct so use those instead of protocol specific
> fields.  The flowside also has space for the guest side endpoint
> address (local address from the guest's PoV) so we fill that in as
> well.
> 
> We didn't previously store equivalent information for the connection
> as it appears to the host; that was implicit in the state of the host
> side socket.  For future generalisations of flow/connection tracking,
> we're going to need that information, so populate the other flowside
> in each flow table entry with as much of this information as we can
> easily obtain.  For connections initiated by the guest that's the
> endpoint address and port.  To get the forwarding address and port
> we'd need to call getsockname() in general, so leave that blank for
> now.  For connections initiated from outside, we also have the
> endpoint address from accept().  We have the forwarding port from the
> epoll ref, but we leave the forwarding address blank.
> 
> For now we just fill the information in without really using it for
> anything.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  flow.h     |  1 -
>  tcp.c      | 88 +++++++++++++++++++++++++++++++++++++-----------------
>  tcp_conn.h |  8 -----
>  3 files changed, 60 insertions(+), 37 deletions(-)
> 
> diff --git a/flow.h b/flow.h
> index f7fb537..88caa76 100644
> --- a/flow.h
> +++ b/flow.h
> @@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
>   * If NULL is given for either address, the appropriate unspecified/any address
>   * for the address family is substituted.
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline void flowside_from_af(struct flowside *fside,
>  				    uint8_t pif, sa_family_t af,
>  				    const void *faddr, in_port_t fport,
> diff --git a/tcp.c b/tcp.c
> index 21d0af0..1835b86 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -372,7 +372,7 @@
>  #define OPT_SACK	5
>  #define OPT_TS		8
>  
> -#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
> +#define CONN_V4(conn)		(!!inany_v4(&conn->f.side[TAPSIDE].faddr))

...which reminds me: I guess CONN_V4() and CONN_V6() should eventually
go away, just like SPLICE_V6 in 7/16.

>  #define CONN_V6(conn)		(!CONN_V4(conn))
>  #define CONN_IS_CLOSING(conn)						\
>  	((conn->events & ESTABLISHED) &&				\
> @@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
>   */
>  static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
>  {
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
>  	int i;
>  
>  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
> -		if (inany_equals(&conn->faddr, low_rtt_dst + i))
> +		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
>  			return 1;
>  
>  	return 0;
> @@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  			      const struct tcp_info *tinfo)
>  {
>  #ifdef HAS_MIN_RTT
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
>  	int i, hole = -1;
>  
>  	if (!tinfo->tcpi_min_rtt ||
> @@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  		return;
>  
>  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
> -		if (inany_equals(&conn->faddr, low_rtt_dst + i))
> +		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
>  			return;
>  		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
>  			hole = i;
> @@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
>  	if (hole == -1)
>  		return;
>  
> -	low_rtt_dst[hole++] = conn->faddr;
> +	low_rtt_dst[hole++] = tapside->faddr;
>  	if (hole == LOW_RTT_TABLE_SIZE)
>  		hole = 0;
>  	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
> @@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
>  			  const union inany_addr *faddr,
>  			  in_port_t eport, in_port_t fport)
>  {
> -	if (inany_equals(&conn->faddr, faddr) &&
> -	    conn->eport == eport && conn->fport == fport)
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> +
> +	if (inany_equals(&tapside->faddr, faddr) &&
> +	    tapside->eport == eport && tapside->fport == fport)
>  		return 1;
>  
>  	return 0;
> @@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
>  static uint64_t tcp_conn_hash(const struct ctx *c,
>  			      const struct tcp_tap_conn *conn)
>  {
> -	return tcp_hash(c, &conn->faddr, conn->eport, conn->fport);
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> +
> +	return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport);
>  }
>  
>  /**
> @@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c)
>   * @seq:	Sequence number
>   */
>  static void tcp_fill_header(struct tcphdr *th,
> -			       const struct tcp_tap_conn *conn, uint32_t seq)
> +			    const struct tcp_tap_conn *conn, uint32_t seq)
>  {
> -	th->source = htons(conn->fport);
> -	th->dest = htons(conn->eport);
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];	
> +
> +	th->source = htons(tapside->fport);
> +	th->dest = htons(tapside->eport);
>  	th->seq = htonl(seq);
>  	th->ack_seq = htonl(conn->seq_ack_to_tap);
>  	if (conn->events & ESTABLISHED)	{
> @@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c,
>  				size_t dlen, const uint16_t *check,
>  				uint32_t seq)
>  {
> -	const struct in_addr *a4 = inany_v4(&conn->faddr);
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> +	const struct in_addr *a4 = inany_v4(&tapside->faddr);
>  	size_t l4len = dlen + sizeof(*th);
>  	size_t l3len = l4len + sizeof(*iph);
>  
> @@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c,
>  				struct ipv6hdr *ip6h, struct tcphdr *th,
>  				size_t dlen, uint32_t seq)
>  {
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
>  	size_t l4len = dlen + sizeof(*th);
>  
>  	ip6h->payload_len = htons(l4len);
> -	ip6h->saddr = conn->faddr.a6;
> +	ip6h->saddr = tapside->faddr.a6;
>  	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
>  		ip6h->daddr = c->ip6.addr_ll_seen;
>  	else
> @@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
>  				      struct iovec *iov, size_t dlen,
>  				      const uint16_t *check, uint32_t seq)
>  {
> -	const struct in_addr *a4 = inany_v4(&conn->faddr);
> -
> -	if (a4) {
> +	if (CONN_V4(conn)) {
>  		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
>  					 iov[TCP_IOV_IP].iov_base,
>  					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
> @@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
>  /**
>   * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
>   * @c:		Execution context
> - * @conn:	TCP connection, with faddr, fport and eport populated
> + * @conn:	TCP connection, with tap flowside faddr, fport and eport
>   * @now:	Current timestamp
>   */
>  static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> @@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>  {
>  	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
>  	union inany_addr aany;
> +	const struct flowside *tapside = &conn->f.side[TAPSIDE];

One line up.

>  	uint64_t hash;
>  	uint32_t ns;
>  
> @@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
>  	else
>  		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
>  
> -	inany_siphash_feed(&state, &conn->faddr);
> +	inany_siphash_feed(&state, &tapside->faddr);
>  	inany_siphash_feed(&state, &aany);
>  	hash = siphash_final(&state, 36,
> -			     (uint64_t)conn->fport << 16 | conn->eport);
> +			     (uint64_t)tapside->fport << 16 | tapside->eport);
>  
>  	/* 32ns ticks, overflows 32 bits every 137s */
>  	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
> @@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  		.sin6_port = htons(dstport),
>  		.sin6_addr = *(struct in6_addr *)daddr,
>  	};
> +	struct flowside *tapside, *sockside;
>  	const struct sockaddr *sa;
>  	struct tcp_tap_conn *conn;
>  	union flow *flow;
> @@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  	if (!(flow = flow_alloc()))
>  		return;
>  
> +	tapside = &flow->f.side[TAPSIDE];
> +	sockside = &flow->f.side[SOCKSIDE];
> +
> +	flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport);
> +
>  	if (af == AF_INET) {
>  		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
>  		    IN4_IS_ADDR_BROADCAST(saddr) ||
> @@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
>  	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
>  		conn->wnd_from_tap = 1;
>  
> -	inany_from_af(&conn->faddr, af, daddr);
> +	sockside->pif = PIF_HOST;
> +	sockside->eport = dstport;
>  
>  	if (af == AF_INET) {
> +		inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr);
>  		sa = (struct sockaddr *)&addr4;
>  		sl = sizeof(addr4);
>  	} else {
> +		inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr);
>  		sa = (struct sockaddr *)&addr6;
>  		sl = sizeof(addr6);
>  	}
>  
> -	conn->fport = dstport;
> -	conn->eport = srcport;
> -
>  	conn->seq_init_from_tap = ntohl(th->seq);
>  	conn->seq_from_tap = conn->seq_init_from_tap + 1;
>  	conn->seq_ack_to_tap = conn->seq_from_tap;
> @@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
>  				   const union sockaddr_inany *sa,
>  				   const struct timespec *now)
>  {
> -	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
> +	struct flowside *sockside = &flow->f.side[SOCKSIDE];
> +	struct flowside *tapside = &flow->f.side[TAPSIDE];
> +	struct tcp_tap_conn *conn;
> +
> +	sockside->pif = PIF_HOST;
> +	inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa);
> +	sockside->fport = dstport;
> +
> +	tapside->pif = PIF_TAP;
> +	tapside->faddr = sockside->eaddr;
> +	tapside->fport = sockside->eport;
> +	tcp_snat_inbound(c, &tapside->faddr);
> +	if (CONN_V4(flow)) {
> +		inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen);
> +	} else {
> +		if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6))
> +			tapside->eaddr.a6 = c->ip6.addr_ll_seen;
> +		else
> +			tapside->eaddr.a6 = c->ip6.addr_seen;
> +	}
> +	tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];

Pre-existing, but I wonder: doesn't this port translation also belong
to tcp_snat_inbound()?

> +
> +	conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
>  
>  	conn->sock = s;
>  	conn->timer = -1;
>  	conn->ws_to_tap = conn->ws_from_tap = 0;
>  	conn_event(c, conn, SOCK_ACCEPTED);
>  
> -	inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
> -	conn->eport = dstport + c->tcp.fwd_in.delta[dstport];
> -
> -	tcp_snat_inbound(c, &conn->faddr);
> -
>  	tcp_seq_init(c, conn, now);
>  	tcp_hash_insert(c, conn);
>  
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 1a07dd5..f55f144 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -23,9 +23,6 @@
>   * @ws_to_tap:		Window scaling factor advertised to tap/guest
>   * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
>   * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
> - * @faddr:		Guest side forwarding address (guest's remote address)
> - * @eport:		Guest side endpoint port (guest's local port)
> - * @fport:		Guest side forwarding port (guest's remote port)
>   * @wnd_from_tap:	Last window size from tap, unscaled (as received)
>   * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
>   * @seq_to_tap:		Next sequence for packets to tap
> @@ -91,11 +88,6 @@ struct tcp_tap_conn {
>  
>  	uint8_t		seq_dup_ack_approx;
>  
> -
> -	union inany_addr faddr;
> -	in_port_t	eport;
> -	in_port_t	fport;
> -
>  	uint16_t	wnd_from_tap;
>  	uint16_t	wnd_to_tap;
>  

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 04/16] tcp: Obtain guest address from flowside
  2024-05-03  1:11 ` [PATCH v4 04/16] tcp: Obtain guest address from flowside David Gibson
@ 2024-05-13 18:07   ` Stefano Brivio
  2024-05-14  0:18     ` David Gibson
  0 siblings, 1 reply; 23+ messages in thread
From: Stefano Brivio @ 2024-05-13 18:07 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On Fri,  3 May 2024 11:11:23 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> Currently we always deliver inbound TCP packets to the guest's most
> recent observed IP address.  This has the odd side effect that if the
> guest changes its IP address with active TCP connections we might
> deliver packets from old connections to the new address.  That won't
> work; it will will probably result in an RST from the guest.  Worse,

s/will will/will/

...if I recall correctly, that was actually working, as long as we
don't swap link-local with global unicast addresses (hence those
conditions sprinkled all over the place).

But it doesn't matter in any case, this is surely the way forward.

-- 
Stefano


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 01/16] flow: Common data structures for tracking flow addresses
  2024-05-13 18:07   ` Stefano Brivio
@ 2024-05-14  0:11     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-14  0:11 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 10733 bytes --]

On Mon, May 13, 2024 at 08:07:00PM +0200, Stefano Brivio wrote:
> Minor comments/nits only:
> 
> On Fri,  3 May 2024 11:11:20 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Handling of each protocol needs some degree of tracking of the addresses
> > and ports at the end of each connection or flow.  Sometimes that's explicit
> > (as in the guest visible addresses for TCP connections), sometimes implicit
> > (the bound and connected addresses of sockets).
> > 
> > To allow more general and robust handling, and more consistency across
> > protocols we want to uniformly track the address and port at each end of
> > the connection.  Furthermore, because we allow port remapping, and we
> > sometimes need to apply NAT, the addresses and ports can be different as
> > seen by the guest/namespace and as by the host.
> > 
> > Introduce 'struct flowside' to keep track of common information
> > related to one side of each flow.  For now that's the addresses, ports
> > and the pif id.  Store two of these in the common fields of a flow to
> > track that information for both sides.  For now we just introduce the
> > structure itself, helpers to populate it, and logging of the contents
> > when starting and ending flows.  Later patches will actually put
> > something useful there.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.c     | 28 ++++++++++++++++++--
> >  flow.h     | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  passt.h    |  3 +++
> >  pif.h      |  1 -
> >  tcp_conn.h |  1 -
> >  5 files changed, 104 insertions(+), 4 deletions(-)
> > 
> > diff --git a/flow.c b/flow.c
> > index 80dd269..02d6008 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -51,10 +51,11 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
> >   *
> >   *    ALLOC - A tentatively allocated entry
> >   *        Operations:
> > + *            - Common flow fields other than type may be accessed
> >   *            - flow_alloc_cancel() returns the entry to FREE state
> >   *            - FLOW_START() set the entry's type and moves to START state
> >   *        Caveats:
> > - *            - It's not safe to write fields in the flow entry
> > + *            - It's not safe to write flow type specific fields in the entry
> >   *            - It's not safe to allocate further entries with flow_alloc()
> >   *            - It's not safe to return to the main epoll loop (use FLOW_START()
> >   *              to move to START state before doing so)
> > @@ -62,6 +63,7 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
> >   *
> >   *    START - An entry being prepared by flow type specific code
> >   *        Operations:
> > + *            - Common flow fields other than type may be accessed
> >   *            - Flow type specific fields may be accessed
> >   *            - flow_*() logging functions
> >   *            - flow_alloc_cancel() returns the entry to FREE state
> > @@ -168,9 +170,21 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
> >  union flow *flow_start(union flow *flow, enum flow_type type,
> >  		       unsigned iniside)
> >  {
> > -	(void)iniside;
> > +	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
> > +	const struct flowside *a = &flow->f.side[iniside];
> 
> As long as iniside is used as a binary value (I guess it's unsigned
> because you have in mind that it could eventually be extended, right?),

Not really.  My intention is that it's fundamentally a two value
variable.  However, it's used as an array index and doesn't represent
true/false values, so bool didn't seem right.  Signs added extra
complications in some cases, hence unsigned.  I'd use uint1_t if that
were a thing...

> I think '!!iniside' would be clearer and perhaps more robust here.

Hm.  I don't really like that.  If iniside ever has a value other than
0 or 1, that's a bug.  Fwiw, this particular instance is gone in the
latest version and there are more places where we use just constants,
but it's not all of them.  I guess see what you think on the new
version.

> > +	const struct flowside *b = &flow->f.side[!iniside];
> > +
> >  	flow->f.type = type;
> >  	flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
> > +	flow_dbg(flow, "  from side %u (%s): [%s]:%hu -> [%s]:%hu",
> > +		 iniside, pif_name(a->pif),
> > +		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport,
> > +		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport);
> > +	flow_dbg(flow, "    to side %u (%s): [%s]:%hu -> [%s]:%hu",
> > +		 !iniside, pif_name(b->pif),
> > +		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
> > +		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
> > +
> >  	return flow;
> >  }
> >  
> > @@ -180,10 +194,20 @@ union flow *flow_start(union flow *flow, enum flow_type type,
> >   */
> >  static void flow_end(union flow *flow)
> >  {
> > +	char ebuf[INANY_ADDRSTRLEN], fbuf[INANY_ADDRSTRLEN];
> > +	const struct flowside *a = &flow->f.side[0];
> > +	const struct flowside *b = &flow->f.side[1];
> > +
> >  	if (flow->f.type == FLOW_TYPE_NONE)
> >  		return; /* Nothing to do */
> >  
> >  	flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
> > +	flow_dbg(flow, "  side 0 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(a->pif),
> > +		 inany_ntop(&a->faddr, fbuf, sizeof(fbuf)), a->fport,
> > +		 inany_ntop(&a->eaddr, ebuf, sizeof(ebuf)), a->eport);
> > +	flow_dbg(flow, "  side 1 (%s): [%s]:%hu <-> [%s]:%hu", pif_name(b->pif),
> > +		 inany_ntop(&b->faddr, fbuf, sizeof(fbuf)), b->fport,
> > +		 inany_ntop(&b->eaddr, ebuf, sizeof(ebuf)), b->eport);
> >  	flow->f.type = FLOW_TYPE_NONE;
> >  }
> >  
> > diff --git a/flow.h b/flow.h
> > index c943c44..f7fb537 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -35,11 +35,86 @@ extern const uint8_t flow_proto[];
> >  #define FLOW_PROTO(f)				\
> >  	((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
> >  
> > +/**
> > + * struct flowside - Common information for one side of a flow
> > + * @eaddr:	Endpoint address (remote address from passt's PoV)
> > + * @faddr:	Forwarding address (local address from passt's PoV)
> > + * @eport:	Endpoint port
> > + * @fport:	Forwarding port
> > + * @pif:	pif ID on which this side of the flow exists
> > + */
> > +struct flowside {
> > +	union inany_addr	faddr;
> > +	union inany_addr	eaddr;
> > +	in_port_t		fport;
> > +	in_port_t		eport;
> > +	uint8_t			pif;
> > +};
> > +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t),
> > +	      "Unexpected alignment for struct flowside");
> 
> I'm too thick to understand the reason behind this assert.

I guess there isn't a particularly strong reason.  This was mostly so
I didn't get surprised by some weird alignment padding.

> > +
> > +/** flowside_from_inany - Initialize flowside from inany addresses
> 
> flowside_from_inany(), it's a function.

Gone in the latest version anyway.

> 
> > + * @fside:	flowside to initialize
> > + * @pif:	pif id of this flowside
> > + * @faddr:	Forwarding address (inany)
> > + * @fport:	Forwarding port
> > + * @eaddr:	Endpoint address (inany)
> > + * @eport:	Endpoint port
> > + */
> > +/* cppcheck-suppress unusedFunction */
> > +static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
> > +				const union inany_addr *faddr, in_port_t fport,
> > +				const union inany_addr *eaddr, in_port_t eport)
> > +{
> > +	fside->pif = pif;
> > +	fside->faddr = *faddr;
> > +	fside->eaddr = *eaddr;
> > +	fside->fport = fport;
> > +	fside->eport = eport;
> > +}
> > +
> > +/** flowside_from_af - Initialize flowside from addresses
> 
> flowside_from_af()

Fixed.  I changed to the british spelling of initialise while I was at
it.

> > + * @fside:	flowside to initialize
> > + * @pif:	pif id of this flowside
> > + * @af:		Address family (AF_INET or AF_INET6)
> > + * @faddr:	Forwarding address (pointer to in_addr or in6_addr, or NULL)
> > + * @fport:	Forwarding port
> > + * @eaddr:	Endpoint address (pointer to in_addr or in6_addr, or NULL)
> > + * @eport:	Endpoint port
> > + *
> > + * If NULL is given for either address, the appropriate unspecified/any address
> 
> s/any/wildcard/ makes it a bit easier to follow, I guess.

That behaviour and comment is gone in the latest version.

> > + * for the address family is substituted.
> > + */
> > +/* cppcheck-suppress unusedFunction */
> > +static inline void flowside_from_af(struct flowside *fside,
> > +				    uint8_t pif, sa_family_t af,
> > +				    const void *faddr, in_port_t fport,
> > +				    const void *eaddr, in_port_t eport)
> > +{
> > +	const union inany_addr *any = af == AF_INET ? &inany_any4 : &inany_any6;
> > +
> > +	fside->pif = pif;
> > +	if (faddr)
> > +		inany_from_af(&fside->faddr, af, faddr);
> > +	else
> > +		fside->faddr = *any;
> > +	if (eaddr)
> > +		inany_from_af(&fside->eaddr, af, eaddr);
> > +	else
> > +		fside->eaddr = *any;
> > +	fside->fport = fport;
> > +	fside->eport = eport;
> > +}
> > +
> > +#define SIDES			2
> > +
> >  /**
> >   * struct flow_common - Common fields for packet flows
> > + * @side[]:	Information for each side of the flow
> >   * @type:	Type of packet flow
> >   */
> >  struct flow_common {
> > +	struct flowside	side[SIDES];
> >  	uint8_t		type;
> >  };
> >  
> > diff --git a/passt.h b/passt.h
> > index bc58d64..3db0b8e 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -17,6 +17,9 @@ union epoll_ref;
> >  
> >  #include "pif.h"
> >  #include "packet.h"
> > +#include "siphash.h"
> > +#include "ip.h"
> > +#include "inany.h"
> >  #include "flow.h"
> >  #include "icmp.h"
> >  #include "fwd.h"
> > diff --git a/pif.h b/pif.h
> > index bd52936..ca85b34 100644
> > --- a/pif.h
> > +++ b/pif.h
> > @@ -38,7 +38,6 @@ static inline const char *pif_type(enum pif_type pt)
> >  		return "?";
> >  }
> >  
> > -/* cppcheck-suppress unusedFunction */
> >  static inline const char *pif_name(uint8_t pif)
> >  {
> >  	return pif_type(pif);
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index d280b22..1a07dd5 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -106,7 +106,6 @@ struct tcp_tap_conn {
> >  	uint32_t	seq_init_from_tap;
> >  };
> >  
> > -#define SIDES			2
> >  /**
> >   * struct tcp_splice_conn - Descriptor for a spliced TCP connection
> >   * @f:			Generic flow information
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections
  2024-05-13 18:07   ` Stefano Brivio
@ 2024-05-14  0:15     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-14  0:15 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 13539 bytes --]

On Mon, May 13, 2024 at 08:07:22PM +0200, Stefano Brivio wrote:
> On Fri,  3 May 2024 11:11:21 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > tcp_tap_conn has several fields to track addresses and ports as seen
> > by the guest/namespace.  We now have general fields for this in the
> > common flowside struct so use those instead of protocol specific
> > fields.  The flowside also has space for the guest side endpoint
> > address (local address from the guest's PoV) so we fill that in as
> > well.
> > 
> > We didn't previously store equivalent information for the connection
> > as it appears to the host; that was implicit in the state of the host
> > side socket.  For future generalisations of flow/connection tracking,
> > we're going to need that information, so populate the other flowside
> > in each flow table entry with as much of this information as we can
> > easily obtain.  For connections initiated by the guest that's the
> > endpoint address and port.  To get the forwarding address and port
> > we'd need to call getsockname() in general, so leave that blank for
> > now.  For connections initiated from outside, we also have the
> > endpoint address from accept().  We have the forwarding port from the
> > epoll ref, but we leave the forwarding address blank.
> > 
> > For now we just fill the information in without really using it for
> > anything.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  flow.h     |  1 -
> >  tcp.c      | 88 +++++++++++++++++++++++++++++++++++++-----------------
> >  tcp_conn.h |  8 -----
> >  3 files changed, 60 insertions(+), 37 deletions(-)
> > 
> > diff --git a/flow.h b/flow.h
> > index f7fb537..88caa76 100644
> > --- a/flow.h
> > +++ b/flow.h
> > @@ -85,7 +85,6 @@ static inline void flowside_from_inany(struct flowside *fside, uint8_t pif,
> >   * If NULL is given for either address, the appropriate unspecified/any address
> >   * for the address family is substituted.
> >   */
> > -/* cppcheck-suppress unusedFunction */
> >  static inline void flowside_from_af(struct flowside *fside,
> >  				    uint8_t pif, sa_family_t af,
> >  				    const void *faddr, in_port_t fport,
> > diff --git a/tcp.c b/tcp.c
> > index 21d0af0..1835b86 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -372,7 +372,7 @@
> >  #define OPT_SACK	5
> >  #define OPT_TS		8
> >  
> > -#define CONN_V4(conn)		(!!inany_v4(&(conn)->faddr))
> > +#define CONN_V4(conn)		(!!inany_v4(&conn->f.side[TAPSIDE].faddr))
> 
> ...which reminds me: I guess CONN_V4() and CONN_V6() should eventually
> go away, just like SPLICE_V6 in 7/16.

Yes.  I've thought about doing that, but haven't quite gotten there yet.

> >  #define CONN_V6(conn)		(!CONN_V4(conn))
> >  #define CONN_IS_CLOSING(conn)						\
> >  	((conn->events & ESTABLISHED) &&				\
> > @@ -795,10 +795,11 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn,
> >   */
> >  static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn)
> >  {
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> >  	int i;
> >  
> >  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++)
> > -		if (inany_equals(&conn->faddr, low_rtt_dst + i))
> > +		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
> >  			return 1;
> >  
> >  	return 0;
> > @@ -813,6 +814,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
> >  			      const struct tcp_info *tinfo)
> >  {
> >  #ifdef HAS_MIN_RTT
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> >  	int i, hole = -1;
> >  
> >  	if (!tinfo->tcpi_min_rtt ||
> > @@ -820,7 +822,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
> >  		return;
> >  
> >  	for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) {
> > -		if (inany_equals(&conn->faddr, low_rtt_dst + i))
> > +		if (inany_equals(&tapside->faddr, low_rtt_dst + i))
> >  			return;
> >  		if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i))
> >  			hole = i;
> > @@ -832,7 +834,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn,
> >  	if (hole == -1)
> >  		return;
> >  
> > -	low_rtt_dst[hole++] = conn->faddr;
> > +	low_rtt_dst[hole++] = tapside->faddr;
> >  	if (hole == LOW_RTT_TABLE_SIZE)
> >  		hole = 0;
> >  	inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any);
> > @@ -1085,8 +1087,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn,
> >  			  const union inany_addr *faddr,
> >  			  in_port_t eport, in_port_t fport)
> >  {
> > -	if (inany_equals(&conn->faddr, faddr) &&
> > -	    conn->eport == eport && conn->fport == fport)
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> > +
> > +	if (inany_equals(&tapside->faddr, faddr) &&
> > +	    tapside->eport == eport && tapside->fport == fport)
> >  		return 1;
> >  
> >  	return 0;
> > @@ -1120,7 +1124,9 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr,
> >  static uint64_t tcp_conn_hash(const struct ctx *c,
> >  			      const struct tcp_tap_conn *conn)
> >  {
> > -	return tcp_hash(c, &conn->faddr, conn->eport, conn->fport);
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> > +
> > +	return tcp_hash(c, &tapside->faddr, tapside->eport, tapside->fport);
> >  }
> >  
> >  /**
> > @@ -1302,10 +1308,12 @@ void tcp_defer_handler(struct ctx *c)
> >   * @seq:	Sequence number
> >   */
> >  static void tcp_fill_header(struct tcphdr *th,
> > -			       const struct tcp_tap_conn *conn, uint32_t seq)
> > +			    const struct tcp_tap_conn *conn, uint32_t seq)
> >  {
> > -	th->source = htons(conn->fport);
> > -	th->dest = htons(conn->eport);
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];	
> > +
> > +	th->source = htons(tapside->fport);
> > +	th->dest = htons(tapside->eport);
> >  	th->seq = htonl(seq);
> >  	th->ack_seq = htonl(conn->seq_ack_to_tap);
> >  	if (conn->events & ESTABLISHED)	{
> > @@ -1337,7 +1345,8 @@ static size_t tcp_fill_headers4(const struct ctx *c,
> >  				size_t dlen, const uint16_t *check,
> >  				uint32_t seq)
> >  {
> > -	const struct in_addr *a4 = inany_v4(&conn->faddr);
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> > +	const struct in_addr *a4 = inany_v4(&tapside->faddr);
> >  	size_t l4len = dlen + sizeof(*th);
> >  	size_t l3len = l4len + sizeof(*iph);
> >  
> > @@ -1379,10 +1388,11 @@ static size_t tcp_fill_headers6(const struct ctx *c,
> >  				struct ipv6hdr *ip6h, struct tcphdr *th,
> >  				size_t dlen, uint32_t seq)
> >  {
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> >  	size_t l4len = dlen + sizeof(*th);
> >  
> >  	ip6h->payload_len = htons(l4len);
> > -	ip6h->saddr = conn->faddr.a6;
> > +	ip6h->saddr = tapside->faddr.a6;
> >  	if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr))
> >  		ip6h->daddr = c->ip6.addr_ll_seen;
> >  	else
> > @@ -1421,9 +1431,7 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> >  				      struct iovec *iov, size_t dlen,
> >  				      const uint16_t *check, uint32_t seq)
> >  {
> > -	const struct in_addr *a4 = inany_v4(&conn->faddr);
> > -
> > -	if (a4) {
> > +	if (CONN_V4(conn)) {
> >  		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
> >  					 iov[TCP_IOV_IP].iov_base,
> >  					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
> > @@ -1738,7 +1746,7 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd)
> >  /**
> >   * tcp_seq_init() - Calculate initial sequence number according to RFC 6528
> >   * @c:		Execution context
> > - * @conn:	TCP connection, with faddr, fport and eport populated
> > + * @conn:	TCP connection, with tap flowside faddr, fport and eport
> >   * @now:	Current timestamp
> >   */
> >  static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> > @@ -1746,6 +1754,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> >  {
> >  	struct siphash_state state = SIPHASH_INIT(c->hash_secret);
> >  	union inany_addr aany;
> > +	const struct flowside *tapside = &conn->f.side[TAPSIDE];
> 
> One line up.

Already fixed in the latest equivalent code.

> >  	uint64_t hash;
> >  	uint32_t ns;
> >  
> > @@ -1754,10 +1763,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn,
> >  	else
> >  		inany_from_af(&aany, AF_INET6, &c->ip6.addr);
> >  
> > -	inany_siphash_feed(&state, &conn->faddr);
> > +	inany_siphash_feed(&state, &tapside->faddr);
> >  	inany_siphash_feed(&state, &aany);
> >  	hash = siphash_final(&state, 36,
> > -			     (uint64_t)conn->fport << 16 | conn->eport);
> > +			     (uint64_t)tapside->fport << 16 | tapside->eport);
> >  
> >  	/* 32ns ticks, overflows 32 bits every 137s */
> >  	ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;
> > @@ -1945,6 +1954,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> >  		.sin6_port = htons(dstport),
> >  		.sin6_addr = *(struct in6_addr *)daddr,
> >  	};
> > +	struct flowside *tapside, *sockside;
> >  	const struct sockaddr *sa;
> >  	struct tcp_tap_conn *conn;
> >  	union flow *flow;
> > @@ -1954,6 +1964,11 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> >  	if (!(flow = flow_alloc()))
> >  		return;
> >  
> > +	tapside = &flow->f.side[TAPSIDE];
> > +	sockside = &flow->f.side[SOCKSIDE];
> > +
> > +	flowside_from_af(tapside, PIF_TAP, af, daddr, dstport, saddr, srcport);
> > +
> >  	if (af == AF_INET) {
> >  		if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
> >  		    IN4_IS_ADDR_BROADCAST(saddr) ||
> > @@ -2026,19 +2041,19 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> >  	if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap)))
> >  		conn->wnd_from_tap = 1;
> >  
> > -	inany_from_af(&conn->faddr, af, daddr);
> > +	sockside->pif = PIF_HOST;
> > +	sockside->eport = dstport;
> >  
> >  	if (af == AF_INET) {
> > +		inany_from_af(&sockside->eaddr, AF_INET, &addr4.sin_addr);
> >  		sa = (struct sockaddr *)&addr4;
> >  		sl = sizeof(addr4);
> >  	} else {
> > +		inany_from_af(&sockside->eaddr, AF_INET6, &addr6.sin6_addr);
> >  		sa = (struct sockaddr *)&addr6;
> >  		sl = sizeof(addr6);
> >  	}
> >  
> > -	conn->fport = dstport;
> > -	conn->eport = srcport;
> > -
> >  	conn->seq_init_from_tap = ntohl(th->seq);
> >  	conn->seq_from_tap = conn->seq_init_from_tap + 1;
> >  	conn->seq_ack_to_tap = conn->seq_from_tap;
> > @@ -2724,18 +2739,35 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
> >  				   const union sockaddr_inany *sa,
> >  				   const struct timespec *now)
> >  {
> > -	struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
> > +	struct flowside *sockside = &flow->f.side[SOCKSIDE];
> > +	struct flowside *tapside = &flow->f.side[TAPSIDE];
> > +	struct tcp_tap_conn *conn;
> > +
> > +	sockside->pif = PIF_HOST;
> > +	inany_from_sockaddr(&sockside->eaddr, &sockside->eport, sa);
> > +	sockside->fport = dstport;
> > +
> > +	tapside->pif = PIF_TAP;
> > +	tapside->faddr = sockside->eaddr;
> > +	tapside->fport = sockside->eport;
> > +	tcp_snat_inbound(c, &tapside->faddr);
> > +	if (CONN_V4(flow)) {
> > +		inany_from_af(&tapside->eaddr, AF_INET, &c->ip4.addr_seen);
> > +	} else {
> > +		if (IN6_IS_ADDR_LINKLOCAL(&tapside->faddr.a6))
> > +			tapside->eaddr.a6 = c->ip6.addr_ll_seen;
> > +		else
> > +			tapside->eaddr.a6 = c->ip6.addr_seen;
> > +	}
> > +	tapside->eport = dstport + c->tcp.fwd_in.delta[dstport];
> 
> Pre-existing, but I wonder: doesn't this port translation also belong
> to tcp_snat_inbound()?

Not really, because "snat" here is for "source nat".  But in any case
both are subsumed into common NAT functions later in the series.

> > +
> > +	conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
> >  
> >  	conn->sock = s;
> >  	conn->timer = -1;
> >  	conn->ws_to_tap = conn->ws_from_tap = 0;
> >  	conn_event(c, conn, SOCK_ACCEPTED);
> >  
> > -	inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
> > -	conn->eport = dstport + c->tcp.fwd_in.delta[dstport];
> > -
> > -	tcp_snat_inbound(c, &conn->faddr);
> > -
> >  	tcp_seq_init(c, conn, now);
> >  	tcp_hash_insert(c, conn);
> >  
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 1a07dd5..f55f144 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -23,9 +23,6 @@
> >   * @ws_to_tap:		Window scaling factor advertised to tap/guest
> >   * @sndbuf:		Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS
> >   * @seq_dup_ack_approx:	Last duplicate ACK number sent to tap
> > - * @faddr:		Guest side forwarding address (guest's remote address)
> > - * @eport:		Guest side endpoint port (guest's local port)
> > - * @fport:		Guest side forwarding port (guest's remote port)
> >   * @wnd_from_tap:	Last window size from tap, unscaled (as received)
> >   * @wnd_to_tap:		Sending window advertised to tap, unscaled (as sent)
> >   * @seq_to_tap:		Next sequence for packets to tap
> > @@ -91,11 +88,6 @@ struct tcp_tap_conn {
> >  
> >  	uint8_t		seq_dup_ack_approx;
> >  
> > -
> > -	union inany_addr faddr;
> > -	in_port_t	eport;
> > -	in_port_t	fport;
> > -
> >  	uint16_t	wnd_from_tap;
> >  	uint16_t	wnd_to_tap;
> >  
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 04/16] tcp: Obtain guest address from flowside
  2024-05-13 18:07   ` Stefano Brivio
@ 2024-05-14  0:18     ` David Gibson
  0 siblings, 0 replies; 23+ messages in thread
From: David Gibson @ 2024-05-14  0:18 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 1287 bytes --]

On Mon, May 13, 2024 at 08:07:43PM +0200, Stefano Brivio wrote:
> On Fri,  3 May 2024 11:11:23 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > Currently we always deliver inbound TCP packets to the guest's most
> > recent observed IP address.  This has the odd side effect that if the
> > guest changes its IP address with active TCP connections we might
> > deliver packets from old connections to the new address.  That won't
> > work; it will will probably result in an RST from the guest.  Worse,
> 
> s/will will/will/

Fixed.

> ...if I recall correctly, that was actually working, as long as we
> don't swap link-local with global unicast addresses (hence those
> conditions sprinkled all over the place).

Um.. I don't see how that's possible.  Linux - and I imagine any peer
- will index TCP connections by both endpoint addresses, so if we
deliver packets from one connection to a different address, the peer
won't recognize them as belonging to the old connection.

> But it doesn't matter in any case, this is surely the way forward.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2024-05-14  0:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03  1:11 [PATCH v4 00/16] RFC: Unified flow table David Gibson
2024-05-03  1:11 ` [PATCH v4 01/16] flow: Common data structures for tracking flow addresses David Gibson
2024-05-13 18:07   ` Stefano Brivio
2024-05-14  0:11     ` David Gibson
2024-05-03  1:11 ` [PATCH v4 02/16] tcp: Maintain flowside information for "tap" connections David Gibson
2024-05-13 18:07   ` Stefano Brivio
2024-05-14  0:15     ` David Gibson
2024-05-03  1:11 ` [PATCH v4 03/16] tcp_splice: Maintain flowside information for spliced connections David Gibson
2024-05-03  1:11 ` [PATCH v4 04/16] tcp: Obtain guest address from flowside David Gibson
2024-05-13 18:07   ` Stefano Brivio
2024-05-14  0:18     ` David Gibson
2024-05-03  1:11 ` [PATCH v4 05/16] tcp: Simplify endpoint validation using flowside information David Gibson
2024-05-03  1:11 ` [PATCH v4 06/16] tcp, tcp_splice: Construct sockaddrs for connect() from flowside David Gibson
2024-05-03  1:11 ` [PATCH v4 07/16] tcp_splice: Eliminate SPLICE_V6 flag David Gibson
2024-05-03  1:11 ` [PATCH v4 08/16] tcp, flow: Replace TCP specific hash function with general flow hash David Gibson
2024-05-03  1:11 ` [PATCH v4 09/16] flow, tcp: Generalise TCP hash table to general flow hash table David Gibson
2024-05-03  1:11 ` [PATCH v4 10/16] tcp: Re-use flow hash for initial sequence number generation David Gibson
2024-05-03  1:11 ` [PATCH v4 11/16] icmp: Populate flowside information David Gibson
2024-05-03  1:11 ` [PATCH v4 12/16] icmp: Use flowsides as the source of truth wherever possible David Gibson
2024-05-03  1:11 ` [PATCH v4 13/16] icmp: Look up ping flows using flow hash David Gibson
2024-05-03  1:11 ` [PATCH v4 14/16] icmp: Eliminate icmp_id_map David Gibson
2024-05-03  1:11 ` [PATCH v4 15/16] flow, tcp: flow based NAT and port forwarding for TCP David Gibson
2024-05-03  1:11 ` [PATCH v4 16/16] flow, icmp: Use general flow forwarding rules for ICMP David Gibson

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).