* [PATCH v2 00/22] More flow table preliminaries: address handling improvements
@ 2024-02-06 1:17 David Gibson
2024-02-06 1:17 ` [PATCH v2 01/22] treewide: Use sa_family_t for address family variables David Gibson
` (22 more replies)
0 siblings, 23 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Here's another batch of cleanups and tweaks in preparation for the
flow table. This set focuses on improved helpers for handling
addresses, particularly in the TCP splice path.
Changes since v1:
* Rebased, and reordered in a way I hope is clearer
* Add patch to rename port_fwd.[ch]
* Added doc comments to clarify flow life cycle
* Added uniform logging of flow start / end to match that lifecycle
* union inany_addr typed special address constants
* inany based tests for unspecified and multicast addresses, as well
as loopback
* Dropped patch allowing NULL parameter to inany_from_af(), turned
out not to be that useful
* Dropped sockaddr_any_init function, turned out not to be very
useful in that form
* Added patch enforcing no loopback addresses on tap interface
* Added logic to sanity check TCP endpoint addresses
* Moved socket creation into tcp_splice_connect()
* Moved epoll ref parsing into tcp_listen_handler()
* Allowed IN4_IS_*() helpers to work on void * addresses
David Gibson (22):
treewide: Use sa_family_t for address family variables
inany: Helper to test for various address types
inany: Add inany_ntop() helper
inany: Provide more conveniently typed constants for special addresses
inany: Introduce union sockaddr_inany
util: Allow IN4_IS_* macros to operate on untyped addresses
tcp, udp: Don't precompute port remappings in epoll references
flow: Add helper to determine a flow's protocol
tcp_splice: Simplify clean up logic
tcp_splice: Don't use flow_trace() before setting flow type
flow: Clarify flow entry life cycle, introduce uniform logging
tcp, tcp_splice: Helpers for getting sockets from the pools
tcp_splice: More specific variable names in new splice path
tcp_splice: Merge tcp_splice_new() into its caller
tcp_splice: Make tcp_splice_connect() create its own sockets
tcp_splice: Improve error reporting on connect path
tcp_splice: Improve logic deciding when to splice
tcp, tcp_splice: Parse listening socket epoll ref in
tcp_listen_handler()
tcp: Validate TCP endpoint addresses
tap: Disallow loopback addresses on tap interface
port_fwd: Fix copypasta error in port_fwd_scan_udp() comments
fwd: Rename port_fwd.[ch] and their contents
Makefile | 14 +--
conf.c | 8 +-
flow.c | 83 +++++++++++++++++-
flow.h | 9 ++
port_fwd.c => fwd.c | 32 +++----
port_fwd.h => fwd.h | 24 +++---
icmp.c | 24 ++----
icmp.h | 4 +-
inany.c | 50 +++++++++++
inany.h | 99 ++++++++++++++++++---
passt.h | 2 +-
tap.c | 19 ++++
tcp.c | 158 ++++++++++++++++++++++++---------
tcp.h | 8 +-
tcp_conn.h | 4 +-
tcp_splice.c | 206 +++++++++++++++++++++++++-------------------
tcp_splice.h | 7 +-
udp.c | 34 ++++----
udp.h | 12 +--
util.c | 2 +-
util.h | 10 +--
21 files changed, 569 insertions(+), 240 deletions(-)
rename port_fwd.c => fwd.c (83%)
rename port_fwd.h => fwd.h (62%)
create mode 100644 inany.c
--
2.43.0
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH v2 01/22] treewide: Use sa_family_t for address family variables
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 02/22] inany: Helper to test for various address types David Gibson
` (21 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
# Conflicts:
# tcp_conn.h
---
icmp.c | 6 +++---
icmp.h | 4 ++--
inany.h | 3 ++-
tcp.c | 12 ++++++------
tcp.h | 2 +-
tcp_conn.h | 2 +-
tcp_splice.c | 2 +-
udp.c | 2 +-
udp.h | 2 +-
util.c | 2 +-
util.h | 2 +-
11 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/icmp.c b/icmp.c
index 9434fc5a..faa38c81 100644
--- a/icmp.c
+++ b/icmp.c
@@ -62,7 +62,7 @@ static struct icmp_id_sock icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS];
* @af: Address family (AF_INET or AF_INET6)
* @ref: epoll reference
*/
-void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref)
+void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
{
struct icmp_id_sock *const id_sock = af == AF_INET
? &icmp_id_map[V4][ref.icmp.id] : &icmp_id_map[V6][ref.icmp.id];
@@ -156,7 +156,7 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_id_sock *id_sock)
* Return: Newly opened ping socket fd, or -1 on failure
*/
static int icmp_ping_new(const struct ctx *c, struct icmp_id_sock *id_sock,
- int af, uint16_t id)
+ sa_family_t af, uint16_t id)
{
uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6;
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
@@ -209,7 +209,7 @@ cancel:
*
* Return: count of consumed packets (always 1, even if malformed)
*/
-int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af,
+int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, const struct timespec *now)
{
diff --git a/icmp.h b/icmp.h
index 00835971..263101db 100644
--- a/icmp.h
+++ b/icmp.h
@@ -10,8 +10,8 @@
struct ctx;
-void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref);
-int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af,
+void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref);
+int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, const struct timespec *now);
void icmp_timer(const struct ctx *c, const struct timespec *now);
diff --git a/inany.h b/inany.h
index 90e98f16..fe652ff7 100644
--- a/inany.h
+++ b/inany.h
@@ -60,7 +60,8 @@ static inline bool inany_equals(const union inany_addr *a,
* @af: Address family of @addr
* @addr: struct in_addr (IPv4) or struct in6_addr (IPv6)
*/
-static inline void inany_from_af(union inany_addr *aa, int af, const void *addr)
+static inline void inany_from_af(union inany_addr *aa,
+ sa_family_t af, const void *addr)
{
if (af == AF_INET6) {
aa->a6 = *((struct in6_addr *)addr);
diff --git a/tcp.c b/tcp.c
index 905d26f6..fdf56713 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1262,7 +1262,7 @@ static void tcp_hash_remove(const struct ctx *c,
* Return: connection pointer, if found, -ENOENT otherwise
*/
static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c,
- int af, const void *faddr,
+ sa_family_t af, const void *faddr,
in_port_t eport, in_port_t fport)
{
union inany_addr aany;
@@ -1904,8 +1904,8 @@ static void tcp_bind_outbound(const struct ctx *c, int s, sa_family_t af)
* @optlen: Bytes in options: caller MUST ensure available length
* @now: Current timestamp
*/
-static void tcp_conn_from_tap(struct ctx *c,
- int af, const void *saddr, const void *daddr,
+static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
+ const void *saddr, const void *daddr,
const struct tcphdr *th, const char *opts,
size_t optlen, const struct timespec *now)
{
@@ -2480,7 +2480,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn,
*
* Return: count of consumed packets
*/
-int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
+int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now)
{
@@ -2857,7 +2857,7 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events)
*
* Return: fd for the new listening socket, negative error code on failure
*/
-static int tcp_sock_init_af(const struct ctx *c, int af, in_port_t port,
+static int tcp_sock_init_af(const struct ctx *c, sa_family_t af, in_port_t port,
const void *addr, const char *ifname)
{
union tcp_listen_epoll_ref tref = {
@@ -3009,7 +3009,7 @@ static int tcp_ns_socks_init(void *arg)
* @pool: Pool of sockets to refill
* @af: Address family to use
*/
-void tcp_sock_refill_pool(const struct ctx *c, int pool[], int af)
+void tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af)
{
int i;
diff --git a/tcp.h b/tcp.h
index b9f546d3..875006ed 100644
--- a/tcp.h
+++ b/tcp.h
@@ -14,7 +14,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref);
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
const struct timespec *now);
void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events);
-int tcp_tap_handler(struct ctx *c, uint8_t pif, int af,
+int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
diff --git a/tcp_conn.h b/tcp_conn.h
index a5f5cfe0..20c7cb8b 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -160,7 +160,7 @@ bool tcp_splice_flow_defer(union flow *flow);
void tcp_splice_timer(const struct ctx *c, union flow *flow);
int tcp_conn_pool_sock(int pool[]);
int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
-void tcp_sock_refill_pool(const struct ctx *c, int pool[], int af);
+void tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
void tcp_splice_refill(const struct ctx *c);
#endif /* TCP_CONN_H */
diff --git a/tcp_splice.c b/tcp_splice.c
index 26d32065..cc9745e8 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -399,7 +399,7 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
*/
if (pif == PIF_SPLICE) {
int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
- int af = CONN_V6(conn) ? AF_INET6 : AF_INET;
+ sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
s = tcp_conn_pool_sock(p);
if (s < 0)
diff --git a/udp.c b/udp.c
index b5b8f8a7..c839e269 100644
--- a/udp.c
+++ b/udp.c
@@ -814,7 +814,7 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
* #syscalls sendmmsg
*/
int udp_tap_handler(struct ctx *c, uint8_t pif,
- int af, const void *saddr, const void *daddr,
+ sa_family_t af, const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now)
{
struct mmsghdr mm[UIO_MAXIOV];
diff --git a/udp.h b/udp.h
index 087e4820..f3d5d6d2 100644
--- a/udp.h
+++ b/udp.h
@@ -11,7 +11,7 @@
void udp_portmap_clear(void);
void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
const struct timespec *now);
-int udp_tap_handler(struct ctx *c, uint8_t pif, int af,
+int udp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af,
const void *saddr, const void *daddr,
const struct pool *p, int idx, const struct timespec *now);
int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
diff --git a/util.c b/util.c
index 21b35ff9..8acce233 100644
--- a/util.c
+++ b/util.c
@@ -97,7 +97,7 @@ found:
*
* Return: newly created socket, negative error code on failure
*/
-int sock_l4(const struct ctx *c, int af, uint8_t proto,
+int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
const void *bind_addr, const char *ifname, uint16_t port,
uint32_t data)
{
diff --git a/util.h b/util.h
index d2320f8c..e0df26c6 100644
--- a/util.h
+++ b/util.h
@@ -212,7 +212,7 @@ struct ipv6_opt_hdr {
__attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
size_t *dlen);
-int sock_l4(const struct ctx *c, int af, uint8_t proto,
+int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
const void *bind_addr, const char *ifname, uint16_t port,
uint32_t data);
void sock_probe_mem(struct ctx *c);
--
@@ -212,7 +212,7 @@ struct ipv6_opt_hdr {
__attribute__ ((weak)) int ffsl(long int i) { return __builtin_ffsl(i); }
char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto,
size_t *dlen);
-int sock_l4(const struct ctx *c, int af, uint8_t proto,
+int sock_l4(const struct ctx *c, sa_family_t af, uint8_t proto,
const void *bind_addr, const char *ifname, uint16_t port,
uint32_t data);
void sock_probe_mem(struct ctx *c);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 02/22] inany: Helper to test for various address types
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
2024-02-06 1:17 ` [PATCH v2 01/22] treewide: Use sa_family_t for address family variables David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 20:58 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 03/22] inany: Add inany_ntop() helper David Gibson
` (20 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Add helpers to determine if an inany is loopback, unspecified or multicast,
regardless of whether it's a "true" IPv6 address or an IPv4 address
represented as v4-mapped.
Sometimes we need to know if an inany is a loopback address, unspecified or
some other particular kind of address, but we don't really care if it is
IPv4.
Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
inany.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
tcp_splice.c | 15 +++------------
2 files changed, 53 insertions(+), 12 deletions(-)
diff --git a/inany.h b/inany.h
index fe652ff7..2058f145 100644
--- a/inany.h
+++ b/inany.h
@@ -55,6 +55,56 @@ static inline bool inany_equals(const union inany_addr *a,
return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
}
+/** inany_is_loopback - Check if address is loopback
+ * @a: IPv[46] address
+ *
+ * Return: true if @a is either ::1 or in 127.0.0.1/8
+ */
+static inline bool inany_is_loopback(const union inany_addr *a)
+{
+ const struct in_addr *v4 = inany_v4(a);
+
+ return IN6_IS_ADDR_LOOPBACK(&a->a6) || (v4 && IN4_IS_ADDR_LOOPBACK(v4));
+}
+
+/** inany_is_unspecified - Check if address is unspecified
+ * @a: IPv[46] address
+ *
+ * Return: true if @a is either :: or 0.0.0.0
+ */
+static inline bool inany_is_unspecified(const union inany_addr *a)
+{
+ const struct in_addr *v4 = inany_v4(a);
+
+ return IN6_IS_ADDR_UNSPECIFIED(&a->a6) ||
+ (v4 && IN4_IS_ADDR_UNSPECIFIED(v4));
+}
+
+/** inany_is_multicast - Check if address is multicast or broadcast
+ * @a: IPv[46] address
+ *
+ * Return: true if @a is IPv6 multicast or IPv4 multicast or broadcast
+ */
+static inline bool inany_is_multicast(const union inany_addr *a)
+{
+ const struct in_addr *v4 = inany_v4(a);
+
+ return IN6_IS_ADDR_MULTICAST(&a->a6) ||
+ (v4 && (IN4_IS_ADDR_MULTICAST(v4) ||
+ IN4_IS_ADDR_BROADCAST(v4)));
+}
+
+/** inany_is_unicast - Check if address is specified and unicast
+ * @a: IPv[46] address
+ *
+ * Return: true if @a is specified and a unicast address
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool inany_is_unicast(const union inany_addr *a)
+{
+ return !inany_is_unspecified(a) && !inany_is_multicast(a);
+}
+
/** inany_from_af - Set IPv[46] address from IPv4 or IPv6 address
* @aa: Pointer to store IPv[46] address
* @af: Address family of @addr
diff --git a/tcp_splice.c b/tcp_splice.c
index cc9745e8..4ecc178b 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -440,29 +440,20 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
struct tcp_splice_conn *conn, int s,
const struct sockaddr *sa)
{
- const struct in_addr *a4;
union inany_addr aany;
in_port_t port;
ASSERT(c->mode == MODE_PASTA);
inany_from_sockaddr(&aany, &port, sa);
- a4 = inany_v4(&aany);
-
- if (a4) {
- if (!IN4_IS_ADDR_LOOPBACK(a4))
- return false;
- conn->flags = 0;
- } else {
- if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
- return false;
- conn->flags = SPLICE_V6;
- }
+ if (!inany_is_loopback(&aany))
+ return false;
if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
conn->f.type = FLOW_TCP_SPLICE;
+ conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
if (tcp_splice_new(c, conn, ref.port, ref.pif))
--
@@ -440,29 +440,20 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
struct tcp_splice_conn *conn, int s,
const struct sockaddr *sa)
{
- const struct in_addr *a4;
union inany_addr aany;
in_port_t port;
ASSERT(c->mode == MODE_PASTA);
inany_from_sockaddr(&aany, &port, sa);
- a4 = inany_v4(&aany);
-
- if (a4) {
- if (!IN4_IS_ADDR_LOOPBACK(a4))
- return false;
- conn->flags = 0;
- } else {
- if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
- return false;
- conn->flags = SPLICE_V6;
- }
+ if (!inany_is_loopback(&aany))
+ return false;
if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
conn->f.type = FLOW_TCP_SPLICE;
+ conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
if (tcp_splice_new(c, conn, ref.port, ref.pif))
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 03/22] inany: Add inany_ntop() helper
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
2024-02-06 1:17 ` [PATCH v2 01/22] treewide: Use sa_family_t for address family variables David Gibson
2024-02-06 1:17 ` [PATCH v2 02/22] inany: Helper to test for various address types David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 04/22] inany: Provide more conveniently typed constants for special addresses David Gibson
` (19 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Add this helper to format an inany into either IPv4 or IPv6 text format as
appropriate.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
Makefile | 6 +++---
inany.c | 35 +++++++++++++++++++++++++++++++++++
inany.h | 4 ++++
3 files changed, 42 insertions(+), 3 deletions(-)
create mode 100644 inany.c
diff --git a/Makefile b/Makefile
index af4fa87e..1c709229 100644
--- a/Makefile
+++ b/Makefile
@@ -45,9 +45,9 @@ FLAGS += -DVERSION=\"$(VERSION)\"
FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
- igmp.c isolation.c lineread.c log.c mld.c ndp.c netlink.c packet.c \
- passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c tcp_splice.c udp.c \
- util.c
+ igmp.c inany.c isolation.c lineread.c log.c mld.c ndp.c netlink.c \
+ packet.c passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c \
+ tcp_splice.c udp.c util.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
diff --git a/inany.c b/inany.c
new file mode 100644
index 00000000..edf0b055
--- /dev/null
+++ b/inany.c
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: David Gibson <david@gibson.dropbear.id.au>
+ *
+ * inany.c - Types and helpers for handling addresses which could be
+ * IPv6 or IPv4 (encoded as IPv4-mapped IPv6 addresses)
+ */
+
+#include <stdlib.h>
+#include <stdbool.h>
+#include <assert.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+
+#include "util.h"
+#include "siphash.h"
+#include "inany.h"
+
+/** inany_ntop - Convert an IPv[46] address to text format
+ * @src: IPv[46] address
+ * @dst: output buffer, minimum INANY_ADDRSTRLEN bytes
+ * @size: size of buffer at @dst
+ *
+ * Return: On success, a non-null pointer to @dst, NULL on failure
+ */
+/* cppcheck-suppress unusedFunction */
+const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size)
+{
+ const struct in_addr *v4 = inany_v4(src);
+
+ if (v4)
+ return inet_ntop(AF_INET, v4, dst, size);
+
+ return inet_ntop(AF_INET6, &src->a6, dst, size);
+}
diff --git a/inany.h b/inany.h
index 2058f145..bf731166 100644
--- a/inany.h
+++ b/inany.h
@@ -162,4 +162,8 @@ static inline void inany_siphash_feed(struct siphash_state *state,
siphash_feed(state, (uint64_t)aa->u32[2] << 32 | aa->u32[3]);
}
+#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
+
+const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
+
#endif /* INANY_H */
--
@@ -162,4 +162,8 @@ static inline void inany_siphash_feed(struct siphash_state *state,
siphash_feed(state, (uint64_t)aa->u32[2] << 32 | aa->u32[3]);
}
+#define INANY_ADDRSTRLEN MAX(INET_ADDRSTRLEN, INET6_ADDRSTRLEN)
+
+const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
+
#endif /* INANY_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 04/22] inany: Provide more conveniently typed constants for special addresses
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (2 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 03/22] inany: Add inany_ntop() helper David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 05/22] inany: Introduce union sockaddr_inany David Gibson
` (18 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely. We don't yet
provide constants of this type for special addresses (loopback and
"any"). Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.
To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any. For the IPv4 versions we need to create
new constant globals.
For complicated historical reasons, the standard library doesn't provide
constants for IPv4 loopback and any addresses as struct in_addr. It just
has macros of type in_addr_t == uint32_t, which has confusing semantics
w.r.t. endianness. We can use some more macros to address this lack,
using macros to effectively create these IPv4 constants as pieces of the
inany constants above.
We use this last to avoid some awkward temporary variables just used to
get an address of an IPv4 loopback address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
inany.c | 16 ++++++++++++++++
inany.h | 9 +++++++++
tcp.c | 4 ++--
udp.c | 8 +++++---
4 files changed, 32 insertions(+), 5 deletions(-)
diff --git a/inany.c b/inany.c
index edf0b055..c11e2aa9 100644
--- a/inany.c
+++ b/inany.c
@@ -16,6 +16,22 @@
#include "siphash.h"
#include "inany.h"
+const union inany_addr inany_loopback4 = {
+ .v4mapped = {
+ .zero = { 0 },
+ .one = { 0xff, 0xff, },
+ .a4 = IN4ADDR_LOOPBACK_INIT,
+ },
+};
+
+const union inany_addr inany_any4 = {
+ .v4mapped = {
+ .zero = { 0 },
+ .one = { 0xff, 0xff, },
+ .a4 = IN4ADDR_ANY_INIT,
+ },
+};
+
/** inany_ntop - Convert an IPv[46] address to text format
* @src: IPv[46] address
* @dst: output buffer, minimum INANY_ADDRSTRLEN bytes
diff --git a/inany.h b/inany.h
index bf731166..4c4b36c3 100644
--- a/inany.h
+++ b/inany.h
@@ -32,6 +32,15 @@ static_assert(sizeof(union inany_addr) == sizeof(struct in6_addr),
static_assert(_Alignof(union inany_addr) == _Alignof(uint32_t),
"union inany_addr has unexpected alignment");
+#define inany_loopback6 (*(const union inany_addr *)(&in6addr_loopback))
+extern const union inany_addr inany_loopback4;
+
+#define inany_any6 (*(const union inany_addr *)(&in6addr_any))
+extern const union inany_addr inany_any4;
+
+#define in4addr_loopback (inany_loopback4.v4mapped.a4)
+#define in4addr_any (inany_any4.v4mapped.a4)
+
/** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
* @addr: IPv4 or IPv6 address
*
diff --git a/tcp.c b/tcp.c
index fdf56713..79b7b4d5 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2926,12 +2926,12 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
.port = port + c->tcp.fwd_out.delta[port],
.pif = PIF_SPLICE,
};
- struct in_addr loopback = IN4ADDR_LOOPBACK_INIT;
int s;
ASSERT(c->mode == MODE_PASTA);
- s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32);
+ s = sock_l4(c, AF_INET, IPPROTO_TCP, &in4addr_loopback, NULL, port,
+ tref.u32);
if (s >= 0)
tcp_sock_set_bufsize(c, s);
else
diff --git a/udp.c b/udp.c
index c839e269..f2be0080 100644
--- a/udp.c
+++ b/udp.c
@@ -96,6 +96,7 @@
#include <stdio.h>
#include <errno.h>
#include <limits.h>
+#include <assert.h>
#include <net/ethernet.h>
#include <net/if.h>
#include <netinet/in.h>
@@ -112,6 +113,8 @@
#include "checksum.h"
#include "util.h"
+#include "siphash.h"
+#include "inany.h"
#include "passt.h"
#include "tap.h"
#include "pcap.h"
@@ -1010,9 +1013,8 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s;
udp_splice_init[V4][port].sock = s < 0 ? -1 : s;
} else {
- struct in_addr loopback = IN4ADDR_LOOPBACK_INIT;
-
- r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, &loopback,
+ r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP,
+ &in4addr_loopback,
ifname, port, uref.u32);
udp_splice_ns[V4][port].sock = s < 0 ? -1 : s;
}
--
@@ -96,6 +96,7 @@
#include <stdio.h>
#include <errno.h>
#include <limits.h>
+#include <assert.h>
#include <net/ethernet.h>
#include <net/if.h>
#include <netinet/in.h>
@@ -112,6 +113,8 @@
#include "checksum.h"
#include "util.h"
+#include "siphash.h"
+#include "inany.h"
#include "passt.h"
#include "tap.h"
#include "pcap.h"
@@ -1010,9 +1013,8 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
udp_tap_map[V4][uref.port].sock = s < 0 ? -1 : s;
udp_splice_init[V4][port].sock = s < 0 ? -1 : s;
} else {
- struct in_addr loopback = IN4ADDR_LOOPBACK_INIT;
-
- r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP, &loopback,
+ r4 = s = sock_l4(c, AF_INET, IPPROTO_UDP,
+ &in4addr_loopback,
ifname, port, uref.u32);
udp_splice_ns[V4][port].sock = s < 0 ? -1 : s;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 05/22] inany: Introduce union sockaddr_inany
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (3 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 04/22] inany: Provide more conveniently typed constants for special addresses David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 06/22] util: Allow IN4_IS_* macros to operate on untyped addresses David Gibson
` (17 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
There are a number of places where we want to handle either a sockaddr_in
or a sockaddr_in6. In some of those we use a void *, which works ok and
matches some standard library interfaces, but doesn't give a signature
level hint that we're dealing with only sockaddr_in or sockaddr_in6, not
(say) sockaddr_un or another type of socket address. Other places we use a
sockaddr_storage, which also works, but has the same problem in addition to
allocating more on the stack than we need to.
Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6. Use it in a number of
places where it's easy to do so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
icmp.c | 18 ++++++------------
inany.h | 33 +++++++++++++++++++++------------
tcp.c | 11 +++++------
tcp_splice.c | 2 +-
tcp_splice.h | 3 ++-
5 files changed, 35 insertions(+), 32 deletions(-)
diff --git a/icmp.c b/icmp.c
index faa38c81..fb2fcafc 100644
--- a/icmp.c
+++ b/icmp.c
@@ -36,6 +36,8 @@
#include "passt.h"
#include "tap.h"
#include "log.h"
+#include "siphash.h"
+#include "inany.h"
#include "icmp.h"
#define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */
@@ -67,13 +69,9 @@ void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
struct icmp_id_sock *const id_sock = af == AF_INET
? &icmp_id_map[V4][ref.icmp.id] : &icmp_id_map[V6][ref.icmp.id];
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
- char buf[USHRT_MAX];
- union {
- struct sockaddr sa;
- struct sockaddr_in sa4;
- struct sockaddr_in6 sa6;
- } sr;
+ union sockaddr_inany sr;
socklen_t sl = sizeof(sr);
+ char buf[USHRT_MAX];
uint16_t seq;
ssize_t n;
@@ -86,7 +84,7 @@ void icmp_sock_handler(const struct ctx *c, sa_family_t af, union epoll_ref ref)
pname, strerror(errno));
return;
}
- if (sr.sa.sa_family != af)
+ if (sr.sa_family != af)
goto unexpected;
if (af == AF_INET) {
@@ -214,11 +212,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,
const struct pool *p, const struct timespec *now)
{
const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6";
- union {
- struct sockaddr sa;
- struct sockaddr_in sa4;
- struct sockaddr_in6 sa6;
- } sa = { .sa.sa_family = af };
+ union sockaddr_inany sa = { .sa_family = af };
const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6);
struct icmp_id_sock *id_sock;
uint16_t id, seq;
diff --git a/inany.h b/inany.h
index 4c4b36c3..ae6d9b25 100644
--- a/inany.h
+++ b/inany.h
@@ -9,6 +9,8 @@
#ifndef INANY_H
#define INANY_H
+struct siphash_state;
+
/** union inany_addr - Represents either an IPv4 or IPv6 address
* @a6: Address as an IPv6 address, may be IPv4-mapped
* @v4mapped.zero: All zero-bits for an IPv4 address
@@ -41,6 +43,19 @@ extern const union inany_addr inany_any4;
#define in4addr_loopback (inany_loopback4.v4mapped.a4)
#define in4addr_any (inany_any4.v4mapped.a4)
+/** union sockaddr_inany - Either a sockaddr_in or a sockaddr_in6
+ * @sa_family: Address family, AF_INET or AF_INET6
+ * @sa: Plain struct sockaddr (useful to avoid casts)
+ * @sa4: IPv4 socket address, valid if sa_family == AF_INET
+ * @sa6: IPv6 socket address, valid if sa_family == AF_INET6
+ */
+union sockaddr_inany {
+ sa_family_t sa_family;
+ struct sockaddr sa;
+ struct sockaddr_in sa4;
+ struct sockaddr_in6 sa6;
+};
+
/** inany_v4 - Extract IPv4 address, if present, from IPv[46] address
* @addr: IPv4 or IPv6 address
*
@@ -137,23 +152,17 @@ static inline void inany_from_af(union inany_addr *aa,
/** inany_from_sockaddr - Extract IPv[46] address and port number from sockaddr
* @aa: Pointer to store IPv[46] address
* @port: Pointer to store port number, host order
- * @addr: struct sockaddr_in (IPv4) or struct sockaddr_in6 (IPv6)
+ * @addr: AF_INET or AF_INET6 socket address
*/
static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port,
- const void *addr)
+ const union sockaddr_inany *sa)
{
- const struct sockaddr *sa = (const struct sockaddr *)addr;
-
if (sa->sa_family == AF_INET6) {
- struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)sa;
-
- inany_from_af(aa, AF_INET6, &sa6->sin6_addr);
- *port = ntohs(sa6->sin6_port);
+ inany_from_af(aa, AF_INET6, &sa->sa6.sin6_addr);
+ *port = ntohs(sa->sa6.sin6_port);
} else if (sa->sa_family == AF_INET) {
- struct sockaddr_in *sa4 = (struct sockaddr_in *)sa;
-
- inany_from_af(aa, AF_INET, &sa4->sin_addr);
- *port = ntohs(sa4->sin_port);
+ inany_from_af(aa, AF_INET, &sa->sa4.sin_addr);
+ *port = ntohs(sa->sa4.sin_port);
} else {
/* Not valid to call with other address families */
ASSERT(0);
diff --git a/tcp.c b/tcp.c
index 79b7b4d5..2bba3000 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2666,7 +2666,7 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
static void tcp_tap_conn_from_sock(struct ctx *c,
union tcp_listen_epoll_ref ref,
struct tcp_tap_conn *conn, int s,
- const struct sockaddr *sa,
+ const union sockaddr_inany *sa,
const struct timespec *now)
{
conn->f.type = FLOW_TCP;
@@ -2702,7 +2702,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
const struct timespec *now)
{
- struct sockaddr_storage sa;
+ union sockaddr_inany sa;
socklen_t sl = sizeof(sa);
union flow *flow;
int s;
@@ -2710,17 +2710,16 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
if (c->no_tcp || !(flow = flow_alloc()))
return;
- s = accept4(ref.fd, (struct sockaddr *)&sa, &sl, SOCK_NONBLOCK);
+ s = accept4(ref.fd, &sa.sa, &sl, SOCK_NONBLOCK);
if (s < 0)
goto cancel;
if (c->mode == MODE_PASTA &&
tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice,
- s, (struct sockaddr *)&sa))
+ s, &sa))
return;
- tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s,
- (struct sockaddr *)&sa, now);
+ tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, &sa, now);
return;
cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index 4ecc178b..9fd49412 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -438,7 +438,7 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref,
struct tcp_splice_conn *conn, int s,
- const struct sockaddr *sa)
+ const union sockaddr_inany *sa)
{
union inany_addr aany;
in_port_t port;
diff --git a/tcp_splice.h b/tcp_splice.h
index 18193e4e..20f41b39 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -7,13 +7,14 @@
#define TCP_SPLICE_H
struct tcp_splice_conn;
+union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref,
struct tcp_splice_conn *conn, int s,
- const struct sockaddr *sa);
+ const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
@@ -7,13 +7,14 @@
#define TCP_SPLICE_H
struct tcp_splice_conn;
+union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref,
struct tcp_splice_conn *conn, int s,
- const struct sockaddr *sa);
+ const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 06/22] util: Allow IN4_IS_* macros to operate on untyped addresses
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (4 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 05/22] inany: Introduce union sockaddr_inany David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 07/22] tcp, udp: Don't precompute port remappings in epoll references David Gibson
` (16 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The IN4_IS_*() macros expect a pointer to a struct in_addr. That makes
sense, but sometimes we have an IPv4 address as a void * pointer or union
type which makes these less convenient. More importantly, this doesn't
match the behaviour of the standard library's IN6_IS_*() macros on which
they're modelled, nor our own IN4_ARE_ADDR_EQUAL().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
util.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/util.h b/util.h
index e0df26c6..7679eade 100644
--- a/util.h
+++ b/util.h
@@ -111,13 +111,13 @@
#endif
#define IN4_IS_ADDR_UNSPECIFIED(a) \
- ((a)->s_addr == htonl_constant(INADDR_ANY))
+ (((struct in_addr *)(a))->s_addr == htonl_constant(INADDR_ANY))
#define IN4_IS_ADDR_BROADCAST(a) \
- ((a)->s_addr == htonl_constant(INADDR_BROADCAST))
+ (((struct in_addr *)(a))->s_addr == htonl_constant(INADDR_BROADCAST))
#define IN4_IS_ADDR_LOOPBACK(a) \
- (ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
+ (ntohl(((struct in_addr *)(a))->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
#define IN4_IS_ADDR_MULTICAST(a) \
- (IN_MULTICAST(ntohl((a)->s_addr)))
+ (IN_MULTICAST(ntohl(((struct in_addr *)(a))->s_addr)))
#define IN4_ARE_ADDR_EQUAL(a, b) \
(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
#define IN4ADDR_LOOPBACK_INIT \
--
@@ -111,13 +111,13 @@
#endif
#define IN4_IS_ADDR_UNSPECIFIED(a) \
- ((a)->s_addr == htonl_constant(INADDR_ANY))
+ (((struct in_addr *)(a))->s_addr == htonl_constant(INADDR_ANY))
#define IN4_IS_ADDR_BROADCAST(a) \
- ((a)->s_addr == htonl_constant(INADDR_BROADCAST))
+ (((struct in_addr *)(a))->s_addr == htonl_constant(INADDR_BROADCAST))
#define IN4_IS_ADDR_LOOPBACK(a) \
- (ntohl((a)->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
+ (ntohl(((struct in_addr *)(a))->s_addr) >> IN_CLASSA_NSHIFT == IN_LOOPBACKNET)
#define IN4_IS_ADDR_MULTICAST(a) \
- (IN_MULTICAST(ntohl((a)->s_addr)))
+ (IN_MULTICAST(ntohl(((struct in_addr *)(a))->s_addr)))
#define IN4_ARE_ADDR_EQUAL(a, b) \
(((struct in_addr *)(a))->s_addr == ((struct in_addr *)b)->s_addr)
#define IN4ADDR_LOOPBACK_INIT \
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 07/22] tcp, udp: Don't precompute port remappings in epoll references
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (5 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 06/22] util: Allow IN4_IS_* macros to operate on untyped addresses David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 08/22] flow: Add helper to determine a flow's protocol David Gibson
` (15 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The epoll references for both TCP listening sockets and UDP sockets
includes a port number. This gives the destination port that traffic to
that socket will be sent to on the other side. That will usually be the
same as the socket's bound port, but might not if the -t, -u, -T or -U
options are given with different original and forwarded port numbers.
As we move towards a more flexible forwarding model for passt, it's going
to become possible for that destination port to vary depending on more
things (for example the source or destination address). So, it will no
longer make sense to have a fixed value for a listening socket.
Change to simpler semantics where this field in the reference gives the
bound port of the socket. We apply the translations to the correct
destination port later on, when we're actually forwarding.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 8 ++++----
tcp.h | 2 +-
tcp_splice.c | 4 ++++
udp.c | 14 ++++++++------
4 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/tcp.c b/tcp.c
index 2bba3000..3722dc09 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2676,7 +2676,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
conn_event(c, conn, SOCK_ACCEPTED);
inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
- conn->eport = ref.port;
+ conn->eport = ref.port + c->tcp.fwd_in.delta[ref.port];
tcp_snat_inbound(c, &conn->faddr);
@@ -2860,7 +2860,7 @@ static int tcp_sock_init_af(const struct ctx *c, sa_family_t af, in_port_t port,
const void *addr, const char *ifname)
{
union tcp_listen_epoll_ref tref = {
- .port = port + c->tcp.fwd_in.delta[port],
+ .port = port,
.pif = PIF_HOST,
};
int s;
@@ -2922,7 +2922,7 @@ int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr,
static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
{
union tcp_listen_epoll_ref tref = {
- .port = port + c->tcp.fwd_out.delta[port],
+ .port = port,
.pif = PIF_SPLICE,
};
int s;
@@ -2948,7 +2948,7 @@ static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port)
static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port)
{
union tcp_listen_epoll_ref tref = {
- .port = port + c->tcp.fwd_out.delta[port],
+ .port = port,
.pif = PIF_SPLICE,
};
int s;
diff --git a/tcp.h b/tcp.h
index 875006ed..5e6756d4 100644
--- a/tcp.h
+++ b/tcp.h
@@ -37,7 +37,7 @@ union tcp_epoll_ref {
/**
* union tcp_listen_epoll_ref - epoll reference portion for TCP listening
- * @port: Port number we're forwarding *to* (listening port plus delta)
+ * @port: Bound port number of the socket
* @pif: pif in which the socket is listening
* @u32: Opaque u32 value of reference
*/
diff --git a/tcp_splice.c b/tcp_splice.c
index 9fd49412..40ecb5d4 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -401,6 +401,8 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
+ port += c->tcp.fwd_out.delta[port];
+
s = tcp_conn_pool_sock(p);
if (s < 0)
s = tcp_conn_new_sock(c, af);
@@ -409,6 +411,8 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
ASSERT(pif == PIF_HOST);
+ port += c->tcp.fwd_in.delta[port];
+
/* If pool is empty, refill it first */
if (p[TCP_SOCK_POOL_SIZE-1] < 0)
NS_CALL(tcp_sock_refill_ns, c);
diff --git a/udp.c b/udp.c
index f2be0080..f5b86568 100644
--- a/udp.c
+++ b/udp.c
@@ -765,6 +765,11 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
if (c->no_udp || !(events & EPOLLIN))
return;
+ if (ref.udp.pif == PIF_SPLICE)
+ dstport += c->udp.fwd_out.f.delta[dstport];
+ else if (ref.udp.pif == PIF_HOST)
+ dstport += c->udp.fwd_in.f.delta[dstport];
+
if (v6) {
mmh_recv = udp6_l2_mh_sock;
udp6_localname.sin6_port = htons(dstport);
@@ -992,16 +997,13 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
const void *addr, const char *ifname, in_port_t port)
{
union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA),
- .orig = true };
+ .orig = true, .port = port };
int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
- if (ns) {
+ if (ns)
uref.pif = PIF_SPLICE;
- uref.port = (in_port_t)(port + c->udp.fwd_out.f.delta[port]);
- } else {
+ else
uref.pif = PIF_HOST;
- uref.port = (in_port_t)(port + c->udp.fwd_in.f.delta[port]);
- }
if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) {
uref.v6 = 0;
--
@@ -765,6 +765,11 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events,
if (c->no_udp || !(events & EPOLLIN))
return;
+ if (ref.udp.pif == PIF_SPLICE)
+ dstport += c->udp.fwd_out.f.delta[dstport];
+ else if (ref.udp.pif == PIF_HOST)
+ dstport += c->udp.fwd_in.f.delta[dstport];
+
if (v6) {
mmh_recv = udp6_l2_mh_sock;
udp6_localname.sin6_port = htons(dstport);
@@ -992,16 +997,13 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af,
const void *addr, const char *ifname, in_port_t port)
{
union udp_epoll_ref uref = { .splice = (c->mode == MODE_PASTA),
- .orig = true };
+ .orig = true, .port = port };
int s, r4 = FD_REF_MAX + 1, r6 = FD_REF_MAX + 1;
- if (ns) {
+ if (ns)
uref.pif = PIF_SPLICE;
- uref.port = (in_port_t)(port + c->udp.fwd_out.f.delta[port]);
- } else {
+ else
uref.pif = PIF_HOST;
- uref.port = (in_port_t)(port + c->udp.fwd_in.f.delta[port]);
- }
if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) {
uref.v6 = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 08/22] flow: Add helper to determine a flow's protocol
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (6 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 07/22] tcp, udp: Don't precompute port remappings in epoll references David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 09/22] tcp_splice: Simplify clean up logic David Gibson
` (14 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Each flow already has a type field. This implies the protocol the flow
represents, but also has more information: we have two ways to represent
TCP flows, "tap" and "spliced". In order to generalise some of the flow
mechanics, we'll need to determine a flow's protocol in terms of the IP
(L4) protocol number.
Introduce a constant table and helper macro to derive this from the flow
type.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
flow.c | 7 +++++++
flow.h | 4 ++++
2 files changed, 11 insertions(+)
diff --git a/flow.c b/flow.c
index 5e94a7a9..beb9749c 100644
--- a/flow.c
+++ b/flow.c
@@ -25,6 +25,13 @@ const char *flow_type_str[] = {
static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES,
"flow_type_str[] doesn't match enum flow_type");
+const uint8_t flow_proto[] = {
+ [FLOW_TCP] = IPPROTO_TCP,
+ [FLOW_TCP_SPLICE] = IPPROTO_TCP,
+};
+static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
+ "flow_proto[] doesn't match enum flow_type");
+
/* Global Flow Table */
/**
diff --git a/flow.h b/flow.h
index 48a0ab4b..e9b3ce3e 100644
--- a/flow.h
+++ b/flow.h
@@ -27,6 +27,10 @@ extern const char *flow_type_str[];
#define FLOW_TYPE(f) \
((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")
+extern const uint8_t flow_proto[];
+#define FLOW_PROTO(f) \
+ ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
+
/**
* struct flow_common - Common fields for packet flows
* @type: Type of packet flow
--
@@ -27,6 +27,10 @@ extern const char *flow_type_str[];
#define FLOW_TYPE(f) \
((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")
+extern const uint8_t flow_proto[];
+#define FLOW_PROTO(f) \
+ ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0)
+
/**
* struct flow_common - Common fields for packet flows
* @type: Type of packet flow
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 09/22] tcp_splice: Simplify clean up logic
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (7 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 08/22] flow: Add helper to determine a flow's protocol David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 20:59 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 10/22] tcp_splice: Don't use flow_trace() before setting flow type David Gibson
` (13 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Currently tcp_splice_flow_defer() contains specific logic to determine if
we're far enough initialised that we need to close pipes and/or sockets.
This is potentially fragile if we change something about the order in which
we do things. We can simplify this by initialising the pipe and socket
fields to -1 very early, then close()ing them if and only if they're non
negative.
This lets us remove a special case cleanup if our connect() fails. This
will already trigger a CLOSING event, and the socket fd in question is
populated in the connection structure. Thus we can let the new cleanup
logic handle it rather than requiring an explicit close().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 40ecb5d4..f0343eb5 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -246,16 +246,14 @@ bool tcp_splice_flow_defer(union flow *flow)
return false;
for (side = 0; side < SIDES; side++) {
- if (conn->events & SPLICE_ESTABLISHED) {
- /* Flushing might need to block: don't recycle them. */
- if (conn->pipe[side][0] != -1) {
- close(conn->pipe[side][0]);
- close(conn->pipe[side][1]);
- conn->pipe[side][0] = conn->pipe[side][1] = -1;
- }
+ /* Flushing might need to block: don't recycle them. */
+ if (conn->pipe[side][0] >= 0) {
+ close(conn->pipe[side][0]);
+ close(conn->pipe[side][1]);
+ conn->pipe[side][0] = conn->pipe[side][1] = -1;
}
- if (side == 0 || conn->events & SPLICE_CONNECT) {
+ if (conn->s[side] >= 0) {
close(conn->s[side]);
conn->s[side] = -1;
}
@@ -284,8 +282,6 @@ static int tcp_splice_connect_finish(const struct ctx *c,
int i = 0;
for (side = 0; side < SIDES; side++) {
- conn->pipe[side][0] = conn->pipe[side][1] = -1;
-
for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) {
if (splice_pipe_pool[i][0] >= 0) {
SWAP(conn->pipe[side][0],
@@ -361,12 +357,8 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (connect(conn->s[1], sa, sl)) {
- if (errno != EINPROGRESS) {
- int ret = -errno;
-
- close(sock_conn);
- return ret;
- }
+ if (errno != EINPROGRESS)
+ return -errno;
conn_event(c, conn, SPLICE_CONNECT);
} else {
conn_event(c, conn, SPLICE_ESTABLISHED);
@@ -459,6 +451,9 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
conn->f.type = FLOW_TCP_SPLICE;
conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
+ conn->s[1] = -1;
+ conn->pipe[0][0] = conn->pipe[0][1] = -1;
+ conn->pipe[1][0] = conn->pipe[1][1] = -1;
if (tcp_splice_new(c, conn, ref.port, ref.pif))
conn_flag(c, conn, CLOSING);
--
@@ -246,16 +246,14 @@ bool tcp_splice_flow_defer(union flow *flow)
return false;
for (side = 0; side < SIDES; side++) {
- if (conn->events & SPLICE_ESTABLISHED) {
- /* Flushing might need to block: don't recycle them. */
- if (conn->pipe[side][0] != -1) {
- close(conn->pipe[side][0]);
- close(conn->pipe[side][1]);
- conn->pipe[side][0] = conn->pipe[side][1] = -1;
- }
+ /* Flushing might need to block: don't recycle them. */
+ if (conn->pipe[side][0] >= 0) {
+ close(conn->pipe[side][0]);
+ close(conn->pipe[side][1]);
+ conn->pipe[side][0] = conn->pipe[side][1] = -1;
}
- if (side == 0 || conn->events & SPLICE_CONNECT) {
+ if (conn->s[side] >= 0) {
close(conn->s[side]);
conn->s[side] = -1;
}
@@ -284,8 +282,6 @@ static int tcp_splice_connect_finish(const struct ctx *c,
int i = 0;
for (side = 0; side < SIDES; side++) {
- conn->pipe[side][0] = conn->pipe[side][1] = -1;
-
for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) {
if (splice_pipe_pool[i][0] >= 0) {
SWAP(conn->pipe[side][0],
@@ -361,12 +357,8 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (connect(conn->s[1], sa, sl)) {
- if (errno != EINPROGRESS) {
- int ret = -errno;
-
- close(sock_conn);
- return ret;
- }
+ if (errno != EINPROGRESS)
+ return -errno;
conn_event(c, conn, SPLICE_CONNECT);
} else {
conn_event(c, conn, SPLICE_ESTABLISHED);
@@ -459,6 +451,9 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
conn->f.type = FLOW_TCP_SPLICE;
conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
+ conn->s[1] = -1;
+ conn->pipe[0][0] = conn->pipe[0][1] = -1;
+ conn->pipe[1][0] = conn->pipe[1][1] = -1;
if (tcp_splice_new(c, conn, ref.port, ref.pif))
conn_flag(c, conn, CLOSING);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 10/22] tcp_splice: Don't use flow_trace() before setting flow type
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (8 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 09/22] tcp_splice: Simplify clean up logic David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging David Gibson
` (12 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
In tcp_splice_conn_from_sock() we can call flow_trace() if there's an
error setting TCP_QUICKACK. However, we do so before we've set the flow
type in the flow entry. That means that flow_trace() will print nonsense
when it tries to print the flow type.
There's no reason the setsockopt() has to happen before initialising the
flow entry, so just move it after.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index f0343eb5..180a9ea7 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -445,9 +445,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (!inany_is_loopback(&aany))
return false;
- if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
- flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
-
conn->f.type = FLOW_TCP_SPLICE;
conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
@@ -455,6 +452,9 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
conn->pipe[0][0] = conn->pipe[0][1] = -1;
conn->pipe[1][0] = conn->pipe[1][1] = -1;
+ if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
+ flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
+
if (tcp_splice_new(c, conn, ref.port, ref.pif))
conn_flag(c, conn, CLOSING);
--
@@ -445,9 +445,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (!inany_is_loopback(&aany))
return false;
- if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
- flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
-
conn->f.type = FLOW_TCP_SPLICE;
conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
@@ -455,6 +452,9 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
conn->pipe[0][0] = conn->pipe[0][1] = -1;
conn->pipe[1][0] = conn->pipe[1][1] = -1;
+ if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
+ flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
+
if (tcp_splice_new(c, conn, ref.port, ref.pif))
conn_flag(c, conn, CLOSING);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (9 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 10/22] tcp_splice: Don't use flow_trace() before setting flow type David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools David Gibson
` (11 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Our allocation scheme for flow entries means there are some non-obvious
constraints on when what things can be done with an entry. Add a big doc
comment explaining the life cycle.
In addition, make a FLOW_START() macro to mark one of the important
transitions. This encourages correct usage, by making it natural to only
access the flow type specific structure after calling it. It also logs
that a new flow has been created, which is useful for debugging.
We also add logging when a flow's lifecycle ends. This doesn't need a new
helper, because it can only happen either from flow_alloc_cancel() or from
the flow deferred handler.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
flow.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
flow.h | 5 ++++
tcp.c | 15 +++++------
tcp_splice.c | 11 ++++----
tcp_splice.h | 5 ++--
5 files changed, 94 insertions(+), 18 deletions(-)
diff --git a/flow.c b/flow.c
index beb9749c..a155b54b 100644
--- a/flow.c
+++ b/flow.c
@@ -34,6 +34,45 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
/* Global Flow Table */
+/**
+ * DOC: Theory of Operation - flow entry life cycle
+ *
+ * An individual flow table entry moves through these logical states, usually in
+ * this order.
+ *
+ * FREE - Part of the general pool of free flow table entries
+ * Operations:
+ * - flow_alloc() finds an entry and moves it to ALLOC state
+ *
+ * ALLOC - A tentatively allocated entry
+ * Operations:
+ * - flow_alloc_cancel() returns the entry to FREE state
+ * - FLOW_START() set the entry's type and moves to START state
+ * Caveats:
+ * - It's not safe to write fields in the flow entry
+ * - It's not safe to allocate other entries with flow_alloc()
+ * - It's not safe to return to the main epoll loop
+ * - It's not safe to use flow_*() logging functions
+ *
+ * START - An entry being prepared by flow type specific code
+ * Operations:
+ * - Flow type specific fields may be accessed
+ * - flow_*() logging functions
+ * - flow_alloc_cancel() returns the entry to FREE state
+ * Caveats:
+ * - Returning to the main epoll loop or allocating another entry
+ * with flow_alloc() implicitly moves the entry to ACTIVE state.
+ *
+ * ACTIVE - An active flow entry managed by flow type specific code
+ * Operations:
+ * - Flow type specific fields may be accessed
+ * - flow_*() logging functions
+ * - Flow may be expired by returning 'true' from flow type specific
+ * deferred or timer handler. This will return it to FREE state.
+ * Caveats:
+ * - It's not safe to call flow_alloc_cancel()
+ */
+
/**
* DOC: Theory of Operation - allocating and freeing flow entries
*
@@ -109,6 +148,39 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...)
logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg);
}
+/**
+ * flow_start() - Set flow type for new flow and log
+ * @flow: Flow to set type for
+ * @type: Type for new flow
+ * @iniside: Which side initiated the new flow
+ *
+ * Return: @flow
+ *
+ * Should be called before setting any flow type specific fields in the flow
+ * table entry.
+ */
+union flow *flow_start(union flow *flow, enum flow_type type,
+ unsigned iniside)
+{
+ (void)iniside;
+ flow->f.type = type;
+ flow_dbg(flow, "START %s", flow_type_str[flow->f.type]);
+ return flow;
+}
+
+/**
+ * flow_end() - Clear flow type for finished flow and log
+ * @flow: Flow to clear
+ */
+static void flow_end(union flow *flow)
+{
+ if (flow->f.type == FLOW_TYPE_NONE)
+ return; /* Nothing to do */
+
+ flow_dbg(flow, "END %s", flow_type_str[flow->f.type]);
+ flow->f.type = FLOW_TYPE_NONE;
+}
+
/**
* flow_alloc() - Allocate a new flow
*
@@ -157,7 +229,7 @@ void flow_alloc_cancel(union flow *flow)
{
ASSERT(flow_first_free > FLOW_IDX(flow));
- flow->f.type = FLOW_TYPE_NONE;
+ flow_end(flow);
/* Put it back in a length 1 free cluster, don't attempt to fully
* reverse flow_alloc()s steps. This will get folded together the next
* time flow_defer_handler runs anyway() */
@@ -227,7 +299,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
}
if (closed) {
- flow->f.type = FLOW_TYPE_NONE;
+ flow_end(flow);
if (free_head) {
/* Add slot to current free cluster */
diff --git a/flow.h b/flow.h
index e9b3ce3e..8b66751b 100644
--- a/flow.h
+++ b/flow.h
@@ -45,6 +45,11 @@ struct flow_common {
#define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */
#define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+union flow *flow_start(union flow *flow, enum flow_type type,
+ unsigned iniside);
+#define FLOW_START(flow_, t_, var_, i_) \
+ (&flow_start((flow_), (t_), (i_))->var_)
+
/**
* struct flow_sidx - ID for one side of a specific flow
* @side: Side referenced (0 or 1)
diff --git a/tcp.c b/tcp.c
index 3722dc09..e15b932f 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1952,8 +1952,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
goto cancel;
}
- conn = &flow->tcp;
- conn->f.type = FLOW_TCP;
+ conn = FLOW_START(flow, FLOW_TCP, tcp, TAPSIDE);
conn->sock = s;
conn->timer = -1;
conn_event(c, conn, TAP_SYN_RCVD);
@@ -2658,18 +2657,19 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
* tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
* @c: Execution context
* @ref: epoll reference of listening socket
- * @conn: connection structure to initialize
+ * @flow: flow to initialise
* @s: Accepted socket
* @sa: Peer socket address (from accept())
* @now: Current timestamp
*/
static void tcp_tap_conn_from_sock(struct ctx *c,
union tcp_listen_epoll_ref ref,
- struct tcp_tap_conn *conn, int s,
+ union flow *flow, int s,
const union sockaddr_inany *sa,
const struct timespec *now)
{
- conn->f.type = FLOW_TCP;
+ struct tcp_tap_conn *conn = FLOW_START(flow, FLOW_TCP, tcp, SOCKSIDE);
+
conn->sock = s;
conn->timer = -1;
conn->ws_to_tap = conn->ws_from_tap = 0;
@@ -2715,11 +2715,10 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
goto cancel;
if (c->mode == MODE_PASTA &&
- tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice,
- s, &sa))
+ tcp_splice_conn_from_sock(c, ref.tcp_listen, flow, s, &sa))
return;
- tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, &sa, now);
+ tcp_tap_conn_from_sock(c, ref.tcp_listen, flow, s, &sa, now);
return;
cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index 180a9ea7..576fe9be 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -424,7 +424,7 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
* tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
* @c: Execution context
* @ref: epoll reference of listening socket
- * @conn: connection structure to initialize
+ * @flow: flow to initialise
* @s: Accepted socket
* @sa: Peer address of connection
*
@@ -432,10 +432,10 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
* #syscalls:pasta setsockopt
*/
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref,
- struct tcp_splice_conn *conn, int s,
- const union sockaddr_inany *sa)
+ union tcp_listen_epoll_ref ref, union flow *flow,
+ int s, const union sockaddr_inany *sa)
{
+ struct tcp_splice_conn *conn;
union inany_addr aany;
in_port_t port;
@@ -445,7 +445,8 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (!inany_is_loopback(&aany))
return false;
- conn->f.type = FLOW_TCP_SPLICE;
+ conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
+
conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
conn->s[0] = s;
conn->s[1] = -1;
diff --git a/tcp_splice.h b/tcp_splice.h
index 20f41b39..5a471af0 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -12,9 +12,8 @@ union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref,
- struct tcp_splice_conn *conn, int s,
- const union sockaddr_inany *sa);
+ union tcp_listen_epoll_ref ref, union flow *flow,
+ int s, const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
@@ -12,9 +12,8 @@ union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref,
- struct tcp_splice_conn *conn, int s,
- const union sockaddr_inany *sa);
+ union tcp_listen_epoll_ref ref, union flow *flow,
+ int s, const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (10 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path David Gibson
` (10 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
We maintain pools of ready-to-connect sockets in both the original and
(for pasta) guest namespace to reduce latency when starting new TCP
connections. If we exhaust those pools we have to take a higher
latency path to get a new socket.
Currently we open-code that fallback in the places we need it. To
improve clarity encapsulate that into helper functions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 29 ++++++++++++++++++++++++-----
tcp_conn.h | 2 +-
tcp_splice.c | 46 +++++++++++++++++++++++++---------------------
3 files changed, 50 insertions(+), 27 deletions(-)
diff --git a/tcp.c b/tcp.c
index e15b932f..c06d1cc4 100644
--- a/tcp.c
+++ b/tcp.c
@@ -1792,7 +1792,7 @@ int tcp_conn_pool_sock(int pool[])
*
* Return: socket number on success, negative code if socket creation failed
*/
-int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
+static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
{
int s;
@@ -1811,6 +1811,27 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
return s;
}
+/**
+ * tcp_conn_sock() - Obtain a connectable socket in the host/init namespace
+ * @c: Execution context
+ * @af: Address family (AF_INET or AF_INET6)
+ *
+ * Return: Socket fd on success, -errno on failure
+ */
+int tcp_conn_sock(const struct ctx *c, sa_family_t af)
+{
+ int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
+ int s;
+
+ if ((s = tcp_conn_pool_sock(pool)) >= 0)
+ return s;
+
+ /* If the pool is empty we just open a new one without refilling the
+ * pool to keep latency down.
+ */
+ return tcp_conn_new_sock(c, af);
+}
+
/**
* tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
* @conn: Connection pointer
@@ -1909,7 +1930,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
const struct tcphdr *th, const char *opts,
size_t optlen, const struct timespec *now)
{
- int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
.sin_port = th->dest,
@@ -1931,9 +1951,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
if (!(flow = flow_alloc()))
return;
- if ((s = tcp_conn_pool_sock(pool)) < 0)
- if ((s = tcp_conn_new_sock(c, af)) < 0)
- goto cancel;
+ if ((s = tcp_conn_sock(c, af)) < 0)
+ goto cancel;
if (!c->no_map_gw) {
if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
diff --git a/tcp_conn.h b/tcp_conn.h
index 20c7cb8b..e55edafe 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -159,7 +159,7 @@ bool tcp_flow_defer(union flow *flow);
bool tcp_splice_flow_defer(union flow *flow);
void tcp_splice_timer(const struct ctx *c, union flow *flow);
int tcp_conn_pool_sock(int pool[]);
-int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
+int tcp_conn_sock(const struct ctx *c, sa_family_t af);
void tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
void tcp_splice_refill(const struct ctx *c);
diff --git a/tcp_splice.c b/tcp_splice.c
index 576fe9be..609f3242 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -90,7 +90,7 @@ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
};
/* Forward declaration */
-static int tcp_sock_refill_ns(void *arg);
+static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
/**
* tcp_splice_conn_epoll_events() - epoll events masks for given state
@@ -380,36 +380,19 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
in_port_t port, uint8_t pif)
{
+ sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
int s = -1;
- /* If the pool is empty we take slightly different approaches
- * for init or ns sockets. For init sockets we just open a
- * new one without refilling the pool to keep latency down.
- * For ns sockets, we're going to incur the latency of
- * entering the ns anyway, so we might as well refill the
- * pool.
- */
if (pif == PIF_SPLICE) {
- int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
- sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
-
port += c->tcp.fwd_out.delta[port];
- s = tcp_conn_pool_sock(p);
- if (s < 0)
- s = tcp_conn_new_sock(c, af);
+ s = tcp_conn_sock(c, af);
} else {
- int *p = CONN_V6(conn) ? ns_sock_pool6 : ns_sock_pool4;
-
ASSERT(pif == PIF_HOST);
port += c->tcp.fwd_in.delta[port];
- /* If pool is empty, refill it first */
- if (p[TCP_SOCK_POOL_SIZE-1] < 0)
- NS_CALL(tcp_sock_refill_ns, c);
-
- s = tcp_conn_pool_sock(p);
+ s = tcp_conn_sock_ns(c, af);
}
if (s < 0) {
@@ -709,6 +692,27 @@ static int tcp_sock_refill_ns(void *arg)
return 0;
}
+/**
+ * tcp_conn_sock_ns() - Obtain a connectable socket in the namespace
+ * @c: Execution context
+ * @af: Address family (AF_INET or AF_INET6)
+ *
+ * Return: Socket fd in the namespace on success, -errno on failure
+ */
+static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
+{
+ int *p = af == AF_INET6 ? ns_sock_pool6 : ns_sock_pool4;
+
+ /* If the pool is empty we have to incur the latency of entering the ns.
+ * Therefore, we might as well refill the whole pool while we're at it,
+ * which differs from tcp_conn_sock().
+ */
+ if (p[TCP_SOCK_POOL_SIZE-1] < 0)
+ NS_CALL(tcp_sock_refill_ns, c);
+
+ return tcp_conn_pool_sock(p);
+}
+
/**
* tcp_splice_refill() - Refill pools of resources needed for splicing
* @c: Execution context
--
@@ -90,7 +90,7 @@ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
};
/* Forward declaration */
-static int tcp_sock_refill_ns(void *arg);
+static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
/**
* tcp_splice_conn_epoll_events() - epoll events masks for given state
@@ -380,36 +380,19 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
in_port_t port, uint8_t pif)
{
+ sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
int s = -1;
- /* If the pool is empty we take slightly different approaches
- * for init or ns sockets. For init sockets we just open a
- * new one without refilling the pool to keep latency down.
- * For ns sockets, we're going to incur the latency of
- * entering the ns anyway, so we might as well refill the
- * pool.
- */
if (pif == PIF_SPLICE) {
- int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
- sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
-
port += c->tcp.fwd_out.delta[port];
- s = tcp_conn_pool_sock(p);
- if (s < 0)
- s = tcp_conn_new_sock(c, af);
+ s = tcp_conn_sock(c, af);
} else {
- int *p = CONN_V6(conn) ? ns_sock_pool6 : ns_sock_pool4;
-
ASSERT(pif == PIF_HOST);
port += c->tcp.fwd_in.delta[port];
- /* If pool is empty, refill it first */
- if (p[TCP_SOCK_POOL_SIZE-1] < 0)
- NS_CALL(tcp_sock_refill_ns, c);
-
- s = tcp_conn_pool_sock(p);
+ s = tcp_conn_sock_ns(c, af);
}
if (s < 0) {
@@ -709,6 +692,27 @@ static int tcp_sock_refill_ns(void *arg)
return 0;
}
+/**
+ * tcp_conn_sock_ns() - Obtain a connectable socket in the namespace
+ * @c: Execution context
+ * @af: Address family (AF_INET or AF_INET6)
+ *
+ * Return: Socket fd in the namespace on success, -errno on failure
+ */
+static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
+{
+ int *p = af == AF_INET6 ? ns_sock_pool6 : ns_sock_pool4;
+
+ /* If the pool is empty we have to incur the latency of entering the ns.
+ * Therefore, we might as well refill the whole pool while we're at it,
+ * which differs from tcp_conn_sock().
+ */
+ if (p[TCP_SOCK_POOL_SIZE-1] < 0)
+ NS_CALL(tcp_sock_refill_ns, c);
+
+ return tcp_conn_pool_sock(p);
+}
+
/**
* tcp_splice_refill() - Refill pools of resources needed for splicing
* @c: Execution context
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (11 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 14/22] tcp_splice: Merge tcp_splice_new() into its caller David Gibson
` (9 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
In tcp_splice_conn_from_sock(), the 'port' variable stores the source port
of the connection on the originating side. In tcp_splice_new(), called
directly from it, the 'port' parameter gives the _destination_ port of the
originating connection and is then updated to the destination port of the
connection on the other side.
Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the accetped
socket (on side 0), whereas in tcp_splice_new(), 's' is the fd of the
connecting socket (side 1).
I, for one, find having the same variable name with different meanings in
such close proximity in the flow of control pretty confusing. Alter the
names for greater specificity and clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 40 ++++++++++++++++++++--------------------
tcp_splice.h | 2 +-
2 files changed, 21 insertions(+), 21 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 609f3242..d1263f21 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -378,29 +378,29 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
* Return: return code from connect()
*/
static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
- in_port_t port, uint8_t pif)
+ in_port_t dstport, uint8_t pif)
{
sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
- int s = -1;
+ int s1;
if (pif == PIF_SPLICE) {
- port += c->tcp.fwd_out.delta[port];
+ dstport += c->tcp.fwd_out.delta[dstport];
- s = tcp_conn_sock(c, af);
+ s1 = tcp_conn_sock(c, af);
} else {
ASSERT(pif == PIF_HOST);
- port += c->tcp.fwd_in.delta[port];
+ dstport += c->tcp.fwd_in.delta[dstport];
- s = tcp_conn_sock_ns(c, af);
+ s1 = tcp_conn_sock_ns(c, af);
}
- if (s < 0) {
- warn("Couldn't open connectable socket for splice (%d)", s);
- return s;
+ if (s1 < 0) {
+ warn("Couldn't open connectable socket for splice (%d)", s1);
+ return s1;
}
- return tcp_splice_connect(c, conn, s, port);
+ return tcp_splice_connect(c, conn, s1, dstport);
}
/**
@@ -408,7 +408,7 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
* @c: Execution context
* @ref: epoll reference of listening socket
* @flow: flow to initialise
- * @s: Accepted socket
+ * @s0: Accepted (side 0) socket
* @sa: Peer address of connection
*
* Return: true if able to create a spliced connection, false otherwise
@@ -416,28 +416,28 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
*/
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref, union flow *flow,
- int s, const union sockaddr_inany *sa)
+ int s0, const union sockaddr_inany *sa)
{
struct tcp_splice_conn *conn;
- union inany_addr aany;
- in_port_t port;
+ union inany_addr src;
+ in_port_t srcport;
ASSERT(c->mode == MODE_PASTA);
- inany_from_sockaddr(&aany, &port, sa);
- if (!inany_is_loopback(&aany))
+ inany_from_sockaddr(&src, &srcport, sa);
+ if (!inany_is_loopback(&src))
return false;
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
- conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
- conn->s[0] = s;
+ conn->flags = inany_v4(&src) ? 0 : SPLICE_V6;
+ conn->s[0] = s0;
conn->s[1] = -1;
conn->pipe[0][0] = conn->pipe[0][1] = -1;
conn->pipe[1][0] = conn->pipe[1][1] = -1;
- if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
- flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
+ if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
+ flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
if (tcp_splice_new(c, conn, ref.port, ref.pif))
conn_flag(c, conn, CLOSING);
diff --git a/tcp_splice.h b/tcp_splice.h
index 5a471af0..d0c6ff79 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -13,7 +13,7 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref, union flow *flow,
- int s, const union sockaddr_inany *sa);
+ int s0, const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
@@ -13,7 +13,7 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref, union flow *flow,
- int s, const union sockaddr_inany *sa);
+ int s0, const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 14/22] tcp_splice: Merge tcp_splice_new() into its caller
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (12 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 15/22] tcp_splice: Make tcp_splice_connect() create its own sockets David Gibson
` (8 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The only caller of tcp_splice_new() is tcp_splice_conn_from_sock(). Both
are quite short, and the division of responsibilities between the two isn't
particularly obvious. Simplify by merging the former into the latter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 63 +++++++++++++++++++++-------------------------------
1 file changed, 25 insertions(+), 38 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index d1263f21..de5fd95c 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -368,41 +368,6 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
return 0;
}
-/**
- * tcp_splice_new() - Handle new spliced connection
- * @c: Execution context
- * @conn: Connection pointer
- * @port: Destination port, host order
- * @pif: Originating pif of the splice
- *
- * Return: return code from connect()
- */
-static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
- in_port_t dstport, uint8_t pif)
-{
- sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
- int s1;
-
- if (pif == PIF_SPLICE) {
- dstport += c->tcp.fwd_out.delta[dstport];
-
- s1 = tcp_conn_sock(c, af);
- } else {
- ASSERT(pif == PIF_HOST);
-
- dstport += c->tcp.fwd_in.delta[dstport];
-
- s1 = tcp_conn_sock_ns(c, af);
- }
-
- if (s1 < 0) {
- warn("Couldn't open connectable socket for splice (%d)", s1);
- return s1;
- }
-
- return tcp_splice_connect(c, conn, s1, dstport);
-}
-
/**
* tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
* @c: Execution context
@@ -418,9 +383,11 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref, union flow *flow,
int s0, const union sockaddr_inany *sa)
{
+ in_port_t srcport, dstport = ref.port;
struct tcp_splice_conn *conn;
union inany_addr src;
- in_port_t srcport;
+ sa_family_t af;
+ int s1;
ASSERT(c->mode == MODE_PASTA);
@@ -428,9 +395,11 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (!inany_is_loopback(&src))
return false;
+ af = inany_v4(&src) ? AF_INET : AF_INET6;
+
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
- conn->flags = inany_v4(&src) ? 0 : SPLICE_V6;
+ conn->flags = af == AF_INET ? 0 : SPLICE_V6;
conn->s[0] = s0;
conn->s[1] = -1;
conn->pipe[0][0] = conn->pipe[0][1] = -1;
@@ -439,7 +408,25 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (tcp_splice_new(c, conn, ref.port, ref.pif))
+ if (ref.pif == PIF_SPLICE) {
+ dstport += c->tcp.fwd_out.delta[dstport];
+
+ s1 = tcp_conn_sock(c, af);
+ } else {
+ ASSERT(ref.pif == PIF_HOST);
+
+ dstport += c->tcp.fwd_in.delta[dstport];
+
+ s1 = tcp_conn_sock_ns(c, af);
+ }
+
+ if (s1 < 0) {
+ warn("Couldn't open connectable socket for splice (%d)", s1);
+ conn_flag(c, conn, CLOSING);
+ return true;
+ }
+
+ if (tcp_splice_connect(c, conn, s1, dstport))
conn_flag(c, conn, CLOSING);
return true;
--
@@ -368,41 +368,6 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
return 0;
}
-/**
- * tcp_splice_new() - Handle new spliced connection
- * @c: Execution context
- * @conn: Connection pointer
- * @port: Destination port, host order
- * @pif: Originating pif of the splice
- *
- * Return: return code from connect()
- */
-static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
- in_port_t dstport, uint8_t pif)
-{
- sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
- int s1;
-
- if (pif == PIF_SPLICE) {
- dstport += c->tcp.fwd_out.delta[dstport];
-
- s1 = tcp_conn_sock(c, af);
- } else {
- ASSERT(pif == PIF_HOST);
-
- dstport += c->tcp.fwd_in.delta[dstport];
-
- s1 = tcp_conn_sock_ns(c, af);
- }
-
- if (s1 < 0) {
- warn("Couldn't open connectable socket for splice (%d)", s1);
- return s1;
- }
-
- return tcp_splice_connect(c, conn, s1, dstport);
-}
-
/**
* tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
* @c: Execution context
@@ -418,9 +383,11 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
union tcp_listen_epoll_ref ref, union flow *flow,
int s0, const union sockaddr_inany *sa)
{
+ in_port_t srcport, dstport = ref.port;
struct tcp_splice_conn *conn;
union inany_addr src;
- in_port_t srcport;
+ sa_family_t af;
+ int s1;
ASSERT(c->mode == MODE_PASTA);
@@ -428,9 +395,11 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (!inany_is_loopback(&src))
return false;
+ af = inany_v4(&src) ? AF_INET : AF_INET6;
+
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
- conn->flags = inany_v4(&src) ? 0 : SPLICE_V6;
+ conn->flags = af == AF_INET ? 0 : SPLICE_V6;
conn->s[0] = s0;
conn->s[1] = -1;
conn->pipe[0][0] = conn->pipe[0][1] = -1;
@@ -439,7 +408,25 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (tcp_splice_new(c, conn, ref.port, ref.pif))
+ if (ref.pif == PIF_SPLICE) {
+ dstport += c->tcp.fwd_out.delta[dstport];
+
+ s1 = tcp_conn_sock(c, af);
+ } else {
+ ASSERT(ref.pif == PIF_HOST);
+
+ dstport += c->tcp.fwd_in.delta[dstport];
+
+ s1 = tcp_conn_sock_ns(c, af);
+ }
+
+ if (s1 < 0) {
+ warn("Couldn't open connectable socket for splice (%d)", s1);
+ conn_flag(c, conn, CLOSING);
+ return true;
+ }
+
+ if (tcp_splice_connect(c, conn, s1, dstport))
conn_flag(c, conn, CLOSING);
return true;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 15/22] tcp_splice: Make tcp_splice_connect() create its own sockets
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (13 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 14/22] tcp_splice: Merge tcp_splice_new() into its caller David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path David Gibson
` (7 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Currently creating the connected socket for a splice is split between
tcp_splice_conn_from_sock(), which opens the socket, and
tcp_splice_connect() which connects it. Alter tcp_splice_connect() to
open its own socket based on an address family and pif we pass it.
This does require a second conditional on pif, but makes for a more logical
split of functionality: tcp_splice_conn_from_sock() picks the target,
tcp_splice_connect() creates the connection. While we're there improve
reporting of errors
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index de5fd95c..5ba9c8ea 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -319,13 +319,14 @@ static int tcp_splice_connect_finish(const struct ctx *c,
* tcp_splice_connect() - Create and connect socket for new spliced connection
* @c: Execution context
* @conn: Connection pointer
- * @s: Accepted socket
+ * @af: Address family
+ * @pif: pif on which to create socket
* @port: Destination port, host order
*
* Return: 0 for connect() succeeded or in progress, negative value on error
*/
static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
- int sock_conn, in_port_t port)
+ sa_family_t af, uint8_t pif, in_port_t port)
{
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
@@ -340,7 +341,18 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
const struct sockaddr *sa;
socklen_t sl;
- conn->s[1] = sock_conn;
+ if (pif == PIF_HOST)
+ conn->s[1] = tcp_conn_sock(c, af);
+ else if (pif == PIF_SPLICE)
+ conn->s[1] = tcp_conn_sock_ns(c, af);
+ else
+ ASSERT(0);
+
+ if (conn->s[1] < 0) {
+ warn("Couldn't open connectable socket for splice (%d)",
+ conn->s[1]);
+ return conn->s[1];
+ }
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
&((int){ 1 }), sizeof(int))) {
@@ -387,7 +399,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
struct tcp_splice_conn *conn;
union inany_addr src;
sa_family_t af;
- int s1;
+ uint8_t pif1;
ASSERT(c->mode == MODE_PASTA);
@@ -409,24 +421,16 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
if (ref.pif == PIF_SPLICE) {
+ pif1 = PIF_HOST;
dstport += c->tcp.fwd_out.delta[dstport];
-
- s1 = tcp_conn_sock(c, af);
} else {
ASSERT(ref.pif == PIF_HOST);
+ pif1 = PIF_SPLICE;
dstport += c->tcp.fwd_in.delta[dstport];
-
- s1 = tcp_conn_sock_ns(c, af);
- }
-
- if (s1 < 0) {
- warn("Couldn't open connectable socket for splice (%d)", s1);
- conn_flag(c, conn, CLOSING);
- return true;
}
- if (tcp_splice_connect(c, conn, s1, dstport))
+ if (tcp_splice_connect(c, conn, af, pif1, dstport))
conn_flag(c, conn, CLOSING);
return true;
--
@@ -319,13 +319,14 @@ static int tcp_splice_connect_finish(const struct ctx *c,
* tcp_splice_connect() - Create and connect socket for new spliced connection
* @c: Execution context
* @conn: Connection pointer
- * @s: Accepted socket
+ * @af: Address family
+ * @pif: pif on which to create socket
* @port: Destination port, host order
*
* Return: 0 for connect() succeeded or in progress, negative value on error
*/
static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
- int sock_conn, in_port_t port)
+ sa_family_t af, uint8_t pif, in_port_t port)
{
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
@@ -340,7 +341,18 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
const struct sockaddr *sa;
socklen_t sl;
- conn->s[1] = sock_conn;
+ if (pif == PIF_HOST)
+ conn->s[1] = tcp_conn_sock(c, af);
+ else if (pif == PIF_SPLICE)
+ conn->s[1] = tcp_conn_sock_ns(c, af);
+ else
+ ASSERT(0);
+
+ if (conn->s[1] < 0) {
+ warn("Couldn't open connectable socket for splice (%d)",
+ conn->s[1]);
+ return conn->s[1];
+ }
if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK,
&((int){ 1 }), sizeof(int))) {
@@ -387,7 +399,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
struct tcp_splice_conn *conn;
union inany_addr src;
sa_family_t af;
- int s1;
+ uint8_t pif1;
ASSERT(c->mode == MODE_PASTA);
@@ -409,24 +421,16 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
if (ref.pif == PIF_SPLICE) {
+ pif1 = PIF_HOST;
dstport += c->tcp.fwd_out.delta[dstport];
-
- s1 = tcp_conn_sock(c, af);
} else {
ASSERT(ref.pif == PIF_HOST);
+ pif1 = PIF_SPLICE;
dstport += c->tcp.fwd_in.delta[dstport];
-
- s1 = tcp_conn_sock_ns(c, af);
- }
-
- if (s1 < 0) {
- warn("Couldn't open connectable socket for splice (%d)", s1);
- conn_flag(c, conn, CLOSING);
- return true;
}
- if (tcp_splice_connect(c, conn, s1, dstport))
+ if (tcp_splice_connect(c, conn, af, pif1, dstport))
conn_flag(c, conn, CLOSING);
return true;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (14 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 15/22] tcp_splice: Make tcp_splice_connect() create its own sockets David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-18 21:01 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 17/22] tcp_splice: Improve logic deciding when to splice David Gibson
` (6 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
This makes a number of changes to improve error reporting while connecting
a new spliced socket:
* We use flow_err() and similar functions so all messages include info
on which specific flow was affected
* We use strerror() to interpret raw error values
* We now report errors on connection (at "trace" level, since this would
allow spamming the logs)
* We also look up and report some details on EPOLLERR events, which can
include connection errors, since we use a non-blocking connect(). Again
we use "trace" level since this can spam the logs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp_splice.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c
index 5ba9c8ea..49075e5c 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
ASSERT(0);
if (conn->s[1] < 0) {
- warn("Couldn't open connectable socket for splice (%d)",
- conn->s[1]);
+ flow_err(conn,
+ "Couldn't open connectable socket for splice: %s",
+ strerror(-conn->s[1]));
return conn->s[1];
}
@@ -369,8 +370,11 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (connect(conn->s[1], sa, sl)) {
- if (errno != EINPROGRESS)
+ if (errno != EINPROGRESS) {
+ flow_trace(conn, "Couldn't connect socket for splice: %s",
+ strerror(errno));
return -errno;
+ }
conn_event(c, conn, SPLICE_CONNECT);
} else {
conn_event(c, conn, SPLICE_ESTABLISHED);
@@ -457,8 +461,20 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
if (conn->events == SPLICE_CLOSED)
return;
- if (events & EPOLLERR)
+ if (events & EPOLLERR) {
+ int err, rc;
+ socklen_t sl = sizeof(err);
+
+ rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
+ if (rc)
+ flow_err(conn, "Error retrieving SO_ERROR: %s",
+ strerror(errno));
+ else
+ flow_trace(conn, "Error event on socket: %s",
+ strerror(err));
+
goto close;
+ }
if (conn->events == SPLICE_CONNECT) {
if (!(events & EPOLLOUT))
--
@@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
ASSERT(0);
if (conn->s[1] < 0) {
- warn("Couldn't open connectable socket for splice (%d)",
- conn->s[1]);
+ flow_err(conn,
+ "Couldn't open connectable socket for splice: %s",
+ strerror(-conn->s[1]));
return conn->s[1];
}
@@ -369,8 +370,11 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
}
if (connect(conn->s[1], sa, sl)) {
- if (errno != EINPROGRESS)
+ if (errno != EINPROGRESS) {
+ flow_trace(conn, "Couldn't connect socket for splice: %s",
+ strerror(errno));
return -errno;
+ }
conn_event(c, conn, SPLICE_CONNECT);
} else {
conn_event(c, conn, SPLICE_ESTABLISHED);
@@ -457,8 +461,20 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
if (conn->events == SPLICE_CLOSED)
return;
- if (events & EPOLLERR)
+ if (events & EPOLLERR) {
+ int err, rc;
+ socklen_t sl = sizeof(err);
+
+ rc = getsockopt(ref.fd, SOL_SOCKET, SO_ERROR, &err, &sl);
+ if (rc)
+ flow_err(conn, "Error retrieving SO_ERROR: %s",
+ strerror(errno));
+ else
+ flow_trace(conn, "Error event on socket: %s",
+ strerror(err));
+
goto close;
+ }
if (conn->events == SPLICE_CONNECT) {
if (!(events & EPOLLOUT))
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 17/22] tcp_splice: Improve logic deciding when to splice
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (15 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 18/22] tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler() David Gibson
` (5 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
This makes several tweaks to improve the logic which decides whether we're
able to use the splice method for a new connection.
* Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
check for pasta mode within it, better localising the checks.
* Previously if we got a connection from a non-loopback address we'd
always fall back to the "tap" path, even if the connection was on a
socket in the namespace. If we did get a non-loopback address on a
namespace socket, something has gone wrong and the "tap" path certainly
won't be able to handle it. Report the error and close, rather than
passing it along to tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
inany.c | 1 -
tcp.c | 3 +--
tcp_splice.c | 48 ++++++++++++++++++++++++++++++++++--------------
3 files changed, 35 insertions(+), 17 deletions(-)
diff --git a/inany.c b/inany.c
index c11e2aa9..1c165b14 100644
--- a/inany.c
+++ b/inany.c
@@ -39,7 +39,6 @@ const union inany_addr inany_any4 = {
*
* Return: On success, a non-null pointer to @dst, NULL on failure
*/
-/* cppcheck-suppress unusedFunction */
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size)
{
const struct in_addr *v4 = inany_v4(src);
diff --git a/tcp.c b/tcp.c
index c06d1cc4..d61fb17b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2733,8 +2733,7 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
if (s < 0)
goto cancel;
- if (c->mode == MODE_PASTA &&
- tcp_splice_conn_from_sock(c, ref.tcp_listen, flow, s, &sa))
+ if (tcp_splice_conn_from_sock(c, ref.tcp_listen, flow, s, &sa))
return;
tcp_tap_conn_from_sock(c, ref.tcp_listen, flow, s, &sa, now);
diff --git a/tcp_splice.c b/tcp_splice.c
index 49075e5c..1937850f 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -405,14 +405,44 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
sa_family_t af;
uint8_t pif1;
- ASSERT(c->mode == MODE_PASTA);
-
- inany_from_sockaddr(&src, &srcport, sa);
- if (!inany_is_loopback(&src))
+ if (c->mode != MODE_PASTA)
return false;
+ inany_from_sockaddr(&src, &srcport, sa);
af = inany_v4(&src) ? AF_INET : AF_INET6;
+ switch (ref.pif) {
+ case PIF_SPLICE:
+ if (!inany_is_loopback(&src)) {
+ char str[INANY_ADDRSTRLEN];
+
+ /* We can't use flow_err() etc. because we haven't set
+ * the flow type yet
+ */
+ warn("Bad source address %s for splice, closing",
+ inany_ntop(&src, str, sizeof(str)));
+
+ /* We *don't* want to fall back to tap */
+ flow_alloc_cancel(flow);
+ return true;
+ }
+
+ pif1 = PIF_HOST;
+ dstport += c->tcp.fwd_out.delta[dstport];
+ break;
+
+ case PIF_HOST:
+ if (!inany_is_loopback(&src))
+ return false;
+
+ pif1 = PIF_SPLICE;
+ dstport += c->tcp.fwd_in.delta[dstport];
+ break;
+
+ default:
+ return false;
+ }
+
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
conn->flags = af == AF_INET ? 0 : SPLICE_V6;
@@ -424,16 +454,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (ref.pif == PIF_SPLICE) {
- pif1 = PIF_HOST;
- dstport += c->tcp.fwd_out.delta[dstport];
- } else {
- ASSERT(ref.pif == PIF_HOST);
-
- pif1 = PIF_SPLICE;
- dstport += c->tcp.fwd_in.delta[dstport];
- }
-
if (tcp_splice_connect(c, conn, af, pif1, dstport))
conn_flag(c, conn, CLOSING);
--
@@ -405,14 +405,44 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
sa_family_t af;
uint8_t pif1;
- ASSERT(c->mode == MODE_PASTA);
-
- inany_from_sockaddr(&src, &srcport, sa);
- if (!inany_is_loopback(&src))
+ if (c->mode != MODE_PASTA)
return false;
+ inany_from_sockaddr(&src, &srcport, sa);
af = inany_v4(&src) ? AF_INET : AF_INET6;
+ switch (ref.pif) {
+ case PIF_SPLICE:
+ if (!inany_is_loopback(&src)) {
+ char str[INANY_ADDRSTRLEN];
+
+ /* We can't use flow_err() etc. because we haven't set
+ * the flow type yet
+ */
+ warn("Bad source address %s for splice, closing",
+ inany_ntop(&src, str, sizeof(str)));
+
+ /* We *don't* want to fall back to tap */
+ flow_alloc_cancel(flow);
+ return true;
+ }
+
+ pif1 = PIF_HOST;
+ dstport += c->tcp.fwd_out.delta[dstport];
+ break;
+
+ case PIF_HOST:
+ if (!inany_is_loopback(&src))
+ return false;
+
+ pif1 = PIF_SPLICE;
+ dstport += c->tcp.fwd_in.delta[dstport];
+ break;
+
+ default:
+ return false;
+ }
+
conn = FLOW_START(flow, FLOW_TCP_SPLICE, tcp_splice, 0);
conn->flags = af == AF_INET ? 0 : SPLICE_V6;
@@ -424,16 +454,6 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (ref.pif == PIF_SPLICE) {
- pif1 = PIF_HOST;
- dstport += c->tcp.fwd_out.delta[dstport];
- } else {
- ASSERT(ref.pif == PIF_HOST);
-
- pif1 = PIF_SPLICE;
- dstport += c->tcp.fwd_in.delta[dstport];
- }
-
if (tcp_splice_connect(c, conn, af, pif1, dstport))
conn_flag(c, conn, CLOSING);
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 18/22] tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (16 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 17/22] tcp_splice: Improve logic deciding when to splice David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 19/22] tcp: Validate TCP endpoint addresses David Gibson
` (4 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
tcp_listen_handler() uses the epoll reference for the listening socket it
handles, and also passes on one variant of it to tcp_tap_conn_from_sock()
and tcp_splice_conn_from_sock(). The latter two functions only need a
couple of specific fields from the reference.
Pass those specific values instead of the whole reference, which localises
the handling of the listening (as opposed to accepted) socket and its
reference entirely within tcp_listen_handler().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 12 ++++++------
tcp_splice.c | 12 +++++++-----
tcp_splice.h | 5 +++--
3 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/tcp.c b/tcp.c
index d61fb17b..31d4e87b 100644
--- a/tcp.c
+++ b/tcp.c
@@ -2675,14 +2675,13 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr)
/**
* tcp_tap_conn_from_sock() - Initialize state for non-spliced connection
* @c: Execution context
- * @ref: epoll reference of listening socket
+ * @dstport: Destination port for connection (host side)
* @flow: flow to initialise
* @s: Accepted socket
* @sa: Peer socket address (from accept())
* @now: Current timestamp
*/
-static void tcp_tap_conn_from_sock(struct ctx *c,
- union tcp_listen_epoll_ref ref,
+static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport,
union flow *flow, int s,
const union sockaddr_inany *sa,
const struct timespec *now)
@@ -2695,7 +2694,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c,
conn_event(c, conn, SOCK_ACCEPTED);
inany_from_sockaddr(&conn->faddr, &conn->fport, sa);
- conn->eport = ref.port + c->tcp.fwd_in.delta[ref.port];
+ conn->eport = dstport + c->tcp.fwd_in.delta[dstport];
tcp_snat_inbound(c, &conn->faddr);
@@ -2733,10 +2732,11 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
if (s < 0)
goto cancel;
- if (tcp_splice_conn_from_sock(c, ref.tcp_listen, flow, s, &sa))
+ if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
+ ref.tcp_listen.port, flow, s, &sa))
return;
- tcp_tap_conn_from_sock(c, ref.tcp_listen, flow, s, &sa, now);
+ tcp_tap_conn_from_sock(c, ref.tcp_listen.port, flow, s, &sa, now);
return;
cancel:
diff --git a/tcp_splice.c b/tcp_splice.c
index 1937850f..97eecc5b 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -387,7 +387,8 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
/**
* tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection
* @c: Execution context
- * @ref: epoll reference of listening socket
+ * @pif0: pif id of side 0
+ * @dstport: Side 0 destination port of connection
* @flow: flow to initialise
* @s0: Accepted (side 0) socket
* @sa: Peer address of connection
@@ -396,12 +397,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
* #syscalls:pasta setsockopt
*/
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref, union flow *flow,
- int s0, const union sockaddr_inany *sa)
+ uint8_t pif0, in_port_t dstport,
+ union flow *flow, int s0,
+ const union sockaddr_inany *sa)
{
- in_port_t srcport, dstport = ref.port;
struct tcp_splice_conn *conn;
union inany_addr src;
+ in_port_t srcport;
sa_family_t af;
uint8_t pif1;
@@ -411,7 +413,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
inany_from_sockaddr(&src, &srcport, sa);
af = inany_v4(&src) ? AF_INET : AF_INET6;
- switch (ref.pif) {
+ switch (pif0) {
case PIF_SPLICE:
if (!inany_is_loopback(&src)) {
char str[INANY_ADDRSTRLEN];
diff --git a/tcp_splice.h b/tcp_splice.h
index d0c6ff79..ed8f0c58 100644
--- a/tcp_splice.h
+++ b/tcp_splice.h
@@ -12,8 +12,9 @@ union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref, union flow *flow,
- int s0, const union sockaddr_inany *sa);
+ uint8_t pif0, in_port_t dstport,
+ union flow *flow, int s0,
+ const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
@@ -12,8 +12,9 @@ union sockaddr_inany;
void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref,
uint32_t events);
bool tcp_splice_conn_from_sock(const struct ctx *c,
- union tcp_listen_epoll_ref ref, union flow *flow,
- int s0, const union sockaddr_inany *sa);
+ uint8_t pif0, in_port_t dstport,
+ union flow *flow, int s0,
+ const union sockaddr_inany *sa);
void tcp_splice_init(struct ctx *c);
#endif /* TCP_SPLICE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 19/22] tcp: Validate TCP endpoint addresses
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (17 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 18/22] tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler() David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-22 12:45 ` Stefano Brivio
2024-02-06 1:17 ` [PATCH v2 20/22] tap: Disallow loopback addresses on tap interface David Gibson
` (3 subsequent siblings)
22 siblings, 1 reply; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
TCP connections should typically not have wildcard addresses (0.0.0.0 or
::) nor a zero port number for either endpoint. It's not entirely clear
(at least to me) if it's strictly against the RFCs to do so, but at any
rate the socket interfaces often treat those values specially[1], so it's
not really possible to manipulate such connections. Likewise they should
not have broadcast or multicast addresses for either endpoint.
However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be. To
ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level, because
that would allow a guest to spam the logs).
We never expect such an address on an accept()ed socket either, but just
in case, check for it as well.
[1] Depending on context as "unknown", "match any" or "kernel, pick
something for me"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tcp.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 67 insertions(+), 7 deletions(-)
diff --git a/tcp.c b/tcp.c
index 31d4e87b..236a8d23 100644
--- a/tcp.c
+++ b/tcp.c
@@ -284,6 +284,7 @@
#include <sys/types.h>
#include <sys/uio.h>
#include <time.h>
+#include <arpa/inet.h>
#include <linux/tcp.h> /* For struct tcp_info */
@@ -1930,27 +1931,59 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
const struct tcphdr *th, const char *opts,
size_t optlen, const struct timespec *now)
{
+ in_port_t srcport = ntohs(th->source);
+ in_port_t dstport = ntohs(th->dest);
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
- .sin_port = th->dest,
+ .sin_port = htons(dstport),
.sin_addr = *(struct in_addr *)daddr,
};
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
- .sin6_port = th->dest,
+ .sin6_port = htons(dstport),
.sin6_addr = *(struct in6_addr *)daddr,
};
const struct sockaddr *sa;
struct tcp_tap_conn *conn;
union flow *flow;
+ int s = -1, mss;
socklen_t sl;
- int s, mss;
-
- (void)saddr;
if (!(flow = flow_alloc()))
return;
+ if (af == AF_INET) {
+ if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
+ IN4_IS_ADDR_BROADCAST(saddr) ||
+ IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
+ IN4_IS_ADDR_UNSPECIFIED(daddr) ||
+ IN4_IS_ADDR_BROADCAST(daddr) ||
+ IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
+ char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
+
+ debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+ inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
+ srcport,
+ inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
+ dstport);
+ goto cancel;
+ }
+ } else if (af == AF_INET6) {
+ if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
+ IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
+ IN6_IS_ADDR_UNSPECIFIED(daddr) ||
+ IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
+ char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
+
+ debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+ inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
+ srcport,
+ inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
+ dstport);
+ goto cancel;
+ }
+ }
+
if ((s = tcp_conn_sock(c, af)) < 0)
goto cancel;
@@ -2001,8 +2034,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
sl = sizeof(addr6);
}
- conn->fport = ntohs(th->dest);
- conn->eport = ntohs(th->source);
+ conn->fport = dstport;
+ conn->eport = srcport;
conn->seq_init_from_tap = ntohl(th->seq);
conn->seq_from_tap = conn->seq_init_from_tap + 1;
@@ -2732,6 +2765,33 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
if (s < 0)
goto cancel;
+ if (sa.sa_family == AF_INET) {
+ const struct in_addr *addr = &sa.sa4.sin_addr;
+ in_port_t port = sa.sa4.sin_port;
+
+ if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
+ IN4_IS_ADDR_BROADCAST(addr) ||
+ IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
+ char str[INET_ADDRSTRLEN];
+
+ err("Invalid endpoint from TCP accept(): %s:%hu",
+ inet_ntop(AF_INET, addr, str, sizeof(str)), port);
+ goto cancel;
+ }
+ } else if (sa.sa_family == AF_INET6) {
+ const struct in6_addr *addr = &sa.sa6.sin6_addr;
+ in_port_t port = sa.sa6.sin6_port;
+
+ if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
+ IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
+ char str[INET6_ADDRSTRLEN];
+
+ err("Invalid endpoint from TCP accept(): %s:%hu",
+ inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
+ goto cancel;
+ }
+ }
+
if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
ref.tcp_listen.port, flow, s, &sa))
return;
--
@@ -284,6 +284,7 @@
#include <sys/types.h>
#include <sys/uio.h>
#include <time.h>
+#include <arpa/inet.h>
#include <linux/tcp.h> /* For struct tcp_info */
@@ -1930,27 +1931,59 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
const struct tcphdr *th, const char *opts,
size_t optlen, const struct timespec *now)
{
+ in_port_t srcport = ntohs(th->source);
+ in_port_t dstport = ntohs(th->dest);
struct sockaddr_in addr4 = {
.sin_family = AF_INET,
- .sin_port = th->dest,
+ .sin_port = htons(dstport),
.sin_addr = *(struct in_addr *)daddr,
};
struct sockaddr_in6 addr6 = {
.sin6_family = AF_INET6,
- .sin6_port = th->dest,
+ .sin6_port = htons(dstport),
.sin6_addr = *(struct in6_addr *)daddr,
};
const struct sockaddr *sa;
struct tcp_tap_conn *conn;
union flow *flow;
+ int s = -1, mss;
socklen_t sl;
- int s, mss;
-
- (void)saddr;
if (!(flow = flow_alloc()))
return;
+ if (af == AF_INET) {
+ if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
+ IN4_IS_ADDR_BROADCAST(saddr) ||
+ IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
+ IN4_IS_ADDR_UNSPECIFIED(daddr) ||
+ IN4_IS_ADDR_BROADCAST(daddr) ||
+ IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
+ char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
+
+ debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+ inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
+ srcport,
+ inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
+ dstport);
+ goto cancel;
+ }
+ } else if (af == AF_INET6) {
+ if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
+ IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
+ IN6_IS_ADDR_UNSPECIFIED(daddr) ||
+ IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
+ char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
+
+ debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
+ inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
+ srcport,
+ inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
+ dstport);
+ goto cancel;
+ }
+ }
+
if ((s = tcp_conn_sock(c, af)) < 0)
goto cancel;
@@ -2001,8 +2034,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
sl = sizeof(addr6);
}
- conn->fport = ntohs(th->dest);
- conn->eport = ntohs(th->source);
+ conn->fport = dstport;
+ conn->eport = srcport;
conn->seq_init_from_tap = ntohl(th->seq);
conn->seq_from_tap = conn->seq_init_from_tap + 1;
@@ -2732,6 +2765,33 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
if (s < 0)
goto cancel;
+ if (sa.sa_family == AF_INET) {
+ const struct in_addr *addr = &sa.sa4.sin_addr;
+ in_port_t port = sa.sa4.sin_port;
+
+ if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
+ IN4_IS_ADDR_BROADCAST(addr) ||
+ IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
+ char str[INET_ADDRSTRLEN];
+
+ err("Invalid endpoint from TCP accept(): %s:%hu",
+ inet_ntop(AF_INET, addr, str, sizeof(str)), port);
+ goto cancel;
+ }
+ } else if (sa.sa_family == AF_INET6) {
+ const struct in6_addr *addr = &sa.sa6.sin6_addr;
+ in_port_t port = sa.sa6.sin6_port;
+
+ if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
+ IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
+ char str[INET6_ADDRSTRLEN];
+
+ err("Invalid endpoint from TCP accept(): %s:%hu",
+ inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
+ goto cancel;
+ }
+ }
+
if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
ref.tcp_listen.port, flow, s, &sa))
return;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 20/22] tap: Disallow loopback addresses on tap interface
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (18 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 19/22] tcp: Validate TCP endpoint addresses David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 21/22] port_fwd: Fix copypasta error in port_fwd_scan_udp() comments David Gibson
` (2 subsequent siblings)
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
The "tap" interface, whether it's actually a tuntap device or a qemu
socket, presents a virtual external link between different network hosts.
Hence, loopback addresses make no sense there. However, nothing prevents
the guest from putting bogus packets with loopback addresses onto the
interface and it's not entirely clear what effect that will have on passt.
Explicitly test for such packets and drop them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
tap.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/tap.c b/tap.c
index 396dee7e..9c839abe 100644
--- a/tap.c
+++ b/tap.c
@@ -633,6 +633,16 @@ resume:
l4_len = htons(iph->tot_len) - hlen;
+ if (IN4_IS_ADDR_LOOPBACK(&iph->saddr) ||
+ IN4_IS_ADDR_LOOPBACK(&iph->daddr)) {
+ char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
+
+ debug("Loopback address on tap interface: %s -> %s",
+ inet_ntop(AF_INET, &iph->saddr, sstr, sizeof(sstr)),
+ inet_ntop(AF_INET, &iph->daddr, dstr, sizeof(dstr)));
+ continue;
+ }
+
if (iph->saddr && c->ip4.addr_seen.s_addr != iph->saddr)
c->ip4.addr_seen.s_addr = iph->saddr;
@@ -789,6 +799,15 @@ resume:
if (!(l4h = ipv6_l4hdr(in, i, sizeof(*eh), &proto, &l4_len)))
continue;
+ if (IN6_IS_ADDR_LOOPBACK(saddr) || IN6_IS_ADDR_LOOPBACK(daddr)) {
+ char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
+
+ debug("Loopback address on tap interface: %s -> %s",
+ inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
+ inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)));
+ continue;
+ }
+
if (IN6_IS_ADDR_LINKLOCAL(saddr)) {
c->ip6.addr_ll_seen = *saddr;
--
@@ -633,6 +633,16 @@ resume:
l4_len = htons(iph->tot_len) - hlen;
+ if (IN4_IS_ADDR_LOOPBACK(&iph->saddr) ||
+ IN4_IS_ADDR_LOOPBACK(&iph->daddr)) {
+ char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
+
+ debug("Loopback address on tap interface: %s -> %s",
+ inet_ntop(AF_INET, &iph->saddr, sstr, sizeof(sstr)),
+ inet_ntop(AF_INET, &iph->daddr, dstr, sizeof(dstr)));
+ continue;
+ }
+
if (iph->saddr && c->ip4.addr_seen.s_addr != iph->saddr)
c->ip4.addr_seen.s_addr = iph->saddr;
@@ -789,6 +799,15 @@ resume:
if (!(l4h = ipv6_l4hdr(in, i, sizeof(*eh), &proto, &l4_len)))
continue;
+ if (IN6_IS_ADDR_LOOPBACK(saddr) || IN6_IS_ADDR_LOOPBACK(daddr)) {
+ char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
+
+ debug("Loopback address on tap interface: %s -> %s",
+ inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
+ inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)));
+ continue;
+ }
+
if (IN6_IS_ADDR_LINKLOCAL(saddr)) {
c->ip6.addr_ll_seen = *saddr;
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 21/22] port_fwd: Fix copypasta error in port_fwd_scan_udp() comments
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (19 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 20/22] tap: Disallow loopback addresses on tap interface David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 22/22] fwd: Rename port_fwd.[ch] and their contents David Gibson
2024-02-27 14:22 ` [PATCH v2 00/22] More flow table preliminaries: address handling improvements Stefano Brivio
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
port_fwd_scan_udp() handles UDP, as the name suggests, but its function
comment has the wrong function name and references TCP, due to a bad
copy/pasta from port_fwd_scan_tcp().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
port_fwd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/port_fwd.c b/port_fwd.c
index 6f6c836c..a7121fe2 100644
--- a/port_fwd.c
+++ b/port_fwd.c
@@ -85,7 +85,7 @@ void port_fwd_scan_tcp(struct port_fwd *fwd, const struct port_fwd *rev)
}
/**
- * port_fwd_scan_tcp() - Scan /proc to update TCP forwarding map
+ * port_fwd_scan_udp() - Scan /proc to update UDP forwarding map
* @fwd: Forwarding information to update
* @rev: Forwarding information for the reverse direction
* @tcp_fwd: Corresponding TCP forwarding information
--
@@ -85,7 +85,7 @@ void port_fwd_scan_tcp(struct port_fwd *fwd, const struct port_fwd *rev)
}
/**
- * port_fwd_scan_tcp() - Scan /proc to update TCP forwarding map
+ * port_fwd_scan_udp() - Scan /proc to update UDP forwarding map
* @fwd: Forwarding information to update
* @rev: Forwarding information for the reverse direction
* @tcp_fwd: Corresponding TCP forwarding information
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH v2 22/22] fwd: Rename port_fwd.[ch] and their contents
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (20 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 21/22] port_fwd: Fix copypasta error in port_fwd_scan_udp() comments David Gibson
@ 2024-02-06 1:17 ` David Gibson
2024-02-27 14:22 ` [PATCH v2 00/22] More flow table preliminaries: address handling improvements Stefano Brivio
22 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-06 1:17 UTC (permalink / raw)
To: passt-dev, Stefano Brivio; +Cc: David Gibson
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding. We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table. This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
# Conflicts:
# Makefile
---
Makefile | 12 ++++++------
conf.c | 8 ++++----
port_fwd.c => fwd.c | 32 ++++++++++++++++----------------
port_fwd.h => fwd.h | 24 ++++++++++++------------
passt.h | 2 +-
tcp.c | 4 ++--
tcp.h | 4 ++--
udp.c | 10 +++++-----
udp.h | 10 +++++-----
9 files changed, 53 insertions(+), 53 deletions(-)
rename port_fwd.c => fwd.c (83%)
rename port_fwd.h => fwd.h (62%)
diff --git a/Makefile b/Makefile
index 1c709229..04c62f9d 100644
--- a/Makefile
+++ b/Makefile
@@ -44,9 +44,9 @@ FLAGS += -DARCH=\"$(TARGET_ARCH)\"
FLAGS += -DVERSION=\"$(VERSION)\"
FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
-PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c icmp.c \
- igmp.c inany.c isolation.c lineread.c log.c mld.c ndp.c netlink.c \
- packet.c passt.c pasta.c pcap.c pif.c port_fwd.c tap.c tcp.c \
+PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
+ icmp.c igmp.c inany.c isolation.c lineread.c log.c mld.c ndp.c \
+ netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
tcp_splice.c udp.c util.c
QRAP_SRCS = qrap.c
SRCS = $(PASST_SRCS) $(QRAP_SRCS)
@@ -54,9 +54,9 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \
- flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
- netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \
- tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h
+ flow_table.h fwd.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \
+ netlink.h packet.h passt.h pasta.h pcap.h pif.h siphash.h tap.h tcp.h \
+ tcp_conn.h tcp_splice.h udp.h util.h
HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/conf.c b/conf.c
index 5e15b665..7ae8fc9c 100644
--- a/conf.c
+++ b/conf.c
@@ -109,10 +109,10 @@ static int parse_port_range(const char *s, char **endptr,
* @c: Execution context
* @optname: Short option name, t, T, u, or U
* @optarg: Option argument (port specification)
- * @fwd: Pointer to @port_fwd to be updated
+ * @fwd: Pointer to @fwd_ports to be updated
*/
static void conf_ports(const struct ctx *c, char optname, const char *optarg,
- struct port_fwd *fwd)
+ struct fwd_ports *fwd)
{
char addr_buf[sizeof(struct in6_addr)] = { 0 }, *addr = addr_buf;
char buf[BUFSIZ], *spec, *ifname = NULL, *p;
@@ -1172,7 +1172,7 @@ void conf(struct ctx *c, int argc, char **argv)
};
char userns[PATH_MAX] = { 0 }, netns[PATH_MAX] = { 0 };
bool copy_addrs_opt = false, copy_routes_opt = false;
- enum port_fwd_mode fwd_default = FWD_NONE;
+ enum fwd_ports_mode fwd_default = FWD_NONE;
bool v4_only = false, v6_only = false;
struct in6_addr *dns6 = c->ip6.dns;
struct fqdn *dnss = c->dns_search;
@@ -1750,7 +1750,7 @@ void conf(struct ctx *c, int argc, char **argv)
if (!c->udp.fwd_out.f.mode)
c->udp.fwd_out.f.mode = fwd_default;
- port_fwd_init(c);
+ fwd_scan_ports_init(c);
if (!c->quiet)
conf_print(c);
diff --git a/port_fwd.c b/fwd.c
similarity index 83%
rename from port_fwd.c
rename to fwd.c
index a7121fe2..09650b26 100644
--- a/port_fwd.c
+++ b/fwd.c
@@ -6,7 +6,7 @@
* PASTA - Pack A Subtle Tap Abstraction
* for network namespace/tap device mode
*
- * port_fwd.c - Port forwarding helpers
+ * fwd.c - Port forwarding helpers
*
* Copyright Red Hat
* Author: Stefano Brivio <sbrivio@redhat.com>
@@ -21,7 +21,7 @@
#include <stdio.h>
#include "util.h"
-#include "port_fwd.h"
+#include "fwd.h"
#include "passt.h"
#include "lineread.h"
@@ -73,11 +73,11 @@ static void procfs_scan_listen(int fd, unsigned int lstate,
}
/**
- * port_fwd_scan_tcp() - Scan /proc to update TCP forwarding map
+ * fwd_scan_ports_tcp() - Scan /proc to update TCP forwarding map
* @fwd: Forwarding information to update
* @rev: Forwarding information for the reverse direction
*/
-void port_fwd_scan_tcp(struct port_fwd *fwd, const struct port_fwd *rev)
+void fwd_scan_ports_tcp(struct fwd_ports *fwd, const struct fwd_ports *rev)
{
memset(fwd->map, 0, PORT_BITMAP_SIZE);
procfs_scan_listen(fwd->scan4, TCP_LISTEN, fwd->map, rev->map);
@@ -85,15 +85,15 @@ void port_fwd_scan_tcp(struct port_fwd *fwd, const struct port_fwd *rev)
}
/**
- * port_fwd_scan_udp() - Scan /proc to update UDP forwarding map
+ * fwd_scan_ports_udp() - Scan /proc to update UDP forwarding map
* @fwd: Forwarding information to update
* @rev: Forwarding information for the reverse direction
* @tcp_fwd: Corresponding TCP forwarding information
* @tcp_rev: TCP forwarding information for the reverse direction
*/
-void port_fwd_scan_udp(struct port_fwd *fwd, const struct port_fwd *rev,
- const struct port_fwd *tcp_fwd,
- const struct port_fwd *tcp_rev)
+void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
+ const struct fwd_ports *tcp_fwd,
+ const struct fwd_ports *tcp_rev)
{
uint8_t exclude[PORT_BITMAP_SIZE];
@@ -118,10 +118,10 @@ void port_fwd_scan_udp(struct port_fwd *fwd, const struct port_fwd *rev,
}
/**
- * port_fwd_init() - Initial setup for port forwarding
+ * fwd_scan_ports_init() - Initial setup for automatic port forwarding
* @c: Execution context
*/
-void port_fwd_init(struct ctx *c)
+void fwd_scan_ports_init(struct ctx *c)
{
const int flags = O_RDONLY | O_CLOEXEC;
@@ -133,23 +133,23 @@ void port_fwd_init(struct ctx *c)
if (c->tcp.fwd_in.mode == FWD_AUTO) {
c->tcp.fwd_in.scan4 = open_in_ns(c, "/proc/net/tcp", flags);
c->tcp.fwd_in.scan6 = open_in_ns(c, "/proc/net/tcp6", flags);
- port_fwd_scan_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out);
+ fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out);
}
if (c->udp.fwd_in.f.mode == FWD_AUTO) {
c->udp.fwd_in.f.scan4 = open_in_ns(c, "/proc/net/udp", flags);
c->udp.fwd_in.f.scan6 = open_in_ns(c, "/proc/net/udp6", flags);
- port_fwd_scan_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f,
- &c->tcp.fwd_in, &c->tcp.fwd_out);
+ fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f,
+ &c->tcp.fwd_in, &c->tcp.fwd_out);
}
if (c->tcp.fwd_out.mode == FWD_AUTO) {
c->tcp.fwd_out.scan4 = open("/proc/net/tcp", flags);
c->tcp.fwd_out.scan6 = open("/proc/net/tcp6", flags);
- port_fwd_scan_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in);
+ fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in);
}
if (c->udp.fwd_out.f.mode == FWD_AUTO) {
c->udp.fwd_out.f.scan4 = open("/proc/net/udp", flags);
c->udp.fwd_out.f.scan6 = open("/proc/net/udp6", flags);
- port_fwd_scan_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f,
- &c->tcp.fwd_out, &c->tcp.fwd_in);
+ fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f,
+ &c->tcp.fwd_out, &c->tcp.fwd_in);
}
}
diff --git a/port_fwd.h b/fwd.h
similarity index 62%
rename from port_fwd.h
rename to fwd.h
index f6bf1b53..23281d90 100644
--- a/port_fwd.h
+++ b/fwd.h
@@ -4,13 +4,13 @@
* Author: David Gibson <david@gibson.dropbear.id.au>
*/
-#ifndef PORT_FWD_H
-#define PORT_FWD_H
+#ifndef FWD_H
+#define FWD_H
/* Number of ports for both TCP and UDP */
#define NUM_PORTS (1U << 16)
-enum port_fwd_mode {
+enum fwd_ports_mode {
FWD_SPEC = 1,
FWD_NONE,
FWD_AUTO,
@@ -20,25 +20,25 @@ enum port_fwd_mode {
#define PORT_BITMAP_SIZE DIV_ROUND_UP(NUM_PORTS, 8)
/**
- * port_fwd - Describes port forwarding for one protocol and direction
+ * fwd_ports - Describes port forwarding for one protocol and direction
* @mode: Overall forwarding mode (all, none, auto, specific ports)
* @scan4: /proc/net fd to scan for IPv4 ports when in AUTO mode
* @scan6: /proc/net fd to scan for IPv6 ports when in AUTO mode
* @map: Bitmap describing which ports are forwarded
* @delta: Offset between the original destination and mapped port number
*/
-struct port_fwd {
- enum port_fwd_mode mode;
+struct fwd_ports {
+ enum fwd_ports_mode mode;
int scan4;
int scan6;
uint8_t map[PORT_BITMAP_SIZE];
in_port_t delta[NUM_PORTS];
};
-void port_fwd_scan_tcp(struct port_fwd *fwd, const struct port_fwd *rev);
-void port_fwd_scan_udp(struct port_fwd *fwd, const struct port_fwd *rev,
- const struct port_fwd *tcp_fwd,
- const struct port_fwd *tcp_rev);
-void port_fwd_init(struct ctx *c);
+void fwd_scan_ports_tcp(struct fwd_ports *fwd, const struct fwd_ports *rev);
+void fwd_scan_ports_udp(struct fwd_ports *fwd, const struct fwd_ports *rev,
+ const struct fwd_ports *tcp_fwd,
+ const struct fwd_ports *tcp_rev);
+void fwd_scan_ports_init(struct ctx *c);
-#endif /* PORT_FWD_H */
+#endif /* FWD_H */
diff --git a/passt.h b/passt.h
index a9e8f15a..21a2e4f4 100644
--- a/passt.h
+++ b/passt.h
@@ -39,7 +39,7 @@ union epoll_ref;
#include "packet.h"
#include "flow.h"
#include "icmp.h"
-#include "port_fwd.h"
+#include "fwd.h"
#include "tcp.h"
#include "udp.h"
diff --git a/tcp.c b/tcp.c
index 236a8d23..18298128 100644
--- a/tcp.c
+++ b/tcp.c
@@ -3216,12 +3216,12 @@ void tcp_timer(struct ctx *c, const struct timespec *now)
if (c->mode == MODE_PASTA) {
if (c->tcp.fwd_out.mode == FWD_AUTO) {
- port_fwd_scan_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in);
+ fwd_scan_ports_tcp(&c->tcp.fwd_out, &c->tcp.fwd_in);
NS_CALL(tcp_port_rebind_outbound, c);
}
if (c->tcp.fwd_in.mode == FWD_AUTO) {
- port_fwd_scan_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out);
+ fwd_scan_ports_tcp(&c->tcp.fwd_in, &c->tcp.fwd_out);
tcp_port_rebind(c, false);
}
}
diff --git a/tcp.h b/tcp.h
index 5e6756d4..a9b8bf87 100644
--- a/tcp.h
+++ b/tcp.h
@@ -59,8 +59,8 @@ union tcp_listen_epoll_ref {
* @pipe_size: Size of pipes for spliced connections
*/
struct tcp_ctx {
- struct port_fwd fwd_in;
- struct port_fwd fwd_out;
+ struct fwd_ports fwd_in;
+ struct fwd_ports fwd_out;
struct timespec timer_run;
#ifdef HAS_SND_WND
int kernel_snd_wnd;
diff --git a/udp.c b/udp.c
index f5b86568..dc2022ac 100644
--- a/udp.c
+++ b/udp.c
@@ -259,7 +259,7 @@ void udp_portmap_clear(void)
* udp_invert_portmap() - Compute reverse port translations for return packets
* @fwd: Port forwarding configuration to compute reverse map for
*/
-static void udp_invert_portmap(struct udp_port_fwd *fwd)
+static void udp_invert_portmap(struct udp_fwd_ports *fwd)
{
int i;
@@ -1261,14 +1261,14 @@ void udp_timer(struct ctx *c, const struct timespec *now)
if (c->mode == MODE_PASTA) {
if (c->udp.fwd_out.f.mode == FWD_AUTO) {
- port_fwd_scan_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f,
- &c->tcp.fwd_out, &c->tcp.fwd_in);
+ fwd_scan_ports_udp(&c->udp.fwd_out.f, &c->udp.fwd_in.f,
+ &c->tcp.fwd_out, &c->tcp.fwd_in);
NS_CALL(udp_port_rebind_outbound, c);
}
if (c->udp.fwd_in.f.mode == FWD_AUTO) {
- port_fwd_scan_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f,
- &c->tcp.fwd_in, &c->tcp.fwd_out);
+ fwd_scan_ports_udp(&c->udp.fwd_in.f, &c->udp.fwd_out.f,
+ &c->tcp.fwd_in, &c->tcp.fwd_out);
udp_port_rebind(c, false);
}
}
diff --git a/udp.h b/udp.h
index f3d5d6d2..9976b623 100644
--- a/udp.h
+++ b/udp.h
@@ -43,12 +43,12 @@ union udp_epoll_ref {
/**
- * udp_port_fwd - UDP specific port forwarding configuration
+ * udp_fwd_ports - UDP specific port forwarding configuration
* @f: Generic forwarding configuration
* @rdelta: Reversed delta map to translate source ports on return packets
*/
-struct udp_port_fwd {
- struct port_fwd f;
+struct udp_fwd_ports {
+ struct fwd_ports f;
in_port_t rdelta[NUM_PORTS];
};
@@ -59,8 +59,8 @@ struct udp_port_fwd {
* @timer_run: Timestamp of most recent timer run
*/
struct udp_ctx {
- struct udp_port_fwd fwd_in;
- struct udp_port_fwd fwd_out;
+ struct udp_fwd_ports fwd_in;
+ struct udp_fwd_ports fwd_out;
struct timespec timer_run;
};
--
@@ -43,12 +43,12 @@ union udp_epoll_ref {
/**
- * udp_port_fwd - UDP specific port forwarding configuration
+ * udp_fwd_ports - UDP specific port forwarding configuration
* @f: Generic forwarding configuration
* @rdelta: Reversed delta map to translate source ports on return packets
*/
-struct udp_port_fwd {
- struct port_fwd f;
+struct udp_fwd_ports {
+ struct fwd_ports f;
in_port_t rdelta[NUM_PORTS];
};
@@ -59,8 +59,8 @@ struct udp_port_fwd {
* @timer_run: Timestamp of most recent timer run
*/
struct udp_ctx {
- struct udp_port_fwd fwd_in;
- struct udp_port_fwd fwd_out;
+ struct udp_fwd_ports fwd_in;
+ struct udp_fwd_ports fwd_out;
struct timespec timer_run;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH v2 02/22] inany: Helper to test for various address types
2024-02-06 1:17 ` [PATCH v2 02/22] inany: Helper to test for various address types David Gibson
@ 2024-02-18 20:58 ` Stefano Brivio
2024-02-19 1:48 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 20:58 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Add helpers to determine if an inany is loopback, unspecified or multicast,
> regardless of whether it's a "true" IPv6 address or an IPv4 address
> represented as v4-mapped.
> Sometimes we need to know if an inany is a loopback address, unspecified or
> some other particular kind of address, but we don't really care if it is
> IPv4.
>
> Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> inany.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> tcp_splice.c | 15 +++------------
> 2 files changed, 53 insertions(+), 12 deletions(-)
>
> diff --git a/inany.h b/inany.h
> index fe652ff7..2058f145 100644
> --- a/inany.h
> +++ b/inany.h
> @@ -55,6 +55,56 @@ static inline bool inany_equals(const union inany_addr *a,
> return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
> }
>
> +/** inany_is_loopback - Check if address is loopback
Nit: inany_is_loopback() (and four occurrences below).
> + * @a: IPv[46] address
> + *
> + * Return: true if @a is either ::1 or in 127.0.0.1/8
> + */
> +static inline bool inany_is_loopback(const union inany_addr *a)
> +{
> + const struct in_addr *v4 = inany_v4(a);
> +
> + return IN6_IS_ADDR_LOOPBACK(&a->a6) || (v4 && IN4_IS_ADDR_LOOPBACK(v4));
> +}
> +
> +/** inany_is_unspecified - Check if address is unspecified
> + * @a: IPv[46] address
> + *
> + * Return: true if @a is either :: or 0.0.0.0
> + */
> +static inline bool inany_is_unspecified(const union inany_addr *a)
> +{
> + const struct in_addr *v4 = inany_v4(a);
> +
> + return IN6_IS_ADDR_UNSPECIFIED(&a->a6) ||
> + (v4 && IN4_IS_ADDR_UNSPECIFIED(v4));
> +}
> +
> +/** inany_is_multicast - Check if address is multicast or broadcast
> + * @a: IPv[46] address
> + *
> + * Return: true if @a is IPv6 multicast or IPv4 multicast or broadcast
> + */
> +static inline bool inany_is_multicast(const union inany_addr *a)
> +{
> + const struct in_addr *v4 = inany_v4(a);
> +
> + return IN6_IS_ADDR_MULTICAST(&a->a6) ||
> + (v4 && (IN4_IS_ADDR_MULTICAST(v4) ||
> + IN4_IS_ADDR_BROADCAST(v4)));
> +}
> +
> +/** inany_is_unicast - Check if address is specified and unicast
> + * @a: IPv[46] address
> + *
> + * Return: true if @a is specified and a unicast address
> + */
> +/* cppcheck-suppress unusedFunction */
> +static inline bool inany_is_unicast(const union inany_addr *a)
> +{
> + return !inany_is_unspecified(a) && !inany_is_multicast(a);
> +}
> +
> /** inany_from_af - Set IPv[46] address from IPv4 or IPv6 address
> * @aa: Pointer to store IPv[46] address
> * @af: Address family of @addr
> diff --git a/tcp_splice.c b/tcp_splice.c
> index cc9745e8..4ecc178b 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -440,29 +440,20 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
> struct tcp_splice_conn *conn, int s,
> const struct sockaddr *sa)
> {
> - const struct in_addr *a4;
> union inany_addr aany;
> in_port_t port;
>
> ASSERT(c->mode == MODE_PASTA);
>
> inany_from_sockaddr(&aany, &port, sa);
> - a4 = inany_v4(&aany);
> -
> - if (a4) {
> - if (!IN4_IS_ADDR_LOOPBACK(a4))
> - return false;
> - conn->flags = 0;
> - } else {
> - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6))
> - return false;
> - conn->flags = SPLICE_V6;
> - }
> + if (!inany_is_loopback(&aany))
> + return false;
>
> if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int)))
> flow_trace(conn, "failed to set TCP_QUICKACK on %i", s);
>
> conn->f.type = FLOW_TCP_SPLICE;
> + conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
> conn->s[0] = s;
>
> if (tcp_splice_new(c, conn, ref.port, ref.pif))
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 09/22] tcp_splice: Simplify clean up logic
2024-02-06 1:17 ` [PATCH v2 09/22] tcp_splice: Simplify clean up logic David Gibson
@ 2024-02-18 20:59 ` Stefano Brivio
2024-02-19 1:50 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 20:59 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:21 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Currently tcp_splice_flow_defer() contains specific logic to determine if
> we're far enough initialised that we need to close pipes and/or sockets.
> This is potentially fragile if we change something about the order in which
> we do things. We can simplify this by initialising the pipe and socket
> fields to -1 very early, then close()ing them if and only if they're non
> negative.
>
> This lets us remove a special case cleanup if our connect() fails. This
> will already trigger a CLOSING event, and the socket fd in question is
> populated in the connection structure. Thus we can let the new cleanup
> logic handle it rather than requiring an explicit close().
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 27 +++++++++++----------------
> 1 file changed, 11 insertions(+), 16 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 40ecb5d4..f0343eb5 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -246,16 +246,14 @@ bool tcp_splice_flow_defer(union flow *flow)
> return false;
>
> for (side = 0; side < SIDES; side++) {
> - if (conn->events & SPLICE_ESTABLISHED) {
> - /* Flushing might need to block: don't recycle them. */
> - if (conn->pipe[side][0] != -1) {
> - close(conn->pipe[side][0]);
> - close(conn->pipe[side][1]);
> - conn->pipe[side][0] = conn->pipe[side][1] = -1;
> - }
> + /* Flushing might need to block: don't recycle them. */
> + if (conn->pipe[side][0] >= 0) {
> + close(conn->pipe[side][0]);
> + close(conn->pipe[side][1]);
> + conn->pipe[side][0] = conn->pipe[side][1] = -1;
> }
>
> - if (side == 0 || conn->events & SPLICE_CONNECT) {
> + if (conn->s[side] >= 0) {
> close(conn->s[side]);
> conn->s[side] = -1;
> }
> @@ -284,8 +282,6 @@ static int tcp_splice_connect_finish(const struct ctx *c,
> int i = 0;
>
> for (side = 0; side < SIDES; side++) {
> - conn->pipe[side][0] = conn->pipe[side][1] = -1;
> -
> for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) {
> if (splice_pipe_pool[i][0] >= 0) {
> SWAP(conn->pipe[side][0],
> @@ -361,12 +357,8 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> }
>
> if (connect(conn->s[1], sa, sl)) {
> - if (errno != EINPROGRESS) {
> - int ret = -errno;
> -
> - close(sock_conn);
> - return ret;
> - }
> + if (errno != EINPROGRESS)
> + return -errno;
Nit: a newline here would be nice.
> conn_event(c, conn, SPLICE_CONNECT);
> } else {
> conn_event(c, conn, SPLICE_ESTABLISHED);
> @@ -459,6 +451,9 @@ bool tcp_splice_conn_from_sock(const struct ctx *c,
> conn->f.type = FLOW_TCP_SPLICE;
> conn->flags = inany_v4(&aany) ? 0 : SPLICE_V6;
> conn->s[0] = s;
> + conn->s[1] = -1;
> + conn->pipe[0][0] = conn->pipe[0][1] = -1;
> + conn->pipe[1][0] = conn->pipe[1][1] = -1;
>
> if (tcp_splice_new(c, conn, ref.port, ref.pif))
> conn_flag(c, conn, CLOSING);
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging
2024-02-06 1:17 ` [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging David Gibson
@ 2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:58 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 21:00 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:23 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Our allocation scheme for flow entries means there are some non-obvious
> constraints on when what things can be done with an entry. Add a big doc
> comment explaining the life cycle.
>
> In addition, make a FLOW_START() macro to mark one of the important
> transitions. This encourages correct usage, by making it natural to only
> access the flow type specific structure after calling it. It also logs
> that a new flow has been created, which is useful for debugging.
>
> We also add logging when a flow's lifecycle ends. This doesn't need a new
> helper, because it can only happen either from flow_alloc_cancel() or from
> the flow deferred handler.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> flow.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> flow.h | 5 ++++
> tcp.c | 15 +++++------
> tcp_splice.c | 11 ++++----
> tcp_splice.h | 5 ++--
> 5 files changed, 94 insertions(+), 18 deletions(-)
>
> diff --git a/flow.c b/flow.c
> index beb9749c..a155b54b 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -34,6 +34,45 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
>
> /* Global Flow Table */
>
> +/**
> + * DOC: Theory of Operation - flow entry life cycle
> + *
> + * An individual flow table entry moves through these logical states, usually in
> + * this order.
> + *
> + * FREE - Part of the general pool of free flow table entries
> + * Operations:
> + * - flow_alloc() finds an entry and moves it to ALLOC state
> + *
> + * ALLOC - A tentatively allocated entry
> + * Operations:
> + * - flow_alloc_cancel() returns the entry to FREE state
> + * - FLOW_START() set the entry's type and moves to START state
> + * Caveats:
> + * - It's not safe to write fields in the flow entry
> + * - It's not safe to allocate other entries with flow_alloc()
I'm not entirely sure what you mean here -- is this in the sense of
s/other/further/ ?
> + * - It's not safe to return to the main epoll loop
..."before FLOW_START() is called on the entry"?
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools
2024-02-06 1:17 ` [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools David Gibson
@ 2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:51 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 21:00 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:24 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> We maintain pools of ready-to-connect sockets in both the original and
> (for pasta) guest namespace to reduce latency when starting new TCP
> connections. If we exhaust those pools we have to take a higher
> latency path to get a new socket.
>
> Currently we open-code that fallback in the places we need it. To
> improve clarity encapsulate that into helper functions.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 29 ++++++++++++++++++++++++-----
> tcp_conn.h | 2 +-
> tcp_splice.c | 46 +++++++++++++++++++++++++---------------------
> 3 files changed, 50 insertions(+), 27 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index e15b932f..c06d1cc4 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -1792,7 +1792,7 @@ int tcp_conn_pool_sock(int pool[])
> *
> * Return: socket number on success, negative code if socket creation failed
> */
> -int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> +static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> {
> int s;
>
> @@ -1811,6 +1811,27 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> return s;
> }
>
> +/**
> + * tcp_conn_sock() - Obtain a connectable socket in the host/init namespace
> + * @c: Execution context
> + * @af: Address family (AF_INET or AF_INET6)
> + *
> + * Return: Socket fd on success, -errno on failure
> + */
> +int tcp_conn_sock(const struct ctx *c, sa_family_t af)
> +{
> + int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
> + int s;
> +
> + if ((s = tcp_conn_pool_sock(pool)) >= 0)
> + return s;
> +
> + /* If the pool is empty we just open a new one without refilling the
> + * pool to keep latency down.
> + */
> + return tcp_conn_new_sock(c, af);
> +}
> +
> /**
> * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
> * @conn: Connection pointer
> @@ -1909,7 +1930,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> const struct tcphdr *th, const char *opts,
> size_t optlen, const struct timespec *now)
> {
> - int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
> struct sockaddr_in addr4 = {
> .sin_family = AF_INET,
> .sin_port = th->dest,
> @@ -1931,9 +1951,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> if (!(flow = flow_alloc()))
> return;
>
> - if ((s = tcp_conn_pool_sock(pool)) < 0)
> - if ((s = tcp_conn_new_sock(c, af)) < 0)
> - goto cancel;
> + if ((s = tcp_conn_sock(c, af)) < 0)
> + goto cancel;
>
> if (!c->no_map_gw) {
> if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
> diff --git a/tcp_conn.h b/tcp_conn.h
> index 20c7cb8b..e55edafe 100644
> --- a/tcp_conn.h
> +++ b/tcp_conn.h
> @@ -159,7 +159,7 @@ bool tcp_flow_defer(union flow *flow);
> bool tcp_splice_flow_defer(union flow *flow);
> void tcp_splice_timer(const struct ctx *c, union flow *flow);
> int tcp_conn_pool_sock(int pool[]);
> -int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
> +int tcp_conn_sock(const struct ctx *c, sa_family_t af);
> void tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
> void tcp_splice_refill(const struct ctx *c);
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 576fe9be..609f3242 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -90,7 +90,7 @@ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
> };
>
> /* Forward declaration */
> -static int tcp_sock_refill_ns(void *arg);
> +static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
>
> /**
> * tcp_splice_conn_epoll_events() - epoll events masks for given state
> @@ -380,36 +380,19 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
> in_port_t port, uint8_t pif)
> {
> + sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
> int s = -1;
>
> - /* If the pool is empty we take slightly different approaches
> - * for init or ns sockets. For init sockets we just open a
> - * new one without refilling the pool to keep latency down.
> - * For ns sockets, we're going to incur the latency of
> - * entering the ns anyway, so we might as well refill the
> - * pool.
> - */
> if (pif == PIF_SPLICE) {
> - int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
> - sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
> -
> port += c->tcp.fwd_out.delta[port];
>
> - s = tcp_conn_pool_sock(p);
> - if (s < 0)
> - s = tcp_conn_new_sock(c, af);
> + s = tcp_conn_sock(c, af);
> } else {
> - int *p = CONN_V6(conn) ? ns_sock_pool6 : ns_sock_pool4;
> -
> ASSERT(pif == PIF_HOST);
>
> port += c->tcp.fwd_in.delta[port];
>
> - /* If pool is empty, refill it first */
> - if (p[TCP_SOCK_POOL_SIZE-1] < 0)
> - NS_CALL(tcp_sock_refill_ns, c);
> -
> - s = tcp_conn_pool_sock(p);
> + s = tcp_conn_sock_ns(c, af);
> }
>
> if (s < 0) {
> @@ -709,6 +692,27 @@ static int tcp_sock_refill_ns(void *arg)
> return 0;
> }
>
> +/**
> + * tcp_conn_sock_ns() - Obtain a connectable socket in the namespace
> + * @c: Execution context
> + * @af: Address family (AF_INET or AF_INET6)
> + *
> + * Return: Socket fd in the namespace on success, -errno on failure
> + */
> +static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
> +{
> + int *p = af == AF_INET6 ? ns_sock_pool6 : ns_sock_pool4;
> +
> + /* If the pool is empty we have to incur the latency of entering the ns.
> + * Therefore, we might as well refill the whole pool while we're at it,
> + * which differs from tcp_conn_sock().
> + */
> + if (p[TCP_SOCK_POOL_SIZE-1] < 0)
Nit, for consistency (but yes, it was already like this):
if (p[TCP_SOCK_POOL_SIZE - 1] < 0)
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path
2024-02-06 1:17 ` [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path David Gibson
@ 2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:53 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 21:00 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:25 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> In tcp_splice_conn_from_sock(), the 'port' variable stores the source port
> of the connection on the originating side. In tcp_splice_new(), called
> directly from it, the 'port' parameter gives the _destination_ port of the
> originating connection and is then updated to the destination port of the
> connection on the other side.
>
> Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the accetped
> socket (on side 0), whereas in tcp_splice_new(), 's' is the fd of the
> connecting socket (side 1).
>
> I, for one, find having the same variable name with different meanings in
> such close proximity in the flow of control pretty confusing.
Yeah, same here, it took me a while to check that what you're changing
to 'dstport' is actually a destination port (and not just because a
comment says that).
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path
2024-02-06 1:17 ` [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path David Gibson
@ 2024-02-18 21:01 ` Stefano Brivio
2024-02-19 3:23 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-18 21:01 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:28 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> This makes a number of changes to improve error reporting while connecting
> a new spliced socket:
> * We use flow_err() and similar functions so all messages include info
> on which specific flow was affected
> * We use strerror() to interpret raw error values
> * We now report errors on connection (at "trace" level, since this would
> allow spamming the logs)
> * We also look up and report some details on EPOLLERR events, which can
> include connection errors, since we use a non-blocking connect(). Again
> we use "trace" level since this can spam the logs.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp_splice.c | 24 ++++++++++++++++++++----
> 1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/tcp_splice.c b/tcp_splice.c
> index 5ba9c8ea..49075e5c 100644
> --- a/tcp_splice.c
> +++ b/tcp_splice.c
> @@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> ASSERT(0);
>
> if (conn->s[1] < 0) {
> - warn("Couldn't open connectable socket for splice (%d)",
> - conn->s[1]);
> + flow_err(conn,
> + "Couldn't open connectable socket for splice: %s",
> + strerror(-conn->s[1]));
It took me a bit to convince myself that we actually store negative
error codes in pools, which sounds like a neat idea, except for two
things:
- in tcp_sock_refill_pool(), once we get an error from
tcp_conn_new_sock(), should we really continue to call it and try to
get more sockets?
Presumably, that will have something to do with some kind of resource
exhaustion, so perhaps we shouldn't risk making it worse and just
stop refilling the pool. If we do, we might have up to one negative
error code in the pool, I guess.
- if we have an error code in the pool, that error might have occurred
a while ago, but here, we're logging it at the moment we need that
socket.
Maybe it would be simpler and saner to just have -1 values in the pool
(error or no socket available) and report errors in tcp_conn_new_sock()
instead?
I'm still reviewing patches 17/22 to 22/22, sorry for the delay.
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 02/22] inany: Helper to test for various address types
2024-02-18 20:58 ` Stefano Brivio
@ 2024-02-19 1:48 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 1:48 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 1454 bytes --]
On Sun, Feb 18, 2024 at 09:58:44PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Add helpers to determine if an inany is loopback, unspecified or multicast,
> > regardless of whether it's a "true" IPv6 address or an IPv4 address
> > represented as v4-mapped.
> > Sometimes we need to know if an inany is a loopback address, unspecified or
> > some other particular kind of address, but we don't really care if it is
> > IPv4.
> >
> > Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > inany.h | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > tcp_splice.c | 15 +++------------
> > 2 files changed, 53 insertions(+), 12 deletions(-)
> >
> > diff --git a/inany.h b/inany.h
> > index fe652ff7..2058f145 100644
> > --- a/inany.h
> > +++ b/inany.h
> > @@ -55,6 +55,56 @@ static inline bool inany_equals(const union inany_addr *a,
> > return IN6_ARE_ADDR_EQUAL(&a->a6, &b->a6);
> > }
> >
> > +/** inany_is_loopback - Check if address is loopback
>
> Nit: inany_is_loopback() (and four occurrences below).
Corrected, thanks.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 09/22] tcp_splice: Simplify clean up logic
2024-02-18 20:59 ` Stefano Brivio
@ 2024-02-19 1:50 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 1:50 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]
On Sun, Feb 18, 2024 at 09:59:37PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:21 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Currently tcp_splice_flow_defer() contains specific logic to determine if
> > we're far enough initialised that we need to close pipes and/or sockets.
> > This is potentially fragile if we change something about the order in which
> > we do things. We can simplify this by initialising the pipe and socket
> > fields to -1 very early, then close()ing them if and only if they're non
> > negative.
> >
> > This lets us remove a special case cleanup if our connect() fails. This
> > will already trigger a CLOSING event, and the socket fd in question is
> > populated in the connection structure. Thus we can let the new cleanup
> > logic handle it rather than requiring an explicit close().
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 27 +++++++++++----------------
> > 1 file changed, 11 insertions(+), 16 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 40ecb5d4..f0343eb5 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -246,16 +246,14 @@ bool tcp_splice_flow_defer(union flow *flow)
> > return false;
> >
> > for (side = 0; side < SIDES; side++) {
> > - if (conn->events & SPLICE_ESTABLISHED) {
> > - /* Flushing might need to block: don't recycle them. */
> > - if (conn->pipe[side][0] != -1) {
> > - close(conn->pipe[side][0]);
> > - close(conn->pipe[side][1]);
> > - conn->pipe[side][0] = conn->pipe[side][1] = -1;
> > - }
> > + /* Flushing might need to block: don't recycle them. */
> > + if (conn->pipe[side][0] >= 0) {
> > + close(conn->pipe[side][0]);
> > + close(conn->pipe[side][1]);
> > + conn->pipe[side][0] = conn->pipe[side][1] = -1;
> > }
> >
> > - if (side == 0 || conn->events & SPLICE_CONNECT) {
> > + if (conn->s[side] >= 0) {
> > close(conn->s[side]);
> > conn->s[side] = -1;
> > }
> > @@ -284,8 +282,6 @@ static int tcp_splice_connect_finish(const struct ctx *c,
> > int i = 0;
> >
> > for (side = 0; side < SIDES; side++) {
> > - conn->pipe[side][0] = conn->pipe[side][1] = -1;
> > -
> > for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) {
> > if (splice_pipe_pool[i][0] >= 0) {
> > SWAP(conn->pipe[side][0],
> > @@ -361,12 +357,8 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> > }
> >
> > if (connect(conn->s[1], sa, sl)) {
> > - if (errno != EINPROGRESS) {
> > - int ret = -errno;
> > -
> > - close(sock_conn);
> > - return ret;
> > - }
> > + if (errno != EINPROGRESS)
> > + return -errno;
>
> Nit: a newline here would be nice.
Done, thanks.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools
2024-02-18 21:00 ` Stefano Brivio
@ 2024-02-19 1:51 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 1:51 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 6636 bytes --]
On Sun, Feb 18, 2024 at 10:00:29PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:24 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > We maintain pools of ready-to-connect sockets in both the original and
> > (for pasta) guest namespace to reduce latency when starting new TCP
> > connections. If we exhaust those pools we have to take a higher
> > latency path to get a new socket.
> >
> > Currently we open-code that fallback in the places we need it. To
> > improve clarity encapsulate that into helper functions.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp.c | 29 ++++++++++++++++++++++++-----
> > tcp_conn.h | 2 +-
> > tcp_splice.c | 46 +++++++++++++++++++++++++---------------------
> > 3 files changed, 50 insertions(+), 27 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index e15b932f..c06d1cc4 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -1792,7 +1792,7 @@ int tcp_conn_pool_sock(int pool[])
> > *
> > * Return: socket number on success, negative code if socket creation failed
> > */
> > -int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> > +static int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> > {
> > int s;
> >
> > @@ -1811,6 +1811,27 @@ int tcp_conn_new_sock(const struct ctx *c, sa_family_t af)
> > return s;
> > }
> >
> > +/**
> > + * tcp_conn_sock() - Obtain a connectable socket in the host/init namespace
> > + * @c: Execution context
> > + * @af: Address family (AF_INET or AF_INET6)
> > + *
> > + * Return: Socket fd on success, -errno on failure
> > + */
> > +int tcp_conn_sock(const struct ctx *c, sa_family_t af)
> > +{
> > + int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
> > + int s;
> > +
> > + if ((s = tcp_conn_pool_sock(pool)) >= 0)
> > + return s;
> > +
> > + /* If the pool is empty we just open a new one without refilling the
> > + * pool to keep latency down.
> > + */
> > + return tcp_conn_new_sock(c, af);
> > +}
> > +
> > /**
> > * tcp_conn_tap_mss() - Get MSS value advertised by tap/guest
> > * @conn: Connection pointer
> > @@ -1909,7 +1930,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> > const struct tcphdr *th, const char *opts,
> > size_t optlen, const struct timespec *now)
> > {
> > - int *pool = af == AF_INET6 ? init_sock_pool6 : init_sock_pool4;
> > struct sockaddr_in addr4 = {
> > .sin_family = AF_INET,
> > .sin_port = th->dest,
> > @@ -1931,9 +1951,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> > if (!(flow = flow_alloc()))
> > return;
> >
> > - if ((s = tcp_conn_pool_sock(pool)) < 0)
> > - if ((s = tcp_conn_new_sock(c, af)) < 0)
> > - goto cancel;
> > + if ((s = tcp_conn_sock(c, af)) < 0)
> > + goto cancel;
> >
> > if (!c->no_map_gw) {
> > if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw))
> > diff --git a/tcp_conn.h b/tcp_conn.h
> > index 20c7cb8b..e55edafe 100644
> > --- a/tcp_conn.h
> > +++ b/tcp_conn.h
> > @@ -159,7 +159,7 @@ bool tcp_flow_defer(union flow *flow);
> > bool tcp_splice_flow_defer(union flow *flow);
> > void tcp_splice_timer(const struct ctx *c, union flow *flow);
> > int tcp_conn_pool_sock(int pool[]);
> > -int tcp_conn_new_sock(const struct ctx *c, sa_family_t af);
> > +int tcp_conn_sock(const struct ctx *c, sa_family_t af);
> > void tcp_sock_refill_pool(const struct ctx *c, int pool[], sa_family_t af);
> > void tcp_splice_refill(const struct ctx *c);
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 576fe9be..609f3242 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -90,7 +90,7 @@ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = {
> > };
> >
> > /* Forward declaration */
> > -static int tcp_sock_refill_ns(void *arg);
> > +static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af);
> >
> > /**
> > * tcp_splice_conn_epoll_events() - epoll events masks for given state
> > @@ -380,36 +380,19 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> > static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn,
> > in_port_t port, uint8_t pif)
> > {
> > + sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
> > int s = -1;
> >
> > - /* If the pool is empty we take slightly different approaches
> > - * for init or ns sockets. For init sockets we just open a
> > - * new one without refilling the pool to keep latency down.
> > - * For ns sockets, we're going to incur the latency of
> > - * entering the ns anyway, so we might as well refill the
> > - * pool.
> > - */
> > if (pif == PIF_SPLICE) {
> > - int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4;
> > - sa_family_t af = CONN_V6(conn) ? AF_INET6 : AF_INET;
> > -
> > port += c->tcp.fwd_out.delta[port];
> >
> > - s = tcp_conn_pool_sock(p);
> > - if (s < 0)
> > - s = tcp_conn_new_sock(c, af);
> > + s = tcp_conn_sock(c, af);
> > } else {
> > - int *p = CONN_V6(conn) ? ns_sock_pool6 : ns_sock_pool4;
> > -
> > ASSERT(pif == PIF_HOST);
> >
> > port += c->tcp.fwd_in.delta[port];
> >
> > - /* If pool is empty, refill it first */
> > - if (p[TCP_SOCK_POOL_SIZE-1] < 0)
> > - NS_CALL(tcp_sock_refill_ns, c);
> > -
> > - s = tcp_conn_pool_sock(p);
> > + s = tcp_conn_sock_ns(c, af);
> > }
> >
> > if (s < 0) {
> > @@ -709,6 +692,27 @@ static int tcp_sock_refill_ns(void *arg)
> > return 0;
> > }
> >
> > +/**
> > + * tcp_conn_sock_ns() - Obtain a connectable socket in the namespace
> > + * @c: Execution context
> > + * @af: Address family (AF_INET or AF_INET6)
> > + *
> > + * Return: Socket fd in the namespace on success, -errno on failure
> > + */
> > +static int tcp_conn_sock_ns(const struct ctx *c, sa_family_t af)
> > +{
> > + int *p = af == AF_INET6 ? ns_sock_pool6 : ns_sock_pool4;
> > +
> > + /* If the pool is empty we have to incur the latency of entering the ns.
> > + * Therefore, we might as well refill the whole pool while we're at it,
> > + * which differs from tcp_conn_sock().
> > + */
> > + if (p[TCP_SOCK_POOL_SIZE-1] < 0)
>
> Nit, for consistency (but yes, it was already like this):
>
> if (p[TCP_SOCK_POOL_SIZE - 1] < 0)
Adjusted, thanks.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path
2024-02-18 21:00 ` Stefano Brivio
@ 2024-02-19 1:53 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 1:53 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]
On Sun, Feb 18, 2024 at 10:00:55PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:25 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > In tcp_splice_conn_from_sock(), the 'port' variable stores the source port
> > of the connection on the originating side. In tcp_splice_new(), called
> > directly from it, the 'port' parameter gives the _destination_ port of the
> > originating connection and is then updated to the destination port of the
> > connection on the other side.
> >
> > Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the accetped
> > socket (on side 0), whereas in tcp_splice_new(), 's' is the fd of the
> > connecting socket (side 1).
> >
> > I, for one, find having the same variable name with different meanings in
> > such close proximity in the flow of control pretty confusing.
>
> Yeah, same here, it took me a while to check that what you're changing
> to 'dstport' is actually a destination port (and not just because a
> comment says that).
Yeah, I had a lot of confusion until I did that trace and realised
these were different things with the same name. Hence the patch.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging
2024-02-18 21:00 ` Stefano Brivio
@ 2024-02-19 1:58 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 1:58 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]
On Sun, Feb 18, 2024 at 10:00:04PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:23 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > Our allocation scheme for flow entries means there are some non-obvious
> > constraints on when what things can be done with an entry. Add a big doc
> > comment explaining the life cycle.
> >
> > In addition, make a FLOW_START() macro to mark one of the important
> > transitions. This encourages correct usage, by making it natural to only
> > access the flow type specific structure after calling it. It also logs
> > that a new flow has been created, which is useful for debugging.
> >
> > We also add logging when a flow's lifecycle ends. This doesn't need a new
> > helper, because it can only happen either from flow_alloc_cancel() or from
> > the flow deferred handler.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > flow.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > flow.h | 5 ++++
> > tcp.c | 15 +++++------
> > tcp_splice.c | 11 ++++----
> > tcp_splice.h | 5 ++--
> > 5 files changed, 94 insertions(+), 18 deletions(-)
> >
> > diff --git a/flow.c b/flow.c
> > index beb9749c..a155b54b 100644
> > --- a/flow.c
> > +++ b/flow.c
> > @@ -34,6 +34,45 @@ static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES,
> >
> > /* Global Flow Table */
> >
> > +/**
> > + * DOC: Theory of Operation - flow entry life cycle
> > + *
> > + * An individual flow table entry moves through these logical states, usually in
> > + * this order.
> > + *
> > + * FREE - Part of the general pool of free flow table entries
> > + * Operations:
> > + * - flow_alloc() finds an entry and moves it to ALLOC state
> > + *
> > + * ALLOC - A tentatively allocated entry
> > + * Operations:
> > + * - flow_alloc_cancel() returns the entry to FREE state
> > + * - FLOW_START() set the entry's type and moves to START state
> > + * Caveats:
> > + * - It's not safe to write fields in the flow entry
> > + * - It's not safe to allocate other entries with flow_alloc()
>
> I'm not entirely sure what you mean here -- is this in the sense of
> s/other/further/ ?
Yes. I've changed the word to "further", which I agree is clearer.
> > + * - It's not safe to return to the main epoll loop
>
> ..."before FLOW_START() is called on the entry"?
Right, because FLOW_START() moves to START state, where returning to
the epoll loop *is* allowed. I've added a note which I hope makes
that clearer.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path
2024-02-18 21:01 ` Stefano Brivio
@ 2024-02-19 3:23 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-19 3:23 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 2817 bytes --]
On Sun, Feb 18, 2024 at 10:01:24PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:28 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > This makes a number of changes to improve error reporting while connecting
> > a new spliced socket:
> > * We use flow_err() and similar functions so all messages include info
> > on which specific flow was affected
> > * We use strerror() to interpret raw error values
> > * We now report errors on connection (at "trace" level, since this would
> > allow spamming the logs)
> > * We also look up and report some details on EPOLLERR events, which can
> > include connection errors, since we use a non-blocking connect(). Again
> > we use "trace" level since this can spam the logs.
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp_splice.c | 24 ++++++++++++++++++++----
> > 1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/tcp_splice.c b/tcp_splice.c
> > index 5ba9c8ea..49075e5c 100644
> > --- a/tcp_splice.c
> > +++ b/tcp_splice.c
> > @@ -349,8 +349,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn,
> > ASSERT(0);
> >
> > if (conn->s[1] < 0) {
> > - warn("Couldn't open connectable socket for splice (%d)",
> > - conn->s[1]);
> > + flow_err(conn,
> > + "Couldn't open connectable socket for splice: %s",
> > + strerror(-conn->s[1]));
>
> It took me a bit to convince myself that we actually store negative
> error codes in pools, which sounds like a neat idea, except for two
> things:
Yeah, it took me a while to discover that too :)
> - in tcp_sock_refill_pool(), once we get an error from
> tcp_conn_new_sock(), should we really continue to call it and try to
> get more sockets?
>
> Presumably, that will have something to do with some kind of resource
> exhaustion, so perhaps we shouldn't risk making it worse and just
> stop refilling the pool. If we do, we might have up to one negative
> error code in the pool, I guess.
Some sort of exhaustion is certainly the most likely cause, yes.
> - if we have an error code in the pool, that error might have occurred
> a while ago, but here, we're logging it at the moment we need that
> socket.
Fair point.
> Maybe it would be simpler and saner to just have -1 values in the pool
> (error or no socket available) and report errors in tcp_conn_new_sock()
> instead?
Yeah, that makes sense. I'll rework to that goal.
> I'm still reviewing patches 17/22 to 22/22, sorry for the delay.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 19/22] tcp: Validate TCP endpoint addresses
2024-02-06 1:17 ` [PATCH v2 19/22] tcp: Validate TCP endpoint addresses David Gibson
@ 2024-02-22 12:45 ` Stefano Brivio
2024-02-23 3:56 ` David Gibson
0 siblings, 1 reply; 38+ messages in thread
From: Stefano Brivio @ 2024-02-22 12:45 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:31 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> TCP connections should typically not have wildcard addresses (0.0.0.0 or
> ::) nor a zero port number for either endpoint.
This is only true for non-local connections, see also:
https://archives.passt.top/passt-dev/20240119105630.089c5d34@elisabeth/
Just looking at RFC 6890, which should be sufficient:
+----------------------+----------------------------+
| Attribute | Value |
+----------------------+----------------------------+
| Address Block | 0.0.0.0/8 |
| Name | "This host on this network"|
| RFC | [RFC1122], Section 3.2.1.3 |
| Allocation Date | September 1981 |
| Termination Date | N/A |
| Source | True |
| Destination | False |
| Forwardable | False |
| Global | False |
| Reserved-by-Protocol | True |
+----------------------+----------------------------+
...it can be used as source, but surely not by a non-loopback
interface, so not by the guest.
About using it as destination: the table says it can't be done, but:
$ nc -l -p 8818 & echo abcd | nc 0.0.0.0 8818
[2] 2624647
abcd
and:
# tcpdump -ni lo tcp port 8818
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:24:51.379436 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [S], seq 3285989360, win 65495, options [mss 65495,sackOK,TS val 1253719321 ecr 0,nop,wscale 7], length 0
13:24:51.379447 IP 127.0.0.1.8818 > 127.0.0.1.58574: Flags [S.], seq 3128422158, ack 3285989361, win 65483, options [mss 65495,sackOK,TS val 1253719321 ecr 1253719321,nop,wscale 7], length 0
13:24:51.379457 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [.], ack 1, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 0
13:24:51.379508 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [P.], seq 1:6, ack 1, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 5
13:24:51.379513 IP 127.0.0.1.8818 > 127.0.0.1.58574: Flags [.], ack 6, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 0
it's not effectively used, but the kernel understands what I mean by
0.0.0.0:
connect(3, {sa_family=AF_INET, sin_port=htons(8818),
sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)
So I wouldn't exclude its usage in tcp_listen_handler() -- even though
sure, the kernel translates that, so I don't think it can ever become a
practical issue.
If it annoys you in the perspective of the flow table, maybe we can
turn it into a loopback address, when we see it's used as source or
destination?
If that's problematic as well, I can totally live with this patch,
though.
About IPv6:
+----------------------+---------------------+
| Attribute | Value |
+----------------------+---------------------+
| Address Block | ::/128 |
| Name | Unspecified Address |
| RFC | [RFC4291] |
| Allocation Date | February 2006 |
| Termination Date | N/A |
| Source | True |
| Destination | False |
| Forwardable | False |
| Global | False |
| Reserved-by-Protocol | True |
+----------------------+---------------------+
...this is more specific, and its usage as source address is exemplified
in RFC 4291, 2.5.2:
One example of its use is in the Source Address field of
any IPv6 packets sent by an initializing host before it has learned
its own address.
...which should never be the case for any flow passt has to handle, so
I think it's fine to refuse it in tcp_listen_handler().
The rest of the series, up to 22/22, looks good to me.
> It's not entirely clear
> (at least to me) if it's strictly against the RFCs to do so, but at any
> rate the socket interfaces often treat those values specially[1], so it's
> not really possible to manipulate such connections. Likewise they should
> not have broadcast or multicast addresses for either endpoint.
>
> However, nothing prevents a guest from creating a SYN packet with such
> values, and it's not entirely clear what the effect on passt would be. To
> ensure sane behaviour, explicitly check for this case and drop such
> packets, logging a debug warning (we don't want a higher level, because
> that would allow a guest to spam the logs).
>
> We never expect such an address on an accept()ed socket either, but just
> in case, check for it as well.
>
> [1] Depending on context as "unknown", "match any" or "kernel, pick
> something for me"
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> tcp.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 67 insertions(+), 7 deletions(-)
>
> diff --git a/tcp.c b/tcp.c
> index 31d4e87b..236a8d23 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -284,6 +284,7 @@
> #include <sys/types.h>
> #include <sys/uio.h>
> #include <time.h>
> +#include <arpa/inet.h>
>
> #include <linux/tcp.h> /* For struct tcp_info */
>
> @@ -1930,27 +1931,59 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> const struct tcphdr *th, const char *opts,
> size_t optlen, const struct timespec *now)
> {
> + in_port_t srcport = ntohs(th->source);
> + in_port_t dstport = ntohs(th->dest);
> struct sockaddr_in addr4 = {
> .sin_family = AF_INET,
> - .sin_port = th->dest,
> + .sin_port = htons(dstport),
> .sin_addr = *(struct in_addr *)daddr,
> };
> struct sockaddr_in6 addr6 = {
> .sin6_family = AF_INET6,
> - .sin6_port = th->dest,
> + .sin6_port = htons(dstport),
> .sin6_addr = *(struct in6_addr *)daddr,
> };
> const struct sockaddr *sa;
> struct tcp_tap_conn *conn;
> union flow *flow;
> + int s = -1, mss;
> socklen_t sl;
> - int s, mss;
> -
> - (void)saddr;
>
> if (!(flow = flow_alloc()))
> return;
>
> + if (af == AF_INET) {
> + if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
> + IN4_IS_ADDR_BROADCAST(saddr) ||
> + IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
> + IN4_IS_ADDR_UNSPECIFIED(daddr) ||
> + IN4_IS_ADDR_BROADCAST(daddr) ||
> + IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
> + char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
> +
> + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
> + inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
> + srcport,
> + inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
> + dstport);
> + goto cancel;
> + }
> + } else if (af == AF_INET6) {
> + if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
> + IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
> + IN6_IS_ADDR_UNSPECIFIED(daddr) ||
> + IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
> + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
> +
> + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
> + inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
> + srcport,
> + inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
> + dstport);
> + goto cancel;
> + }
> + }
> +
> if ((s = tcp_conn_sock(c, af)) < 0)
> goto cancel;
>
> @@ -2001,8 +2034,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> sl = sizeof(addr6);
> }
>
> - conn->fport = ntohs(th->dest);
> - conn->eport = ntohs(th->source);
> + conn->fport = dstport;
> + conn->eport = srcport;
>
> conn->seq_init_from_tap = ntohl(th->seq);
> conn->seq_from_tap = conn->seq_init_from_tap + 1;
> @@ -2732,6 +2765,33 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> if (s < 0)
> goto cancel;
>
> + if (sa.sa_family == AF_INET) {
> + const struct in_addr *addr = &sa.sa4.sin_addr;
> + in_port_t port = sa.sa4.sin_port;
> +
> + if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
> + IN4_IS_ADDR_BROADCAST(addr) ||
> + IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
> + char str[INET_ADDRSTRLEN];
> +
> + err("Invalid endpoint from TCP accept(): %s:%hu",
> + inet_ntop(AF_INET, addr, str, sizeof(str)), port);
> + goto cancel;
> + }
> + } else if (sa.sa_family == AF_INET6) {
> + const struct in6_addr *addr = &sa.sa6.sin6_addr;
> + in_port_t port = sa.sa6.sin6_port;
> +
> + if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
> + IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
> + char str[INET6_ADDRSTRLEN];
> +
> + err("Invalid endpoint from TCP accept(): %s:%hu",
> + inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
> + goto cancel;
> + }
> + }
> +
> if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
> ref.tcp_listen.port, flow, s, &sa))
> return;
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 19/22] tcp: Validate TCP endpoint addresses
2024-02-22 12:45 ` Stefano Brivio
@ 2024-02-23 3:56 ` David Gibson
0 siblings, 0 replies; 38+ messages in thread
From: David Gibson @ 2024-02-23 3:56 UTC (permalink / raw)
To: Stefano Brivio; +Cc: passt-dev
[-- Attachment #1: Type: text/plain, Size: 11405 bytes --]
On Thu, Feb 22, 2024 at 01:45:44PM +0100, Stefano Brivio wrote:
> On Tue, 6 Feb 2024 12:17:31 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > TCP connections should typically not have wildcard addresses (0.0.0.0 or
> > ::) nor a zero port number for either endpoint.
>
> This is only true for non-local connections, see also:
> https://archives.passt.top/passt-dev/20240119105630.089c5d34@elisabeth/
>
> Just looking at RFC 6890, which should be sufficient:
>
> +----------------------+----------------------------+
> | Attribute | Value |
> +----------------------+----------------------------+
> | Address Block | 0.0.0.0/8 |
> | Name | "This host on this network"|
> | RFC | [RFC1122], Section 3.2.1.3 |
> | Allocation Date | September 1981 |
> | Termination Date | N/A |
> | Source | True |
> | Destination | False |
> | Forwardable | False |
> | Global | False |
> | Reserved-by-Protocol | True |
> +----------------------+----------------------------+
It's unclear to me from either rfc 6890 or rfc 1122 whether this is
describing something meaningful on the wire, or only something
meaningful in APIs.
> ...it can be used as source, but surely not by a non-loopback
> interface, so not by the guest.
>
> About using it as destination: the table says it can't be done, but:
>
> $ nc -l -p 8818 & echo abcd | nc 0.0.0.0 8818
> [2] 2624647
> abcd
>
> and:
>
> # tcpdump -ni lo tcp port 8818
> tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
> listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 13:24:51.379436 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [S], seq 3285989360, win 65495, options [mss 65495,sackOK,TS val 1253719321 ecr 0,nop,wscale 7], length 0
> 13:24:51.379447 IP 127.0.0.1.8818 > 127.0.0.1.58574: Flags [S.], seq 3128422158, ack 3285989361, win 65483, options [mss 65495,sackOK,TS val 1253719321 ecr 1253719321,nop,wscale 7], length 0
> 13:24:51.379457 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [.], ack 1, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 0
> 13:24:51.379508 IP 127.0.0.1.58574 > 127.0.0.1.8818: Flags [P.], seq 1:6, ack 1, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 5
> 13:24:51.379513 IP 127.0.0.1.8818 > 127.0.0.1.58574: Flags [.], ack 6, win 512, options [nop,nop,TS val 1253719321 ecr 1253719321], length 0
>
> it's not effectively used, but the kernel understands what I mean by
> 0.0.0.0:
Right: the 0.0.0.0 means something at the API level, but not AFAICT,
at the wire level. At least for the "from tap" side of this change,
it's the wire level that I'm checking.
> connect(3, {sa_family=AF_INET, sin_port=htons(8818),
> sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)
>
> So I wouldn't exclude its usage in tcp_listen_handler() -- even though
> sure, the kernel translates that, so I don't think it can ever become a
> practical issue.
Hmm. In listen_handler() it's not "on the wire", but reading from an
API still seems different from sending to an API to me. At the very
least 0.0.0.0 doesn't have the same meaning in all API contexts, which
makes it, IMO, unsafe to carry around from one API to another.
> If it annoys you in the perspective of the flow table, maybe we can
> turn it into a loopback address, when we see it's used as source or
> destination?
It appears the kernel already does that (IMO, correctly). If we do
see a 0.0.0.0 here, I think that would be the kernel indicating some
sort of special case that would need special handling. So, again, I
don't think we should just blindly copy this address to other APIs.
> If that's problematic as well, I can totally live with this patch,
> though.
>
> About IPv6:
>
>
> +----------------------+---------------------+
> | Attribute | Value |
> +----------------------+---------------------+
> | Address Block | ::/128 |
> | Name | Unspecified Address |
> | RFC | [RFC4291] |
> | Allocation Date | February 2006 |
> | Termination Date | N/A |
> | Source | True |
> | Destination | False |
> | Forwardable | False |
> | Global | False |
> | Reserved-by-Protocol | True |
> +----------------------+---------------------+
>
> ...this is more specific, and its usage as source address is exemplified
> in RFC 4291, 2.5.2:
>
> One example of its use is in the Source Address field of
> any IPv6 packets sent by an initializing host before it has learned
> its own address.
>
> ...which should never be the case for any flow passt has to handle, so
> I think it's fine to refuse it in tcp_listen_handler().
Well, depends how broadly you define "flow". We may need to handle
this for NDP and/or DHCPv6. But this is specifically for TCP.
> The rest of the series, up to 22/22, looks good to me.
>
> > It's not entirely clear
> > (at least to me) if it's strictly against the RFCs to do so, but at any
> > rate the socket interfaces often treat those values specially[1], so it's
> > not really possible to manipulate such connections. Likewise they should
> > not have broadcast or multicast addresses for either endpoint.
> >
> > However, nothing prevents a guest from creating a SYN packet with such
> > values, and it's not entirely clear what the effect on passt would be. To
> > ensure sane behaviour, explicitly check for this case and drop such
> > packets, logging a debug warning (we don't want a higher level, because
> > that would allow a guest to spam the logs).
> >
> > We never expect such an address on an accept()ed socket either, but just
> > in case, check for it as well.
> >
> > [1] Depending on context as "unknown", "match any" or "kernel, pick
> > something for me"
> >
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> > tcp.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 67 insertions(+), 7 deletions(-)
> >
> > diff --git a/tcp.c b/tcp.c
> > index 31d4e87b..236a8d23 100644
> > --- a/tcp.c
> > +++ b/tcp.c
> > @@ -284,6 +284,7 @@
> > #include <sys/types.h>
> > #include <sys/uio.h>
> > #include <time.h>
> > +#include <arpa/inet.h>
> >
> > #include <linux/tcp.h> /* For struct tcp_info */
> >
> > @@ -1930,27 +1931,59 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> > const struct tcphdr *th, const char *opts,
> > size_t optlen, const struct timespec *now)
> > {
> > + in_port_t srcport = ntohs(th->source);
> > + in_port_t dstport = ntohs(th->dest);
> > struct sockaddr_in addr4 = {
> > .sin_family = AF_INET,
> > - .sin_port = th->dest,
> > + .sin_port = htons(dstport),
> > .sin_addr = *(struct in_addr *)daddr,
> > };
> > struct sockaddr_in6 addr6 = {
> > .sin6_family = AF_INET6,
> > - .sin6_port = th->dest,
> > + .sin6_port = htons(dstport),
> > .sin6_addr = *(struct in6_addr *)daddr,
> > };
> > const struct sockaddr *sa;
> > struct tcp_tap_conn *conn;
> > union flow *flow;
> > + int s = -1, mss;
> > socklen_t sl;
> > - int s, mss;
> > -
> > - (void)saddr;
> >
> > if (!(flow = flow_alloc()))
> > return;
> >
> > + if (af == AF_INET) {
> > + if (IN4_IS_ADDR_UNSPECIFIED(saddr) ||
> > + IN4_IS_ADDR_BROADCAST(saddr) ||
> > + IN4_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
> > + IN4_IS_ADDR_UNSPECIFIED(daddr) ||
> > + IN4_IS_ADDR_BROADCAST(daddr) ||
> > + IN4_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
> > + char sstr[INET_ADDRSTRLEN], dstr[INET_ADDRSTRLEN];
> > +
> > + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
> > + inet_ntop(AF_INET, saddr, sstr, sizeof(sstr)),
> > + srcport,
> > + inet_ntop(AF_INET, daddr, dstr, sizeof(dstr)),
> > + dstport);
> > + goto cancel;
> > + }
> > + } else if (af == AF_INET6) {
> > + if (IN6_IS_ADDR_UNSPECIFIED(saddr) ||
> > + IN6_IS_ADDR_MULTICAST(saddr) || srcport == 0 ||
> > + IN6_IS_ADDR_UNSPECIFIED(daddr) ||
> > + IN6_IS_ADDR_MULTICAST(daddr) || dstport == 0) {
> > + char sstr[INET6_ADDRSTRLEN], dstr[INET6_ADDRSTRLEN];
> > +
> > + debug("Invalid endpoint in TCP SYN: %s:%hu -> %s:%hu",
> > + inet_ntop(AF_INET6, saddr, sstr, sizeof(sstr)),
> > + srcport,
> > + inet_ntop(AF_INET6, daddr, dstr, sizeof(dstr)),
> > + dstport);
> > + goto cancel;
> > + }
> > + }
> > +
> > if ((s = tcp_conn_sock(c, af)) < 0)
> > goto cancel;
> >
> > @@ -2001,8 +2034,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
> > sl = sizeof(addr6);
> > }
> >
> > - conn->fport = ntohs(th->dest);
> > - conn->eport = ntohs(th->source);
> > + conn->fport = dstport;
> > + conn->eport = srcport;
> >
> > conn->seq_init_from_tap = ntohl(th->seq);
> > conn->seq_from_tap = conn->seq_init_from_tap + 1;
> > @@ -2732,6 +2765,33 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref,
> > if (s < 0)
> > goto cancel;
> >
> > + if (sa.sa_family == AF_INET) {
> > + const struct in_addr *addr = &sa.sa4.sin_addr;
> > + in_port_t port = sa.sa4.sin_port;
> > +
> > + if (IN4_IS_ADDR_UNSPECIFIED(addr) ||
> > + IN4_IS_ADDR_BROADCAST(addr) ||
> > + IN4_IS_ADDR_MULTICAST(addr) || port == 0) {
> > + char str[INET_ADDRSTRLEN];
> > +
> > + err("Invalid endpoint from TCP accept(): %s:%hu",
> > + inet_ntop(AF_INET, addr, str, sizeof(str)), port);
> > + goto cancel;
> > + }
> > + } else if (sa.sa_family == AF_INET6) {
> > + const struct in6_addr *addr = &sa.sa6.sin6_addr;
> > + in_port_t port = sa.sa6.sin6_port;
> > +
> > + if (IN6_IS_ADDR_UNSPECIFIED(addr) ||
> > + IN6_IS_ADDR_MULTICAST(addr) || port == 0) {
> > + char str[INET6_ADDRSTRLEN];
> > +
> > + err("Invalid endpoint from TCP accept(): %s:%hu",
> > + inet_ntop(AF_INET6, addr, str, sizeof(str)), port);
> > + goto cancel;
> > + }
> > + }
> > +
> > if (tcp_splice_conn_from_sock(c, ref.tcp_listen.pif,
> > ref.tcp_listen.port, flow, s, &sa))
> > return;
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH v2 00/22] More flow table preliminaries: address handling improvements
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
` (21 preceding siblings ...)
2024-02-06 1:17 ` [PATCH v2 22/22] fwd: Rename port_fwd.[ch] and their contents David Gibson
@ 2024-02-27 14:22 ` Stefano Brivio
22 siblings, 0 replies; 38+ messages in thread
From: Stefano Brivio @ 2024-02-27 14:22 UTC (permalink / raw)
To: David Gibson; +Cc: passt-dev
On Tue, 6 Feb 2024 12:17:12 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> Here's another batch of cleanups and tweaks in preparation for the
> flow table. This set focuses on improved helpers for handling
> addresses, particularly in the TCP splice path.
Sorry for the delay, I made sure I'm done reviewing this and I have no
further comments -- I'm fine with 19/22 as it is. Please post v3 so
that I can apply this.
--
Stefano
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2024-02-27 14:23 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-06 1:17 [PATCH v2 00/22] More flow table preliminaries: address handling improvements David Gibson
2024-02-06 1:17 ` [PATCH v2 01/22] treewide: Use sa_family_t for address family variables David Gibson
2024-02-06 1:17 ` [PATCH v2 02/22] inany: Helper to test for various address types David Gibson
2024-02-18 20:58 ` Stefano Brivio
2024-02-19 1:48 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 03/22] inany: Add inany_ntop() helper David Gibson
2024-02-06 1:17 ` [PATCH v2 04/22] inany: Provide more conveniently typed constants for special addresses David Gibson
2024-02-06 1:17 ` [PATCH v2 05/22] inany: Introduce union sockaddr_inany David Gibson
2024-02-06 1:17 ` [PATCH v2 06/22] util: Allow IN4_IS_* macros to operate on untyped addresses David Gibson
2024-02-06 1:17 ` [PATCH v2 07/22] tcp, udp: Don't precompute port remappings in epoll references David Gibson
2024-02-06 1:17 ` [PATCH v2 08/22] flow: Add helper to determine a flow's protocol David Gibson
2024-02-06 1:17 ` [PATCH v2 09/22] tcp_splice: Simplify clean up logic David Gibson
2024-02-18 20:59 ` Stefano Brivio
2024-02-19 1:50 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 10/22] tcp_splice: Don't use flow_trace() before setting flow type David Gibson
2024-02-06 1:17 ` [PATCH v2 11/22] flow: Clarify flow entry life cycle, introduce uniform logging David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:58 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 12/22] tcp, tcp_splice: Helpers for getting sockets from the pools David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:51 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 13/22] tcp_splice: More specific variable names in new splice path David Gibson
2024-02-18 21:00 ` Stefano Brivio
2024-02-19 1:53 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 14/22] tcp_splice: Merge tcp_splice_new() into its caller David Gibson
2024-02-06 1:17 ` [PATCH v2 15/22] tcp_splice: Make tcp_splice_connect() create its own sockets David Gibson
2024-02-06 1:17 ` [PATCH v2 16/22] tcp_splice: Improve error reporting on connect path David Gibson
2024-02-18 21:01 ` Stefano Brivio
2024-02-19 3:23 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 17/22] tcp_splice: Improve logic deciding when to splice David Gibson
2024-02-06 1:17 ` [PATCH v2 18/22] tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler() David Gibson
2024-02-06 1:17 ` [PATCH v2 19/22] tcp: Validate TCP endpoint addresses David Gibson
2024-02-22 12:45 ` Stefano Brivio
2024-02-23 3:56 ` David Gibson
2024-02-06 1:17 ` [PATCH v2 20/22] tap: Disallow loopback addresses on tap interface David Gibson
2024-02-06 1:17 ` [PATCH v2 21/22] port_fwd: Fix copypasta error in port_fwd_scan_udp() comments David Gibson
2024-02-06 1:17 ` [PATCH v2 22/22] fwd: Rename port_fwd.[ch] and their contents David Gibson
2024-02-27 14:22 ` [PATCH v2 00/22] More flow table preliminaries: address handling improvements Stefano Brivio
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).