From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202502 header.b=S7GxaxVQ; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 97CF05A061D for ; Fri, 04 Apr 2025 12:16:00 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202502; t=1743761743; bh=WW5gQJOLyX5YuYjB5kAb5ga+T4uApNleJS5VbZzsUGE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=S7GxaxVQ2bdFQbAjPvLlHRe95YEY2Ug0PUsWtZNu9J2ms5pnzglUaanZhlNDBv8DJ 6GCsRTlIgjQ2pSh8G3uA25QECk38x1+BlV1anzpBh6rsA9COfkO/91xSlZhIaplc12 ihDOD7TrMHbfOSJAGEsSxBlVc/c6dKbYD2IWmhlyggOikgxb1bEZx7nITu+CYk9aTw LJB6qb3XulZkDIk4SUBAv2k5HeHJHcZIqoK5joRPLtxaeoaN02gAcoiO36YMLgJbwa Xe/EAin9kgqTN2Uffe/ZeSo12unfnXhZLnw+fimc0HVI/UA41g+GPb690Pd88uSZmH Gr5vvDRnUEWiw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4ZTZHg5R1tz4xNg; Fri, 4 Apr 2025 21:15:43 +1100 (AEDT) From: David Gibson To: Stefano Brivio , passt-dev@passt.top Subject: [PATCH 01/12] udp: Use connect()ed sockets for initiating side Date: Fri, 4 Apr 2025 21:15:31 +1100 Message-ID: <20250404101542.3729316-2-david@gibson.dropbear.id.au> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250404101542.3729316-1-david@gibson.dropbear.id.au> References: <20250404101542.3729316-1-david@gibson.dropbear.id.au> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Message-ID-Hash: JPWZLWTSXY57ITYRBWDRBYCC2XKJTW4K X-Message-ID-Hash: JPWZLWTSXY57ITYRBWDRBYCC2XKJTW4K X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: David Gibson X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Currently we have an asymmetry in how we handle UDP sockets. For flows where the target side is a socket, we create a new connect()ed socket - the "reply socket" specifically for that flow used for sending and receiving datagrams on that flow and only that flow. For flows where the initiating side is a socket, we continue to use the "listening" socket (or rather, a dup() of it). This has some disadvantages: * We need a hash lookup for every datagram on the listening socket in order to work out what flow it belongs to * The dup() keeps the socket alive even if automatic forwarding removes the listening socket. However, the epoll data remains the same including containing the now stale original fd. This causes bug 103. * We can't (easily) set flow-specific options on an initiating side socket, because that could affect other flows as well Alter the code to use a connect()ed socket on the initiating side as well as the target side. There's no way to "clone and connect" the listening socket (a loose equivalent of accept() for UDP), so we have to create a new socket. We have to bind() this socket before we connect() it, which is allowed thanks to SO_REUSEADDR, but does leave a small window where it could receive datagrams not intended for this flow. For now we handle this by simply discarding any datagrams received between bind() and connect(), but I intend to improve this in a later patch. Link: https://bugs.passt.top/show_bug.cgi?id=103 Signed-off-by: David Gibson --- epoll_type.h | 4 ++-- passt.c | 6 +++--- udp.c | 46 ++++++++++++++++++++++++---------------------- udp.h | 4 ++-- udp_flow.c | 32 +++++++++----------------------- util.c | 2 +- 6 files changed, 41 insertions(+), 53 deletions(-) diff --git a/epoll_type.h b/epoll_type.h index 7f2a1217..12ac59b1 100644 --- a/epoll_type.h +++ b/epoll_type.h @@ -22,8 +22,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP "listening" sockets */ EPOLL_TYPE_UDP_LISTEN, - /* UDP socket for replies on a specific flow */ - EPOLL_TYPE_UDP_REPLY, + /* UDP socket for a specific flow */ + EPOLL_TYPE_UDP, /* ICMP/ICMPv6 ping sockets */ EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ diff --git a/passt.c b/passt.c index cd067728..388d10f1 100644 --- a/passt.c +++ b/passt.c @@ -68,7 +68,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP_LISTEN] = "listening UDP socket", - [EPOLL_TYPE_UDP_REPLY] = "UDP reply socket", + [EPOLL_TYPE_UDP] = "UDP flow socket", [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT_INOTIFY] = "namespace inotify watch", [EPOLL_TYPE_NSQUIT_TIMER] = "namespace timer watch", @@ -339,8 +339,8 @@ loop: case EPOLL_TYPE_UDP_LISTEN: udp_listen_sock_handler(&c, ref, eventmask, &now); break; - case EPOLL_TYPE_UDP_REPLY: - udp_reply_sock_handler(&c, ref, eventmask, &now); + case EPOLL_TYPE_UDP: + udp_sock_handler(&c, ref, eventmask, &now); break; case EPOLL_TYPE_PING: icmp_sock_handler(&c, ref); diff --git a/udp.c b/udp.c index ab3e9d20..fa6fccdc 100644 --- a/udp.c +++ b/udp.c @@ -39,27 +39,30 @@ * could receive packets from multiple flows, so we use a hash table match to * find the specific flow for a datagram. * - * When a UDP flow is initiated from a listening socket we take a duplicate of - * the socket and store it in uflow->s[INISIDE]. This will last for the - * lifetime of the flow, even if the original listening socket is closed due to - * port auto-probing. The duplicate is used to deliver replies back to the - * originating side. - * - * Reply sockets - * ============= + * Flow sockets + * ============ * - * When a UDP flow targets a socket, we create a "reply" socket in + * When a UDP flow targets a socket, we create a "flow" socket in * uflow->s[TGTSIDE] both to deliver datagrams to the target side and receive * replies on the target side. This socket is both bound and connected and has - * EPOLL_TYPE_UDP_REPLY. The connect() means it will only receive datagrams + * EPOLL_TYPE_UDP. The connect() means it will only receive datagrams * associated with this flow, so the epoll reference directly points to the flow * and we don't need a hash lookup. * - * NOTE: it's possible that the reply socket could have a bound address - * overlapping with an unrelated listening socket. We assume datagrams for the - * flow will come to the reply socket in preference to a listening socket. The - * sample program doc/platform-requirements/reuseaddr-priority.c documents and - * tests that assumption. + * When a flow is initiated from a listening socket, we create a "flow" socket + * with the same bound address as the listening socket, but also connect()ed to + * the flow's peer. This is stored in uflow->s[INISIDE] and will last for the + * lifetime of the flow, even if the original listening socket is closed due to + * port auto-probing. The duplicate is used to deliver replies back to the + * originating side. + * + * NOTE: A flow socket can have a bound address overlapping with a listening + * socket. That will happen naturally for flows initiated from a socket, but is + * also possible (though unlikely) for tap initiated flows, depending on the + * source port. We assume datagrams for the flow will come to a connect()ed + * socket in preference to a listening socket. The sample program + * doc/platform-requirements/reuseaddr-priority.c documents and tests that + * assumption. * * "Spliced" flows * =============== @@ -71,8 +74,7 @@ * actually used; it doesn't make sense for datagrams and instead a pair of * recvmmsg() and sendmmsg() is used to forward the datagrams. * - * Note that a spliced flow will have *both* a duplicated listening socket and a - * reply socket (see above). + * Note that a spliced flow will have two flow sockets (see above). */ #include @@ -557,7 +559,7 @@ static int udp_sock_recverr(const struct ctx *c, union epoll_ref ref) } eh = (const struct errhdr *)CMSG_DATA(hdr); - if (ref.type == EPOLL_TYPE_UDP_REPLY) { + if (ref.type == EPOLL_TYPE_UDP) { flow_sidx_t sidx = flow_sidx_opposite(ref.flowside); const struct flowside *toside = flowside_at_sidx(sidx); size_t dlen = rc; @@ -792,14 +794,14 @@ static bool udp_buf_reply_sock_data(const struct ctx *c, } /** - * udp_reply_sock_handler() - Handle new data from flow specific socket + * udp_sock_handler() - Handle new data from flow specific socket * @c: Execution context * @ref: epoll reference * @events: epoll events bitmap * @now: Current timestamp */ -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, - uint32_t events, const struct timespec *now) +void udp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now) { struct udp_flow *uflow = udp_at_sidx(ref.flowside); @@ -807,7 +809,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, if (events & EPOLLERR) { if (udp_sock_errs(c, ref) < 0) { - flow_err(uflow, "Unrecoverable error on reply socket"); + flow_err(uflow, "Unrecoverable error on flow socket"); goto fail; } } diff --git a/udp.h b/udp.h index de2df6de..8fc42838 100644 --- a/udp.h +++ b/udp.h @@ -11,8 +11,8 @@ void udp_portmap_clear(void); void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref, - uint32_t events, const struct timespec *now); +void udp_sock_handler(const struct ctx *c, union epoll_ref ref, + uint32_t events, const struct timespec *now); int udp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, const void *saddr, const void *daddr, const struct pool *p, int idx, const struct timespec *now); diff --git a/udp_flow.c b/udp_flow.c index bf4b8965..d50bddb2 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -49,10 +49,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) flow_foreach_sidei(sidei) { flow_hash_remove(c, FLOW_SIDX(uflow, sidei)); if (uflow->s[sidei] >= 0) { - /* The listening socket needs to stay in epoll, but the - * flow specific one needs to be removed */ - if (sidei == TGTSIDE) - epoll_del(c, uflow->s[sidei]); + epoll_del(c, uflow->s[sidei]); close(uflow->s[sidei]); uflow->s[sidei] = -1; } @@ -81,7 +78,7 @@ static int udp_flow_sock(const struct ctx *c, } fref = { .sidx = FLOW_SIDX(uflow, sidei) }; int rc, s; - s = flowside_sock_l4(c, EPOLL_TYPE_UDP_REPLY, pif, side, fref.data); + s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data); if (s < 0) { flow_dbg_perror(uflow, "Couldn't open flow specific socket"); return s; @@ -120,13 +117,12 @@ static int udp_flow_sock(const struct ctx *c, * udp_flow_new() - Common setup for a new UDP flow * @c: Execution context * @flow: Initiated flow - * @s_ini: Initiating socket (or -1) * @now: Timestamp * * Return: UDP specific flow, if successful, NULL on failure */ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, - int s_ini, const struct timespec *now) + const struct timespec *now) { struct udp_flow *uflow = NULL; unsigned sidei; @@ -138,22 +134,12 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, uflow->ts = now->tv_sec; uflow->s[INISIDE] = uflow->s[TGTSIDE] = -1; - if (s_ini >= 0) { - /* When using auto port-scanning the listening port could go - * away, so we need to duplicate the socket - */ - uflow->s[INISIDE] = fcntl(s_ini, F_DUPFD_CLOEXEC, 0); - if (uflow->s[INISIDE] < 0) { - flow_perror(uflow, - "Couldn't duplicate listening socket"); - goto cancel; - } + flow_foreach_sidei(sidei) { + if (pif_is_socket(uflow->f.pif[sidei])) + if ((uflow->s[sidei] = udp_flow_sock(c, uflow, sidei)) < 0) + goto cancel; } - if (pif_is_socket(flow->f.pif[TGTSIDE])) - if ((uflow->s[TGTSIDE] = udp_flow_sock(c, uflow, TGTSIDE)) < 0) - goto cancel; - /* Tap sides always need to be looked up by hash. Socket sides don't * always, but sometimes do (receiving packets on a socket not specific * to one flow). Unconditionally hash both sides so all our bases are @@ -224,7 +210,7 @@ flow_sidx_t udp_flow_from_sock(const struct ctx *c, union epoll_ref ref, return FLOW_SIDX_NONE; } - return udp_flow_new(c, flow, ref.fd, now); + return udp_flow_new(c, flow, now); } /** @@ -280,7 +266,7 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, return FLOW_SIDX_NONE; } - return udp_flow_new(c, flow, -1, now); + return udp_flow_new(c, flow, now); } /** diff --git a/util.c b/util.c index b9a3d434..0f68cf57 100644 --- a/util.c +++ b/util.c @@ -71,7 +71,7 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, case EPOLL_TYPE_UDP_LISTEN: freebind = c->freebind; /* fallthrough */ - case EPOLL_TYPE_UDP_REPLY: + case EPOLL_TYPE_UDP: proto = IPPROTO_UDP; socktype = SOCK_DGRAM | SOCK_NONBLOCK; break; -- 2.49.0