From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>, passt-dev@passt.top
Cc: David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH 12/12] udp_flow: Don't discard packets that arrive between bind() and connect()
Date: Fri, 4 Apr 2025 21:15:42 +1100 [thread overview]
Message-ID: <20250404101542.3729316-13-david@gibson.dropbear.id.au> (raw)
In-Reply-To: <20250404101542.3729316-1-david@gibson.dropbear.id.au>
When we establish a new UDP flow we create connect()ed sockets that will
only handle datagrams for this flow. However, there is a race between
bind() and connect() where they might get some packets queued for a
different flow. Currently we handle this by simply discarding any
queued datagrams after the connect. UDP protocols should be able to handle
such packet loss, but it's not ideal.
We now have the tools we need to handle this better, by redirecting any
datagrams received during that race to the appropriate flow. We need to
use a deferred handler for this to avoid unexpectedly re-ordering datagrams
in some edge cases.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
flow.c | 2 +-
udp.c | 4 +--
udp_flow.c | 73 +++++++++++++++++++++++++++++++++++---------------
udp_flow.h | 6 ++++-
udp_internal.h | 2 ++
5 files changed, 61 insertions(+), 26 deletions(-)
diff --git a/flow.c b/flow.c
index 86222426..29a83e18 100644
--- a/flow.c
+++ b/flow.c
@@ -850,7 +850,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
closed = icmp_ping_timer(c, &flow->ping, now);
break;
case FLOW_UDP:
- closed = udp_flow_defer(&flow->udp);
+ closed = udp_flow_defer(c, &flow->udp, now);
if (!closed && timer)
closed = udp_flow_timer(c, &flow->udp, now);
break;
diff --git a/udp.c b/udp.c
index 7c8b7a2c..b275db3d 100644
--- a/udp.c
+++ b/udp.c
@@ -697,8 +697,8 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
* @port: Our (local) port number of @s
* @now: Current timestamp
*/
-static void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
- in_port_t port, const struct timespec *now)
+void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+ in_port_t port, const struct timespec *now)
{
union sockaddr_inany src;
diff --git a/udp_flow.c b/udp_flow.c
index b95c3176..af15d7f2 100644
--- a/udp_flow.c
+++ b/udp_flow.c
@@ -9,10 +9,12 @@
#include <fcntl.h>
#include <sys/uio.h>
#include <unistd.h>
+#include <netinet/udp.h>
#include "util.h"
#include "passt.h"
#include "flow_table.h"
+#include "udp_internal.h"
#define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */
@@ -67,16 +69,15 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
* Return: fd of new socket on success, -ve error code on failure
*/
static int udp_flow_sock(const struct ctx *c,
- const struct udp_flow *uflow, unsigned sidei)
+ struct udp_flow *uflow, unsigned sidei)
{
const struct flowside *side = &uflow->f.side[sidei];
- struct mmsghdr discard[UIO_MAXIOV] = { 0 };
uint8_t pif = uflow->f.pif[sidei];
union {
flow_sidx_t sidx;
uint32_t data;
} fref = { .sidx = FLOW_SIDX(uflow, sidei) };
- int rc, s;
+ int s;
s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
if (s < 0) {
@@ -85,30 +86,32 @@ static int udp_flow_sock(const struct ctx *c,
}
if (flowside_connect(c, s, pif, side) < 0) {
- rc = -errno;
+ int rc = -errno;
flow_dbg_perror(uflow, "Couldn't connect flow socket");
return rc;
}
- /* It's possible, if unlikely, that we could receive some unrelated
- * packets in between the bind() and connect() of this socket. For now
- * we just discard these.
+ /* It's possible, if unlikely, that we could receive some packets in
+ * between the bind() and connect() which may or may not be for this
+ * flow. Being UDP we could just discard them, but it's not ideal.
*
- * FIXME: Redirect these to an appropriate handler
+ * There's also a tricky case if a bunch of datagrams for a new flow
+ * arrive in rapid succession, the first going to the original listening
+ * socket and later ones going to this new socket. If we forwarded the
+ * datagrams from the new socket immediately here they would go before
+ * the datagram which established the flow. Again, not strictly wrong
+ * for UDP, but not ideal.
+ *
+ * So, we flag that the new socket is in a transient state where it
+ * might have datagrams for a different flow queued. Before the next
+ * epoll cycle, udp_flow_defer() will flush out any such datagrams, and
+ * thereafter everything on the new socket should be strictly for this
+ * flow.
*/
- rc = recvmmsg(s, discard, ARRAY_SIZE(discard), MSG_DONTWAIT, NULL);
- if (rc >= ARRAY_SIZE(discard)) {
- flow_dbg(uflow, "Too many (%d) spurious reply datagrams", rc);
- return -E2BIG;
- }
-
- if (rc > 0) {
- flow_trace(uflow, "Discarded %d spurious reply datagrams", rc);
- } else if (errno != EAGAIN) {
- rc = -errno;
- flow_perror(uflow, "Unexpected error discarding datagrams");
- return rc;
- }
+ if (sidei)
+ uflow->flush1 = true;
+ else
+ uflow->flush0 = true;
return s;
}
@@ -268,14 +271,40 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
return udp_flow_new(c, flow, now);
}
+/**
+ * udp_flush_flow() - Flush datagrams that might not be for this flow
+ * @ctx: Execution context
+ * @uflow: Flow to handle
+ * @sidei: Side of the flow to flush
+ * @now: Current timestamp
+ */
+static void udp_flush_flow(const struct ctx *c,
+ const struct udp_flow *uflow, unsigned sidei,
+ const struct timespec *now)
+{
+ /* We don't know exactly where the datagrams will come from, but we know
+ * they'll have an interface and oport matching this flow */
+ udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei],
+ uflow->f.side[sidei].oport, now);
+}
+
/**
* udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
* @uflow: Flow to handle
*
* Return: true if the connection is ready to free, false otherwise
*/
-bool udp_flow_defer(const struct udp_flow *uflow)
+bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
+ const struct timespec *now)
{
+ if (uflow->flush0) {
+ udp_flush_flow(c, uflow, INISIDE, now);
+ uflow->flush0 = false;
+ }
+ if (uflow->flush1) {
+ udp_flush_flow(c, uflow, TGTSIDE, now);
+ uflow->flush1 = false;
+ }
return uflow->closed;
}
diff --git a/udp_flow.h b/udp_flow.h
index d4e4c8b9..d518737e 100644
--- a/udp_flow.h
+++ b/udp_flow.h
@@ -11,6 +11,8 @@
* struct udp - Descriptor for a flow of UDP packets
* @f: Generic flow information
* @closed: Flow is already closed
+ * @flush0: @s[0] may have datagrams queued for other flows
+ * @flush1: @s[1] may have datagrams queued for other flows
* @ts: Activity timestamp
* @s: Socket fd (or -1) for each side of the flow
*/
@@ -19,6 +21,7 @@ struct udp_flow {
struct flow_common f;
bool closed :1;
+ bool flush0, flush1 :1;
time_t ts;
int s[SIDES];
};
@@ -33,7 +36,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
in_port_t srcport, in_port_t dstport,
const struct timespec *now);
void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
-bool udp_flow_defer(const struct udp_flow *uflow);
+bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
+ const struct timespec *now);
bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
const struct timespec *now);
diff --git a/udp_internal.h b/udp_internal.h
index f7d84267..96d11cff 100644
--- a/udp_internal.h
+++ b/udp_internal.h
@@ -28,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum);
+void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+ in_port_t port, const struct timespec *now);
#endif /* UDP_INTERNAL_H */
--
@@ -28,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
const struct flowside *toside, size_t dlen,
bool no_udp_csum);
+void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
+ in_port_t port, const struct timespec *now);
#endif /* UDP_INTERNAL_H */
--
2.49.0
prev parent reply other threads:[~2025-04-04 10:15 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 10:15 [PATCH 00/12] Use connect()ed sockets for both sides of UDP flows David Gibson
2025-04-04 10:15 ` [PATCH 01/12] udp: Use connect()ed sockets for initiating side David Gibson
2025-04-04 10:15 ` [PATCH 02/12] udp: Make udp_sock_recv() take max number of frames as a parameter David Gibson
2025-04-04 10:15 ` [PATCH 03/12] udp: Polish udp_vu_sock_info() and remove from vu specific code David Gibson
2025-04-04 10:15 ` [PATCH 04/12] udp: Don't bother to batch datagrams from "listening" socket David Gibson
2025-04-04 10:15 ` [PATCH 05/12] udp: Parameterize number of datagrams handled by udp_*_reply_sock_data() David Gibson
2025-04-04 10:15 ` [PATCH 06/12] udp: Split spliced forwarding path from udp_buf_reply_sock_data() David Gibson
2025-04-04 10:15 ` [PATCH 07/12] udp: Merge vhost-user and "buf" listening socket paths David Gibson
2025-04-04 10:15 ` [PATCH 08/12] udp: Move UDP_MAX_FRAMES to udp.c David Gibson
2025-04-04 10:15 ` [PATCH 09/12] udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock() David Gibson
2025-04-04 10:15 ` [PATCH 10/12] udp: Rework udp_listen_sock_data() into udp_sock_fwd() David Gibson
2025-04-04 10:15 ` [PATCH 11/12] udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock David Gibson
2025-04-04 10:15 ` David Gibson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250404101542.3729316-13-david@gibson.dropbear.id.au \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).