From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 12/12] udp_flow: Don't discard packets that arrive between bind() and connect()
Date: Mon, 7 Apr 2025 23:49:41 +0200 [thread overview]
Message-ID: <20250407234941.4a0b811b@elisabeth> (raw)
In-Reply-To: <20250404101542.3729316-13-david@gibson.dropbear.id.au>
On Fri, 4 Apr 2025 21:15:42 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:
> When we establish a new UDP flow we create connect()ed sockets that will
> only handle datagrams for this flow. However, there is a race between
> bind() and connect() where they might get some packets queued for a
> different flow. Currently we handle this by simply discarding any
> queued datagrams after the connect. UDP protocols should be able to handle
> such packet loss, but it's not ideal.
>
> We now have the tools we need to handle this better, by redirecting any
> datagrams received during that race to the appropriate flow. We need to
> use a deferred handler for this to avoid unexpectedly re-ordering datagrams
> in some edge cases.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> flow.c | 2 +-
> udp.c | 4 +--
> udp_flow.c | 73 +++++++++++++++++++++++++++++++++++---------------
> udp_flow.h | 6 ++++-
> udp_internal.h | 2 ++
> 5 files changed, 61 insertions(+), 26 deletions(-)
>
> diff --git a/flow.c b/flow.c
> index 86222426..29a83e18 100644
> --- a/flow.c
> +++ b/flow.c
> @@ -850,7 +850,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now)
> closed = icmp_ping_timer(c, &flow->ping, now);
> break;
> case FLOW_UDP:
> - closed = udp_flow_defer(&flow->udp);
> + closed = udp_flow_defer(c, &flow->udp, now);
> if (!closed && timer)
> closed = udp_flow_timer(c, &flow->udp, now);
> break;
> diff --git a/udp.c b/udp.c
> index 7c8b7a2c..b275db3d 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -697,8 +697,8 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n,
> * @port: Our (local) port number of @s
> * @now: Current timestamp
> */
> -static void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
> - in_port_t port, const struct timespec *now)
> +void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
> + in_port_t port, const struct timespec *now)
> {
> union sockaddr_inany src;
>
> diff --git a/udp_flow.c b/udp_flow.c
> index b95c3176..af15d7f2 100644
> --- a/udp_flow.c
> +++ b/udp_flow.c
> @@ -9,10 +9,12 @@
> #include <fcntl.h>
> #include <sys/uio.h>
> #include <unistd.h>
> +#include <netinet/udp.h>
>
> #include "util.h"
> #include "passt.h"
> #include "flow_table.h"
> +#include "udp_internal.h"
>
> #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */
>
> @@ -67,16 +69,15 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow)
> * Return: fd of new socket on success, -ve error code on failure
> */
> static int udp_flow_sock(const struct ctx *c,
> - const struct udp_flow *uflow, unsigned sidei)
> + struct udp_flow *uflow, unsigned sidei)
> {
> const struct flowside *side = &uflow->f.side[sidei];
> - struct mmsghdr discard[UIO_MAXIOV] = { 0 };
> uint8_t pif = uflow->f.pif[sidei];
> union {
> flow_sidx_t sidx;
> uint32_t data;
> } fref = { .sidx = FLOW_SIDX(uflow, sidei) };
> - int rc, s;
> + int s;
>
> s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data);
> if (s < 0) {
> @@ -85,30 +86,32 @@ static int udp_flow_sock(const struct ctx *c,
> }
>
> if (flowside_connect(c, s, pif, side) < 0) {
> - rc = -errno;
> + int rc = -errno;
> flow_dbg_perror(uflow, "Couldn't connect flow socket");
> return rc;
> }
>
> - /* It's possible, if unlikely, that we could receive some unrelated
> - * packets in between the bind() and connect() of this socket. For now
> - * we just discard these.
> + /* It's possible, if unlikely, that we could receive some packets in
> + * between the bind() and connect() which may or may not be for this
> + * flow. Being UDP we could just discard them, but it's not ideal.
> *
> - * FIXME: Redirect these to an appropriate handler
> + * There's also a tricky case if a bunch of datagrams for a new flow
> + * arrive in rapid succession, the first going to the original listening
> + * socket and later ones going to this new socket. If we forwarded the
> + * datagrams from the new socket immediately here they would go before
> + * the datagram which established the flow. Again, not strictly wrong
> + * for UDP, but not ideal.
> + *
> + * So, we flag that the new socket is in a transient state where it
> + * might have datagrams for a different flow queued. Before the next
> + * epoll cycle, udp_flow_defer() will flush out any such datagrams, and
> + * thereafter everything on the new socket should be strictly for this
> + * flow.
> */
> - rc = recvmmsg(s, discard, ARRAY_SIZE(discard), MSG_DONTWAIT, NULL);
> - if (rc >= ARRAY_SIZE(discard)) {
> - flow_dbg(uflow, "Too many (%d) spurious reply datagrams", rc);
> - return -E2BIG;
> - }
> -
> - if (rc > 0) {
> - flow_trace(uflow, "Discarded %d spurious reply datagrams", rc);
> - } else if (errno != EAGAIN) {
> - rc = -errno;
> - flow_perror(uflow, "Unexpected error discarding datagrams");
> - return rc;
> - }
> + if (sidei)
> + uflow->flush1 = true;
> + else
> + uflow->flush0 = true;
>
> return s;
> }
> @@ -268,14 +271,40 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
> return udp_flow_new(c, flow, now);
> }
>
> +/**
> + * udp_flush_flow() - Flush datagrams that might not be for this flow
> + * @ctx: Execution context
> + * @uflow: Flow to handle
> + * @sidei: Side of the flow to flush
> + * @now: Current timestamp
> + */
> +static void udp_flush_flow(const struct ctx *c,
> + const struct udp_flow *uflow, unsigned sidei,
> + const struct timespec *now)
> +{
> + /* We don't know exactly where the datagrams will come from, but we know
> + * they'll have an interface and oport matching this flow */
> + udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei],
> + uflow->f.side[sidei].oport, now);
> +}
> +
> /**
> * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows)
> * @uflow: Flow to handle
> *
> * Return: true if the connection is ready to free, false otherwise
> */
> -bool udp_flow_defer(const struct udp_flow *uflow)
> +bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
> + const struct timespec *now)
Function comment not updated. Updated on merge.
> {
> + if (uflow->flush0) {
> + udp_flush_flow(c, uflow, INISIDE, now);
> + uflow->flush0 = false;
> + }
> + if (uflow->flush1) {
> + udp_flush_flow(c, uflow, TGTSIDE, now);
> + uflow->flush1 = false;
> + }
> return uflow->closed;
> }
>
> diff --git a/udp_flow.h b/udp_flow.h
> index d4e4c8b9..d518737e 100644
> --- a/udp_flow.h
> +++ b/udp_flow.h
> @@ -11,6 +11,8 @@
> * struct udp - Descriptor for a flow of UDP packets
> * @f: Generic flow information
> * @closed: Flow is already closed
> + * @flush0: @s[0] may have datagrams queued for other flows
> + * @flush1: @s[1] may have datagrams queued for other flows
> * @ts: Activity timestamp
> * @s: Socket fd (or -1) for each side of the flow
> */
> @@ -19,6 +21,7 @@ struct udp_flow {
> struct flow_common f;
>
> bool closed :1;
> + bool flush0, flush1 :1;
> time_t ts;
> int s[SIDES];
> };
> @@ -33,7 +36,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c,
> in_port_t srcport, in_port_t dstport,
> const struct timespec *now);
> void udp_flow_close(const struct ctx *c, struct udp_flow *uflow);
> -bool udp_flow_defer(const struct udp_flow *uflow);
> +bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow,
> + const struct timespec *now);
> bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow,
> const struct timespec *now);
>
> diff --git a/udp_internal.h b/udp_internal.h
> index f7d84267..96d11cff 100644
> --- a/udp_internal.h
> +++ b/udp_internal.h
> @@ -28,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
> size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> const struct flowside *toside, size_t dlen,
> bool no_udp_csum);
> +void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif,
> + in_port_t port, const struct timespec *now);
>
> #endif /* UDP_INTERNAL_H */
--
Stefano
next prev parent reply other threads:[~2025-04-07 21:49 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-04 10:15 [PATCH 00/12] Use connect()ed sockets for both sides of UDP flows David Gibson
2025-04-04 10:15 ` [PATCH 01/12] udp: Use connect()ed sockets for initiating side David Gibson
2025-04-04 10:15 ` [PATCH 02/12] udp: Make udp_sock_recv() take max number of frames as a parameter David Gibson
2025-04-07 21:49 ` Stefano Brivio
2025-04-04 10:15 ` [PATCH 03/12] udp: Polish udp_vu_sock_info() and remove from vu specific code David Gibson
2025-04-07 21:49 ` Stefano Brivio
2025-04-04 10:15 ` [PATCH 04/12] udp: Don't bother to batch datagrams from "listening" socket David Gibson
2025-04-04 10:15 ` [PATCH 05/12] udp: Parameterize number of datagrams handled by udp_*_reply_sock_data() David Gibson
2025-04-04 10:15 ` [PATCH 06/12] udp: Split spliced forwarding path from udp_buf_reply_sock_data() David Gibson
2025-04-07 21:49 ` Stefano Brivio
2025-04-04 10:15 ` [PATCH 07/12] udp: Merge vhost-user and "buf" listening socket paths David Gibson
2025-04-04 10:15 ` [PATCH 08/12] udp: Move UDP_MAX_FRAMES to udp.c David Gibson
2025-04-04 10:15 ` [PATCH 09/12] udp_flow: Take pif and port as explicit parameters to udp_flow_from_sock() David Gibson
2025-04-07 21:49 ` Stefano Brivio
2025-04-04 10:15 ` [PATCH 10/12] udp: Rework udp_listen_sock_data() into udp_sock_fwd() David Gibson
2025-04-04 10:15 ` [PATCH 11/12] udp: Fold udp_splice_prepare and udp_splice_send into udp_sock_to_sock David Gibson
2025-04-07 21:49 ` Stefano Brivio
2025-04-04 10:15 ` [PATCH 12/12] udp_flow: Don't discard packets that arrive between bind() and connect() David Gibson
2025-04-07 21:49 ` Stefano Brivio [this message]
2025-04-07 21:49 ` [PATCH 00/12] Use connect()ed sockets for both sides of UDP flows Stefano Brivio
2025-04-07 23:50 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250407234941.4a0b811b@elisabeth \
--to=sbrivio@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).