From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=jTd2m5wg; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 4D0EA5A0008 for ; Mon, 07 Apr 2025 23:49:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744062586; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uRGMVBQAQU5j3VfDj0SMqAdR0waPbwEXRf4BPomdWL8=; b=jTd2m5wgLVJJImZN4o81SLnCmFrqIxEFYhNLsdxvylaSt8fHda/lQHr7aNGNYNKurfq7HM BKWrgh+2Ou2L7LJj+dASnfEC3P/lzAaalpW+8wGv3hLZvguPUXHmlkGCPTUts1Sukh1c6v bOCacFOEqrw0VWxO7X7ujFUhlUeIQR0= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-615-1AGDRRmBNT6KJy_DeiqPNA-1; Mon, 07 Apr 2025 17:49:44 -0400 X-MC-Unique: 1AGDRRmBNT6KJy_DeiqPNA-1 X-Mimecast-MFC-AGG-ID: 1AGDRRmBNT6KJy_DeiqPNA_1744062583 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-43d5ca7c86aso32503105e9.0 for ; Mon, 07 Apr 2025 14:49:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744062583; x=1744667383; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=uRGMVBQAQU5j3VfDj0SMqAdR0waPbwEXRf4BPomdWL8=; b=FwJje7JiCyueF/HP4K7sj9qlEkNMqqUpuXF7jlIw/gKs9OnEVLpOL2Dm7iKO4Zst+U 6psctLI3UD9D88wgj4QTo3ZP4ysIPvZHZXM3ZsSR4QqUW1NwI2TRjzPiVH76EiIw4qfC fP4VLRnJS1bKAQT3dsuEkkbG4JgEaRqbB8Ijos/eRPdCiHAXqT41RX3ofc/QuGxqyESk OVEB7VeWcLTU4uYXHAqDnYZTv3MDpt8KpOSPZ+uGCwisSZk6ahzJZ06unezU57dLfwuJ kPrV6RARt63unqUO8VECzcCPyrtrYXwQIIZFDIH0f7LJmivvnE+ZNDIqPciOH4ZYu685 FJ3A== X-Gm-Message-State: AOJu0YyxCzfqM4PLd2tbbDBWXD33LqoMS83n8aVCIMfvVdfLXB9Z/EX9 VFC064YILga499QzV7PPaox254cXtQrmFI4hVFGR8S5n+lCxKHo+Nn7HvNS+GOfqMnv3fDDrikQ X/WEKLQKxgizkVuyxwUPSA+I/R+SzfvYjg6vertz7mzfcY35KhQ== X-Gm-Gg: ASbGncvTCwWYi/VSt7MI/W35cwbPLi3/WYEoUpjDD/0uHkfd3TaEf3RyhdgELrxKxLZ m7a0/09c2I/rW7uXPWkUWkITknrKGjPTuUzoL6J8q/Gd+qZEtbW6Zo00dzJgMhDi8cnvVJ4OgO4 bPXTipkzWJ2mGRKp0MZ6zphk03WayDQ+GOId9y0lBPzQYRqIuMl+yB1Jv99/8UFWxLevtdvKqDx znK72Gy+MuVGddmDJJgOQbTfx/OODvHy0S9D3ayLLZg4hgTf7xAOE/qXEIPsNNV0mwCHEpbu0TR WLg/5Tz09BdQ/tiqHH8CiCqPvOg= X-Received: by 2002:a05:6000:1a88:b0:391:47d8:de3a with SMTP id ffacd0b85a97d-39cba97f7d5mr10898380f8f.53.1744062583198; Mon, 07 Apr 2025 14:49:43 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGkzuBoe7VlOYW2/9gVDrNncAGrYo322LoB9MEhmbifzMUW2yhPnItK5PpQT9wzX3ma4/GDzA== X-Received: by 2002:a05:6000:1a88:b0:391:47d8:de3a with SMTP id ffacd0b85a97d-39cba97f7d5mr10898369f8f.53.1744062582790; Mon, 07 Apr 2025 14:49:42 -0700 (PDT) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43f0e5e8dfasm7259575e9.27.2025.04.07.14.49.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Apr 2025 14:49:42 -0700 (PDT) Date: Mon, 7 Apr 2025 23:49:41 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 12/12] udp_flow: Don't discard packets that arrive between bind() and connect() Message-ID: <20250407234941.4a0b811b@elisabeth> In-Reply-To: <20250404101542.3729316-13-david@gibson.dropbear.id.au> References: <20250404101542.3729316-1-david@gibson.dropbear.id.au> <20250404101542.3729316-13-david@gibson.dropbear.id.au> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.49; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: lPaGkSxoc4ZishBHUs9T9KuVG-kC6BPADKYQiqC1YSw_1744062583 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: RDQARIYCEBVW5AHOJ7AJXF7R5LVIL5NQ X-Message-ID-Hash: RDQARIYCEBVW5AHOJ7AJXF7R5LVIL5NQ X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, 4 Apr 2025 21:15:42 +1100 David Gibson wrote: > When we establish a new UDP flow we create connect()ed sockets that will > only handle datagrams for this flow. However, there is a race between > bind() and connect() where they might get some packets queued for a > different flow. Currently we handle this by simply discarding any > queued datagrams after the connect. UDP protocols should be able to handle > such packet loss, but it's not ideal. > > We now have the tools we need to handle this better, by redirecting any > datagrams received during that race to the appropriate flow. We need to > use a deferred handler for this to avoid unexpectedly re-ordering datagrams > in some edge cases. > > Signed-off-by: David Gibson > --- > flow.c | 2 +- > udp.c | 4 +-- > udp_flow.c | 73 +++++++++++++++++++++++++++++++++++--------------- > udp_flow.h | 6 ++++- > udp_internal.h | 2 ++ > 5 files changed, 61 insertions(+), 26 deletions(-) > > diff --git a/flow.c b/flow.c > index 86222426..29a83e18 100644 > --- a/flow.c > +++ b/flow.c > @@ -850,7 +850,7 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) > closed = icmp_ping_timer(c, &flow->ping, now); > break; > case FLOW_UDP: > - closed = udp_flow_defer(&flow->udp); > + closed = udp_flow_defer(c, &flow->udp, now); > if (!closed && timer) > closed = udp_flow_timer(c, &flow->udp, now); > break; > diff --git a/udp.c b/udp.c > index 7c8b7a2c..b275db3d 100644 > --- a/udp.c > +++ b/udp.c > @@ -697,8 +697,8 @@ static void udp_buf_sock_to_tap(const struct ctx *c, int s, int n, > * @port: Our (local) port number of @s > * @now: Current timestamp > */ > -static void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, > - in_port_t port, const struct timespec *now) > +void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, > + in_port_t port, const struct timespec *now) > { > union sockaddr_inany src; > > diff --git a/udp_flow.c b/udp_flow.c > index b95c3176..af15d7f2 100644 > --- a/udp_flow.c > +++ b/udp_flow.c > @@ -9,10 +9,12 @@ > #include > #include > #include > +#include > > #include "util.h" > #include "passt.h" > #include "flow_table.h" > +#include "udp_internal.h" > > #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */ > > @@ -67,16 +69,15 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) > * Return: fd of new socket on success, -ve error code on failure > */ > static int udp_flow_sock(const struct ctx *c, > - const struct udp_flow *uflow, unsigned sidei) > + struct udp_flow *uflow, unsigned sidei) > { > const struct flowside *side = &uflow->f.side[sidei]; > - struct mmsghdr discard[UIO_MAXIOV] = { 0 }; > uint8_t pif = uflow->f.pif[sidei]; > union { > flow_sidx_t sidx; > uint32_t data; > } fref = { .sidx = FLOW_SIDX(uflow, sidei) }; > - int rc, s; > + int s; > > s = flowside_sock_l4(c, EPOLL_TYPE_UDP, pif, side, fref.data); > if (s < 0) { > @@ -85,30 +86,32 @@ static int udp_flow_sock(const struct ctx *c, > } > > if (flowside_connect(c, s, pif, side) < 0) { > - rc = -errno; > + int rc = -errno; > flow_dbg_perror(uflow, "Couldn't connect flow socket"); > return rc; > } > > - /* It's possible, if unlikely, that we could receive some unrelated > - * packets in between the bind() and connect() of this socket. For now > - * we just discard these. > + /* It's possible, if unlikely, that we could receive some packets in > + * between the bind() and connect() which may or may not be for this > + * flow. Being UDP we could just discard them, but it's not ideal. > * > - * FIXME: Redirect these to an appropriate handler > + * There's also a tricky case if a bunch of datagrams for a new flow > + * arrive in rapid succession, the first going to the original listening > + * socket and later ones going to this new socket. If we forwarded the > + * datagrams from the new socket immediately here they would go before > + * the datagram which established the flow. Again, not strictly wrong > + * for UDP, but not ideal. > + * > + * So, we flag that the new socket is in a transient state where it > + * might have datagrams for a different flow queued. Before the next > + * epoll cycle, udp_flow_defer() will flush out any such datagrams, and > + * thereafter everything on the new socket should be strictly for this > + * flow. > */ > - rc = recvmmsg(s, discard, ARRAY_SIZE(discard), MSG_DONTWAIT, NULL); > - if (rc >= ARRAY_SIZE(discard)) { > - flow_dbg(uflow, "Too many (%d) spurious reply datagrams", rc); > - return -E2BIG; > - } > - > - if (rc > 0) { > - flow_trace(uflow, "Discarded %d spurious reply datagrams", rc); > - } else if (errno != EAGAIN) { > - rc = -errno; > - flow_perror(uflow, "Unexpected error discarding datagrams"); > - return rc; > - } > + if (sidei) > + uflow->flush1 = true; > + else > + uflow->flush0 = true; > > return s; > } > @@ -268,14 +271,40 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, > return udp_flow_new(c, flow, now); > } > > +/** > + * udp_flush_flow() - Flush datagrams that might not be for this flow > + * @ctx: Execution context > + * @uflow: Flow to handle > + * @sidei: Side of the flow to flush > + * @now: Current timestamp > + */ > +static void udp_flush_flow(const struct ctx *c, > + const struct udp_flow *uflow, unsigned sidei, > + const struct timespec *now) > +{ > + /* We don't know exactly where the datagrams will come from, but we know > + * they'll have an interface and oport matching this flow */ > + udp_sock_fwd(c, uflow->s[sidei], uflow->f.pif[sidei], > + uflow->f.side[sidei].oport, now); > +} > + > /** > * udp_flow_defer() - Deferred per-flow handling (clean up aborted flows) > * @uflow: Flow to handle > * > * Return: true if the connection is ready to free, false otherwise > */ > -bool udp_flow_defer(const struct udp_flow *uflow) > +bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow, > + const struct timespec *now) Function comment not updated. Updated on merge. > { > + if (uflow->flush0) { > + udp_flush_flow(c, uflow, INISIDE, now); > + uflow->flush0 = false; > + } > + if (uflow->flush1) { > + udp_flush_flow(c, uflow, TGTSIDE, now); > + uflow->flush1 = false; > + } > return uflow->closed; > } > > diff --git a/udp_flow.h b/udp_flow.h > index d4e4c8b9..d518737e 100644 > --- a/udp_flow.h > +++ b/udp_flow.h > @@ -11,6 +11,8 @@ > * struct udp - Descriptor for a flow of UDP packets > * @f: Generic flow information > * @closed: Flow is already closed > + * @flush0: @s[0] may have datagrams queued for other flows > + * @flush1: @s[1] may have datagrams queued for other flows > * @ts: Activity timestamp > * @s: Socket fd (or -1) for each side of the flow > */ > @@ -19,6 +21,7 @@ struct udp_flow { > struct flow_common f; > > bool closed :1; > + bool flush0, flush1 :1; > time_t ts; > int s[SIDES]; > }; > @@ -33,7 +36,8 @@ flow_sidx_t udp_flow_from_tap(const struct ctx *c, > in_port_t srcport, in_port_t dstport, > const struct timespec *now); > void udp_flow_close(const struct ctx *c, struct udp_flow *uflow); > -bool udp_flow_defer(const struct udp_flow *uflow); > +bool udp_flow_defer(const struct ctx *c, struct udp_flow *uflow, > + const struct timespec *now); > bool udp_flow_timer(const struct ctx *c, struct udp_flow *uflow, > const struct timespec *now); > > diff --git a/udp_internal.h b/udp_internal.h > index f7d84267..96d11cff 100644 > --- a/udp_internal.h > +++ b/udp_internal.h > @@ -28,5 +28,7 @@ size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp, > size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp, > const struct flowside *toside, size_t dlen, > bool no_udp_csum); > +void udp_sock_fwd(const struct ctx *c, int s, uint8_t frompif, > + in_port_t port, const struct timespec *now); > > #endif /* UDP_INTERNAL_H */ -- Stefano