From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id CE6865A004E for ; Wed, 10 Jul 2024 02:23:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1720570999; bh=L5Asjca82EXaV3hHwTsOEhpGtE5xuVC5llUKlr/fz64=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=SzVJ4bMNR9LRvggUMn4objHAbbWFwViw5UcA3iFvjwihxpSEgVDKDlUKyGdlZjImO cCRP2Gy8062Udh67qg08q87vb6MBrMEojZhZzYB6cA+kpUaa6rppKvsaEiQXRkrCir g993kcffeyml2WuYtYv7HNdFYPD/sQqrxUn6sfGvWh680qXyYc9Zne0kco+5v9ewyr z8HoZdhPfOl9dWjnlyqkn89n/pVYe9DgjNl9L333I4tK8w4iLpLi5VhIZUEFLM80u7 HAJ+hy0MdzhU1AATdp66WBnoNXlacNW7qk5/FV69c/Wc50L5xAxp8AxwsbhIimQH6L jwXHRqHMWOeLw== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4WJdqq4SCmz4wny; Wed, 10 Jul 2024 10:23:19 +1000 (AEST) Date: Wed, 10 Jul 2024 09:59:08 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH v7 20/27] udp: Create flows for datagrams from originating sockets Message-ID: References: <20240705020724.3447719-1-david@gibson.dropbear.id.au> <20240705020724.3447719-21-david@gibson.dropbear.id.au> <20240710003202.2909199a@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="fQ9W4v/fFRaPFj1Z" Content-Disposition: inline In-Reply-To: <20240710003202.2909199a@elisabeth> Message-ID-Hash: XFLUOGF6XZF3U5RB44MPUQZJ73WPKOVV X-Message-ID-Hash: XFLUOGF6XZF3U5RB44MPUQZJ73WPKOVV X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jmaloy@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --fQ9W4v/fFRaPFj1Z Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 10, 2024 at 12:32:02AM +0200, Stefano Brivio wrote: > Nits only, here: >=20 > On Fri, 5 Jul 2024 12:07:17 +1000 > David Gibson wrote: >=20 > > This implements the first steps of tracking UDP packets with the flow t= able > > rather than it's own (buggy) set of port maps. Specifically we create = flow >=20 > its Oops, fixed. > > table entries for datagrams received from a socket (PIF_HOST or > > PIF_SPLICE). > >=20 > > When splitting datagrams from sockets into batches, we group by the flow > > as well as splicesrc. This may result in smaller batches, but makes th= ings > > easier down the line. We can re-optimise this later if necessary. For= now > > we don't do anything else with the flow, not even match reply packets to > > the same flow. > >=20 > > Signed-off-by: David Gibson > > --- > > Makefile | 2 +- > > flow.c | 31 ++++++++++ > > flow.h | 4 ++ > > flow_table.h | 14 +++++ > > udp.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++-- > > udp_flow.h | 25 ++++++++ > > 6 files changed, 240 insertions(+), 5 deletions(-) > > create mode 100644 udp_flow.h > >=20 > > diff --git a/Makefile b/Makefile > > index 09fc461d..92cbd5a6 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -57,7 +57,7 @@ PASST_HEADERS =3D arch.h arp.h checksum.h conf.h dhcp= =2Eh dhcpv6.h flow.h fwd.h \ > > flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ > > lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.= h \ > > siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.= h \ > > - udp.h util.h > > + udp.h udp_flow.h util.h > > HEADERS =3D $(PASST_HEADERS) seccomp.h > > =20 > > C :=3D \#include \nstruct tcp_info x =3D { .tcpi_snd_wnd = =3D 0 }; > > diff --git a/flow.c b/flow.c > > index 218033ae..0cb9495b 100644 > > --- a/flow.c > > +++ b/flow.c > > @@ -37,6 +37,7 @@ const char *flow_type_str[] =3D { > > [FLOW_TCP_SPLICE] =3D "TCP connection (spliced)", > > [FLOW_PING4] =3D "ICMP ping sequence", > > [FLOW_PING6] =3D "ICMPv6 ping sequence", > > + [FLOW_UDP] =3D "UDP flow", > > }; > > static_assert(ARRAY_SIZE(flow_type_str) =3D=3D FLOW_NUM_TYPES, > > "flow_type_str[] doesn't match enum flow_type"); > > @@ -46,6 +47,7 @@ const uint8_t flow_proto[] =3D { > > [FLOW_TCP_SPLICE] =3D IPPROTO_TCP, > > [FLOW_PING4] =3D IPPROTO_ICMP, > > [FLOW_PING6] =3D IPPROTO_ICMPV6, > > + [FLOW_UDP] =3D IPPROTO_UDP, > > }; > > static_assert(ARRAY_SIZE(flow_proto) =3D=3D FLOW_NUM_TYPES, > > "flow_proto[] doesn't match enum flow_type"); > > @@ -700,6 +702,31 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, > > return flowside_lookup(c, proto, pif, &fside); > > } > > =20 > > +/** > > + * flow_lookup_sa() - Look up a flow given and endpoint socket address >=20 > s/and/an/ Fixed. > > + * @c: Execution context > > + * @proto: Protocol of the flow (IP L4 protocol number) > > + * @pif: Interface of the flow > > + * @esa: Socket address of the endpoint > > + * @fport: Forwarding port number > > + * > > + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not fou= nd > > + */ > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t= pif, > > + const void *esa, in_port_t fport) > > +{ > > + struct flowside fside =3D { >=20 > And the "f" in "fside" stands for "forwarding"... I don't have any > quick fix in mind, and it's _kind of_ clear anyway, but this makes me > doubt a bit about the "forwarding" / "endpoint" choice of words. Heh, no, here "fside" is simply short for "flowside". Every flowside has both forwarding and endpoint elements. So it is confusing, but for a different reason. I need to find a different convention for naming struct flowside variables. I'd say 'side', but sometimes that's used for the 1-bit integer indicating which side in a flow. Hrm.. now that pif has been removed from here, maybe I could rename struct flowside back to 'flowaddrs' or 'sideaddrs' perhaps? > > + .fport =3D fport, > > + }; > > + > > + inany_from_sockaddr(&fside.eaddr, &fside.eport, esa); > > + if (inany_v4(&fside.eaddr)) > > + fside.faddr =3D inany_any4; > > + else > > + fside.faddr =3D inany_any6; >=20 > The usual extra newline here? Done. > > + return flowside_lookup(c, proto, pif, &fside); > > +} > > + > > /** > > * flow_defer_handler() - Handler for per-flow deferred and timed tasks > > * @c: Execution context > > @@ -779,6 +806,10 @@ void flow_defer_handler(const struct ctx *c, const= struct timespec *now) > > if (timer) > > closed =3D icmp_ping_timer(c, &flow->ping, now); > > break; > > + case FLOW_UDP: > > + if (timer) > > + closed =3D udp_flow_timer(c, &flow->udp, now); > > + break; > > default: > > /* Assume other flow types don't need any handling */ > > ; > > diff --git a/flow.h b/flow.h > > index e27f99be..3752e5ee 100644 > > --- a/flow.h > > +++ b/flow.h > > @@ -115,6 +115,8 @@ enum flow_type { > > FLOW_PING4, > > /* ICMPv6 echo requests from guest to host and matching replies back = */ > > FLOW_PING6, > > + /* UDP pseudo-connection */ > > + FLOW_UDP, > > =20 > > FLOW_NUM_TYPES, > > }; > > @@ -238,6 +240,8 @@ flow_sidx_t flow_lookup_af(const struct ctx *c, > > uint8_t proto, uint8_t pif, sa_family_t af, > > const void *eaddr, const void *faddr, > > in_port_t eport, in_port_t fport); > > +flow_sidx_t flow_lookup_sa(const struct ctx *c, uint8_t proto, uint8_t= pif, > > + const void *esa, in_port_t fport); > > =20 > > union flow; > > =20 > > diff --git a/flow_table.h b/flow_table.h > > index 457f27b1..3fbc7c8d 100644 > > --- a/flow_table.h > > +++ b/flow_table.h > > @@ -9,6 +9,7 @@ > > =20 > > #include "tcp_conn.h" > > #include "icmp_flow.h" > > +#include "udp_flow.h" > > =20 > > /** > > * struct flow_free_cluster - Information about a cluster of free entr= ies > > @@ -35,6 +36,7 @@ union flow { > > struct tcp_tap_conn tcp; > > struct tcp_splice_conn tcp_splice; > > struct icmp_ping_flow ping; > > + struct udp_flow udp; > > }; > > =20 > > /* Global Flow Table */ > > @@ -78,6 +80,18 @@ static inline union flow *flow_at_sidx(flow_sidx_t s= idx) > > return FLOW(sidx.flow); > > } > > =20 > > +/** flow_sidx_opposite - Get the other side of the same flow >=20 > flow_sidx_opposite() Done. > > + * @sidx: Flow & side index > > + * > > + * Return: sidx for the other side of the same flow as @sidx > > + */ > > +static inline flow_sidx_t flow_sidx_opposite(flow_sidx_t sidx) > > +{ > > + if (!flow_sidx_valid(sidx)) > > + return FLOW_SIDX_NONE; >=20 > Same here with the extra newline. Done. > > + return (flow_sidx_t){.flow =3D sidx.flow, .side =3D !sidx.side}; > > +} > > + > > /** flow_sidx_t - Index of one side of a flow from common structure > > * @f: Common flow fields pointer > > * @side: Which side to refer to (0 or 1) > > diff --git a/udp.c b/udp.c > > index 6427b9ce..daf4fe26 100644 > > --- a/udp.c > > +++ b/udp.c > > @@ -15,6 +15,30 @@ > > /** > > * DOC: Theory of Operation > > * > > + * UDP Flows > > + * =3D=3D=3D=3D=3D=3D=3D=3D=3D > > + * > > + * UDP doesn't have true connections, but many protocols use a connect= ion-like > > + * format. The flow is initiated by a client sending a datagram from = a port of > > + * its choosing (usually ephemeral) to a specific port (usually well k= nown) on a > > + * server. Both client and server address must be unicast. The serve= r sends > > + * replies using the same addresses & ports with src/dest swapped. > > + * > > + * We track pseudo-connections of this type as flow table entries of t= ype > > + * FLOW_UDP. We store the time of the last traffic on the flow in ufl= ow->ts, > > + * and let the flow expire if there is no traffic for UDP_CONN_TIMEOUT= seconds. > > + * > > + * NOTE: This won't handle multicast protocols, or some protocols with= different > > + * port usage. We'll need specific logic if we want to handle those. > > + * > > + * "Listening" sockets > > + * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > + * > > + * UDP doesn't use listen(), but we consider long term sockets which a= re allowed > > + * to create new flows "listening" by analogy with TCP. > > + * > > + * Port tracking > > + * =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > * > > * For UDP, a reduced version of port-based connection tracking is imp= lemented > > * with two purposes: > > @@ -121,6 +145,7 @@ > > #include "tap.h" > > #include "pcap.h" > > #include "log.h" > > +#include "flow_table.h" > > =20 > > #define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind= */ > > #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */ > > @@ -199,6 +224,7 @@ static struct ethhdr udp6_eth_hdr; > > * @taph: Tap backend specific header > > * @s_in: Source socket address, filled in by recvmmsg() > > * @splicesrc: Source port for splicing, or -1 if not spliceable > > + * @tosidx: sidx for the destination side of this datagram's flow > > */ > > static struct udp_meta_t { > > struct ipv6hdr ip6h; > > @@ -207,6 +233,7 @@ static struct udp_meta_t { > > =20 > > union sockaddr_inany s_in; > > int splicesrc; > > + flow_sidx_t tosidx; > > } > > #ifdef __AVX2__ > > __attribute__ ((aligned(32))) > > @@ -490,6 +517,115 @@ static int udp_mmh_splice_port(union epoll_ref re= f, const struct mmsghdr *mmh) > > return -1; > > } > > =20 > > +/** > > + * udp_at_sidx() - Get UDP specific flow at given sidx > > + * @sidx: Flow and side to retrieve > > + * > > + * Return: UDP specific flow at @sidx, or NULL of @sidx is invalid. A= sserts if > > + * the flow at @sidx is not FLOW_UDP. > > + */ > > +struct udp_flow *udp_at_sidx(flow_sidx_t sidx) > > +{ > > + union flow *flow =3D flow_at_sidx(sidx); > > + > > + if (!flow) > > + return NULL; > > + > > + ASSERT(flow->f.type =3D=3D FLOW_UDP); > > + return &flow->udp; > > +} > > + > > +/* > > + * udp_flow_close() - Close and clean up UDP flow > > + * @c: Execution context > > + * @uflow: UDP flow > > + */ > > +static void udp_flow_close(const struct ctx *c, const struct udp_flow = *uflow) > > +{ > > + flow_hash_remove(c, FLOW_SIDX(uflow, INISIDE)); > > +} > > + > > +/** > > + * udp_flow_new() - Common setup for a new UDP flow > > + * @c: Execution context > > + * @flow: Initiated flow > > + * @now: Timestamp > > + * > > + * Return: UDP specific flow, if successful, NULL on failure > > + */ > > +static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow, > > + const struct timespec *now) > > +{ > > + const struct flowside *ini =3D &flow->f.side[INISIDE]; > > + struct udp_flow *uflow =3D NULL; > > + > > + if (!inany_is_unicast(&ini->eaddr) || ini->eport =3D=3D 0) { > > + flow_dbg(flow, "Invalid endpoint to initiate UDP flow"); >=20 > Do we risk making debug logs unusable if we see multicast traffic? Um.. I'm not sure. > Maybe this could be flow_trace() instead. Sure, why not. --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --fQ9W4v/fFRaPFj1Z Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmaNzq4ACgkQzQJF27ox 2Gc0KxAAnH7SgcCgz2TR/KJPwHKH+46nNJ2P0GVf7Qxhh/xmYrMnlvCqcwHgOdgC MT9gsl3mRLzKGJor47KzGuJfVLMwxCGixKvQXrXDiaQ1WVhE1cYOwfRjWoRcGVb9 noEwr76U+XctxIQnky57vEz620SHYgog21CIPoRBh0r3IyTU5Vo9frnPbkbSKbqZ 302rZX9m3MXkr05HfEW03LPEyUu3v9YzeWdGhh8VZbDZ+R72MDQIA7P0Q3jbPxMu Idq96IShAkoPcg9iI/264Ey1BsZ36x3I7rNyJunPMtXZE/gA3PoD/DA/EEKqWCbJ x3Ut7qqETFQMaOuh85TPSamVDyd0QVYO5MFT4SnsNJYEDpKs/qglooNBOGiD67W+ 3/m3AMrS90lLNdS5E8h2C+/xSBdiogQWEksR9ZEOe36uV6IiehM6+OiVfwwVgwlK F8v7RumvI6fwSFT0rrTBfpQSkNGwzWrAMsqVXZqpOBnRkKHPwKf9BhxVyOXgeA1c bPQ4KtwyBBuVB0/wRyQzioxbhGYuawfroaautV3CtOPiT+KSaVH0sOqm67vRfn9X pEbKG6a7M21ASG8tg0XQxNwVCcGdaCw7pKgsxPMqyS5TcLbInzSz7pwQACh/Vl05 l73OvrC+QKu2Ncv+muJiP4sGMfMUOHMU6Na1AtlGcFd43dbsEKU= =CRvl -----END PGP SIGNATURE----- --fQ9W4v/fFRaPFj1Z--