From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 92E165A004E for ; Fri, 14 Jun 2024 03:10:00 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1718327396; bh=j11ufca5tcqiWA4hLiPRZFVR66OC6vnIK4If4ZKZIQo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=muW8qAdEjdnRAn34ykVqdjMGeqblR6354jXTguitn2QAQ0EJiGH67pDisHf3pL4dG DMaqM+Q0FpGjvoKObp6/ecowQBWn9cvlAwIrme4COd5smFT8Is5t/23S15BoCXli8c rfFPaW+9CvBrIAxctTOxKW0VjTAonFnDVHw7lOrAxc7wLacR1Kwddf6mLDn2F/Jvpc dQpg/LEiXvWO+pNEmnxhPJ2D27rXyInlVXlEaEym/hUMCDMAG4X6nKB87YJG0OJkMd NLxDR3aeAbFTQ11ii9WUK41JalI23T2y/NeGNKq8+/z8ZSoRiwutsz2+y91D+41cB1 hsQM2lohKSGFg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4W0h5c6zqvz4wc5; Fri, 14 Jun 2024 11:09:56 +1000 (AEST) Date: Fri, 14 Jun 2024 11:08:26 +1000 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods Message-ID: References: <20240605013903.3694452-1-david@gibson.dropbear.id.au> <20240605013903.3694452-4-david@gibson.dropbear.id.au> <20240613202112.57a059ee@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ClURflDKhEXRc83M" Content-Disposition: inline In-Reply-To: <20240613202112.57a059ee@elisabeth> Message-ID-Hash: PBNIEK3ADPTPBNRZNRLKSWYHU6HYYQXH X-Message-ID-Hash: PBNIEK3ADPTPBNRZNRLKSWYHU6HYYQXH X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --ClURflDKhEXRc83M Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 13, 2024 at 08:21:12PM +0200, Stefano Brivio wrote: > On Wed, 5 Jun 2024 11:39:02 +1000 > David Gibson wrote: >=20 > > udp_sock_handler() takes a number of datagrams from sockets that depend= ing > > on their addresses could be forwarded either to the L2 interface ("tap") > > or to another socket ("spliced"). In the latter case we can also only > > send packets together if they have the same source port, and therefore > > are sent via the same socket. > >=20 > > To reduce the total number of system calls we gather contiguous batches= of > > datagrams with the same destination interface and socket where applicab= le. > > The determination of what the target is is made by udp_mmh_splice_port(= ). > > It returns the source port for splice packets and -1 for "tap" packets. > > We find batches by looking ahead in our queue until we find a datagram > > whose "splicefrom" port doesn't match the first in our current batch. > >=20 > > udp_mmh_splice_port() is moderately expensive, since it must examine IP= v6 > > addresses. But unfortunately we can call it twice on the same datagram: > > once as the (last + 1) entry in one batch (showing that it's not in that > > match, then again as the first entry in the next batch. >=20 > This paragraph took me an embarrassingly long time to grasp, if you > re-spin it would be nice to fix it: >=20 > - "And unfortunately [...]", I guess: otherwise it looks like we're > lucky that udp_mmh_splice_port() is expensive or something like that > (because of the "But" implying contrast). >=20 > I initially assumed "unfortunately" was a typo and tried to > understand why it was a good thing we'd call udp_mmh_splice_port() > twice on the same datagram (faster than calling it on two > datagrams!), then started reading the change and got even more > confused... >=20 > - "(to check that it's not that batch)" ? Yeah, didn't help that there were some other typos in there. Updated referring to these suggestions. > > Avoid this by keeping track of the "splice port" in the metadata struct= ure, > > and filling it in one entry ahead of the one we're currently considerin= g. > > This is a bit subtle, but not that hard. It will also generalise better > > when we have more complex possibilities based on the flow table. >=20 > I guess this is the actual, main reason for this change. :) I should > have read this paragraph first. Yes. In the flow table we'll replace this splicesrc array with an array of flow sidxs, updated in the same way. > > Signed-off-by: David Gibson > > --- > > udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------ > > 1 file changed, 86 insertions(+), 61 deletions(-) > >=20 > > diff --git a/udp.c b/udp.c > > index 7487d2b2..757c10ab 100644 > > --- a/udp.c > > +++ b/udp.c > > @@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr; > > * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) > > * @taph: Tap backend specific header > > * @s_in: Source socket address, filled in by recvmmsg() > > + * @splicesrc: Source port for splicing, or -1 if not spliceable > > */ > > static struct udp_meta_t { > > struct ipv6hdr ip6h; > > @@ -205,6 +206,7 @@ static struct udp_meta_t { > > struct tap_hdr taph; > > =20 > > union sockaddr_inany s_in; > > + int splicesrc; > > } > > #ifdef __AVX2__ > > __attribute__ ((aligned(32))) > > @@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_re= f uref, > > } > > =20 > > /** > > - * udp_splice_sendfrom() - Send datagrams from given port to given port > > + * udp_splice_send() - Send datagrams from socket to socket > > * @c: Execution context > > * @start: Index of first datagram in udp[46]_l2_buf > > - * @n: Number of datagrams to send > > - * @src: Datagrams will be sent from this port (on origin side) > > - * @dst: Datagrams will be send to this port (on destination side) > > - * @from_pif: pif from which the packet originated > > - * @v6: Send as IPv6? > > - * @allow_new: If true create sending socket if needed, if false disca= rd > > - * if no sending socket is available > > + * @n: Total number of datagrams in udp[46]_l2_buf pool > > + * @dst: Datagrams will be sent to this port (on destination side) > > + * @uref: UDP epoll reference for origin socket > > * @now: Timestamp > > + * > > + * This consumes as many frames as are sendable via a single socket. = It >=20 > s/frames/datagrams/ ...or messages. Good point, adjusted. > > + * requires that udp_meta[@start].splicesrc is initialised, and will i= nitialise > > + * udp_meta[].splicesrc for each frame it consumes *and one more* (if = present). > > + * > > + * Return: Number of frames sent >=20 > I'd say it's rather the number of datagrams (not frames) we tried to > send. >=20 > In some sense, it's also the number of frames sent _by us_ (well, after > calling sendmmsg(), messages were sent), but we call sendmmsg() > ignoring the result, so this comment might look a bit misleading. Right... I've gone with "Number of datagrams forwarded" how's that? > > */ > > -static void udp_splice_sendfrom(const struct ctx *c, unsigned start, u= nsigned n, > > - in_port_t src, in_port_t dst, uint8_t from_pif, > > - bool v6, bool allow_new, > > +static unsigned udp_splice_send(const struct ctx *c, size_t start, siz= e_t n, > > + in_port_t dst, union udp_epoll_ref uref, > > const struct timespec *now) > > { > > + in_port_t src =3D udp_meta[start].splicesrc; > > struct mmsghdr *mmh_recv, *mmh_send; > > - unsigned int i; > > + unsigned int i =3D start; > > int s; > > =20 > > - if (v6) { > > + ASSERT(udp_meta[start].splicesrc >=3D 0); > > + > > + if (uref.v6) { > > mmh_recv =3D udp6_l2_mh_sock; > > mmh_send =3D udp6_mh_splice; > > } else { > > @@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx = *c, unsigned start, unsigned n, > > mmh_send =3D udp4_mh_splice; > > } > > =20 > > - if (from_pif =3D=3D PIF_SPLICE) { > > + do { > > + mmh_send[i].msg_hdr.msg_iov->iov_len =3D mmh_recv[i].msg_len; > > + > > + if (++i >=3D n) > > + break; > > + > > + udp_meta[i].splicesrc =3D udp_mmh_splice_port(uref, &mmh_recv[i]); > > + } while (udp_meta[i].splicesrc =3D=3D src); >=20 > I don't have a strong preference, but a for loop like this: >=20 > for (; i < n && udp_meta[i].splicesrc =3D=3D src; i++) { > mmh_send[i].msg_hdr.msg_iov->iov_len =3D mmh_recv[i].msg_len; > udp_meta[i].splicesrc =3D udp_mmh_splice_port(uref, &mmh_recv[i]); This needs to update udp_meta[i+1], not udp_meta[i], and therefore also be conditional on i+1 < n. > } >=20 > if (i++ < n) /* Set splicesrc for first mismatching entry, too */ This needs to be ++i, not i++. > udp_meta[i].splicesrc =3D udp_mmh_splice_port(uref, &mmh_recv[i]); >=20 > looks a bit more readable to me. Same for udp_tap_send(). At which point I'm not sure it's more readable after all. It also redundantly checks that udp_meta[start].splicesrc =3D=3D src. Like I said, subtle. Putting the increment in the middle of the loop body, rather than beginning or end, while unusual, was the sanest way I could see to do this. Well, other than havine one pass to set splicesrc[], then another using it, which works but I'm concerned about the effect on cache locality. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --ClURflDKhEXRc83M Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmZrmAkACgkQzQJF27ox 2Geu2hAAn1hppNFZC8Wb2QbFWIG6x0bYrymtN1eptWrwA/Y39/6O4YCc4alNydri dpi3U2EFkrjfZiyBYDWC8tGvgp+2bz0MzLZSJ94u/lDB+fwqxViWkkY1fomsqwJi jqJOuvjfFo324r/X3A6OEb8bab//GRL/NU2miADr/mwMD6hr9S9RkmIGZPcVYs4u dwSLDc3galv4b73P4Xn+cmaXXrXbdjWVhSyjxruXRds4UKhjqIvztNQAuavcy7qX ktGBKNTop+HDXuii+eL3Qy1Iie/Duy4VdJqq0W1q6QmpCRfc2FOfDNIO6Ujeee/G gofEnKtaNhG/Vrm7OygC5/onkK0sAsG+gyW5ZVmUTaeUV+nkcsPwzJU+PgHlLsK+ cXt1S8t+hIIT2cb0DGC1Oc8YaWvJYW5LuJSDEN93vWYemEfsuMznfqZ07EeOWKI+ OElYdTOt81sDqI7Pb+wNoy8VD2bBIBVOblGbJxK/3wagbIJNG0z87LqcdaIasHZl N6jO82sCuFci8bOhIqHlMn+BBYw6wYr/eyYtnhjorrvgCFaUB0lSwt2sxfKGk86L k3fApChvavH6YotC0h1SIyARjRiQ0uYdqRidC96fFRw5jFPnXrkjvoPuBaCxq+Tg 8DhCAxgro78J9PRROQgfTuJA5WlShfaK16nHpmIq2M7xKwgr9L8= =LuIE -----END PGP SIGNATURE----- --ClURflDKhEXRc83M--