From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 03A8B5A0082 for ; Wed, 4 Jan 2023 05:58:26 +0100 (CET) Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4Nmy6Q3fkFz4y0Q; Wed, 4 Jan 2023 15:58:22 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=201602; t=1672808302; bh=a8+3tTSV65C0BFIGJAD5pxdY4cbLv6jawJjaqvtxbkw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XwiRXvt3mjNWNhGPKFFDoITbm6wACixiumZdz7kGU1zIL0QIq8nUCWo2gNUtG5/Jn P1HAH8wG+nsM4RTJyB2bPocRng7KKAbqwgyCpwe97V6ond9jc5Fi/uo0WvK1DmMFDu MubNEhHoat7GQZnCT4WgvP44iy5tVA632i2Gtbxo= Date: Wed, 4 Jan 2023 15:53:21 +1100 From: David Gibson To: Stefano Brivio Subject: Re: [PATCH 4/8] udp: Receive multiple datagrams at once on the pasta sock->tap path Message-ID: References: <20221205081425.2614425-1-david@gibson.dropbear.id.au> <20221205081425.2614425-5-david@gibson.dropbear.id.au> <20221213234847.6c723ad9@elisabeth> <20221214113546.16942d3a@elisabeth> <20221220114246.737b0c3e@elisabeth> <20230104010852.02e96a70@elisabeth> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="V0G18SOW0//bro8Y" Content-Disposition: inline In-Reply-To: <20230104010852.02e96a70@elisabeth> Message-ID-Hash: MS7FOGDAP2OIEU3PP6WUB4TEDKMQWCHR X-Message-ID-Hash: MS7FOGDAP2OIEU3PP6WUB4TEDKMQWCHR X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --V0G18SOW0//bro8Y Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jan 04, 2023 at 01:08:52AM +0100, Stefano Brivio wrote: > On Wed, 21 Dec 2022 17:00:24 +1100 > David Gibson wrote: >=20 > > On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote: > > > Sorry for the further delay, > > >=20 > > > On Wed, 14 Dec 2022 11:35:46 +0100 > > > Stefano Brivio wrote: > > > =20 > > > > On Wed, 14 Dec 2022 12:42:14 +1100 > > > > David Gibson wrote: > > > > =20 > > > > > On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote: = =20 > > > > > > Sorry for the long delay here, > > > > > >=20 > > > > > > On Mon, 5 Dec 2022 19:14:21 +1100 > > > > > > David Gibson wrote: > > > > > > =20 > > > > > > > Usually udp_sock_handler() will receive and forward multiple = (up to 32) > > > > > > > datagrams in udp_sock_handler(), then forward them all to the= tap > > > > > > > interface. For unclear reasons, though, when in pasta mode w= e will only > > > > > > > receive and forward a single datagram at a time. Change it t= o receive > > > > > > > multiple datagrams at once, like the other paths. =20 > > > > > >=20 > > > > > > This is explained in the commit message of 6c931118643c ("tcp, = udp: > > > > > > Receive batching doesn't pay off when writing single frames to = tap"). > > > > > >=20 > > > > > > I think it's worth re-checking the throughput now as this path = is a bit > > > > > > different, but unfortunately I didn't include this in the "perf= " tests :( > > > > > > because at the time I introduced those I wasn't sure it even ma= de sense to > > > > > > have traffic from the same host being directed to the tap devic= e. > > > > > >=20 > > > > > > The iperf3 runs were I observed this are actually the ones from= the Podman > > > > > > demo. Ideally that case should be also checked in the perf/past= a_udp tests. =20 > > > > >=20 > > > > > Hm, ok. > > > > > =20 > > > > > > How fundamental is this for the rest of the series? I couldn't = find any > > > > > > actual dependency on this but I might be missing something. = =20 > > > > >=20 > > > > > So the issue is that prior to this change in pasta we receive mul= tiple > > > > > frames at once on the splice path, but one frame at a time on the= tap > > > > > path. By the end of this series we can't do that any more, becau= se we > > > > > don't know before the recvmmsg() which one we'll be doing. =20 > > > >=20 > > > > Oh, right, I see. Then let me add this path to the perf/pasta_udp t= est > > > > and check how relevant this is now, I'll get back to you in a bit. = =20 > > >=20 > > > I was checking the wrong path. With this: > > >=20 > > > diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp > > > index 27ea724..973c2f4 100644 > > > --- a/test/perf/pasta_udp > > > +++ b/test/perf/pasta_udp > > > @@ -31,6 +31,14 @@ report pasta lo_udp 1 __FREQ__ > > > =20 > > > th MTU 1500B 4000B 16384B 65535B > > > =20 > > > +tr UDP throughput over IPv6: host to ns > > > +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type =3D= =3D "ether").ifname' > > > +nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname =3D=3D "= __IFNAME__").addr_info[] | select(.scope =3D=3D "global" and .prefixlen =3D= =3D 64).local' > > > +bw - > > > +bw - > > > +bw - > > > +iperf3 BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -= b 15G > > > +bw __BW__ 7.0 9.0 > > > =20 > > > tr UDP throughput over IPv6: ns to host > > > ns ip link set dev lo mtu 1500 > > > diff --git a/test/run b/test/run > > > index e07513f..b53182b 100755 > > > --- a/test/run > > > +++ b/test/run > > > @@ -67,6 +67,14 @@ run() { > > > test build/clang_tidy > > > teardown build > > > =20 > > > + VALGRIND=3D0 > > > + setup passt_in_ns > > > + test passt/ndp > > > + test passt/dhcp > > > + test perf/pasta_udp > > > + test passt_in_ns/shutdown > > > + teardown passt_in_ns > > > + > > > setup pasta > > > test pasta/ndp > > > test pasta/dhcp =20 > >=20 > > Ah, ok. Can we add that to the standard set of tests ASAP, please. > >=20 > > > I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite > > > significant. =20 > >=20 > > Drat. > >=20 > > > And there's nothing strange in perf's output, really, the distribution > > > of overhead per functions is pretty much the same, but writing multip= le > > > messages to the tap device just takes more cycles per message compared > > > to a single message. =20 > >=20 > > That's so weird. It should be basically an identical set of write()s, > > except that they happen in a batch, rather than a bit spread out. I > > guess it has to be some kind of cache locality thing. I wonder if the > > difference would go away or reverse if we had a way to submit multiple > > frames with a single syscall. > >=20 > > > I'm a bit ashamed to propose this, but do you think about something > > > like: =20 > >=20 > > > if (c->mode =3D=3D MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv, > > > 1, 0, NULL) <=3D 0) return; =20 > >=20 > > > if (udp_mmh_splice_port(v6, mmh_recv)) { n =3D > > > recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES > > > - 1, 0, NULL); } =20 > >=20 > > > if (n > 0) n++; else n =3D 1; } else { n =3D > > > recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0, > > > NULL); if (n <=3D 0) return; } =20 > >=20 > > > ? Other than the inherent ugliness, it looks like a good > > > approximation to me. =20 > >=20 > > Hmm. Well, the first question is how much impact does going 1 message > > at a time have on the spliced throughput. If it's not too bad, then > > we could just always go one at a time for pasta, regardless of > > splicing. And we could even abstract that difference into the tap > > backend with a callback like tap_batch_size(c). >=20 > So, finally I had the chance to try this out. >=20 > First off, baseline with the patch adding the new tests I just sent, > and the series you posted: >=20 > =3D=3D=3D perf/pasta_udp > > pasta: throughput and latency (local traffic) > Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams > MTU: | 1500B | = 4000B | 16384B | 65535B | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: ns to host | 4.4 | = 8.5 | 19.5 | 23.0 | > UDP RR latency over IPv6: ns to host | - | = - | - | 27 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: ns to host | 4.3 | = 8.8 | 18.5 | 24.4 | > UDP RR latency over IPv4: ns to host | - | = - | - | 26 | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: host to ns | - | = - | - | 22.5 | > UDP RR latency over IPv6: host to ns | - | = - | - | 30 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: host to ns | - | = - | - | 24.5 | > UDP RR latency over IPv4: host to ns | - | = - | - | 25 | > '--------'-= -------'--------'--------' > ...passed. >=20 > > pasta: throughput and latency (traffic via tap) > Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams > MTU: | 1500B | = 4000B | 16384B | 65520B | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: ns to host | 4.4 | = 10.4 | 16.0 | 23.4 | > UDP RR latency over IPv6: ns to host | - | = - | - | 27 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: ns to host | 5.2 | = 10.8 | 16.0 | 24.0 | > UDP RR latency over IPv4: ns to host | - | = - | - | 28 | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: host to ns | - | = - | - | 21.5 | > UDP RR latency over IPv6: host to ns | - | = - | - | 29 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: host to ns | - | = - | - | 26.3 | > UDP RR latency over IPv4: host to ns | - | = - | - | 26 | > '--------'-= -------'--------'--------' >=20 > which seems to indicate the whole "splicing" thing is pretty much > useless, for UDP (except for that 16 KiB MTU case, but I wonder how > relevant that is). >=20 > If I set UDP_MAX_FRAMES to 1, with a quick workaround for the resulting > warning in udp_tap_send() (single frame to send, hence single message), > it gets somewhat weird: >=20 > =3D=3D=3D perf/pasta_udp > > pasta: throughput and latency (local traffic) > Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams > MTU: | 1500B | = 4000B | 16384B | 65535B | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: ns to host | 3.4 | = 7.0 | 21.6 | 31.6 | > UDP RR latency over IPv6: ns to host | - | = - | - | 30 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: ns to host | 3.8 | = 7.0 | 22.0 | 32.4 | > UDP RR latency over IPv4: ns to host | - | = - | - | 26 | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: host to ns | - | = - | - | 29.3 | > UDP RR latency over IPv6: host to ns | - | = - | - | 31 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: host to ns | - | = - | - | 33.8 | > UDP RR latency over IPv4: host to ns | - | = - | - | 25 | > '--------'-= -------'--------'--------' > ...passed. >=20 > > pasta: throughput and latency (traffic via tap) > Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams > MTU: | 1500B | = 4000B | 16384B | 65520B | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: ns to host | 4.7 | = 10.3 | 16.0 | 24.0 | > UDP RR latency over IPv6: ns to host | - | = - | - | 27 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: ns to host | 5.6 | = 11.4 | 16.0 | 24.0 | > UDP RR latency over IPv4: ns to host | - | = - | - | 26 | > |--------|-= -------|--------|--------| > UDP throughput over IPv6: host to ns | - | = - | - | 21.5 | > UDP RR latency over IPv6: host to ns | - | = - | - | 29 | > |--------|-= -------|--------|--------| > UDP throughput over IPv4: host to ns | - | = - | - | 28.7 | > UDP RR latency over IPv4: host to ns | - | = - | - | 29 | > '--------'-= -------'--------'--------' >=20 > ...except for the cases with low MTUs, throughput is significantly > higher if we read and send one message at a time on the "spliced" path. >=20 > Next, I would like to: >=20 > - bisect between 32 and 1 for UDP_MAX_FRAMES: maybe 32 affects data > locality too much, but some lower value would still be beneficial by > lowering syscall overhead Ok. > - try with sendmsg() instead of sendmmsg(), at this point. Looking at > the kernel, that doesn't seem to make a real difference. Which sendmmsg() specifically are you looking at changing? > About this series: should we just go ahead and apply it with > UDP_MAX_FRAMES set to 1 for the moment being? It's anyway better than > the existing situation. I think that's a good idea - or rather, not setting UDP_MAX_FRAMES to 1, but clamping the batch size to 1 for pasta - I'm pretty sure we still want the batching for passt. We lose a little bit on small-packet spliced, but we gain on both tap and large-packet spliced. This will unblock the dual stack udp stuff and we can further tune it later. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --V0G18SOW0//bro8Y Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEoULxWu4/Ws0dB+XtgypY4gEwYSIFAmO1BjoACgkQgypY4gEw YSJQ0BAAoAwHXWW3aMcvjZnIU65DfGhmFzGnklFN7/6qgCZUow9FtiqEHE6Fb3An CWQ0s/NvkDcuTzOfR3Nc8iacVsczYPUk5nYTRghwCDucFr3G895XuY80zPVEoizU arPuZ0tYZK1JlVG4J68NmEgpZdIjlkp02a3UODGkP6x7DU0KZ9SqUpgDu8iIExD4 m7EzjDYSEk5RKzbwHoNuV73D2+UvMY0dxn6cSSFqdgMFckjLsSxmmH5oy9URivNz 5IxgG7v3OsdCt0ipJIv0c8NaQzRiop+8SJr8izXIx1KmT6hTtorcR9qnMeK/hr8Q 2FdK8jUpJWHinaCQ8fYj5VzmF0TZmS3W9zhBF/pMqcf/wnl7ePqSx8/ji711WBq6 m9lVmZIvL8nSrwE/BXlD6rLrEzTX/FpCrMQc/c7yRaAN4w3JgL4/jzfboTtMOojS bky2YA4qWeTtfwB5SDFahUyGlTqPglElt1fAGpOKiPO6+U90lGxsRdrrkr9+bZVk 58u4s0TssQkQ0OGweNt8hausjmBFFaLYBco+Rnl6vJHXT3KXX3Ff76LjIIgK0TCn aVeY9Rr7ib4OV/ToGzn1Yi54GkWUmKTOMlEgSgSPf/mXZgPnLRw2Uu8IwPAxbyAI 6T/b29vuEl6jAod/a6eG6AgfmZxa8tkzfXcS+1aay+MO90YzMh4= =Nn/5 -----END PGP SIGNATURE----- --V0G18SOW0//bro8Y--