From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id 03A8B5A0082
	for <passt-dev@passt.top>; Wed,  4 Jan 2023 05:58:26 +0100 (CET)
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4Nmy6Q3fkFz4y0Q; Wed,  4 Jan 2023 15:58:22 +1100 (AEDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=201602; t=1672808302;
	bh=a8+3tTSV65C0BFIGJAD5pxdY4cbLv6jawJjaqvtxbkw=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=XwiRXvt3mjNWNhGPKFFDoITbm6wACixiumZdz7kGU1zIL0QIq8nUCWo2gNUtG5/Jn
	 P1HAH8wG+nsM4RTJyB2bPocRng7KKAbqwgyCpwe97V6ond9jc5Fi/uo0WvK1DmMFDu
	 MubNEhHoat7GQZnCT4WgvP44iy5tVA632i2Gtbxo=
Date: Wed, 4 Jan 2023 15:53:21 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH 4/8] udp: Receive multiple datagrams at once on the pasta
 sock->tap path
Message-ID: <Y7UGQRuDITy7BdZK@yekko>
References: <20221205081425.2614425-1-david@gibson.dropbear.id.au>
 <20221205081425.2614425-5-david@gibson.dropbear.id.au>
 <20221213234847.6c723ad9@elisabeth>
 <Y5kp9jPlDMeBQrrK@yekko>
 <20221214113546.16942d3a@elisabeth>
 <20221220114246.737b0c3e@elisabeth>
 <Y6Kg+DJ3YR7K2LXR@yekko>
 <20230104010852.02e96a70@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="V0G18SOW0//bro8Y"
Content-Disposition: inline
In-Reply-To: <20230104010852.02e96a70@elisabeth>
Message-ID-Hash: MS7FOGDAP2OIEU3PP6WUB4TEDKMQWCHR
X-Message-ID-Hash: MS7FOGDAP2OIEU3PP6WUB4TEDKMQWCHR
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.3
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/Y7UGQRuDITy7BdZK@yekko/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/MS7FOGDAP2OIEU3PP6WUB4TEDKMQWCHR/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--V0G18SOW0//bro8Y
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jan 04, 2023 at 01:08:52AM +0100, Stefano Brivio wrote:
> On Wed, 21 Dec 2022 17:00:24 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote:
> > > Sorry for the further delay,
> > >=20
> > > On Wed, 14 Dec 2022 11:35:46 +0100
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >  =20
> > > > On Wed, 14 Dec 2022 12:42:14 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >  =20
> > > > > On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote:  =
 =20
> > > > > > Sorry for the long delay here,
> > > > > >=20
> > > > > > On Mon,  5 Dec 2022 19:14:21 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >      =20
> > > > > > > Usually udp_sock_handler() will receive and forward multiple =
(up to 32)
> > > > > > > datagrams in udp_sock_handler(), then forward them all to the=
 tap
> > > > > > > interface.  For unclear reasons, though, when in pasta mode w=
e will only
> > > > > > > receive and forward a single datagram at a time.  Change it t=
o receive
> > > > > > > multiple datagrams at once, like the other paths.     =20
> > > > > >=20
> > > > > > This is explained in the commit message of 6c931118643c ("tcp, =
udp:
> > > > > > Receive batching doesn't pay off when writing single frames to =
tap").
> > > > > >=20
> > > > > > I think it's worth re-checking the throughput now as this path =
is a bit
> > > > > > different, but unfortunately I didn't include this in the "perf=
" tests :(
> > > > > > because at the time I introduced those I wasn't sure it even ma=
de sense to
> > > > > > have traffic from the same host being directed to the tap devic=
e.
> > > > > >=20
> > > > > > The iperf3 runs were I observed this are actually the ones from=
 the Podman
> > > > > > demo. Ideally that case should be also checked in the perf/past=
a_udp tests.     =20
> > > > >=20
> > > > > Hm, ok.
> > > > >    =20
> > > > > > How fundamental is this for the rest of the series? I couldn't =
find any
> > > > > > actual dependency on this but I might be missing something.    =
 =20
> > > > >=20
> > > > > So the issue is that prior to this change in pasta we receive mul=
tiple
> > > > > frames at once on the splice path, but one frame at a time on the=
 tap
> > > > > path.  By the end of this series we can't do that any more, becau=
se we
> > > > > don't know before the recvmmsg() which one we'll be doing.   =20
> > > >=20
> > > > Oh, right, I see. Then let me add this path to the perf/pasta_udp t=
est
> > > > and check how relevant this is now, I'll get back to you in a bit. =
=20
> > >=20
> > > I was checking the wrong path. With this:
> > >=20
> > > diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp
> > > index 27ea724..973c2f4 100644
> > > --- a/test/perf/pasta_udp
> > > +++ b/test/perf/pasta_udp
> > > @@ -31,6 +31,14 @@ report	pasta lo_udp 1 __FREQ__
> > > =20
> > >  th	MTU 1500B 4000B 16384B 65535B
> > > =20
> > > +tr	UDP throughput over IPv6: host to ns
> > > +nsout	IFNAME ip -j link show | jq -rM '.[] | select(.link_type =3D=
=3D "ether").ifname'
> > > +nsout	ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname =3D=3D "=
__IFNAME__").addr_info[] | select(.scope =3D=3D "global" and .prefixlen =3D=
=3D 64).local'
> > > +bw	-
> > > +bw	-
> > > +bw	-
> > > +iperf3	BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -=
b 15G
> > > +bw	__BW__ 7.0 9.0
> > > =20
> > >  tr	UDP throughput over IPv6: ns to host
> > >  ns	ip link set dev lo mtu 1500
> > > diff --git a/test/run b/test/run
> > > index e07513f..b53182b 100755
> > > --- a/test/run
> > > +++ b/test/run
> > > @@ -67,6 +67,14 @@ run() {
> > >  	test build/clang_tidy
> > >  	teardown build
> > > =20
> > > +	VALGRIND=3D0
> > > +	setup passt_in_ns
> > > +	test passt/ndp
> > > +	test passt/dhcp
> > > +	test perf/pasta_udp
> > > +	test passt_in_ns/shutdown
> > > +	teardown passt_in_ns
> > > +
> > >  	setup pasta
> > >  	test pasta/ndp
> > >  	test pasta/dhcp =20
> >=20
> > Ah, ok.  Can we add that to the standard set of tests ASAP, please.
> >=20
> > > I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite
> > > significant. =20
> >=20
> > Drat.
> >=20
> > > And there's nothing strange in perf's output, really, the distribution
> > > of overhead per functions is pretty much the same, but writing multip=
le
> > > messages to the tap device just takes more cycles per message compared
> > > to a single message. =20
> >=20
> > That's so weird.  It should be basically an identical set of write()s,
> > except that they happen in a batch, rather than a bit spread out.  I
> > guess it has to be some kind of cache locality thing.  I wonder if the
> > difference would go away or reverse if we had a way to submit multiple
> > frames with a single syscall.
> >=20
> > > I'm a bit ashamed to propose this, but do you think about something
> > > like: =20
> >=20
> > > 	if (c->mode =3D=3D MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv,
> > > 		1, 0, NULL) <=3D 0) return; =20
> >=20
> > > 		if (udp_mmh_splice_port(v6, mmh_recv)) { n =3D
> > > 			recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES
> > > 			- 1, 0, NULL); } =20
> >=20
> > > 		if (n > 0) n++; else n =3D 1; } else { n =3D
> > > 			recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0,
> > > 			NULL); if (n <=3D 0) return; } =20
> >=20
> > > ? Other than the inherent ugliness, it looks like a good
> > > approximation to me. =20
> >=20
> > Hmm.  Well, the first question is how much impact does going 1 message
> > at a time have on the spliced throughput.  If it's not too bad, then
> > we could just always go one at a time for pasta, regardless of
> > splicing.  And we could even abstract that difference into the tap
> > backend with a callback like tap_batch_size(c).
>=20
> So, finally I had the chance to try this out.
>=20
> First off, baseline with the patch adding the new tests I just sent,
> and the series you posted:
>=20
> =3D=3D=3D perf/pasta_udp
> > pasta: throughput and latency (local traffic)
> Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams
>                                                          MTU: |  1500B | =
 4000B | 16384B | 65535B |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: ns to host             |    4.4 | =
   8.5 |   19.5 |   23.0 |
>              UDP RR latency over IPv6: ns to host             |      - | =
     - |      - |     27 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: ns to host             |    4.3 | =
   8.8 |   18.5 |   24.4 |
>              UDP RR latency over IPv4: ns to host             |      - | =
     - |      - |     26 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: host to ns             |      - | =
     - |      - |   22.5 |
>              UDP RR latency over IPv6: host to ns             |      - | =
     - |      - |     30 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: host to ns             |      - | =
     - |      - |   24.5 |
>              UDP RR latency over IPv4: host to ns             |      - | =
     - |      - |     25 |
>                                                               '--------'-=
-------'--------'--------'
> ...passed.
>=20
> > pasta: throughput and latency (traffic via tap)
> Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams
>                                                          MTU: |  1500B | =
 4000B | 16384B | 65520B |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: ns to host             |    4.4 | =
  10.4 |   16.0 |   23.4 |
>              UDP RR latency over IPv6: ns to host             |      - | =
     - |      - |     27 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: ns to host             |    5.2 | =
  10.8 |   16.0 |   24.0 |
>              UDP RR latency over IPv4: ns to host             |      - | =
     - |      - |     28 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: host to ns             |      - | =
     - |      - |   21.5 |
>              UDP RR latency over IPv6: host to ns             |      - | =
     - |      - |     29 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: host to ns             |      - | =
     - |      - |   26.3 |
>              UDP RR latency over IPv4: host to ns             |      - | =
     - |      - |     26 |
>                                                               '--------'-=
-------'--------'--------'
>=20
> which seems to indicate the whole "splicing" thing is pretty much
> useless, for UDP (except for that 16 KiB MTU case, but I wonder how
> relevant that is).
>=20
> If I set UDP_MAX_FRAMES to 1, with a quick workaround for the resulting
> warning in udp_tap_send() (single frame to send, hence single message),
> it gets somewhat weird:
>=20
> =3D=3D=3D perf/pasta_udp
> > pasta: throughput and latency (local traffic)
> Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams
>                                                          MTU: |  1500B | =
 4000B | 16384B | 65535B |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: ns to host             |    3.4 | =
   7.0 |   21.6 |   31.6 |
>              UDP RR latency over IPv6: ns to host             |      - | =
     - |      - |     30 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: ns to host             |    3.8 | =
   7.0 |   22.0 |   32.4 |
>              UDP RR latency over IPv4: ns to host             |      - | =
     - |      - |     26 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: host to ns             |      - | =
     - |      - |   29.3 |
>              UDP RR latency over IPv6: host to ns             |      - | =
     - |      - |     31 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: host to ns             |      - | =
     - |      - |   33.8 |
>              UDP RR latency over IPv4: host to ns             |      - | =
     - |      - |     25 |
>                                                               '--------'-=
-------'--------'--------'
> ...passed.
>=20
> > pasta: throughput and latency (traffic via tap)
> Throughput in Gbps, latency in =B5s, one thread at 3.6 GHz, 4 streams
>                                                          MTU: |  1500B | =
 4000B | 16384B | 65520B |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: ns to host             |    4.7 | =
  10.3 |   16.0 |   24.0 |
>              UDP RR latency over IPv6: ns to host             |      - | =
     - |      - |     27 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: ns to host             |    5.6 | =
  11.4 |   16.0 |   24.0 |
>              UDP RR latency over IPv4: ns to host             |      - | =
     - |      - |     26 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv6: host to ns             |      - | =
     - |      - |   21.5 |
>              UDP RR latency over IPv6: host to ns             |      - | =
     - |      - |     29 |
>                                                               |--------|-=
-------|--------|--------|
>              UDP throughput over IPv4: host to ns             |      - | =
     - |      - |   28.7 |
>              UDP RR latency over IPv4: host to ns             |      - | =
     - |      - |     29 |
>                                                               '--------'-=
-------'--------'--------'
>=20
> ...except for the cases with low MTUs, throughput is significantly
> higher if we read and send one message at a time on the "spliced" path.
>=20
> Next, I would like to:
>=20
> - bisect between 32 and 1 for UDP_MAX_FRAMES: maybe 32 affects data
>   locality too much, but some lower value would still be beneficial by
>   lowering syscall overhead

Ok.

> - try with sendmsg() instead of sendmmsg(), at this point. Looking at
>   the kernel, that doesn't seem to make a real difference.

Which sendmmsg() specifically are you looking at changing?

> About this series: should we just go ahead and apply it with
> UDP_MAX_FRAMES set to 1 for the moment being? It's anyway better than
> the existing situation.

I think that's a good idea - or rather, not setting UDP_MAX_FRAMES to
1, but clamping the batch size to 1 for pasta - I'm pretty sure we
still want the batching for passt.  We lose a little bit on
small-packet spliced, but we gain on both tap and large-packet
spliced.  This will unblock the dual stack udp stuff and we can
further tune it later.

--=20
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--V0G18SOW0//bro8Y
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEoULxWu4/Ws0dB+XtgypY4gEwYSIFAmO1BjoACgkQgypY4gEw
YSJQ0BAAoAwHXWW3aMcvjZnIU65DfGhmFzGnklFN7/6qgCZUow9FtiqEHE6Fb3An
CWQ0s/NvkDcuTzOfR3Nc8iacVsczYPUk5nYTRghwCDucFr3G895XuY80zPVEoizU
arPuZ0tYZK1JlVG4J68NmEgpZdIjlkp02a3UODGkP6x7DU0KZ9SqUpgDu8iIExD4
m7EzjDYSEk5RKzbwHoNuV73D2+UvMY0dxn6cSSFqdgMFckjLsSxmmH5oy9URivNz
5IxgG7v3OsdCt0ipJIv0c8NaQzRiop+8SJr8izXIx1KmT6hTtorcR9qnMeK/hr8Q
2FdK8jUpJWHinaCQ8fYj5VzmF0TZmS3W9zhBF/pMqcf/wnl7ePqSx8/ji711WBq6
m9lVmZIvL8nSrwE/BXlD6rLrEzTX/FpCrMQc/c7yRaAN4w3JgL4/jzfboTtMOojS
bky2YA4qWeTtfwB5SDFahUyGlTqPglElt1fAGpOKiPO6+U90lGxsRdrrkr9+bZVk
58u4s0TssQkQ0OGweNt8hausjmBFFaLYBco+Rnl6vJHXT3KXX3Ff76LjIIgK0TCn
aVeY9Rr7ib4OV/ToGzn1Yi54GkWUmKTOMlEgSgSPf/mXZgPnLRw2Uu8IwPAxbyAI
6T/b29vuEl6jAod/a6eG6AgfmZxa8tkzfXcS+1aay+MO90YzMh4=
=Nn/5
-----END PGP SIGNATURE-----

--V0G18SOW0//bro8Y--