From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202412 header.b=MKnvRvWW;
	dkim-atps=neutral
Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3])
	by passt.top (Postfix) with ESMTPS id 0989A5A0271
	for <passt-dev@passt.top>; Fri, 03 Jan 2025 07:07:05 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202412; t=1735884415;
	bh=oPuTkwr5u08460J74IknhHmqRwpBBGbGVO4c1c5AEHg=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=MKnvRvWWPP+Vxm8kOG5S8zUYsycSoj0fx9yRtGjIuEuwNsOtEZNHvnDHRMAjzjtSo
	 hmJnWV2SiqST0D5sqGRY0Myj7TfDyXa03jWW1VlOuQwGlGqFmEShFdjD4VFHVyAGH7
	 2M3gQYvy4f2nECXxzGXsRZf8HQXe8ohrLJvHgfxyakKGD0mXggubvQwj2nAJehhlc2
	 Q6fOmDZWo+7KPhJPGdAuFyS5hD2aalDEptsm9Pi2OcIK061DjLiQM9qwmJ9kD1KiHg
	 YM69gwhb+adanWiyEnPre4p9M9+uo9aJXfEcW2zPnLFOgR4R4j09GwKXT3sQY+ZzGP
	 nfnN9362SN4CA==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4YPY4b627Bz4xff; Fri,  3 Jan 2025 17:06:55 +1100 (AEDT)
Date: Fri, 3 Jan 2025 17:06:49 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Subject: Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum
 number of packets
Message-ID: <Z3d-ecXZefFcUSHe@zatzit>
References: <20241220083535.1372523-1-david@gibson.dropbear.id.au>
 <20241220083535.1372523-12-david@gibson.dropbear.id.au>
 <20250101225444.130c1034@elisabeth>
 <Z3YMJXIpAifPqSJ2@zatzit>
 <20250102230014.3dcc957d@elisabeth>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="oPB326m9GsOTxTIH"
Content-Disposition: inline
In-Reply-To: <20250102230014.3dcc957d@elisabeth>
Message-ID-Hash: 4C3TAA4GGWMR4BWCMKIGYFPY7GMFNZ4K
X-Message-ID-Hash: 4C3TAA4GGWMR4BWCMKIGYFPY7GMFNZ4K
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/Z3d-ecXZefFcUSHe@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/4C3TAA4GGWMR4BWCMKIGYFPY7GMFNZ4K/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--oPB326m9GsOTxTIH
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
> On Thu, 2 Jan 2025 14:46:45 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>=20
> > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> > > On Fri, 20 Dec 2024 19:35:34 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >  =20
> > > > Currently we attempt to size pool_tap[46] so they have room for the=
 maximum
> > > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  Ho=
wever,
> > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN =
(60) as
> > > > the minimum possible L2 frame size.  But, we don't enforce that L2 =
frames
> > > > are at least ETH_ZLEN when we receive them from the tap backend, an=
d since
> > > > we're dealing with virtual interfaces we don't have the physical Et=
hernet
> > > > limitations requiring that length.  Indeed it is possible to genera=
te a
> > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 fr=
ame on
> > > > the 'pasta' backend is only 42 bytes long).
> > > >=20
> > > > It's also unclear if this limit is sufficient for vhost-user which =
isn't
> > > > limited by the size of pkt_buf as the other modes are.
> > > >=20
> > > > We could attempt to correct the calculation, but that would leave u=
s with
> > > > even larger arrays, which in practice rarely accumulate more than a=
 handful
> > > > of packets.  So, instead, put an arbitrary cap on the number of pac=
kets we
> > > > can put in a batch, and if we run out of space, process and flush t=
he
> > > > batch. =20
> > >=20
> > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > > general I couldn't really see a difference in latency (especially for
> > > UDP streams with small packets) or throughput. Figures from short
> > > throughput tests (such as the ones from the test suite) look a bit mo=
re
> > > variable, but I don't have any statistically meaningful data.
> > >=20
> > > Then I looked into how many messages we might have in the array witho=
ut
> > > this change, and I realised that, with the throughput tests from the
> > > suite, we very easily exceed the 256 limit. =20
> >=20
> > Ah, interesting.
> >=20
> > > Perhaps surprisingly we get the highest buffer counts with TCP transf=
ers
> > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interva=
l,
> > > the extra system call overhead from forcibly flushing batches might
> > > become rather relevant. =20
> >=20
> > Really?  I thought syscall overhead (as in the part that's
> > per-syscall, rather than per-work) was generally in the tens of =B5s
> > range, rather than the ms range.
>=20
> Tens or hundreds of =B5s (mind that it's several of them), so there
> could be just one order of magnitude between the two.

Hm, ok.

> > But in any case, I'd be fine with upping the size of the array to 4k
> > or 8k based on that empirical data.  That's still much smaller than
> > the >150k we have now.
>=20
> I would even go with 32k -- there are embedded systems with a ton of
> memory but still much slower clocks compared to my typical setup. Go
> figure. Again, I think we should test and profile this, ideally, but if
> not, then let's pick something that's ~10x of what I see.
>=20
> > > With lower MTUs, it looks like we have a lower CPU load and
> > > transmissions are scheduled differently (resulting in smaller batches=
),
> > > but I didn't really trace things. =20
> >=20
> > Ok.  I wonder if with the small MTUs we're hitting throughput
> > bottlenecks elsewhere which mean this particular path isn't
> > over-exercised.
> >=20
> > > So I start thinking that this has the *potential* to introduce a
> > > performance regression in some cases and we shouldn't just assume that
> > > some arbitrary 256 limit is good enough. I didn't check with perf(1),
> > > though.
> > >=20
> > > Right now that array takes, effectively, less than 100 KiB (it's ~5000
> > > copies of struct iovec, 16 bytes each), and in theory that could be
> > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and wi=
ll
> > > have no practical effect anyway. =20
> >=20
> > Yeah, I guess.  I don't love the fact that currently for correctness
> > (not spuriously dropping packets) we rely on a fairly complex
> > calculation that's based on information from different layers: the
> > buffer size and enforcement is in the packet pool layer and is
> > independent of packet layout, but the minimum frame size comes from
> > the tap layer and depends quite specifically on which L2 encapsulation
> > we're using.
>=20
> Well but it's exactly one line, and we're talking about the same
> project and tool, not something that's separated by several API layers.
>=20
> By the way: on one hand you have that. On the other hand, you're adding
> an arbitrary limit that comes from a test I just did, which is also
> based on information from different layers.

The difference is that it used to be a hard limit: if we exceeded it
we drop packets.  Now, we just do a little more work.
Actually... that makes me realise the actual size of the batches isn't
the relevant factor in choosing the size.  The amortized cost of
processing a packet will be pretty much:

	  (per byte costs) * (packet size)
	+ (per packet costs)
	+ (per batch costs) / (batch size)

So, we just need to allow (batch size) to become large enough to make
the last term negligible compared to the other two.  I find it hard to
believe that we'd need more than ~1000 to do that.

> > > All in all, I think we shouldn't change this limit without a deeper
> > > understanding of the practical impact. While this change doesn't bring
> > > any practical advantage, the current behaviour is somewhat tested by
> > > now, and a small limit isn't. =20
>=20
> ...and this still applies, I think.
>=20

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--oPB326m9GsOTxTIH
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmd3fnUACgkQzQJF27ox
2GfIeg//X5LkXLvgqbJHXANiChbCO3OA5atqpyjccDpNghAMLe2PLg4WEMf+49rC
/x3X3ziqHfiIwJ620mZA7TZlxWamrGe+A8gd3S86kb54/lCFUk6VgrXs2t6pfa2B
DN3n+N6ifKQ4IWHwgGM7R06aSl75u1CwU/DP8X7aXn4X7GAMpf+SSaXy9r930rA6
Ic0PhAAr774ATeUro/dLmIiaUqAT4lOmenLMkv6nmE8ThOzvqLNDLgDl6nu0E4gg
mKu4LGgafrmq8Ls3OMtFtRHmFdPNdwdRuBhUk3eKCHzN1eByZrp/PHnO9x1u3Lgw
R9fcJ329YK/2A8GaH9g8x4TO99x7b5dlZVCPxwuSTxLHaGqdEnyhMgFQ56T16L4R
ELfJ18mdyS9S3VL+FMpuSfH26fpnD4A5HqdmqRoPGFr/xzylciEss2+kYMTRMdFz
5ZBTLN2mPEAJolI2PHPLu6aNqPGSmLg72mODgmO7o4rl8SPpH8QyTe5ukpEK84V5
ppgST+Rc7eZd4KO4ZzazOKoNsWNEzi5nTR1Jd+76dGk/WsmPsgr20hvPlgc/BV0S
jIoxRnOL4FqLtGqbqeQJTS/RW1+BbYTlnzsSXV8WP83h6uSHj/hv1PVNIfTctLsY
X27NhxUrt2ZDh4wRvi1nvDCVeY3wuEFOq8qz7Wih+OlDnt9TjXU=
=H0Lc
-----END PGP SIGNATURE-----

--oPB326m9GsOTxTIH--