From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=aIHgv59j;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id 697B95A004E
	for <passt-dev@passt.top>; Tue, 07 Jan 2025 23:56:41 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1736290600;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=UlZ8gCDNPiqEELn+gOWA3dFD6owFgqqjoCR3shuiaPY=;
	b=aIHgv59jo+WBVR1Xs3V1I/D76dDVCPSmdOAepaYbqBpqqF7URs7OZ3WnlaYeUs01SEoJDQ
	eBC99JW6QGwrb2l6KQ0ZaGt5tgkjtKRXPGdR/2Dzj3C731wo3VOIvcRW7+BAZ1vVhC36mv
	dvambj1QsjlR22EhwqMVnx260miNk2c=
Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com
 [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-583-DSTjrKkkP8mBNl_ActsN_w-1; Tue, 07 Jan 2025 17:56:38 -0500
X-MC-Unique: DSTjrKkkP8mBNl_ActsN_w-1
X-Mimecast-MFC-AGG-ID: DSTjrKkkP8mBNl_ActsN_w
Received: by mail-wr1-f72.google.com with SMTP id ffacd0b85a97d-385dcae001fso6891421f8f.1
        for <passt-dev@passt.top>; Tue, 07 Jan 2025 14:56:38 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1736290597; x=1736895397;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=999IyzOFMFvNRimGKIULBnENzIUCZOXvN7dgD093qTo=;
        b=BG2RwGGeaSdt3sL3wOpppAYcF4deHpKkxZplUtCMHf5GdGCiZOVUuMqqzKm3dB0m7S
         n5cB22r2M+mhL/SAVRYDBcJZyXZZUdB0Z8gwiNrMFxKsleD5XvkqvaJY33kzm63R/pe3
         r8TLwDlVuymnznjfcyaguZcHhoXbLgfX4/CQkHfc5v/8Tb49h7NIsaQEphyfSQCOaFQv
         M4i6UoNDk6qy2GgE9Iyv1SqLlyfvHy3gJ6LrngyvnXmDOCXv3cRmJT51ZY/lzLKxiM0K
         pRhCpglZw+tevGfv4fWiKHQPKRFhDJ+TFihRVcg+74e9IOhcuZR4fZA0/S+MxKFKMJEr
         4ogA==
X-Gm-Message-State: AOJu0YzWA5+PeUtpy4jTx2MBaKzQ4HZXno83QONrmE2a5LoYVzC5S97X
	tepIRyAsvFsdJ9d2A2iEOXmmIogMb5qzHqWMPkau4NsCs/Ac7PxwpEFhQnZPvQZNCnSSaB483Mt
	rK2rmP62SdtJTd+BCq41cebjb9jeT9LGtAfbCVCIILBlte2FuifBbFy4TrA==
X-Gm-Gg: ASbGncsELVqsplE5DAr7Gh61l/u+cQ0BSUFF7EzcH3YgXTSyCOJbi85zIm1+DpoY/Mb
	1K/4mx/tM9x2V2eyqTP394pTuiQlrhE3/ldd4JAnh8OB9KtrI72dBSpVceIg8a1jXV7lgWOQfZi
	N9h8vz3HnlTwkli9eT3naF4CBRpk4++Tbe0+cW6kU38p2tqvemDQ6aJpnuWilJaoXqACGmuhzxR
	6ZVl2b9VtyT6hr0huzI/GF+huOR3POnQLmkjqzne6jPbJ/6Djf/8rG+UUSDKtNR341u7QtwC+a0
	NV3bEKyX1A==
X-Received: by 2002:a5d:584b:0:b0:38a:39ad:3e31 with SMTP id ffacd0b85a97d-38a87359f50mr199713f8f.57.1736290597044;
        Tue, 07 Jan 2025 14:56:37 -0800 (PST)
X-Google-Smtp-Source: AGHT+IEA2vxFdsBgtKFPGOnpSTVOev89tEW0Amo5WEGaQ2jcBbG8S/Lwf8yr9wMF++Ay6sYGkr1PyQ==
X-Received: by 2002:a5d:584b:0:b0:38a:39ad:3e31 with SMTP id ffacd0b85a97d-38a87359f50mr199705f8f.57.1736290596594;
        Tue, 07 Jan 2025 14:56:36 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [176.103.220.4])
        by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38a28f17315sm45990776f8f.108.2025.01.07.14.56.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 07 Jan 2025 14:56:35 -0800 (PST)
Date: Tue, 7 Jan 2025 23:56:34 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum
 number of packets
Message-ID: <20250107235634.45584ac7@elisabeth>
In-Reply-To: <Z3d-ecXZefFcUSHe@zatzit>
References: <20241220083535.1372523-1-david@gibson.dropbear.id.au>
	<20241220083535.1372523-12-david@gibson.dropbear.id.au>
	<20250101225444.130c1034@elisabeth>
	<Z3YMJXIpAifPqSJ2@zatzit>
	<20250102230014.3dcc957d@elisabeth>
	<Z3d-ecXZefFcUSHe@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: L7T2vpX45KK4vUEaWZ2fW_XtRZ9pkIKKd6jWTv1q1CQ_1736290598
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: UYH2RKZOUJM5MMNO3MRIUPFM7K7RC6NW
X-Message-ID-Hash: UYH2RKZOUJM5MMNO3MRIUPFM7K7RC6NW
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20250107235634.45584ac7@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/UYH2RKZOUJM5MMNO3MRIUPFM7K7RC6NW/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Fri, 3 Jan 2025 17:06:49 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
> > On Thu, 2 Jan 2025 14:46:45 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >  =20
> > > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote: =20
> > > > On Fri, 20 Dec 2024 19:35:34 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >    =20
> > > > > Currently we attempt to size pool_tap[46] so they have room for t=
he maximum
> > > > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  =
However,
> > > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLE=
N (60) as
> > > > > the minimum possible L2 frame size.  But, we don't enforce that L=
2 frames
> > > > > are at least ETH_ZLEN when we receive them from the tap backend, =
and since
> > > > > we're dealing with virtual interfaces we don't have the physical =
Ethernet
> > > > > limitations requiring that length.  Indeed it is possible to gene=
rate a
> > > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 =
frame on
> > > > > the 'pasta' backend is only 42 bytes long).
> > > > >=20
> > > > > It's also unclear if this limit is sufficient for vhost-user whic=
h isn't
> > > > > limited by the size of pkt_buf as the other modes are.
> > > > >=20
> > > > > We could attempt to correct the calculation, but that would leave=
 us with
> > > > > even larger arrays, which in practice rarely accumulate more than=
 a handful
> > > > > of packets.  So, instead, put an arbitrary cap on the number of p=
ackets we
> > > > > can put in a batch, and if we run out of space, process and flush=
 the
> > > > > batch.   =20
> > > >=20
> > > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > > > general I couldn't really see a difference in latency (especially f=
or
> > > > UDP streams with small packets) or throughput. Figures from short
> > > > throughput tests (such as the ones from the test suite) look a bit =
more
> > > > variable, but I don't have any statistically meaningful data.
> > > >=20
> > > > Then I looked into how many messages we might have in the array wit=
hout
> > > > this change, and I realised that, with the throughput tests from th=
e
> > > > suite, we very easily exceed the 256 limit.   =20
> > >=20
> > > Ah, interesting.
> > >  =20
> > > > Perhaps surprisingly we get the highest buffer counts with TCP tran=
sfers
> > > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (an=
d
> > > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megaby=
tes
> > > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time inter=
val,
> > > > the extra system call overhead from forcibly flushing batches might
> > > > become rather relevant.   =20
> > >=20
> > > Really?  I thought syscall overhead (as in the part that's
> > > per-syscall, rather than per-work) was generally in the tens of =C2=
=B5s
> > > range, rather than the ms range. =20
> >=20
> > Tens or hundreds of =C2=B5s (mind that it's several of them), so there
> > could be just one order of magnitude between the two. =20
>=20
> Hm, ok.
>=20
> > > But in any case, I'd be fine with upping the size of the array to 4k
> > > or 8k based on that empirical data.  That's still much smaller than
> > > the >150k we have now. =20
> >=20
> > I would even go with 32k -- there are embedded systems with a ton of
> > memory but still much slower clocks compared to my typical setup. Go
> > figure. Again, I think we should test and profile this, ideally, but if
> > not, then let's pick something that's ~10x of what I see.
> >  =20
> > > > With lower MTUs, it looks like we have a lower CPU load and
> > > > transmissions are scheduled differently (resulting in smaller batch=
es),
> > > > but I didn't really trace things.   =20
> > >=20
> > > Ok.  I wonder if with the small MTUs we're hitting throughput
> > > bottlenecks elsewhere which mean this particular path isn't
> > > over-exercised.
> > >  =20
> > > > So I start thinking that this has the *potential* to introduce a
> > > > performance regression in some cases and we shouldn't just assume t=
hat
> > > > some arbitrary 256 limit is good enough. I didn't check with perf(1=
),
> > > > though.
> > > >=20
> > > > Right now that array takes, effectively, less than 100 KiB (it's ~5=
000
> > > > copies of struct iovec, 16 bytes each), and in theory that could be
> > > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and =
will
> > > > have no practical effect anyway.   =20
> > >=20
> > > Yeah, I guess.  I don't love the fact that currently for correctness
> > > (not spuriously dropping packets) we rely on a fairly complex
> > > calculation that's based on information from different layers: the
> > > buffer size and enforcement is in the packet pool layer and is
> > > independent of packet layout, but the minimum frame size comes from
> > > the tap layer and depends quite specifically on which L2 encapsulatio=
n
> > > we're using. =20
> >=20
> > Well but it's exactly one line, and we're talking about the same
> > project and tool, not something that's separated by several API layers.
> >=20
> > By the way: on one hand you have that. On the other hand, you're adding
> > an arbitrary limit that comes from a test I just did, which is also
> > based on information from different layers. =20
>=20
> The difference is that it used to be a hard limit: if we exceeded it
> we drop packets.  Now, we just do a little more work.
> Actually... that makes me realise the actual size of the batches isn't
> the relevant factor in choosing the size.  The amortized cost of
> processing a packet will be pretty much:
>=20
> =09  (per byte costs) * (packet size)
> =09+ (per packet costs)

Which is not independent of batch size, though (cache locality).

> =09+ (per batch costs) / (batch size)
>=20
> So, we just need to allow (batch size) to become large enough to make
> the last term negligible compared to the other two.  I find it hard to
> believe that we'd need more than ~1000 to do that.

...but anyway, yes, that's probably a minor detail, so in theory it's
like that, and probably also in practice.

But my whole point here is: let's not change the behaviour (regardless
of the theory) if we don't have time to profile things, because it's
not great to let users do that, and the current behaviour with the
current threshold is something we had for a long time. Quoting myself:

> > Again, I think we should test and profile this, ideally, but if
> > not, then let's pick something that's ~10x of what I see.

--=20
Stefano