From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Zz5P3lp4;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by passt.top (Postfix) with ESMTPS id 039FD5A0271
	for <passt-dev@passt.top>; Thu, 02 Jan 2025 23:00:21 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1735855220;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=CU0AvDaa8+rk76zkCFmFuGsVg4iHriemc9YoUWhMtuM=;
	b=Zz5P3lp4Hn26WcUcAv68vkUHfELRx2iP5Xsm/13cD2Nwq3XaWYpgTVN1MxgOFCZR+/xmsR
	EpyLhJxCDpNI0LWr6z95GHaeo6nFhmue72roSVz/55+MuMyeT9Hc8abLtAwKsyc+0YTPVH
	e2Uc7uZWE2h4mm/kKw/qtKwRgreJPLo=
Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com
 [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-685-wc_OTT6eOqqfTDC_FB0-Pg-1; Thu, 02 Jan 2025 17:00:19 -0500
X-MC-Unique: wc_OTT6eOqqfTDC_FB0-Pg-1
X-Mimecast-MFC-AGG-ID: wc_OTT6eOqqfTDC_FB0-Pg
Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-385fdff9db5so2220499f8f.0
        for <passt-dev@passt.top>; Thu, 02 Jan 2025 14:00:19 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1735855218; x=1736460018;
        h=content-transfer-encoding:mime-version:organization:references
         :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=Hfn65e2c3avgkHE5Z8MnkRorebRD7wCv31jYmt27BpA=;
        b=pWHBkQLoXvZupVTgAcSca1hZtcDtuy4/Nooa9+PDVIHnM5rpn0Mq4HJofiLYZNQ1H9
         Y9Su5LL7udeKKESMNK/waKYvE6uZDjs012wlhGpO5L4pLmdJ71+/rA9+9sYMvJ3C0zIu
         rp6PA7nvh2T4SQ+wxK90zDfEFSrDBhIZY2qIxwOsqgqX9BPyhzNzzIomrhIZYLcmaz4s
         alUmgvzyv+tVo/lTJj+t8h+AFgjaTlsC6m+mlUyTGPvqjKx/7S0y/5aNjIQjFM8BC93g
         hZ1kzXetLkZU1lBReEv92FEVZYn9ce1vdpixWn7DCW7Ttj6irg1B/+poWlaSfwUOGy7/
         vOHg==
X-Gm-Message-State: AOJu0YwS+wnTBgguKqB2E1Vu04zS4JDMchIKgr0q9Cxd0OHl32u39M4W
	1EzUmVlVx5IKnAsL8objOKfcVoyyVClMEgYWrQgR01j/HArEmJaBFJ+pUuOmAqX+7qBCgFLOgMM
	zq9Zg6OpQr/+JTKEg89J3jbWidESRnD9jCvTe1fZFuPzkufTuKPbEE+hJWw==
X-Gm-Gg: ASbGncujY0hj2amd3mC/pS2OXUb22IRWaf3nVdj0LJ4rupEqbMlIHF9QF96kRApD85s
	iOsZjP+uwyLUoHzu2cRFp6WZ8dKApOAYNTATVju8yldi5SXJhJly46Bus/KcLZ1nyl/dOfD8h8P
	NW3zhOa5WX+GsEck1aj7c2LlCyMJWNkxwv39j+SWHz7erNWpwUvRl9aqosOjl8sRHE+PIfmfVZq
	7LYCCU2HKBNSPiZ0/CWrH4cOBnzC638VxrddMEt6tdnl/LhQQe08Qfez9GwTB8xkaRH
X-Received: by 2002:a05:6000:1f82:b0:387:8752:5691 with SMTP id ffacd0b85a97d-38a223ff322mr30314089f8f.47.1735855217813;
        Thu, 02 Jan 2025 14:00:17 -0800 (PST)
X-Google-Smtp-Source: AGHT+IEfLTqf6/53MMFHa8xxYo3l7A1AhqPLhyuXt6A0CcZ17ah8/bxBFtt2RqOKp0ZqoXzCeksCig==
X-Received: by 2002:a05:6000:1f82:b0:387:8752:5691 with SMTP id ffacd0b85a97d-38a223ff322mr30314077f8f.47.1735855217421;
        Thu, 02 Jan 2025 14:00:17 -0800 (PST)
Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1])
        by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43656b3b271sm496265015e9.34.2025.01.02.14.00.15
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 02 Jan 2025 14:00:16 -0800 (PST)
Date: Thu, 2 Jan 2025 23:00:14 +0100
From: Stefano Brivio <sbrivio@redhat.com>
To: David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum
 number of packets
Message-ID: <20250102230014.3dcc957d@elisabeth>
In-Reply-To: <Z3YMJXIpAifPqSJ2@zatzit>
References: <20241220083535.1372523-1-david@gibson.dropbear.id.au>
	<20241220083535.1372523-12-david@gibson.dropbear.id.au>
	<20250101225444.130c1034@elisabeth>
	<Z3YMJXIpAifPqSJ2@zatzit>
Organization: Red Hat
X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu)
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: 3SfPOM1GSKKuuSvwtOxXvfqe-Y4sjjz2HzRmPeyQKTM_1735855218
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: ELINVOFG5WXU4O25IVL2BFFBRF5PNSKV
X-Message-ID-Hash: ELINVOFG5WXU4O25IVL2BFFBRF5PNSKV
X-MailFrom: sbrivio@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/20250102230014.3dcc957d@elisabeth/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/ELINVOFG5WXU4O25IVL2BFFBRF5PNSKV/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Thu, 2 Jan 2025 14:46:45 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
> > On Fri, 20 Dec 2024 19:35:34 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >  =20
> > > Currently we attempt to size pool_tap[46] so they have room for the m=
aximum
> > > possible number of packets that could fit in pkt_buf, TAP_MSGS.  Howe=
ver,
> > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (6=
0) as
> > > the minimum possible L2 frame size.  But, we don't enforce that L2 fr=
ames
> > > are at least ETH_ZLEN when we receive them from the tap backend, and =
since
> > > we're dealing with virtual interfaces we don't have the physical Ethe=
rnet
> > > limitations requiring that length.  Indeed it is possible to generate=
 a
> > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 fram=
e on
> > > the 'pasta' backend is only 42 bytes long).
> > >=20
> > > It's also unclear if this limit is sufficient for vhost-user which is=
n't
> > > limited by the size of pkt_buf as the other modes are.
> > >=20
> > > We could attempt to correct the calculation, but that would leave us =
with
> > > even larger arrays, which in practice rarely accumulate more than a h=
andful
> > > of packets.  So, instead, put an arbitrary cap on the number of packe=
ts we
> > > can put in a batch, and if we run out of space, process and flush the
> > > batch. =20
> >=20
> > I ran a few more tests with this, keeping TAP_MSGS at 256, and in
> > general I couldn't really see a difference in latency (especially for
> > UDP streams with small packets) or throughput. Figures from short
> > throughput tests (such as the ones from the test suite) look a bit more
> > variable, but I don't have any statistically meaningful data.
> >=20
> > Then I looked into how many messages we might have in the array without
> > this change, and I realised that, with the throughput tests from the
> > suite, we very easily exceed the 256 limit. =20
>=20
> Ah, interesting.
>=20
> > Perhaps surprisingly we get the highest buffer counts with TCP transfer=
s
> > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and
> > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes
> > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval,
> > the extra system call overhead from forcibly flushing batches might
> > become rather relevant. =20
>=20
> Really?  I thought syscall overhead (as in the part that's
> per-syscall, rather than per-work) was generally in the tens of =C2=B5s
> range, rather than the ms range.

Tens or hundreds of =C2=B5s (mind that it's several of them), so there
could be just one order of magnitude between the two.

> But in any case, I'd be fine with upping the size of the array to 4k
> or 8k based on that empirical data.  That's still much smaller than
> the >150k we have now.

I would even go with 32k -- there are embedded systems with a ton of
memory but still much slower clocks compared to my typical setup. Go
figure. Again, I think we should test and profile this, ideally, but if
not, then let's pick something that's ~10x of what I see.

> > With lower MTUs, it looks like we have a lower CPU load and
> > transmissions are scheduled differently (resulting in smaller batches),
> > but I didn't really trace things. =20
>=20
> Ok.  I wonder if with the small MTUs we're hitting throughput
> bottlenecks elsewhere which mean this particular path isn't
> over-exercised.
>=20
> > So I start thinking that this has the *potential* to introduce a
> > performance regression in some cases and we shouldn't just assume that
> > some arbitrary 256 limit is good enough. I didn't check with perf(1),
> > though.
> >=20
> > Right now that array takes, effectively, less than 100 KiB (it's ~5000
> > copies of struct iovec, 16 bytes each), and in theory that could be
> > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's
> > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will
> > have no practical effect anyway. =20
>=20
> Yeah, I guess.  I don't love the fact that currently for correctness
> (not spuriously dropping packets) we rely on a fairly complex
> calculation that's based on information from different layers: the
> buffer size and enforcement is in the packet pool layer and is
> independent of packet layout, but the minimum frame size comes from
> the tap layer and depends quite specifically on which L2 encapsulation
> we're using.

Well but it's exactly one line, and we're talking about the same
project and tool, not something that's separated by several API layers.

By the way: on one hand you have that. On the other hand, you're adding
an arbitrary limit that comes from a test I just did, which is also
based on information from different layers.

> > All in all, I think we shouldn't change this limit without a deeper
> > understanding of the practical impact. While this change doesn't bring
> > any practical advantage, the current behaviour is somewhat tested by
> > now, and a small limit isn't. =20

...and this still applies, I think.

--=20
Stefano