From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Zz5P3lp4; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by passt.top (Postfix) with ESMTPS id 039FD5A0271 for ; Thu, 02 Jan 2025 23:00:21 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735855220; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CU0AvDaa8+rk76zkCFmFuGsVg4iHriemc9YoUWhMtuM=; b=Zz5P3lp4Hn26WcUcAv68vkUHfELRx2iP5Xsm/13cD2Nwq3XaWYpgTVN1MxgOFCZR+/xmsR EpyLhJxCDpNI0LWr6z95GHaeo6nFhmue72roSVz/55+MuMyeT9Hc8abLtAwKsyc+0YTPVH e2Uc7uZWE2h4mm/kKw/qtKwRgreJPLo= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-685-wc_OTT6eOqqfTDC_FB0-Pg-1; Thu, 02 Jan 2025 17:00:19 -0500 X-MC-Unique: wc_OTT6eOqqfTDC_FB0-Pg-1 X-Mimecast-MFC-AGG-ID: wc_OTT6eOqqfTDC_FB0-Pg Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-385fdff9db5so2220499f8f.0 for ; Thu, 02 Jan 2025 14:00:19 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735855218; x=1736460018; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Hfn65e2c3avgkHE5Z8MnkRorebRD7wCv31jYmt27BpA=; b=pWHBkQLoXvZupVTgAcSca1hZtcDtuy4/Nooa9+PDVIHnM5rpn0Mq4HJofiLYZNQ1H9 Y9Su5LL7udeKKESMNK/waKYvE6uZDjs012wlhGpO5L4pLmdJ71+/rA9+9sYMvJ3C0zIu rp6PA7nvh2T4SQ+wxK90zDfEFSrDBhIZY2qIxwOsqgqX9BPyhzNzzIomrhIZYLcmaz4s alUmgvzyv+tVo/lTJj+t8h+AFgjaTlsC6m+mlUyTGPvqjKx/7S0y/5aNjIQjFM8BC93g hZ1kzXetLkZU1lBReEv92FEVZYn9ce1vdpixWn7DCW7Ttj6irg1B/+poWlaSfwUOGy7/ vOHg== X-Gm-Message-State: AOJu0YwS+wnTBgguKqB2E1Vu04zS4JDMchIKgr0q9Cxd0OHl32u39M4W 1EzUmVlVx5IKnAsL8objOKfcVoyyVClMEgYWrQgR01j/HArEmJaBFJ+pUuOmAqX+7qBCgFLOgMM zq9Zg6OpQr/+JTKEg89J3jbWidESRnD9jCvTe1fZFuPzkufTuKPbEE+hJWw== X-Gm-Gg: ASbGncujY0hj2amd3mC/pS2OXUb22IRWaf3nVdj0LJ4rupEqbMlIHF9QF96kRApD85s iOsZjP+uwyLUoHzu2cRFp6WZ8dKApOAYNTATVju8yldi5SXJhJly46Bus/KcLZ1nyl/dOfD8h8P NW3zhOa5WX+GsEck1aj7c2LlCyMJWNkxwv39j+SWHz7erNWpwUvRl9aqosOjl8sRHE+PIfmfVZq 7LYCCU2HKBNSPiZ0/CWrH4cOBnzC638VxrddMEt6tdnl/LhQQe08Qfez9GwTB8xkaRH X-Received: by 2002:a05:6000:1f82:b0:387:8752:5691 with SMTP id ffacd0b85a97d-38a223ff322mr30314089f8f.47.1735855217813; Thu, 02 Jan 2025 14:00:17 -0800 (PST) X-Google-Smtp-Source: AGHT+IEfLTqf6/53MMFHa8xxYo3l7A1AhqPLhyuXt6A0CcZ17ah8/bxBFtt2RqOKp0ZqoXzCeksCig== X-Received: by 2002:a05:6000:1f82:b0:387:8752:5691 with SMTP id ffacd0b85a97d-38a223ff322mr30314077f8f.47.1735855217421; Thu, 02 Jan 2025 14:00:17 -0800 (PST) Received: from maya.myfinge.rs (ifcgrfdd.trafficplex.cloud. [2a10:fc81:a806:d6a9::1]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-43656b3b271sm496265015e9.34.2025.01.02.14.00.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Jan 2025 14:00:16 -0800 (PST) Date: Thu, 2 Jan 2025 23:00:14 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH v2 11/12] tap: Don't size pool_tap[46] for the maximum number of packets Message-ID: <20250102230014.3dcc957d@elisabeth> In-Reply-To: References: <20241220083535.1372523-1-david@gibson.dropbear.id.au> <20241220083535.1372523-12-david@gibson.dropbear.id.au> <20250101225444.130c1034@elisabeth> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 3SfPOM1GSKKuuSvwtOxXvfqe-Y4sjjz2HzRmPeyQKTM_1735855218 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: ELINVOFG5WXU4O25IVL2BFFBRF5PNSKV X-Message-ID-Hash: ELINVOFG5WXU4O25IVL2BFFBRF5PNSKV X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Thu, 2 Jan 2025 14:46:45 +1100 David Gibson wrote: > On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote: > > On Fri, 20 Dec 2024 19:35:34 +1100 > > David Gibson wrote: > > =20 > > > Currently we attempt to size pool_tap[46] so they have room for the m= aximum > > > possible number of packets that could fit in pkt_buf, TAP_MSGS. Howe= ver, > > > the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (6= 0) as > > > the minimum possible L2 frame size. But, we don't enforce that L2 fr= ames > > > are at least ETH_ZLEN when we receive them from the tap backend, and = since > > > we're dealing with virtual interfaces we don't have the physical Ethe= rnet > > > limitations requiring that length. Indeed it is possible to generate= a > > > legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 fram= e on > > > the 'pasta' backend is only 42 bytes long). > > >=20 > > > It's also unclear if this limit is sufficient for vhost-user which is= n't > > > limited by the size of pkt_buf as the other modes are. > > >=20 > > > We could attempt to correct the calculation, but that would leave us = with > > > even larger arrays, which in practice rarely accumulate more than a h= andful > > > of packets. So, instead, put an arbitrary cap on the number of packe= ts we > > > can put in a batch, and if we run out of space, process and flush the > > > batch. =20 > >=20 > > I ran a few more tests with this, keeping TAP_MSGS at 256, and in > > general I couldn't really see a difference in latency (especially for > > UDP streams with small packets) or throughput. Figures from short > > throughput tests (such as the ones from the test suite) look a bit more > > variable, but I don't have any statistically meaningful data. > >=20 > > Then I looked into how many messages we might have in the array without > > this change, and I realised that, with the throughput tests from the > > suite, we very easily exceed the 256 limit. =20 >=20 > Ah, interesting. >=20 > > Perhaps surprisingly we get the highest buffer counts with TCP transfer= s > > and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and > > more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes > > in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, > > the extra system call overhead from forcibly flushing batches might > > become rather relevant. =20 >=20 > Really? I thought syscall overhead (as in the part that's > per-syscall, rather than per-work) was generally in the tens of =C2=B5s > range, rather than the ms range. Tens or hundreds of =C2=B5s (mind that it's several of them), so there could be just one order of magnitude between the two. > But in any case, I'd be fine with upping the size of the array to 4k > or 8k based on that empirical data. That's still much smaller than > the >150k we have now. I would even go with 32k -- there are embedded systems with a ton of memory but still much slower clocks compared to my typical setup. Go figure. Again, I think we should test and profile this, ideally, but if not, then let's pick something that's ~10x of what I see. > > With lower MTUs, it looks like we have a lower CPU load and > > transmissions are scheduled differently (resulting in smaller batches), > > but I didn't really trace things. =20 >=20 > Ok. I wonder if with the small MTUs we're hitting throughput > bottlenecks elsewhere which mean this particular path isn't > over-exercised. >=20 > > So I start thinking that this has the *potential* to introduce a > > performance regression in some cases and we shouldn't just assume that > > some arbitrary 256 limit is good enough. I didn't check with perf(1), > > though. > >=20 > > Right now that array takes, effectively, less than 100 KiB (it's ~5000 > > copies of struct iovec, 16 bytes each), and in theory that could be > > ~2.5 MiB (at 161319 items). Even if we double or triple that (let's > > assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will > > have no practical effect anyway. =20 >=20 > Yeah, I guess. I don't love the fact that currently for correctness > (not spuriously dropping packets) we rely on a fairly complex > calculation that's based on information from different layers: the > buffer size and enforcement is in the packet pool layer and is > independent of packet layout, but the minimum frame size comes from > the tap layer and depends quite specifically on which L2 encapsulation > we're using. Well but it's exactly one line, and we're talking about the same project and tool, not something that's separated by several API layers. By the way: on one hand you have that. On the other hand, you're adding an arbitrary limit that comes from a test I just did, which is also based on information from different layers. > > All in all, I think we shouldn't change this limit without a deeper > > understanding of the practical impact. While this change doesn't bring > > any practical advantage, the current behaviour is somewhat tested by > > now, and a small limit isn't. =20 ...and this still applies, I think. --=20 Stefano