From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: passt.top; dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=G0dzZPOd; dkim-atps=neutral Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTPS id 572285A0287 for ; Wed, 11 Jun 2025 09:05:38 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1749625537; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=A2T6XDmZ0F8gEcDCkk2uvFTP4qpKzI7GcttGNgHAqzk=; b=G0dzZPOdw0Ui4FQDRNKr2MGr/S71mt3Mr0rKt8AV3zJBORcvB+7aJqIgEEeLPW5CiJYe1V 1lzyVNYA9JR0YXWgYaRq+7NFq87EOLFyjqOP0cByJAXe9JF/A1riiVAVwlOAWl8zjDIEBJ 7B22649JFR3EW0Q7vUDVwplyebY7wQA= Received: from mail-pj1-f69.google.com (mail-pj1-f69.google.com [209.85.216.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-558-VbRbpaaTM5S1Y4-_rei1FA-1; Wed, 11 Jun 2025 03:05:36 -0400 X-MC-Unique: VbRbpaaTM5S1Y4-_rei1FA-1 X-Mimecast-MFC-AGG-ID: VbRbpaaTM5S1Y4-_rei1FA_1749625535 Received: by mail-pj1-f69.google.com with SMTP id 98e67ed59e1d1-313b0a63c41so731295a91.0 for ; Wed, 11 Jun 2025 00:05:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749625535; x=1750230335; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A2T6XDmZ0F8gEcDCkk2uvFTP4qpKzI7GcttGNgHAqzk=; b=mKUACFh1tDPxdRzuvAdWgfY+2gX3WjZjbnTsAtcFfQA2T8FPswASShMhY10niGw/T3 YpsRA7GGtzcZOCiW4CXUvp2kF/6QLNP3lB6SuM1Q6wkjjuzINDe7dSF1xrKWqsPJEG4u BPMkaBQz3anzmjBRUyd3Fj4t0xdDHu/hG2pn1vOhlDjX0xpWHvq1CxhhtgfAfKFLD74e BgmzowM6T8wdHdzkdKlMskc0oA6i3jt6+VNuK1DXZ8qq57xTS7/LpPZHnmN6q4coUVbf AxFeMqyGNItjorYSKaA8akFlWz2ujYft4wpGV4hcxI7Hh41yn23krj40jni20H2iy8cj Jcgg== X-Gm-Message-State: AOJu0Yy7XS467LSOr3blHsdmMWKMrOJLgnajoU1zAQF1gM1LmSsGUyqE Tj/NPJ6Kqh/drRz9YZ2VOZyGTV4XBo95fXYx102KZazVCw/7rMKst+GcXBVW29Isvx52w3vFbW9 /2IHNc7rpq8Y9WwzmxRPhNocPyboi1iNxHW6vEr4T7AW4JO1TVjT9wQ2WnIMDQ1Wj/S75H4Iv/7 Jx2JdnmXcg/pC7F7hJxbQRYZupnYit X-Gm-Gg: ASbGnctcQ2VCs2Is19iGCdp9hTtHFqok5eCvNJh3g8fZ5PEuLz8SSonVFLYs4ZaiUSx q0mXqAVYY4ZZNUbtGdpkSJyQfqsiq4eIMBR7bn7QsKrcOwc4A9FrAOTlR6udeYKB6MbjckWWSmq IipA== X-Received: by 2002:a17:90a:d60c:b0:313:279d:665c with SMTP id 98e67ed59e1d1-313af0fd099mr3697609a91.7.1749625534590; Wed, 11 Jun 2025 00:05:34 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGKPNZFfcHWOg7mPIrO22+iPEt1eVJUpXJF/lh710aPTMh4GJNYQlF2ySOlat/poYkn4J0L5KXpa8tq7TU9TZM= X-Received: by 2002:a17:90a:d60c:b0:313:279d:665c with SMTP id 98e67ed59e1d1-313af0fd099mr3697551a91.7.1749625533945; Wed, 11 Jun 2025 00:05:33 -0700 (PDT) MIME-Version: 1.0 References: <20250521120855.5cdaeb04@elisabeth> <20250606183702.0ff9a3c7@elisabeth> <20250610172931.4c730f04@elisabeth> In-Reply-To: <20250610172931.4c730f04@elisabeth> From: Eugenio Perez Martin Date: Wed, 11 Jun 2025 09:04:57 +0200 X-Gm-Features: AX0GCFtyzn4VHe0o-8DAya9LtUxfgFNcq_eDiNSCyTY-kH0c_GWrqV-T_3ecplo Message-ID: Subject: Re: vhost-kernel net on pasta: from 26 to 37Gbit/s To: Stefano Brivio X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: byxj07MgiVP4Wuv9xF8qq-asoeox2MnxJv2MZdtspBk_1749625535 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: B2DP47TVFISHLKRR6KSR276KOIZBMXZN X-Message-ID-Hash: B2DP47TVFISHLKRR6KSR276KOIZBMXZN X-MailFrom: eperezma@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, Jason Wang , Jeff Nelson , Paul Holzinger X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Tue, Jun 10, 2025 at 5:29=E2=80=AFPM Stefano Brivio = wrote: > > [Adding Paul as Podman developer] > > On Mon, 9 Jun 2025 11:59:21 +0200 > Eugenio Perez Martin wrote: > > > On Fri, Jun 6, 2025 at 6:37=E2=80=AFPM Stefano Brivio wrote: > > > > > > On Fri, 6 Jun 2025 16:32:38 +0200 > > > Eugenio Perez Martin wrote: > > > > > > > On Wed, May 21, 2025 at 12:35=E2=80=AFPM Eugenio Perez Martin > > > > wrote: > > > > > > > > > > On Wed, May 21, 2025 at 12:09=E2=80=AFPM Stefano Brivio wrote: > > > > > > > > > > > > On Tue, 20 May 2025 17:09:44 +0200 > > > > > > Eugenio Perez Martin wrote: > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > Now if I isolate the vhost kernel thread [1] I get way more > > > > > > > performance as expected: > > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > > > [ ID] Interval Transfer Bitrate Retr > > > > > > > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 = sender > > > > > > > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec = receiver > > > > > > > > > > > > > > After analyzing perf output, rep_movs_alternative is the most= called > > > > > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self= ) and > > > > > > > vhost (~15%Self) > > > > > > > > > > > > Interesting... s/most called function/function using the most c= ycles/, I > > > > > > suppose. > > > > > > > > > > > > > > > > Right! > > > > > > > > > > > So it looks somewhat similar to > > > > > > > > > > > > https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@= elisabeth/ > > > > > > > > > > > > now? > > > > > > > > > > > > > > > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill= but > > > > > skb_do_copy_data_nocache. Not sure if that means something, as it > > > > > should not be affected by vhost. > > > > > > > > > > > > But I don't see any of them consuming 100% of CPU in > > > > > > > top: pasta consumes ~85% %CPU, both iperf3 client and server = consumes > > > > > > > 60%, and vhost consumes ~53%. > > > > > > > > > > > > > > So... I have mixed feelings about this :). By "default" it se= ems to > > > > > > > have less performance, but my test is maybe too synthetic. > > > > > > > > > > > > Well, surely we can't ask Podman users to pin specific stuff to= given > > > > > > CPU threads. :) > > > > > > > > > > > > > > > > Yes but maybe the result changes under the right schedule? I'm > > > > > isolating the CPUs entirely, which is not the usual case for past= a for > > > > > sure :). > > > > > > > > > > > > There is room for improvement with the mentioned optimization= s so I'd > > > > > > > continue applying them, continuing with UDP and TCP zerocopy,= and > > > > > > > developing zerocopy vhost rx. > > > > > > > > > > > > That definitely makes sense to me. > > > > > > > > > > > > > > > > Good! > > > > > > > > > > > > With these numbers I think the series should not be > > > > > > > merged at the moment. I could send it as RFC if you want but = I've not > > > > > > > applied the comments the first one received, POC style :). > > > > > > > > > > > > I don't think it's really needed for you to spend time on > > > > > > semi-polishing something just to have an RFC if you're still wo= rking on > > > > > > it. I guess the implementation will change substantially anyway= once > > > > > > you factor in further optimisations. > > > > > > > > > > > > > > > > Agree! I'll keep iterating on this then. > > > > > > > > > > > > > Actually, if I remove all the taskset etc, and trust the kernel > > > > scheduler, vanilla pasta gives me: > > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6= .68.254 -w 8M > > > > Connecting to host 10.6.68.254, port 5201 > > > > [ 5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 520= 1 > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 1.00-2.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 2.00-3.00 sec 3.12 GBytes 26.8 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 3.00-4.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 4.00-5.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 5.00-6.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 6.00-7.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 7.00-8.00 sec 3.09 GBytes 26.6 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 8.00-9.00 sec 3.08 GBytes 26.5 Gbits/sec 0 25.4 MB= ytes > > > > [ 5] 9.00-10.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MB= ytes > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate Retr > > > > [ 5] 0.00-10.00 sec 31.0 GBytes 26.7 Gbits/sec 0 = sender > > > > [ 5] 0.00-10.04 sec 31.0 GBytes 26.5 Gbits/sec = receiver > > > > > > > > And with vhost-net : > > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6= .68.254 -w 8M > > > > ... > > > > Connecting to host 10.6.68.254, port 5201 > > > > [ 5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 520= 1 > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 4.17 GBytes 35.8 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 1.00-2.00 sec 4.17 GBytes 35.9 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 2.00-3.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 3.00-4.00 sec 4.14 GBytes 35.6 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 4.00-5.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 5.00-6.00 sec 4.16 GBytes 35.8 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 6.00-7.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 7.00-8.00 sec 4.19 GBytes 35.9 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 8.00-9.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MB= ytes > > > > [ 5] 9.00-10.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MB= ytes > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate Retr > > > > [ 5] 0.00-10.00 sec 41.7 GBytes 35.8 Gbits/sec 0 = sender > > > > [ 5] 0.00-10.04 sec 41.7 GBytes 35.7 Gbits/sec = receiver > > > > > > > > If I go the extra mile and disable notifications (it might be just > > > > noise, but...) > > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6= .68.254 -w 8M > > > > ... > > > > Connecting to host 10.6.68.254, port 5201 > > > > [ 5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 520= 1 > > > > [ ID] Interval Transfer Bitrate Retr Cwnd > > > > [ 5] 0.00-1.00 sec 4.19 GBytes 36.0 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 1.00-2.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 3.00-4.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 6.00-7.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 7.00-8.00 sec 4.23 GBytes 36.4 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 8.00-9.00 sec 4.24 GBytes 36.4 Gbits/sec 0 12.4 MB= ytes > > > > [ 5] 9.00-10.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MB= ytes > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate Retr > > > > [ 5] 0.00-10.00 sec 42.1 GBytes 36.1 Gbits/sec 0 = sender > > > > [ 5] 0.00-10.04 sec 42.1 GBytes 36.0 Gbits/sec = receiver > > > > > > > > So I guess the best is to actually run performance tests closer to > > > > real-world workload against the new version and see if it works > > > > better? > > > > > > Well, that's certainly a possibility. > > > > > > I'd say the biggest value for vhost-net usage in pasta is reaching > > > throughput figures that are comparable with veth, with or without > > > multithreading (keeping an eye on bytes per cycle, of course), with o= r > > > without kernel changes, so that users won't need to choose between > > > rootless and performance anymore. > > > > > > It would also simplify things in Podman quite a lot (and to some exte= nt > > > in rootlesskit / Docker as well). We're pretty much there with virtua= l > > > machines, just not quite with containers (which is somewhat ironic, b= ut > > > of course there's a good reason for that). > > > > > > If we're clearly wasting cycles in vhost-net (because of the bounce > > > buffer, plus something else perhaps?) *and* there's a somewhat possib= le > > > solution for that in sight *and* the interface would change anyway, > > > running throughput tests and polishing up the current version with a > > > half-baked solution at the moment sounds a bit wasteful to me. > > > > My point is that I'm testing a very synthetic scenario. If everybody > > agree this is close enough to real world ones, I'm ok to continue > > improving the edges we see. If not, maybe we're picking the wrong > > fruit even if it is low hand? > > > > Getting a table like [1] would give us light about this, especially if > > it is just a matter of running "make performance" or similar. Maybe we > > need to include longer queues? Focus on a given scenario? UDP goes > > better but TCP? > > Well, it's a matter of running ./run under tests (or 'make' there). > Have you tried that with your patch? It's kind of representative in the > sense that it uses several message sizes and different values for the > sending window. > Yes but it freezes in my env. Copying the different windows, top-left: # tail -f --retry /home/passt/test/test_logs/context_unshare.log /home/passt/test/test_logs/context_ns.log tail: warning: --retry only effective for the initial open =3D=3D> /home/passt/test/test_logs/context_unshare.log <=3D=3D unshare$ tail: cannot open '/home/passt/test/test_logs/context_ns.log' for reading: No such file or directory --- top-right: # while cat /tmp/passt-tests-7HpbEm/log_pipe; do :; done Test layout: single pasta instance with namespace. --- bottom-left: # tail -f --retry /home/passt/test/test_logs/context_host.log tail: warning: --retry only effective for the initial open host$ --- bottom-right: # tail -f --ontext_passt.logt/test/test_logs/c tail: warning: --retry only effective for the initial open passt$ --- And test/test_logs/test.log: =3D=3D=3D build/all > Build passt ? ! [ -e passt ] ? [ -f passt ] ...passed. > Build pasta ? ! [ -e pasta ] ? [ -h pasta ] ...passed. > Build qrap ? ! [ -e qrap ] ? [ -f qrap ] ...passed. > Build all ? ! [ -e passt ] ? ! [ -e pasta ] ? ! [ -e qrap ] ? [ -f passt ] ? [ -h pasta ] ? [ -f qrap ] ...passed. > Install ? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ] ? [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ] ? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ] ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta ? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap ...passed. > Uninstall ? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ] ? ! [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ] ? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ] ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt 2>/dev/null ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta 2>/dev/null ? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap 2>/dev/null ...passed. =3D=3D=3D build/cppcheck ...skipped. =3D=3D=3D build/clang_tidy ...skipped. --- > > Now more points about this scenario: > > 1) I don't see 100% CPU usage in any element: > > CPU% > > 84.2 passt.avx2 > > 57.9 iperf3 > > 57.2 iperf3 > > 50.7 vhost-1805109 > > Still, I bet we're using an awful amount of cycles compared to veth. > > > 2) The most used (Self%) function in vhost is rep_movs_alternative, > > called from skb_copy_datagram_iter, so yes, ZeroCopy should help a lot > > here. > > > > Now, is "iperf3 -w 8M" representative? I'm sure ZC helps in this > > scenario, does it make it worse if we have small packets? Do we care? > > We don't care _a lot_ about small packets because we can typically use > large packets, inbound and outbound, at least for TCP (bulk) transfers. > But users are doing all sort of things with containers, including bulk > transfers and VPN traffic over UDP, so we do, a bit. > > Again, the main value of using vhost-net, I think, is making "rootful" > networking essentially unnecessary, or necessary just for niche use > cases (say, non-TCP, non-UDP traffic, or macvlan-like cases). If there > are relatively common use cases where pasta performs pretty bad > compared to veth, we'll still need rootful networking. > > So, yes, it is representative, but not necessarily universal. > > > I'm totally ok with continuing trying with ZC, I just want to make > > sure we're not missing anything :). > > In any case, it looks like vhost-net zero-copy is a bigger task than we > thought, so, even if we don't reach a universal solution that makes > rootful networking essentially unnecessary, but we have a big > improvement ready, there's of course a lot of value in it. Your call... > Got it, thanks!