* vhost-kernel net on pasta: from 26 to 37Gbit/s @ 2025-05-20 15:09 Eugenio Perez Martin 2025-05-21 0:57 ` Jason Wang 2025-05-21 10:08 ` Stefano Brivio 0 siblings, 2 replies; 7+ messages in thread From: Eugenio Perez Martin @ 2025-05-20 15:09 UTC (permalink / raw) To: passt-dev; +Cc: Jason Wang, Jeff Nelson Hi! Some updates on the integration. The main culprit was to allow pasta to keep reading packets using the regular read() on the tap device. I thought that part was completely disabled, but I guess the kernel is able to omit the notification on tap as long as the userspace does not read it. My scenario: in different CPUs, all in the same NUMA. I run iperf server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated with isolcpus=1,3,... nohz=on nohz_full=1,3,... With vanilla pasta isolated to CPUs 1,3 with taskset, and just --config-net option, and running iperf with "iperf3 -A 5 -c 10.6.68.254 -w 8M": - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 30.7 GBytes 26.4 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 30.7 GBytes 26.3 Gbits/sec receiver Now trying with the vhost patches we get a slightly worse performance: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 25.5 GBytes 21.9 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 25.5 GBytes 21.8 Gbits/sec receiver Now vhost patch still lacks optimizations like disabling notifications or batch more rx available buffer notifications. At the moment it refills the rx buffers in each iteration, and does not set the no_notify bit which makes the kernel skip the used buffer notifications if pasta is actively checking the queue, which is not optimal. Now if I isolate the vhost kernel thread [1] I get way more performance as expected: - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver After analyzing perf output, rep_movs_alternative is the most called function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and vhost (~15%Self), But I don't see any of them consuming 100% of CPU in top: pasta consumes ~85% %CPU, both iperf3 client and server consumes 60%, and vhost consumes ~53%. So... I have mixed feelings about this :). By "default" it seems to have less performance, but my test is maybe too synthetic. There is room for improvement with the mentioned optimizations so I'd continue applying them, continuing with UDP and TCP zerocopy, and developing zerocopy vhost rx. With these numbers I think the series should not be merged at the moment. I could send it as RFC if you want but I've not applied the comments the first one received, POC style :). Thanks! [1] Notes to reproduce it, I'm able to see it with top -H and then set with taskset. Either the latest changes on the module or the way pasta behaves does not allow me to see in classical ps output. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin @ 2025-05-21 0:57 ` Jason Wang 2025-05-21 5:37 ` Eugenio Perez Martin 2025-05-21 10:08 ` Stefano Brivio 1 sibling, 1 reply; 7+ messages in thread From: Jason Wang @ 2025-05-21 0:57 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: passt-dev, Jeff Nelson On Tue, May 20, 2025 at 11:10 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > Hi! > > Some updates on the integration. The main culprit was to allow pasta > to keep reading packets using the regular read() on the tap device. I > thought that part was completely disabled, but I guess the kernel is > able to omit the notification on tap as long as the userspace does not > read it. > > My scenario: in different CPUs, all in the same NUMA. I run iperf > server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated > with isolcpus=1,3,... nohz=on nohz_full=1,3,... > > With vanilla pasta isolated to CPUs 1,3 with taskset, and just > --config-net option, and running iperf with "iperf3 -A 5 -c > 10.6.68.254 -w 8M": > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 30.7 GBytes 26.4 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 30.7 GBytes 26.3 Gbits/sec receiver > > Now trying with the vhost patches we get a slightly worse performance: > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 25.5 GBytes 21.9 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 25.5 GBytes 21.8 Gbits/sec receiver > > Now vhost patch still lacks optimizations like disabling notifications > or batch more rx available buffer notifications. At the moment it > refills the rx buffers in each iteration, and does not set the > no_notify bit which makes the kernel skip the used buffer > notifications if pasta is actively checking the queue, which is not > optimal. > > Now if I isolate the vhost kernel thread [1] I get way more > performance as expected: > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > After analyzing perf output, rep_movs_alternative is the most called > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > vhost (~15%Self), But I don't see any of them consuming 100% of CPU in > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > 60%, and vhost consumes ~53%. > > So... I have mixed feelings about this :). By "default" it seems to > have less performance, but my test is maybe too synthetic. There is > room for improvement with the mentioned optimizations so I'd continue > applying them, continuing with UDP and TCP zerocopy, and developing > zerocopy vhost rx. With these numbers I think the series should not be > merged at the moment. I could send it as RFC if you want but I've not > applied the comments the first one received, POC style :). Have you pinned pasta in a specific CPU? Note that vhost will inherit the affinity so there could be some contention if you do that. Thanks > > Thanks! > > [1] Notes to reproduce it, I'm able to see it with top -H and then set > with taskset. Either the latest changes on the module or the way pasta > behaves does not allow me to see in classical ps output. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-05-21 0:57 ` Jason Wang @ 2025-05-21 5:37 ` Eugenio Perez Martin 0 siblings, 0 replies; 7+ messages in thread From: Eugenio Perez Martin @ 2025-05-21 5:37 UTC (permalink / raw) To: Jason Wang; +Cc: passt-dev, Jeff Nelson On Wed, May 21, 2025 at 2:57 AM Jason Wang <jasowang@redhat.com> wrote: > > On Tue, May 20, 2025 at 11:10 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > Hi! > > > > Some updates on the integration. The main culprit was to allow pasta > > to keep reading packets using the regular read() on the tap device. I > > thought that part was completely disabled, but I guess the kernel is > > able to omit the notification on tap as long as the userspace does not > > read it. > > > > My scenario: in different CPUs, all in the same NUMA. I run iperf > > server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated > > with isolcpus=1,3,... nohz=on nohz_full=1,3,... > > > > With vanilla pasta isolated to CPUs 1,3 with taskset, and just > > --config-net option, and running iperf with "iperf3 -A 5 -c > > 10.6.68.254 -w 8M": > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate Retr > > [ 5] 0.00-10.00 sec 30.7 GBytes 26.4 Gbits/sec 0 sender > > [ 5] 0.00-10.04 sec 30.7 GBytes 26.3 Gbits/sec receiver > > > > Now trying with the vhost patches we get a slightly worse performance: > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate Retr > > [ 5] 0.00-10.00 sec 25.5 GBytes 21.9 Gbits/sec 0 sender > > [ 5] 0.00-10.04 sec 25.5 GBytes 21.8 Gbits/sec receiver > > > > Now vhost patch still lacks optimizations like disabling notifications > > or batch more rx available buffer notifications. At the moment it > > refills the rx buffers in each iteration, and does not set the > > no_notify bit which makes the kernel skip the used buffer > > notifications if pasta is actively checking the queue, which is not > > optimal. > > > > Now if I isolate the vhost kernel thread [1] I get way more > > performance as expected: > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate Retr > > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > > > After analyzing perf output, rep_movs_alternative is the most called > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > > vhost (~15%Self), But I don't see any of them consuming 100% of CPU in > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > > 60%, and vhost consumes ~53%. > > > > So... I have mixed feelings about this :). By "default" it seems to > > have less performance, but my test is maybe too synthetic. There is > > room for improvement with the mentioned optimizations so I'd continue > > applying them, continuing with UDP and TCP zerocopy, and developing > > zerocopy vhost rx. With these numbers I think the series should not be > > merged at the moment. I could send it as RFC if you want but I've not > > applied the comments the first one received, POC style :). > > Have you pinned pasta in a specific CPU? Note that vhost will inherit > the affinity so there could be some contention if you do that. > Yes, pasta was pinned to 1,3 and vhost to 7. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin 2025-05-21 0:57 ` Jason Wang @ 2025-05-21 10:08 ` Stefano Brivio 2025-05-21 10:35 ` Eugenio Perez Martin 1 sibling, 1 reply; 7+ messages in thread From: Stefano Brivio @ 2025-05-21 10:08 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson On Tue, 20 May 2025 17:09:44 +0200 Eugenio Perez Martin <eperezma@redhat.com> wrote: > [...] > > Now if I isolate the vhost kernel thread [1] I get way more > performance as expected: > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > After analyzing perf output, rep_movs_alternative is the most called > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > vhost (~15%Self) Interesting... s/most called function/function using the most cycles/, I suppose. So it looks somewhat similar to https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/ now? > But I don't see any of them consuming 100% of CPU in > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > 60%, and vhost consumes ~53%. > > So... I have mixed feelings about this :). By "default" it seems to > have less performance, but my test is maybe too synthetic. Well, surely we can't ask Podman users to pin specific stuff to given CPU threads. :) > There is room for improvement with the mentioned optimizations so I'd > continue applying them, continuing with UDP and TCP zerocopy, and > developing zerocopy vhost rx. That definitely makes sense to me. > With these numbers I think the series should not be > merged at the moment. I could send it as RFC if you want but I've not > applied the comments the first one received, POC style :). I don't think it's really needed for you to spend time on semi-polishing something just to have an RFC if you're still working on it. I guess the implementation will change substantially anyway once you factor in further optimisations. -- Stefano ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-05-21 10:08 ` Stefano Brivio @ 2025-05-21 10:35 ` Eugenio Perez Martin 2025-06-06 14:32 ` Eugenio Perez Martin 0 siblings, 1 reply; 7+ messages in thread From: Eugenio Perez Martin @ 2025-05-21 10:35 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote: > > On Tue, 20 May 2025 17:09:44 +0200 > Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > [...] > > > > Now if I isolate the vhost kernel thread [1] I get way more > > performance as expected: > > - - - - - - - - - - - - - - - - - - - - - - - - - > > [ ID] Interval Transfer Bitrate Retr > > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > > > After analyzing perf output, rep_movs_alternative is the most called > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > > vhost (~15%Self) > > Interesting... s/most called function/function using the most cycles/, I > suppose. > Right! > So it looks somewhat similar to > > https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/ > > now? > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but skb_do_copy_data_nocache. Not sure if that means something, as it should not be affected by vhost. > > But I don't see any of them consuming 100% of CPU in > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > > 60%, and vhost consumes ~53%. > > > > So... I have mixed feelings about this :). By "default" it seems to > > have less performance, but my test is maybe too synthetic. > > Well, surely we can't ask Podman users to pin specific stuff to given > CPU threads. :) > Yes but maybe the result changes under the right schedule? I'm isolating the CPUs entirely, which is not the usual case for pasta for sure :). > > There is room for improvement with the mentioned optimizations so I'd > > continue applying them, continuing with UDP and TCP zerocopy, and > > developing zerocopy vhost rx. > > That definitely makes sense to me. > Good! > > With these numbers I think the series should not be > > merged at the moment. I could send it as RFC if you want but I've not > > applied the comments the first one received, POC style :). > > I don't think it's really needed for you to spend time on > semi-polishing something just to have an RFC if you're still working on > it. I guess the implementation will change substantially anyway once > you factor in further optimisations. > Agree! I'll keep iterating on this then. Thanks! ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-05-21 10:35 ` Eugenio Perez Martin @ 2025-06-06 14:32 ` Eugenio Perez Martin 2025-06-06 16:37 ` Stefano Brivio 0 siblings, 1 reply; 7+ messages in thread From: Eugenio Perez Martin @ 2025-06-06 14:32 UTC (permalink / raw) To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote: > > > > On Tue, 20 May 2025 17:09:44 +0200 > > Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > [...] > > > > > > Now if I isolate the vhost kernel thread [1] I get way more > > > performance as expected: > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > [ ID] Interval Transfer Bitrate Retr > > > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > > > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > > > > > After analyzing perf output, rep_movs_alternative is the most called > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > > > vhost (~15%Self) > > > > Interesting... s/most called function/function using the most cycles/, I > > suppose. > > > > Right! > > > So it looks somewhat similar to > > > > https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/ > > > > now? > > > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but > skb_do_copy_data_nocache. Not sure if that means something, as it > should not be affected by vhost. > > > > But I don't see any of them consuming 100% of CPU in > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > > > 60%, and vhost consumes ~53%. > > > > > > So... I have mixed feelings about this :). By "default" it seems to > > > have less performance, but my test is maybe too synthetic. > > > > Well, surely we can't ask Podman users to pin specific stuff to given > > CPU threads. :) > > > > Yes but maybe the result changes under the right schedule? I'm > isolating the CPUs entirely, which is not the usual case for pasta for > sure :). > > > > There is room for improvement with the mentioned optimizations so I'd > > > continue applying them, continuing with UDP and TCP zerocopy, and > > > developing zerocopy vhost rx. > > > > That definitely makes sense to me. > > > > Good! > > > > With these numbers I think the series should not be > > > merged at the moment. I could send it as RFC if you want but I've not > > > applied the comments the first one received, POC style :). > > > > I don't think it's really needed for you to spend time on > > semi-polishing something just to have an RFC if you're still working on > > it. I guess the implementation will change substantially anyway once > > you factor in further optimisations. > > > > Agree! I'll keep iterating on this then. > Actually, if I remove all the taskset etc, and trust the kernel scheduler, vanilla pasta gives me: [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 1.00-2.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 2.00-3.00 sec 3.12 GBytes 26.8 Gbits/sec 0 25.4 MBytes [ 5] 3.00-4.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 4.00-5.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes [ 5] 5.00-6.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 6.00-7.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes [ 5] 7.00-8.00 sec 3.09 GBytes 26.6 Gbits/sec 0 25.4 MBytes [ 5] 8.00-9.00 sec 3.08 GBytes 26.5 Gbits/sec 0 25.4 MBytes [ 5] 9.00-10.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 31.0 GBytes 26.7 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 31.0 GBytes 26.5 Gbits/sec receiver And with vhost-net : [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M ... Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.17 GBytes 35.8 Gbits/sec 0 11.9 MBytes [ 5] 1.00-2.00 sec 4.17 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 2.00-3.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes [ 5] 3.00-4.00 sec 4.14 GBytes 35.6 Gbits/sec 0 11.9 MBytes [ 5] 4.00-5.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes [ 5] 5.00-6.00 sec 4.16 GBytes 35.8 Gbits/sec 0 11.9 MBytes [ 5] 6.00-7.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 7.00-8.00 sec 4.19 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 8.00-9.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes [ 5] 9.00-10.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 41.7 GBytes 35.8 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 41.7 GBytes 35.7 Gbits/sec receiver If I go the extra mile and disable notifications (it might be just noise, but...) [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M ... Connecting to host 10.6.68.254, port 5201 [ 5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 4.19 GBytes 36.0 Gbits/sec 0 12.4 MBytes [ 5] 1.00-2.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes [ 5] 3.00-4.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes [ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 6.00-7.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes [ 5] 7.00-8.00 sec 4.23 GBytes 36.4 Gbits/sec 0 12.4 MBytes [ 5] 8.00-9.00 sec 4.24 GBytes 36.4 Gbits/sec 0 12.4 MBytes [ 5] 9.00-10.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 42.1 GBytes 36.1 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 42.1 GBytes 36.0 Gbits/sec receiver So I guess the best is to actually run performance tests closer to real-world workload against the new version and see if it works better? Thanks! ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s 2025-06-06 14:32 ` Eugenio Perez Martin @ 2025-06-06 16:37 ` Stefano Brivio 0 siblings, 0 replies; 7+ messages in thread From: Stefano Brivio @ 2025-06-06 16:37 UTC (permalink / raw) To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson On Fri, 6 Jun 2025 16:32:38 +0200 Eugenio Perez Martin <eperezma@redhat.com> wrote: > On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote: > > > > > > On Tue, 20 May 2025 17:09:44 +0200 > > > Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > [...] > > > > > > > > Now if I isolate the vhost kernel thread [1] I get way more > > > > performance as expected: > > > > - - - - - - - - - - - - - - - - - - - - - - - - - > > > > [ ID] Interval Transfer Bitrate Retr > > > > [ 5] 0.00-10.00 sec 43.1 GBytes 37.1 Gbits/sec 0 sender > > > > [ 5] 0.00-10.04 sec 43.1 GBytes 36.9 Gbits/sec receiver > > > > > > > > After analyzing perf output, rep_movs_alternative is the most called > > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and > > > > vhost (~15%Self) > > > > > > Interesting... s/most called function/function using the most cycles/, I > > > suppose. > > > > > > > Right! > > > > > So it looks somewhat similar to > > > > > > https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/ > > > > > > now? > > > > > > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but > > skb_do_copy_data_nocache. Not sure if that means something, as it > > should not be affected by vhost. > > > > > > But I don't see any of them consuming 100% of CPU in > > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes > > > > 60%, and vhost consumes ~53%. > > > > > > > > So... I have mixed feelings about this :). By "default" it seems to > > > > have less performance, but my test is maybe too synthetic. > > > > > > Well, surely we can't ask Podman users to pin specific stuff to given > > > CPU threads. :) > > > > > > > Yes but maybe the result changes under the right schedule? I'm > > isolating the CPUs entirely, which is not the usual case for pasta for > > sure :). > > > > > > There is room for improvement with the mentioned optimizations so I'd > > > > continue applying them, continuing with UDP and TCP zerocopy, and > > > > developing zerocopy vhost rx. > > > > > > That definitely makes sense to me. > > > > > > > Good! > > > > > > With these numbers I think the series should not be > > > > merged at the moment. I could send it as RFC if you want but I've not > > > > applied the comments the first one received, POC style :). > > > > > > I don't think it's really needed for you to spend time on > > > semi-polishing something just to have an RFC if you're still working on > > > it. I guess the implementation will change substantially anyway once > > > you factor in further optimisations. > > > > > > > Agree! I'll keep iterating on this then. > > > > Actually, if I remove all the taskset etc, and trust the kernel > scheduler, vanilla pasta gives me: > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M > Connecting to host 10.6.68.254, port 5201 > [ 5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes > [ 5] 1.00-2.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes > [ 5] 2.00-3.00 sec 3.12 GBytes 26.8 Gbits/sec 0 25.4 MBytes > [ 5] 3.00-4.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes > [ 5] 4.00-5.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes > [ 5] 5.00-6.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes > [ 5] 6.00-7.00 sec 3.11 GBytes 26.7 Gbits/sec 0 25.4 MBytes > [ 5] 7.00-8.00 sec 3.09 GBytes 26.6 Gbits/sec 0 25.4 MBytes > [ 5] 8.00-9.00 sec 3.08 GBytes 26.5 Gbits/sec 0 25.4 MBytes > [ 5] 9.00-10.00 sec 3.10 GBytes 26.6 Gbits/sec 0 25.4 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 31.0 GBytes 26.7 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 31.0 GBytes 26.5 Gbits/sec receiver > > And with vhost-net : > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M > ... > Connecting to host 10.6.68.254, port 5201 > [ 5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.17 GBytes 35.8 Gbits/sec 0 11.9 MBytes > [ 5] 1.00-2.00 sec 4.17 GBytes 35.9 Gbits/sec 0 11.9 MBytes > [ 5] 2.00-3.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes > [ 5] 3.00-4.00 sec 4.14 GBytes 35.6 Gbits/sec 0 11.9 MBytes > [ 5] 4.00-5.00 sec 4.16 GBytes 35.7 Gbits/sec 0 11.9 MBytes > [ 5] 5.00-6.00 sec 4.16 GBytes 35.8 Gbits/sec 0 11.9 MBytes > [ 5] 6.00-7.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes > [ 5] 7.00-8.00 sec 4.19 GBytes 35.9 Gbits/sec 0 11.9 MBytes > [ 5] 8.00-9.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes > [ 5] 9.00-10.00 sec 4.18 GBytes 35.9 Gbits/sec 0 11.9 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 41.7 GBytes 35.8 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 41.7 GBytes 35.7 Gbits/sec receiver > > If I go the extra mile and disable notifications (it might be just > noise, but...) > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M > ... > Connecting to host 10.6.68.254, port 5201 > [ 5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.19 GBytes 36.0 Gbits/sec 0 12.4 MBytes > [ 5] 1.00-2.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec 0 12.4 MBytes > [ 5] 3.00-4.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes > [ 5] 4.00-5.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes > [ 5] 5.00-6.00 sec 4.21 GBytes 36.1 Gbits/sec 0 12.4 MBytes > [ 5] 6.00-7.00 sec 4.20 GBytes 36.1 Gbits/sec 0 12.4 MBytes > [ 5] 7.00-8.00 sec 4.23 GBytes 36.4 Gbits/sec 0 12.4 MBytes > [ 5] 8.00-9.00 sec 4.24 GBytes 36.4 Gbits/sec 0 12.4 MBytes > [ 5] 9.00-10.00 sec 4.21 GBytes 36.2 Gbits/sec 0 12.4 MBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 42.1 GBytes 36.1 Gbits/sec 0 sender > [ 5] 0.00-10.04 sec 42.1 GBytes 36.0 Gbits/sec receiver > > So I guess the best is to actually run performance tests closer to > real-world workload against the new version and see if it works > better? Well, that's certainly a possibility. I'd say the biggest value for vhost-net usage in pasta is reaching throughput figures that are comparable with veth, with or without multithreading (keeping an eye on bytes per cycle, of course), with or without kernel changes, so that users won't need to choose between rootless and performance anymore. It would also simplify things in Podman quite a lot (and to some extent in rootlesskit / Docker as well). We're pretty much there with virtual machines, just not quite with containers (which is somewhat ironic, but of course there's a good reason for that). If we're clearly wasting cycles in vhost-net (because of the bounce buffer, plus something else perhaps?) *and* there's a somewhat possible solution for that in sight *and* the interface would change anyway, running throughput tests and polishing up the current version with a half-baked solution at the moment sounds a bit wasteful to me. But if one of those assumptions doesn't hold, or if you feel the need to consolidate the current status, perhaps polishing up the current version right now and actually evaluating throughput (as well as overhead) makes sense to me, yes. -- Stefano ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-06 16:37 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin 2025-05-21 0:57 ` Jason Wang 2025-05-21 5:37 ` Eugenio Perez Martin 2025-05-21 10:08 ` Stefano Brivio 2025-05-21 10:35 ` Eugenio Perez Martin 2025-06-06 14:32 ` Eugenio Perez Martin 2025-06-06 16:37 ` Stefano Brivio
Code repositories for project(s) associated with this public inbox https://passt.top/passt This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for IMAP folder(s).