vhost-kernel net on pasta: from 26 to 37Gbit/s

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* vhost-kernel net on pasta: from 26 to 37Gbit/s
@ 2025-05-20 15:09 Eugenio Perez Martin
  2025-05-21  0:57 ` Jason Wang
  2025-05-21 10:08 ` Stefano Brivio
  0 siblings, 2 replies; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-05-20 15:09 UTC (permalink / raw)
  To: passt-dev; +Cc: Jason Wang, Jeff Nelson

Hi!

Some updates on the integration. The main culprit was to allow pasta
to keep reading packets using the regular read() on the tap device. I
thought that part was completely disabled, but I guess the kernel is
able to omit the notification on tap as long as the userspace does not
read it.

My scenario: in different CPUs, all in the same NUMA. I run iperf
server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated
with isolcpus=1,3,... nohz=on nohz_full=1,3,...

With vanilla pasta isolated to CPUs 1,3 with taskset, and just
--config-net option, and running iperf with "iperf3 -A 5 -c
10.6.68.254 -w 8M":
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  30.7 GBytes  26.4 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  30.7 GBytes  26.3 Gbits/sec                  receiver

Now trying with the vhost patches we get a slightly worse performance:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  25.5 GBytes  21.9 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  25.5 GBytes  21.8 Gbits/sec                  receiver

Now vhost patch still lacks optimizations like disabling notifications
or batch more rx available buffer notifications. At the moment it
refills the rx buffers in each iteration, and does not set the
no_notify bit which makes the kernel skip the used buffer
notifications if pasta is actively checking the queue, which is not
optimal.

Now if I isolate the vhost kernel thread [1] I get way more
performance as expected:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver

After analyzing perf output, rep_movs_alternative is the most called
function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
vhost (~15%Self), But I don't see any of them consuming 100% of CPU in
top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
60%, and vhost consumes ~53%.

So... I have mixed feelings about this :). By "default" it seems to
have less performance, but my test is maybe too synthetic. There is
room for improvement with the mentioned optimizations so I'd continue
applying them, continuing with UDP and TCP zerocopy, and developing
zerocopy vhost rx. With these numbers I think the series should not be
merged at the moment. I could send it as RFC if you want but I've not
applied the comments the first one received, POC style :).

Thanks!

[1] Notes to reproduce it, I'm able to see it with top -H and then set
with taskset. Either the latest changes on the module or the way pasta
behaves does not allow me to see in classical ps output.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin
@ 2025-05-21  0:57 ` Jason Wang
  2025-05-21  5:37   ` Eugenio Perez Martin
  2025-05-21 10:08 ` Stefano Brivio
  1 sibling, 1 reply; 11+ messages in thread
From: Jason Wang @ 2025-05-21  0:57 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: passt-dev, Jeff Nelson

On Tue, May 20, 2025 at 11:10 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> Hi!
>
> Some updates on the integration. The main culprit was to allow pasta
> to keep reading packets using the regular read() on the tap device. I
> thought that part was completely disabled, but I guess the kernel is
> able to omit the notification on tap as long as the userspace does not
> read it.
>
> My scenario: in different CPUs, all in the same NUMA. I run iperf
> server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated
> with isolcpus=1,3,... nohz=on nohz_full=1,3,...
>
> With vanilla pasta isolated to CPUs 1,3 with taskset, and just
> --config-net option, and running iperf with "iperf3 -A 5 -c
> 10.6.68.254 -w 8M":
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  30.7 GBytes  26.4 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  30.7 GBytes  26.3 Gbits/sec                  receiver
>
> Now trying with the vhost patches we get a slightly worse performance:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  25.5 GBytes  21.9 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  25.5 GBytes  21.8 Gbits/sec                  receiver
>
> Now vhost patch still lacks optimizations like disabling notifications
> or batch more rx available buffer notifications. At the moment it
> refills the rx buffers in each iteration, and does not set the
> no_notify bit which makes the kernel skip the used buffer
> notifications if pasta is actively checking the queue, which is not
> optimal.
>
> Now if I isolate the vhost kernel thread [1] I get way more
> performance as expected:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
>
> After analyzing perf output, rep_movs_alternative is the most called
> function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> vhost (~15%Self), But I don't see any of them consuming 100% of CPU in
> top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> 60%, and vhost consumes ~53%.
>
> So... I have mixed feelings about this :). By "default" it seems to
> have less performance, but my test is maybe too synthetic. There is
> room for improvement with the mentioned optimizations so I'd continue
> applying them, continuing with UDP and TCP zerocopy, and developing
> zerocopy vhost rx. With these numbers I think the series should not be
> merged at the moment. I could send it as RFC if you want but I've not
> applied the comments the first one received, POC style :).

Have you pinned pasta in a specific CPU? Note that vhost will inherit
the affinity so there could be some contention if you do that.

Thanks

>
> Thanks!
>
> [1] Notes to reproduce it, I'm able to see it with top -H and then set
> with taskset. Either the latest changes on the module or the way pasta
> behaves does not allow me to see in classical ps output.
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-05-21  0:57 ` Jason Wang
@ 2025-05-21  5:37   ` Eugenio Perez Martin
  0 siblings, 0 replies; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-05-21  5:37 UTC (permalink / raw)
  To: Jason Wang; +Cc: passt-dev, Jeff Nelson

On Wed, May 21, 2025 at 2:57 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, May 20, 2025 at 11:10 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > Hi!
> >
> > Some updates on the integration. The main culprit was to allow pasta
> > to keep reading packets using the regular read() on the tap device. I
> > thought that part was completely disabled, but I guess the kernel is
> > able to omit the notification on tap as long as the userspace does not
> > read it.
> >
> > My scenario: in different CPUs, all in the same NUMA. I run iperf
> > server on CPU 11 with "iperf3 -A 11 -s". All odds CPUs are isolated
> > with isolcpus=1,3,... nohz=on nohz_full=1,3,...
> >
> > With vanilla pasta isolated to CPUs 1,3 with taskset, and just
> > --config-net option, and running iperf with "iperf3 -A 5 -c
> > 10.6.68.254 -w 8M":
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  30.7 GBytes  26.4 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  30.7 GBytes  26.3 Gbits/sec                  receiver
> >
> > Now trying with the vhost patches we get a slightly worse performance:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  25.5 GBytes  21.9 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  25.5 GBytes  21.8 Gbits/sec                  receiver
> >
> > Now vhost patch still lacks optimizations like disabling notifications
> > or batch more rx available buffer notifications. At the moment it
> > refills the rx buffers in each iteration, and does not set the
> > no_notify bit which makes the kernel skip the used buffer
> > notifications if pasta is actively checking the queue, which is not
> > optimal.
> >
> > Now if I isolate the vhost kernel thread [1] I get way more
> > performance as expected:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> >
> > After analyzing perf output, rep_movs_alternative is the most called
> > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > vhost (~15%Self), But I don't see any of them consuming 100% of CPU in
> > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > 60%, and vhost consumes ~53%.
> >
> > So... I have mixed feelings about this :). By "default" it seems to
> > have less performance, but my test is maybe too synthetic. There is
> > room for improvement with the mentioned optimizations so I'd continue
> > applying them, continuing with UDP and TCP zerocopy, and developing
> > zerocopy vhost rx. With these numbers I think the series should not be
> > merged at the moment. I could send it as RFC if you want but I've not
> > applied the comments the first one received, POC style :).
>
> Have you pinned pasta in a specific CPU? Note that vhost will inherit
> the affinity so there could be some contention if you do that.
>

Yes, pasta was pinned to 1,3 and vhost to 7.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin
  2025-05-21  0:57 ` Jason Wang
@ 2025-05-21 10:08 ` Stefano Brivio
  2025-05-21 10:35   ` Eugenio Perez Martin
  1 sibling, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2025-05-21 10:08 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson

On Tue, 20 May 2025 17:09:44 +0200
Eugenio Perez Martin <eperezma@redhat.com> wrote:

> [...]
>
> Now if I isolate the vhost kernel thread [1] I get way more
> performance as expected:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> 
> After analyzing perf output, rep_movs_alternative is the most called
> function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> vhost (~15%Self)

Interesting... s/most called function/function using the most cycles/, I
suppose.

So it looks somewhat similar to

  https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/

now?

> But I don't see any of them consuming 100% of CPU in
> top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> 60%, and vhost consumes ~53%.
> 
> So... I have mixed feelings about this :). By "default" it seems to
> have less performance, but my test is maybe too synthetic.

Well, surely we can't ask Podman users to pin specific stuff to given
CPU threads. :)

> There is room for improvement with the mentioned optimizations so I'd
> continue applying them, continuing with UDP and TCP zerocopy, and
> developing zerocopy vhost rx.

That definitely makes sense to me.

> With these numbers I think the series should not be
> merged at the moment. I could send it as RFC if you want but I've not
> applied the comments the first one received, POC style :).

I don't think it's really needed for you to spend time on
semi-polishing something just to have an RFC if you're still working on
it. I guess the implementation will change substantially anyway once
you factor in further optimisations.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-05-21 10:08 ` Stefano Brivio
@ 2025-05-21 10:35   ` Eugenio Perez Martin
  2025-06-06 14:32     ` Eugenio Perez Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-05-21 10:35 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson

On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 20 May 2025 17:09:44 +0200
> Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> > [...]
> >
> > Now if I isolate the vhost kernel thread [1] I get way more
> > performance as expected:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> >
> > After analyzing perf output, rep_movs_alternative is the most called
> > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > vhost (~15%Self)
>
> Interesting... s/most called function/function using the most cycles/, I
> suppose.
>

Right!

> So it looks somewhat similar to
>
>   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
>
> now?
>

Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
skb_do_copy_data_nocache. Not sure if that means something, as it
should not be affected by vhost.

> > But I don't see any of them consuming 100% of CPU in
> > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > 60%, and vhost consumes ~53%.
> >
> > So... I have mixed feelings about this :). By "default" it seems to
> > have less performance, but my test is maybe too synthetic.
>
> Well, surely we can't ask Podman users to pin specific stuff to given
> CPU threads. :)
>

Yes but maybe the result changes under the right schedule? I'm
isolating the CPUs entirely, which is not the usual case for pasta for
sure :).

> > There is room for improvement with the mentioned optimizations so I'd
> > continue applying them, continuing with UDP and TCP zerocopy, and
> > developing zerocopy vhost rx.
>
> That definitely makes sense to me.
>

Good!

> > With these numbers I think the series should not be
> > merged at the moment. I could send it as RFC if you want but I've not
> > applied the comments the first one received, POC style :).
>
> I don't think it's really needed for you to spend time on
> semi-polishing something just to have an RFC if you're still working on
> it. I guess the implementation will change substantially anyway once
> you factor in further optimisations.
>

Agree! I'll keep iterating on this then.

Thanks!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-05-21 10:35   ` Eugenio Perez Martin
@ 2025-06-06 14:32     ` Eugenio Perez Martin
  2025-06-06 16:37       ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-06-06 14:32 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson

On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Tue, 20 May 2025 17:09:44 +0200
> > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > > [...]
> > >
> > > Now if I isolate the vhost kernel thread [1] I get way more
> > > performance as expected:
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> > >
> > > After analyzing perf output, rep_movs_alternative is the most called
> > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > > vhost (~15%Self)
> >
> > Interesting... s/most called function/function using the most cycles/, I
> > suppose.
> >
>
> Right!
>
> > So it looks somewhat similar to
> >
> >   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
> >
> > now?
> >
>
> Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
> skb_do_copy_data_nocache. Not sure if that means something, as it
> should not be affected by vhost.
>
> > > But I don't see any of them consuming 100% of CPU in
> > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > > 60%, and vhost consumes ~53%.
> > >
> > > So... I have mixed feelings about this :). By "default" it seems to
> > > have less performance, but my test is maybe too synthetic.
> >
> > Well, surely we can't ask Podman users to pin specific stuff to given
> > CPU threads. :)
> >
>
> Yes but maybe the result changes under the right schedule? I'm
> isolating the CPUs entirely, which is not the usual case for pasta for
> sure :).
>
> > > There is room for improvement with the mentioned optimizations so I'd
> > > continue applying them, continuing with UDP and TCP zerocopy, and
> > > developing zerocopy vhost rx.
> >
> > That definitely makes sense to me.
> >
>
> Good!
>
> > > With these numbers I think the series should not be
> > > merged at the moment. I could send it as RFC if you want but I've not
> > > applied the comments the first one received, POC style :).
> >
> > I don't think it's really needed for you to spend time on
> > semi-polishing something just to have an RFC if you're still working on
> > it. I guess the implementation will change substantially anyway once
> > you factor in further optimisations.
> >
>
> Agree! I'll keep iterating on this then.
>

Actually, if I remove all the taskset etc, and trust the kernel
scheduler, vanilla pasta gives me:
[pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
Connecting to host 10.6.68.254, port 5201
[  5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
[  5]   1.00-2.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
[  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   25.4 MBytes
[  5]   3.00-4.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
[  5]   4.00-5.00   sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
[  5]   5.00-6.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
[  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
[  5]   7.00-8.00   sec  3.09 GBytes  26.6 Gbits/sec    0   25.4 MBytes
[  5]   8.00-9.00   sec  3.08 GBytes  26.5 Gbits/sec    0   25.4 MBytes
[  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  31.0 GBytes  26.7 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  31.0 GBytes  26.5 Gbits/sec                  receiver

And with vhost-net :
[pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
...
Connecting to host 10.6.68.254, port 5201
[  5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.17 GBytes  35.8 Gbits/sec    0   11.9 MBytes
[  5]   1.00-2.00   sec  4.17 GBytes  35.9 Gbits/sec    0   11.9 MBytes
[  5]   2.00-3.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
[  5]   3.00-4.00   sec  4.14 GBytes  35.6 Gbits/sec    0   11.9 MBytes
[  5]   4.00-5.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
[  5]   5.00-6.00   sec  4.16 GBytes  35.8 Gbits/sec    0   11.9 MBytes
[  5]   6.00-7.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
[  5]   7.00-8.00   sec  4.19 GBytes  35.9 Gbits/sec    0   11.9 MBytes
[  5]   8.00-9.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
[  5]   9.00-10.00  sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  41.7 GBytes  35.8 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  41.7 GBytes  35.7 Gbits/sec                  receiver

If I  go the extra mile and disable notifications (it might be just
noise, but...)
[pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
...
Connecting to host 10.6.68.254, port 5201
[  5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.19 GBytes  36.0 Gbits/sec    0   12.4 MBytes
[  5]   1.00-2.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
[  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
[  5]   3.00-4.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
[  5]   4.00-5.00   sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
[  5]   5.00-6.00   sec  4.21 GBytes  36.1 Gbits/sec    0   12.4 MBytes
[  5]   6.00-7.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
[  5]   7.00-8.00   sec  4.23 GBytes  36.4 Gbits/sec    0   12.4 MBytes
[  5]   8.00-9.00   sec  4.24 GBytes  36.4 Gbits/sec    0   12.4 MBytes
[  5]   9.00-10.00  sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  42.1 GBytes  36.1 Gbits/sec    0             sender
[  5]   0.00-10.04  sec  42.1 GBytes  36.0 Gbits/sec                  receiver

So I guess the best is to actually run performance tests closer to
real-world workload against the new version and see if it works
better?

Thanks!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-06-06 14:32     ` Eugenio Perez Martin
@ 2025-06-06 16:37       ` Stefano Brivio
  2025-06-09  9:59         ` Eugenio Perez Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2025-06-06 16:37 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson

On Fri, 6 Jun 2025 16:32:38 +0200
Eugenio Perez Martin <eperezma@redhat.com> wrote:

> On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > >
> > > On Tue, 20 May 2025 17:09:44 +0200
> > > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >  
> > > > [...]
> > > >
> > > > Now if I isolate the vhost kernel thread [1] I get way more
> > > > performance as expected:
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > > > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> > > >
> > > > After analyzing perf output, rep_movs_alternative is the most called
> > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > > > vhost (~15%Self)  
> > >
> > > Interesting... s/most called function/function using the most cycles/, I
> > > suppose.
> > >  
> >
> > Right!
> >  
> > > So it looks somewhat similar to
> > >
> > >   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
> > >
> > > now?
> > >  
> >
> > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
> > skb_do_copy_data_nocache. Not sure if that means something, as it
> > should not be affected by vhost.
> >  
> > > > But I don't see any of them consuming 100% of CPU in
> > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > > > 60%, and vhost consumes ~53%.
> > > >
> > > > So... I have mixed feelings about this :). By "default" it seems to
> > > > have less performance, but my test is maybe too synthetic.  
> > >
> > > Well, surely we can't ask Podman users to pin specific stuff to given
> > > CPU threads. :)
> > >  
> >
> > Yes but maybe the result changes under the right schedule? I'm
> > isolating the CPUs entirely, which is not the usual case for pasta for
> > sure :).
> >  
> > > > There is room for improvement with the mentioned optimizations so I'd
> > > > continue applying them, continuing with UDP and TCP zerocopy, and
> > > > developing zerocopy vhost rx.  
> > >
> > > That definitely makes sense to me.
> > >  
> >
> > Good!
> >  
> > > > With these numbers I think the series should not be
> > > > merged at the moment. I could send it as RFC if you want but I've not
> > > > applied the comments the first one received, POC style :).  
> > >
> > > I don't think it's really needed for you to spend time on
> > > semi-polishing something just to have an RFC if you're still working on
> > > it. I guess the implementation will change substantially anyway once
> > > you factor in further optimisations.
> > >  
> >
> > Agree! I'll keep iterating on this then.
> >  
> 
> Actually, if I remove all the taskset etc, and trust the kernel
> scheduler, vanilla pasta gives me:
> [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> Connecting to host 10.6.68.254, port 5201
> [  5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> [  5]   1.00-2.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> [  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   25.4 MBytes
> [  5]   3.00-4.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> [  5]   4.00-5.00   sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> [  5]   5.00-6.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> [  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> [  5]   7.00-8.00   sec  3.09 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> [  5]   8.00-9.00   sec  3.08 GBytes  26.5 Gbits/sec    0   25.4 MBytes
> [  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  31.0 GBytes  26.7 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  31.0 GBytes  26.5 Gbits/sec                  receiver
> 
> And with vhost-net :
> [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> ...
> Connecting to host 10.6.68.254, port 5201
> [  5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.17 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> [  5]   1.00-2.00   sec  4.17 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> [  5]   2.00-3.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> [  5]   3.00-4.00   sec  4.14 GBytes  35.6 Gbits/sec    0   11.9 MBytes
> [  5]   4.00-5.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> [  5]   5.00-6.00   sec  4.16 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> [  5]   6.00-7.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> [  5]   7.00-8.00   sec  4.19 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> [  5]   8.00-9.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> [  5]   9.00-10.00  sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  41.7 GBytes  35.8 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  41.7 GBytes  35.7 Gbits/sec                  receiver
> 
> If I  go the extra mile and disable notifications (it might be just
> noise, but...)
> [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> ...
> Connecting to host 10.6.68.254, port 5201
> [  5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201
> [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5]   0.00-1.00   sec  4.19 GBytes  36.0 Gbits/sec    0   12.4 MBytes
> [  5]   1.00-2.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> [  5]   3.00-4.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> [  5]   4.00-5.00   sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> [  5]   5.00-6.00   sec  4.21 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> [  5]   6.00-7.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> [  5]   7.00-8.00   sec  4.23 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> [  5]   8.00-9.00   sec  4.24 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> [  5]   9.00-10.00  sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bitrate         Retr
> [  5]   0.00-10.00  sec  42.1 GBytes  36.1 Gbits/sec    0             sender
> [  5]   0.00-10.04  sec  42.1 GBytes  36.0 Gbits/sec                  receiver
> 
> So I guess the best is to actually run performance tests closer to
> real-world workload against the new version and see if it works
> better?

Well, that's certainly a possibility.

I'd say the biggest value for vhost-net usage in pasta is reaching
throughput figures that are comparable with veth, with or without
multithreading (keeping an eye on bytes per cycle, of course), with or
without kernel changes, so that users won't need to choose between
rootless and performance anymore.

It would also simplify things in Podman quite a lot (and to some extent
in rootlesskit / Docker as well). We're pretty much there with virtual
machines, just not quite with containers (which is somewhat ironic, but
of course there's a good reason for that).

If we're clearly wasting cycles in vhost-net (because of the bounce
buffer, plus something else perhaps?) *and* there's a somewhat possible
solution for that in sight *and* the interface would change anyway,
running throughput tests and polishing up the current version with a
half-baked solution at the moment sounds a bit wasteful to me.

But if one of those assumptions doesn't hold, or if you feel the need to
consolidate the current status, perhaps polishing up the current
version right now and actually evaluating throughput (as well as
overhead) makes sense to me, yes.

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-06-06 16:37       ` Stefano Brivio
@ 2025-06-09  9:59         ` Eugenio Perez Martin
  2025-06-10 15:29           ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-06-09  9:59 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson

On Fri, Jun 6, 2025 at 6:37 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Fri, 6 Jun 2025 16:32:38 +0200
> Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> > On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > On Tue, 20 May 2025 17:09:44 +0200
> > > > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > > [...]
> > > > >
> > > > > Now if I isolate the vhost kernel thread [1] I get way more
> > > > > performance as expected:
> > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > > > > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> > > > >
> > > > > After analyzing perf output, rep_movs_alternative is the most called
> > > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > > > > vhost (~15%Self)
> > > >
> > > > Interesting... s/most called function/function using the most cycles/, I
> > > > suppose.
> > > >
> > >
> > > Right!
> > >
> > > > So it looks somewhat similar to
> > > >
> > > >   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
> > > >
> > > > now?
> > > >
> > >
> > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
> > > skb_do_copy_data_nocache. Not sure if that means something, as it
> > > should not be affected by vhost.
> > >
> > > > > But I don't see any of them consuming 100% of CPU in
> > > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > > > > 60%, and vhost consumes ~53%.
> > > > >
> > > > > So... I have mixed feelings about this :). By "default" it seems to
> > > > > have less performance, but my test is maybe too synthetic.
> > > >
> > > > Well, surely we can't ask Podman users to pin specific stuff to given
> > > > CPU threads. :)
> > > >
> > >
> > > Yes but maybe the result changes under the right schedule? I'm
> > > isolating the CPUs entirely, which is not the usual case for pasta for
> > > sure :).
> > >
> > > > > There is room for improvement with the mentioned optimizations so I'd
> > > > > continue applying them, continuing with UDP and TCP zerocopy, and
> > > > > developing zerocopy vhost rx.
> > > >
> > > > That definitely makes sense to me.
> > > >
> > >
> > > Good!
> > >
> > > > > With these numbers I think the series should not be
> > > > > merged at the moment. I could send it as RFC if you want but I've not
> > > > > applied the comments the first one received, POC style :).
> > > >
> > > > I don't think it's really needed for you to spend time on
> > > > semi-polishing something just to have an RFC if you're still working on
> > > > it. I guess the implementation will change substantially anyway once
> > > > you factor in further optimisations.
> > > >
> > >
> > > Agree! I'll keep iterating on this then.
> > >
> >
> > Actually, if I remove all the taskset etc, and trust the kernel
> > scheduler, vanilla pasta gives me:
> > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > Connecting to host 10.6.68.254, port 5201
> > [  5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > [  5]   1.00-2.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > [  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   25.4 MBytes
> > [  5]   3.00-4.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > [  5]   4.00-5.00   sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > [  5]   5.00-6.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > [  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > [  5]   7.00-8.00   sec  3.09 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > [  5]   8.00-9.00   sec  3.08 GBytes  26.5 Gbits/sec    0   25.4 MBytes
> > [  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  31.0 GBytes  26.7 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  31.0 GBytes  26.5 Gbits/sec                  receiver
> >
> > And with vhost-net :
> > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > ...
> > Connecting to host 10.6.68.254, port 5201
> > [  5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.17 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > [  5]   1.00-2.00   sec  4.17 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > [  5]   2.00-3.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > [  5]   3.00-4.00   sec  4.14 GBytes  35.6 Gbits/sec    0   11.9 MBytes
> > [  5]   4.00-5.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > [  5]   5.00-6.00   sec  4.16 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > [  5]   6.00-7.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > [  5]   7.00-8.00   sec  4.19 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > [  5]   8.00-9.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > [  5]   9.00-10.00  sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  41.7 GBytes  35.8 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  41.7 GBytes  35.7 Gbits/sec                  receiver
> >
> > If I  go the extra mile and disable notifications (it might be just
> > noise, but...)
> > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > ...
> > Connecting to host 10.6.68.254, port 5201
> > [  5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201
> > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > [  5]   0.00-1.00   sec  4.19 GBytes  36.0 Gbits/sec    0   12.4 MBytes
> > [  5]   1.00-2.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > [  5]   3.00-4.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > [  5]   4.00-5.00   sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > [  5]   5.00-6.00   sec  4.21 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > [  5]   6.00-7.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > [  5]   7.00-8.00   sec  4.23 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > [  5]   8.00-9.00   sec  4.24 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > [  5]   9.00-10.00  sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  42.1 GBytes  36.1 Gbits/sec    0             sender
> > [  5]   0.00-10.04  sec  42.1 GBytes  36.0 Gbits/sec                  receiver
> >
> > So I guess the best is to actually run performance tests closer to
> > real-world workload against the new version and see if it works
> > better?
>
> Well, that's certainly a possibility.
>
> I'd say the biggest value for vhost-net usage in pasta is reaching
> throughput figures that are comparable with veth, with or without
> multithreading (keeping an eye on bytes per cycle, of course), with or
> without kernel changes, so that users won't need to choose between
> rootless and performance anymore.
>
> It would also simplify things in Podman quite a lot (and to some extent
> in rootlesskit / Docker as well). We're pretty much there with virtual
> machines, just not quite with containers (which is somewhat ironic, but
> of course there's a good reason for that).
>
> If we're clearly wasting cycles in vhost-net (because of the bounce
> buffer, plus something else perhaps?) *and* there's a somewhat possible
> solution for that in sight *and* the interface would change anyway,
> running throughput tests and polishing up the current version with a
> half-baked solution at the moment sounds a bit wasteful to me.
>

My point is that I'm testing a very synthetic scenario. If everybody
agree this is close enough to real world ones, I'm ok to continue
improving the edges we see. If not, maybe we're picking the wrong
fruit even if it is low hand?

Getting a table like [1] would give us light about this, especially if
it is just a matter of running "make performance" or similar. Maybe we
need to include longer queues? Focus on a given scenario? UDP goes
better but TCP?

Now more points about this scenario:
1) I don't see 100% CPU usage in any element:
CPU%
84.2  passt.avx2
57.9 iperf3
57.2 iperf3
50.7 vhost-1805109

2) The most used (Self%) function in vhost is rep_movs_alternative,
called from skb_copy_datagram_iter, so yes, ZeroCopy should help a lot
here.

Now, is "iperf3 -w 8M" representative? I'm sure ZC helps in this
scenario, does it make it worse if we have small packets? Do we care?

I'm totally ok with continuing trying with ZC, I just want to make
sure we're not missing anything :).

Thanks!

[1] https://passt.top/passt/about/#performance_1

More details

> But if one of those assumptions doesn't hold, or if you feel the need to
> consolidate the current status, perhaps polishing up the current
> version right now and actually evaluating throughput (as well as
> overhead) makes sense to me, yes.
>
> --
> Stefano
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-06-09  9:59         ` Eugenio Perez Martin
@ 2025-06-10 15:29           ` Stefano Brivio
  2025-06-11  7:04             ` Eugenio Perez Martin
  0 siblings, 1 reply; 11+ messages in thread
From: Stefano Brivio @ 2025-06-10 15:29 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson, Paul Holzinger

[Adding Paul as Podman developer]

On Mon, 9 Jun 2025 11:59:21 +0200
Eugenio Perez Martin <eperezma@redhat.com> wrote:

> On Fri, Jun 6, 2025 at 6:37 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Fri, 6 Jun 2025 16:32:38 +0200
> > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >  
> > > On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:  
> > > >
> > > > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > > >
> > > > > On Tue, 20 May 2025 17:09:44 +0200
> > > > > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >  
> > > > > > [...]
> > > > > >
> > > > > > Now if I isolate the vhost kernel thread [1] I get way more
> > > > > > performance as expected:
> > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > > > > > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> > > > > >
> > > > > > After analyzing perf output, rep_movs_alternative is the most called
> > > > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > > > > > vhost (~15%Self)  
> > > > >
> > > > > Interesting... s/most called function/function using the most cycles/, I
> > > > > suppose.
> > > > >  
> > > >
> > > > Right!
> > > >  
> > > > > So it looks somewhat similar to
> > > > >
> > > > >   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
> > > > >
> > > > > now?
> > > > >  
> > > >
> > > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
> > > > skb_do_copy_data_nocache. Not sure if that means something, as it
> > > > should not be affected by vhost.
> > > >  
> > > > > > But I don't see any of them consuming 100% of CPU in
> > > > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > > > > > 60%, and vhost consumes ~53%.
> > > > > >
> > > > > > So... I have mixed feelings about this :). By "default" it seems to
> > > > > > have less performance, but my test is maybe too synthetic.  
> > > > >
> > > > > Well, surely we can't ask Podman users to pin specific stuff to given
> > > > > CPU threads. :)
> > > > >  
> > > >
> > > > Yes but maybe the result changes under the right schedule? I'm
> > > > isolating the CPUs entirely, which is not the usual case for pasta for
> > > > sure :).
> > > >  
> > > > > > There is room for improvement with the mentioned optimizations so I'd
> > > > > > continue applying them, continuing with UDP and TCP zerocopy, and
> > > > > > developing zerocopy vhost rx.  
> > > > >
> > > > > That definitely makes sense to me.
> > > > >  
> > > >
> > > > Good!
> > > >  
> > > > > > With these numbers I think the series should not be
> > > > > > merged at the moment. I could send it as RFC if you want but I've not
> > > > > > applied the comments the first one received, POC style :).  
> > > > >
> > > > > I don't think it's really needed for you to spend time on
> > > > > semi-polishing something just to have an RFC if you're still working on
> > > > > it. I guess the implementation will change substantially anyway once
> > > > > you factor in further optimisations.
> > > > >  
> > > >
> > > > Agree! I'll keep iterating on this then.
> > > >  
> > >
> > > Actually, if I remove all the taskset etc, and trust the kernel
> > > scheduler, vanilla pasta gives me:
> > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > Connecting to host 10.6.68.254, port 5201
> > > [  5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > [  5]   1.00-2.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > [  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   25.4 MBytes
> > > [  5]   3.00-4.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > [  5]   4.00-5.00   sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > [  5]   5.00-6.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > [  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > [  5]   7.00-8.00   sec  3.09 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > [  5]   8.00-9.00   sec  3.08 GBytes  26.5 Gbits/sec    0   25.4 MBytes
> > > [  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  31.0 GBytes  26.7 Gbits/sec    0             sender
> > > [  5]   0.00-10.04  sec  31.0 GBytes  26.5 Gbits/sec                  receiver
> > >
> > > And with vhost-net :
> > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > ...
> > > Connecting to host 10.6.68.254, port 5201
> > > [  5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.17 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > > [  5]   1.00-2.00   sec  4.17 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > [  5]   2.00-3.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > > [  5]   3.00-4.00   sec  4.14 GBytes  35.6 Gbits/sec    0   11.9 MBytes
> > > [  5]   4.00-5.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > > [  5]   5.00-6.00   sec  4.16 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > > [  5]   6.00-7.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > [  5]   7.00-8.00   sec  4.19 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > [  5]   8.00-9.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > [  5]   9.00-10.00  sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  41.7 GBytes  35.8 Gbits/sec    0             sender
> > > [  5]   0.00-10.04  sec  41.7 GBytes  35.7 Gbits/sec                  receiver
> > >
> > > If I  go the extra mile and disable notifications (it might be just
> > > noise, but...)
> > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > ...
> > > Connecting to host 10.6.68.254, port 5201
> > > [  5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201
> > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > [  5]   0.00-1.00   sec  4.19 GBytes  36.0 Gbits/sec    0   12.4 MBytes
> > > [  5]   1.00-2.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > > [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > > [  5]   3.00-4.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > [  5]   4.00-5.00   sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > > [  5]   5.00-6.00   sec  4.21 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > [  5]   6.00-7.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > [  5]   7.00-8.00   sec  4.23 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > > [  5]   8.00-9.00   sec  4.24 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > > [  5]   9.00-10.00  sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > [ ID] Interval           Transfer     Bitrate         Retr
> > > [  5]   0.00-10.00  sec  42.1 GBytes  36.1 Gbits/sec    0             sender
> > > [  5]   0.00-10.04  sec  42.1 GBytes  36.0 Gbits/sec                  receiver
> > >
> > > So I guess the best is to actually run performance tests closer to
> > > real-world workload against the new version and see if it works
> > > better?  
> >
> > Well, that's certainly a possibility.
> >
> > I'd say the biggest value for vhost-net usage in pasta is reaching
> > throughput figures that are comparable with veth, with or without
> > multithreading (keeping an eye on bytes per cycle, of course), with or
> > without kernel changes, so that users won't need to choose between
> > rootless and performance anymore.
> >
> > It would also simplify things in Podman quite a lot (and to some extent
> > in rootlesskit / Docker as well). We're pretty much there with virtual
> > machines, just not quite with containers (which is somewhat ironic, but
> > of course there's a good reason for that).
> >
> > If we're clearly wasting cycles in vhost-net (because of the bounce
> > buffer, plus something else perhaps?) *and* there's a somewhat possible
> > solution for that in sight *and* the interface would change anyway,
> > running throughput tests and polishing up the current version with a
> > half-baked solution at the moment sounds a bit wasteful to me.
> 
> My point is that I'm testing a very synthetic scenario. If everybody
> agree this is close enough to real world ones, I'm ok to continue
> improving the edges we see. If not, maybe we're picking the wrong
> fruit even if it is low hand?
> 
> Getting a table like [1] would give us light about this, especially if
> it is just a matter of running "make performance" or similar. Maybe we
> need to include longer queues? Focus on a given scenario? UDP goes
> better but TCP?

Well, it's a matter of running ./run under tests (or 'make' there).
Have you tried that with your patch? It's kind of representative in the
sense that it uses several message sizes and different values for the
sending window.

> Now more points about this scenario:
> 1) I don't see 100% CPU usage in any element:
> CPU%
> 84.2  passt.avx2
> 57.9 iperf3
> 57.2 iperf3
> 50.7 vhost-1805109

Still, I bet we're using an awful amount of cycles compared to veth.

> 2) The most used (Self%) function in vhost is rep_movs_alternative,
> called from skb_copy_datagram_iter, so yes, ZeroCopy should help a lot
> here.
> 
> Now, is "iperf3 -w 8M" representative? I'm sure ZC helps in this
> scenario, does it make it worse if we have small packets? Do we care?

We don't care _a lot_ about small packets because we can typically use
large packets, inbound and outbound, at least for TCP (bulk) transfers.
But users are doing all sort of things with containers, including bulk
transfers and VPN traffic over UDP, so we do, a bit.

Again, the main value of using vhost-net, I think, is making "rootful"
networking essentially unnecessary, or necessary just for niche use
cases (say, non-TCP, non-UDP traffic, or macvlan-like cases). If there
are relatively common use cases where pasta performs pretty bad
compared to veth, we'll still need rootful networking.

So, yes, it is representative, but not necessarily universal.

> I'm totally ok with continuing trying with ZC, I just want to make
> sure we're not missing anything :).

In any case, it looks like vhost-net zero-copy is a bigger task than we
thought, so, even if we don't reach a universal solution that makes
rootful networking essentially unnecessary, but we have a big
improvement ready, there's of course a lot of value in it. Your call...

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-06-10 15:29           ` Stefano Brivio
@ 2025-06-11  7:04             ` Eugenio Perez Martin
  2025-06-11  8:08               ` Stefano Brivio
  0 siblings, 1 reply; 11+ messages in thread
From: Eugenio Perez Martin @ 2025-06-11  7:04 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: passt-dev, Jason Wang, Jeff Nelson, Paul Holzinger

On Tue, Jun 10, 2025 at 5:29 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> [Adding Paul as Podman developer]
>
> On Mon, 9 Jun 2025 11:59:21 +0200
> Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> > On Fri, Jun 6, 2025 at 6:37 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Fri, 6 Jun 2025 16:32:38 +0200
> > > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > > On Wed, May 21, 2025 at 12:35 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Wed, May 21, 2025 at 12:09 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, 20 May 2025 17:09:44 +0200
> > > > > > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > > [...]
> > > > > > >
> > > > > > > Now if I isolate the vhost kernel thread [1] I get way more
> > > > > > > performance as expected:
> > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > > > > [  5]   0.00-10.00  sec  43.1 GBytes  37.1 Gbits/sec    0             sender
> > > > > > > [  5]   0.00-10.04  sec  43.1 GBytes  36.9 Gbits/sec                  receiver
> > > > > > >
> > > > > > > After analyzing perf output, rep_movs_alternative is the most called
> > > > > > > function in the three iperf3 (~20%Self), passt.avx2 (~15%Self) and
> > > > > > > vhost (~15%Self)
> > > > > >
> > > > > > Interesting... s/most called function/function using the most cycles/, I
> > > > > > suppose.
> > > > > >
> > > > >
> > > > > Right!
> > > > >
> > > > > > So it looks somewhat similar to
> > > > > >
> > > > > >   https://archives.passt.top/passt-dev/20241017021027.2ac9ea53@elisabeth/
> > > > > >
> > > > > > now?
> > > > > >
> > > > >
> > > > > Kind of. Below tcp_sendmsg_locked I don't see sk_page_frag_refill but
> > > > > skb_do_copy_data_nocache. Not sure if that means something, as it
> > > > > should not be affected by vhost.
> > > > >
> > > > > > > But I don't see any of them consuming 100% of CPU in
> > > > > > > top: pasta consumes ~85% %CPU, both iperf3 client and server consumes
> > > > > > > 60%, and vhost consumes ~53%.
> > > > > > >
> > > > > > > So... I have mixed feelings about this :). By "default" it seems to
> > > > > > > have less performance, but my test is maybe too synthetic.
> > > > > >
> > > > > > Well, surely we can't ask Podman users to pin specific stuff to given
> > > > > > CPU threads. :)
> > > > > >
> > > > >
> > > > > Yes but maybe the result changes under the right schedule? I'm
> > > > > isolating the CPUs entirely, which is not the usual case for pasta for
> > > > > sure :).
> > > > >
> > > > > > > There is room for improvement with the mentioned optimizations so I'd
> > > > > > > continue applying them, continuing with UDP and TCP zerocopy, and
> > > > > > > developing zerocopy vhost rx.
> > > > > >
> > > > > > That definitely makes sense to me.
> > > > > >
> > > > >
> > > > > Good!
> > > > >
> > > > > > > With these numbers I think the series should not be
> > > > > > > merged at the moment. I could send it as RFC if you want but I've not
> > > > > > > applied the comments the first one received, POC style :).
> > > > > >
> > > > > > I don't think it's really needed for you to spend time on
> > > > > > semi-polishing something just to have an RFC if you're still working on
> > > > > > it. I guess the implementation will change substantially anyway once
> > > > > > you factor in further optimisations.
> > > > > >
> > > > >
> > > > > Agree! I'll keep iterating on this then.
> > > > >
> > > >
> > > > Actually, if I remove all the taskset etc, and trust the kernel
> > > > scheduler, vanilla pasta gives me:
> > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > > Connecting to host 10.6.68.254, port 5201
> > > > [  5] local 10.6.68.20 port 40408 connected to 10.6.68.254 port 5201
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > > [  5]   1.00-2.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > > [  5]   2.00-3.00   sec  3.12 GBytes  26.8 Gbits/sec    0   25.4 MBytes
> > > > [  5]   3.00-4.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > > [  5]   4.00-5.00   sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > > [  5]   5.00-6.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > > [  5]   6.00-7.00   sec  3.11 GBytes  26.7 Gbits/sec    0   25.4 MBytes
> > > > [  5]   7.00-8.00   sec  3.09 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > > [  5]   8.00-9.00   sec  3.08 GBytes  26.5 Gbits/sec    0   25.4 MBytes
> > > > [  5]   9.00-10.00  sec  3.10 GBytes  26.6 Gbits/sec    0   25.4 MBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  31.0 GBytes  26.7 Gbits/sec    0             sender
> > > > [  5]   0.00-10.04  sec  31.0 GBytes  26.5 Gbits/sec                  receiver
> > > >
> > > > And with vhost-net :
> > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > > ...
> > > > Connecting to host 10.6.68.254, port 5201
> > > > [  5] local 10.6.68.20 port 46720 connected to 10.6.68.254 port 5201
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.17 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > > > [  5]   1.00-2.00   sec  4.17 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > > [  5]   2.00-3.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > > > [  5]   3.00-4.00   sec  4.14 GBytes  35.6 Gbits/sec    0   11.9 MBytes
> > > > [  5]   4.00-5.00   sec  4.16 GBytes  35.7 Gbits/sec    0   11.9 MBytes
> > > > [  5]   5.00-6.00   sec  4.16 GBytes  35.8 Gbits/sec    0   11.9 MBytes
> > > > [  5]   6.00-7.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > > [  5]   7.00-8.00   sec  4.19 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > > [  5]   8.00-9.00   sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > > [  5]   9.00-10.00  sec  4.18 GBytes  35.9 Gbits/sec    0   11.9 MBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  41.7 GBytes  35.8 Gbits/sec    0             sender
> > > > [  5]   0.00-10.04  sec  41.7 GBytes  35.7 Gbits/sec                  receiver
> > > >
> > > > If I  go the extra mile and disable notifications (it might be just
> > > > noise, but...)
> > > > [pasta@virtlab716 ~]$ /home/passt/pasta --config-net iperf3 -c 10.6.68.254 -w 8M
> > > > ...
> > > > Connecting to host 10.6.68.254, port 5201
> > > > [  5] local 10.6.68.20 port 56590 connected to 10.6.68.254 port 5201
> > > > [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
> > > > [  5]   0.00-1.00   sec  4.19 GBytes  36.0 Gbits/sec    0   12.4 MBytes
> > > > [  5]   1.00-2.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > > > [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec    0   12.4 MBytes
> > > > [  5]   3.00-4.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > > [  5]   4.00-5.00   sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > > > [  5]   5.00-6.00   sec  4.21 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > > [  5]   6.00-7.00   sec  4.20 GBytes  36.1 Gbits/sec    0   12.4 MBytes
> > > > [  5]   7.00-8.00   sec  4.23 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > > > [  5]   8.00-9.00   sec  4.24 GBytes  36.4 Gbits/sec    0   12.4 MBytes
> > > > [  5]   9.00-10.00  sec  4.21 GBytes  36.2 Gbits/sec    0   12.4 MBytes
> > > > - - - - - - - - - - - - - - - - - - - - - - - - -
> > > > [ ID] Interval           Transfer     Bitrate         Retr
> > > > [  5]   0.00-10.00  sec  42.1 GBytes  36.1 Gbits/sec    0             sender
> > > > [  5]   0.00-10.04  sec  42.1 GBytes  36.0 Gbits/sec                  receiver
> > > >
> > > > So I guess the best is to actually run performance tests closer to
> > > > real-world workload against the new version and see if it works
> > > > better?
> > >
> > > Well, that's certainly a possibility.
> > >
> > > I'd say the biggest value for vhost-net usage in pasta is reaching
> > > throughput figures that are comparable with veth, with or without
> > > multithreading (keeping an eye on bytes per cycle, of course), with or
> > > without kernel changes, so that users won't need to choose between
> > > rootless and performance anymore.
> > >
> > > It would also simplify things in Podman quite a lot (and to some extent
> > > in rootlesskit / Docker as well). We're pretty much there with virtual
> > > machines, just not quite with containers (which is somewhat ironic, but
> > > of course there's a good reason for that).
> > >
> > > If we're clearly wasting cycles in vhost-net (because of the bounce
> > > buffer, plus something else perhaps?) *and* there's a somewhat possible
> > > solution for that in sight *and* the interface would change anyway,
> > > running throughput tests and polishing up the current version with a
> > > half-baked solution at the moment sounds a bit wasteful to me.
> >
> > My point is that I'm testing a very synthetic scenario. If everybody
> > agree this is close enough to real world ones, I'm ok to continue
> > improving the edges we see. If not, maybe we're picking the wrong
> > fruit even if it is low hand?
> >
> > Getting a table like [1] would give us light about this, especially if
> > it is just a matter of running "make performance" or similar. Maybe we
> > need to include longer queues? Focus on a given scenario? UDP goes
> > better but TCP?
>
> Well, it's a matter of running ./run under tests (or 'make' there).
> Have you tried that with your patch? It's kind of representative in the
> sense that it uses several message sizes and different values for the
> sending window.
>

Yes but it freezes in my env.

Copying the different windows, top-left:
# tail -f --retry /home/passt/test/test_logs/context_unshare.log
/home/passt/test/test_logs/context_ns.log
tail: warning: --retry only effective for the initial open
==> /home/passt/test/test_logs/context_unshare.log <==
unshare$ tail: cannot open '/home/passt/test/test_logs/context_ns.log'
for reading: No such file or directory
---

top-right:
# while cat /tmp/passt-tests-7HpbEm/log_pipe; do :; done
Test layout: single pasta instance with namespace.
---

bottom-left:
# tail -f --retry /home/passt/test/test_logs/context_host.log
tail: warning: --retry only effective for the initial open
host$
---

bottom-right:
# tail -f --ontext_passt.logt/test/test_logs/c
tail: warning: --retry only effective for the initial open
passt$
---

And test/test_logs/test.log:
=== build/all
> Build passt
? ! [ -e passt ]
? [ -f passt ]
...passed.

> Build pasta
? ! [ -e pasta ]
? [ -h pasta ]
...passed.

> Build qrap
? ! [ -e qrap ]
? [ -f qrap ]
...passed.

> Build all
? ! [ -e passt ]
? ! [ -e pasta ]
? ! [ -e qrap ]
? [ -f passt ]
? [ -h pasta ]
? [ -f qrap ]
...passed.

> Install
? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ]
? [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ]
? [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ]
? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt
? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta
? man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap
...passed.

> Uninstall
? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/passt ]
? ! [ -h /tmp/passt-tests-7HpbEm/build/all/prefix/bin/pasta ]
? ! [ -f /tmp/passt-tests-7HpbEm/build/all/prefix/bin/qrap ]
? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W passt
2>/dev/null
? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W pasta
2>/dev/null
? ! man -M /tmp/passt-tests-7HpbEm/build/all/prefix/share/man -W qrap
2>/dev/null
...passed.


=== build/cppcheck
...skipped.


=== build/clang_tidy
...skipped.
---

> > Now more points about this scenario:
> > 1) I don't see 100% CPU usage in any element:
> > CPU%
> > 84.2  passt.avx2
> > 57.9 iperf3
> > 57.2 iperf3
> > 50.7 vhost-1805109
>
> Still, I bet we're using an awful amount of cycles compared to veth.
>
> > 2) The most used (Self%) function in vhost is rep_movs_alternative,
> > called from skb_copy_datagram_iter, so yes, ZeroCopy should help a lot
> > here.
> >
> > Now, is "iperf3 -w 8M" representative? I'm sure ZC helps in this
> > scenario, does it make it worse if we have small packets? Do we care?
>
> We don't care _a lot_ about small packets because we can typically use
> large packets, inbound and outbound, at least for TCP (bulk) transfers.
> But users are doing all sort of things with containers, including bulk
> transfers and VPN traffic over UDP, so we do, a bit.
>
> Again, the main value of using vhost-net, I think, is making "rootful"
> networking essentially unnecessary, or necessary just for niche use
> cases (say, non-TCP, non-UDP traffic, or macvlan-like cases). If there
> are relatively common use cases where pasta performs pretty bad
> compared to veth, we'll still need rootful networking.
>
> So, yes, it is representative, but not necessarily universal.
>
> > I'm totally ok with continuing trying with ZC, I just want to make
> > sure we're not missing anything :).
>
> In any case, it looks like vhost-net zero-copy is a bigger task than we
> thought, so, even if we don't reach a universal solution that makes
> rootful networking essentially unnecessary, but we have a big
> improvement ready, there's of course a lot of value in it. Your call...
>

Got it, thanks!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: vhost-kernel net on pasta: from 26 to 37Gbit/s
  2025-06-11  7:04             ` Eugenio Perez Martin
@ 2025-06-11  8:08               ` Stefano Brivio
  0 siblings, 0 replies; 11+ messages in thread
From: Stefano Brivio @ 2025-06-11  8:08 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: passt-dev, Jason Wang, Jeff Nelson, Paul Holzinger

On Wed, 11 Jun 2025 09:04:57 +0200
Eugenio Perez Martin <eperezma@redhat.com> wrote:

> On Tue, Jun 10, 2025 at 5:29 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Mon, 9 Jun 2025 11:59:21 +0200
> > Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >  
> > > Getting a table like [1] would give us light about this, especially if
> > > it is just a matter of running "make performance" or similar. Maybe we
> > > need to include longer queues? Focus on a given scenario? UDP goes
> > > better but TCP?  
> >
> > Well, it's a matter of running ./run under tests (or 'make' there).
> > Have you tried that with your patch? It's kind of representative in the
> > sense that it uses several message sizes and different values for the
> > sending window.
> 
> Yes but it freezes in my env.
> 
> Copying the different windows, top-left:
> # tail -f --retry /home/passt/test/test_logs/context_unshare.log
> /home/passt/test/test_logs/context_ns.log
> tail: warning: --retry only effective for the initial open
> ==> /home/passt/test/test_logs/context_unshare.log <==  
> unshare$ tail: cannot open '/home/passt/test/test_logs/context_ns.log'
> for reading: No such file or directory
>
> [...]
>
> === build/cppcheck
> ...skipped.
> 
> 
> === build/clang_tidy
> ...skipped.
> ---

I guess you're missing some dependencies, nstool ('make assets', or
'make' under tests/), or similar.

There's a test/README.md file listing most dependencies and also
describing how to debug things, essentially:

  DEBUG=1 ./run

and have a look at script.log. Patches for whatever issue you
might be hitting in your setup are warmly welcome, of course. :)

-- 
Stefano


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-11  8:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-20 15:09 vhost-kernel net on pasta: from 26 to 37Gbit/s Eugenio Perez Martin
2025-05-21  0:57 ` Jason Wang
2025-05-21  5:37   ` Eugenio Perez Martin
2025-05-21 10:08 ` Stefano Brivio
2025-05-21 10:35   ` Eugenio Perez Martin
2025-06-06 14:32     ` Eugenio Perez Martin
2025-06-06 16:37       ` Stefano Brivio
2025-06-09  9:59         ` Eugenio Perez Martin
2025-06-10 15:29           ` Stefano Brivio
2025-06-11  7:04             ` Eugenio Perez Martin
2025-06-11  8:08               ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).