On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote:
> Sorry for the further delay,
> 
> On Wed, 14 Dec 2022 11:35:46 +0100
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Wed, 14 Dec 2022 12:42:14 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> > 
> > > On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote:  
> > > > Sorry for the long delay here,
> > > > 
> > > > On Mon,  5 Dec 2022 19:14:21 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > Usually udp_sock_handler() will receive and forward multiple (up to 32)
> > > > > datagrams in udp_sock_handler(), then forward them all to the tap
> > > > > interface.  For unclear reasons, though, when in pasta mode we will only
> > > > > receive and forward a single datagram at a time.  Change it to receive
> > > > > multiple datagrams at once, like the other paths.    
> > > > 
> > > > This is explained in the commit message of 6c931118643c ("tcp, udp:
> > > > Receive batching doesn't pay off when writing single frames to tap").
> > > > 
> > > > I think it's worth re-checking the throughput now as this path is a bit
> > > > different, but unfortunately I didn't include this in the "perf" tests :(
> > > > because at the time I introduced those I wasn't sure it even made sense to
> > > > have traffic from the same host being directed to the tap device.
> > > > 
> > > > The iperf3 runs were I observed this are actually the ones from the Podman
> > > > demo. Ideally that case should be also checked in the perf/pasta_udp tests.    
> > > 
> > > Hm, ok.
> > >   
> > > > How fundamental is this for the rest of the series? I couldn't find any
> > > > actual dependency on this but I might be missing something.    
> > > 
> > > So the issue is that prior to this change in pasta we receive multiple
> > > frames at once on the splice path, but one frame at a time on the tap
> > > path.  By the end of this series we can't do that any more, because we
> > > don't know before the recvmmsg() which one we'll be doing.  
> > 
> > Oh, right, I see. Then let me add this path to the perf/pasta_udp test
> > and check how relevant this is now, I'll get back to you in a bit.
> 
> I was checking the wrong path. With this:
> 
> diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp
> index 27ea724..973c2f4 100644
> --- a/test/perf/pasta_udp
> +++ b/test/perf/pasta_udp
> @@ -31,6 +31,14 @@ report	pasta lo_udp 1 __FREQ__
>  
>  th	MTU 1500B 4000B 16384B 65535B
>  
> +tr	UDP throughput over IPv6: host to ns
> +nsout	IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
> +nsout	ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local'
> +bw	-
> +bw	-
> +bw	-
> +iperf3	BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G
> +bw	__BW__ 7.0 9.0
>  
>  tr	UDP throughput over IPv6: ns to host
>  ns	ip link set dev lo mtu 1500
> diff --git a/test/run b/test/run
> index e07513f..b53182b 100755
> --- a/test/run
> +++ b/test/run
> @@ -67,6 +67,14 @@ run() {
>  	test build/clang_tidy
>  	teardown build
>  
> +	VALGRIND=0
> +	setup passt_in_ns
> +	test passt/ndp
> +	test passt/dhcp
> +	test perf/pasta_udp
> +	test passt_in_ns/shutdown
> +	teardown passt_in_ns
> +
>  	setup pasta
>  	test pasta/ndp
>  	test pasta/dhcp

Ah, ok.  Can we add that to the standard set of tests ASAP, please.

> I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite
> significant.

Drat.

> And there's nothing strange in perf's output, really, the distribution
> of overhead per functions is pretty much the same, but writing multiple
> messages to the tap device just takes more cycles per message compared
> to a single message.

That's so weird.  It should be basically an identical set of write()s,
except that they happen in a batch, rather than a bit spread out.  I
guess it has to be some kind of cache locality thing.  I wonder if the
difference would go away or reverse if we had a way to submit multiple
frames with a single syscall.

> I'm a bit ashamed to propose this, but do you think about something
> like:

> 	if (c->mode == MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv,
> 		1, 0, NULL) <= 0) return;

> 		if (udp_mmh_splice_port(v6, mmh_recv)) { n =
> 			recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES
> 			- 1, 0, NULL); }

> 		if (n > 0) n++; else n = 1; } else { n =
> 			recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0,
> 			NULL); if (n <= 0) return; }

> ? Other than the inherent ugliness, it looks like a good
> approximation to me.

Hmm.  Well, the first question is how much impact does going 1 message
at a time have on the spliced throughput.  If it's not too bad, then
we could just always go one at a time for pasta, regardless of
splicing.  And we could even abstract that difference into the tap
backend with a callback like tap_batch_size(c).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson