From: David Gibson <david@gibson.dropbear.id.au>
To: Stefano Brivio <sbrivio@redhat.com>
Cc: passt-dev@passt.top
Subject: Re: [PATCH 4/8] udp: Receive multiple datagrams at once on the pasta sock->tap path
Date: Wed, 4 Jan 2023 15:53:21 +1100 [thread overview]
Message-ID: <Y7UGQRuDITy7BdZK@yekko> (raw)
In-Reply-To: <20230104010852.02e96a70@elisabeth>
[-- Attachment #1: Type: text/plain, Size: 13889 bytes --]
On Wed, Jan 04, 2023 at 01:08:52AM +0100, Stefano Brivio wrote:
> On Wed, 21 Dec 2022 17:00:24 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
>
> > On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote:
> > > Sorry for the further delay,
> > >
> > > On Wed, 14 Dec 2022 11:35:46 +0100
> > > Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > > On Wed, 14 Dec 2022 12:42:14 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >
> > > > > On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote:
> > > > > > Sorry for the long delay here,
> > > > > >
> > > > > > On Mon, 5 Dec 2022 19:14:21 +1100
> > > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > > >
> > > > > > > Usually udp_sock_handler() will receive and forward multiple (up to 32)
> > > > > > > datagrams in udp_sock_handler(), then forward them all to the tap
> > > > > > > interface. For unclear reasons, though, when in pasta mode we will only
> > > > > > > receive and forward a single datagram at a time. Change it to receive
> > > > > > > multiple datagrams at once, like the other paths.
> > > > > >
> > > > > > This is explained in the commit message of 6c931118643c ("tcp, udp:
> > > > > > Receive batching doesn't pay off when writing single frames to tap").
> > > > > >
> > > > > > I think it's worth re-checking the throughput now as this path is a bit
> > > > > > different, but unfortunately I didn't include this in the "perf" tests :(
> > > > > > because at the time I introduced those I wasn't sure it even made sense to
> > > > > > have traffic from the same host being directed to the tap device.
> > > > > >
> > > > > > The iperf3 runs were I observed this are actually the ones from the Podman
> > > > > > demo. Ideally that case should be also checked in the perf/pasta_udp tests.
> > > > >
> > > > > Hm, ok.
> > > > >
> > > > > > How fundamental is this for the rest of the series? I couldn't find any
> > > > > > actual dependency on this but I might be missing something.
> > > > >
> > > > > So the issue is that prior to this change in pasta we receive multiple
> > > > > frames at once on the splice path, but one frame at a time on the tap
> > > > > path. By the end of this series we can't do that any more, because we
> > > > > don't know before the recvmmsg() which one we'll be doing.
> > > >
> > > > Oh, right, I see. Then let me add this path to the perf/pasta_udp test
> > > > and check how relevant this is now, I'll get back to you in a bit.
> > >
> > > I was checking the wrong path. With this:
> > >
> > > diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp
> > > index 27ea724..973c2f4 100644
> > > --- a/test/perf/pasta_udp
> > > +++ b/test/perf/pasta_udp
> > > @@ -31,6 +31,14 @@ report pasta lo_udp 1 __FREQ__
> > >
> > > th MTU 1500B 4000B 16384B 65535B
> > >
> > > +tr UDP throughput over IPv6: host to ns
> > > +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
> > > +nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local'
> > > +bw -
> > > +bw -
> > > +bw -
> > > +iperf3 BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G
> > > +bw __BW__ 7.0 9.0
> > >
> > > tr UDP throughput over IPv6: ns to host
> > > ns ip link set dev lo mtu 1500
> > > diff --git a/test/run b/test/run
> > > index e07513f..b53182b 100755
> > > --- a/test/run
> > > +++ b/test/run
> > > @@ -67,6 +67,14 @@ run() {
> > > test build/clang_tidy
> > > teardown build
> > >
> > > + VALGRIND=0
> > > + setup passt_in_ns
> > > + test passt/ndp
> > > + test passt/dhcp
> > > + test perf/pasta_udp
> > > + test passt_in_ns/shutdown
> > > + teardown passt_in_ns
> > > +
> > > setup pasta
> > > test pasta/ndp
> > > test pasta/dhcp
> >
> > Ah, ok. Can we add that to the standard set of tests ASAP, please.
> >
> > > I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite
> > > significant.
> >
> > Drat.
> >
> > > And there's nothing strange in perf's output, really, the distribution
> > > of overhead per functions is pretty much the same, but writing multiple
> > > messages to the tap device just takes more cycles per message compared
> > > to a single message.
> >
> > That's so weird. It should be basically an identical set of write()s,
> > except that they happen in a batch, rather than a bit spread out. I
> > guess it has to be some kind of cache locality thing. I wonder if the
> > difference would go away or reverse if we had a way to submit multiple
> > frames with a single syscall.
> >
> > > I'm a bit ashamed to propose this, but do you think about something
> > > like:
> >
> > > if (c->mode == MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv,
> > > 1, 0, NULL) <= 0) return;
> >
> > > if (udp_mmh_splice_port(v6, mmh_recv)) { n =
> > > recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES
> > > - 1, 0, NULL); }
> >
> > > if (n > 0) n++; else n = 1; } else { n =
> > > recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0,
> > > NULL); if (n <= 0) return; }
> >
> > > ? Other than the inherent ugliness, it looks like a good
> > > approximation to me.
> >
> > Hmm. Well, the first question is how much impact does going 1 message
> > at a time have on the spliced throughput. If it's not too bad, then
> > we could just always go one at a time for pasta, regardless of
> > splicing. And we could even abstract that difference into the tap
> > backend with a callback like tap_batch_size(c).
>
> So, finally I had the chance to try this out.
>
> First off, baseline with the patch adding the new tests I just sent,
> and the series you posted:
>
> === perf/pasta_udp
> > pasta: throughput and latency (local traffic)
> Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
> MTU: | 1500B | 4000B | 16384B | 65535B |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: ns to host | 4.4 | 8.5 | 19.5 | 23.0 |
> UDP RR latency over IPv6: ns to host | - | - | - | 27 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: ns to host | 4.3 | 8.8 | 18.5 | 24.4 |
> UDP RR latency over IPv4: ns to host | - | - | - | 26 |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: host to ns | - | - | - | 22.5 |
> UDP RR latency over IPv6: host to ns | - | - | - | 30 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: host to ns | - | - | - | 24.5 |
> UDP RR latency over IPv4: host to ns | - | - | - | 25 |
> '--------'--------'--------'--------'
> ...passed.
>
> > pasta: throughput and latency (traffic via tap)
> Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
> MTU: | 1500B | 4000B | 16384B | 65520B |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: ns to host | 4.4 | 10.4 | 16.0 | 23.4 |
> UDP RR latency over IPv6: ns to host | - | - | - | 27 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: ns to host | 5.2 | 10.8 | 16.0 | 24.0 |
> UDP RR latency over IPv4: ns to host | - | - | - | 28 |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: host to ns | - | - | - | 21.5 |
> UDP RR latency over IPv6: host to ns | - | - | - | 29 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: host to ns | - | - | - | 26.3 |
> UDP RR latency over IPv4: host to ns | - | - | - | 26 |
> '--------'--------'--------'--------'
>
> which seems to indicate the whole "splicing" thing is pretty much
> useless, for UDP (except for that 16 KiB MTU case, but I wonder how
> relevant that is).
>
> If I set UDP_MAX_FRAMES to 1, with a quick workaround for the resulting
> warning in udp_tap_send() (single frame to send, hence single message),
> it gets somewhat weird:
>
> === perf/pasta_udp
> > pasta: throughput and latency (local traffic)
> Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
> MTU: | 1500B | 4000B | 16384B | 65535B |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: ns to host | 3.4 | 7.0 | 21.6 | 31.6 |
> UDP RR latency over IPv6: ns to host | - | - | - | 30 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: ns to host | 3.8 | 7.0 | 22.0 | 32.4 |
> UDP RR latency over IPv4: ns to host | - | - | - | 26 |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: host to ns | - | - | - | 29.3 |
> UDP RR latency over IPv6: host to ns | - | - | - | 31 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: host to ns | - | - | - | 33.8 |
> UDP RR latency over IPv4: host to ns | - | - | - | 25 |
> '--------'--------'--------'--------'
> ...passed.
>
> > pasta: throughput and latency (traffic via tap)
> Throughput in Gbps, latency in µs, one thread at 3.6 GHz, 4 streams
> MTU: | 1500B | 4000B | 16384B | 65520B |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: ns to host | 4.7 | 10.3 | 16.0 | 24.0 |
> UDP RR latency over IPv6: ns to host | - | - | - | 27 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: ns to host | 5.6 | 11.4 | 16.0 | 24.0 |
> UDP RR latency over IPv4: ns to host | - | - | - | 26 |
> |--------|--------|--------|--------|
> UDP throughput over IPv6: host to ns | - | - | - | 21.5 |
> UDP RR latency over IPv6: host to ns | - | - | - | 29 |
> |--------|--------|--------|--------|
> UDP throughput over IPv4: host to ns | - | - | - | 28.7 |
> UDP RR latency over IPv4: host to ns | - | - | - | 29 |
> '--------'--------'--------'--------'
>
> ...except for the cases with low MTUs, throughput is significantly
> higher if we read and send one message at a time on the "spliced" path.
>
> Next, I would like to:
>
> - bisect between 32 and 1 for UDP_MAX_FRAMES: maybe 32 affects data
> locality too much, but some lower value would still be beneficial by
> lowering syscall overhead
Ok.
> - try with sendmsg() instead of sendmmsg(), at this point. Looking at
> the kernel, that doesn't seem to make a real difference.
Which sendmmsg() specifically are you looking at changing?
> About this series: should we just go ahead and apply it with
> UDP_MAX_FRAMES set to 1 for the moment being? It's anyway better than
> the existing situation.
I think that's a good idea - or rather, not setting UDP_MAX_FRAMES to
1, but clamping the batch size to 1 for pasta - I'm pretty sure we
still want the batching for passt. We lose a little bit on
small-packet spliced, but we gain on both tap and large-packet
spliced. This will unblock the dual stack udp stuff and we can
further tune it later.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2023-01-04 4:58 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-05 8:14 [PATCH 0/8] Don't use additional sockets for receiving "spliced" UDP communications David Gibson
2022-12-05 8:14 ` [PATCH 1/8] udp: Move sending pasta tap frames to the end of udp_sock_handler() David Gibson
2022-12-05 8:14 ` [PATCH 2/8] udp: Split sending to passt tap interface into separate function David Gibson
2022-12-05 8:14 ` [PATCH 3/8] udp: Split receive from preparation and send in udp_sock_handler() David Gibson
2022-12-05 8:14 ` [PATCH 4/8] udp: Receive multiple datagrams at once on the pasta sock->tap path David Gibson
2022-12-13 22:48 ` Stefano Brivio
2022-12-14 1:42 ` David Gibson
2022-12-14 10:35 ` Stefano Brivio
2022-12-20 10:42 ` Stefano Brivio
2022-12-21 6:00 ` David Gibson
2022-12-22 2:37 ` Stefano Brivio
2023-01-04 0:08 ` Stefano Brivio
2023-01-04 4:53 ` David Gibson [this message]
2022-12-05 8:14 ` [PATCH 5/8] udp: Pre-populate msg_names with local address David Gibson
2022-12-13 22:48 ` Stefano Brivio
2022-12-14 1:22 ` David Gibson
2022-12-05 8:14 ` [PATCH 6/8] udp: Unify udp_sock_handler_splice() with udp_sock_handler() David Gibson
2022-12-13 22:48 ` Stefano Brivio
2022-12-14 1:19 ` David Gibson
2022-12-14 10:35 ` Stefano Brivio
2022-12-05 8:14 ` [PATCH 7/8] udp: Decide whether to "splice" per datagram rather than per socket David Gibson
2022-12-13 22:49 ` Stefano Brivio
2022-12-14 1:47 ` David Gibson
2022-12-14 10:35 ` Stefano Brivio
2022-12-15 0:33 ` David Gibson
2022-12-05 8:14 ` [PATCH 8/8] udp: Don't use separate sockets to listen for spliced packets David Gibson
2022-12-06 6:45 ` [PATCH 0/8] Don't use additional sockets for receiving "spliced" UDP communications Stefano Brivio
2022-12-06 6:46 ` Stefano Brivio
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y7UGQRuDITy7BdZK@yekko \
--to=david@gibson.dropbear.id.au \
--cc=passt-dev@passt.top \
--cc=sbrivio@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).