From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 25A825A005E for ; Thu, 22 Dec 2022 03:37:32 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1671676651; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oyoJdfAZym8kG8j1NBDtoWG3wbYZWzU52NHwnVq4y68=; b=d4AqqNf/Ed5wCXlaLVk5UViQTpi1steDDyCswB6pjY7elWjD1uETeGIa7pz7C0P2o+Wlcd eeTqAzbbbQi2TR3uHTNsn3vn/CiLelIov0RWXz28JfLVtoP9MdiOR1BaxOMCBurrq1RClf DfZFqv0NpZozoTlutaeVPadANhk+6hU= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-56-3gWOL3yHOe-YU5MaayON1w-1; Wed, 21 Dec 2022 21:37:29 -0500 X-MC-Unique: 3gWOL3yHOe-YU5MaayON1w-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5B9BE2804123; Thu, 22 Dec 2022 02:37:29 +0000 (UTC) Received: from maya.cloud.tilaa.com (ovpn-208-4.brq.redhat.com [10.40.208.4]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D13EC112132C; Thu, 22 Dec 2022 02:37:28 +0000 (UTC) Date: Thu, 22 Dec 2022 03:37:00 +0100 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 4/8] udp: Receive multiple datagrams at once on the pasta sock->tap path Message-ID: <20221222033700.3ca2b46e@elisabeth> In-Reply-To: References: <20221205081425.2614425-1-david@gibson.dropbear.id.au> <20221205081425.2614425-5-david@gibson.dropbear.id.au> <20221213234847.6c723ad9@elisabeth> <20221214113546.16942d3a@elisabeth> <20221220114246.737b0c3e@elisabeth> Organization: Red Hat MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: EIHHY273VGTWDBYQJJ4QBY7YF3RXBRIX X-Message-ID-Hash: EIHHY273VGTWDBYQJJ4QBY7YF3RXBRIX X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.3 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 21 Dec 2022 17:00:24 +1100 David Gibson wrote: > On Tue, Dec 20, 2022 at 11:42:46AM +0100, Stefano Brivio wrote: > > Sorry for the further delay, > > > > On Wed, 14 Dec 2022 11:35:46 +0100 > > Stefano Brivio wrote: > > > > > On Wed, 14 Dec 2022 12:42:14 +1100 > > > David Gibson wrote: > > > > > > > On Tue, Dec 13, 2022 at 11:48:47PM +0100, Stefano Brivio wrote: > > > > > Sorry for the long delay here, > > > > > > > > > > On Mon, 5 Dec 2022 19:14:21 +1100 > > > > > David Gibson wrote: > > > > > > > > > > > Usually udp_sock_handler() will receive and forward multiple (up to 32) > > > > > > datagrams in udp_sock_handler(), then forward them all to the tap > > > > > > interface. For unclear reasons, though, when in pasta mode we will only > > > > > > receive and forward a single datagram at a time. Change it to receive > > > > > > multiple datagrams at once, like the other paths. > > > > > > > > > > This is explained in the commit message of 6c931118643c ("tcp, udp: > > > > > Receive batching doesn't pay off when writing single frames to tap"). > > > > > > > > > > I think it's worth re-checking the throughput now as this path is a bit > > > > > different, but unfortunately I didn't include this in the "perf" tests :( > > > > > because at the time I introduced those I wasn't sure it even made sense to > > > > > have traffic from the same host being directed to the tap device. > > > > > > > > > > The iperf3 runs were I observed this are actually the ones from the Podman > > > > > demo. Ideally that case should be also checked in the perf/pasta_udp tests. > > > > > > > > Hm, ok. > > > > > > > > > How fundamental is this for the rest of the series? I couldn't find any > > > > > actual dependency on this but I might be missing something. > > > > > > > > So the issue is that prior to this change in pasta we receive multiple > > > > frames at once on the splice path, but one frame at a time on the tap > > > > path. By the end of this series we can't do that any more, because we > > > > don't know before the recvmmsg() which one we'll be doing. > > > > > > Oh, right, I see. Then let me add this path to the perf/pasta_udp test > > > and check how relevant this is now, I'll get back to you in a bit. > > > > I was checking the wrong path. With this: > > > > diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp > > index 27ea724..973c2f4 100644 > > --- a/test/perf/pasta_udp > > +++ b/test/perf/pasta_udp > > @@ -31,6 +31,14 @@ report pasta lo_udp 1 __FREQ__ > > > > th MTU 1500B 4000B 16384B 65535B > > > > +tr UDP throughput over IPv6: host to ns > > +nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' > > +nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local' > > +bw - > > +bw - > > +bw - > > +iperf3 BW host ns __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G > > +bw __BW__ 7.0 9.0 > > > > tr UDP throughput over IPv6: ns to host > > ns ip link set dev lo mtu 1500 > > diff --git a/test/run b/test/run > > index e07513f..b53182b 100755 > > --- a/test/run > > +++ b/test/run > > @@ -67,6 +67,14 @@ run() { > > test build/clang_tidy > > teardown build > > > > + VALGRIND=0 > > + setup passt_in_ns > > + test passt/ndp > > + test passt/dhcp > > + test perf/pasta_udp > > + test passt_in_ns/shutdown > > + teardown passt_in_ns > > + > > setup pasta > > test pasta/ndp > > test pasta/dhcp > > Ah, ok. Can we add that to the standard set of tests ASAP, please. Yes -- that part itself was easy, but now I'm fighting against my own finest write-only code that generates the JavaScript snippet for the performance report (perf_fill_lines() in test/lib/perf_report -- and this is not a suggestion to have a look at it ;)). I'm trying to rework it a bit together with the "new" test. > > I get 21.6 gbps after this series, and 29.7 gbps before -- it's quite > > significant. > > Drat. > > > And there's nothing strange in perf's output, really, the distribution > > of overhead per functions is pretty much the same, but writing multiple > > messages to the tap device just takes more cycles per message compared > > to a single message. > > That's so weird. It should be basically an identical set of write()s, > except that they happen in a batch, rather than a bit spread out. I > guess it has to be some kind of cache locality thing. I wonder if the > difference would go away or reverse if we had a way to submit multiple > frames with a single syscall. I haven't tried, but to test this, I think we could actually just write multiple frames in a single call, with subsequent headers and everything, and the iperf3 server will simply report how many bytes it received. > > I'm a bit ashamed to propose this, but do you think about something > > like: > > > if (c->mode == MODE_PASTA) { if (recvmmsg(ref.r.s, mmh_recv, > > 1, 0, NULL) <= 0) return; > > > if (udp_mmh_splice_port(v6, mmh_recv)) { n = > > recvmmsg(ref.r.s, mmh_recv + 1, UDP_MAX_FRAMES > > - 1, 0, NULL); } > > > if (n > 0) n++; else n = 1; } else { n = > > recvmmsg(ref.r.s, mmh_recv, UDP_MAX_FRAMES, 0, > > NULL); if (n <= 0) return; } > > > ? Other than the inherent ugliness, it looks like a good > > approximation to me. > > Hmm. Well, the first question is how much impact does going 1 message > at a time have on the spliced throughput. If it's not too bad, then > we could just always go one at a time for pasta, regardless of > splicing. And we could even abstract that difference into the tap > backend with a callback like tap_batch_size(c). Right... it used to be significantly worse in the "spliced" case, I checked that when I did that commit to use 1 instead of UDP_MAX_FRAME in the other case, but I don't have data. I'll test this again. -- Stefano