From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by passt.top (Postfix) with ESMTP id 89BC65A004E for ; Thu, 13 Jun 2024 20:22:26 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1718302945; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=h1nk/dYJxJzZfNvwEFvPtRbdcOuXhfDJl9dYh0smGpE=; b=JcM/fUlZ0qWPxme0siC/ronBF8kJZ8mEYmOyqJK4GFozoIukhRBSuuO6nm4R2n6CkELIog +a3+caHexU6QE2E3JYm9CKFPGKe94/X+71WKfJ/9CObnKeuZotZ1LAS+lwiMqAPexibYOq PgFjf9gyJplZOTZk0iJ/MQlBZPR5YYA= Received: from mail-qt1-f198.google.com (mail-qt1-f198.google.com [209.85.160.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-662-IXbUj3mVMtib17WN4xjuuA-1; Thu, 13 Jun 2024 14:22:23 -0400 X-MC-Unique: IXbUj3mVMtib17WN4xjuuA-1 Received: by mail-qt1-f198.google.com with SMTP id d75a77b69052e-4415b409145so13876781cf.1 for ; Thu, 13 Jun 2024 11:22:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718302943; x=1718907743; h=content-transfer-encoding:mime-version:organization:references :in-reply-to:message-id:subject:cc:to:from:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=h1nk/dYJxJzZfNvwEFvPtRbdcOuXhfDJl9dYh0smGpE=; b=CubMLN+Al11W/y5nZjf53jbQA3psqEgEXKb5O9jm+igSV6I/B1L6y6o6eBmxmH7LZH b0L33GVSy4i2ljqv8mQBku4qCHNosirRbkVXkLSCZaIdBQfCG3LZFjmU5E19Kvmmj06e KTmF5scQ9+D9FgXmCP7XuuWf0IAIxrEZ51HKvzf8yuR8IoOvwEnoJlYufLRHJMgEXXoE Hcn4sCK0wXsxAuukRQ3xdA5j8BbVOWwSJeDA53nBaKbx92T6O0b0AR74iifFGnnLDijr ex3K31EMd3uqi04vpw2BMzaV5AgSLa37teoqWiZps+iQNQH+hUBRzZ6XU7WdOL3xGC9u P7SQ== X-Gm-Message-State: AOJu0YxActs4kNNAOPoIOCcX0W04twmkBLWMBdm0r7y4r4GGBkQfxfF/ jpXhdi6gnOqFLFEhETyiKXuBV0llX+ACqUEdnbo9bOOKzeVb1oVwp885FuPT0ZH5veqQU2Zd5A4 LQsAHrJWgatB0x44Qp1FOawWhJsVGgiKx6EI+Tde3zuVm47/qDQ== X-Received: by 2002:a05:622a:3cf:b0:441:78:4ddf with SMTP id d75a77b69052e-44216b2cba0mr3944641cf.51.1718302942378; Thu, 13 Jun 2024 11:22:22 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEh+JkG06uwArt6oMbJH1sXzM5B7vXgh0oPcQgSyirUls51JvdF2DuQjUkgJWnH/MOIyS6z4Q== X-Received: by 2002:a05:622a:3cf:b0:441:78:4ddf with SMTP id d75a77b69052e-44216b2cba0mr3944281cf.51.1718302941429; Thu, 13 Jun 2024 11:22:21 -0700 (PDT) Received: from maya.cloud.tilaa.com (maya.cloud.tilaa.com. [164.138.29.33]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-44212efe609sm4608821cf.52.2024.06.13.11.22.20 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 13 Jun 2024 11:22:20 -0700 (PDT) Date: Thu, 13 Jun 2024 20:21:12 +0200 From: Stefano Brivio To: David Gibson Subject: Re: [PATCH 3/4] udp: Rework how we divide queued datagrams between sending methods Message-ID: <20240613202112.57a059ee@elisabeth> In-Reply-To: <20240605013903.3694452-4-david@gibson.dropbear.id.au> References: <20240605013903.3694452-1-david@gibson.dropbear.id.au> <20240605013903.3694452-4-david@gibson.dropbear.id.au> Organization: Red Hat X-Mailer: Claws Mail 4.2.0 (GTK 3.24.41; x86_64-pc-linux-gnu) MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Message-ID-Hash: EHVKJHHZIEFQYGCKPAZSDVZ7ZV6FQEYB X-Message-ID-Hash: EHVKJHHZIEFQYGCKPAZSDVZ7ZV6FQEYB X-MailFrom: sbrivio@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Wed, 5 Jun 2024 11:39:02 +1000 David Gibson wrote: > udp_sock_handler() takes a number of datagrams from sockets that depending > on their addresses could be forwarded either to the L2 interface ("tap") > or to another socket ("spliced"). In the latter case we can also only > send packets together if they have the same source port, and therefore > are sent via the same socket. > > To reduce the total number of system calls we gather contiguous batches of > datagrams with the same destination interface and socket where applicable. > The determination of what the target is is made by udp_mmh_splice_port(). > It returns the source port for splice packets and -1 for "tap" packets. > We find batches by looking ahead in our queue until we find a datagram > whose "splicefrom" port doesn't match the first in our current batch. > > udp_mmh_splice_port() is moderately expensive, since it must examine IPv6 > addresses. But unfortunately we can call it twice on the same datagram: > once as the (last + 1) entry in one batch (showing that it's not in that > match, then again as the first entry in the next batch. This paragraph took me an embarrassingly long time to grasp, if you re-spin it would be nice to fix it: - "And unfortunately [...]", I guess: otherwise it looks like we're lucky that udp_mmh_splice_port() is expensive or something like that (because of the "But" implying contrast). I initially assumed "unfortunately" was a typo and tried to understand why it was a good thing we'd call udp_mmh_splice_port() twice on the same datagram (faster than calling it on two datagrams!), then started reading the change and got even more confused... - "(to check that it's not that batch)" ? > Avoid this by keeping track of the "splice port" in the metadata structure, > and filling it in one entry ahead of the one we're currently considering. > This is a bit subtle, but not that hard. It will also generalise better > when we have more complex possibilities based on the flow table. I guess this is the actual, main reason for this change. :) I should have read this paragraph first. > > Signed-off-by: David Gibson > --- > udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------ > 1 file changed, 86 insertions(+), 61 deletions(-) > > diff --git a/udp.c b/udp.c > index 7487d2b2..757c10ab 100644 > --- a/udp.c > +++ b/udp.c > @@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr; > * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) > * @taph: Tap backend specific header > * @s_in: Source socket address, filled in by recvmmsg() > + * @splicesrc: Source port for splicing, or -1 if not spliceable > */ > static struct udp_meta_t { > struct ipv6hdr ip6h; > @@ -205,6 +206,7 @@ static struct udp_meta_t { > struct tap_hdr taph; > > union sockaddr_inany s_in; > + int splicesrc; > } > #ifdef __AVX2__ > __attribute__ ((aligned(32))) > @@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref, > } > > /** > - * udp_splice_sendfrom() - Send datagrams from given port to given port > + * udp_splice_send() - Send datagrams from socket to socket > * @c: Execution context > * @start: Index of first datagram in udp[46]_l2_buf > - * @n: Number of datagrams to send > - * @src: Datagrams will be sent from this port (on origin side) > - * @dst: Datagrams will be send to this port (on destination side) > - * @from_pif: pif from which the packet originated > - * @v6: Send as IPv6? > - * @allow_new: If true create sending socket if needed, if false discard > - * if no sending socket is available > + * @n: Total number of datagrams in udp[46]_l2_buf pool > + * @dst: Datagrams will be sent to this port (on destination side) > + * @uref: UDP epoll reference for origin socket > * @now: Timestamp > + * > + * This consumes as many frames as are sendable via a single socket. It s/frames/datagrams/ ...or messages. > + * requires that udp_meta[@start].splicesrc is initialised, and will initialise > + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). > + * > + * Return: Number of frames sent I'd say it's rather the number of datagrams (not frames) we tried to send. In some sense, it's also the number of frames sent _by us_ (well, after calling sendmmsg(), messages were sent), but we call sendmmsg() ignoring the result, so this comment might look a bit misleading. > */ > -static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n, > - in_port_t src, in_port_t dst, uint8_t from_pif, > - bool v6, bool allow_new, > +static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, > + in_port_t dst, union udp_epoll_ref uref, > const struct timespec *now) > { > + in_port_t src = udp_meta[start].splicesrc; > struct mmsghdr *mmh_recv, *mmh_send; > - unsigned int i; > + unsigned int i = start; > int s; > > - if (v6) { > + ASSERT(udp_meta[start].splicesrc >= 0); > + > + if (uref.v6) { > mmh_recv = udp6_l2_mh_sock; > mmh_send = udp6_mh_splice; > } else { > @@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n, > mmh_send = udp4_mh_splice; > } > > - if (from_pif == PIF_SPLICE) { > + do { > + mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; > + > + if (++i >= n) > + break; > + > + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); > + } while (udp_meta[i].splicesrc == src); I don't have a strong preference, but a for loop like this: for (; i < n && udp_meta[i].splicesrc == src; i++) { mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); } if (i++ < n) /* Set splicesrc for first mismatching entry, too */ udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); looks a bit more readable to me. Same for udp_tap_send(). > + > + if (uref.pif == PIF_SPLICE) { > src += c->udp.fwd_in.rdelta[src]; > - s = udp_splice_init[v6][src].sock; > - if (s < 0 && allow_new) > - s = udp_splice_new(c, v6, src, false); > + s = udp_splice_init[uref.v6][src].sock; > + if (s < 0 && uref.orig) > + s = udp_splice_new(c, uref.v6, src, false); > > if (s < 0) > - return; > + goto out; > > - udp_splice_ns[v6][dst].ts = now->tv_sec; > - udp_splice_init[v6][src].ts = now->tv_sec; > + udp_splice_ns[uref.v6][dst].ts = now->tv_sec; > + udp_splice_init[uref.v6][src].ts = now->tv_sec; > } else { > - ASSERT(from_pif == PIF_HOST); > + ASSERT(uref.pif == PIF_HOST); > src += c->udp.fwd_out.rdelta[src]; > - s = udp_splice_ns[v6][src].sock; > - if (s < 0 && allow_new) { > + s = udp_splice_ns[uref.v6][src].sock; > + if (s < 0 && uref.orig) { > struct udp_splice_new_ns_arg arg = { > - c, v6, src, -1, > + c, uref.v6, src, -1, > }; > > NS_CALL(udp_splice_new_ns, &arg); > s = arg.s; > } > if (s < 0) > - return; > + goto out; > > - udp_splice_init[v6][dst].ts = now->tv_sec; > - udp_splice_ns[v6][src].ts = now->tv_sec; > + udp_splice_init[uref.v6][dst].ts = now->tv_sec; > + udp_splice_ns[uref.v6][src].ts = now->tv_sec; > } > > - for (i = start; i < start + n; i++) > - mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; > - > - sendmmsg(s, mmh_send + start, n, MSG_NOSIGNAL); > + sendmmsg(s, mmh_send + start, i - start, MSG_NOSIGNAL); > +out: > + return i - start; > } > > /** > @@ -687,31 +701,41 @@ static size_t udp_update_hdr6(const struct ctx *c, > * udp_tap_send() - Prepare UDP datagrams and send to tap interface > * @c: Execution context > * @start: Index of first datagram in udp[46]_l2_buf pool > - * @n: Number of datagrams to send > - * @dstport: Destination port number > - * @v6: True if using IPv6 > + * @n: Total number of datagrams in udp[46]_l2_buf pool > + * @dstport: Destination port number on destination side > + * @uref: UDP epoll reference for origin socket > * @now: Current timestamp > * > - * Return: size of tap frame with headers > + * This consumes as many frames as are sendable via tap. It requires that > + * udp_meta[@start].splicesrc is initialised, and will initialise > + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). > + * > + * Return: Number of frames sent via tap > */ > -static void udp_tap_send(const struct ctx *c, > - unsigned int start, unsigned int n, > - in_port_t dstport, bool v6, const struct timespec *now) > +static unsigned udp_tap_send(const struct ctx *c, size_t start, size_t n, > + in_port_t dstport, union udp_epoll_ref uref, > + const struct timespec *now) > { > struct iovec (*tap_iov)[UDP_NUM_IOVS]; > - size_t i; > + struct mmsghdr *mmh_recv; > + size_t i = start; > > - if (v6) > + ASSERT(udp_meta[start].splicesrc == -1); > + > + if (uref.v6) { > tap_iov = udp6_l2_iov_tap; > - else > + mmh_recv = udp6_l2_mh_sock; > + } else { > + mmh_recv = udp4_l2_mh_sock; > tap_iov = udp4_l2_iov_tap; > + } > > - for (i = start; i < start + n; i++) { > + do { > struct udp_payload_t *bp = &udp_payload[i]; > struct udp_meta_t *bm = &udp_meta[i]; > size_t l4len; > > - if (v6) { > + if (uref.v6) { > l4len = udp_update_hdr6(c, bm, bp, dstport, > udp6_l2_mh_sock[i].msg_len, now); > } else { > @@ -719,9 +743,15 @@ static void udp_tap_send(const struct ctx *c, > udp4_l2_mh_sock[i].msg_len, now); > } > tap_iov[i][UDP_IOV_PAYLOAD].iov_len = l4len; > - } > > - tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, n); > + if (++i >= n) > + break; > + > + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); > + } while (udp_meta[i].splicesrc == -1); > + > + tap_send_frames(c, &tap_iov[start][0], UDP_NUM_IOVS, i - start); > + return i - start; > } > > /** > @@ -770,24 +800,19 @@ void udp_sock_handler(const struct ctx *c, union epoll_ref ref, uint32_t events, > if (n <= 0) > return; > > + /* We divide things into batches based on how we need to send them, > + * determined by udp_meta[i].splicesrc. To avoid either two passes > + * through the array, or recalculating splicesrc for a single entry, we > + * have to populate it one entry *ahead* of the loop counter (if > + * present). So we fill in entry 0 before the loop, then udp_*_send() > + * populate one entry past where they consume. > + */ > + udp_meta[0].splicesrc = udp_mmh_splice_port(ref.udp, mmh_recv); > for (i = 0; i < n; i += m) { > - int splicefrom = -1; > - > - splicefrom = udp_mmh_splice_port(ref.udp, mmh_recv + i); > - > - for (m = 1; i + m < n; m++) { > - int p; > - > - p = udp_mmh_splice_port(ref.udp, mmh_recv + i + m); > - if (p != splicefrom) > - break; > - } > - > - if (splicefrom >= 0) > - udp_splice_sendfrom(c, i, m, splicefrom, dstport, > - ref.udp.pif, v6, ref.udp.orig, now); > + if (udp_meta[i].splicesrc >= 0) > + m = udp_splice_send(c, i, n, dstport, ref.udp, now); > else > - udp_tap_send(c, i, m, dstport, v6, now); > + m = udp_tap_send(c, i, n, dstport, ref.udp, now); > } > } > The rest of the series looks good to me. -- Stefano