On Thu, Jun 13, 2024 at 08:21:12PM +0200, Stefano Brivio wrote: > On Wed, 5 Jun 2024 11:39:02 +1000 > David Gibson wrote: > > > udp_sock_handler() takes a number of datagrams from sockets that depending > > on their addresses could be forwarded either to the L2 interface ("tap") > > or to another socket ("spliced"). In the latter case we can also only > > send packets together if they have the same source port, and therefore > > are sent via the same socket. > > > > To reduce the total number of system calls we gather contiguous batches of > > datagrams with the same destination interface and socket where applicable. > > The determination of what the target is is made by udp_mmh_splice_port(). > > It returns the source port for splice packets and -1 for "tap" packets. > > We find batches by looking ahead in our queue until we find a datagram > > whose "splicefrom" port doesn't match the first in our current batch. > > > > udp_mmh_splice_port() is moderately expensive, since it must examine IPv6 > > addresses. But unfortunately we can call it twice on the same datagram: > > once as the (last + 1) entry in one batch (showing that it's not in that > > match, then again as the first entry in the next batch. > > This paragraph took me an embarrassingly long time to grasp, if you > re-spin it would be nice to fix it: > > - "And unfortunately [...]", I guess: otherwise it looks like we're > lucky that udp_mmh_splice_port() is expensive or something like that > (because of the "But" implying contrast). > > I initially assumed "unfortunately" was a typo and tried to > understand why it was a good thing we'd call udp_mmh_splice_port() > twice on the same datagram (faster than calling it on two > datagrams!), then started reading the change and got even more > confused... > > - "(to check that it's not that batch)" ? Yeah, didn't help that there were some other typos in there. Updated referring to these suggestions. > > Avoid this by keeping track of the "splice port" in the metadata structure, > > and filling it in one entry ahead of the one we're currently considering. > > This is a bit subtle, but not that hard. It will also generalise better > > when we have more complex possibilities based on the flow table. > > I guess this is the actual, main reason for this change. :) I should > have read this paragraph first. Yes. In the flow table we'll replace this splicesrc array with an array of flow sidxs, updated in the same way. > > Signed-off-by: David Gibson > > --- > > udp.c | 147 ++++++++++++++++++++++++++++++++++------------------------ > > 1 file changed, 86 insertions(+), 61 deletions(-) > > > > diff --git a/udp.c b/udp.c > > index 7487d2b2..757c10ab 100644 > > --- a/udp.c > > +++ b/udp.c > > @@ -198,6 +198,7 @@ static struct ethhdr udp6_eth_hdr; > > * @ip4h: Pre-filled IPv4 header (except for tot_len and saddr) > > * @taph: Tap backend specific header > > * @s_in: Source socket address, filled in by recvmmsg() > > + * @splicesrc: Source port for splicing, or -1 if not spliceable > > */ > > static struct udp_meta_t { > > struct ipv6hdr ip6h; > > @@ -205,6 +206,7 @@ static struct udp_meta_t { > > struct tap_hdr taph; > > > > union sockaddr_inany s_in; > > + int splicesrc; > > } > > #ifdef __AVX2__ > > __attribute__ ((aligned(32))) > > @@ -492,28 +494,32 @@ static int udp_mmh_splice_port(union udp_epoll_ref uref, > > } > > > > /** > > - * udp_splice_sendfrom() - Send datagrams from given port to given port > > + * udp_splice_send() - Send datagrams from socket to socket > > * @c: Execution context > > * @start: Index of first datagram in udp[46]_l2_buf > > - * @n: Number of datagrams to send > > - * @src: Datagrams will be sent from this port (on origin side) > > - * @dst: Datagrams will be send to this port (on destination side) > > - * @from_pif: pif from which the packet originated > > - * @v6: Send as IPv6? > > - * @allow_new: If true create sending socket if needed, if false discard > > - * if no sending socket is available > > + * @n: Total number of datagrams in udp[46]_l2_buf pool > > + * @dst: Datagrams will be sent to this port (on destination side) > > + * @uref: UDP epoll reference for origin socket > > * @now: Timestamp > > + * > > + * This consumes as many frames as are sendable via a single socket. It > > s/frames/datagrams/ ...or messages. Good point, adjusted. > > + * requires that udp_meta[@start].splicesrc is initialised, and will initialise > > + * udp_meta[].splicesrc for each frame it consumes *and one more* (if present). > > + * > > + * Return: Number of frames sent > > I'd say it's rather the number of datagrams (not frames) we tried to > send. > > In some sense, it's also the number of frames sent _by us_ (well, after > calling sendmmsg(), messages were sent), but we call sendmmsg() > ignoring the result, so this comment might look a bit misleading. Right... I've gone with "Number of datagrams forwarded" how's that? > > */ > > -static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n, > > - in_port_t src, in_port_t dst, uint8_t from_pif, > > - bool v6, bool allow_new, > > +static unsigned udp_splice_send(const struct ctx *c, size_t start, size_t n, > > + in_port_t dst, union udp_epoll_ref uref, > > const struct timespec *now) > > { > > + in_port_t src = udp_meta[start].splicesrc; > > struct mmsghdr *mmh_recv, *mmh_send; > > - unsigned int i; > > + unsigned int i = start; > > int s; > > > > - if (v6) { > > + ASSERT(udp_meta[start].splicesrc >= 0); > > + > > + if (uref.v6) { > > mmh_recv = udp6_l2_mh_sock; > > mmh_send = udp6_mh_splice; > > } else { > > @@ -521,40 +527,48 @@ static void udp_splice_sendfrom(const struct ctx *c, unsigned start, unsigned n, > > mmh_send = udp4_mh_splice; > > } > > > > - if (from_pif == PIF_SPLICE) { > > + do { > > + mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; > > + > > + if (++i >= n) > > + break; > > + > > + udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); > > + } while (udp_meta[i].splicesrc == src); > > I don't have a strong preference, but a for loop like this: > > for (; i < n && udp_meta[i].splicesrc == src; i++) { > mmh_send[i].msg_hdr.msg_iov->iov_len = mmh_recv[i].msg_len; > udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); This needs to update udp_meta[i+1], not udp_meta[i], and therefore also be conditional on i+1 < n. > } > > if (i++ < n) /* Set splicesrc for first mismatching entry, too */ This needs to be ++i, not i++. > udp_meta[i].splicesrc = udp_mmh_splice_port(uref, &mmh_recv[i]); > > looks a bit more readable to me. Same for udp_tap_send(). At which point I'm not sure it's more readable after all. It also redundantly checks that udp_meta[start].splicesrc == src. Like I said, subtle. Putting the increment in the middle of the loop body, rather than beginning or end, while unusual, was the sanest way I could see to do this. Well, other than havine one pass to set splicesrc[], then another using it, which works but I'm concerned about the effect on cache locality. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson