From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from gandalf.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id E2CFD5A026D for ; Wed, 27 Mar 2024 04:37:50 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202312; t=1711510663; bh=n1W9OTAzVN/cuIWg/ooq11o1wtioHkE8OBiNF2a5J3M=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=g85UfdZA7AIezVlQT0IJzxLgarYcijYy8HzsuiL0+igc9zuHqJOBVLhdUE9SasLtw dIsh+gz7zMa2Kp6EpkGeuECkOpniNri5JvOojf2majO8L5NZTV+ldweXAY4UYgKcy9 rIsIGlp08ftrxN5CAM3/VkgN1bGdf6czC2chgueWpIakDzl+6NzG3l4rCjPeBE+Hh8 Ql/j9c0nce0JWTxiIwxqBDwXVbXinMYGHRvZGm82+gE9lgxVeJiiQxYIZYc0dUQAau RswREpH4ZBtZ4Meks1eb6f2jGCk9pb91jViKw6nOlf305zWyixZskFFd/KSsu6Qvef tQp7Nb0enSqSg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4V4C6b3x44z4wcq; Wed, 27 Mar 2024 14:37:43 +1100 (AEDT) Date: Wed, 27 Mar 2024 12:35:25 +1100 From: David Gibson To: Laurent Vivier Subject: Re: [RFC v4] tcp: Replace TCP buffer structure by an iovec array Message-ID: References: <20240321102655.2003763-1-lvivier@redhat.com> <91c44be3-7643-407c-a58d-41476a721d59@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="Zn7YNEAUKrdtASHX" Content-Disposition: inline In-Reply-To: <91c44be3-7643-407c-a58d-41476a721d59@redhat.com> Message-ID-Hash: E76REN5UT5NPR4XEUIZJSIILLIXBWG5B X-Message-ID-Hash: E76REN5UT5NPR4XEUIZJSIILLIXBWG5B X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --Zn7YNEAUKrdtASHX Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Mar 26, 2024 at 11:19:22AM +0100, Laurent Vivier wrote: > Hi, >=20 > I compared perf result using this patch and a patch changing tap_send_fra= mes_passt() to: >=20 > static size_t tap_send_frames_passt(const struct ctx *c, > const struct iovec *iov, > size_t bufs_per_frame, size_t nframes) > { > struct msghdr mh =3D { > .msg_iovlen =3D bufs_per_frame, > }; > size_t buf_offset; > unsigned int i; > ssize_t sent; >=20 > for (i =3D 0; i < nframes; i++) { > unsigned int j; >=20 > if (bufs_per_frame > 1) { > /* if we have more than 1 iovec, the first one is vnet_len */ > uint32_t *p =3D iov[i * bufs_per_frame].iov_base; > uint32_t vnet_len =3D 0; >=20 > for (j =3D 1; j < bufs_per_frame; j++) > vnet_len +=3D iov[i * bufs_per_frame + j]= =2Eiov_len; > vnet_len =3D htonl(vnet_len); >=20 > *p =3D vnet_len; > } >=20 > mh.msg_iov =3D (void *)&iov[i * bufs_per_frame]; >=20 > sent =3D sendmsg(c->fd_tap, &mh, MSG_NOSIGNAL | MSG_DONTW= AIT); > if (sent < 0) > return i; >=20 > /* Check for any partial frames due to short send */ > j =3D iov_skip_bytes(&iov[i * bufs_per_frame], bufs_per_f= rame, > sent, &buf_offset); >=20 > if (buf_offset && j < bufs_per_frame) { > if (write_remainder(c->fd_tap, &iov[i * bufs_per_= frame + j], > bufs_per_frame - j, > buf_offset) < 0) { > err("tap: partial frame send: %s", > strerror(errno)); > return i; > } > } > } >=20 > return i; > } >=20 > And the result of 'perf record -e cache-misses' gives: >=20 > slow >=20 > 83.95% passt.avx2 passt.avx2 [.] csum_avx2 > 4.39% passt.avx2 passt.avx2 [.] tap4_handler > 2.37% passt.avx2 libc.so.6 [.] __printf_buffer > 0.84% passt.avx2 passt.avx2 [.] udp_timer >=20 > fast >=20 > 22.15% passt.avx2 passt.avx2 [.] csum_avx2 > 14.91% passt.avx2 passt.avx2 [.] udp_timer > 7.60% passt.avx2 libc.so.6 [.] __printf_buffer > 5.10% passt.avx2 passt.avx2 [.] ffsl Well.. I *guess* that means we're getting more cache misses in the batched version, as we suspected. I'm a bit mystified as to how to interpret those percentages, though. Is that the percentage of total cache misses that occur in that function? The percentage of times that function generates a cache miss (what if it generates more than one)? Something else.. If this does indicate many more cache misses computing the checksum, I'm still a bit baffled as to what's going on. It doesn't quite fit with the theory I had: the csum_avx2() calls are in the "first loop" in both these scenarios - my theory would suggest more cache misses in the "second loop" instead (in the kernel inside sendmsg()). What happens if you fill in the vnet_len field in the first loop, but still use a sendmsg() per frame, instead of one batched one? > From d4b3e12132ceaf5484de215e9c84cbedcbbb8188 Mon Sep 17 00:00:00 2001 > From: Laurent Vivier > Date: Tue, 19 Mar 2024 18:20:20 +0100 > Subject: [PATCH] tap: compute vnet_len inside tap_send_frames_passt() >=20 > Signed-off-by: Laurent Vivier > --- > tap.c | 49 +++++++++++++++++++++++++++++++++---------------- > tcp.c | 39 ++++++++++----------------------------- > 2 files changed, 43 insertions(+), 45 deletions(-) >=20 > diff --git a/tap.c b/tap.c > index 13e4da79d690..1096272b411a 100644 > --- a/tap.c > +++ b/tap.c > @@ -74,7 +74,7 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS, pkt_buf); > */ > void tap_send_single(const struct ctx *c, const void *data, size_t len) > { > - uint32_t vnet_len =3D htonl(len); > + uint32_t vnet_len; > struct iovec iov[2]; > size_t iovcnt =3D 0; > =20 > @@ -365,34 +365,51 @@ static size_t tap_send_frames_passt(const struct ct= x *c, > const struct iovec *iov, > size_t bufs_per_frame, size_t nframes) > { > - size_t nbufs =3D bufs_per_frame * nframes; > struct msghdr mh =3D { > - .msg_iov =3D (void *)iov, > - .msg_iovlen =3D nbufs, > + .msg_iovlen =3D bufs_per_frame, > }; > size_t buf_offset; > unsigned int i; > ssize_t sent; > =20 > - sent =3D sendmsg(c->fd_tap, &mh, MSG_NOSIGNAL | MSG_DONTWAIT); > - if (sent < 0) > - return 0; > + for (i =3D 0; i < nframes; i++) { > + unsigned int j; > + > + if (bufs_per_frame > 1) { > + /* if we have more than one iovec, the first one is > + * vnet_len > + */ > + uint32_t *p =3D iov[i * bufs_per_frame].iov_base; > + uint32_t vnet_len =3D 0; > =20 > - /* Check for any partial frames due to short send */ > - i =3D iov_skip_bytes(iov, nbufs, sent, &buf_offset); > + for (j =3D 1; j < bufs_per_frame; j++) > + vnet_len +=3D iov[i * bufs_per_frame + j].iov_len; > + vnet_len =3D htonl(vnet_len); > + > + *p =3D vnet_len; > + } > =20 > - if (i < nbufs && (buf_offset || (i % bufs_per_frame))) { > - /* Number of unsent or partially sent buffers for the frame */ > - size_t rembufs =3D bufs_per_frame - (i % bufs_per_frame); > + mh.msg_iov =3D (void *)&iov[i * bufs_per_frame]; > =20 > - if (write_remainder(c->fd_tap, &iov[i], rembufs, buf_offset) < 0) { > - err("tap: partial frame send: %s", strerror(errno)); > + sent =3D sendmsg(c->fd_tap, &mh, MSG_NOSIGNAL | MSG_DONTWAIT); > + if (sent < 0) > return i; > + > + /* Check for any partial frames due to short send */ > + j =3D iov_skip_bytes(&iov[i * bufs_per_frame], bufs_per_frame, sent, &= buf_offset); > + > + if (buf_offset && j < bufs_per_frame) { > + if (write_remainder(c->fd_tap, &iov[i * bufs_per_frame + j], > + bufs_per_frame - j, > + buf_offset) < 0) { > + err("tap: partial frame send: %s", > + strerror(errno)); > + return i; > + } > } > - i +=3D rembufs; > } > =20 > - return i / bufs_per_frame; > + return i; > } > =20 > /** > diff --git a/tcp.c b/tcp.c > index cc705064f059..d147e2c41648 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -443,10 +443,11 @@ struct tcp_flags_t { > } __attribute__ ((packed, aligned(__alignof__(unsigned int)))); > #endif > =20 > +static uint32_t vnet_len; > + > /* Ethernet header for IPv4 frames */ > static struct ethhdr tcp4_eth_src; > =20 > -static uint32_t tcp4_payload_vnet_len[TCP_FRAMES_MEM]; > /* IPv4 headers */ > static struct iphdr tcp4_payload_ip[TCP_FRAMES_MEM]; > /* TCP headers and data for IPv4 frames */ > @@ -457,7 +458,6 @@ static_assert(MSS4 <=3D sizeof(tcp4_payload[0].data),= "MSS4 is greater than 65516" > static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; > static unsigned int tcp4_payload_used; > =20 > -static uint32_t tcp4_flags_vnet_len[TCP_FRAMES_MEM]; > /* IPv4 headers for TCP option flags frames */ > static struct iphdr tcp4_flags_ip[TCP_FRAMES_MEM]; > /* TCP headers and option flags for IPv4 frames */ > @@ -468,7 +468,6 @@ static unsigned int tcp4_flags_used; > /* Ethernet header for IPv6 frames */ > static struct ethhdr tcp6_eth_src; > =20 > -static uint32_t tcp6_payload_vnet_len[TCP_FRAMES_MEM]; > /* IPv6 headers */ > static struct ipv6hdr tcp6_payload_ip[TCP_FRAMES_MEM]; > /* TCP headers and data for IPv6 frames */ > @@ -479,7 +478,6 @@ static_assert(MSS6 <=3D sizeof(tcp6_payload[0].data),= "MSS6 is greater than 65516" > static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; > static unsigned int tcp6_payload_used; > =20 > -static uint32_t tcp6_flags_vnet_len[TCP_FRAMES_MEM]; > /* IPv6 headers for TCP option flags frames */ > static struct ipv6hdr tcp6_flags_ip[TCP_FRAMES_MEM]; > /* TCP headers and option flags for IPv6 frames */ > @@ -944,9 +942,8 @@ static void tcp_sock4_iov_init(const struct ctx *c) > =20 > /* iovecs */ > iov =3D tcp4_l2_iov[i]; > - iov[TCP_IOV_TAP].iov_base =3D &tcp4_payload_vnet_len[i]; > - iov[TCP_IOV_TAP].iov_len =3D c->mode =3D=3D MODE_PASST ? > - sizeof(tcp4_payload_vnet_len[i]) : 0; > + iov[TCP_IOV_TAP].iov_base =3D &vnet_len; > + iov[TCP_IOV_TAP].iov_len =3D sizeof(vnet_len); > iov[TCP_IOV_ETH].iov_base =3D &tcp4_eth_src; > iov[TCP_IOV_ETH].iov_len =3D sizeof(tcp4_eth_src); > iov[TCP_IOV_IP].iov_base =3D &tcp4_payload_ip[i]; > @@ -954,9 +951,8 @@ static void tcp_sock4_iov_init(const struct ctx *c) > iov[TCP_IOV_PAYLOAD].iov_base =3D &tcp4_payload[i]; > =20 > iov =3D tcp4_l2_flags_iov[i]; > - iov[TCP_IOV_TAP].iov_base =3D &tcp4_flags_vnet_len[i]; > - iov[TCP_IOV_TAP].iov_len =3D c->mode =3D=3D MODE_PASST ? > - sizeof(tcp4_flags_vnet_len[i]) : 0; > + iov[TCP_IOV_TAP].iov_base =3D &vnet_len; > + iov[TCP_IOV_TAP].iov_len =3D sizeof(vnet_len); > iov[TCP_IOV_ETH].iov_base =3D &tcp4_eth_src; > iov[TCP_IOV_ETH].iov_len =3D sizeof(tcp4_eth_src); > iov[TCP_IOV_IP].iov_base =3D &tcp4_flags_ip[i]; > @@ -989,9 +985,8 @@ static void tcp_sock6_iov_init(const struct ctx *c) > =20 > /* iovecs */ > iov =3D tcp6_l2_iov[i]; > - iov[TCP_IOV_TAP].iov_base =3D &tcp6_payload_vnet_len[i]; > - iov[TCP_IOV_TAP].iov_len =3D c->mode =3D=3D MODE_PASST ? > - sizeof(tcp6_payload_vnet_len[i]) : 0; > + iov[TCP_IOV_TAP].iov_base =3D &vnet_len; > + iov[TCP_IOV_TAP].iov_len =3D sizeof(vnet_len); > iov[TCP_IOV_ETH].iov_base =3D &tcp6_eth_src; > iov[TCP_IOV_ETH].iov_len =3D sizeof(tcp6_eth_src); > iov[TCP_IOV_IP].iov_base =3D &tcp6_payload_ip[i]; > @@ -999,9 +994,8 @@ static void tcp_sock6_iov_init(const struct ctx *c) > iov[TCP_IOV_PAYLOAD].iov_base =3D &tcp6_payload[i]; > =20 > iov =3D tcp6_l2_flags_iov[i]; > - iov[TCP_IOV_TAP].iov_base =3D &tcp6_flags_vnet_len[i]; > - iov[TCP_IOV_TAP].iov_len =3D c->mode =3D=3D MODE_PASST ? > - sizeof(tcp6_flags_vnet_len[i]) : 0; > + iov[TCP_IOV_TAP].iov_base =3D &vnet_len; > + iov[TCP_IOV_TAP].iov_len =3D sizeof(vnet_len); > iov[TCP_IOV_ETH].iov_base =3D &tcp6_eth_src; > iov[TCP_IOV_ETH].iov_len =3D sizeof(tcp6_eth_src); > iov[TCP_IOV_IP].iov_base =3D &tcp6_flags_ip[i]; > @@ -1558,7 +1552,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_= tap_conn *conn, int flags) > struct tcp_info tinfo =3D { 0 }; > socklen_t sl =3D sizeof(tinfo); > int s =3D conn->sock; > - uint32_t vnet_len; > size_t optlen =3D 0; > struct tcphdr *th; > struct iovec *iov; > @@ -1587,10 +1580,8 @@ static int tcp_send_flag(struct ctx *c, struct tcp= _tap_conn *conn, int flags) > =20 > if (CONN_V4(conn)) { > iov =3D tcp4_l2_flags_iov[tcp4_flags_used++]; > - vnet_len =3D sizeof(struct ethhdr) + sizeof(struct iphdr); > } else { > iov =3D tcp6_l2_flags_iov[tcp6_flags_used++]; > - vnet_len =3D sizeof(struct ethhdr) + sizeof(struct ipv6hdr); > } > =20 > payload =3D iov[TCP_IOV_PAYLOAD].iov_base; > @@ -1649,8 +1640,6 @@ static int tcp_send_flag(struct ctx *c, struct tcp_= tap_conn *conn, int flags) > conn->seq_to_tap); > iov[TCP_IOV_PAYLOAD].iov_len =3D ip_len; > =20 > - *(uint32_t *)iov[TCP_IOV_TAP].iov_base =3D htonl(vnet_len + ip_len); > - > if (th->ack) { > if (SEQ_GE(conn->seq_ack_to_tap, conn->seq_from_tap)) > conn_flag(c, conn, ~ACK_TO_TAP_DUE); > @@ -2150,10 +2139,6 @@ static void tcp_data_to_tap(const struct ctx *c, s= truct tcp_tap_conn *conn, > ip_len =3D tcp_l2_buf_fill_headers(c, conn, iov, plen, check, > seq); > iov[TCP_IOV_PAYLOAD].iov_len =3D ip_len; > - *(uint32_t *)iov[TCP_IOV_TAP].iov_base =3D > - htonl(sizeof(struct ethhdr) + > - sizeof(struct iphdr) + > - ip_len); > if (tcp4_payload_used > TCP_FRAMES_MEM - 1) > tcp_payload_flush(c); > } else if (CONN_V6(conn)) { > @@ -2163,10 +2148,6 @@ static void tcp_data_to_tap(const struct ctx *c, s= truct tcp_tap_conn *conn, > iov =3D tcp6_l2_iov[tcp6_payload_used++]; > ip_len =3D tcp_l2_buf_fill_headers(c, conn, iov, plen, NULL, seq); > iov[TCP_IOV_PAYLOAD].iov_len =3D ip_len; > - *(uint32_t *)iov[TCP_IOV_TAP].iov_base =3D > - htonl(sizeof(struct ethhdr) + > - sizeof(struct ipv6hdr) + > - ip_len); > if (tcp6_payload_used > TCP_FRAMES_MEM - 1) > tcp_payload_flush(c); > } --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --Zn7YNEAUKrdtASHX Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmYDd9kACgkQzQJF27ox 2Gc2bA/+KeZjOuIUyc3IqWwv4gff+BhYxsHRZwrJd2akSLoBp5SmPJD+Qm449B8m bjghjonF28B7l4OjfQi4uV57nJEqjj8RhSDprOguzgvXMOIKdz3VqfMfhbnBpU7m XvZVGSm5P0UYam0wrOw3AJEOd/dk8v9w1x1G46cF+vUtSFtUq8qxu3noTRleQSoB 7XlavGaPoIJeECr2RyEtljBr+jA0Iu/4+lHMXtnRbXXB07RYjaXGG0KDZXhd04Ak 26y2k8wWLSKfml9b6wGcT4BFbgCJfcdgry6AoHHkLaNgxA0DTrbPgozhJbLTksKx mE8vcbvImjOh7UurPebEiuIJoQurlp0T6N+ErnDaTqEim2tNjObRSRyltZDpSrOf GjJA3c8YwL1Ala9yP1//lanf9MIUxSuCjYgIbma4f40d/3xvLKjyR1kF2GnDw9uT X5Me+WHczjBhLBXZqqGmqr5xZGrxChA/mNz9Ty4BgLJmjT5oLPzZExuPmFj7bWI2 uAvkwSKe1gWOTuoWTdpB9Y4g/6L5y5moIsBpYW6klTsjWbv3GYGBQUKhR53slccX KckaaASPbldoZCKYSG+YTBXSJOAHr85SlSp/cmZe48iD5Sn07GDijEQcmB8KU4gz fUSF76pj/W4WorhGYpVlndYFPEW3SRiQtdfam1hyLJeyNd2Igp0= =XgXc -----END PGP SIGNATURE----- --Zn7YNEAUKrdtASHX--