From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202506 header.b=C5OOVZ4m; dkim-atps=neutral Received: from mail.ozlabs.org (mail.ozlabs.org [IPv6:2404:9400:2221:ea00::3]) by passt.top (Postfix) with ESMTPS id 2B9905A0278 for ; Thu, 24 Jul 2025 02:24:53 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202506; t=1753316518; bh=yrk938LmHspHIfbD9iwp+dqqIJYrzkVvgO59sUAo040=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=C5OOVZ4mAqjC9l96G9wu1RM2FltApBuYHjsiUWzemADeRhw+bK+0qp35bYppixHlM R+ERa7+iI/4P+p7xpZ4lSGJ4mDLU5lWsCCkCAsuT01ghspr1L3huMLGMnDyLqFArnO gC3AeUiCkQC97Sfen2K9Wl0h1t+dscGbiec0GEG9J/wW9RC3dPnJSjXg39nJY8Rcm+ D5QYbeLlXE5Aido4LsLiLMsYVOIAoREyId9ldLNreheWpC6PNuk2tjWefYGcQbNOBc 4tHtBT0EPl3UO3sbW8LjlhpPmRCKB85WJG5JbuTU0fG9p7YX/CnFS6kwRDo7BBtJ3i TdqMonOfaM/bg== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4bnWsL3H5zz4x3p; Thu, 24 Jul 2025 10:21:58 +1000 (AEST) Date: Thu, 24 Jul 2025 10:24:39 +1000 From: David Gibson To: Eugenio =?iso-8859-1?Q?P=E9rez?= Subject: Re: [RFC v2 07/11] tap: support tx through vhost Message-ID: References: <20250709174748.3514693-1-eperezma@redhat.com> <20250709174748.3514693-8-eperezma@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="VF+ytT/R000+U3lZ" Content-Disposition: inline In-Reply-To: <20250709174748.3514693-8-eperezma@redhat.com> Message-ID-Hash: GJUFNPNWNQJSMRJAXKW4DKOZUQCOGM6L X-Message-ID-Hash: GJUFNPNWNQJSMRJAXKW4DKOZUQCOGM6L X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jasowang@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --VF+ytT/R000+U3lZ Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 09, 2025 at 07:47:44PM +0200, Eugenio P=E9rez wrote: > No users enable vhost right now, just defining the functions. >=20 > The use of virtqueue is similar than in rx case. fills the descriptor > table with packet data it wants to send to the namespace. Each > descriptor points to a buffer in memory, with an address and a length. > The number of descriptors is again defined by VHOST_NDESCS. Does the number of descriptors have to be equal for the Tx and Rx queues? For Rx, the number of descriptors is basically determined by how many frames we can fit in pkt_buf. That doesn't really apply to the Tx path though, it would be more natural to define the number of Tx descriptors independently. > Afterwards it writes the descriptor index into the avail->ring[] array, > then increments avail->idx to make it visible to the kernel, then kicks > the virtqueue 1 event fd. >=20 > When the kernel does not need the buffer anymore it writes its id into > the used_ring->ring[], and increments used_ring->idx. Normally, the > kernel also notifies pasta through call eventfd of the virtqueue 1. > But we don't monitor the eventfd. Instead, we check if we can reuse the > buffers or not just when we produce, making the code simpler and more > performant. Oh, that's pretty neat. Nit: s/performant/performent/ > Like on the rx path, we assume descriptors are used in the same order > they were made available. This is also consistent with behavior seen in > QEMU's virtio-net implementation. >=20 > Signed-off-by: Eugenio P=E9rez > --- > arp.c | 2 +- > tap.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++-------- > tap.h | 4 +-- > tcp.c | 2 +- > tcp_buf.c | 2 +- > udp.c | 2 +- > 6 files changed, 79 insertions(+), 17 deletions(-) >=20 > diff --git a/arp.c b/arp.c > index fc482bb..ea786a0 100644 > --- a/arp.c > +++ b/arp.c > @@ -80,7 +80,7 @@ int arp(const struct ctx *c, const struct pool *p) > memcpy(eh->h_dest, eh->h_source, sizeof(eh->h_dest)); > memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source)); > =20 > - tap_send_single(c, eh, l2len); > + tap_send_single(c, eh, l2len, false); > =20 > return 1; > } > diff --git a/tap.c b/tap.c > index 5667fbe..7ccac86 100644 > --- a/tap.c > +++ b/tap.c > @@ -121,11 +121,19 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, = pkt_buf); > static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), > "Number of vhost descs must be a power of two by standard"); > static struct { > + /* Descriptor index we're using. This is not the same as avail idx in > + * split: this takes into account the chained descs */ > + uint16_t vring_idx; > + > /* Number of free descriptors */ > uint16_t num_free; > =20 > /* Last used idx processed */ > uint16_t last_used_idx; > + > + /* Descriptors in use */ > + /* desc info: number of descriptors in the chain */ > + uint16_t ndescs[VHOST_NDESCS]; > } vqs[2]; Do we need all these fields for both vqs? Or should we have different structs or a union to store different info for the Rx versus Tx queue. > =20 > static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((alig= ned(PAGE_SIZE))); > @@ -176,7 +184,7 @@ unsigned long tap_l2_max_len(const struct ctx *c) > * @data: Packet buffer > * @l2len: Total L2 packet length > */ > -void tap_send_single(const struct ctx *c, const void *data, size_t l2len) > +void tap_send_single(const struct ctx *c, const void *data, size_t l2len= , bool vhost) This function is explicitly a slow path, so I don't think we need to offer the option of using vhost. Likewise with the various wrappers below. > { > uint32_t vnet_len =3D htonl(l2len); > struct iovec iov[2]; > @@ -192,7 +200,7 @@ void tap_send_single(const struct ctx *c, const void = *data, size_t l2len) > iov[iovcnt].iov_len =3D l2len; > iovcnt++; > =20 > - tap_send_frames(c, iov, iovcnt, 1); > + tap_send_frames(c, iov, iovcnt, 1, vhost); > break; > case MODE_VU: > vu_send_single(c, data, l2len); > @@ -314,7 +322,7 @@ void tap_udp4_send(const struct ctx *c, struct in_add= r src, in_port_t sport, > char *data =3D tap_push_uh4(uh, src, sport, dst, dport, in, dlen); > =20 > memcpy(data, in, dlen); > - tap_send_single(c, buf, dlen + (data - buf)); > + tap_send_single(c, buf, dlen + (data - buf), false); > } > =20 > /** > @@ -336,7 +344,7 @@ void tap_icmp4_send(const struct ctx *c, struct in_ad= dr src, struct in_addr dst, > memcpy(icmp4h, in, l4len); > csum_icmp4(icmp4h, icmp4h + 1, l4len - sizeof(*icmp4h)); > =20 > - tap_send_single(c, buf, l4len + ((char *)icmp4h - buf)); > + tap_send_single(c, buf, l4len + ((char *)icmp4h - buf), false); > } > =20 > /** > @@ -421,7 +429,7 @@ void tap_udp6_send(const struct ctx *c, > char *data =3D tap_push_uh6(uh, src, sport, dst, dport, in, dlen); > =20 > memcpy(data, in, dlen); > - tap_send_single(c, buf, dlen + (data - buf)); > + tap_send_single(c, buf, dlen + (data - buf), false); > } > =20 > /** > @@ -444,7 +452,7 @@ void tap_icmp6_send(const struct ctx *c, > memcpy(icmp6h, in, l4len); > csum_icmp6(icmp6h, src, dst, icmp6h + 1, l4len - sizeof(*icmp6h)); > =20 > - tap_send_single(c, buf, l4len + ((char *)icmp6h - buf)); > + tap_send_single(c, buf, l4len + ((char *)icmp6h - buf), false); > } > =20 > static void vhost_kick(struct vring_used *used, int kick_fd) { > @@ -459,8 +467,9 @@ static void vhost_kick(struct vring_used *used, int k= ick_fd) { > eventfd_write(kick_fd, 1); > } > =20 > + > /** > - * tap_send_frames_pasta() - Send multiple frames to the pasta tap > + * tap_send_frames_vhost() - Send multiple frames to the pasta tap > * @c: Execution context > * @iov: Array of buffers > * @bufs_per_frame: Number of buffers (iovec entries) per frame > @@ -470,16 +479,68 @@ static void vhost_kick(struct vring_used *used, int= kick_fd) { > * @bufs_per_frame contiguous buffers representing a single frame. > * > * Return: number of frames successfully sent > + */ > +static size_t tap_send_frames_vhost(const struct ctx *c, > + const struct iovec *iov, > + size_t bufs_per_frame, size_t nframes) > +{ > + size_t i; > + > + for (i =3D 0; i < nframes; i++) { > + size_t j; > + > + if (vqs[1].num_free < bufs_per_frame) > + return i; > + > + vring_avail_1.avail.ring[(vring_avail_1.avail.idx + i) % VHOST_NDESCS]= =3D htole16(vqs[1].vring_idx) % VHOST_NDESCS; Seems like it would be worth putting ((vring_avail_1.avail.idx + i) % VHOST_NDESCS) into a local. > + vqs[1].ndescs[(vring_avail_1.avail.idx + i) % VHOST_NDESCS] =3D bufs_p= er_frame; > + vqs[1].num_free -=3D bufs_per_frame; > + > + for (j =3D 0; j < bufs_per_frame; ++j) { > + struct vring_desc *desc =3D &vring_desc[1][vqs[1].vring_idx % VHOST_N= DESCS]; > + const struct iovec *iov_i =3D &iov[i * bufs_per_frame + j]; > + > + desc->addr =3D (uint64_t)iov_i->iov_base; > + desc->len =3D iov_i->iov_len; > + desc->flags =3D (j =3D=3D bufs_per_frame - 1) ? 0 : htole16(VRING_DES= C_F_NEXT); > + vqs[1].vring_idx++; > + } > + } > + > + smp_wmb(); > + vring_avail_1.avail.idx =3D htole16(le16toh(vring_avail_1.avail.idx) + = nframes); > + > + vhost_kick(&vring_used_1.used, c->vq[1].kick_fd); > + > + return nframes; > +} > + > + > +/** > + * tap_send_frames_pasta() - Send multiple frames to the pasta tap > + * @c: Execution context > + * @iov: Array of buffers > + * @bufs_per_frame: Number of buffers (iovec entries) per frame > + * @nframes: Number of frames to send > + * @vhost: Use vhost-kernel or not > + * > + * @iov must have total length @bufs_per_frame * @nframes, with each set= of > + * @bufs_per_frame contiguous buffers representing a single frame. > + * > + * Return: number of frames successfully sent (or queued) > * > * #syscalls:pasta write > */ > static size_t tap_send_frames_pasta(const struct ctx *c, > const struct iovec *iov, > - size_t bufs_per_frame, size_t nframes) > + size_t bufs_per_frame, size_t nframes, bool vhost) > { > size_t nbufs =3D bufs_per_frame * nframes; > size_t i; > =20 > + if (vhost) > + return tap_send_frames_vhost(c, iov, bufs_per_frame, nframes); > + > for (i =3D 0; i < nbufs; i +=3D bufs_per_frame) { > ssize_t rc =3D writev(c->fd_tap, iov + i, bufs_per_frame); > size_t framelen =3D iov_size(iov + i, bufs_per_frame); > @@ -563,14 +624,15 @@ static size_t tap_send_frames_passt(const struct ct= x *c, > * @iov: Array of buffers, each containing one frame (with L2 headers) > * @bufs_per_frame: Number of buffers (iovec entries) per frame > * @nframes: Number of frames to send > + * @vhost: Use vhost-kernel or not Hrm. I don't love this parameter. tap_send_frames() is supposed to more or less abstract away the differences between the different backends, but here we're putting a tuntap (pasta mode) specific option here. I don't quickly see a way to avoid it though; making this a different mode / a sub-mode in struct ctx would take away the option of using vanilla tap for slow paths. > * > * @iov must have total length @bufs_per_frame * @nframes, with each set= of > * @bufs_per_frame contiguous buffers representing a single frame. > * > - * Return: number of frames actually sent > + * Return: number of frames actually sent (or queued) > */ > size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, > - size_t bufs_per_frame, size_t nframes) > + size_t bufs_per_frame, size_t nframes, bool vhost) > { > size_t m; > =20 > @@ -579,7 +641,7 @@ size_t tap_send_frames(const struct ctx *c, const str= uct iovec *iov, > =20 > switch (c->mode) { > case MODE_PASTA: > - m =3D tap_send_frames_pasta(c, iov, bufs_per_frame, nframes); > + m =3D tap_send_frames_pasta(c, iov, bufs_per_frame, nframes, vhost); > break; > case MODE_PASST: > m =3D tap_send_frames_passt(c, iov, bufs_per_frame, nframes); > diff --git a/tap.h b/tap.h > index ff8cee5..e924dfb 100644 > --- a/tap.h > +++ b/tap.h > @@ -111,9 +111,9 @@ void tap_udp6_send(const struct ctx *c, > void tap_icmp6_send(const struct ctx *c, > const struct in6_addr *src, const struct in6_addr *dst, > const void *in, size_t l4len); > -void tap_send_single(const struct ctx *c, const void *data, size_t l2len= ); > +void tap_send_single(const struct ctx *c, const void *data, size_t l2len= , bool vhost); > size_t tap_send_frames(const struct ctx *c, const struct iovec *iov, > - size_t bufs_per_frame, size_t nframes); > + size_t bufs_per_frame, size_t nframes, bool vhost); > void eth_update_mac(struct ethhdr *eh, > const unsigned char *eth_d, const unsigned char *eth_s); > void tap_listen_handler(struct ctx *c, uint32_t events); > diff --git a/tcp.c b/tcp.c > index f43c1e2..05f5b4c 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -1935,7 +1935,7 @@ static void tcp_rst_no_conn(const struct ctx *c, in= t af, > =20 > tcp_update_csum(psum, rsth, &payload); > rst_l2len =3D ((char *)rsth - buf) + sizeof(*rsth); > - tap_send_single(c, buf, rst_l2len); > + tap_send_single(c, buf, rst_l2len, false); > } > =20 > /** > diff --git a/tcp_buf.c b/tcp_buf.c > index 6d79d67..242086d 100644 > --- a/tcp_buf.c > +++ b/tcp_buf.c > @@ -141,7 +141,7 @@ void tcp_payload_flush(const struct ctx *c) > size_t m; > =20 > m =3D tap_send_frames(c, &tcp_l2_iov[0][0], TCP_NUM_IOVS, > - tcp_payload_used); > + tcp_payload_used, false); > if (m !=3D tcp_payload_used) { > tcp_revert_seq(c, &tcp_frame_conns[m], &tcp_l2_iov[m], > tcp_payload_used - m); > diff --git a/udp.c b/udp.c > index 65a52e0..d017d99 100644 > --- a/udp.c > +++ b/udp.c > @@ -809,7 +809,7 @@ static void udp_buf_sock_to_tap(const struct ctx *c, = int s, int n, > for (i =3D 0; i < n; i++) > udp_tap_prepare(udp_mh_recv, i, toside, false); > =20 > - tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n); > + tap_send_frames(c, &udp_l2_iov[0][0], UDP_NUM_IOVS, n, false); > } > =20 > /** --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --VF+ytT/R000+U3lZ Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmiBfUcACgkQzQJF27ox 2GdWCg//bU2nxCf5YyQUuophjSrx08icgwPjfAMTNNvb7cepBQziFF4YC3IC/k0S wFh2VPhXPubkkTJCMWy6w/8lrpsegGpgf1agee7jPUcqdXvAsoBqlwlh867QtJkz FUg7PMr8vZ8ZTaIKKCMkjQMTwZZFAyR9+C3uCo/x7DFildaT4dz4MeTE0Ry5WfJT PIlJe3VP+i18drDlKYeDsrDiwY+kk57uqXtcfNS+sTHapNj7YCLYL4bQeO60zSVq EoX4zxE7FvGkjfDZ4CCdKInxgybP7nwHBRxCZCNC0DzTfLQMQJ0y/SykAwvYLW5r jvBSJPARc1hB6a1yvv7K35J6ONecrlc5t2A8AcS8A5pnG649DYVfaTLsB+rnAWnA 4WPCtXWijnc+6zYVnrF/WjfJR9YB0boi3WIbGyAK060bf1siEiioLHEJb0bcG1b9 KWDhvDuaGbLuTDVZEUqVQ/Y0GNK/nsQW15U9BnfN61m3U2dE8bv77jNy279KSxGg V/64Z/OBvStlwRMDFEIBurT2c3mB0mcaCGSdH/zTlkgKxLFKqXRXopS/JUMbAGXl Q1wy+dNqhZH+wb+CCyrg13g4Yxt0QnZkB/jvKvZF3zncDd9e6ALoIYkckNggu6NK rcKOb9Debr/3z90P621DvsKxwTCflIxyOtVmigvtsN8h4Nye62c= =vorW -----END PGP SIGNATURE----- --VF+ytT/R000+U3lZ--