From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au Authentication-Results: passt.top; dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202506 header.b=w6R1neFs; dkim-atps=neutral Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76]) by passt.top (Postfix) with ESMTPS id 849265A0278 for ; Tue, 29 Jul 2025 02:32:55 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gibson.dropbear.id.au; s=202506; t=1753748965; bh=UQLX/6t7oaadp8QFZJB6z1URS5dVMr91wZFqs785GQQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=w6R1neFsoEXhkgx0SCxL7YNEYXg2jRl3eB3T4AQVelKfCef/WwN/NTMDUAg8HuwUh II7Ooq+Y7UbV4uTfGT7ZaF7W/dGmwJ92i0OUYBxTrv6xy9Y23EV5izyLq5EWVFolTu 5OjFPSeYslk47HuPOIwLTYCZheb7diZ+VHPqjon8T4X2XKVbbEPUEc8C5tmKoTLuAS 0HXBZWDKHzffiQMNR4joogxx/PDzBNm71XYMOrD9unFagNsLzCRAcAU8FUTyGoXPLV 8Te+pb+HfYPVE9rNguO7Dy8i4CB6I5nEDY9O95n9YZmtYnoYEUk4VR1c7xzuTENII7 0XFVRQxOsc6mA== Received: by gandalf.ozlabs.org (Postfix, from userid 1007) id 4brbnd2cqkz4wb0; Tue, 29 Jul 2025 10:29:25 +1000 (AEST) Date: Tue, 29 Jul 2025 10:11:58 +1000 From: David Gibson To: Eugenio Perez Martin Subject: Re: [RFC v2 01/11] tap: implement vhost_call_cb Message-ID: References: <20250709174748.3514693-1-eperezma@redhat.com> <20250709174748.3514693-2-eperezma@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="gbrL8p4LdLvgZCXx" Content-Disposition: inline In-Reply-To: Message-ID-Hash: VTENIQMU7UYSSZXWX6UQNV6YLYQIHNPW X-Message-ID-Hash: VTENIQMU7UYSSZXWX6UQNV6YLYQIHNPW X-MailFrom: dgibson@gandalf.ozlabs.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: passt-dev@passt.top, jasowang@redhat.com X-Mailman-Version: 3.3.8 Precedence: list List-Id: Development discussion and patches for passt Archived-At: Archived-At: List-Archive: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --gbrL8p4LdLvgZCXx Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jul 28, 2025 at 06:33:43PM +0200, Eugenio Perez Martin wrote: > On Wed, Jul 23, 2025 at 8:57=E2=80=AFAM David Gibson > wrote: > > > > On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio P=C3=A9rez wrote: [snip] > > > @@ -310,7 +311,8 @@ loop: > > > > > > switch (ref.type) { > > > case EPOLL_TYPE_TAP_PASTA: > > > - tap_handler_pasta(&c, eventmask, &now); > > > + // TODO: Find a better way, maybe reuse the fd. > > > + // tap_handler_pasta(&c, eventmask, &now); > > > > This change seems bogus. We still need this if we want to be able to > > fall back to tap without vhost, which I think we want to be able to > > do. I'm also not clear why you need it - if you don't want these > > events, can't you just not register them (at least not with this event > > type) in epoll_ctl? > > >=20 > Absolutely, it was commented out for testing purposes but I forgot to not= e it. Ok. > > > break; > > > case EPOLL_TYPE_TAP_PASST: > > > tap_handler_passt(&c, eventmask, &now); > > > @@ -357,6 +359,9 @@ loop: > > > case EPOLL_TYPE_REPAIR: > > > repair_handler(&c, eventmask); > > > break; > > > + case EPOLL_TYPE_VHOST_CALL: > > > + tap_vhost_input(&c, ref, &now); > > > > I'd suggest calling this tap_handler_vhost() for consistency with > > tap_handler_pasta() and others. I've also found it usually ends up a > > bit cleaner to pass the relevant individual members of ref, rather > > than the whole thing. > > >=20 > I'm ok with tap_handler_vhost, but tap_vhost_input was your > suggestion :) [1]. Looks.... ah, I see. I was right the first time, sorry. The difference here is that a vhost call event is always Rx path, whereas tap_handler_*() handle multiple epoll events that could be Rx or Tx path. tap_vhost_input() is good. [snip] > > > +#define VHOST_VIRTIO 0xAF > > > +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) > > > +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) > > > +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) > > > +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_m= emory) > > > +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_v= ring_state) > > > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_v= ring_addr) > > > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_v= ring_file) > > > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_v= ring_file) > > > +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_v= ring_file) > > > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64) > > > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_= vring_file) > > > > This stuff probably belongs in linux_dep.h (if you can't get it from a > > system or kernel header). > > > > > +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520) > > > > I suspect this is wrong. For starters we should definitely be using a > > define, not open-coded 65520. But more importantly, 65520 is the > > (default) MTU - the maximum size *of the IP packet*, excluding L2 > > headers. I'm assuming these buffers will also need to contain the L2 > > header, so will need to be larger. Looking at the later code, it > > looks like these buffers also include the virtio_net_hdr_mrg_rxbuf > > pseudo-physical header, so will need to be bigger again. > > > > I think you'll want a new define for this, which would be > > L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header > > (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this > > L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it > > later in this mail. You may also need/want to round it up to > > 4/8/whatever bytes to avoid poorly aligned buffers. > > >=20 > Very good points, applying it for the next series. >=20 > > > +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)), > > > + "Number of vhost descs must be a power of two = by standard"); > > > > Relying on this calculation come out to a power of 2 seems kind of > > fragile. Maybe just round down to a power of 2 instead? Although I > > guess that could mean a bunch of wasted space. >=20 > To me it's the classical kernel include/linux/log2.h:is_power_of_2 > function, but we can replace (or entirely omit) the assert for sure. The test for a power of 2 is fine (although you introduce an explicit macro for it later, which should be used here as well). The part I was concerned about was requiring this calculated value to be a power of 2, rather than finding a power of 2 that fits. > > To avoid wasting space if this doesn't come neatly to a power of 2, > > would it be safe to round _up_ to a power of 2, as long as we never, > > ever put an index beyond what we actually have space for into the > > available queue? >=20 > All descriptors are exposed to the kernel always in the rx queue, so > rounding is not possible. This number must be a power of 2 in the > split case, but I'll be happy if it is just a comment in the final > version. Ah, ok. *thinks*.... I withdraw my objection. Rounding down could lost a *lot* of buffers and could therefore lead to hard to track performance issues. Forcing it to be a power of 2 in the first place, and adjusting the buffer size to make that true is a better approach. >=20 > > > +static struct { > > > + /* Number of free descriptors */ > > > + uint16_t num_free; > > > + > > > + /* Last used idx processed */ > > > + uint16_t last_used_idx; > > > +} vqs[2]; > > > + > > > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((= aligned(PAGE_SIZE))); > > > +static union { > > > + struct vring_avail avail; > > > + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])]; > > > > The purpose of this idiom of a union with a char buffer based on > > offsetof isn't obvious to me yet. >=20 > It's because vring_avail contains a flexible array member. >=20 > In VirtIO the avail ring is defined as: > struct virtq_avail { > le16 flags; > le16 idx; > le16 ring[ /* Queue Size */ ]; > /* ommiting last member here, as it is not used by pasta */ > }; >=20 > But the kernel uapi definition of vring_avail is: > struct vring_avail { > __virtio16 flags; > __virtio16 idx; > __virtio16 ring[]; > }; >=20 > So I need to allocate enough bytes for ring[]. To me the easiest way > is by using char buf[offsetof(last_member_of_ring[])]. Ah, ok, now I understand what it's doing at least. > I would be happy to redefine it for pasta, as we know the size in > advance, or to hide it in a macro. Redefining it for our purpose sounds promising to me, but I might have to see what it looks like. > > > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 _= _attribute__((aligned(PAGE_SIZE))); > > > +static union { > > > + struct vring_used used; > > > + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])]; > > > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __a= ttribute__((aligned(PAGE_SIZE))); > > > > Maybe 2 entry arrays, rather than *_0 and *_1? > > >=20 > Flexible array members again :(. Right. Yeah, defining our own version or wrapper of vring_used would be cleaner, I think. [snip] > > You should check that there are enough bytes for the L2 header as well > > as the mrg_rxbuf header - tap_add_packet() doesn't appear to check > > that itself (maybe it should). > > >=20 > That's right, I thought tap_add_packet did it but now I realize it > assumes there is. > Why not check it in tap_add_packet, as all namespace input needs to go > there? We need to adapt the function signature though. Yeah, I think it probably should. > > > + warn("vhost: invalid len %u", len); > > > + continue; > > > + } > > > + > > > + /* TODO this will break from this moment */ > > > + if (hdr->num_buffers !=3D 1) { > > > + warn("vhost: Too many buffers %u, %zu bytes sho= uld be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS= ); > > > > L1_MAX_LEN_VHOST again. >=20 > I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of > places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the > two macros in parallel? We can have macros for both the individual buffer size and the total buffer size, as long as we make sure they're consistent (either by construction or with static_assert()s). --=20 David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson --gbrL8p4LdLvgZCXx Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmiIEc4ACgkQzQJF27ox 2Gdi2A/+I4ztCnTe0TPPmTm4qFGW7G/28UCAcW2jkiZCj9S2icdFfYxcimfxgp/E OrvBMXcntUdqdo9fBNqyCkr2te7PL7v67qI810Ol+Sz0MTpgSR3IaM0J/5ZOH+Kq O7RevCQtrbiZ1D0LlWnGyTDNYqcWU+q+nNHwlMD544l3aE0gocblM5kiEhQGJTxh cw94AWydJvr5ks5nmVwBdeiXvJDom90TL4aXpcLAhxwTm8fu6Dg/Cxgc0PhNRQGp zQeWAeMeBECc0qjKJOkf/VH8WOkE4JCszNBy+m8PWc2CiV+bdBftx0gAxzDeVYCc FbF2QiT8jLQKcpZU9+e+A2PR3t31WBLxsytK1F0AKC/Onrs0Uh6eU2+BC8lmSQFZ 0ihRHK1FOdpNxQCUjnyc82tO4Qf4ObLOlX45vqMPbp7BOx6GdDZjUHC4aluR4EHF Ap5W8aZsHCUiMHo06j5+KB0M8POHmozdopISfGZq9ibDPG4CHpfQvX3j/Tz+zHSs oSQmyoqgOUd9FJt3KpqQNci1Gan6tVov/5+slWFfYApvkXxyb7m42471zd98sGXX yBmV+DFTVL8y58e+dOOMvuWFahrnkVzi1chVXm3OMUIw8IVju7NaJyeYL+0SFHVB E0E5zksNuoZdb+L6j4ubNHNiqnRoiHc4qAsDgft/ximY77oPUz8= =KRhh -----END PGP SIGNATURE----- --gbrL8p4LdLvgZCXx--