From: David Gibson <david@gibson.dropbear.id.au>
To: Eugenio Perez Martin <eperezma@redhat.com>
Cc: passt-dev@passt.top, jasowang@redhat.com
Subject: Re: [RFC v2 01/11] tap: implement vhost_call_cb
Date: Tue, 29 Jul 2025 10:11:58 +1000 [thread overview]
Message-ID: <aIgRzu152RR_j_Jk@zatzit> (raw)
In-Reply-To: <CAJaqyWc42NLNU5anS=G4Uf52v2e-ZOVix5CFt3Wg=adns32QPw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 9101 bytes --]
On Mon, Jul 28, 2025 at 06:33:43PM +0200, Eugenio Perez Martin wrote:
> On Wed, Jul 23, 2025 at 8:57 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:
[snip]
> > > @@ -310,7 +311,8 @@ loop:
> > >
> > > switch (ref.type) {
> > > case EPOLL_TYPE_TAP_PASTA:
> > > - tap_handler_pasta(&c, eventmask, &now);
> > > + // TODO: Find a better way, maybe reuse the fd.
> > > + // tap_handler_pasta(&c, eventmask, &now);
> >
> > This change seems bogus. We still need this if we want to be able to
> > fall back to tap without vhost, which I think we want to be able to
> > do. I'm also not clear why you need it - if you don't want these
> > events, can't you just not register them (at least not with this event
> > type) in epoll_ctl?
> >
>
> Absolutely, it was commented out for testing purposes but I forgot to note it.
Ok.
> > > break;
> > > case EPOLL_TYPE_TAP_PASST:
> > > tap_handler_passt(&c, eventmask, &now);
> > > @@ -357,6 +359,9 @@ loop:
> > > case EPOLL_TYPE_REPAIR:
> > > repair_handler(&c, eventmask);
> > > break;
> > > + case EPOLL_TYPE_VHOST_CALL:
> > > + tap_vhost_input(&c, ref, &now);
> >
> > I'd suggest calling this tap_handler_vhost() for consistency with
> > tap_handler_pasta() and others. I've also found it usually ends up a
> > bit cleaner to pass the relevant individual members of ref, rather
> > than the whole thing.
> >
>
> I'm ok with tap_handler_vhost, but tap_vhost_input was your
> suggestion :) [1].
Looks.... ah, I see. I was right the first time, sorry. The
difference here is that a vhost call event is always Rx path, whereas
tap_handler_*() handle multiple epoll events that could be Rx or Tx
path. tap_vhost_input() is good.
[snip]
> > > +#define VHOST_VIRTIO 0xAF
> > > +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
> > > +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
> > > +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> > > +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> > > +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> > > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> > > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> > > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> > > +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> > > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> > > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
> >
> > This stuff probably belongs in linux_dep.h (if you can't get it from a
> > system or kernel header).
> >
> > > +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
> >
> > I suspect this is wrong. For starters we should definitely be using a
> > define, not open-coded 65520. But more importantly, 65520 is the
> > (default) MTU - the maximum size *of the IP packet*, excluding L2
> > headers. I'm assuming these buffers will also need to contain the L2
> > header, so will need to be larger. Looking at the later code, it
> > looks like these buffers also include the virtio_net_hdr_mrg_rxbuf
> > pseudo-physical header, so will need to be bigger again.
> >
> > I think you'll want a new define for this, which would be
> > L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header
> > (virtion_net_hdr_mrg_rxbuf, AFAICT). For now, I'm going to call this
> > L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it
> > later in this mail. You may also need/want to round it up to
> > 4/8/whatever bytes to avoid poorly aligned buffers.
> >
>
> Very good points, applying it for the next series.
>
> > > +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
> > > + "Number of vhost descs must be a power of two by standard");
> >
> > Relying on this calculation come out to a power of 2 seems kind of
> > fragile. Maybe just round down to a power of 2 instead? Although I
> > guess that could mean a bunch of wasted space.
>
> To me it's the classical kernel include/linux/log2.h:is_power_of_2
> function, but we can replace (or entirely omit) the assert for sure.
The test for a power of 2 is fine (although you introduce an explicit
macro for it later, which should be used here as well). The part I
was concerned about was requiring this calculated value to be a power
of 2, rather than finding a power of 2 that fits.
> > To avoid wasting space if this doesn't come neatly to a power of 2,
> > would it be safe to round _up_ to a power of 2, as long as we never,
> > ever put an index beyond what we actually have space for into the
> > available queue?
>
> All descriptors are exposed to the kernel always in the rx queue, so
> rounding is not possible. This number must be a power of 2 in the
> split case, but I'll be happy if it is just a comment in the final
> version.
Ah, ok. *thinks*.... I withdraw my objection. Rounding down could
lost a *lot* of buffers and could therefore lead to hard to track
performance issues. Forcing it to be a power of 2 in the first place,
and adjusting the buffer size to make that true is a better approach.
>
> > > +static struct {
> > > + /* Number of free descriptors */
> > > + uint16_t num_free;
> > > +
> > > + /* Last used idx processed */
> > > + uint16_t last_used_idx;
> > > +} vqs[2];
> > > +
> > > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
> > > +static union {
> > > + struct vring_avail avail;
> > > + char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];
> >
> > The purpose of this idiom of a union with a char buffer based on
> > offsetof isn't obvious to me yet.
>
> It's because vring_avail contains a flexible array member.
>
> In VirtIO the avail ring is defined as:
> struct virtq_avail {
> le16 flags;
> le16 idx;
> le16 ring[ /* Queue Size */ ];
> /* ommiting last member here, as it is not used by pasta */
> };
>
> But the kernel uapi definition of vring_avail is:
> struct vring_avail {
> __virtio16 flags;
> __virtio16 idx;
> __virtio16 ring[];
> };
>
> So I need to allocate enough bytes for ring[]. To me the easiest way
> is by using char buf[offsetof(last_member_of_ring[])].
Ah, ok, now I understand what it's doing at least.
> I would be happy to redefine it for pasta, as we know the size in
> advance, or to hide it in a macro.
Redefining it for our purpose sounds promising to me, but I might have
to see what it looks like.
> > > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE)));
> > > +static union {
> > > + struct vring_used used;
> > > + char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
> > > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
> >
> > Maybe 2 entry arrays, rather than *_0 and *_1?
> >
>
> Flexible array members again :(.
Right. Yeah, defining our own version or wrapper of vring_used would
be cleaner, I think.
[snip]
> > You should check that there are enough bytes for the L2 header as well
> > as the mrg_rxbuf header - tap_add_packet() doesn't appear to check
> > that itself (maybe it should).
> >
>
> That's right, I thought tap_add_packet did it but now I realize it
> assumes there is.
> Why not check it in tap_add_packet, as all namespace input needs to go
> there? We need to adapt the function signature though.
Yeah, I think it probably should.
> > > + warn("vhost: invalid len %u", len);
> > > + continue;
> > > + }
> > > +
> > > + /* TODO this will break from this moment */
> > > + if (hdr->num_buffers != 1) {
> > > + warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);
> >
> > L1_MAX_LEN_VHOST again.
>
> I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of
> places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the
> two macros in parallel?
We can have macros for both the individual buffer size and the total
buffer size, as long as we make sure they're consistent (either by
construction or with static_assert()s).
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2025-07-29 0:32 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 17:47 [RFC v2 00/11] Add vhost-net kernel support Eugenio Pérez
2025-07-09 17:47 ` [RFC v2 01/11] tap: implement vhost_call_cb Eugenio Pérez
2025-07-23 6:56 ` David Gibson
2025-07-28 16:33 ` Eugenio Perez Martin
2025-07-29 0:11 ` David Gibson [this message]
2025-07-09 17:47 ` [RFC v2 02/11] tap: add die() on vhost error Eugenio Pérez
2025-07-23 6:58 ` David Gibson
2025-07-09 17:47 ` [RFC v2 03/11] tap: replace tx tap hdr with virtio_nethdr_mrg_rxbuf Eugenio Pérez
2025-07-24 0:17 ` David Gibson
2025-07-28 16:37 ` Eugenio Perez Martin
2025-07-09 17:47 ` [RFC v2 04/11] tcp: export memory regions to vhost Eugenio Pérez
2025-07-23 7:06 ` David Gibson
2025-07-28 16:41 ` Eugenio Perez Martin
2025-07-29 0:25 ` David Gibson
2025-07-09 17:47 ` [RFC v2 05/11] virtio: Fill .next in tx queue Eugenio Pérez
2025-07-23 7:07 ` David Gibson
2025-07-28 16:44 ` Eugenio Perez Martin
2025-07-09 17:47 ` [RFC v2 06/11] tap: move static iov_sock to tcp_buf_data_from_sock Eugenio Pérez
2025-07-23 7:09 ` David Gibson
2025-07-28 16:43 ` Eugenio Perez Martin
2025-07-29 0:28 ` David Gibson
2025-07-09 17:47 ` [RFC v2 07/11] tap: support tx through vhost Eugenio Pérez
2025-07-24 0:24 ` David Gibson
2025-07-24 14:30 ` Stefano Brivio
2025-07-25 0:23 ` David Gibson
2025-07-09 17:47 ` [RFC v2 08/11] tap: add tap_free_old_xmit Eugenio Pérez
2025-07-24 0:32 ` David Gibson
2025-07-28 16:45 ` Eugenio Perez Martin
2025-07-09 17:47 ` [RFC v2 09/11] tcp: start conversion to circular buffer Eugenio Pérez
2025-07-24 1:03 ` David Gibson
2025-07-28 16:55 ` Eugenio Perez Martin
2025-07-29 0:30 ` David Gibson
2025-07-09 17:47 ` [RFC v2 10/11] tap: add poll(2) to used_idx Eugenio Pérez
2025-07-24 1:20 ` David Gibson
2025-07-28 17:03 ` Eugenio Perez Martin
2025-07-29 0:32 ` David Gibson
2025-07-29 7:04 ` Eugenio Perez Martin
2025-07-30 0:32 ` David Gibson
2025-07-09 17:47 ` [RFC v2 11/11] tcp_buf: adding TCP tx circular buffer Eugenio Pérez
2025-07-24 1:33 ` David Gibson
2025-07-28 17:04 ` Eugenio Perez Martin
2025-07-10 9:46 ` [RFC v2 00/11] Add vhost-net kernel support Eugenio Perez Martin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aIgRzu152RR_j_Jk@zatzit \
--to=david@gibson.dropbear.id.au \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=passt-dev@passt.top \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://passt.top/passt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).