On Mon, Jul 28, 2025 at 06:33:43PM +0200, Eugenio Perez Martin wrote:
> On Wed, Jul 23, 2025 at 8:57 AM David Gibson
> <david@gibson.dropbear.id.au> wrote:
> >
> > On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio Pérez wrote:
[snip]
> > > @@ -310,7 +311,8 @@ loop:
> > >
> > >               switch (ref.type) {
> > >               case EPOLL_TYPE_TAP_PASTA:
> > > -                     tap_handler_pasta(&c, eventmask, &now);
> > > +                     // TODO: Find a better way, maybe reuse the fd.
> > > +                     // tap_handler_pasta(&c, eventmask, &now);
> >
> > This change seems bogus.  We still need this if we want to be able to
> > fall back to tap without vhost, which I think we want to be able to
> > do.  I'm also not clear why you need it - if you don't want these
> > events, can't you just not register them (at least not with this event
> > type) in epoll_ctl?
> >
> 
> Absolutely, it was commented out for testing purposes but I forgot to note it.

Ok.

> > >                       break;
> > >               case EPOLL_TYPE_TAP_PASST:
> > >                       tap_handler_passt(&c, eventmask, &now);
> > > @@ -357,6 +359,9 @@ loop:
> > >               case EPOLL_TYPE_REPAIR:
> > >                       repair_handler(&c, eventmask);
> > >                       break;
> > > +             case EPOLL_TYPE_VHOST_CALL:
> > > +                     tap_vhost_input(&c, ref, &now);
> >
> > I'd suggest calling this tap_handler_vhost() for consistency with
> > tap_handler_pasta() and others.  I've also found it usually ends up a
> > bit cleaner to pass the relevant individual members of ref, rather
> > than the whole thing.
> >
> 
> I'm ok with tap_handler_vhost, but tap_vhost_input was your
> suggestion :) [1].

Looks.... ah, I see.  I was right the first time, sorry.  The
difference here is that a vhost call event is always Rx path, whereas
tap_handler_*() handle multiple epoll events that could be Rx or Tx
path.  tap_vhost_input() is good.

[snip]
> > > +#define VHOST_VIRTIO         0xAF
> > > +#define VHOST_GET_FEATURES   _IOR(VHOST_VIRTIO, 0x00, __u64)
> > > +#define VHOST_SET_FEATURES   _IOW(VHOST_VIRTIO, 0x00, __u64)
> > > +#define VHOST_SET_OWNER           _IO(VHOST_VIRTIO, 0x01)
> > > +#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory)
> > > +#define VHOST_SET_VRING_NUM  _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> > > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> > > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> > > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> > > +#define VHOST_SET_VRING_ERR  _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> > > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> > > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
> >
> > This stuff probably belongs in linux_dep.h (if you can't get it from a
> > system or kernel header).
> >
> > > +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
> >
> > I suspect this is wrong.  For starters we should definitely be using a
> > define, not open-coded 65520.  But more importantly, 65520 is the
> > (default) MTU - the maximum size *of the IP packet*, excluding L2
> > headers.  I'm assuming these buffers will also need to contain the L2
> > header, so will need to be larger.  Looking at the later code, it
> > looks like these buffers also include the virtio_net_hdr_mrg_rxbuf
> > pseudo-physical header, so will need to be bigger again.
> >
> > I think you'll want a new define for this, which would be
> > L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header
> > (virtion_net_hdr_mrg_rxbuf, AFAICT).  For now, I'm going to call this
> > L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it
> > later in this mail.  You may also need/want to round it up to
> > 4/8/whatever bytes to avoid poorly aligned buffers.
> >
> 
> Very good points, applying it for the next series.
> 
> > > +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
> > > +                      "Number of vhost descs must be a power of two by standard");
> >
> > Relying on this calculation come out to a power of 2 seems kind of
> > fragile.  Maybe just round down to a power of 2 instead?  Although I
> > guess that could mean a bunch of wasted space.
> 
> To me it's the classical kernel include/linux/log2.h:is_power_of_2
> function, but we can replace (or entirely omit) the assert for sure.

The test for a power of 2 is fine (although you introduce an explicit
macro for it later, which should be used here as well).  The part I
was concerned about was requiring this calculated value to be a power
of 2, rather than finding a power of 2 that fits.

> > To avoid wasting space if this doesn't come neatly to a power of 2,
> > would it be safe to round _up_ to a power of 2, as long as we never,
> > ever put an index beyond what we actually have space for into the
> > available queue?
> 
> All descriptors are exposed to the kernel always in the rx queue, so
> rounding is not possible. This number must be a power of 2 in the
> split case, but I'll be happy if it is just a comment in the final
> version.

Ah, ok.  *thinks*.... I withdraw my objection.  Rounding down could
lost a *lot* of buffers and could therefore lead to hard to track
performance issues.  Forcing it to be a power of 2 in the first place,
and adjusting the buffer size to make that true is a better approach.

> 
> > > +static struct {
> > > +     /* Number of free descriptors */
> > > +     uint16_t num_free;
> > > +
> > > +     /* Last used idx processed */
> > > +     uint16_t last_used_idx;
> > > +} vqs[2];
> > > +
> > > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((aligned(PAGE_SIZE)));
> > > +static union {
> > > +     struct vring_avail avail;
> > > +     char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];
> >
> > The purpose of this idiom of a union with a char buffer based on
> > offsetof isn't obvious to me yet.
> 
> It's because vring_avail contains a flexible array member.
> 
> In VirtIO the avail ring is defined as:
> struct virtq_avail {
>         le16 flags;
>         le16 idx;
>         le16 ring[ /* Queue Size */ ];
>         /* ommiting last member here, as it is not used by pasta */
> };
> 
> But the kernel uapi definition of vring_avail is:
> struct vring_avail {
>         __virtio16 flags;
>         __virtio16 idx;
>         __virtio16 ring[];
> };
> 
> So I need to allocate enough bytes for ring[]. To me the easiest way
> is by using char buf[offsetof(last_member_of_ring[])].

Ah, ok, now I understand what it's doing at least.

> I would be happy to redefine it for pasta, as we know the size in
> advance, or to hide it in a macro.

Redefining it for our purpose sounds promising to me, but I might have
to see what it looks like.

> > > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __attribute__((aligned(PAGE_SIZE)));
> > > +static union {
> > > +     struct vring_used used;
> > > +     char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
> > > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attribute__((aligned(PAGE_SIZE)));
> >
> > Maybe 2 entry arrays, rather than *_0 and *_1?
> >
> 
> Flexible array members again :(.

Right.  Yeah, defining our own version or wrapper of vring_used would
be cleaner, I think.

[snip]
> > You should check that there are enough bytes for the L2 header as well
> > as the mrg_rxbuf header - tap_add_packet() doesn't appear to check
> > that itself (maybe it should).
> >
> 
> That's right, I thought tap_add_packet did it but now I realize it
> assumes there is.

> Why not check it in tap_add_packet, as all namespace input needs to go
> there? We need to adapt the function signature though.

Yeah, I think it probably should.

> > > +                     warn("vhost: invalid len %u", len);
> > > +                     continue;
> > > +             }
> > > +
> > > +             /* TODO this will break from this moment */
> > > +             if (hdr->num_buffers != 1) {
> > > +                     warn("vhost: Too many buffers %u, %zu bytes should be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);
> >
> > L1_MAX_LEN_VHOST again.
> 
> I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of
> places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the
> two macros in parallel?

We can have macros for both the individual buffer size and the total
buffer size, as long as we make sure they're consistent (either by
construction or with static_assert()s).

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson