From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=pass (p=quarantine dis=none) header.from=redhat.com
Authentication-Results: passt.top;
	dkim=pass (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=PEz80VYB;
	dkim-atps=neutral
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by passt.top (Postfix) with ESMTPS id 31F115A0278
	for <passt-dev@passt.top>; Mon, 28 Jul 2025 18:34:26 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1753720464;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=cN2QNIZRTHnagONqfVGRgKCcI2/2rKqMuSv6RPMlj3c=;
	b=PEz80VYBglWzGSaO3jY392LzBbigi63t84eUlFheVC7vZVlFlc871xejESnXNsqf5IaGBT
	Anssa8pxI+kDl6wdLgPe3HswtMOQpbbkkyZIQ7txrxIZi7LzVjMtgwTi/o4XxAdyzcMrSr
	V99hByDPYFr4wiMBhrRJTYYSN950CF0=
Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com
 [209.85.216.72]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-186-23iHVpRFMAC5mMbW_aFQhQ-1; Mon, 28 Jul 2025 12:34:22 -0400
X-MC-Unique: 23iHVpRFMAC5mMbW_aFQhQ-1
X-Mimecast-MFC-AGG-ID: 23iHVpRFMAC5mMbW_aFQhQ_1753720462
Received: by mail-pj1-f72.google.com with SMTP id 98e67ed59e1d1-31effad1358so58615a91.1
        for <passt-dev@passt.top>; Mon, 28 Jul 2025 09:34:22 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1753720461; x=1754325261;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cN2QNIZRTHnagONqfVGRgKCcI2/2rKqMuSv6RPMlj3c=;
        b=UQSBxKaHGpCVPxjGeai8G5WaXnG+4mCzwGbUWNL5kARkFxq3vqS0fMURoHcFb05+v+
         RvZvCFvn2A4SwMo7tcz0W1MlHZCe3lpDh4x8W5XUBBQvAKuQLddUBC3MJarcD33+OzM5
         vmwUI+9LGu4IuKoMHyLxhFlRNieIY74u0Olh/HawKU0GUSQ9fft9ShijCKQlww+M2RLB
         Y3b6vWoiNOpMG4VrYHXTJYdFYspbyrgkH/ExpJJUjA/AovdV44zIsI3PoCHqF58R0AeG
         fD8wotRXddZUqdAwv/twmDvdyE600R8Xj5nlXl10HHXGfSoxlQ2A9boD15OGnAb+5mHF
         2jXw==
X-Gm-Message-State: AOJu0YxlJMKWy2jiTMwlfmuo96jvIBMbFs2lY1jDKIwdgS7EsB2hQwsQ
	Hl3RRYNvnDvBEcfqXKWYqMwLpU5qTTQICU+KNwBKMA5mqn7JB+o9WaHedKY37s9leh30XLDb6u5
	iK0Q2slOCcBbUtoxB+g76BGDhP0IpCf7Mto8Qme2jG042bXE7qCTwovvjxXxkqtvxpq5OTwYR2u
	swrjC3iSOaQA4OFI6H+bdaDGY/Rwgi
X-Gm-Gg: ASbGncucwbbVBqWWuxuyCIMR9C51Ec3s6GJdFRtl7JKFIVZ9ax/LGTBychZPZt4pl3u
	AF1AJpVty7bxkft1ohCfGXt2k7l+Qed91MdN+xJAjDGJVxyUJsOml3MJpBxU3LlYRzKGCufKEBg
	Ebb86csW3DyM9QXcUZtx3f
X-Received: by 2002:a17:90b:3a8a:b0:31e:b772:dfcb with SMTP id 98e67ed59e1d1-31f28cdb992mr264396a91.10.1753720460878;
        Mon, 28 Jul 2025 09:34:20 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IFF+immgcssyYX4De08yGz2MIZVZqul78LkDlTEs0GLQ5eJzlDS1CzJaHrQiwkIkqnJ4LaOSoua/JQRYrl5gdQ=
X-Received: by 2002:a17:90b:3a8a:b0:31e:b772:dfcb with SMTP id
 98e67ed59e1d1-31f28cdb992mr264320a91.10.1753720459975; Mon, 28 Jul 2025
 09:34:19 -0700 (PDT)
MIME-Version: 1.0
References: <20250709174748.3514693-1-eperezma@redhat.com> <20250709174748.3514693-2-eperezma@redhat.com>
 <aICHsmIC0GrUQbnC@zatzit>
In-Reply-To: <aICHsmIC0GrUQbnC@zatzit>
From: Eugenio Perez Martin <eperezma@redhat.com>
Date: Mon, 28 Jul 2025 18:33:43 +0200
X-Gm-Features: Ac12FXwyR0WdQOjdlF6ne0ppkABAAWfI1kOxhS-9DchyOxvhENcGpst0RjqQFDw
Message-ID: <CAJaqyWc42NLNU5anS=G4Uf52v2e-ZOVix5CFt3Wg=adns32QPw@mail.gmail.com>
Subject: Re: [RFC v2 01/11] tap: implement vhost_call_cb
To: David Gibson <david@gibson.dropbear.id.au>
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: oYYFcqzlqtRbAWZeJO0x7FsyzTCTWYPmBI2XmT0iBYE_1753720462
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Message-ID-Hash: YECZTSPU4VJXTO6GGE3MZDURPRTVKP7N
X-Message-ID-Hash: YECZTSPU4VJXTO6GGE3MZDURPRTVKP7N
X-MailFrom: eperezma@redhat.com
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, jasowang@redhat.com
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/CAJaqyWc42NLNU5anS=G4Uf52v2e-ZOVix5CFt3Wg=adns32QPw@mail.gmail.com/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/YECZTSPU4VJXTO6GGE3MZDURPRTVKP7N/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>

On Wed, Jul 23, 2025 at 8:57=E2=80=AFAM David Gibson
<david@gibson.dropbear.id.au> wrote:
>
> On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio P=C3=A9rez wrote:
> > Implement the rx side of the vhost-kernel, where the namespace send
> > packets through its tap device and the kernel passes them to
> > pasta using a virtqueue.  The virtqueue is build using a
> > virtual ring (vring), which includes:
> > * A descriptor table (buffer info)
> > * An available ring (buffers ready for the kernel)
> > * A used ring (buffers that the kernel has filled)
> >
> > The descriptor table holds an array of buffers defined by address and
> > length. The kernel writes the packets transmitted by the namespace into
> > these buffers.  The number of descriptors (vq size) is set by
> > VHOST_NDESCS.  Pasta fills this table using pkt_buf, splitting it
> > evenly across all descriptors.  This table is read-only for the kernel.
> >
> > The available ring is where pasta marks which buffers the kernel can
> > use.  It's read only for the kernel.  It includes a ring[] array with
> > the descriptor indexes and an avail->idx index. Pasta increments
> > avail->idx when a new buffer is added, modulo the size of the
> > virtqueue.  As pasta writes the rx buffers sequentially, ring[] is
> > always [0, 1, 2...] and only avail->idx is incremented when new buffers
> > are available for the kernel. avail->idx can be incremented by more
> > than one at a time.
> >
> > Pasta also notifies the kernel of new available buffers by writing to
> > the kick eventfd.
> >
> > Once the kernel has written a frame in a descriptor it writes its index
> > into used_ring->ring[] and increments the used_ring->idx accordly.
> > Like the avail idx the kernel can increase it by more than one.  Pasta
> > gets a notification in the call eventfd, so we add it into the epoll ct=
x.
> >
> > Pasta assumes buffers are used in order. QEMU also assumes it in the
> > virtio-net migration code so it is safe.
> >
> > Now, vhost-kernel is designed to read the virtqueues and the buffers as
> > *guest* physical addresses (GPA), not process virtual addresses (HVA).
> > The way QEMU tells the translations is through the memory regions.
> > Since we don't have GPAs, let's just send the memory regions as a 1:1
> > translations of the HVA.
> >
> > TODO: Evaluate if we can reuse the tap fd code instead of making a new
> > epoll event type.
>
> I'm not sure this is a particularly desirable code, unless it obvious
> removes a bunch of duplicated code.  As a rule I'd prefer to use
> different epoll event types rather than a single epoll type that then
> needs to consult additional information to determine how to handle it.
>

Sure, it can be done that way.

> > TODO: Split a new file for vhost (Stefano)
> >
> > Signed-off-by: Eugenio P=C3=A9rez <eperezma@redhat.com>
> > ---
> > RFC v2:
> > * Need to integrate "now" parameter in tap_add_packet and replace
> > TAP_MSGS to TAP_MSGS{4,6}.
> > * Actually refill rx queue at the end of operation.
> > * Explain virtqueues and memory regions theory of operation.
> > * Extrack vhost_kick so it can be reused by tx.
> > * Add macro for VHOST regions.
> > * Only register call_cb in rx queue, as tx calls are handled just if
> >   needed.
> > * Renamed vhost_call_cb to tap_vhost_input (David)
> > * Move the inuse and last_used_idx to a writable struct from vhost tx
> >   function. Context is readonly for vhost tx code path.
> > * Changed from inuse to num_free tracking, more aligned with kernel drv
> > * Use always the same tap_pool instead of having the two splitted for v=
4
> >   and v6.
> > * Space between (struct vhost_memory_region) {.XXX =3D ...}. (Stefano)
> > * Expand comments about memory region size (David)
> > * Add some TODOs.
> >
> > Signed-off-by: Eugenio P=C3=A9rez <eperezma@redhat.com>
> > ---
> >  epoll_type.h |   2 +
> >  passt.c      |   7 +-
> >  passt.h      |  10 +-
> >  tap.c        | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  tap.h        |   8 ++
> >  5 files changed, 335 insertions(+), 3 deletions(-)
> >
> > diff --git a/epoll_type.h b/epoll_type.h
> > index 12ac59b..0371c14 100644
> > --- a/epoll_type.h
> > +++ b/epoll_type.h
> > @@ -44,6 +44,8 @@ enum epoll_type {
> >       EPOLL_TYPE_REPAIR_LISTEN,
> >       /* TCP_REPAIR helper socket */
> >       EPOLL_TYPE_REPAIR,
> > +     /* vhost-kernel call socket */
> > +     EPOLL_TYPE_VHOST_CALL,
> >
> >       EPOLL_NUM_TYPES,
> >  };
> > diff --git a/passt.c b/passt.c
> > index 388d10f..0f2659c 100644
> > --- a/passt.c
> > +++ b/passt.c
> > @@ -79,6 +79,7 @@ char *epoll_type_str[] =3D {
> >       [EPOLL_TYPE_VHOST_KICK]         =3D "vhost-user kick socket",
> >       [EPOLL_TYPE_REPAIR_LISTEN]      =3D "TCP_REPAIR helper listening =
socket",
> >       [EPOLL_TYPE_REPAIR]             =3D "TCP_REPAIR helper socket",
> > +     [EPOLL_TYPE_VHOST_CALL]         =3D "vhost-kernel call socket",
> >  };
> >  static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES,
> >             "epoll_type_str[] doesn't match enum epoll_type");
> > @@ -310,7 +311,8 @@ loop:
> >
> >               switch (ref.type) {
> >               case EPOLL_TYPE_TAP_PASTA:
> > -                     tap_handler_pasta(&c, eventmask, &now);
> > +                     // TODO: Find a better way, maybe reuse the fd.
> > +                     // tap_handler_pasta(&c, eventmask, &now);
>
> This change seems bogus.  We still need this if we want to be able to
> fall back to tap without vhost, which I think we want to be able to
> do.  I'm also not clear why you need it - if you don't want these
> events, can't you just not register them (at least not with this event
> type) in epoll_ctl?
>

Absolutely, it was commented out for testing purposes but I forgot to note =
it.

> >                       break;
> >               case EPOLL_TYPE_TAP_PASST:
> >                       tap_handler_passt(&c, eventmask, &now);
> > @@ -357,6 +359,9 @@ loop:
> >               case EPOLL_TYPE_REPAIR:
> >                       repair_handler(&c, eventmask);
> >                       break;
> > +             case EPOLL_TYPE_VHOST_CALL:
> > +                     tap_vhost_input(&c, ref, &now);
>
> I'd suggest calling this tap_handler_vhost() for consistency with
> tap_handler_pasta() and others.  I've also found it usually ends up a
> bit cleaner to pass the relevant individual members of ref, rather
> than the whole thing.
>

I'm ok with tap_handler_vhost, but tap_vhost_input was your suggestion :) [=
1].

> > +                     break;
> >               default:
> >                       /* Can't happen */
> >                       ASSERT(0);
> > diff --git a/passt.h b/passt.h
> > index 8693794..7bb86c4 100644
> > --- a/passt.h
> > +++ b/passt.h
> > @@ -45,7 +45,7 @@ union epoll_ref;
> >   * @icmp:    ICMP-specific reference part
> >   * @data:    Data handled by protocol handlers
> >   * @nsdir_fd:        netns dirfd for fallback timer checking if namesp=
ace is gone
> > - * @queue:   vhost-user queue index for this fd
> > + * @queue:   vhost queue index for this fd
> >   * @u64:     Opaque reference for epoll_ctl() and epoll_wait()
> >   */
> >  union epoll_ref {
> > @@ -269,11 +269,14 @@ struct ctx {
> >       int fd_tap;
> >       int fd_repair_listen;
> >       int fd_repair;
> > +     /* TODO document all added fields */
> > +     int fd_vhost;
> >       unsigned char our_tap_mac[ETH_ALEN];
> >       unsigned char guest_mac[ETH_ALEN];
> >       uint16_t mtu;
> >
> >       uint64_t hash_secret[2];
> > +     uint64_t virtio_features;
> >
> >       int ifi4;
> >       struct ip4_ctx ip4;
> > @@ -297,6 +300,11 @@ struct ctx {
> >       int no_icmp;
> >       struct icmp_ctx icmp;
> >
> > +     struct {
> > +             int kick_fd;
> > +             int call_fd;
> > +     } vq[2];
>
> Maybe not in scope for this series, but this is reminding me that we
> should really split out the tap-backend specific fields into their own
> union of substructures to make it clearer what fields are relevant in
> which configurations.
>
>
> >       int no_dns;
> >       int no_dns_search;
> >       int no_dhcp_dns;
> > diff --git a/tap.c b/tap.c
> > index 6db5d88..e4a3822 100644
> > --- a/tap.c
> > +++ b/tap.c
> > @@ -31,6 +31,7 @@
> >  #include <sys/types.h>
> >  #include <sys/stat.h>
> >  #include <fcntl.h>
> > +#include <sys/eventfd.h>
> >  #include <sys/uio.h>
> >  #include <stdbool.h>
> >  #include <stdlib.h>
> > @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6,=
 pkt_buf);
> >  #define TAP_SEQS             128 /* Different L4 tuples in one batch *=
/
> >  #define FRAGMENT_MSG_RATE    10  /* # seconds between fragment warning=
s */
> >
> > +#define VHOST_VIRTIO         0xAF
> > +#define VHOST_GET_FEATURES   _IOR(VHOST_VIRTIO, 0x00, __u64)
> > +#define VHOST_SET_FEATURES   _IOW(VHOST_VIRTIO, 0x00, __u64)
> > +#define VHOST_SET_OWNER           _IO(VHOST_VIRTIO, 0x01)
> > +#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_mem=
ory)
> > +#define VHOST_SET_VRING_NUM  _IOW(VHOST_VIRTIO, 0x10, struct vhost_vri=
ng_state)
> > +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vri=
ng_addr)
> > +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vri=
ng_file)
> > +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vri=
ng_file)
> > +#define VHOST_SET_VRING_ERR  _IOW(VHOST_VIRTIO, 0x22, struct vhost_vri=
ng_file)
> > +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> > +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vr=
ing_file)
>
> This stuff probably belongs in linux_dep.h (if you can't get it from a
> system or kernel header).
>
> > +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)
>
> I suspect this is wrong.  For starters we should definitely be using a
> define, not open-coded 65520.  But more importantly, 65520 is the
> (default) MTU - the maximum size *of the IP packet*, excluding L2
> headers.  I'm assuming these buffers will also need to contain the L2
> header, so will need to be larger.  Looking at the later code, it
> looks like these buffers also include the virtio_net_hdr_mrg_rxbuf
> pseudo-physical header, so will need to be bigger again.
>
> I think you'll want a new define for this, which would be
> L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header
> (virtion_net_hdr_mrg_rxbuf, AFAICT).  For now, I'm going to call this
> L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it
> later in this mail.  You may also need/want to round it up to
> 4/8/whatever bytes to avoid poorly aligned buffers.
>

Very good points, applying it for the next series.

> > +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
> > +                      "Number of vhost descs must be a power of two by=
 standard");
>
> Relying on this calculation come out to a power of 2 seems kind of
> fragile.  Maybe just round down to a power of 2 instead?  Although I
> guess that could mean a bunch of wasted space.
>

To me it's the classical kernel include/linux/log2.h:is_power_of_2
function, but we can replace (or entirely omit) the assert for sure.

> To avoid wasting space if this doesn't come neatly to a power of 2,
> would it be safe to round _up_ to a power of 2, as long as we never,
> ever put an index beyond what we actually have space for into the
> available queue?
>

All descriptors are exposed to the kernel always in the rx queue, so
rounding is not possible. This number must be a power of 2 in the
split case, but I'll be happy if it is just a comment in the final
version.

> > +static struct {
> > +     /* Number of free descriptors */
> > +     uint16_t num_free;
> > +
> > +     /* Last used idx processed */
> > +     uint16_t last_used_idx;
> > +} vqs[2];
> > +
> > +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((al=
igned(PAGE_SIZE)));
> > +static union {
> > +     struct vring_avail avail;
> > +     char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];
>
> The purpose of this idiom of a union with a char buffer based on
> offsetof isn't obvious to me yet.
>

It's because vring_avail contains a flexible array member.

In VirtIO the avail ring is defined as:
struct virtq_avail {
        le16 flags;
        le16 idx;
        le16 ring[ /* Queue Size */ ];
        /* ommiting last member here, as it is not used by pasta */
};

But the kernel uapi definition of vring_avail is:
struct vring_avail {
        __virtio16 flags;
        __virtio16 idx;
        __virtio16 ring[];
};

So I need to allocate enough bytes for ring[]. To me the easiest way
is by using char buf[offsetof(last_member_of_ring[])].

I would be happy to redefine it for pasta, as we know the size in
advance, or to hide it in a macro.

> > +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __a=
ttribute__((aligned(PAGE_SIZE)));
> > +static union {
> > +     struct vring_used used;
> > +     char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
> > +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __att=
ribute__((aligned(PAGE_SIZE)));
>
> Maybe 2 entry arrays, rather than *_0 and *_1?
>

Flexible array members again :(.

> > +
> > +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
> > +#define N_VHOST_REGIONS 6
> > +union {
> > +     struct vhost_memory mem;
> > +     char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])]=
;
> > +} vhost_memory =3D {
> > +     .mem =3D {
> > +             .nregions =3D N_VHOST_REGIONS,
> > +     },
> > +};
> > +
> >  /**
> >   * tap_l2_max_len() - Maximum frame size (including L2 header) for cur=
rent mode
> >   * @c:               Execution context
> > @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c,
> >       tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
> >  }
> >
> > +static void vhost_kick(struct vring_used *used, int kick_fd) {
>
> passt style puts the opening { of a function on the nect line.
>

Ouch, I'll fix it in the next version!

> > +     /* We need to expose available array entries before checking avai=
l
> > +      * event.
> > +      *
> > +      * TODO: Does eventfd_write already do this?
> > +      */
> > +     smp_mb();
> > +
> > +     if (!(used->flags & VRING_USED_F_NO_NOTIFY))
> > +             eventfd_write(kick_fd, 1);
> > +}
> > +
> >  /**
> >   * tap_send_frames_pasta() - Send multiple frames to the pasta tap
> >   * @c:                       Execution context
> > @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t =
events)
> >       tap_start_connection(c);
> >  }
> >
> > +static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)
>
> None of these functions have descriptive comment blocks yet.  I
> realise this is an RFC, but it can certainly make things easier to
> review if you know what to expect before reading the function.
>
> > +{
> > +     struct vring_used *used =3D !qid ? &vring_used_0.used : &vring_us=
ed_1.used;
>
> Using an array as suggested above would avoid this awkward construct.
>
> > +     uint32_t i;
> > +     uint16_t used_idx, last_used;
>
> passt style orders declarations in reverse order of line length (when
> possible), so this line should be before the previous one.
>

Got it, I'll use it from now on.

> > +
> > +     /* TODO think if this has races with previous eventfd_read */
> > +     /* TODO we could improve performance with a shadow_used_idx */
> > +     used_idx =3D le16toh(used->idx);
> > +
> > +     smp_rmb();
> > +
> > +     if (used_idx =3D=3D vqs[0].last_used_idx) {
> > +             *len =3D 0;
> > +             return NULL;
> > +     }
> > +
> > +     last_used =3D vqs[0].last_used_idx % VHOST_NDESCS;
> > +     i =3D le32toh(used->ring[last_used].id);
> > +     *len =3D le32toh(used->ring[last_used].len);
> > +
> > +     /* Make sure the kernel is consuming the descriptors in order */
> > +     if (i !=3D last_used) {
> > +             die("vhost: id %u at used position %u !=3D %u", i, last_u=
sed, i);
> > +             return NULL;
> > +     }
> > +
> > +     if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {
>
> You want your L1_MAX_LEN_VHOST (or whatever it's called) again here.
>
> > +             warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/=
VHOST_NDESCS);
>
> And here.
>
> > +             return NULL;
> > +     }
> > +
> > +     /* TODO check if the id is valid and it has not been double used =
*/
> > +     vqs[0].last_used_idx++;
> > +     vqs[0].num_free++;
> > +     return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);
>
> And here.
>
> > +}
> > +
> > +/* TODO this assumes the kernel consumes descriptors in order */
> > +static void rx_pkt_refill(struct ctx *c)
> > +{
> > +     /* TODO: tune this threshold */
> > +     if (!vqs[0].num_free)
> > +             return;
> > +
> > +     vring_avail_0.avail.idx +=3D vqs[0].num_free;
> > +     vqs[0].num_free =3D 0;
> > +     vhost_kick(&vring_used_0.used, c->vq[0].kick_fd);
> > +}
> > +
> > +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct =
timespec *now)
> > +{
> > +     eventfd_read(ref.fd, (eventfd_t[]){ 0 });
> > +
> > +     tap_flush_pools();
> > +
> > +     while (true) {
> > +             struct virtio_net_hdr_mrg_rxbuf *hdr;
> > +             unsigned len;
> > +
> > +             hdr =3D virtqueue_get_rx_buf(ref.queue, &len);
> > +             if (!hdr)
> > +                     break;
>
> You could use while ((hdr =3D virtqueue_get_rx_buf(...))) instead of
> while (true), couldn't you?
>

Sure.

> > +
> > +             if (len < sizeof(*hdr)) {
>
> You should check that there are enough bytes for the L2 header as well
> as the mrg_rxbuf header - tap_add_packet() doesn't appear to check
> that itself (maybe it should).
>

That's right, I thought tap_add_packet did it but now I realize it
assumes there is.

Why not check it in tap_add_packet, as all namespace input needs to go
there? We need to adapt the function signature though.

> > +                     warn("vhost: invalid len %u", len);
> > +                     continue;
> > +             }
> > +
> > +             /* TODO this will break from this moment */
> > +             if (hdr->num_buffers !=3D 1) {
> > +                     warn("vhost: Too many buffers %u, %zu bytes shoul=
d be enough for everybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);
>
> L1_MAX_LEN_VHOST again.

I'm ok with moving to L1_MAX_LEN_VHOST, but then another bunch of
places would need L1_MAX_LEN_VHOST*VHOST_NDESCS. Or should we keep the
two macros in parallel?

> Also passt style keeps lines below 80
> columns.
>

Fixing for the next RFC.

> > +                     continue;
> > +             }
> > +
> > +             /* TODO fix the v6 pool to support ipv6 */
>
> Not clear on why anything particular is needed for ipv6 here.
>

This comment is outdated actually, I only used pool_tap4 here. Fixing
in the next revision!

> > +             tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), no=
w);
> > +     }
> > +
> > +     tap_handler(c, now);
> > +     rx_pkt_refill(c);
> > +}
> > +
> >  /**
> >   * tap_ns_tun() - Get tuntap fd in namespace
> >   * @c:               Execution context
> > @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t=
 events)
> >   */
> >  static int tap_ns_tun(void *arg)
> >  {
> > +     /* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF =
and VHOST_NET_F_VIRTIO_NET_HDR actually */
> > +     static const uint64_t features =3D
> > +             (1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_=
RXBUF) | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR);
>
> If I understand the code above properly, for the Rx path at least we
> don't need or want F_MRG_RXBUF.  So why require it here?  If we want
> it for the Tx path, we can omit it here and add it later in the
> series.
>

I can try without it, sure.

> >       struct ifreq ifr =3D { .ifr_flags =3D IFF_TAP | IFF_NO_PI };
> >       int flags =3D O_RDWR | O_NONBLOCK | O_CLOEXEC;
> >       struct ctx *c =3D (struct ctx *)arg;
> > -     int fd, rc;
> > +     unsigned i;
> > +     int fd, vhost_fd, rc;
> >
> >       c->fd_tap =3D -1;
> >       memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ);
> > @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg)
> >       if (fd < 0)
> >               die_perror("Failed to open() /dev/net/tun");
> >
> > +     vhost_fd =3D open("/dev/vhost-net", flags);
> > +     if (vhost_fd < 0)
> > +             die_perror("Failed to open() /dev/vhost-net");
> > +
> > +     rc =3D ioctl(vhost_fd, VHOST_SET_OWNER, NULL);
> > +     if (rc < 0)
> > +             die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net faile=
d");
> > +
> > +     rc =3D ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features);
> > +     if (rc < 0)
> > +             die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net fa=
iled");
> > +
> > +     /* TODO inform more explicitely */
> > +     fprintf(stderr, "vhost features: %lx\n", c->virtio_features);
> > +     fprintf(stderr, "req features: %lx\n", features);
>
> You should use info() or debug() here.
>

Right, I'll change.

> > +     c->virtio_features &=3D features;
> > +     if (c->virtio_features !=3D features)
> > +             die("vhost does not support required features");
> > +
> > +     for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> > +             struct vhost_vring_file file =3D {
> > +                     .index =3D i,
> > +             };
> > +             union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_CALL=
,
> > +                                     .queue =3D i };
> > +             struct epoll_event ev;
> > +
> > +             file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > +             ref.fd =3D file.fd;
> > +             if (file.fd < 0)
> > +                     die_perror("Failed to create call eventfd");
>
> I know it's technically harmless, but it reads better for me if you
> check for the error immediately after the eventfd() call, before doing
> something else with the value.
>

Changing in the next version.

> > +
> > +             rc =3D ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file);
> > +             if (rc < 0)
> > +                     die_perror(
> > +                             "VHOST_SET_VRING_CALL ioctl on /dev/vhost=
-net failed");
> > +
> > +             ev =3D (struct epoll_event){ .data.u64 =3D ref.u64, .even=
ts =3D EPOLLIN };
> > +             if (i =3D=3D 0) {
> > +                     rc =3D epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.f=
d, &ev);
> > +                     if (rc < 0)
> > +                             die_perror("Failed to add call eventfd to=
 epoll");
> > +             }
> > +             c->vq[i].call_fd =3D file.fd;
> > +     }
> > +
> > +     /*
> > +      * Unlike most of C-style APIs, userspace_addr+memory_size is
> > +      * also accesible by the kernel.  Include a -1 to adjust.
> > +      */
> > +#define VHOST_MEMORY_REGION_PTR(addr, size) \
> > +     (struct vhost_memory_region) { \
> > +             .guest_phys_addr =3D (uintptr_t)addr, \
> > +             .memory_size =3D size - 1, \
> > +             .userspace_addr =3D (uintptr_t)addr, \
> > +     }
> > +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeo=
f(elem))
> > +
> > +     /* 1:1 translation */
> > +     vhost_memory.mem.regions[0] =3D VHOST_MEMORY_REGION(vring_desc);
> > +     vhost_memory.mem.regions[1] =3D VHOST_MEMORY_REGION(vring_avail_0=
);
> > +     vhost_memory.mem.regions[2] =3D VHOST_MEMORY_REGION(vring_avail_1=
);
> > +     vhost_memory.mem.regions[3] =3D VHOST_MEMORY_REGION(vring_used_0)=
;
> > +     vhost_memory.mem.regions[4] =3D VHOST_MEMORY_REGION(vring_used_1)=
;
> > +     vhost_memory.mem.regions[5] =3D VHOST_MEMORY_REGION(pkt_buf);
> > +     static_assert(5 < N_VHOST_REGIONS);
> > +#undef VHOST_MEMORY_REGION
> > +#undef VHOST_MEMORY_REGION_PTR
> > +
> > +     rc =3D ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem);
> > +     if (rc < 0)
> > +             die_perror(
> > +                     "VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net fail=
ed");
> > +
> > +     /* TODO: probably it increases RX perf */
> > +#if 0
> > +     struct ifreq ifr;
> > +     memset(&ifr, 0, sizeof(ifr));
> > +
> > +     if (ioctl(fd, TUNGETIFF, &ifr) !=3D 0)
> > +             die_perror("Unable to query TUNGETIFF on FD %d", fd);
> > +     }
> > +
> > +     if (ifr.ifr_flags & IFF_VNET_HDR)
> > +             net->dev.features &=3D ~(1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > +#endif
> > +     rc =3D ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features);
> > +     if (rc < 0)
> > +             die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net fa=
iled");
> > +
> > +     /* Duplicating foreach queue to follow the exact order from QEMU =
*/
> > +     for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> > +             struct vhost_vring_addr addr =3D {
> > +                     .index =3D i,
> > +                     .desc_user_addr =3D (unsigned long)vring_desc[i],
> > +                     .avail_user_addr =3D i =3D=3D 0 ? (unsigned long)=
&vring_avail_0 :
> > +                                                                      =
       (unsigned long)&vring_avail_1,
> > +                     .used_user_addr =3D i =3D=3D 0 ? (unsigned long)&=
vring_used_0 :
> > +                                                                      =
       (unsigned long)&vring_used_1,
> > +                     /* GPA addr */
> > +                     .log_guest_addr =3D i =3D=3D 0 ? (unsigned long)&=
vring_used_0 :
> > +                                                                      =
  (unsigned long)&vring_used_1,
> > +             };
> > +             struct vhost_vring_state state =3D {
> > +                     .index =3D i,
> > +                     .num =3D VHOST_NDESCS,
> > +             };
> > +             struct vhost_vring_file file =3D {
> > +                     .index =3D i,
> > +             };
> > +
> > +             rc =3D ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
> > +             if (rc < 0)
> > +                     die_perror(
> > +                             "VHOST_SET_VRING_NUM ioctl on /dev/vhost-=
net failed");
> > +
> > +             fprintf(stderr, "qid: %d\n", i);
> > +             fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_us=
er_addr);
> > +             fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_=
user_addr);
> > +             fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_us=
er_addr);
> > +             rc =3D ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
> > +             if (rc < 0)
> > +                     die_perror(
> > +                             "VHOST_SET_VRING_ADDR ioctl on /dev/vhost=
-net failed");
> > +
> > +             file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> > +             if (file.fd < 0)
> > +                     die_perror("Failed to create kick eventfd");
> > +             rc =3D ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file);
> > +             if (rc < 0)
> > +                     die_perror(
> > +                             "VHOST_SET_VRING_KICK ioctl on /dev/vhost=
-net failed");
> > +             c->vq[i].kick_fd =3D file.fd;
> > +
> > +             vqs[i].num_free =3D VHOST_NDESCS;
> > +     }
> > +
> >       rc =3D ioctl(fd, (int)TUNSETIFF, &ifr);
> >       if (rc < 0)
> >               die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
> > @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg)
> >       if (!(c->pasta_ifi =3D if_nametoindex(c->pasta_ifn)))
> >               die("Tap device opened but no network interface found");
> >
> > +     for (i =3D 0; i < VHOST_NDESCS; ++i) {
> > +             vring_desc[0][i].addr =3D (uintptr_t)pkt_buf + i * (PKT_B=
UF_BYTES/VHOST_NDESCS);
> > +             vring_desc[0][i].len =3D PKT_BUF_BYTES/VHOST_NDESCS;
> > +             vring_desc[0][i].flags =3D VRING_DESC_F_WRITE;
> > +     }
> > +     for (i =3D 0; i < VHOST_NDESCS; ++i) {
>
> No { } for single line blocks in passt style.
>

I'll change for the next version.

> > +             vring_avail_0.avail.ring[i] =3D i;
> > +     }
> > +
> > +     rx_pkt_refill(c);
> > +
> > +     /* Duplicating foreach queue to follow the exact order from QEMU =
*/
> > +     for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> > +             struct vhost_vring_file file =3D {
> > +                     .index =3D i,
> > +                     .fd =3D fd,
> > +             };
> > +
> > +             fprintf(stderr, "qid: %d\n", file.index);
> > +             fprintf(stderr, "tap fd: %d\n", file.fd);
> > +             rc =3D ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file);
> > +             if (rc < 0)
> > +                     die_perror(
> > +                             "VHOST_NET_SET_BACKEND ioctl on /dev/vhos=
t-net failed");
> > +     }
> > +
> >       c->fd_tap =3D fd;
> > +     c->fd_vhost =3D vhost_fd;
> >
> >       return 0;
> >  }
> > diff --git a/tap.h b/tap.h
> > index 6fe3d15..ff8cee5 100644
> > --- a/tap.h
> > +++ b/tap.h
> > @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *th=
dr, size_t l2len)
> >               thdr->vnet_len =3D htonl(l2len);
> >  }
> >
> > +/**
> > + * tap_vhost_input() - Handler for new data on the socket to hyperviso=
r vq
> > + * @c:               Execution context
> > + * @ref:     epoll reference
> > + * @now:     Current timestamp
> > + */
>
> passt convention puts these comments in the .c file not the .h, even
> for exported functions.
>

Ouch, I missed that.

> > +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct =
timespec *now);
> > +
> >  unsigned long tap_l2_max_len(const struct ctx *c);
> >  void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
> >  void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,
>

[1] https://archives.passt.top/passt-dev/Z-3oQcL28QuIh6LT@zatzit/