From mboxrd@z Thu Jan  1 00:00:00 1970
Authentication-Results: passt.top; dmarc=none (p=none dis=none) header.from=gibson.dropbear.id.au
Authentication-Results: passt.top;
	dkim=pass (2048-bit key; secure) header.d=gibson.dropbear.id.au header.i=@gibson.dropbear.id.au header.a=rsa-sha256 header.s=202506 header.b=VxfqDkxN;
	dkim-atps=neutral
Received: from mail.ozlabs.org (gandalf.ozlabs.org [150.107.74.76])
	by passt.top (Postfix) with ESMTPS id B45A05A0280
	for <passt-dev@passt.top>; Wed, 23 Jul 2025 08:56:59 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gibson.dropbear.id.au; s=202506; t=1753253650;
	bh=2LP3BYpo35JDH8dKDe4O9J46asyrYI/JtnGyNX8baic=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=VxfqDkxNz4ADWLMyz9RnSyrjpq3XeiIPbRK1owevf20AO1A8s63op/5GWLMG+qR30
	 Q60Wq7SBqaF+MQiLMrpq2DbKWysynXTUeeOLEFFda+n5jw7YnqdBUlCY2zZ8/rSbgk
	 56Nv/+pf6Ggp8RUzjsAWvwSCJaHtxDxMMQZjYbuCO1REEOOXa5LEOgnTy4MtjCAsPf
	 nMhwqhGYXJDKnHa/pLcK5kRcyAmRNMYezqgqgxIDQoapd7RSQS7oAtwe/ZZ0HtYKPx
	 mcTsUfL/5lHr2puRMxd/FNuPhA5TlCQVWdoqQuhQpnilDMswgBUeFGfsQeQ6YNJT/9
	 RPEg+790/bdQg==
Received: by gandalf.ozlabs.org (Postfix, from userid 1007)
	id 4bn4cL3czwz4x8Y; Wed, 23 Jul 2025 16:54:10 +1000 (AEST)
Date: Wed, 23 Jul 2025 16:56:50 +1000
From: David Gibson <david@gibson.dropbear.id.au>
To: Eugenio =?iso-8859-1?Q?P=E9rez?= <eperezma@redhat.com>
Subject: Re: [RFC v2 01/11] tap: implement vhost_call_cb
Message-ID: <aICHsmIC0GrUQbnC@zatzit>
References: <20250709174748.3514693-1-eperezma@redhat.com>
 <20250709174748.3514693-2-eperezma@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="0yHg5seefeSV+JWO"
Content-Disposition: inline
In-Reply-To: <20250709174748.3514693-2-eperezma@redhat.com>
Message-ID-Hash: NEK6RUAJ5RP2JJ6JHHVKXG5AMVD5R54J
X-Message-ID-Hash: NEK6RUAJ5RP2JJ6JHHVKXG5AMVD5R54J
X-MailFrom: dgibson@gandalf.ozlabs.org
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: passt-dev@passt.top, jasowang@redhat.com
X-Mailman-Version: 3.3.8
Precedence: list
List-Id: Development discussion and patches for passt <passt-dev.passt.top>
Archived-At: <https://archives.passt.top/passt-dev/aICHsmIC0GrUQbnC@zatzit/>
Archived-At: <https://passt.top/hyperkitty/list/passt-dev@passt.top/message/NEK6RUAJ5RP2JJ6JHHVKXG5AMVD5R54J/>
List-Archive: <https://archives.passt.top/passt-dev/>
List-Archive: <https://passt.top/hyperkitty/list/passt-dev@passt.top/>
List-Help: <mailto:passt-dev-request@passt.top?subject=help>
List-Owner: <mailto:passt-dev-owner@passt.top>
List-Post: <mailto:passt-dev@passt.top>
List-Subscribe: <mailto:passt-dev-join@passt.top>
List-Unsubscribe: <mailto:passt-dev-leave@passt.top>


--0yHg5seefeSV+JWO
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 09, 2025 at 07:47:38PM +0200, Eugenio P=E9rez wrote:
> Implement the rx side of the vhost-kernel, where the namespace send
> packets through its tap device and the kernel passes them to
> pasta using a virtqueue.  The virtqueue is build using a
> virtual ring (vring), which includes:
> * A descriptor table (buffer info)
> * An available ring (buffers ready for the kernel)
> * A used ring (buffers that the kernel has filled)
>=20
> The descriptor table holds an array of buffers defined by address and
> length. The kernel writes the packets transmitted by the namespace into
> these buffers.  The number of descriptors (vq size) is set by
> VHOST_NDESCS.  Pasta fills this table using pkt_buf, splitting it
> evenly across all descriptors.  This table is read-only for the kernel.
>=20
> The available ring is where pasta marks which buffers the kernel can
> use.  It's read only for the kernel.  It includes a ring[] array with
> the descriptor indexes and an avail->idx index. Pasta increments
> avail->idx when a new buffer is added, modulo the size of the
> virtqueue.  As pasta writes the rx buffers sequentially, ring[] is
> always [0, 1, 2...] and only avail->idx is incremented when new buffers
> are available for the kernel. avail->idx can be incremented by more
> than one at a time.
>=20
> Pasta also notifies the kernel of new available buffers by writing to
> the kick eventfd.
>=20
> Once the kernel has written a frame in a descriptor it writes its index
> into used_ring->ring[] and increments the used_ring->idx accordly.
> Like the avail idx the kernel can increase it by more than one.  Pasta
> gets a notification in the call eventfd, so we add it into the epoll ctx.
>=20
> Pasta assumes buffers are used in order. QEMU also assumes it in the
> virtio-net migration code so it is safe.
>=20
> Now, vhost-kernel is designed to read the virtqueues and the buffers as
> *guest* physical addresses (GPA), not process virtual addresses (HVA).
> The way QEMU tells the translations is through the memory regions.
> Since we don't have GPAs, let's just send the memory regions as a 1:1
> translations of the HVA.
>=20
> TODO: Evaluate if we can reuse the tap fd code instead of making a new
> epoll event type.

I'm not sure this is a particularly desirable code, unless it obvious
removes a bunch of duplicated code.  As a rule I'd prefer to use
different epoll event types rather than a single epoll type that then
needs to consult additional information to determine how to handle it.

> TODO: Split a new file for vhost (Stefano)
>=20
> Signed-off-by: Eugenio P=E9rez <eperezma@redhat.com>
> ---
> RFC v2:
> * Need to integrate "now" parameter in tap_add_packet and replace
> TAP_MSGS to TAP_MSGS{4,6}.
> * Actually refill rx queue at the end of operation.
> * Explain virtqueues and memory regions theory of operation.
> * Extrack vhost_kick so it can be reused by tx.
> * Add macro for VHOST regions.
> * Only register call_cb in rx queue, as tx calls are handled just if
>   needed.
> * Renamed vhost_call_cb to tap_vhost_input (David)
> * Move the inuse and last_used_idx to a writable struct from vhost tx
>   function. Context is readonly for vhost tx code path.
> * Changed from inuse to num_free tracking, more aligned with kernel drv
> * Use always the same tap_pool instead of having the two splitted for v4
>   and v6.
> * Space between (struct vhost_memory_region) {.XXX =3D ...}. (Stefano)
> * Expand comments about memory region size (David)
> * Add some TODOs.
>=20
> Signed-off-by: Eugenio P=E9rez <eperezma@redhat.com>
> ---
>  epoll_type.h |   2 +
>  passt.c      |   7 +-
>  passt.h      |  10 +-
>  tap.c        | 311 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  tap.h        |   8 ++
>  5 files changed, 335 insertions(+), 3 deletions(-)
>=20
> diff --git a/epoll_type.h b/epoll_type.h
> index 12ac59b..0371c14 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -44,6 +44,8 @@ enum epoll_type {
>  	EPOLL_TYPE_REPAIR_LISTEN,
>  	/* TCP_REPAIR helper socket */
>  	EPOLL_TYPE_REPAIR,
> +	/* vhost-kernel call socket */
> +	EPOLL_TYPE_VHOST_CALL,
> =20
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/passt.c b/passt.c
> index 388d10f..0f2659c 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -79,6 +79,7 @@ char *epoll_type_str[] =3D {
>  	[EPOLL_TYPE_VHOST_KICK]		=3D "vhost-user kick socket",
>  	[EPOLL_TYPE_REPAIR_LISTEN]	=3D "TCP_REPAIR helper listening socket",
>  	[EPOLL_TYPE_REPAIR]		=3D "TCP_REPAIR helper socket",
> +	[EPOLL_TYPE_VHOST_CALL]		=3D "vhost-kernel call socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) =3D=3D EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -310,7 +311,8 @@ loop:
> =20
>  		switch (ref.type) {
>  		case EPOLL_TYPE_TAP_PASTA:
> -			tap_handler_pasta(&c, eventmask, &now);
> +			// TODO: Find a better way, maybe reuse the fd.
> +			// tap_handler_pasta(&c, eventmask, &now);

This change seems bogus.  We still need this if we want to be able to
fall back to tap without vhost, which I think we want to be able to
do.  I'm also not clear why you need it - if you don't want these
events, can't you just not register them (at least not with this event
type) in epoll_ctl?

>  			break;
>  		case EPOLL_TYPE_TAP_PASST:
>  			tap_handler_passt(&c, eventmask, &now);
> @@ -357,6 +359,9 @@ loop:
>  		case EPOLL_TYPE_REPAIR:
>  			repair_handler(&c, eventmask);
>  			break;
> +		case EPOLL_TYPE_VHOST_CALL:
> +			tap_vhost_input(&c, ref, &now);

I'd suggest calling this tap_handler_vhost() for consistency with
tap_handler_pasta() and others.  I've also found it usually ends up a
bit cleaner to pass the relevant individual members of ref, rather
than the whole thing.

> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index 8693794..7bb86c4 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -45,7 +45,7 @@ union epoll_ref;
>   * @icmp:	ICMP-specific reference part
>   * @data:	Data handled by protocol handlers
>   * @nsdir_fd:	netns dirfd for fallback timer checking if namespace is go=
ne
> - * @queue:	vhost-user queue index for this fd
> + * @queue:	vhost queue index for this fd
>   * @u64:	Opaque reference for epoll_ctl() and epoll_wait()
>   */
>  union epoll_ref {
> @@ -269,11 +269,14 @@ struct ctx {
>  	int fd_tap;
>  	int fd_repair_listen;
>  	int fd_repair;
> +	/* TODO document all added fields */
> +	int fd_vhost;
>  	unsigned char our_tap_mac[ETH_ALEN];
>  	unsigned char guest_mac[ETH_ALEN];
>  	uint16_t mtu;
> =20
>  	uint64_t hash_secret[2];
> +	uint64_t virtio_features;
> =20
>  	int ifi4;
>  	struct ip4_ctx ip4;
> @@ -297,6 +300,11 @@ struct ctx {
>  	int no_icmp;
>  	struct icmp_ctx icmp;
> =20
> +	struct {
> +		int kick_fd;
> +		int call_fd;
> +	} vq[2];

Maybe not in scope for this series, but this is reminding me that we
should really split out the tap-backend specific fields into their own
union of substructures to make it clearer what fields are relevant in
which configurations.


>  	int no_dns;
>  	int no_dns_search;
>  	int no_dhcp_dns;
> diff --git a/tap.c b/tap.c
> index 6db5d88..e4a3822 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -31,6 +31,7 @@
>  #include <sys/types.h>
>  #include <sys/stat.h>
>  #include <fcntl.h>
> +#include <sys/eventfd.h>
>  #include <sys/uio.h>
>  #include <stdbool.h>
>  #include <stdlib.h>
> @@ -101,6 +102,51 @@ static PACKET_POOL_NOINIT(pool_tap6, TAP_MSGS_IP6, p=
kt_buf);
>  #define TAP_SEQS		128 /* Different L4 tuples in one batch */
>  #define FRAGMENT_MSG_RATE	10  /* # seconds between fragment warnings */
> =20
> +#define VHOST_VIRTIO         0xAF
> +#define VHOST_GET_FEATURES   _IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_FEATURES   _IOW(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_OWNER	     _IO(VHOST_VIRTIO, 0x01)
> +#define VHOST_SET_MEM_TABLE  _IOW(VHOST_VIRTIO, 0x03, struct vhost_memor=
y)
> +#define VHOST_SET_VRING_NUM  _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring=
_state)
> +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring=
_addr)
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring=
_file)
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring=
_file)
> +#define VHOST_SET_VRING_ERR  _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring=
_file)
> +#define VHOST_SET_BACKEND_FEATURES _IOW(VHOST_VIRTIO, 0x25, __u64)
> +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vrin=
g_file)

This stuff probably belongs in linux_dep.h (if you can't get it from a
system or kernel header).

> +#define VHOST_NDESCS (PKT_BUF_BYTES / 65520)

I suspect this is wrong.  For starters we should definitely be using a
define, not open-coded 65520.  But more importantly, 65520 is the
(default) MTU - the maximum size *of the IP packet*, excluding L2
headers.  I'm assuming these buffers will also need to contain the L2
header, so will need to be larger.  Looking at the later code, it
looks like these buffers also include the virtio_net_hdr_mrg_rxbuf
pseudo-physical header, so will need to be bigger again.

I think you'll want a new define for this, which would be
L2_MAX_LEN_PASTA plus whatever you need on top of the Ethernet header
(virtion_net_hdr_mrg_rxbuf, AFAICT).  For now, I'm going to call this
L1_MAX_LEN_VHOST, which is not a great name, but I want to refer to it
later in this mail.  You may also need/want to round it up to
4/8/whatever bytes to avoid poorly aligned buffers.

> +static_assert(!(VHOST_NDESCS & (VHOST_NDESCS - 1)),
> +			 "Number of vhost descs must be a power of two by standard");

Relying on this calculation come out to a power of 2 seems kind of
fragile.  Maybe just round down to a power of 2 instead?  Although I
guess that could mean a bunch of wasted space.

To avoid wasting space if this doesn't come neatly to a power of 2,
would it be safe to round _up_ to a power of 2, as long as we never,
ever put an index beyond what we actually have space for into the
available queue?

> +static struct {
> +	/* Number of free descriptors */
> +	uint16_t num_free;
> +
> +	/* Last used idx processed */
> +	uint16_t last_used_idx;
> +} vqs[2];
> +
> +static struct vring_desc vring_desc[2][VHOST_NDESCS] __attribute__((alig=
ned(PAGE_SIZE)));
> +static union {
> +	struct vring_avail avail;
> +	char buf[offsetof(struct vring_avail, ring[VHOST_NDESCS])];

The purpose of this idiom of a union with a char buffer based on
offsetof isn't obvious to me yet.

> +} vring_avail_0 __attribute__((aligned(PAGE_SIZE))), vring_avail_1 __att=
ribute__((aligned(PAGE_SIZE)));
> +static union {
> +	struct vring_used used;
> +	char buf[offsetof(struct vring_used, ring[VHOST_NDESCS])];
> +} vring_used_0 __attribute__((aligned(PAGE_SIZE))), vring_used_1 __attri=
bute__((aligned(PAGE_SIZE)));

Maybe 2 entry arrays, rather than *_0 and *_1?

> +
> +/* all descs ring + 2rings * 2vqs + tx pkt buf + rx pkt buf */
> +#define N_VHOST_REGIONS 6
> +union {
> +	struct vhost_memory mem;
> +	char buf[offsetof(struct vhost_memory, regions[N_VHOST_REGIONS])];
> +} vhost_memory =3D {
> +	.mem =3D {
> +		.nregions =3D N_VHOST_REGIONS,
> +	},
> +};
> +
>  /**
>   * tap_l2_max_len() - Maximum frame size (including L2 header) for curre=
nt mode
>   * @c:		Execution context
> @@ -399,6 +445,18 @@ void tap_icmp6_send(const struct ctx *c,
>  	tap_send_single(c, buf, l4len + ((char *)icmp6h - buf));
>  }
> =20
> +static void vhost_kick(struct vring_used *used, int kick_fd) {

passt style puts the opening { of a function on the nect line.

> +	/* We need to expose available array entries before checking avail
> +	 * event.
> +	 *
> +	 * TODO: Does eventfd_write already do this?
> +	 */
> +	smp_mb();
> +
> +	if (!(used->flags & VRING_USED_F_NO_NOTIFY))
> +		eventfd_write(kick_fd, 1);
> +}
> +
>  /**
>   * tap_send_frames_pasta() - Send multiple frames to the pasta tap
>   * @c:			Execution context
> @@ -1386,6 +1444,89 @@ void tap_listen_handler(struct ctx *c, uint32_t ev=
ents)
>  	tap_start_connection(c);
>  }
>
> +static void *virtqueue_get_rx_buf(unsigned qid, unsigned *len)

None of these functions have descriptive comment blocks yet.  I
realise this is an RFC, but it can certainly make things easier to
review if you know what to expect before reading the function.

> +{
> +	struct vring_used *used =3D !qid ? &vring_used_0.used : &vring_used_1.u=
sed;

Using an array as suggested above would avoid this awkward construct.

> +	uint32_t i;
> +	uint16_t used_idx, last_used;

passt style orders declarations in reverse order of line length (when
possible), so this line should be before the previous one.

> +
> +	/* TODO think if this has races with previous eventfd_read */
> +	/* TODO we could improve performance with a shadow_used_idx */
> +	used_idx =3D le16toh(used->idx);
> +
> +	smp_rmb();
> +
> +	if (used_idx =3D=3D vqs[0].last_used_idx) {
> +		*len =3D 0;
> +		return NULL;
> +	}
> +
> +	last_used =3D vqs[0].last_used_idx % VHOST_NDESCS;
> +	i =3D le32toh(used->ring[last_used].id);
> +	*len =3D le32toh(used->ring[last_used].len);
> +
> +	/* Make sure the kernel is consuming the descriptors in order */
> +	if (i !=3D last_used) {
> +		die("vhost: id %u at used position %u !=3D %u", i, last_used, i);
> +		return NULL;
> +	}
> +
> +	if (*len > PKT_BUF_BYTES/VHOST_NDESCS) {

You want your L1_MAX_LEN_VHOST (or whatever it's called) again here.

> +		warn("vhost: id %d len %u > %zu", i, *len, PKT_BUF_BYTES/VHOST_NDESCS);

And here.

> +		return NULL;
> +	}
> +
> +	/* TODO check if the id is valid and it has not been double used */
> +	vqs[0].last_used_idx++;
> +	vqs[0].num_free++;
> +	return pkt_buf + i * (PKT_BUF_BYTES/VHOST_NDESCS);

And here.

> +}
> +
> +/* TODO this assumes the kernel consumes descriptors in order */
> +static void rx_pkt_refill(struct ctx *c)
> +{
> +	/* TODO: tune this threshold */
> +	if (!vqs[0].num_free)
> +		return;
> +
> +	vring_avail_0.avail.idx +=3D vqs[0].num_free;
> +	vqs[0].num_free =3D 0;
> +	vhost_kick(&vring_used_0.used, c->vq[0].kick_fd);
> +}
> +
> +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct ti=
mespec *now)
> +{
> +	eventfd_read(ref.fd, (eventfd_t[]){ 0 });
> +
> +	tap_flush_pools();
> +
> +	while (true) {
> +		struct virtio_net_hdr_mrg_rxbuf *hdr;
> +		unsigned len;
> +
> +		hdr =3D virtqueue_get_rx_buf(ref.queue, &len);
> +		if (!hdr)
> +			break;

You could use while ((hdr =3D virtqueue_get_rx_buf(...))) instead of
while (true), couldn't you?

> +
> +		if (len < sizeof(*hdr)) {

You should check that there are enough bytes for the L2 header as well
as the mrg_rxbuf header - tap_add_packet() doesn't appear to check
that itself (maybe it should).

> +			warn("vhost: invalid len %u", len);
> +			continue;
> +		}
> +
> +		/* TODO this will break from this moment */
> +		if (hdr->num_buffers !=3D 1) {
> +			warn("vhost: Too many buffers %u, %zu bytes should be enough for ever=
ybody!", hdr->num_buffers, PKT_BUF_BYTES/VHOST_NDESCS);

L1_MAX_LEN_VHOST again.  Also passt style keeps lines below 80
columns.

> +			continue;
> +		}
> +
> +		/* TODO fix the v6 pool to support ipv6 */

Not clear on why anything particular is needed for ipv6 here.

> +		tap_add_packet(c, len - sizeof(*hdr), (void *)(hdr+1), now);
> +	}
> +
> +	tap_handler(c, now);
> +	rx_pkt_refill(c);
> +}
> +
>  /**
>   * tap_ns_tun() - Get tuntap fd in namespace
>   * @c:		Execution context
> @@ -1396,10 +1537,14 @@ void tap_listen_handler(struct ctx *c, uint32_t e=
vents)
>   */
>  static int tap_ns_tun(void *arg)
>  {
> +	/* TODO we need to check if vhost support VIRTIO_NET_F_MRG_RXBUF and VH=
OST_NET_F_VIRTIO_NET_HDR actually */
> +	static const uint64_t features =3D
> +		(1ULL << VIRTIO_F_VERSION_1) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | (1UL=
L << VHOST_NET_F_VIRTIO_NET_HDR);

If I understand the code above properly, for the Rx path at least we
don't need or want F_MRG_RXBUF.  So why require it here?  If we want
it for the Tx path, we can omit it here and add it later in the
series.

>  	struct ifreq ifr =3D { .ifr_flags =3D IFF_TAP | IFF_NO_PI };
>  	int flags =3D O_RDWR | O_NONBLOCK | O_CLOEXEC;
>  	struct ctx *c =3D (struct ctx *)arg;
> -	int fd, rc;
> +	unsigned i;
> +	int fd, vhost_fd, rc;
> =20
>  	c->fd_tap =3D -1;
>  	memcpy(ifr.ifr_name, c->pasta_ifn, IFNAMSIZ);
> @@ -1409,6 +1554,143 @@ static int tap_ns_tun(void *arg)
>  	if (fd < 0)
>  		die_perror("Failed to open() /dev/net/tun");
> =20
> +	vhost_fd =3D open("/dev/vhost-net", flags);
> +	if (vhost_fd < 0)
> +		die_perror("Failed to open() /dev/vhost-net");
> +
> +	rc =3D ioctl(vhost_fd, VHOST_SET_OWNER, NULL);
> +	if (rc < 0)
> +		die_perror("VHOST_SET_OWNER ioctl on /dev/vhost-net failed");
> +
> +	rc =3D ioctl(vhost_fd, VHOST_GET_FEATURES, &c->virtio_features);
> +	if (rc < 0)
> +		die_perror("VHOST_GET_FEATURES ioctl on /dev/vhost-net failed");
> +
> +	/* TODO inform more explicitely */
> +	fprintf(stderr, "vhost features: %lx\n", c->virtio_features);
> +	fprintf(stderr, "req features: %lx\n", features);

You should use info() or debug() here.

> +	c->virtio_features &=3D features;
> +	if (c->virtio_features !=3D features)
> +		die("vhost does not support required features");
> +
> +	for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_file file =3D {
> +			.index =3D i,
> +		};
> +		union epoll_ref ref =3D { .type =3D EPOLL_TYPE_VHOST_CALL,
> +					.queue =3D i };
> +		struct epoll_event ev;
> +
> +		file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> +		ref.fd =3D file.fd;
> +		if (file.fd < 0)
> +			die_perror("Failed to create call eventfd");

I know it's technically harmless, but it reads better for me if you
check for the error immediately after the eventfd() call, before doing
something else with the value.

> +
> +		rc =3D ioctl(vhost_fd, VHOST_SET_VRING_CALL, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_CALL ioctl on /dev/vhost-net failed");
> +
> +		ev =3D (struct epoll_event){ .data.u64 =3D ref.u64, .events =3D EPOLLI=
N };
> +		if (i =3D=3D 0) {
> +			rc =3D epoll_ctl(c->epollfd, EPOLL_CTL_ADD, ref.fd, &ev);
> +			if (rc < 0)
> +				die_perror("Failed to add call eventfd to epoll");
> +		}
> +		c->vq[i].call_fd =3D file.fd;
> +	}
> +
> +	/*
> +	 * Unlike most of C-style APIs, userspace_addr+memory_size is
> +	 * also accesible by the kernel.  Include a -1 to adjust.
> +	 */
> +#define VHOST_MEMORY_REGION_PTR(addr, size) \
> +	(struct vhost_memory_region) { \
> +		.guest_phys_addr =3D (uintptr_t)addr, \
> +		.memory_size =3D size - 1, \
> +		.userspace_addr =3D (uintptr_t)addr, \
> +	}
> +#define VHOST_MEMORY_REGION(elem) VHOST_MEMORY_REGION_PTR(&elem, sizeof(=
elem))
> +
> +	/* 1:1 translation */
> +	vhost_memory.mem.regions[0] =3D VHOST_MEMORY_REGION(vring_desc);
> +	vhost_memory.mem.regions[1] =3D VHOST_MEMORY_REGION(vring_avail_0);
> +	vhost_memory.mem.regions[2] =3D VHOST_MEMORY_REGION(vring_avail_1);
> +	vhost_memory.mem.regions[3] =3D VHOST_MEMORY_REGION(vring_used_0);
> +	vhost_memory.mem.regions[4] =3D VHOST_MEMORY_REGION(vring_used_1);
> +	vhost_memory.mem.regions[5] =3D VHOST_MEMORY_REGION(pkt_buf);
> +	static_assert(5 < N_VHOST_REGIONS);
> +#undef VHOST_MEMORY_REGION
> +#undef VHOST_MEMORY_REGION_PTR
> +
> +	rc =3D ioctl(vhost_fd, VHOST_SET_MEM_TABLE, &vhost_memory.mem);
> +	if (rc < 0)
> +		die_perror(
> +			"VHOST_SET_MEM_TABLE ioctl on /dev/vhost-net failed");
> +
> +	/* TODO: probably it increases RX perf */
> +#if 0
> +	struct ifreq ifr;
> +	memset(&ifr, 0, sizeof(ifr));
> +
> +	if (ioctl(fd, TUNGETIFF, &ifr) !=3D 0)
> +		die_perror("Unable to query TUNGETIFF on FD %d", fd);
> +	}
> +
> +	if (ifr.ifr_flags & IFF_VNET_HDR)
> +		net->dev.features &=3D ~(1ULL << VIRTIO_NET_F_MRG_RXBUF);
> +#endif
> +	rc =3D ioctl(vhost_fd, VHOST_SET_FEATURES, &c->virtio_features);
> +	if (rc < 0)
> +		die_perror("VHOST_SET_FEATURES ioctl on /dev/vhost-net failed");
> +
> +	/* Duplicating foreach queue to follow the exact order from QEMU */
> +	for (i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_addr addr =3D {
> +			.index =3D i,
> +			.desc_user_addr =3D (unsigned long)vring_desc[i],
> +			.avail_user_addr =3D i =3D=3D 0 ? (unsigned long)&vring_avail_0 :
> +										(unsigned long)&vring_avail_1,
> +			.used_user_addr =3D i =3D=3D 0 ? (unsigned long)&vring_used_0 :
> +										(unsigned long)&vring_used_1,
> +			/* GPA addr */
> +			.log_guest_addr =3D i =3D=3D 0 ? (unsigned long)&vring_used_0 :
> +									   (unsigned long)&vring_used_1,
> +		};
> +		struct vhost_vring_state state =3D {
> +			.index =3D i,
> +			.num =3D VHOST_NDESCS,
> +		};
> +		struct vhost_vring_file file =3D {
> +			.index =3D i,
> +		};
> +
> +		rc =3D ioctl(vhost_fd, VHOST_SET_VRING_NUM, &state);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_NUM ioctl on /dev/vhost-net failed");
> +
> +		fprintf(stderr, "qid: %d\n", i);
> +		fprintf(stderr, "vhost desc addr: 0x%llx\n", addr.desc_user_addr);
> +		fprintf(stderr, "vhost avail addr: 0x%llx\n", addr.avail_user_addr);
> +		fprintf(stderr, "vhost used addr: 0x%llx\n", addr.used_user_addr);
> +		rc =3D ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &addr);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_ADDR ioctl on /dev/vhost-net failed");
> +
> +		file.fd =3D eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
> +		if (file.fd < 0)
> +			die_perror("Failed to create kick eventfd");
> +		rc =3D ioctl(vhost_fd, VHOST_SET_VRING_KICK, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_SET_VRING_KICK ioctl on /dev/vhost-net failed");
> +		c->vq[i].kick_fd =3D file.fd;
> +
> +		vqs[i].num_free =3D VHOST_NDESCS;
> +	}
> +
>  	rc =3D ioctl(fd, (int)TUNSETIFF, &ifr);
>  	if (rc < 0)
>  		die_perror("TUNSETIFF ioctl on /dev/net/tun failed");
> @@ -1416,7 +1698,34 @@ static int tap_ns_tun(void *arg)
>  	if (!(c->pasta_ifi =3D if_nametoindex(c->pasta_ifn)))
>  		die("Tap device opened but no network interface found");
> =20
> +	for (i =3D 0; i < VHOST_NDESCS; ++i) {
> +		vring_desc[0][i].addr =3D (uintptr_t)pkt_buf + i * (PKT_BUF_BYTES/VHOS=
T_NDESCS);
> +		vring_desc[0][i].len =3D PKT_BUF_BYTES/VHOST_NDESCS;
> +		vring_desc[0][i].flags =3D VRING_DESC_F_WRITE;
> +	}
> +	for (i =3D 0; i < VHOST_NDESCS; ++i) {

No { } for single line blocks in passt style.

> +		vring_avail_0.avail.ring[i] =3D i;
> +	}
> +
> +	rx_pkt_refill(c);
> +
> +	/* Duplicating foreach queue to follow the exact order from QEMU */
> +	for (int i =3D 0; i < ARRAY_SIZE(c->vq); i++) {
> +		struct vhost_vring_file file =3D {
> +			.index =3D i,
> +			.fd =3D fd,
> +		};
> +
> +		fprintf(stderr, "qid: %d\n", file.index);
> +		fprintf(stderr, "tap fd: %d\n", file.fd);
> +		rc =3D ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &file);
> +		if (rc < 0)
> +			die_perror(
> +				"VHOST_NET_SET_BACKEND ioctl on /dev/vhost-net failed");
> +	}
> +
>  	c->fd_tap =3D fd;
> +	c->fd_vhost =3D vhost_fd;
> =20
>  	return 0;
>  }
> diff --git a/tap.h b/tap.h
> index 6fe3d15..ff8cee5 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -69,6 +69,14 @@ static inline void tap_hdr_update(struct tap_hdr *thdr=
, size_t l2len)
>  		thdr->vnet_len =3D htonl(l2len);
>  }
> =20
> +/**
> + * tap_vhost_input() - Handler for new data on the socket to hypervisor =
vq
> + * @c:		Execution context
> + * @ref:	epoll reference
> + * @now:	Current timestamp
> + */

passt convention puts these comments in the .c file not the .h, even
for exported functions.

> +void tap_vhost_input(struct ctx *c, union epoll_ref ref, const struct ti=
mespec *now);
> +
>  unsigned long tap_l2_max_len(const struct ctx *c);
>  void *tap_push_l2h(const struct ctx *c, void *buf, uint16_t proto);
>  void *tap_push_ip4h(struct iphdr *ip4h, struct in_addr src,

--=20
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

--0yHg5seefeSV+JWO
Content-Type: application/pgp-signature; name=signature.asc

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEO+dNsU4E3yXUXRK2zQJF27ox2GcFAmiAh6oACgkQzQJF27ox
2GcTYQ//aRG3k54OX4TXgOd+cJJDueUERXen5IRU4mQzei8+soDmi9B+4Zh64ieb
1O3vffv6AM2iLIqGlwhCPqW+SuJ6ErQOgajaHs0l3Q2p2QdwDp/MJgynidwMLrmn
4ih7ckUeFrh0NgUEY9rnfCKtkC9toDBrm1+NQZ0bvjY/qDri1TxVfPLhueXDSCc0
4dh4enj9Aao9Oh1gkZjV/n8RYJ9gDXyLCbFdISBxymZwCZmD9AinWV6ekCWW69IX
Xyimt/RnCRObMygn8FaJe83X+R1QT7ggFvDLnjqx6AUartQFDEFrlwzIL0hwHI7u
oaconnOaUmJYlIrbabVvrR8v5NHSIPq1OfV/Z42fclN+mX1Smla4NuXYSgL/e/bE
sxwiCLUUC0jiOLl52p7uVmJl62GWNx1ZgZrojrwqpiXeOPxRDO9RpKqlMA91uSFa
LXZ4bZSyhrJKl1KEGxY7hmFncmHp+Jo5BziPj9WPBc0ws8dWt1K1xySB1M94iPco
RJTt3zeFP+xXTL5xlTw1ZbSziRWh3Y0shslDS5cuIR+/jgb9TNGLQLKQ5gh44IvA
qmUFV989p4yZSye8+kkVw3G9B+xjgkhDOda1GWOqfxMUnuJkttBXCiqYeYhtS4kM
H6UGPcborjtNM5Yxh5gZHpXEbNRB8tPMESMVWyo+pS9PJT94DAM=
=7ufQ
-----END PGP SIGNATURE-----

--0yHg5seefeSV+JWO--