[PATCH v3 0/4] Add vhost-user support to passt. (part 3)

public inbox for passt-dev@passt.top
 help / color / mirror / code / Atom feed

* [PATCH v3 0/4] Add vhost-user support to passt. (part 3)
@ 2024-08-15 15:50 Laurent Vivier
  2024-08-15 15:50 ` [PATCH v3 1/4] packet: replace struct desc by struct iovec Laurent Vivier
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Laurent Vivier @ 2024-08-15 15:50 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

This series of patches adds vhost-user support to passt
and then allows passt to connect to QEMU network backend using
virtqueue rather than a socket.

With QEMU, rather than using to connect:

  -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket

we will use:

  -chardev socket,id=chr0,path=/tmp/passt_1.socket
  -netdev vhost-user,id=netdev0,chardev=chr0
  -device virtio-net,netdev=netdev0
  -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
  -numa node,memdev=memfd0

The memory backend is needed to share data between passt and QEMU.

Performance comparison between "-netdev stream" and "-netdev vhost-user":

$ iperf3 -c localhost -p 10001  -t 60 -6 -u -b 50G

socket:
[  5]   0.00-60.05  sec  95.6 GBytes  13.7 Gbits/sec  0.017 ms  6998988/10132413 (69%)  receiver
vhost-user:
[  5]   0.00-60.04  sec   237 GBytes  33.9 Gbits/sec  0.006 ms  53673/7813770 (0.69%)  receiver

$ iperf3 -c localhost -p 10001  -t 60 -4 -u -b 50G

socket:
[  5]   0.00-60.05  sec  98.9 GBytes  14.1 Gbits/sec  0.018 ms  6260735/9501832 (66%)  receiver
vhost-user:
[  5]   0.00-60.05  sec   235 GBytes  33.7 Gbits/sec  0.008 ms  37581/7752699 (0.48%)  receiver

$ iperf3 -c localhost -p 10001  -t 60 -6

socket:
[  5]   0.00-60.00  sec  17.3 GBytes  2.48 Gbits/sec    0             sender
[  5]   0.00-60.06  sec  17.3 GBytes  2.48 Gbits/sec                  receiver
vhost-user:
[  5]   0.00-60.00  sec   191 GBytes  27.4 Gbits/sec    0             sender
[  5]   0.00-60.05  sec   191 GBytes  27.3 Gbits/sec                  receiver

$ iperf3 -c localhost -p 10001  -t 60 -4

socket:
[  5]   0.00-60.00  sec  15.6 GBytes  2.24 Gbits/sec    0             sender
[  5]   0.00-60.06  sec  15.6 GBytes  2.24 Gbits/sec                  receiver
vhost-user:
[  5]   0.00-60.00  sec   189 GBytes  27.1 Gbits/sec    0             sender
[  5]   0.00-60.04  sec   189 GBytes  27.0 Gbits/sec                  receiver

v3:
  - rebase on top of flow table
  - update tcp_vu.c to look like udp_vu.c (recv()/prepare()/send_frame())
  - address comments from Stefano and David on version 2

v2:
  - remove PATCH 4
  - rewrite PATCH 2 and 3 to follow passt coding style
  - move some code from PATCH 3 to PATCH 4 (previously PATCH 5)
  - partially addressed David's comment on PATCH 5

Laurent Vivier (4):
  packet: replace struct desc by struct iovec
  vhost-user: introduce virtio API
  vhost-user: introduce vhost-user API
  vhost-user: add vhost-user

 Makefile       |    6 +-
 checksum.c     |    1 -
 conf.c         |   24 +-
 epoll_type.h   |    4 +
 iov.c          |    1 -
 isolation.c    |   15 +-
 packet.c       |   93 ++--
 packet.h       |   16 +-
 passt.c        |   25 +-
 passt.h        |    6 +
 pcap.c         |    1 -
 tap.c          |  106 +++-
 tap.h          |    5 +-
 tcp.c          |   33 +-
 tcp_buf.c      |    6 +-
 tcp_internal.h |    3 +-
 tcp_vu.c       |  593 ++++++++++++++++++++++
 tcp_vu.h       |   12 +
 udp.c          |   71 +--
 udp.h          |    8 +-
 udp_internal.h |   34 ++
 udp_vu.c       |  338 +++++++++++++
 udp_vu.h       |   13 +
 util.h         |    8 +
 vhost_user.c   | 1277 ++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h   |  202 ++++++++
 virtio.c       |  656 +++++++++++++++++++++++++
 virtio.h       |  185 +++++++
 vu_common.c    |   27 +
 vu_common.h    |   34 ++
 30 files changed, 3666 insertions(+), 137 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h
 create mode 100644 virtio.c
 create mode 100644 virtio.h
 create mode 100644 vu_common.c
 create mode 100644 vu_common.h

-- 
2.45.2



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/4] packet: replace struct desc by struct iovec
  2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
@ 2024-08-15 15:50 ` Laurent Vivier
  2024-08-20  0:27   ` David Gibson
  2024-08-15 15:50 ` [PATCH v3 2/4] vhost-user: introduce virtio API Laurent Vivier
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Laurent Vivier @ 2024-08-15 15:50 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.

We need a 64bit address, so replace struct desc by struct iovec
and update range checking.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 packet.c | 80 ++++++++++++++++++++++++++++++--------------------------
 packet.h | 14 ++--------
 2 files changed, 45 insertions(+), 49 deletions(-)

diff --git a/packet.c b/packet.c
index ccfc84607709..37489961a37e 100644
--- a/packet.c
+++ b/packet.c
@@ -22,6 +22,35 @@
 #include "util.h"
 #include "log.h"
 
+/**
+ * packet_check_range() - Check if a packet memory range is valid
+ * @p:		Packet pool
+ * @offset:	Offset of data range in packet descriptor
+ * @len:	Length of desired data range
+ * @start:	Start of the packet descriptor
+ * @func:	For tracing: name of calling function
+ * @line:	For tracing: caller line of function call
+ *
+ * Return: 0 if the range is valid, -1 otherwise
+ */
+static int packet_check_range(const struct pool *p, size_t offset, size_t len,
+			      const char *start, const char *func, int line)
+{
+	if (start < p->buf) {
+		trace("packet start %p before buffer start %p, "
+		      "%s:%i", (void *)start, (void *)p->buf, func, line);
+		return -1;
+	}
+
+	if (start + len + offset > p->buf + p->buf_size) {
+		trace("packet offset plus length %lu from size %lu, "
+		      "%s:%i", start - p->buf + len + offset,
+		      p->buf_size, func, line);
+		return -1;
+	}
+
+	return 0;
+}
 /**
  * packet_add_do() - Add data as packet descriptor to given pool
  * @p:		Existing pool
@@ -41,34 +70,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
 		return;
 	}
 
-	if (start < p->buf) {
-		trace("add packet start %p before buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
+	if (packet_check_range(p, 0, len, start, func, line))
 		return;
-	}
-
-	if (start + len > p->buf + p->buf_size) {
-		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
-		      (void *)start, len, (void *)(p->buf + p->buf_size),
-		      func, line);
-		return;
-	}
 
 	if (len > UINT16_MAX) {
 		trace("add packet length %zu, %s:%i", len, func, line);
 		return;
 	}
 
-#if UINTPTR_MAX == UINT64_MAX
-	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
-		trace("add packet start %p, buffer start %p, %s:%i",
-		      (void *)start, (void *)p->buf, func, line);
-		return;
-	}
-#endif
-
-	p->pkt[idx].offset = start - p->buf;
-	p->pkt[idx].len = len;
+	p->pkt[idx].iov_base = (void *)start;
+	p->pkt[idx].iov_len = len;
 
 	p->count++;
 }
@@ -96,36 +107,31 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
 		return NULL;
 	}
 
-	if (len > UINT16_MAX || len + offset > UINT32_MAX) {
+	if (len > UINT16_MAX) {
 		if (func) {
-			trace("packet data length %zu, offset %zu, %s:%i",
-			      len, offset, func, line);
+			trace("packet data length %zu, %s:%i",
+			      len, func, line);
 		}
 		return NULL;
 	}
 
-	if (p->pkt[idx].offset + len + offset > p->buf_size) {
+	if (len + offset > p->pkt[idx].iov_len) {
 		if (func) {
-			trace("packet offset plus length %zu from size %zu, "
-			      "%s:%i", p->pkt[idx].offset + len + offset,
-			      p->buf_size, func, line);
+			trace("data length %zu, offset %zu from length %zu, "
+			      "%s:%i", len, offset, p->pkt[idx].iov_len,
+			      func, line);
 		}
 		return NULL;
 	}
 
-	if (len + offset > p->pkt[idx].len) {
-		if (func) {
-			trace("data length %zu, offset %zu from length %u, "
-			      "%s:%i", len, offset, p->pkt[idx].len,
-			      func, line);
-		}
+	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
+			       func, line))
 		return NULL;
-	}
 
 	if (left)
-		*left = p->pkt[idx].len - offset - len;
+		*left = p->pkt[idx].iov_len - offset - len;
 
-	return p->buf + p->pkt[idx].offset + offset;
+	return (char *)p->pkt[idx].iov_base + offset;
 }
 
 /**
diff --git a/packet.h b/packet.h
index a784b07bbed5..8377dcf678bb 100644
--- a/packet.h
+++ b/packet.h
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
@@ -6,16 +6,6 @@
 #ifndef PACKET_H
 #define PACKET_H
 
-/**
- * struct desc - Generic offset-based descriptor within buffer
- * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
- * @len:	Length of descriptor, host order, 16-bit limit
- */
-struct desc {
-	uint32_t offset;
-	uint16_t len;
-};
-
 /**
  * struct pool - Generic pool of packets stored in a buffer
  * @buf:	Buffer storing packet descriptors
@@ -29,7 +19,7 @@ struct pool {
 	size_t buf_size;
 	size_t size;
 	size_t count;
-	struct desc pkt[1];
+	struct iovec pkt[1];
 };
 
 void packet_add_do(struct pool *p, size_t len, const char *start,
@@ -54,7 +44,7 @@ struct _name ## _t {							\
 	size_t buf_size;						\
 	size_t size;							\
 	size_t count;							\
-	struct desc pkt[_size];						\
+	struct iovec pkt[_size];					\
 }
 
 #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/4] vhost-user: introduce virtio API
  2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
  2024-08-15 15:50 ` [PATCH v3 1/4] packet: replace struct desc by struct iovec Laurent Vivier
@ 2024-08-15 15:50 ` Laurent Vivier
  2024-08-20  1:00   ` David Gibson
  2024-08-22 22:14   ` Stefano Brivio
  2024-08-15 15:50 ` [PATCH v3 3/4] vhost-user: introduce vhost-user API Laurent Vivier
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Laurent Vivier @ 2024-08-15 15:50 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile |   4 +-
 util.h   |   8 +
 virtio.c | 662 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 virtio.h | 185 ++++++++++++++++
 4 files changed, 857 insertions(+), 2 deletions(-)
 create mode 100644 virtio.c
 create mode 100644 virtio.h

diff --git a/Makefile b/Makefile
index b6329e35f884..f171c7955ac9 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
+	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h
+	udp.h udp_flow.h util.h virtio.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/util.h b/util.h
index b7541ce24e5a..7944cfe1219d 100644
--- a/util.h
+++ b/util.h
@@ -132,6 +132,14 @@ static inline uint32_t ntohl_unaligned(const void *p)
 	return ntohl(val);
 }
 
+static inline void barrier(void) { __asm__ __volatile__("" ::: "memory"); }
+#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
+#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
+#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
+
+#define smp_wmb()	smp_mb_release()
+#define smp_rmb()	smp_mb_acquire()
+
 #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
 int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
 	     void *arg);
diff --git a/virtio.c b/virtio.c
new file mode 100644
index 000000000000..8354f6052aee
--- /dev/null
+++ b/virtio.c
@@ -0,0 +1,662 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+/* some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c
+ * licensed under the following terms:
+ *
+ * Copyright IBM, Corp. 2007
+ * Copyright (c) 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Anthony Liguori <aliguori@us.ibm.com>
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Victor Kaplansky <victork@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ * Some parts copied from QEMU hw/virtio/virtio.c
+ * licensed under the following terms:
+ *
+ * Copyright IBM, Corp. 2007
+ *
+ * Authors:
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * virtq_used_event() and virtq_avail_event() from
+ * https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-712000A
+ * licensed under the following terms:
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers.
+ *
+ * Copyright 2007, 2009, IBM Corporation
+ * Copyright 2011, Red Hat, Inc
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ‘‘AS IS’’ AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stddef.h>
+#include <endian.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/eventfd.h>
+#include <sys/socket.h>
+
+#include "util.h"
+#include "virtio.h"
+
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * vu_gpa_to_va() - Translate guest physical address to our virtual address.
+ * @dev:	Vhost-user device
+ * @plen:	Physical length to map (input), virtual address mapped (output)
+ * @guest_addr:	Guest physical address
+ *
+ * Return: virtual address in our address space of the guest physical address
+ */
+static void *vu_gpa_to_va(struct vu_dev *dev, uint64_t *plen, uint64_t guest_addr)
+{
+	unsigned int i;
+
+	if (*plen == 0)
+		return NULL;
+
+	/* Find matching memory region. */
+	for (i = 0; i < dev->nregions; i++) {
+		const struct vu_dev_region *r = &dev->regions[i];
+
+		if ((guest_addr >= r->gpa) &&
+		    (guest_addr < (r->gpa + r->size))) {
+			if ((guest_addr + *plen) > (r->gpa + r->size))
+				*plen = r->gpa + r->size - guest_addr;
+			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+			return (void *)(guest_addr - r->gpa + r->mmap_addr +
+						     r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+/**
+ * vring_avail_flags() - Read the available ring flags
+ * @vq:		Virtqueue
+ *
+ * Return: the available ring descriptor flags of the given virtqueue
+ */
+static inline uint16_t vring_avail_flags(const struct vu_virtq *vq)
+{
+	return le16toh(vq->vring.avail->flags);
+}
+
+/**
+ * vring_avail_idx() - Read the available ring index
+ * @vq:		Virtqueue
+ *
+ * Return: the available ring index of the given virtqueue
+ */
+static inline uint16_t vring_avail_idx(struct vu_virtq *vq)
+{
+	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
+
+	return vq->shadow_avail_idx;
+}
+
+/**
+ * vring_avail_ring() - Read an available ring entry
+ * @vq:		Virtqueue
+ * @i:		Index of the entry to read
+ *
+ * Return: the ring entry content (head of the descriptor chain)
+ */
+static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
+{
+	return le16toh(vq->vring.avail->ring[i]);
+}
+
+/**
+ * virtq_used_event - Get location of used event indices
+ *		      (only with VIRTIO_F_EVENT_IDX)
+ * @vq		Virtqueue
+ *
+ * Return: return the location of the used event index
+ */
+static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
+{
+        /* For backwards compat, used event index is at *end* of avail ring. */
+        return &vq->vring.avail->ring[vq->vring.num];
+}
+
+/**
+ * vring_get_used_event() - Get the used event from the available ring
+ * @vq		Virtqueue
+ *
+ * Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
+ *         used_event is a performant alternative where the driver
+ *         specifies how far the device can progress before a notification
+ *         is required.
+ */
+static inline uint16_t vring_get_used_event(const struct vu_virtq *vq)
+{
+	return le16toh(*virtq_used_event(vq));
+}
+
+/**
+ * virtqueue_get_head() - Get the head of the descriptor chain for a given
+ *                        index
+ * @vq:		Virtqueue
+ * @idx:	Available ring entry index
+ * @head:	Head of the descriptor chain
+ */
+static void virtqueue_get_head(const struct vu_virtq *vq,
+			       unsigned int idx, unsigned int *head)
+{
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen.
+	 */
+	*head = vring_avail_ring(vq, idx % vq->vring.num);
+
+	/* If their number is silly, that's a fatal mistake. */
+	if (*head >= vq->vring.num)
+		die("Guest says index %u is available", *head);
+}
+
+/**
+ * virtqueue_read_indirect_desc() - Copy virtio ring descriptors from guest
+ *                                  memory
+ * @dev:	Vhost-user device
+ * @desc:	Destination address to copy the descriptors
+ * @addr:	Guest memory address to copy from
+ * @len:	Length of memory to copy
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+static int virtqueue_read_indirect_desc(struct vu_dev *dev, struct vring_desc *desc,
+					uint64_t addr, size_t len)
+{
+	uint64_t read_len;
+
+	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc)))
+		return -1;
+
+	if (len == 0)
+		return -1;
+
+	while (len) {
+		const struct vring_desc *orig_desc;
+
+		read_len = len;
+		orig_desc = vu_gpa_to_va(dev, &read_len, addr);
+		if (!orig_desc)
+			return -1;
+
+		memcpy(desc, orig_desc, read_len);
+		len -= read_len;
+		addr += read_len;
+		desc += read_len / sizeof(struct vring_desc);
+	}
+
+	return 0;
+}
+
+/**
+ * enum virtqueue_read_desc_state - State in the descriptor chain
+ * @VIRTQUEUE_READ_DESC_ERROR	Found an invalid descriptor
+ * @VIRTQUEUE_READ_DESC_DONE	No more descriptors in the chain
+ * @VIRTQUEUE_READ_DESC_MORE	there are more descriptors in the chain
+ */
+enum virtqueue_read_desc_state {
+	VIRTQUEUE_READ_DESC_ERROR = -1,
+	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
+	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
+};
+
+/**
+ * virtqueue_read_next_desc() - Read the the next descriptor in the chain
+ * @desc:	Virtio ring descriptors
+ * @i:		Index of the current descriptor
+ * @max:	Maximum value of the descriptor index
+ * @next:	Index of the next descriptor in the chain (output value)
+ *
+ * Return: current chain descriptor state (error, next, done)
+ */
+static int virtqueue_read_next_desc(const struct vring_desc *desc,
+				    int i, unsigned int max, unsigned int *next)
+{
+	/* If this descriptor says it doesn't chain, we're done. */
+	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT))
+		return VIRTQUEUE_READ_DESC_DONE;
+
+	/* Check they're not leading us off end of descriptors. */
+	*next = le16toh(desc[i].next);
+	/* Make sure compiler knows to grab that: we don't want it changing! */
+	smp_wmb();
+
+	if (*next >= max)
+		return VIRTQUEUE_READ_DESC_ERROR;
+
+	return VIRTQUEUE_READ_DESC_MORE;
+}
+
+/**
+ * vu_queue_empty() - Check if virtqueue is empty
+ * @vq:		Virtqueue
+ *
+ * Return: true if the virtqueue is empty, false otherwise
+ */
+bool vu_queue_empty(struct vu_virtq *vq)
+{
+	if (!vq->vring.avail)
+		return true;
+
+	if (vq->shadow_avail_idx != vq->last_avail_idx)
+		return false;
+
+	return vring_avail_idx(vq) == vq->last_avail_idx;
+}
+
+/**
+ * vring_can_notify() - Check if a notification can be sent
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ *
+ * Return: true if notification can be sent
+ */
+static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
+{
+	uint16_t old, new;
+	bool v;
+
+	/* We need to expose used array entries before checking used event. */
+	smp_mb();
+
+	/* Always notify when queue is empty (when feature acknowledge) */
+	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
+		!vq->inuse && vu_queue_empty(vq)) {
+		return true;
+	}
+
+	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
+		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
+
+	v = vq->signalled_used_valid;
+	vq->signalled_used_valid = true;
+	old = vq->signalled_used;
+	new = vq->signalled_used = vq->used_idx;
+	return !v || vring_need_event(vring_get_used_event(vq), new, old);
+}
+
+/**
+ * vu_queue_notify() - Send a notification to the given virtqueue
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
+{
+	if (!vq->vring.avail)
+		return;
+
+	if (!vring_can_notify(dev, vq)) {
+		debug("vhost-user: virtqueue can skip notify...");
+		return;
+	}
+
+	if (eventfd_write(vq->call_fd, 1) < 0)
+		die_perror("Error writing eventfd");
+}
+
+/* virtq_avail_event() -  Get location of available event indices
+ *			      (only with VIRTIO_F_EVENT_IDX)
+ * @vq:		Virtqueue
+ *
+ * Return: return the location of the available event index
+ */
+static inline uint16_t *virtq_avail_event(const struct vu_virtq *vq)
+{
+        /* For backwards compat, avail event index is at *end* of used ring. */
+        return (uint16_t *)&vq->vring.used->ring[vq->vring.num];
+}
+
+/**
+ * vring_set_avail_event() - Set avail_event
+ * @vq:		Virtqueue
+ * @val:	Value to set to avail_event
+ *		avail_event is used in the same way the used_event is in the
+ *		avail_ring.
+ *		avail_event is used to advise the driver that notifications
+ *		are unnecessary until the driver writes entry with an index
+ *		specified by avail_event into the available ring.
+ */
+static inline void vring_set_avail_event(const struct vu_virtq *vq,
+					 uint16_t val)
+{
+	uint16_t val_le = htole16(val);
+
+	if (!vq->notification)
+		return;
+
+	memcpy(virtq_avail_event(vq), &val_le, sizeof(val_le));
+}
+
+/**
+ * virtqueue_map_desc() - Translate descriptor ring physical address into our
+ * 			  virtual address space
+ * @dev:	Vhost-user device
+ * @p_num_sg:	First iov entry to use (input),
+ *		first iov entry not used (output)
+ * @iov:	Iov array to use to store buffer virtual addresses
+ * @max_num_sg:	Maximum number of iov entries
+ * @pa:		Guest physical address of the buffer to map into our virtual
+ * 		address
+ * @sz:		Size of the buffer
+ *
+ * Return: false on error, true otherwise
+ */
+static bool virtqueue_map_desc(struct vu_dev *dev,
+			       unsigned int *p_num_sg, struct iovec *iov,
+			       unsigned int max_num_sg,
+			       uint64_t pa, size_t sz)
+{
+	unsigned int num_sg = *p_num_sg;
+
+	ASSERT(num_sg < max_num_sg);
+	ASSERT(sz);
+
+	while (sz) {
+		uint64_t len = sz;
+
+		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
+		if (iov[num_sg].iov_base == NULL)
+			die("virtio: invalid address for buffers");
+		iov[num_sg].iov_len = len;
+		num_sg++;
+		sz -= len;
+		pa += len;
+	}
+
+	*p_num_sg = num_sg;
+	return true;
+}
+
+/**
+ * vu_queue_map_desc - Map the virqueue descriptor ring into our virtual
+ * 		       address space
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @idx:	First descriptor ring entry to map
+ * @elem:	Virtqueue element to store descriptor ring iov
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned int idx,
+			     struct vu_virtq_element *elem)
+{
+	const struct vring_desc *desc = vq->vring.desc;
+	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
+	unsigned int out_num = 0, in_num = 0;
+	unsigned int max = vq->vring.num;
+	unsigned int i = idx;
+	uint64_t read_len;
+	int rc;
+
+	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
+		unsigned int desc_len;
+		uint64_t desc_addr;
+
+		if (le32toh(desc[i].len) % sizeof(struct vring_desc))
+			die("Invalid size for indirect buffer table");
+
+		/* loop over the indirect descriptor table */
+		desc_addr = le64toh(desc[i].addr);
+		desc_len = le32toh(desc[i].len);
+		max = desc_len / sizeof(struct vring_desc);
+		read_len = desc_len;
+		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
+		if (desc && read_len != desc_len) {
+			/* Failed to use zero copy */
+			desc = NULL;
+			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len))
+				desc = desc_buf;
+		}
+		if (!desc)
+			die("Invalid indirect buffer table");
+		i = 0;
+	}
+
+	/* Collect all the descriptors */
+	do {
+		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
+			if (!virtqueue_map_desc(dev, &in_num, elem->in_sg,
+						elem->in_num,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len))) {
+				return -1;
+			}
+		} else {
+			if (in_num)
+				die("Incorrect order for descriptors");
+			if (!virtqueue_map_desc(dev, &out_num, elem->out_sg,
+						elem->out_num,
+						le64toh(desc[i].addr),
+						le32toh(desc[i].len))) {
+				return -1;
+			}
+		}
+
+		/* If we've got too many, that implies a descriptor loop. */
+		if ((in_num + out_num) > max)
+			die("Looped descriptor");
+		rc = virtqueue_read_next_desc(desc, i, max, &i);
+	} while (rc == VIRTQUEUE_READ_DESC_MORE);
+
+	if (rc == VIRTQUEUE_READ_DESC_ERROR)
+		die("read descriptor error");
+
+	elem->index = idx;
+	elem->in_num = in_num;
+	elem->out_num = out_num;
+
+	return 0;
+}
+
+/**
+ * vu_queue_pop() - Pop an entry from the virtqueue
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @elem:	Virtqueue element to file with the entry information
+ *
+ * Return: -1 if there is an error, 0 otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
+{
+	unsigned int head;
+	int ret;
+
+	if (!vq->vring.avail)
+		return -1;
+
+	if (vu_queue_empty(vq))
+		return -1;
+
+	/*
+	 * Needed after vu_queue_empty(), see comment in
+	 * virtqueue_num_heads().
+	 */
+	smp_rmb();
+
+	if (vq->inuse >= vq->vring.num)
+		die("Virtqueue size exceeded");
+
+	virtqueue_get_head(vq, vq->last_avail_idx++, &head);
+
+	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
+		vring_set_avail_event(vq, vq->last_avail_idx);
+
+	ret = vu_queue_map_desc(dev, vq, head, elem);
+
+	if (ret < 0)
+		return ret;
+
+	vq->inuse++;
+
+	return 0;
+}
+
+/**
+ * vu_queue_detach_element() - Detach an element from the virqueue
+ * @vq:		Virtqueue
+ */
+void vu_queue_detach_element(struct vu_virtq *vq)
+{
+	vq->inuse--;
+	/* unmap, when DMA support is added */
+}
+
+/**
+ * vu_queue_unpop() - Push back the previously popped element from the virqueue
+ * @vq:		Virtqueue
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_unpop(struct vu_virtq *vq)
+{
+	vq->last_avail_idx--;
+	vu_queue_detach_element(vq);
+}
+
+/**
+ * vu_queue_rewind() - Push back a given number of popped elements
+ * @vq:		Virtqueue
+ * @num:	Number of element to unpop
+ */
+/* cppcheck-suppress unusedFunction */
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
+{
+	if (num > vq->inuse)
+		return false;
+
+	vq->last_avail_idx -= num;
+	vq->inuse -= num;
+	return true;
+}
+
+/**
+ * vring_used_write() - Write an entry in the used ring
+ * @vq:		Virtqueue
+ * @uelem:	Entry to write
+ * @i:		Index of the entry in the used ring
+ */
+static inline void vring_used_write(struct vu_virtq *vq,
+				    const struct vring_used_elem *uelem, int i)
+{
+	struct vring_used *used = vq->vring.used;
+
+	used->ring[i] = *uelem;
+}
+
+/**
+ * vu_queue_fill_by_index() - Update information of a descriptor ring entry
+ *			      in the used ring
+ * @vq:		Virtqueue
+ * @index:	Descriptor ring index
+ * @len:	Size of the element
+ * @idx:	Used ring entry index
+ */
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx)
+{
+	struct vring_used_elem uelem;
+
+	if (!vq->vring.avail)
+		return;
+
+	idx = (idx + vq->used_idx) % vq->vring.num;
+
+	uelem.id = htole32(index);
+	uelem.len = htole32(len);
+	vring_used_write(vq, &uelem, idx);
+}
+
+/**
+ * vu_queue_fill() - Update information of a given element in the used ring
+ * @dev:	Vhost-user device
+ * @vq:		Virtqueue
+ * @elem:	Element information to fill
+ * @len:	Size of the element
+ * @idx:	Used ring entry index
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
+		   unsigned int len, unsigned int idx)
+{
+	vu_queue_fill_by_index(vq, elem->index, len, idx);
+}
+
+/**
+ * vring_used_idx_set() - Set the descriptor ring current index
+ * @vq:		Virtqueue
+ * @val:	Value to set in the index
+ */
+static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
+{
+	vq->vring.used->idx = htole16(val);
+
+	vq->used_idx = val;
+}
+
+/**
+ * vu_queue_flush() - Flush the virtqueue
+ * @vq:		Virtqueue
+ * @count:	Number of entry to flush
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
+{
+	uint16_t old, new;
+
+	if (!vq->vring.avail)
+		return;
+
+	/* Make sure buffer is written before we update index. */
+	smp_wmb();
+
+	old = vq->used_idx;
+	new = old + count;
+	vring_used_idx_set(vq, new);
+	vq->inuse -= count;
+	if ((uint16_t)(new - vq->signalled_used) < (uint16_t)(new - old))
+		vq->signalled_used_valid = false;
+}
diff --git a/virtio.h b/virtio.h
new file mode 100644
index 000000000000..af9cadc990b9
--- /dev/null
+++ b/virtio.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+/* Maximum size of a virtqueue */
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * struct vu_ring - Virtqueue rings
+ * @num:		Size of the queue
+ * @desc:		Descriptor ring
+ * @avail:		Available ring
+ * @used:		Used ring
+ * @log_guest_addr:	Guest address for logging
+ * @flags:		Vring flags
+ * 			VHOST_VRING_F_LOG is set if log address is valid
+ */
+struct vu_ring {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+};
+
+/**
+ * struct vu_virtq - Virtqueue definition
+ * @vring:			Virtqueue rings
+ * @last_avail_idx:		Next head to pop
+ * @shadow_avail_idx:		Last avail_idx read from VQ.
+ * @used_idx:			Descriptor ring current index
+ * @signalled_used:		Last used index value we have signalled on
+ * @signalled_used_valid:	True if signalled_used if valid
+ * @notification:		True if the queues notify (via event
+ * 				index or interrupt)
+ * @inuse:			Number of entries in use
+ * @call_fd:			The event file descriptor to signal when
+ * 				buffers are used.
+ * @kick_fd:			The event file descriptor for adding
+ * 				buffers to the vring
+ * @err_fd:			The event file descriptor to signal when
+ * 				error occurs
+ * @enable:			True if the virtqueue is enabled
+ * @started:			True if the virtqueue is started
+ * @vra:			QEMU address of our rings
+ */
+struct vu_virtq {
+	struct vu_ring vring;
+	uint16_t last_avail_idx;
+	uint16_t shadow_avail_idx;
+	uint16_t used_idx;
+	uint16_t signalled_used;
+	bool signalled_used_valid;
+	bool notification;
+	unsigned int inuse;
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+	struct vhost_vring_addr vra;
+};
+
+/**
+ * struct vu_dev_region - guest shared memory region
+ * @gpa:		Guest physical address of the region
+ * @size:		Memory size in bytes
+ * @qva:		QEMU virtual address
+ * @mmap_offset:	Offset where the region starts in the mapped memory
+ * @mmap_addr:		Address of the mapped memory
+ */
+struct vu_dev_region {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+};
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+/**
+ * struct vu_dev - vhost-user device information
+ * @context:		Execution context
+ * *nregions:		Number of shared memory regions
+ * @regions:		Guest shared memory regions
+ * @features:		Vhost-user features
+ * @protocol_features:	Vhost-user protocol features
+ * @hdrlen:		Virtio -net header length
+ */
+struct vu_dev {
+	uint32_t nregions;
+	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
+	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+	int hdrlen;
+};
+
+/**
+ * struct vu_virtq_element - virtqueue element
+ * @index:	Descriptor ring index
+ * @out_num:	Number of outgoing iovec buffers
+ * @in_num:	Number of incoming iovec buffers
+ * @in_sg:	Incoming iovec buffers
+ * @out_sg:	Outgoing iovec buffers
+ */
+struct vu_virtq_element {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+};
+
+/**
+ * has_feature() - Check a feature bit in a features set
+ * @features:	Features set
+ * @fb:		Feature bit to check
+ *
+ * Return:	True if the feature bit is set
+ */
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+/**
+ * vu_has_feature() - Check if a virtio-net feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+static inline bool vu_has_feature(const struct vu_dev *vdev,
+				  unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+/**
+ * vu_has_protocol_feature() - Check if a vhost-user feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
+					   unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(struct vu_virtq *vq);
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
+		 struct vu_virtq_element *elem);
+void vu_queue_detach_element(struct vu_virtq *vq);
+void vu_queue_unpop(struct vu_virtq *vq);
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(struct vu_virtq *vq,
+		   const struct vu_virtq_element *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
+#endif /* VIRTIO_H */
-- 
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * virtio API, vring and virtqueue functions definition
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef VIRTIO_H
+#define VIRTIO_H
+
+#include <stdbool.h>
+#include <linux/vhost_types.h>
+
+/* Maximum size of a virtqueue */
+#define VIRTQUEUE_MAX_SIZE 1024
+
+/**
+ * struct vu_ring - Virtqueue rings
+ * @num:		Size of the queue
+ * @desc:		Descriptor ring
+ * @avail:		Available ring
+ * @used:		Used ring
+ * @log_guest_addr:	Guest address for logging
+ * @flags:		Vring flags
+ * 			VHOST_VRING_F_LOG is set if log address is valid
+ */
+struct vu_ring {
+	unsigned int num;
+	struct vring_desc *desc;
+	struct vring_avail *avail;
+	struct vring_used *used;
+	uint64_t log_guest_addr;
+	uint32_t flags;
+};
+
+/**
+ * struct vu_virtq - Virtqueue definition
+ * @vring:			Virtqueue rings
+ * @last_avail_idx:		Next head to pop
+ * @shadow_avail_idx:		Last avail_idx read from VQ.
+ * @used_idx:			Descriptor ring current index
+ * @signalled_used:		Last used index value we have signalled on
+ * @signalled_used_valid:	True if signalled_used if valid
+ * @notification:		True if the queues notify (via event
+ * 				index or interrupt)
+ * @inuse:			Number of entries in use
+ * @call_fd:			The event file descriptor to signal when
+ * 				buffers are used.
+ * @kick_fd:			The event file descriptor for adding
+ * 				buffers to the vring
+ * @err_fd:			The event file descriptor to signal when
+ * 				error occurs
+ * @enable:			True if the virtqueue is enabled
+ * @started:			True if the virtqueue is started
+ * @vra:			QEMU address of our rings
+ */
+struct vu_virtq {
+	struct vu_ring vring;
+	uint16_t last_avail_idx;
+	uint16_t shadow_avail_idx;
+	uint16_t used_idx;
+	uint16_t signalled_used;
+	bool signalled_used_valid;
+	bool notification;
+	unsigned int inuse;
+	int call_fd;
+	int kick_fd;
+	int err_fd;
+	unsigned int enable;
+	bool started;
+	struct vhost_vring_addr vra;
+};
+
+/**
+ * struct vu_dev_region - guest shared memory region
+ * @gpa:		Guest physical address of the region
+ * @size:		Memory size in bytes
+ * @qva:		QEMU virtual address
+ * @mmap_offset:	Offset where the region starts in the mapped memory
+ * @mmap_addr:		Address of the mapped memory
+ */
+struct vu_dev_region {
+	uint64_t gpa;
+	uint64_t size;
+	uint64_t qva;
+	uint64_t mmap_offset;
+	uint64_t mmap_addr;
+};
+
+#define VHOST_USER_MAX_QUEUES 2
+
+/*
+ * Set a reasonable maximum number of ram slots, which will be supported by
+ * any architecture.
+ */
+#define VHOST_USER_MAX_RAM_SLOTS 32
+
+/**
+ * struct vu_dev - vhost-user device information
+ * @context:		Execution context
+ * *nregions:		Number of shared memory regions
+ * @regions:		Guest shared memory regions
+ * @features:		Vhost-user features
+ * @protocol_features:	Vhost-user protocol features
+ * @hdrlen:		Virtio -net header length
+ */
+struct vu_dev {
+	uint32_t nregions;
+	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
+	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
+	uint64_t features;
+	uint64_t protocol_features;
+	int hdrlen;
+};
+
+/**
+ * struct vu_virtq_element - virtqueue element
+ * @index:	Descriptor ring index
+ * @out_num:	Number of outgoing iovec buffers
+ * @in_num:	Number of incoming iovec buffers
+ * @in_sg:	Incoming iovec buffers
+ * @out_sg:	Outgoing iovec buffers
+ */
+struct vu_virtq_element {
+	unsigned int index;
+	unsigned int out_num;
+	unsigned int in_num;
+	struct iovec *in_sg;
+	struct iovec *out_sg;
+};
+
+/**
+ * has_feature() - Check a feature bit in a features set
+ * @features:	Features set
+ * @fb:		Feature bit to check
+ *
+ * Return:	True if the feature bit is set
+ */
+static inline bool has_feature(uint64_t features, unsigned int fbit)
+{
+	return !!(features & (1ULL << fbit));
+}
+
+/**
+ * vu_has_feature() - Check if a virtio-net feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+static inline bool vu_has_feature(const struct vu_dev *vdev,
+				  unsigned int fbit)
+{
+	return has_feature(vdev->features, fbit);
+}
+
+/**
+ * vu_has_protocol_feature() - Check if a vhost-user feature is available
+ * @vdev:	Vhost-user device
+ * @bit:	Feature to check
+ *
+ * Return:	True if the feature is available
+ */
+/* cppcheck-suppress unusedFunction */
+static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
+					   unsigned int fbit)
+{
+	return has_feature(vdev->protocol_features, fbit);
+}
+
+bool vu_queue_empty(struct vu_virtq *vq);
+void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
+int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
+		 struct vu_virtq_element *elem);
+void vu_queue_detach_element(struct vu_virtq *vq);
+void vu_queue_unpop(struct vu_virtq *vq);
+bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
+void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
+			    unsigned int len, unsigned int idx);
+void vu_queue_fill(struct vu_virtq *vq,
+		   const struct vu_virtq_element *elem, unsigned int len,
+		   unsigned int idx);
+void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
+#endif /* VIRTIO_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
  2024-08-15 15:50 ` [PATCH v3 1/4] packet: replace struct desc by struct iovec Laurent Vivier
  2024-08-15 15:50 ` [PATCH v3 2/4] vhost-user: introduce virtio API Laurent Vivier
@ 2024-08-15 15:50 ` Laurent Vivier
  2024-08-22 22:14   ` Stefano Brivio
  2024-08-26  5:26   ` David Gibson
  2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
  2024-08-20 22:41 ` [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Stefano Brivio
  4 siblings, 2 replies; 22+ messages in thread
From: Laurent Vivier @ 2024-08-15 15:50 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile     |    4 +-
 iov.c        |    1 -
 vhost_user.c | 1271 ++++++++++++++++++++++++++++++++++++++++++++++++++
 vhost_user.h |  202 ++++++++
 virtio.c     |    5 -
 virtio.h     |    2 +-
 6 files changed, 1476 insertions(+), 9 deletions(-)
 create mode 100644 vhost_user.c
 create mode 100644 vhost_user.h

diff --git a/Makefile b/Makefile
index f171c7955ac9..4ccefffacfde 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
+	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h virtio.h
+	udp.h udp_flow.h util.h vhost_user.h virtio.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/iov.c b/iov.c
index 3f9e229a305f..3741db21790f 100644
--- a/iov.c
+++ b/iov.c
@@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
  *
  * Returns:    The number of bytes successfully copied.
  */
-/* cppcheck-suppress unusedFunction */
 size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
 		    size_t offset, const void *buf, size_t bytes)
 {
diff --git a/vhost_user.c b/vhost_user.c
new file mode 100644
index 000000000000..c4cd25fae84e
--- /dev/null
+++ b/vhost_user.c
@@ -0,0 +1,1271 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * vhost-user API, command management and virtio interface
+ *
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+/* some parts from QEMU subprojects/libvhost-user/libvhost-user.c
+ * licensed under the following terms:
+ *
+ * Copyright IBM, Corp. 2007
+ * Copyright (c) 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Anthony Liguori <aliguori@us.ibm.com>
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Victor Kaplansky <victork@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <string.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <time.h>
+#include <net/ethernet.h>
+#include <netinet/in.h>
+#include <sys/epoll.h>
+#include <sys/eventfd.h>
+#include <sys/mman.h>
+#include <linux/vhost_types.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "passt.h"
+#include "tap.h"
+#include "vhost_user.h"
+
+/* vhost-user version we are compatible with */
+#define VHOST_USER_VERSION 1
+
+/**
+ * vu_print_capabilities() - print vhost-user capabilities
+ * 			     this is part of the vhost-user backend
+ * 			     convention.
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_print_capabilities(void)
+{
+	info("{");
+	info("  \"type\": \"net\"");
+	info("}");
+	exit(EXIT_SUCCESS);
+}
+
+/**
+ * vu_request_to_string() - convert a vhost-user request number to its name
+ * @req:	request number
+ *
+ * Return: the name of request number
+ */
+static const char *vu_request_to_string(unsigned int req)
+{
+	if (req < VHOST_USER_MAX) {
+#define REQ(req) [req] = #req
+		static const char * const vu_request_str[] = {
+			REQ(VHOST_USER_NONE),
+			REQ(VHOST_USER_GET_FEATURES),
+			REQ(VHOST_USER_SET_FEATURES),
+			REQ(VHOST_USER_SET_OWNER),
+			REQ(VHOST_USER_RESET_OWNER),
+			REQ(VHOST_USER_SET_MEM_TABLE),
+			REQ(VHOST_USER_SET_LOG_BASE),
+			REQ(VHOST_USER_SET_LOG_FD),
+			REQ(VHOST_USER_SET_VRING_NUM),
+			REQ(VHOST_USER_SET_VRING_ADDR),
+			REQ(VHOST_USER_SET_VRING_BASE),
+			REQ(VHOST_USER_GET_VRING_BASE),
+			REQ(VHOST_USER_SET_VRING_KICK),
+			REQ(VHOST_USER_SET_VRING_CALL),
+			REQ(VHOST_USER_SET_VRING_ERR),
+			REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
+			REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
+			REQ(VHOST_USER_GET_QUEUE_NUM),
+			REQ(VHOST_USER_SET_VRING_ENABLE),
+			REQ(VHOST_USER_SEND_RARP),
+			REQ(VHOST_USER_NET_SET_MTU),
+			REQ(VHOST_USER_SET_BACKEND_REQ_FD),
+			REQ(VHOST_USER_IOTLB_MSG),
+			REQ(VHOST_USER_SET_VRING_ENDIAN),
+			REQ(VHOST_USER_GET_CONFIG),
+			REQ(VHOST_USER_SET_CONFIG),
+			REQ(VHOST_USER_POSTCOPY_ADVISE),
+			REQ(VHOST_USER_POSTCOPY_LISTEN),
+			REQ(VHOST_USER_POSTCOPY_END),
+			REQ(VHOST_USER_GET_INFLIGHT_FD),
+			REQ(VHOST_USER_SET_INFLIGHT_FD),
+			REQ(VHOST_USER_GPU_SET_SOCKET),
+			REQ(VHOST_USER_VRING_KICK),
+			REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
+			REQ(VHOST_USER_ADD_MEM_REG),
+			REQ(VHOST_USER_REM_MEM_REG),
+			REQ(VHOST_USER_MAX),
+		};
+#undef REQ
+		return vu_request_str[req];
+	}
+
+	return "unknown";
+}
+
+/**
+ * qva_to_va() -  Translate front-end (QEMU) virtual address to our virtual
+ * 		  address
+ * @dev:		Vhost-user device
+ * @qemu_addr:		front-end userspace address
+ *
+ * Return: the memory address in our process virtual address space.
+ */
+static void *qva_to_va(struct vu_dev *dev, uint64_t qemu_addr)
+{
+	unsigned int i;
+
+	/* Find matching memory region.  */
+	for (i = 0; i < dev->nregions; i++) {
+		const struct vu_dev_region *r = &dev->regions[i];
+
+		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
+			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+			return (void *)(qemu_addr - r->qva + r->mmap_addr +
+					r->mmap_offset);
+		}
+	}
+
+	return NULL;
+}
+
+/**
+ * vmsg_close_fds() - Close all file descriptors of a given message
+ * @vmsg:	Vhost-user message with the list of the file descriptors
+ */
+static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
+{
+	int i;
+
+	for (i = 0; i < vmsg->fd_num; i++)
+		close(vmsg->fds[i]);
+}
+
+/**
+ * vu_remove_watch() - Remove a file descriptor from an our passt epoll
+ * 		       file descriptor
+ * @vdev:	Vhost-user device
+ * @fd:		file descriptor to remove
+ */
+static void vu_remove_watch(const struct vu_dev *vdev, int fd)
+{
+	(void)vdev;
+	(void)fd;
+}
+
+/**
+ * vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
+ * 			  and fd_num
+ * @vmsg:	Vhost-user message
+ * @val:	64bit value to reply
+ */
+static void vmsg_set_reply_u64(struct vhost_user_msg *vmsg, uint64_t val)
+{
+	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
+	vmsg->hdr.size = sizeof(vmsg->payload.u64);
+	vmsg->payload.u64 = val;
+	vmsg->fd_num = 0;
+}
+
+/**
+ * vu_message_read_default() - Read incoming vhost-user message from the
+ * 			       front-end
+ * @conn_fd:	Vhost-user command socket
+ * @vmsg:	Vhost-user message
+ *
+ * Return: -1 there is an error,
+ *          0 if recvmsg() has been interrupted,
+ *          1 if a message has been received
+ */
+static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
+{
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
+		     sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+		.msg_controllen = sizeof(control),
+	};
+	ssize_t ret, sz_payload;
+	struct cmsghdr *cmsg;
+	size_t fd_size;
+
+	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
+	if (ret < 0) {
+		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
+			return 0;
+		return -1;
+	}
+
+	vmsg->fd_num = 0;
+	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+		if (cmsg->cmsg_level == SOL_SOCKET &&
+		    cmsg->cmsg_type == SCM_RIGHTS) {
+			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
+			ASSERT(fd_size / sizeof(int) <=
+			       VHOST_MEMORY_BASELINE_NREGIONS);
+			vmsg->fd_num = fd_size / sizeof(int);
+			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
+			break;
+		}
+	}
+
+	sz_payload = vmsg->hdr.size;
+	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
+		die("Error: too big message request: %d,"
+			 " size: vmsg->size: %zd, "
+			 "while sizeof(vmsg->payload) = %zu",
+			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
+	}
+
+	if (sz_payload) {
+		do {
+			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
+		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));
+
+		if (ret < sz_payload)
+			die_perror("Error while reading");
+	}
+
+	return 1;
+}
+
+/**
+ * vu_message_write() - send a message to the front-end
+ * @conn_fd:	Vhost-user command socket
+ * @vmsg:	Vhost-user message
+ *
+ * #syscalls:vu sendmsg
+ */
+static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
+{
+	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
+	struct iovec iov = {
+		.iov_base = (char *)vmsg,
+		.iov_len = VHOST_USER_HDR_SIZE,
+	};
+	struct msghdr msg = {
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+		.msg_control = control,
+	};
+	const uint8_t *p = (uint8_t *)vmsg;
+	int rc;
+
+	memset(control, 0, sizeof(control));
+	ASSERT(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
+	if (vmsg->fd_num > 0) {
+		size_t fdsize = vmsg->fd_num * sizeof(int);
+		struct cmsghdr *cmsg;
+
+		msg.msg_controllen = CMSG_SPACE(fdsize);
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(fdsize);
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
+	} else {
+		msg.msg_controllen = 0;
+	}
+
+	do {
+		rc = sendmsg(conn_fd, &msg, 0);
+	} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
+
+	if (vmsg->hdr.size) {
+		do {
+			rc = write(conn_fd, p + VHOST_USER_HDR_SIZE,
+				   vmsg->hdr.size);
+		} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
+	}
+
+	if (rc <= 0)
+		die_perror("Error while writing");
+}
+
+/**
+ * vu_send_reply() - Update message flags and send it to front-end
+ * @conn_fd:	Vhost-user command socket
+ * @vmsg:	Vhost-user message
+ */
+static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
+{
+	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
+	msg->hdr.flags |= VHOST_USER_VERSION;
+	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
+
+	vu_message_write(conn_fd, msg);
+}
+
+/**
+ * vu_get_features_exec() - Provide back-end features bitmask to front-end
+ * @vmsg:	Vhost-user message
+ *
+ * Return: true as a reply is requested
+ */
+static bool vu_get_features_exec(struct vhost_user_msg *msg)
+{
+	uint64_t features =
+		1ULL << VIRTIO_F_VERSION_1 |
+		1ULL << VIRTIO_NET_F_MRG_RXBUF |
+		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
+
+	vmsg_set_reply_u64(msg, features);
+
+	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
+
+	return true;
+}
+
+/**
+ * vu_set_enable_all_rings() - Enable/disable all the virtqueues
+ * @vdev:	Vhost-user device
+ * @enable:	New virtqueues state
+ */
+static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
+{
+	uint16_t i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
+		vdev->vq[i].enable = enable;
+}
+
+/**
+ * vu_set_features_exec() - Enable features of the back-end
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_features_exec(struct vu_dev *vdev,
+				 struct vhost_user_msg *msg)
+{
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vdev->features = msg->payload.u64;
+	/* We only support devices conforming to VIRTIO 1.0 or
+	 * later
+	 */
+	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1))
+		die("virtio legacy devices aren't supported by passt");
+
+	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES))
+		vu_set_enable_all_rings(vdev, true);
+
+	/* virtio-net features */
+
+	if (vu_has_feature(vdev, VIRTIO_F_VERSION_1) ||
+	    vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
+		vdev->hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	} else {
+		vdev->hdrlen = sizeof(struct virtio_net_hdr);
+	}
+
+	return false;
+}
+
+/**
+ * vu_set_owner_exec() - Session start flag, do nothing in our case
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_owner_exec(void)
+{
+	return false;
+}
+
+/**
+ * map_ring() - Convert ring front-end (QEMU) addresses to our process
+ * 		virtual address space.
+ * @vdev:	Vhost-user device
+ * @vq:		Virtqueue
+ *
+ * Return: true if ring cannot be mapped to our address space
+ */
+static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
+{
+	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
+	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
+	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
+
+	debug("Setting virtq addresses:");
+	debug("    vring_desc  at %p", (void *)vq->vring.desc);
+	debug("    vring_used  at %p", (void *)vq->vring.used);
+	debug("    vring_avail at %p", (void *)vq->vring.avail);
+
+	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
+/**
+ * vu_packet_check_range() - Check if a given memory zone is contained in
+ * 			     a mapped guest memory region
+ * @buf:	Array of the available memory regions
+ * @offset:	Offset of data range in packet descriptor
+ * @size:	Length of desired data range
+ * @start:	Start of the packet descriptor
+ *
+ * Return: 0 if the zone in a mapped memory region, -1 otherwise
+ */
+/* cppcheck-suppress unusedFunction */
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start)
+{
+	struct vu_dev_region *dev_region;
+
+	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		char *m = (char *)dev_region->mmap_addr;
+
+		if (m <= start &&
+		    start + offset + len < m + dev_region->mmap_offset +
+					       dev_region->size)
+			return 0;
+	}
+
+	return -1;
+}
+
+/**
+ * vu_set_mem_table_exec() - Sets the memory map regions to be able to
+ * 			     translate the vring addresses.
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ *
+ * #syscalls:vu mmap munmap
+ */
+static bool vu_set_mem_table_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	struct vhost_user_memory m = msg->payload.memory, *memory = &m;
+	unsigned int i;
+
+	for (i = 0; i < vdev->nregions; i++) {
+		struct vu_dev_region *r = &vdev->regions[i];
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		void *mm = (void *)r->mmap_addr;
+
+		if (mm)
+			munmap(mm, r->size + r->mmap_offset);
+	}
+	vdev->nregions = memory->nregions;
+
+	debug("Nregions: %u", memory->nregions);
+	for (i = 0; i < vdev->nregions; i++) {
+		struct vhost_user_memory_region *msg_region = &memory->regions[i];
+		struct vu_dev_region *dev_region = &vdev->regions[i];
+		void *mmap_addr;
+
+		debug("Region %d", i);
+		debug("    guest_phys_addr: 0x%016"PRIx64,
+		      msg_region->guest_phys_addr);
+		debug("    memory_size:     0x%016"PRIx64,
+		      msg_region->memory_size);
+		debug("    userspace_addr   0x%016"PRIx64,
+		      msg_region->userspace_addr);
+		debug("    mmap_offset      0x%016"PRIx64,
+		      msg_region->mmap_offset);
+
+		dev_region->gpa = msg_region->guest_phys_addr;
+		dev_region->size = msg_region->memory_size;
+		dev_region->qva = msg_region->userspace_addr;
+		dev_region->mmap_offset = msg_region->mmap_offset;
+
+		/* We don't use offset argument of mmap() since the
+		 * mapped address has to be page aligned, and we use huge
+		 * pages.
+		 */
+		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
+				 PROT_READ | PROT_WRITE, MAP_SHARED |
+				 MAP_NORESERVE, msg->fds[i], 0);
+
+		if (mmap_addr == MAP_FAILED)
+			die_perror("region mmap error");
+
+		dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+		debug("    mmap_addr:       0x%016"PRIx64,
+		      dev_region->mmap_addr);
+
+		close(msg->fds[i]);
+	}
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		if (vdev->vq[i].vring.desc) {
+			if (map_ring(vdev, &vdev->vq[i]))
+				die("remapping queue %d during setmemtable", i);
+		}
+	}
+
+	return false;
+}
+
+/**
+ * vu_set_vring_num_exec() - Set the size of the queue (vring size)
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_num_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", idx);
+	debug("State.num:   %u", num);
+	vdev->vq[idx].vring.num = num;
+
+	return false;
+}
+
+/**
+ * vu_set_vring_addr_exec() - Set the addresses of the vring
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	struct vhost_vring_addr addr = msg->payload.addr, *vra = &addr;
+	struct vu_virtq *vq = &vdev->vq[vra->index];
+
+	debug("vhost_vring_addr:");
+	debug("    index:  %d", vra->index);
+	debug("    flags:  %d", vra->flags);
+	debug("    desc_user_addr:   0x%016" PRIx64, (uint64_t)vra->desc_user_addr);
+	debug("    used_user_addr:   0x%016" PRIx64, (uint64_t)vra->used_user_addr);
+	debug("    avail_user_addr:  0x%016" PRIx64, (uint64_t)vra->avail_user_addr);
+	debug("    log_guest_addr:   0x%016" PRIx64, (uint64_t)vra->log_guest_addr);
+
+	vq->vra = *vra;
+	vq->vring.flags = vra->flags;
+	vq->vring.log_guest_addr = vra->log_guest_addr;
+
+	if (map_ring(vdev, vq))
+		die("Invalid vring_addr message");
+
+	vq->used_idx = le16toh(vq->vring.used->idx);
+
+	if (vq->last_avail_idx != vq->used_idx) {
+		debug("Last avail index != used index: %u != %u",
+		      vq->last_avail_idx, vq->used_idx);
+	}
+
+	return false;
+}
+/**
+ * vu_set_vring_base_exec() - Sets the next index to use for descriptors
+ * 			      in this vring
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_base_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+	unsigned int num = msg->payload.state.num;
+
+	debug("State.index: %u", idx);
+	debug("State.num:   %u", num);
+	vdev->vq[idx].shadow_avail_idx = vdev->vq[idx].last_avail_idx = num;
+
+	return false;
+}
+
+/**
+ * vu_get_vring_base_exec() - Stops the vring and returns the current
+ * 			      descriptor index or indices
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as a reply is requested
+ */
+static bool vu_get_vring_base_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	unsigned int idx = msg->payload.state.index;
+
+	debug("State.index: %u", idx);
+	msg->payload.state.num = vdev->vq[idx].last_avail_idx;
+	msg->hdr.size = sizeof(msg->payload.state);
+
+	vdev->vq[idx].started = false;
+
+	if (vdev->vq[idx].call_fd != -1) {
+		close(vdev->vq[idx].call_fd);
+		vdev->vq[idx].call_fd = -1;
+	}
+	if (vdev->vq[idx].kick_fd != -1) {
+		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		close(vdev->vq[idx].kick_fd);
+		vdev->vq[idx].kick_fd = -1;
+	}
+
+	return true;
+}
+
+/**
+ * vu_set_watch() - Add a file descriptor to the passt epoll file descriptor
+ * @vdev:	vhost-user device
+ * @fd:		file descriptor to add
+ */
+static void vu_set_watch(const struct vu_dev *vdev, int fd)
+{
+	(void)vdev;
+	(void)fd;
+}
+
+/**
+ * vu_wait_queue() - wait new free entries in the virtqueue
+ * @vq:		virtqueue to wait on
+ */
+static int vu_wait_queue(const struct vu_virtq *vq)
+{
+	eventfd_t kick_data;
+	ssize_t rc;
+	int status;
+
+	/* wait the kernel to put new entries in the queue */
+	status = fcntl(vq->kick_fd, F_GETFL);
+	if (status == -1)
+		return -1;
+
+	status = fcntl(vq->kick_fd, F_SETFL, status & ~O_NONBLOCK);
+	if (status == -1)
+		return -1;
+	rc = eventfd_read(vq->kick_fd, &kick_data);
+	status = fcntl(vq->kick_fd, F_SETFL, status);
+	if (status == -1)
+		return -1;
+
+	if (rc == -1)
+		return -1;
+
+	return 0;
+}
+
+/**
+ * vu_send() - Send a buffer to the front-end using the RX virtqueue
+ * @vdev:	vhost-user device
+ * @buf:	address of the buffer
+ * @size:	size of the buffer
+ *
+ * Return: number of bytes sent, -1 if there is an error
+ */
+/* cppcheck-suppress unusedFunction */
+int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
+{
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+	size_t lens[VIRTQUEUE_MAX_SIZE];
+	__virtio16 *num_buffers_ptr = NULL;
+	size_t hdrlen = vdev->hdrlen;
+	int in_sg_count = 0;
+	size_t offset = 0;
+	int i = 0, j;
+
+	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		err("Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	while (offset < size) {
+		size_t len;
+		int total;
+		int ret;
+
+		total = 0;
+
+		if (i == ARRAY_SIZE(elem) ||
+		    in_sg_count == ARRAY_SIZE(in_sg)) {
+			err("virtio-net unexpected long buffer chain");
+			goto err;
+		}
+
+		elem[i].out_num = 0;
+		elem[i].out_sg = NULL;
+		elem[i].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
+		elem[i].in_sg = &in_sg[in_sg_count];
+
+		ret = vu_queue_pop(vdev, vq, &elem[i]);
+		if (ret < 0) {
+			if (vu_wait_queue(vq) != -1)
+				continue;
+			if (i) {
+				err("virtio-net unexpected empty queue: "
+				    "i %d mergeable %d offset %zd, size %zd, "
+				    "features 0x%" PRIx64,
+				    i, vu_has_feature(vdev,
+						      VIRTIO_NET_F_MRG_RXBUF),
+				    offset, size, vdev->features);
+			}
+			offset = -1;
+			goto err;
+		}
+		in_sg_count += elem[i].in_num;
+
+		if (elem[i].in_num < 1) {
+			err("virtio-net receive queue contains no in buffers");
+			vu_queue_detach_element(vq);
+			offset = -1;
+			goto err;
+		}
+
+		if (i == 0) {
+			struct virtio_net_hdr hdr = {
+				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+			};
+
+			ASSERT(offset == 0);
+			ASSERT(elem[i].in_sg[0].iov_len >= hdrlen);
+
+			len = iov_from_buf(elem[i].in_sg, elem[i].in_num, 0,
+					   &hdr, sizeof(hdr));
+
+			num_buffers_ptr = (__virtio16 *)((char *)elem[i].in_sg[0].iov_base +
+							 len);
+
+			total += hdrlen;
+		}
+
+		len = iov_from_buf(elem[i].in_sg, elem[i].in_num, total,
+				   (char *)buf + offset, size - offset);
+
+		total += len;
+		offset += len;
+
+		/* If buffers can't be merged, at this point we
+		 * must have consumed the complete packet.
+		 * Otherwise, drop it.
+		 */
+		if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) &&
+		    offset < size) {
+			vu_queue_unpop(vq);
+			goto err;
+		}
+
+		lens[i] = total;
+		i++;
+	}
+
+	if (num_buffers_ptr && vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		*num_buffers_ptr = htole16(i);
+
+	for (j = 0; j < i; j++) {
+		debug("filling total %zd idx %d", lens[j], j);
+		vu_queue_fill(vq, &elem[j], lens[j], j);
+	}
+
+	vu_queue_flush(vq, i);
+	vu_queue_notify(vdev, vq);
+
+	debug("vhost-user sent %zu", offset);
+
+	return offset;
+err:
+	for (j = 0; j < i; j++)
+		vu_queue_detach_element(vq);
+
+	return offset;
+}
+
+/**
+ * vu_handle_tx() - Receive data from the TX virtqueue
+ * @vdev:	vhost-user device
+ * @index:	index of the virtqueue
+ */
+static void vu_handle_tx(struct vu_dev *vdev, int index,
+			 const struct timespec *now)
+{
+	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+	struct vu_virtq *vq = &vdev->vq[index];
+	int hdrlen = vdev->hdrlen;
+	int out_sg_count;
+	int count;
+
+	if (!VHOST_USER_IS_QUEUE_TX(index)) {
+		debug("index %d is not a TX queue", index);
+		return;
+	}
+
+	tap_flush_pools();
+
+	count = 0;
+	out_sg_count = 0;
+	while (1) {
+		int ret;
+
+
+		elem[count].out_num = 1;
+		elem[count].out_sg = &out_sg[out_sg_count];
+		elem[count].in_num = 0;
+		elem[count].in_sg = NULL;
+		ret = vu_queue_pop(vdev, vq, &elem[count]);
+		if (ret < 0)
+			break;
+		out_sg_count += elem[count].out_num;
+
+		if (elem[count].out_num < 1) {
+			debug("virtio-net header not in first element");
+			break;
+		}
+		ASSERT(elem[count].out_num == 1);
+
+		tap_add_packet(vdev->context,
+			       elem[count].out_sg[0].iov_len - hdrlen,
+			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
+		count++;
+	}
+	tap_handler(vdev->context, now);
+
+	if (count) {
+		int i;
+
+		for (i = 0; i < count; i++)
+			vu_queue_fill(vq, &elem[i], 0, i);
+		vu_queue_flush(vq, count);
+		vu_queue_notify(vdev, vq);
+	}
+}
+
+/**
+ * vu_kick_cb() - Called on a kick event to start to receive data
+ * @vdev:	vhost-user device
+ * @ref:	epoll reference information
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
+		const struct timespec *now)
+{
+	eventfd_t kick_data;
+	ssize_t rc;
+	int idx;
+
+	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++)
+		if (vdev->vq[idx].kick_fd == ref.fd)
+			break;
+
+	if (idx == VHOST_USER_MAX_QUEUES)
+		return;
+
+	rc = eventfd_read(ref.fd, &kick_data);
+	if (rc == -1)
+		die_perror("kick eventfd_read()");
+
+	debug("Got kick_data: %016"PRIx64" idx:%d",
+	      kick_data, idx);
+	if (VHOST_USER_IS_QUEUE_TX(idx))
+		vu_handle_tx(vdev, idx, now);
+}
+
+/**
+ * vu_check_queue_msg_file() - Check if a message is valid,
+ * 			       close fds if NOFD bit is set
+ * @vmsg:	Vhost-user message
+ */
+static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	if (idx >= VHOST_USER_MAX_QUEUES)
+		die("Invalid queue index: %u", idx);
+
+	if (nofd) {
+		vmsg_close_fds(msg);
+		return;
+	}
+
+	if (msg->fd_num != 1)
+		die("Invalid fds in request: %d", msg->hdr.request);
+}
+
+/**
+ * vu_set_vring_kick_exec() - Set the event file descriptor for adding buffers
+ * 			      to the vring
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].kick_fd != -1) {
+		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
+		close(vdev->vq[idx].kick_fd);
+	}
+
+	vdev->vq[idx].kick_fd = nofd ? -1 : msg->fds[0];
+	debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
+
+	vdev->vq[idx].started = true;
+
+	if (vdev->vq[idx].kick_fd != -1 && VHOST_USER_IS_QUEUE_TX(idx)) {
+		vu_set_watch(vdev, vdev->vq[idx].kick_fd);
+		debug("Waiting for kicks on fd: %d for vq: %d",
+		      vdev->vq[idx].kick_fd, idx);
+	}
+
+	return false;
+}
+
+/**
+ * vu_set_vring_call_exec() - Set the event file descriptor to signal when
+ * 			      buffers are used
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_call_exec(struct vu_dev *vdev,
+				   struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].call_fd != -1)
+		close(vdev->vq[idx].call_fd);
+
+	vdev->vq[idx].call_fd = nofd ? -1 : msg->fds[0];
+
+	/* in case of I/O hang after reconnecting */
+	if (vdev->vq[idx].call_fd != -1)
+		eventfd_write(msg->fds[0], 1);
+
+	debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
+
+	return false;
+}
+
+/**
+ * vu_set_vring_err_exec() - Set the event file descriptor to signal when
+ * 			     error occurs
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_err_exec(struct vu_dev *vdev,
+				  struct vhost_user_msg *msg)
+{
+	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
+	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
+
+	debug("u64: 0x%016"PRIx64, msg->payload.u64);
+
+	vu_check_queue_msg_file(msg);
+
+	if (vdev->vq[idx].err_fd != -1) {
+		close(vdev->vq[idx].err_fd);
+		vdev->vq[idx].err_fd = -1;
+	}
+
+	/* cppcheck-suppress redundantAssignment */
+	vdev->vq[idx].err_fd = nofd ? -1 : msg->fds[0];
+
+	return false;
+}
+
+/**
+ * vu_get_protocol_features_exec() - Provide the protocol (vhost-user) features
+ * 				     to the front-end
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as a reply is requested
+ */
+static bool vu_get_protocol_features_exec(struct vhost_user_msg *msg)
+{
+	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
+
+	vmsg_set_reply_u64(msg, features);
+
+	return true;
+}
+
+/**
+ * vu_set_protocol_features_exec() - Enable protocol (vhost-user) features
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
+					  struct vhost_user_msg *msg)
+{
+	uint64_t features = msg->payload.u64;
+
+	debug("u64: 0x%016"PRIx64, features);
+
+	vdev->protocol_features = msg->payload.u64;
+
+	if (vu_has_protocol_feature(vdev,
+				    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS) &&
+	    (!vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_BACKEND_REQ) ||
+	     !vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_REPLY_ACK))) {
+	/*
+	 * The use case for using messages for kick/call is simulation, to make
+	 * the kick and call synchronous. To actually get that behaviour, both
+	 * of the other features are required.
+	 * Theoretically, one could use only kick messages, or do them without
+	 * having F_REPLY_ACK, but too many (possibly pending) messages on the
+	 * socket will eventually cause the master to hang, to avoid this in
+	 * scenarios where not desired enforce that the settings are in a way
+	 * that actually enables the simulation case.
+	 */
+		die("F_IN_BAND_NOTIFICATIONS requires F_BACKEND_REQ && F_REPLY_ACK");
+	}
+
+	return false;
+}
+
+/**
+ * vu_get_queue_num_exec() - Tell how many queues we support
+ * @vmsg:	Vhost-user message
+ *
+ * Return: true as a reply is requested
+ */
+static bool vu_get_queue_num_exec(struct vhost_user_msg *msg)
+{
+	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
+	return true;
+}
+
+/**
+ * vu_set_vring_enable_exec() - Enable or disable corresponding vring
+ * @vdev:	Vhost-user device
+ * @vmsg:	Vhost-user message
+ *
+ * Return: false as no reply is requested
+ */
+static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
+				     struct vhost_user_msg *msg)
+{
+	unsigned int enable = msg->payload.state.num;
+	unsigned int idx = msg->payload.state.index;
+
+	debug("State.index:  %u", idx);
+	debug("State.enable: %u", enable);
+
+	if (idx >= VHOST_USER_MAX_QUEUES)
+		die("Invalid vring_enable index: %u", idx);
+
+	vdev->vq[idx].enable = enable;
+	return false;
+}
+
+/**
+ * vu_init() - Initialize vhost-user device structure
+ * @c:		execution context
+ * @vdev:	vhost-user device
+ */
+/* cppcheck-suppress unusedFunction */
+void vu_init(struct ctx *c, struct vu_dev *vdev)
+{
+	int i;
+
+	vdev->context = c;
+	vdev->hdrlen = 0;
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		vdev->vq[i] = (struct vu_virtq){
+			.call_fd = -1,
+			.kick_fd = -1,
+			.err_fd = -1,
+			.notification = true,
+		};
+	}
+}
+
+/**
+ * vu_cleanup() - Reset vhost-user device
+ * @vdev:	vhost-user device
+ */
+void vu_cleanup(struct vu_dev *vdev)
+{
+	unsigned int i;
+
+	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
+		struct vu_virtq *vq = &vdev->vq[i];
+
+		vq->started = false;
+		vq->notification = true;
+
+		if (vq->call_fd != -1) {
+			close(vq->call_fd);
+			vq->call_fd = -1;
+		}
+		if (vq->err_fd != -1) {
+			close(vq->err_fd);
+			vq->err_fd = -1;
+		}
+		if (vq->kick_fd != -1) {
+			vu_remove_watch(vdev, vq->kick_fd);
+			close(vq->kick_fd);
+			vq->kick_fd = -1;
+		}
+
+		vq->vring.desc = 0;
+		vq->vring.used = 0;
+		vq->vring.avail = 0;
+	}
+	vdev->hdrlen = 0;
+
+	for (i = 0; i < vdev->nregions; i++) {
+		const struct vu_dev_region *r = &vdev->regions[i];
+		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
+		void *m = (void *)r->mmap_addr;
+
+		if (m)
+			munmap(m, r->size + r->mmap_offset);
+	}
+	vdev->nregions = 0;
+}
+
+/**
+ * vu_sock_reset() - Reset connection socket
+ * @vdev:	vhost-user device
+ */
+static void vu_sock_reset(struct vu_dev *vdev)
+{
+	(void)vdev;
+}
+
+/**
+ * tap_handler_vu() - Packet handler for vhost-user
+ * @vdev:	vhost-user device
+ * @fd:		vhost-user message socket
+ * @events:	epoll events
+ */
+/* cppcheck-suppress unusedFunction */
+void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)
+{
+	struct vhost_user_msg msg = { 0 };
+	bool need_reply, reply_requested;
+	int ret;
+
+	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
+		vu_sock_reset(vdev);
+		return;
+	}
+
+	ret = vu_message_read_default(fd, &msg);
+	if (ret < 0)
+		die_perror("Error while recvmsg");
+	if (ret == 0) {
+		vu_sock_reset(vdev);
+		return;
+	}
+	debug("================ Vhost user message ================");
+	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
+		msg.hdr.request);
+	debug("Flags:   0x%x", msg.hdr.flags);
+	debug("Size:    %u", msg.hdr.size);
+
+	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
+	switch (msg.hdr.request) {
+	case VHOST_USER_GET_FEATURES:
+		reply_requested = vu_get_features_exec(&msg);
+		break;
+	case VHOST_USER_SET_FEATURES:
+		reply_requested = vu_set_features_exec(vdev, &msg);
+		break;
+	case VHOST_USER_GET_PROTOCOL_FEATURES:
+		reply_requested = vu_get_protocol_features_exec(&msg);
+		break;
+	case VHOST_USER_SET_PROTOCOL_FEATURES:
+		reply_requested = vu_set_protocol_features_exec(vdev, &msg);
+		break;
+	case VHOST_USER_GET_QUEUE_NUM:
+		reply_requested = vu_get_queue_num_exec(&msg);
+		break;
+	case VHOST_USER_SET_OWNER:
+		reply_requested = vu_set_owner_exec();
+		break;
+	case VHOST_USER_SET_MEM_TABLE:
+		reply_requested = vu_set_mem_table_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_NUM:
+		reply_requested = vu_set_vring_num_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ADDR:
+		reply_requested = vu_set_vring_addr_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_BASE:
+		reply_requested = vu_set_vring_base_exec(vdev, &msg);
+		break;
+	case VHOST_USER_GET_VRING_BASE:
+		reply_requested = vu_get_vring_base_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_KICK:
+		reply_requested = vu_set_vring_kick_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_CALL:
+		reply_requested = vu_set_vring_call_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ERR:
+		reply_requested = vu_set_vring_err_exec(vdev, &msg);
+		break;
+	case VHOST_USER_SET_VRING_ENABLE:
+		reply_requested = vu_set_vring_enable_exec(vdev, &msg);
+		break;
+	case VHOST_USER_NONE:
+		vu_cleanup(vdev);
+		return;
+	default:
+		die("Unhandled request: %d", msg.hdr.request);
+	}
+
+	if (!reply_requested && need_reply) {
+		msg.payload.u64 = 0;
+		msg.hdr.flags = 0;
+		msg.hdr.size = sizeof(msg.payload.u64);
+		msg.fd_num = 0;
+		reply_requested = true;
+	}
+
+	if (reply_requested)
+		vu_send_reply(fd, &msg);
+}
diff --git a/vhost_user.h b/vhost_user.h
new file mode 100644
index 000000000000..135856dc2873
--- /dev/null
+++ b/vhost_user.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user API, command management and virtio interface
+ */
+
+/* some parts from subprojects/libvhost-user/libvhost-user.h */
+
+#ifndef VHOST_USER_H
+#define VHOST_USER_H
+
+#include "virtio.h"
+#include "iov.h"
+
+#define VHOST_USER_F_PROTOCOL_FEATURES 30
+
+#define VHOST_MEMORY_BASELINE_NREGIONS 8
+
+/**
+ * enum vhost_user_protocol_feature - List of available vhost-user features
+ */
+enum vhost_user_protocol_feature {
+	VHOST_USER_PROTOCOL_F_MQ = 0,
+	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
+	VHOST_USER_PROTOCOL_F_RARP = 2,
+	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
+	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
+	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
+	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
+	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
+	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
+	VHOST_USER_PROTOCOL_F_CONFIG = 9,
+	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
+	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
+	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
+	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+
+	VHOST_USER_PROTOCOL_F_MAX
+};
+
+/**
+ * enum vhost_user_request - list of available vhost-user request
+ */
+enum vhost_user_request {
+	VHOST_USER_NONE = 0,
+	VHOST_USER_GET_FEATURES = 1,
+	VHOST_USER_SET_FEATURES = 2,
+	VHOST_USER_SET_OWNER = 3,
+	VHOST_USER_RESET_OWNER = 4,
+	VHOST_USER_SET_MEM_TABLE = 5,
+	VHOST_USER_SET_LOG_BASE = 6,
+	VHOST_USER_SET_LOG_FD = 7,
+	VHOST_USER_SET_VRING_NUM = 8,
+	VHOST_USER_SET_VRING_ADDR = 9,
+	VHOST_USER_SET_VRING_BASE = 10,
+	VHOST_USER_GET_VRING_BASE = 11,
+	VHOST_USER_SET_VRING_KICK = 12,
+	VHOST_USER_SET_VRING_CALL = 13,
+	VHOST_USER_SET_VRING_ERR = 14,
+	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
+	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
+	VHOST_USER_GET_QUEUE_NUM = 17,
+	VHOST_USER_SET_VRING_ENABLE = 18,
+	VHOST_USER_SEND_RARP = 19,
+	VHOST_USER_NET_SET_MTU = 20,
+	VHOST_USER_SET_BACKEND_REQ_FD = 21,
+	VHOST_USER_IOTLB_MSG = 22,
+	VHOST_USER_SET_VRING_ENDIAN = 23,
+	VHOST_USER_GET_CONFIG = 24,
+	VHOST_USER_SET_CONFIG = 25,
+	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
+	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
+	VHOST_USER_POSTCOPY_ADVISE  = 28,
+	VHOST_USER_POSTCOPY_LISTEN  = 29,
+	VHOST_USER_POSTCOPY_END     = 30,
+	VHOST_USER_GET_INFLIGHT_FD = 31,
+	VHOST_USER_SET_INFLIGHT_FD = 32,
+	VHOST_USER_GPU_SET_SOCKET = 33,
+	VHOST_USER_VRING_KICK = 35,
+	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
+	VHOST_USER_ADD_MEM_REG = 37,
+	VHOST_USER_REM_MEM_REG = 38,
+	VHOST_USER_MAX
+};
+
+/**
+ * struct vhost_user_header - Vhost-user message header
+ * @request:	Request type of the message
+ * @flags:	Request flags
+ * @size:	The following payload size
+ */
+struct vhost_user_header {
+	enum vhost_user_request request;
+
+#define VHOST_USER_VERSION_MASK     0x3
+#define VHOST_USER_REPLY_MASK       (0x1 << 2)
+#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
+	uint32_t flags;
+	uint32_t size; /* the following payload size */
+} __attribute__ ((__packed__));
+
+/**
+ * struct vhost_user_memory_region - Front-end shared memory region information
+ * @guest_phys_addr:	Guest physical address of the region
+ * @memory_size:	Memory size
+ * @userspace_addr:	front-end (QEMU) userspace address
+ * @mmap_offset:	region offset in the shared memory area
+ */
+struct vhost_user_memory_region {
+	uint64_t guest_phys_addr;
+	uint64_t memory_size;
+	uint64_t userspace_addr;
+	uint64_t mmap_offset;
+};
+
+/**
+ * struct vhost_user_memory - List of all the shared memory regions
+ * @nregions:	Number of memory regions
+ * @padding:	Padding
+ * @regions:	Memory regions list
+ */
+struct vhost_user_memory {
+	uint32_t nregions;
+	uint32_t padding;
+	struct vhost_user_memory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
+};
+
+/**
+ * union vhost_user_payload - Vhost-user message payload
+ * @u64:		64bit payload
+ * @state:		Vring state payload
+ * @addr:		Vring addresses payload
+ * vhost_user_memory:	Memory regions information payload
+ */
+union vhost_user_payload {
+#define VHOST_USER_VRING_IDX_MASK   0xff
+#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
+	uint64_t u64;
+	struct vhost_vring_state state;
+	struct vhost_vring_addr addr;
+	struct vhost_user_memory memory;
+};
+
+/**
+ * struct vhost_user_msg - Vhost-use message
+ * @hdr:		Message header
+ * @payload:		Message payload
+ * @fds:		File descriptors associated with the message
+ * 			in the ancillary data.
+ * 			(shared memory or event file descriptors)
+ * @fd_num:		Number of file descriptors
+ */
+struct vhost_user_msg {
+	struct vhost_user_header hdr;
+	union vhost_user_payload payload;
+
+	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
+	int fd_num;
+} __attribute__ ((__packed__));
+#define VHOST_USER_HDR_SIZE sizeof(struct vhost_user_header)
+
+/* index of the RX virtqueue */
+#define VHOST_USER_RX_QUEUE 0
+/* index of the TX virtqueue */
+#define VHOST_USER_TX_QUEUE 1
+
+/* in case of multiqueue, we RX and TX queues are interleaved */
+#define VHOST_USER_IS_QUEUE_TX(n)	(n % 2)
+#define VHOST_USER_IS_QUEUE_RX(n)	(!(n % 2))
+
+/**
+ * vu_queue_enabled - Return state of a virtqueue
+ * @vq:		Virtqueue to check
+ *
+ * Return: true if the virqueue is enabled, false otherwise
+ */
+static inline bool vu_queue_enabled(const struct vu_virtq *vq)
+{
+	return vq->enable;
+}
+
+/**
+ * vu_queue_started - Return state of a virtqueue
+ * @vq:		Virtqueue to check
+ *
+ * Return: true if the virqueue is started, false otherwise
+ */
+static inline bool vu_queue_started(const struct vu_virtq *vq)
+{
+	return vq->started;
+}
+
+int vu_send(struct vu_dev *vdev, const void *buf, size_t size);
+void vu_print_capabilities(void);
+void vu_init(struct ctx *c, struct vu_dev *vdev);
+void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
+		const struct timespec *now);
+void vu_cleanup(struct vu_dev *vdev);
+void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events);
+#endif /* VHOST_USER_H */
diff --git a/virtio.c b/virtio.c
index 8354f6052aee..d02e6e04701d 100644
--- a/virtio.c
+++ b/virtio.c
@@ -323,7 +323,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
  * @dev:	Vhost-user device
  * @vq:		Virtqueue
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
 {
 	if (!vq->vring.avail)
@@ -500,7 +499,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
  *
  * Return: -1 if there is an error, 0 otherwise
  */
-/* cppcheck-suppress unusedFunction */
 int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
 {
 	unsigned int head;
@@ -550,7 +548,6 @@ void vu_queue_detach_element(struct vu_virtq *vq)
  * vu_queue_unpop() - Push back the previously popped element from the virqueue
  * @vq:		Virtqueue
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_unpop(struct vu_virtq *vq)
 {
 	vq->last_avail_idx--;
@@ -618,7 +615,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
  * @len:	Size of the element
  * @idx:	Used ring entry index
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
 		   unsigned int len, unsigned int idx)
 {
@@ -642,7 +638,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
  * @vq:		Virtqueue
  * @count:	Number of entry to flush
  */
-/* cppcheck-suppress unusedFunction */
 void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
 {
 	uint16_t old, new;
diff --git a/virtio.h b/virtio.h
index af9cadc990b9..242e788e07e9 100644
--- a/virtio.h
+++ b/virtio.h
@@ -106,6 +106,7 @@ struct vu_dev_region {
  * @hdrlen:		Virtio -net header length
  */
 struct vu_dev {
+	struct ctx *context;
 	uint32_t nregions;
 	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
 	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
@@ -162,7 +163,6 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
  *
  * Return:	True if the feature is available
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
 					   unsigned int fbit)
 {
-- 
@@ -106,6 +106,7 @@ struct vu_dev_region {
  * @hdrlen:		Virtio -net header length
  */
 struct vu_dev {
+	struct ctx *context;
 	uint32_t nregions;
 	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
 	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
@@ -162,7 +163,6 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
  *
  * Return:	True if the feature is available
  */
-/* cppcheck-suppress unusedFunction */
 static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
 					   unsigned int fbit)
 {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 4/4] vhost-user: add vhost-user
  2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (2 preceding siblings ...)
  2024-08-15 15:50 ` [PATCH v3 3/4] vhost-user: introduce vhost-user API Laurent Vivier
@ 2024-08-15 15:50 ` Laurent Vivier
  2024-08-22  9:59   ` Stefano Brivio
                     ` (2 more replies)
  2024-08-20 22:41 ` [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Stefano Brivio
  4 siblings, 3 replies; 22+ messages in thread
From: Laurent Vivier @ 2024-08-15 15:50 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile       |   6 +-
 checksum.c     |   1 -
 conf.c         |  24 +-
 epoll_type.h   |   4 +
 isolation.c    |  15 +-
 packet.c       |  13 ++
 packet.h       |   2 +
 passt.c        |  25 ++-
 passt.h        |   6 +
 pcap.c         |   1 -
 tap.c          | 106 +++++++--
 tap.h          |   5 +-
 tcp.c          |  33 ++-
 tcp_buf.c      |   6 +-
 tcp_internal.h |   3 +-
 tcp_vu.c       | 593 +++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_vu.h       |  12 +
 udp.c          |  71 +++---
 udp.h          |   8 +-
 udp_internal.h |  34 +++
 udp_vu.c       | 338 ++++++++++++++++++++++++++++
 udp_vu.h       |  13 ++
 vhost_user.c   |  28 ++-
 virtio.c       |   1 -
 vu_common.c    |  27 +++
 vu_common.h    |  34 +++
 26 files changed, 1310 insertions(+), 99 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vu_common.c
 create mode 100644 vu_common.h

diff --git a/Makefile b/Makefile
index 4ccefffacfde..4fb178932f8e 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
+	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +58,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h vhost_user.h virtio.h
+	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
+	virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/checksum.c b/checksum.c
index 006614fcbb28..aa5b7ae1cb66 100644
--- a/checksum.c
+++ b/checksum.c
@@ -501,7 +501,6 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
  *
  * Return: 16-bit folded, complemented checksum
  */
-/* cppcheck-suppress unusedFunction */
 uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
 {
 	unsigned int i;
diff --git a/conf.c b/conf.c
index 46fcd9126b4c..c684dbaac694 100644
--- a/conf.c
+++ b/conf.c
@@ -45,6 +45,7 @@
 #include "lineread.h"
 #include "isolation.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /**
  * next_chunk - Return the next piece of a string delimited by a character
@@ -759,9 +760,14 @@ static void usage(const char *name, FILE *f, int status)
 			"    default: same interface name as external one\n");
 	} else {
 		fprintf(f,
-			"  -s, --socket PATH	UNIX domain socket path\n"
+			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
 			"    default: probe free path starting from "
 			UNIX_SOCK_PATH "\n", 1);
+		fprintf(f,
+			"  --vhost-user		Enable vhost-user mode\n"
+			"    UNIX domain socket is provided by -s option\n"
+			"  --print-capabilities	print back-end capabilities in JSON format,\n"
+			"    only meaningful for vhost-user mode\n");
 	}
 
 	fprintf(f,
@@ -1230,6 +1236,10 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"no-copy-routes", no_argument,		NULL,		18 },
 		{"no-copy-addrs", no_argument,		NULL,		19 },
 		{"netns-only",	no_argument,		NULL,		20 },
+		{"vhost-user",	no_argument,		NULL,		21 },
+		/* vhost-user backend program convention */
+		{"print-capabilities", no_argument,	NULL,		22 },
+		{"socket-path",	required_argument,	NULL,		's' },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1359,14 +1369,12 @@ void conf(struct ctx *c, int argc, char **argv)
 				       sizeof(c->ip4.ifname_out), "%s", optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->ip4.ifname_out))
 				die("Invalid interface name: %s", optarg);
-
 			break;
 		case 16:
 			ret = snprintf(c->ip6.ifname_out,
 				       sizeof(c->ip6.ifname_out), "%s", optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
 				die("Invalid interface name: %s", optarg);
-
 			break;
 		case 17:
 			if (c->mode != MODE_PASTA)
@@ -1395,6 +1403,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			netns_only = 1;
 			*userns = 0;
 			break;
+		case 21:
+			if (c->mode == MODE_PASTA) {
+				err("--vhost-user is for passt mode only");
+				usage(argv[0], stdout, EXIT_SUCCESS);
+			}
+			c->mode = MODE_VU;
+			break;
+		case 22:
+			vu_print_capabilities();
+			break;
 		case 'd':
 			c->debug = 1;
 			c->quiet = 0;
diff --git a/epoll_type.h b/epoll_type.h
index 0ad1efa0ccec..f3ef41584757 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -36,6 +36,10 @@ enum epoll_type {
 	EPOLL_TYPE_TAP_PASST,
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
+	/* vhost-user command socket */
+	EPOLL_TYPE_VHOST_CMD,
+	/* vhost-user kick event socket */
+	EPOLL_TYPE_VHOST_KICK,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/isolation.c b/isolation.c
index 4956d7e6f331..1a27f066c2ba 100644
--- a/isolation.c
+++ b/isolation.c
@@ -373,12 +373,19 @@ void isolate_postfork(const struct ctx *c)
 
 	prctl(PR_SET_DUMPABLE, 0);
 
-	if (c->mode == MODE_PASTA) {
-		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
-		prog.filter = filter_pasta;
-	} else {
+	switch (c->mode) {
+	case MODE_PASST:
 		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
 		prog.filter = filter_passt;
+		break;
+	case MODE_PASTA:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
+		prog.filter = filter_pasta;
+		break;
+	case MODE_VU:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
+		prog.filter = filter_vu;
+		break;
 	}
 
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
diff --git a/packet.c b/packet.c
index 37489961a37e..36c7e5070831 100644
--- a/packet.c
+++ b/packet.c
@@ -36,6 +36,19 @@
 static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 			      const char *start, const char *func, int line)
 {
+	ASSERT(p->buf);
+
+	if (p->buf_size == 0) {
+		int ret;
+
+		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
+
+		if (ret == -1)
+			trace("cannot find region, %s:%i", func, line);
+
+		return ret;
+	}
+
 	if (start < p->buf) {
 		trace("packet start %p before buffer start %p, "
 		      "%s:%i", (void *)start, (void *)p->buf, func, line);
diff --git a/packet.h b/packet.h
index 8377dcf678bb..d32688d8a0a4 100644
--- a/packet.h
+++ b/packet.h
@@ -22,6 +22,8 @@ struct pool {
 	struct iovec pkt[1];
 };
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/passt.c b/passt.c
index 6401730dae65..a931a55ab31b 100644
--- a/passt.c
+++ b/passt.c
@@ -74,6 +74,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
 	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
+	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
+	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -206,6 +208,7 @@ int main(int argc, char **argv)
 	struct rlimit limit;
 	struct timespec now;
 	struct sigaction sa;
+	struct vu_dev vdev;
 
 	clock_gettime(CLOCK_MONOTONIC, &log_start);
 
@@ -262,6 +265,8 @@ int main(int argc, char **argv)
 	pasta_netns_quit_init(&c);
 
 	tap_sock_init(&c);
+	if (c.mode == MODE_VU)
+		vu_init(&c, &vdev);
 
 	secret_init(&c);
 
@@ -350,14 +355,30 @@ loop:
 			tcp_timer_handler(&c, ref);
 			break;
 		case EPOLL_TYPE_UDP_LISTEN:
-			udp_listen_sock_handler(&c, ref, eventmask, &now);
+			if (c.mode == MODE_VU)
+				udp_vu_listen_sock_handler(&c, ref, eventmask,
+							   &now);
+			else
+				udp_buf_listen_sock_handler(&c, ref, eventmask,
+							    &now);
 			break;
 		case EPOLL_TYPE_UDP_REPLY:
-			udp_reply_sock_handler(&c, ref, eventmask, &now);
+			if (c.mode == MODE_VU)
+				udp_vu_reply_sock_handler(&c, ref, eventmask,
+							  &now);
+			else
+				udp_buf_reply_sock_handler(&c, ref, eventmask,
+							   &now);
 			break;
 		case EPOLL_TYPE_PING:
 			icmp_sock_handler(&c, ref);
 			break;
+		case EPOLL_TYPE_VHOST_CMD:
+			tap_handler_vu(&vdev, c.fd_tap, eventmask);
+			break;
+		case EPOLL_TYPE_VHOST_KICK:
+			vu_kick_cb(&vdev, ref, &now);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index d0f31a230976..71ad32aa3dd0 100644
--- a/passt.h
+++ b/passt.h
@@ -25,6 +25,8 @@ union epoll_ref;
 #include "fwd.h"
 #include "tcp.h"
 #include "udp.h"
+#include "udp_vu.h"
+#include "vhost_user.h"
 
 /**
  * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
@@ -87,6 +89,7 @@ struct fqdn {
 enum passt_modes {
 	MODE_PASST,
 	MODE_PASTA,
+	MODE_VU,
 };
 
 /**
@@ -193,6 +196,7 @@ struct ip6_ctx {
  * @no_map_gw:		Don't map connections, untracked UDP to gateway to host
  * @low_wmem:		Low probed net.core.wmem_max
  * @low_rmem:		Low probed net.core.rmem_max
+ * @vdev:		vhost-user device
  */
 struct ctx {
 	enum passt_modes mode;
@@ -256,6 +260,8 @@ struct ctx {
 
 	int low_wmem;
 	int low_rmem;
+
+	struct vu_dev *vdev;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/pcap.c b/pcap.c
index 46cc4b0d72b6..7e9c56090041 100644
--- a/pcap.c
+++ b/pcap.c
@@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
  *		containing packet data to write, including L2 header
  * @iovcnt:	Number of buffers (@iov entries)
  */
-/* cppcheck-suppress unusedFunction */
 void pcap_iov(const struct iovec *iov, size_t iovcnt)
 {
 	struct timespec now;
diff --git a/tap.c b/tap.c
index 5852705b897c..a25e4e494287 100644
--- a/tap.c
+++ b/tap.c
@@ -58,6 +58,7 @@
 #include "packet.h"
 #include "tap.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
@@ -78,16 +79,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 	struct iovec iov[2];
 	size_t iovcnt = 0;
 
-	if (c->mode == MODE_PASST) {
+	switch (c->mode) {
+	case MODE_PASST:
 		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
 		iovcnt++;
-	}
-
-	iov[iovcnt].iov_base = (void *)data;
-	iov[iovcnt].iov_len = l2len;
-	iovcnt++;
+		/* fall through */
+	case MODE_PASTA:
+		iov[iovcnt].iov_base = (void *)data;
+		iov[iovcnt].iov_len = l2len;
+		iovcnt++;
 
-	tap_send_frames(c, iov, iovcnt, 1);
+		tap_send_frames(c, iov, iovcnt, 1);
+		break;
+	case MODE_VU:
+		vu_send(c->vdev, data, l2len);
+		break;
+	}
 }
 
 /**
@@ -406,10 +413,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 	if (!nframes)
 		return 0;
 
-	if (c->mode == MODE_PASTA)
+	switch (c->mode) {
+	case MODE_PASTA:
 		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
-	else
+		break;
+	case MODE_PASST:
 		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
+		break;
+	case MODE_VU:
+		/* fall through */
+	default:
+		ASSERT(0);
+	}
 
 	if (m < nframes)
 		debug("tap: failed to send %zu frames of %zu",
@@ -967,7 +982,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
  * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
  * @c:		Execution context
  */
-static void tap_sock_reset(struct ctx *c)
+void tap_sock_reset(struct ctx *c)
 {
 	info("Client connection closed%s", c->one_off ? ", exiting" : "");
 
@@ -978,6 +993,8 @@ static void tap_sock_reset(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
 	close(c->fd_tap);
 	c->fd_tap = -1;
+	if (c->mode == MODE_VU)
+		vu_cleanup(c->vdev);
 }
 
 /**
@@ -1177,11 +1194,17 @@ static void tap_sock_unix_init(struct ctx *c)
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
 
-	info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
-	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
-	     c->sock_path);
-	info("or qrap, for earlier qemu versions:");
-	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	if (c->mode == MODE_VU) {
+		info("You can start qemu with:");
+		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
+		     c->sock_path);
+	} else {
+		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
+		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
+		     c->sock_path);
+		info("or qrap, for earlier qemu versions:");
+		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	}
 }
 
 /**
@@ -1191,8 +1214,8 @@ static void tap_sock_unix_init(struct ctx *c)
  */
 void tap_listen_handler(struct ctx *c, uint32_t events)
 {
-	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
 	struct epoll_event ev = { 0 };
+	union epoll_ref ref;
 	int v = INT_MAX / 2;
 	struct ucred ucred;
 	socklen_t len;
@@ -1232,6 +1255,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
+	if (c->mode == MODE_VU)
+		ref.type = EPOLL_TYPE_VHOST_CMD;
+	else
+		ref.type = EPOLL_TYPE_TAP_PASST;
 	ev.events = EPOLLIN | EPOLLRDHUP;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
@@ -1293,21 +1320,47 @@ static void tap_sock_tun_init(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
 
+void tap_sock_update_buf(void *base, size_t size)
+{
+	int i;
+
+	pool_tap4_storage.buf = base;
+	pool_tap4_storage.buf_size = size;
+	pool_tap6_storage.buf = base;
+	pool_tap6_storage.buf_size = size;
+
+	for (i = 0; i < TAP_SEQS; i++) {
+		tap4_l4[i].p.buf = base;
+		tap4_l4[i].p.buf_size = size;
+		tap6_l4[i].p.buf = base;
+		tap6_l4[i].p.buf_size = size;
+	}
+}
+
 /**
  * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
  * @c:		Execution context
  */
 void tap_sock_init(struct ctx *c)
 {
-	size_t sz = sizeof(pkt_buf);
+	size_t sz;
+	char *buf;
 	int i;
 
-	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
-	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
+	if (c->mode == MODE_VU) {
+		buf = NULL;
+		sz = 0;
+	} else {
+		buf = pkt_buf;
+		sz = sizeof(pkt_buf);
+	}
+
+	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, buf, sz);
+	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, buf, sz);
 
 	for (i = 0; i < TAP_SEQS; i++) {
-		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
-		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
+		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
+		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
 	}
 
 	if (c->fd_tap != -1) { /* Passed as --fd */
@@ -1316,10 +1369,17 @@ void tap_sock_init(struct ctx *c)
 
 		ASSERT(c->one_off);
 		ref.fd = c->fd_tap;
-		if (c->mode == MODE_PASST)
+		switch (c->mode) {
+		case MODE_PASST:
 			ref.type = EPOLL_TYPE_TAP_PASST;
-		else
+			break;
+		case MODE_PASTA:
 			ref.type = EPOLL_TYPE_TAP_PASTA;
+			break;
+		case MODE_VU:
+			ref.type = EPOLL_TYPE_VHOST_CMD;
+			break;
+		}
 
 		ev.events = EPOLLIN | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
diff --git a/tap.h b/tap.h
index ec9e2acec460..c5447f7077eb 100644
--- a/tap.h
+++ b/tap.h
@@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
  */
 static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 {
-	thdr->vnet_len = htonl(l2len);
+	if (thdr)
+		thdr->vnet_len = htonl(l2len);
 }
 
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
@@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 int tap_sock_unix_open(char *sock_path);
+void tap_sock_reset(struct ctx *c);
+void tap_sock_update_buf(void *base, size_t size);
 void tap_sock_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
diff --git a/tcp.c b/tcp.c
index c0820ce7a391..1af99b2ae042 100644
--- a/tcp.c
+++ b/tcp.c
@@ -304,6 +304,7 @@
 #include "flow_table.h"
 #include "tcp_internal.h"
 #include "tcp_buf.h"
+#include "tcp_vu.h"
 
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
@@ -896,6 +897,7 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -906,7 +908,8 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
@@ -929,7 +932,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 
 	tcp_fill_header(th, conn, seq);
 
-	tcp_update_check_tcp4(iph, th);
+	if (c->mode != MODE_VU)
+		tcp_update_check_tcp4(iph, th);
+	else
+		th->check = 0;
 
 	tap_hdr_update(taph, l3len + sizeof(struct ethhdr));
 
@@ -938,6 +944,7 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -948,7 +955,8 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -970,7 +978,10 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 
 	tcp_fill_header(th, conn, seq);
 
-	tcp_update_check_tcp6(ip6h, th);
+	if (c->mode != MODE_VU)
+		tcp_update_check_tcp6(ip6h, th);
+	else
+		th->check = 0;
 
 	tap_hdr_update(taph, l4len + sizeof(*ip6h) + sizeof(struct ethhdr));
 
@@ -979,6 +990,7 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -987,7 +999,8 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
  *
  * Return: IP payload length, host order
  */
-size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+size_t tcp_l2_buf_fill_headers(const struct ctx *c,
+			       const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq)
 {
@@ -995,13 +1008,13 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 	const struct in_addr *a4 = inany_v4(&tapside->faddr);
 
 	if (a4) {
-		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1237,6 +1250,9 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
  */
 int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_send_flag(c, conn, flags);
+
 	return tcp_buf_send_flag(c, conn, flags);
 }
 
@@ -1630,6 +1646,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
  */
 static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_data_from_sock(c, conn);
+
 	return tcp_buf_data_from_sock(c, conn);
 }
 
diff --git a/tcp_buf.c b/tcp_buf.c
index c31e9f31b438..6b702b00be89 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -321,7 +321,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		return ret;
 	}
 
-	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq);
+	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
 	if (flags & DUP_ACK) {
@@ -378,7 +378,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_frame_conns[tcp4_payload_used] = conn;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -386,7 +386,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_frame_conns[tcp6_payload_used] = conn;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
diff --git a/tcp_internal.h b/tcp_internal.h
index 8b60aabc1b33..3dd4b49a4441 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -89,7 +89,8 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
 		tcp_rst_do(c, conn);					\
 	} while (0)
 
-size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+size_t tcp_l2_buf_fill_headers(const struct ctx *c,
+			       const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
diff --git a/tcp_vu.c b/tcp_vu.c
new file mode 100644
index 000000000000..6eef9187dbd7
--- /dev/null
+++ b/tcp_vu.c
@@ -0,0 +1,593 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * tcp_vu.c - TCP L2 vhost-user management functions
+ */
+
+#include <errno.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <netinet/ip.h>
+
+#include <sys/socket.h>
+
+#include <linux/tcp.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "siphash.h"
+#include "inany.h"
+#include "vhost_user.h"
+#include "tcp.h"
+#include "pcap.h"
+#include "flow.h"
+#include "tcp_conn.h"
+#include "flow_table.h"
+#include "tcp_vu.h"
+#include "tcp_internal.h"
+#include "checksum.h"
+#include "vu_common.h"
+
+/**
+ * struct tcp_payload_t - TCP header and data to send segments with payload
+ * @th:		TCP header
+ * @data:	TCP data
+ */
+struct tcp_payload_t {
+	struct tcphdr th;
+	uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
+};
+
+/**
+ * struct tcp_flags_t - TCP header and data to send zero-length
+ *                      segments (flags)
+ * @th:		TCP header
+ * @opts	TCP options
+ */
+struct tcp_flags_t {
+	struct tcphdr th;
+	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
+};
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE];
+static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+
+static size_t tcp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
+		    sizeof(struct tcphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+static void tcp_vu_pcap(const struct ctx *c, const struct flowside *tapside,
+			struct iovec *iov, int iov_used, size_t l4len)
+{
+	const struct in_addr *src = inany_v4(&tapside->faddr);
+	const struct in_addr *dst = inany_v4(&tapside->eaddr);
+	const struct vu_dev *vdev = c->vdev;
+	char *base = iov[0].iov_base;
+	size_t size = iov[0].iov_len;
+	struct tcp_payload_t *bp;
+	uint32_t sum;
+
+	if (!*c->pcap)
+		return;
+
+	if (src && dst) {
+		bp = vu_payloadv4(vdev, base);
+		sum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
+					     *src, *dst);
+	} else {
+		bp = vu_payloadv6(vdev, base);
+		sum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
+					     &tapside->faddr.a6,
+					     &tapside->eaddr.a6);
+	}
+	iov[0].iov_base = &bp->th;
+	iov[0].iov_len = size - ((char *)iov[0].iov_base - base);
+	bp->th.check = 0;
+	bp->th.check = csum_iov(iov, iov_used, sum);
+
+	/* set iov for pcap logging */
+	iov[0].iov_base = base + vdev->hdrlen;
+	iov[0].iov_len = size - vdev->hdrlen;
+
+	pcap_iov(iov, iov_used);
+
+	/* restore iov[0] */
+	iov[0].iov_base = base;
+	iov[0].iov_len = size;
+}
+
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	struct virtio_net_hdr_mrg_rxbuf *vh;
+	struct iovec l2_iov[TCP_NUM_IOVS];
+	size_t l2len, l4len, optlen;
+	struct iovec in_sg;
+	struct ethhdr *eh;
+	int nb_ack;
+	int ret;
+
+	elem[0].out_num = 0;
+	elem[0].out_sg = NULL;
+	elem[0].in_num = 1;
+	elem[0].in_sg = &in_sg;
+	ret = vu_queue_pop(vdev, vq, &elem[0]);
+	if (ret < 0)
+		return 0;
+
+	if (elem[0].in_num < 1) {
+		err("virtio-net receive queue contains no in buffers");
+		vu_queue_rewind(vq, 1);
+		return 0;
+	}
+
+	vh = elem[0].in_sg[0].iov_base;
+
+	vh->hdr = vu_header;
+	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+		vh->num_buffers = htole16(1);
+
+	l2_iov[TCP_IOV_TAP].iov_base = NULL;
+	l2_iov[TCP_IOV_TAP].iov_len = 0;
+	l2_iov[TCP_IOV_ETH].iov_base = (char *)elem[0].in_sg[0].iov_base + vdev->hdrlen;
+	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
+
+	eh = l2_iov[TCP_IOV_ETH].iov_base;
+
+	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+	if (CONN_V4(conn)) {
+		struct tcp_flags_t *payload;
+		struct iphdr *iph;
+		uint32_t seq;
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = l2_iov[TCP_IOV_IP].iov_base;
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_flags_t, opts) / 4,
+			.ack = 1
+		};
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
+						seq);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		l2len = l4len + sizeof(*iph) + sizeof(struct ethhdr);
+	} else {
+		struct tcp_flags_t *payload;
+		struct ipv6hdr *ip6h;
+		uint32_t seq;
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = l2_iov[TCP_IOV_IP].iov_base;
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_flags_t, opts) / 4,
+			.ack = 1
+		};
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
+						seq);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		l2len = l4len + sizeof(*ip6h) + sizeof(struct ethhdr);
+	}
+	l2len += vdev->hdrlen;
+	ASSERT(l2len <= elem[0].in_sg[0].iov_len);
+
+	elem[0].in_sg[0].iov_len = l2len;
+	tcp_vu_pcap(c, tapside, &elem[0].in_sg[0], 1, l4len);
+
+	vu_queue_fill(vq, &elem[0], l2len, 0);
+	nb_ack = 1;
+
+	if (flags & DUP_ACK) {
+		struct iovec in_sg_dup;
+
+		elem[1].out_num = 0;
+		elem[1].out_sg = NULL;
+		elem[1].in_num = 1;
+		elem[1].in_sg = &in_sg_dup;
+		ret = vu_queue_pop(vdev, vq, &elem[1]);
+		if (ret == 0) {
+			if (elem[1].in_num < 1 || elem[1].in_sg[0].iov_len < l2len) {
+				vu_queue_rewind(vq, 1);
+			} else {
+				memcpy(elem[1].in_sg[0].iov_base, vh, l2len);
+				nb_ack++;
+
+				tcp_vu_pcap(c, tapside, &elem[1].in_sg[0], 1,
+					    l4len);
+
+				vu_queue_fill(vq, &elem[1], l2len, 1);
+			}
+		}
+	}
+
+	vu_queue_flush(vq, nb_ack);
+	vu_queue_notify(vdev, vq);
+
+	return 0;
+}
+
+static ssize_t tcp_vu_sock_recv(struct ctx *c,
+				struct tcp_tap_conn *conn, bool v4,
+				size_t fillsize, uint16_t mss,
+				ssize_t *data_len)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+	struct msghdr mh_sock = { 0 };
+	static int in_sg_count;
+	int s = conn->sock;
+	size_t l2_hdrlen;
+	int segment_size;
+	int iov_cnt;
+	ssize_t ret;
+
+	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
+
+	iov_cnt = 0;
+	in_sg_count = 0;
+	segment_size = 0;
+	*data_len = 0;
+	while (fillsize > 0 && iov_cnt < VIRTQUEUE_MAX_SIZE - 1 &&
+			       in_sg_count < ARRAY_SIZE(in_sg)) {
+
+		elem[iov_cnt].out_num = 0;
+		elem[iov_cnt].out_sg = NULL;
+		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
+		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
+		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
+		if (ret < 0)
+			break;
+
+		if (elem[iov_cnt].in_num < 1)
+			die("virtio-net receive queue contains no in buffers");
+
+		in_sg_count += elem[iov_cnt].in_num;
+
+		ASSERT(elem[iov_cnt].in_num == 1);
+		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
+
+		if (segment_size == 0) {
+			iov_vu[iov_cnt + 1].iov_base =
+					(char *)elem[iov_cnt].in_sg[0].iov_base + l2_hdrlen;
+			iov_vu[iov_cnt + 1].iov_len =
+					elem[iov_cnt].in_sg[0].iov_len - l2_hdrlen;
+		} else {
+			iov_vu[iov_cnt + 1].iov_base = elem[iov_cnt].in_sg[0].iov_base;
+			iov_vu[iov_cnt + 1].iov_len = elem[iov_cnt].in_sg[0].iov_len;
+		}
+
+		if (iov_vu[iov_cnt + 1].iov_len > fillsize)
+			iov_vu[iov_cnt + 1].iov_len = fillsize;
+
+		segment_size += iov_vu[iov_cnt + 1].iov_len;
+		if (vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
+			segment_size = 0;
+		} else if (segment_size >= mss) {
+			iov_vu[iov_cnt + 1].iov_len -= segment_size - mss;
+			segment_size = 0;
+		}
+		fillsize -= iov_vu[iov_cnt + 1].iov_len;
+
+		iov_cnt++;
+	}
+	if (iov_cnt == 0)
+		return 0;
+
+	mh_sock.msg_iov = iov_vu;
+	mh_sock.msg_iovlen = iov_cnt + 1;
+
+	do
+		ret = recvmsg(s, &mh_sock, MSG_PEEK);
+	while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		vu_queue_rewind(vq, iov_cnt);
+		if (errno != EAGAIN && errno != EWOULDBLOCK) {
+			ret = -errno;
+			tcp_rst(c, conn);
+		}
+		return ret;
+	}
+	if (!ret) {
+		vu_queue_rewind(vq, iov_cnt);
+
+		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
+			if (retf) {
+				tcp_rst(c, conn);
+				return retf;
+			}
+
+			conn_event(c, conn, TAP_FIN_SENT);
+		}
+		return 0;
+	}
+
+	*data_len = ret;
+	return iov_cnt;
+}
+
+static size_t tcp_vu_prepare(const struct ctx *c,
+			     struct tcp_tap_conn *conn, struct iovec *first,
+			     size_t data_len, const uint16_t **check)
+{
+	const struct flowside *toside = TAPFLOW(conn);
+	const struct vu_dev *vdev = c->vdev;
+	struct iovec l2_iov[TCP_NUM_IOVS];
+	char *base = first->iov_base;
+	struct ethhdr *eh;
+	size_t l4len;
+
+	/* we guess the first iovec provided by the guest can embed
+         * all the headers needed by L2 frame
+         */
+
+	l2_iov[TCP_IOV_TAP].iov_base = NULL;
+	l2_iov[TCP_IOV_TAP].iov_len = 0;
+	l2_iov[TCP_IOV_ETH].iov_base = base + vdev->hdrlen;
+	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
+
+	eh = l2_iov[TCP_IOV_ETH].iov_base;
+
+	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->faddr)) {
+		struct tcp_payload_t *payload;
+		struct iphdr *iph;
+
+		ASSERT(first[0].iov_len >= vdev->hdrlen +
+		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
+		       sizeof(struct tcphdr));
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = l2_iov[TCP_IOV_IP].iov_base;
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_payload_t, data) / 4,
+			.ack = 1
+		};
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
+						data_len, *check,
+						conn->seq_to_tap);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		*check = &iph->check;
+	} else {
+		struct tcp_payload_t *payload;
+		struct ipv6hdr *ip6h;
+
+		ASSERT(first[0].iov_len >= vdev->hdrlen +
+		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		       sizeof(struct tcphdr));
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = l2_iov[TCP_IOV_IP].iov_base;
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_payload_t, data) / 4,
+			.ack = 1
+		};
+;
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
+						data_len,
+						NULL, conn->seq_to_tap);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+	}
+
+	return l4len;
+}
+
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	uint16_t mss = MSS_GET(conn);
+	size_t l2_hdrlen, fillsize;
+	int i, iov_cnt, iov_used;
+	int v4 = CONN_V4(conn);
+	uint32_t already_sent;
+	const uint16_t *check;
+	struct iovec *first;
+	int segment_size;
+	int num_buffers;
+	ssize_t len;
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		flow_err(conn,
+			 "Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
+
+	if (SEQ_LT(already_sent, 0)) {
+		/* RFC 761, section 2.1. */
+		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
+			   conn->seq_ack_from_tap, conn->seq_to_tap);
+		conn->seq_to_tap = conn->seq_ack_from_tap;
+		already_sent = 0;
+	}
+
+	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
+		return 0;
+	}
+
+	/* Set up buffer descriptors we'll fill completely and partially. */
+
+	fillsize = wnd_scaled;
+
+	iov_vu[0].iov_base = tcp_buf_discard;
+	iov_vu[0].iov_len = already_sent;
+	fillsize -= already_sent;
+
+	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, mss, &len);
+	if (iov_cnt <= 0)
+		return iov_cnt;
+
+	len -= already_sent;
+	if (len <= 0) {
+		conn_flag(c, conn, STALLED);
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
+
+	/* Likely, some new data was acked too. */
+	tcp_update_seqack_wnd(c, conn, 0, NULL);
+
+	/* initialize headers */
+	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
+	iov_used = 0;
+	num_buffers = 0;
+	check = NULL;
+	segment_size = 0;
+	for (i = 0; i < iov_cnt && len; i++) {
+
+		if (segment_size == 0)
+			first = &iov_vu[i + 1];
+
+		if (iov_vu[i + 1].iov_len > (size_t)len)
+			iov_vu[i + 1].iov_len = len;
+
+		len -= iov_vu[i + 1].iov_len;
+		iov_used++;
+
+		segment_size += iov_vu[i + 1].iov_len;
+		num_buffers++;
+
+		if (segment_size >= mss || len == 0 ||
+		    i + 1 == iov_cnt || vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
+			struct virtio_net_hdr_mrg_rxbuf *vh;
+			size_t l4len;
+
+			if (i + 1 == iov_cnt)
+				check = NULL;
+
+			/* restore first iovec base: point to vnet header */
+			first->iov_base = (char *)first->iov_base - l2_hdrlen;
+			first->iov_len = first->iov_len + l2_hdrlen;
+
+			vh = first->iov_base;
+
+			vh->hdr = vu_header;
+			if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+				vh->num_buffers = htole16(num_buffers);
+
+			l4len = tcp_vu_prepare(c, conn, first, segment_size, &check);
+
+			tcp_vu_pcap(c, tapside, first, num_buffers, l4len);
+
+			conn->seq_to_tap += segment_size;
+
+			segment_size = 0;
+			num_buffers = 0;
+		}
+	}
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	/* send packets */
+	vu_send_frame(vdev, vq, elem, &iov_vu[1], iov_used);
+
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+	return 0;
+}
diff --git a/tcp_vu.h b/tcp_vu.h
new file mode 100644
index 000000000000..99daba5b34ed
--- /dev/null
+++ b/tcp_vu.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef TCP_VU_H
+#define TCP_VU_H
+
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
+
+#endif  /*TCP_VU_H */
diff --git a/udp.c b/udp.c
index 7731257292e1..4d2afc62478a 100644
--- a/udp.c
+++ b/udp.c
@@ -109,8 +109,7 @@
 #include "pcap.h"
 #include "log.h"
 #include "flow_table.h"
-
-#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+#include "udp_internal.h"
 
 /* "Spliced" sockets indexed by bound port (host order) */
 static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
@@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
 
 /* Static buffers */
 
-/**
- * struct udp_payload_t - UDP header and data for inbound messages
- * @uh:		UDP header
- * @data:	UDP data
- */
-static struct udp_payload_t {
-	struct udphdr uh;
-	char data[USHRT_MAX - sizeof(struct udphdr)];
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-udp_payload[UDP_MAX_FRAMES];
+/* UDP header and data for inbound messages */
+static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
 
 /* Ethernet header for IPv4 frames */
 static struct ethhdr udp4_eth_hdr;
@@ -311,6 +298,7 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
 
 /**
  * udp_update_hdr4() - Update headers for one IPv4 datagram
+ * @c:		Execution context
  * @ip4h:	Pre-filled IPv4 header (except for tot_len and saddr)
  * @bp:		Pointer to udp_payload_t to update
  * @toside:	Flowside for destination side
@@ -318,8 +306,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
  *
  * Return: size of IPv4 payload (UDP header + data)
  */
-static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen)
+size_t udp_update_hdr4(const struct ctx *c,
+		       struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen)
 {
 	const struct in_addr *src = inany_v4(&toside->faddr);
 	const struct in_addr *dst = inany_v4(&toside->eaddr);
@@ -336,13 +325,17 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 	bp->uh.source = htons(toside->fport);
 	bp->uh.dest = htons(toside->eport);
 	bp->uh.len = htons(l4len);
-	csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
+	if (c->mode != MODE_VU)
+		csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
+	else
+		bp->uh.check = 0;
 
 	return l4len;
 }
 
 /**
  * udp_update_hdr6() - Update headers for one IPv6 datagram
+ * @c:		Execution context
  * @ip6h:	Pre-filled IPv6 header (except for payload_len and addresses)
  * @bp:		Pointer to udp_payload_t to update
  * @toside:	Flowside for destination side
@@ -350,8 +343,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
  *
  * Return: size of IPv6 payload (UDP header + data)
  */
-static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen)
+size_t udp_update_hdr6(const struct ctx *c,
+		       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen)
 {
 	uint16_t l4len = dlen + sizeof(bp->uh);
 
@@ -365,19 +359,24 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
 	bp->uh.source = htons(toside->fport);
 	bp->uh.dest = htons(toside->eport);
 	bp->uh.len = ip6h->payload_len;
-	csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6, bp->data, dlen);
+	if (c->mode != MODE_VU)
+		csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6,
+			  bp->data, dlen);
+	else
+		bp->uh.check = 0xffff; /* zero checksum is invalid with IPv6 */
 
 	return l4len;
 }
 
 /**
  * udp_tap_prepare() - Convert one datagram into a tap frame
+ * @c:		Execution context
  * @mmh:	Receiving mmsghdr array
  * @idx:	Index of the datagram to prepare
  * @toside:	Flowside for destination side
  */
-static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
-			    const struct flowside *toside)
+static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh,
+			    unsigned idx, const struct flowside *toside)
 {
 	struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
 	struct udp_payload_t *bp = &udp_payload[idx];
@@ -385,13 +384,15 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
 	size_t l4len;
 
 	if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->faddr)) {
-		l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len);
+		l4len = udp_update_hdr6(c, &bm->ip6h, bp, toside,
+					mmh[idx].msg_len);
 		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
 			       sizeof(udp6_eth_hdr));
 		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
 		(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
 	} else {
-		l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len);
+		l4len = udp_update_hdr4(c, &bm->ip4h, bp, toside,
+					mmh[idx].msg_len);
 		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
 			       sizeof(udp4_eth_hdr));
 		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
@@ -408,7 +409,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
  *
  * #syscalls recvmsg
  */
-static bool udp_sock_recverr(int s)
+bool udp_sock_recverr(int s)
 {
 	const struct sock_extended_err *ee;
 	const struct cmsghdr *hdr;
@@ -495,7 +496,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 }
 
 /**
- * udp_listen_sock_handler() - Handle new data from socket
+ * udp_buf_listen_sock_handler() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -503,8 +504,8 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
  *
  * #syscalls recvmmsg
  */
-void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-			     uint32_t events, const struct timespec *now)
+void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				 uint32_t events, const struct timespec *now)
 {
 	struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv;
 	int n, i;
@@ -527,7 +528,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 			if (pif_is_socket(batchpif)) {
 				udp_splice_prepare(mmh_recv, i);
 			} else if (batchpif == PIF_TAP) {
-				udp_tap_prepare(mmh_recv, i,
+				udp_tap_prepare(c, mmh_recv, i,
 						flowside_at_sidx(batchsidx));
 			}
 
@@ -561,7 +562,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * udp_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -569,8 +570,8 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
  *
  * #syscalls recvmmsg
  */
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now)
+void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now)
 {
 	const struct flowside *fromside = flowside_at_sidx(ref.flowside);
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
@@ -594,7 +595,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 		if (pif_is_socket(topif))
 			udp_splice_prepare(mmh_recv, i);
 		else if (topif == PIF_TAP)
-			udp_tap_prepare(mmh_recv, i, toside);
+			udp_tap_prepare(c, mmh_recv, i, toside);
 	}
 
 	if (pif_is_socket(topif)) {
diff --git a/udp.h b/udp.h
index fb42e1c50d70..77b29260e8d1 100644
--- a/udp.h
+++ b/udp.h
@@ -9,10 +9,10 @@
 #define UDP_TIMER_INTERVAL		1000 /* ms */
 
 void udp_portmap_clear(void);
-void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-			     uint32_t events, const struct timespec *now);
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now);
+void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				 uint32_t events, const struct timespec *now);
+void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now);
 int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now);
diff --git a/udp_internal.h b/udp_internal.h
new file mode 100644
index 000000000000..7dd45753698f
--- /dev/null
+++ b/udp_internal.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef UDP_INTERNAL_H
+#define UDP_INTERNAL_H
+
+#include "tap.h" /* needed by udp_meta_t */
+
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
+/**
+ * struct udp_payload_t - UDP header and data for inbound messages
+ * @uh:		UDP header
+ * @data:	UDP data
+ */
+struct udp_payload_t {
+	struct udphdr uh;
+	char data[USHRT_MAX - sizeof(struct udphdr)];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+size_t udp_update_hdr4(const struct ctx *c,
+		       struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen);
+size_t udp_update_hdr6(const struct ctx *c,
+                       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+                       const struct flowside *toside, size_t dlen);
+bool udp_sock_recverr(int s);
+#endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
new file mode 100644
index 000000000000..f9e7afcf4ddb
--- /dev/null
+++ b/udp_vu.c
@@ -0,0 +1,338 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * udp_vu.c - UDP L2 vhost-user management functions
+ */
+
+#include <unistd.h>
+#include <assert.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include "checksum.h"
+#include "util.h"
+#include "ip.h"
+#include "siphash.h"
+#include "inany.h"
+#include "passt.h"
+#include "pcap.h"
+#include "log.h"
+#include "vhost_user.h"
+#include "udp_internal.h"
+#include "flow.h"
+#include "flow_table.h"
+#include "udp_flow.h"
+#include "udp_vu.h"
+#include "vu_common.h"
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
+static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
+static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+static int in_sg_count;
+
+static size_t udp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
+		    sizeof(struct udphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+static int udp_vu_sock_recv(const struct ctx *c, union sockaddr_inany *s_in,
+			    int s, uint32_t events, bool v6, ssize_t *data_len)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int virtqueue_max, iov_cnt, idx, iov_used;
+	size_t fillsize, size, off, l2_hdrlen;
+	struct virtio_net_hdr_mrg_rxbuf *vh;
+	struct msghdr msg  = { 0 };
+	char *base;
+
+	ASSERT(!c->no_udp);
+
+	/* Clear any errors first */
+	if (events & EPOLLERR) {
+		while (udp_sock_recverr(s))
+			;
+	}
+
+	if (!(events & EPOLLIN))
+		return 0;
+
+	/* compute L2 header length */
+
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		virtqueue_max = VIRTQUEUE_MAX_SIZE;
+	else
+		virtqueue_max = 1;
+
+	l2_hdrlen = udp_vu_l2_hdrlen(vdev, v6);
+
+	msg.msg_name = s_in;
+	msg.msg_namelen = sizeof(union sockaddr_inany);
+
+	fillsize = USHRT_MAX;
+	iov_cnt = 0;
+	in_sg_count = 0;
+	while (fillsize && iov_cnt < virtqueue_max &&
+			in_sg_count < ARRAY_SIZE(in_sg)) {
+		int ret;
+
+		elem[iov_cnt].out_num = 0;
+		elem[iov_cnt].out_sg = NULL;
+		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
+		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
+		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
+		if (ret < 0)
+			break;
+		in_sg_count += elem[iov_cnt].in_num;
+
+		if (elem[iov_cnt].in_num < 1) {
+			err("virtio-net receive queue contains no in buffers");
+			vu_queue_rewind(vq, iov_cnt);
+			return 0;
+		}
+		ASSERT(elem[iov_cnt].in_num == 1);
+		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
+
+		if (iov_cnt == 0) {
+			base = elem[iov_cnt].in_sg[0].iov_base;
+			size = elem[iov_cnt].in_sg[0].iov_len;
+
+			/* keep space for the headers */
+			iov_vu[0].iov_base = base + l2_hdrlen;
+			iov_vu[0].iov_len = size - l2_hdrlen;
+		} else {
+			iov_vu[iov_cnt].iov_base = elem[iov_cnt].in_sg[0].iov_base;
+			iov_vu[iov_cnt].iov_len = elem[iov_cnt].in_sg[0].iov_len;
+		}
+
+		if (iov_vu[iov_cnt].iov_len > fillsize)
+			iov_vu[iov_cnt].iov_len = fillsize;
+
+		fillsize -= iov_vu[iov_cnt].iov_len;
+
+		iov_cnt++;
+	}
+	if (iov_cnt == 0)
+		return 0;
+
+	msg.msg_iov = iov_vu;
+	msg.msg_iovlen = iov_cnt;
+
+	*data_len = recvmsg(s, &msg, 0);
+	if (*data_len < 0) {
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	/* restore original values */
+	iov_vu[0].iov_base = base;
+	iov_vu[0].iov_len = size;
+
+	/* count the numbers of buffer filled by recvmsg() */
+	idx = iov_skip_bytes(iov_vu, iov_cnt, l2_hdrlen + *data_len,
+			     &off);
+	/* adjust last iov length */
+	if (idx < iov_cnt)
+		iov_vu[idx].iov_len = off;
+	iov_used = idx + !!off;
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
+	vh->hdr = vu_header;
+	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+		vh->num_buffers = htole16(iov_used);
+
+	return iov_used;
+}
+
+static size_t udp_vu_prepare(const struct ctx *c,
+			     const struct flowside *toside, ssize_t data_len)
+{
+	const struct vu_dev *vdev = c->vdev;
+	struct ethhdr *eh;
+	size_t l4len;
+
+	/* ethernet header */
+	eh = vu_eth(vdev, iov_vu[0].iov_base);
+
+	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->faddr)) {
+		struct iphdr *iph = vu_ip(vdev, iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv4(vdev,
+							    iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr4(c, iph, bp, toside, data_len);
+	} else {
+		struct ipv6hdr *ip6h = vu_ip(vdev, iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv6(vdev,
+							    iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr6(c, ip6h, bp, toside, data_len);
+	}
+
+	return l4len;
+}
+
+static void udp_vu_pcap(const struct ctx *c, const struct flowside *toside,
+			size_t l4len, int iov_used)
+{
+	const struct in_addr *src = inany_v4(&toside->faddr);
+	const struct in_addr *dst = inany_v4(&toside->eaddr);
+	const struct vu_dev *vdev = c->vdev;
+	char *base = iov_vu[0].iov_base;
+	size_t size = iov_vu[0].iov_len;
+	struct udp_payload_t *bp;
+	uint32_t sum;
+
+	if (!*c->pcap)
+		return;
+
+	if (src && dst) {
+		bp = vu_payloadv4(vdev, base);
+		sum = proto_ipv4_header_psum(l4len, IPPROTO_UDP, *src, *dst);
+	} else {
+		bp = vu_payloadv6(vdev, base);
+		sum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
+					     &toside->faddr.a6,
+					     &toside->eaddr.a6);
+		bp->uh.check = 0; /* by default, set to 0xffff */
+	}
+
+	iov_vu[0].iov_base = &bp->uh;
+	iov_vu[0].iov_len = size - ((char *)iov_vu[0].iov_base - base);
+
+	bp->uh.check = csum_iov(iov_vu, iov_used, sum);
+
+	/* set iov for pcap logging */
+	iov_vu[0].iov_base = base + vdev->hdrlen;
+	iov_vu[0].iov_len = size - vdev->hdrlen;
+	pcap_iov(iov_vu, iov_used);
+
+	/* restore iov_vu[0] */
+	iov_vu[0].iov_base = base;
+	iov_vu[0].iov_len = size;
+}
+
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	bool v6 = ref.udp.v6;
+	int i;
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		union sockaddr_inany s_in;
+		flow_sidx_t batchsidx;
+		uint8_t batchpif;
+		ssize_t data_len;
+		int iov_used;
+
+		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
+					    events, v6, &data_len);
+		if (iov_used <= 0)
+			return;
+
+		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
+		batchpif = pif_at_sidx(batchsidx);
+
+		if (batchpif == PIF_TAP) {
+			size_t l4len;
+
+			l4len = udp_vu_prepare(c, flowside_at_sidx(batchsidx),
+					       data_len);
+			udp_vu_pcap(c, flowside_at_sidx(batchsidx), l4len,
+				    iov_used);
+			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
+		} else if (flow_sidx_valid(batchsidx)) {
+			flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
+			struct udp_flow *uflow = udp_at_sidx(batchsidx);
+
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(pif_at_sidx(fromsidx)),
+				 pif_name(batchpif));
+		} else {
+			debug("Discarding 1 datagram without flow");
+		}
+	}
+}
+
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			        uint32_t events, const struct timespec *now)
+{
+	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
+	const struct flowside *toside = flowside_at_sidx(tosidx);
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
+	uint8_t topif = pif_at_sidx(tosidx);
+	bool v6 = ref.udp.v6;
+	int i;
+
+	ASSERT(!c->no_udp && uflow);
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		union sockaddr_inany s_in;
+		ssize_t data_len;
+		int iov_used;
+
+		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
+					    events, v6, &data_len);
+		if (iov_used <= 0)
+			return;
+		flow_trace(uflow, "Received 1 datagram on reply socket");
+		uflow->ts = now->tv_sec;
+
+		if (topif == PIF_TAP) {
+			size_t l4len;
+
+			l4len = udp_vu_prepare(c, toside, data_len);
+			udp_vu_pcap(c, toside, l4len, iov_used);
+			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
+		} else {
+			uint8_t frompif = pif_at_sidx(ref.flowside);
+
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(frompif), pif_name(topif));
+		}
+	}
+}
diff --git a/udp_vu.h b/udp_vu.h
new file mode 100644
index 000000000000..0db7558914d9
--- /dev/null
+++ b/udp_vu.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef UDP_VU_H
+#define UDP_VU_H
+
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now);
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			       uint32_t events, const struct timespec *now);
+#endif /* UDP_VU_H */
diff --git a/vhost_user.c b/vhost_user.c
index c4cd25fae84e..e65b550774b7 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -52,7 +52,6 @@
  * 			     this is part of the vhost-user backend
  * 			     convention.
  */
-/* cppcheck-suppress unusedFunction */
 void vu_print_capabilities(void)
 {
 	info("{");
@@ -163,8 +162,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
  */
 static void vu_remove_watch(const struct vu_dev *vdev, int fd)
 {
-	(void)vdev;
-	(void)fd;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
 }
 
 /**
@@ -426,7 +424,6 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
  *
  * Return: 0 if the zone in a mapped memory region, -1 otherwise
  */
-/* cppcheck-suppress unusedFunction */
 int vu_packet_check_range(void *buf, size_t offset, size_t len,
 			  const char *start)
 {
@@ -517,6 +514,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 		}
 	}
 
+	/* As vu_packet_check_range() has no access to the number of
+	 * memory regions, mark the end of the array with mmap_addr = 0
+	 */
+	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
+	vdev->regions[vdev->nregions].mmap_addr = 0;
+
+	tap_sock_update_buf(vdev->regions, 0);
+
 	return false;
 }
 
@@ -637,8 +642,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
  */
 static void vu_set_watch(const struct vu_dev *vdev, int fd)
 {
-	(void)vdev;
-	(void)fd;
+	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
+	struct epoll_event ev = { 0 };
+
+	ev.data.u64 = ref.u64;
+	ev.events = EPOLLIN;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
 }
 
 /**
@@ -678,7 +687,6 @@ static int vu_wait_queue(const struct vu_virtq *vq)
  *
  * Return: number of bytes sent, -1 if there is an error
  */
-/* cppcheck-suppress unusedFunction */
 int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
 {
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@@ -864,7 +872,6 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
  * @vdev:	vhost-user device
  * @ref:	epoll reference information
  */
-/* cppcheck-suppress unusedFunction */
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now)
 {
@@ -1102,11 +1109,11 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
  * @c:		execution context
  * @vdev:	vhost-user device
  */
-/* cppcheck-suppress unusedFunction */
 void vu_init(struct ctx *c, struct vu_dev *vdev)
 {
 	int i;
 
+	c->vdev = vdev;
 	vdev->context = c;
 	vdev->hdrlen = 0;
 	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
@@ -1170,7 +1177,7 @@ void vu_cleanup(struct vu_dev *vdev)
  */
 static void vu_sock_reset(struct vu_dev *vdev)
 {
-	(void)vdev;
+	tap_sock_reset(vdev->context);
 }
 
 /**
@@ -1179,7 +1186,6 @@ static void vu_sock_reset(struct vu_dev *vdev)
  * @fd:		vhost-user message socket
  * @events:	epoll events
  */
-/* cppcheck-suppress unusedFunction */
 void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)
 {
 	struct vhost_user_msg msg = { 0 };
diff --git a/virtio.c b/virtio.c
index d02e6e04701d..55fc647842bb 100644
--- a/virtio.c
+++ b/virtio.c
@@ -559,7 +559,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
  * @vq:		Virtqueue
  * @num:	Number of element to unpop
  */
-/* cppcheck-suppress unusedFunction */
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
 {
 	if (num > vq->inuse)
diff --git a/vu_common.c b/vu_common.c
new file mode 100644
index 000000000000..611c44a39142
--- /dev/null
+++ b/vu_common.c
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * common_vu.c - vhost-user common UDP and TCP functions
+ */
+
+#include <unistd.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "passt.h"
+#include "vhost_user.h"
+#include "vu_common.h"
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used)
+{
+	int i;
+
+	for (i = 0; i < iov_used; i++)
+		vu_queue_fill(vq, &elem[i], iov_vu[i].iov_len, i);
+
+	vu_queue_flush(vq, iov_used);
+	vu_queue_notify(vdev, vq);
+}
diff --git a/vu_common.h b/vu_common.h
new file mode 100644
index 000000000000..d2ea46bd379b
--- /dev/null
+++ b/vu_common.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+
+static inline void *vu_eth(const struct vu_dev *vdev, void *base)
+{
+	return ((char *)base + vdev->hdrlen);
+}
+
+static inline void *vu_ip(const struct vu_dev *vdev, void *base)
+{
+	return (struct ethhdr *)vu_eth(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv4(const struct vu_dev *vdev, void *base)
+{
+	return (struct iphdr *)vu_ip(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv6(const struct vu_dev *vdev, void *base)
+{
+	return (struct ipv6hdr *)vu_ip(vdev, base) + 1;
+}
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used);
+#endif /* VU_COMMON_H */
-- 
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+
+static inline void *vu_eth(const struct vu_dev *vdev, void *base)
+{
+	return ((char *)base + vdev->hdrlen);
+}
+
+static inline void *vu_ip(const struct vu_dev *vdev, void *base)
+{
+	return (struct ethhdr *)vu_eth(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv4(const struct vu_dev *vdev, void *base)
+{
+	return (struct iphdr *)vu_ip(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv6(const struct vu_dev *vdev, void *base)
+{
+	return (struct ipv6hdr *)vu_ip(vdev, base) + 1;
+}
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used);
+#endif /* VU_COMMON_H */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 1/4] packet: replace struct desc by struct iovec
  2024-08-15 15:50 ` [PATCH v3 1/4] packet: replace struct desc by struct iovec Laurent Vivier
@ 2024-08-20  0:27   ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2024-08-20  0:27 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 5883 bytes --]

On Thu, Aug 15, 2024 at 05:50:20PM +0200, Laurent Vivier wrote:
> To be able to manage buffers inside a shared memory provided
> by a VM via a vhost-user interface, we cannot rely on the fact
> that buffers are located in a pre-defined memory area and use
> a base address and a 32bit offset to address them.
> 
> We need a 64bit address, so replace struct desc by struct iovec
> and update range checking.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  packet.c | 80 ++++++++++++++++++++++++++++++--------------------------
>  packet.h | 14 ++--------
>  2 files changed, 45 insertions(+), 49 deletions(-)
> 
> diff --git a/packet.c b/packet.c
> index ccfc84607709..37489961a37e 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -22,6 +22,35 @@
>  #include "util.h"
>  #include "log.h"
>  
> +/**
> + * packet_check_range() - Check if a packet memory range is valid
> + * @p:		Packet pool
> + * @offset:	Offset of data range in packet descriptor
> + * @len:	Length of desired data range
> + * @start:	Start of the packet descriptor
> + * @func:	For tracing: name of calling function
> + * @line:	For tracing: caller line of function call
> + *
> + * Return: 0 if the range is valid, -1 otherwise
> + */
> +static int packet_check_range(const struct pool *p, size_t offset, size_t len,
> +			      const char *start, const char *func, int line)
> +{
> +	if (start < p->buf) {
> +		trace("packet start %p before buffer start %p, "
> +		      "%s:%i", (void *)start, (void *)p->buf, func, line);
> +		return -1;
> +	}
> +
> +	if (start + len + offset > p->buf + p->buf_size) {
> +		trace("packet offset plus length %lu from size %lu, "
> +		      "%s:%i", start - p->buf + len + offset,
> +		      p->buf_size, func, line);
> +		return -1;
> +	}
> +
> +	return 0;
> +}
>  /**
>   * packet_add_do() - Add data as packet descriptor to given pool
>   * @p:		Existing pool
> @@ -41,34 +70,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start,
>  		return;
>  	}
>  
> -	if (start < p->buf) {
> -		trace("add packet start %p before buffer start %p, %s:%i",
> -		      (void *)start, (void *)p->buf, func, line);
> +	if (packet_check_range(p, 0, len, start, func, line))
>  		return;
> -	}
> -
> -	if (start + len > p->buf + p->buf_size) {
> -		trace("add packet start %p, length: %zu, buffer end %p, %s:%i",
> -		      (void *)start, len, (void *)(p->buf + p->buf_size),
> -		      func, line);
> -		return;
> -	}
>  
>  	if (len > UINT16_MAX) {
>  		trace("add packet length %zu, %s:%i", len, func, line);
>  		return;
>  	}
>  
> -#if UINTPTR_MAX == UINT64_MAX
> -	if ((uintptr_t)start - (uintptr_t)p->buf > UINT32_MAX) {
> -		trace("add packet start %p, buffer start %p, %s:%i",
> -		      (void *)start, (void *)p->buf, func, line);
> -		return;
> -	}
> -#endif
> -
> -	p->pkt[idx].offset = start - p->buf;
> -	p->pkt[idx].len = len;
> +	p->pkt[idx].iov_base = (void *)start;
> +	p->pkt[idx].iov_len = len;
>  
>  	p->count++;
>  }
> @@ -96,36 +107,31 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
>  		return NULL;
>  	}
>  
> -	if (len > UINT16_MAX || len + offset > UINT32_MAX) {
> +	if (len > UINT16_MAX) {
>  		if (func) {
> -			trace("packet data length %zu, offset %zu, %s:%i",
> -			      len, offset, func, line);
> +			trace("packet data length %zu, %s:%i",
> +			      len, func, line);
>  		}
>  		return NULL;
>  	}
>  
> -	if (p->pkt[idx].offset + len + offset > p->buf_size) {
> +	if (len + offset > p->pkt[idx].iov_len) {
>  		if (func) {
> -			trace("packet offset plus length %zu from size %zu, "
> -			      "%s:%i", p->pkt[idx].offset + len + offset,
> -			      p->buf_size, func, line);
> +			trace("data length %zu, offset %zu from length %zu, "
> +			      "%s:%i", len, offset, p->pkt[idx].iov_len,
> +			      func, line);
>  		}
>  		return NULL;
>  	}
>  
> -	if (len + offset > p->pkt[idx].len) {
> -		if (func) {
> -			trace("data length %zu, offset %zu from length %u, "
> -			      "%s:%i", len, offset, p->pkt[idx].len,
> -			      func, line);
> -		}
> +	if (packet_check_range(p, offset, len, p->pkt[idx].iov_base,
> +			       func, line))
>  		return NULL;
> -	}
>  
>  	if (left)
> -		*left = p->pkt[idx].len - offset - len;
> +		*left = p->pkt[idx].iov_len - offset - len;
>  
> -	return p->buf + p->pkt[idx].offset + offset;
> +	return (char *)p->pkt[idx].iov_base + offset;
>  }
>  
>  /**
> diff --git a/packet.h b/packet.h
> index a784b07bbed5..8377dcf678bb 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -6,16 +6,6 @@
>  #ifndef PACKET_H
>  #define PACKET_H
>  
> -/**
> - * struct desc - Generic offset-based descriptor within buffer
> - * @offset:	Offset of descriptor relative to buffer start, 32-bit limit
> - * @len:	Length of descriptor, host order, 16-bit limit
> - */
> -struct desc {
> -	uint32_t offset;
> -	uint16_t len;
> -};
> -
>  /**
>   * struct pool - Generic pool of packets stored in a buffer
>   * @buf:	Buffer storing packet descriptors
> @@ -29,7 +19,7 @@ struct pool {
>  	size_t buf_size;
>  	size_t size;
>  	size_t count;
> -	struct desc pkt[1];
> +	struct iovec pkt[1];
>  };
>  
>  void packet_add_do(struct pool *p, size_t len, const char *start,
> @@ -54,7 +44,7 @@ struct _name ## _t {							\
>  	size_t buf_size;						\
>  	size_t size;							\
>  	size_t count;							\
> -	struct desc pkt[_size];						\
> +	struct iovec pkt[_size];					\
>  }
>  
>  #define PACKET_POOL_INIT_NOCAST(_size, _buf, _buf_size)			\

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] vhost-user: introduce virtio API
  2024-08-15 15:50 ` [PATCH v3 2/4] vhost-user: introduce virtio API Laurent Vivier
@ 2024-08-20  1:00   ` David Gibson
  2024-08-22 22:14   ` Stefano Brivio
  1 sibling, 0 replies; 22+ messages in thread
From: David Gibson @ 2024-08-20  1:00 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 29623 bytes --]

On Thu, Aug 15, 2024 at 05:50:21PM +0200, Laurent Vivier wrote:
> Add virtio.c and virtio.h that define the functions needed
> to manage virtqueues.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Although one tiny nit noted below.

> ---
>  Makefile |   4 +-
>  util.h   |   8 +
>  virtio.c | 662 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  virtio.h | 185 ++++++++++++++++
>  4 files changed, 857 insertions(+), 2 deletions(-)
>  create mode 100644 virtio.c
>  create mode 100644 virtio.h
> 
> diff --git a/Makefile b/Makefile
> index b6329e35f884..f171c7955ac9 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
> +	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h
> +	udp.h udp_flow.h util.h virtio.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/util.h b/util.h
> index b7541ce24e5a..7944cfe1219d 100644
> --- a/util.h
> +++ b/util.h
> @@ -132,6 +132,14 @@ static inline uint32_t ntohl_unaligned(const void *p)
>  	return ntohl(val);
>  }
>  
> +static inline void barrier(void) { __asm__ __volatile__("" ::: "memory"); }
> +#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
> +#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
> +#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
> +
> +#define smp_wmb()	smp_mb_release()
> +#define smp_rmb()	smp_mb_acquire()
> +
>  #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
>  int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  	     void *arg);
> diff --git a/virtio.c b/virtio.c
> new file mode 100644
> index 000000000000..8354f6052aee
> --- /dev/null
> +++ b/virtio.c
> @@ -0,0 +1,662 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * virtio API, vring and virtqueue functions definition
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +/* some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c
> + * licensed under the following terms:
> + *
> + * Copyright IBM, Corp. 2007
> + * Copyright (c) 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Anthony Liguori <aliguori@us.ibm.com>
> + *  Marc-André Lureau <mlureau@redhat.com>
> + *  Victor Kaplansky <victork@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + *
> + * Some parts copied from QEMU hw/virtio/virtio.c
> + * licensed under the following terms:
> + *
> + * Copyright IBM, Corp. 2007
> + *
> + * Authors:
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * virtq_used_event() and virtq_avail_event() from
> + * https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-712000A
> + * licensed under the following terms:
> + *
> + * This header is BSD licensed so anyone can use the definitions
> + * to implement compatible drivers/servers.
> + *
> + * Copyright 2007, 2009, IBM Corporation
> + * Copyright 2011, Red Hat, Inc
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ‘‘AS IS’’ AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +
> +#include <stddef.h>
> +#include <endian.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/eventfd.h>
> +#include <sys/socket.h>
> +
> +#include "util.h"
> +#include "virtio.h"
> +
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +/**
> + * vu_gpa_to_va() - Translate guest physical address to our virtual address.
> + * @dev:	Vhost-user device
> + * @plen:	Physical length to map (input), virtual address mapped (output)
> + * @guest_addr:	Guest physical address
> + *
> + * Return: virtual address in our address space of the guest physical address
> + */
> +static void *vu_gpa_to_va(struct vu_dev *dev, uint64_t *plen, uint64_t guest_addr)
> +{
> +	unsigned int i;
> +
> +	if (*plen == 0)
> +		return NULL;
> +
> +	/* Find matching memory region. */
> +	for (i = 0; i < dev->nregions; i++) {
> +		const struct vu_dev_region *r = &dev->regions[i];
> +
> +		if ((guest_addr >= r->gpa) &&
> +		    (guest_addr < (r->gpa + r->size))) {
> +			if ((guest_addr + *plen) > (r->gpa + r->size))
> +				*plen = r->gpa + r->size - guest_addr;
> +			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +			return (void *)(guest_addr - r->gpa + r->mmap_addr +
> +						     r->mmap_offset);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * vring_avail_flags() - Read the available ring flags
> + * @vq:		Virtqueue
> + *
> + * Return: the available ring descriptor flags of the given virtqueue
> + */
> +static inline uint16_t vring_avail_flags(const struct vu_virtq *vq)
> +{
> +	return le16toh(vq->vring.avail->flags);
> +}
> +
> +/**
> + * vring_avail_idx() - Read the available ring index
> + * @vq:		Virtqueue
> + *
> + * Return: the available ring index of the given virtqueue
> + */
> +static inline uint16_t vring_avail_idx(struct vu_virtq *vq)
> +{
> +	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
> +
> +	return vq->shadow_avail_idx;
> +}
> +
> +/**
> + * vring_avail_ring() - Read an available ring entry
> + * @vq:		Virtqueue
> + * @i:		Index of the entry to read
> + *
> + * Return: the ring entry content (head of the descriptor chain)
> + */
> +static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
> +{
> +	return le16toh(vq->vring.avail->ring[i]);
> +}
> +
> +/**
> + * virtq_used_event - Get location of used event indices
> + *		      (only with VIRTIO_F_EVENT_IDX)
> + * @vq		Virtqueue
> + *
> + * Return: return the location of the used event index
> + */
> +static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
> +{
> +        /* For backwards compat, used event index is at *end* of avail ring. */
> +        return &vq->vring.avail->ring[vq->vring.num];
> +}
> +
> +/**
> + * vring_get_used_event() - Get the used event from the available ring
> + * @vq		Virtqueue
> + *
> + * Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
> + *         used_event is a performant alternative where the driver
> + *         specifies how far the device can progress before a notification
> + *         is required.
> + */
> +static inline uint16_t vring_get_used_event(const struct vu_virtq *vq)
> +{
> +	return le16toh(*virtq_used_event(vq));
> +}
> +
> +/**
> + * virtqueue_get_head() - Get the head of the descriptor chain for a given
> + *                        index
> + * @vq:		Virtqueue
> + * @idx:	Available ring entry index
> + * @head:	Head of the descriptor chain
> + */
> +static void virtqueue_get_head(const struct vu_virtq *vq,
> +			       unsigned int idx, unsigned int *head)
> +{
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen.
> +	 */
> +	*head = vring_avail_ring(vq, idx % vq->vring.num);
> +
> +	/* If their number is silly, that's a fatal mistake. */
> +	if (*head >= vq->vring.num)
> +		die("Guest says index %u is available", *head);
> +}
> +
> +/**
> + * virtqueue_read_indirect_desc() - Copy virtio ring descriptors from guest
> + *                                  memory
> + * @dev:	Vhost-user device
> + * @desc:	Destination address to copy the descriptors
> + * @addr:	Guest memory address to copy from
> + * @len:	Length of memory to copy
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +static int virtqueue_read_indirect_desc(struct vu_dev *dev, struct vring_desc *desc,
> +					uint64_t addr, size_t len)
> +{
> +	uint64_t read_len;
> +
> +	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc)))
> +		return -1;
> +
> +	if (len == 0)
> +		return -1;
> +
> +	while (len) {
> +		const struct vring_desc *orig_desc;
> +
> +		read_len = len;
> +		orig_desc = vu_gpa_to_va(dev, &read_len, addr);
> +		if (!orig_desc)
> +			return -1;
> +
> +		memcpy(desc, orig_desc, read_len);
> +		len -= read_len;
> +		addr += read_len;
> +		desc += read_len / sizeof(struct vring_desc);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * enum virtqueue_read_desc_state - State in the descriptor chain
> + * @VIRTQUEUE_READ_DESC_ERROR	Found an invalid descriptor
> + * @VIRTQUEUE_READ_DESC_DONE	No more descriptors in the chain
> + * @VIRTQUEUE_READ_DESC_MORE	there are more descriptors in the chain
> + */
> +enum virtqueue_read_desc_state {
> +	VIRTQUEUE_READ_DESC_ERROR = -1,
> +	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
> +	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
> +};
> +
> +/**
> + * virtqueue_read_next_desc() - Read the the next descriptor in the chain
> + * @desc:	Virtio ring descriptors
> + * @i:		Index of the current descriptor
> + * @max:	Maximum value of the descriptor index
> + * @next:	Index of the next descriptor in the chain (output value)
> + *
> + * Return: current chain descriptor state (error, next, done)
> + */
> +static int virtqueue_read_next_desc(const struct vring_desc *desc,
> +				    int i, unsigned int max, unsigned int *next)
> +{
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT))
> +		return VIRTQUEUE_READ_DESC_DONE;
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	*next = le16toh(desc[i].next);
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	smp_wmb();
> +
> +	if (*next >= max)
> +		return VIRTQUEUE_READ_DESC_ERROR;
> +
> +	return VIRTQUEUE_READ_DESC_MORE;
> +}
> +
> +/**
> + * vu_queue_empty() - Check if virtqueue is empty
> + * @vq:		Virtqueue
> + *
> + * Return: true if the virtqueue is empty, false otherwise
> + */
> +bool vu_queue_empty(struct vu_virtq *vq)
> +{
> +	if (!vq->vring.avail)
> +		return true;
> +
> +	if (vq->shadow_avail_idx != vq->last_avail_idx)
> +		return false;
> +
> +	return vring_avail_idx(vq) == vq->last_avail_idx;
> +}
> +
> +/**
> + * vring_can_notify() - Check if a notification can be sent
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + *
> + * Return: true if notification can be sent
> + */
> +static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
> +{
> +	uint16_t old, new;
> +	bool v;
> +
> +	/* We need to expose used array entries before checking used event. */
> +	smp_mb();
> +
> +	/* Always notify when queue is empty (when feature acknowledge) */
> +	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +		!vq->inuse && vu_queue_empty(vq)) {
> +		return true;
> +	}
> +
> +	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
> +		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
> +
> +	v = vq->signalled_used_valid;
> +	vq->signalled_used_valid = true;
> +	old = vq->signalled_used;
> +	new = vq->signalled_used = vq->used_idx;
> +	return !v || vring_need_event(vring_get_used_event(vq), new, old);
> +}
> +
> +/**
> + * vu_queue_notify() - Send a notification to the given virtqueue
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
> +{
> +	if (!vq->vring.avail)
> +		return;
> +
> +	if (!vring_can_notify(dev, vq)) {
> +		debug("vhost-user: virtqueue can skip notify...");
> +		return;
> +	}
> +
> +	if (eventfd_write(vq->call_fd, 1) < 0)
> +		die_perror("Error writing eventfd");
> +}
> +
> +/* virtq_avail_event() -  Get location of available event indices
> + *			      (only with VIRTIO_F_EVENT_IDX)
> + * @vq:		Virtqueue
> + *
> + * Return: return the location of the available event index
> + */
> +static inline uint16_t *virtq_avail_event(const struct vu_virtq *vq)
> +{
> +        /* For backwards compat, avail event index is at *end* of used ring. */
> +        return (uint16_t *)&vq->vring.used->ring[vq->vring.num];
> +}
> +
> +/**
> + * vring_set_avail_event() - Set avail_event
> + * @vq:		Virtqueue
> + * @val:	Value to set to avail_event
> + *		avail_event is used in the same way the used_event is in the
> + *		avail_ring.
> + *		avail_event is used to advise the driver that notifications
> + *		are unnecessary until the driver writes entry with an index
> + *		specified by avail_event into the available ring.
> + */
> +static inline void vring_set_avail_event(const struct vu_virtq *vq,
> +					 uint16_t val)
> +{
> +	uint16_t val_le = htole16(val);
> +
> +	if (!vq->notification)
> +		return;
> +
> +	memcpy(virtq_avail_event(vq), &val_le, sizeof(val_le));
> +}
> +
> +/**
> + * virtqueue_map_desc() - Translate descriptor ring physical address into our
> + * 			  virtual address space
> + * @dev:	Vhost-user device
> + * @p_num_sg:	First iov entry to use (input),
> + *		first iov entry not used (output)
> + * @iov:	Iov array to use to store buffer virtual addresses
> + * @max_num_sg:	Maximum number of iov entries
> + * @pa:		Guest physical address of the buffer to map into our virtual
> + * 		address
> + * @sz:		Size of the buffer
> + *
> + * Return: false on error, true otherwise
> + */
> +static bool virtqueue_map_desc(struct vu_dev *dev,
> +			       unsigned int *p_num_sg, struct iovec *iov,
> +			       unsigned int max_num_sg,
> +			       uint64_t pa, size_t sz)
> +{
> +	unsigned int num_sg = *p_num_sg;
> +
> +	ASSERT(num_sg < max_num_sg);
> +	ASSERT(sz);
> +
> +	while (sz) {
> +		uint64_t len = sz;
> +
> +		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
> +		if (iov[num_sg].iov_base == NULL)
> +			die("virtio: invalid address for buffers");
> +		iov[num_sg].iov_len = len;
> +		num_sg++;
> +		sz -= len;
> +		pa += len;
> +	}
> +
> +	*p_num_sg = num_sg;
> +	return true;
> +}
> +
> +/**
> + * vu_queue_map_desc - Map the virqueue descriptor ring into our virtual
> + * 		       address space
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @idx:	First descriptor ring entry to map
> + * @elem:	Virtqueue element to store descriptor ring iov
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned int idx,
> +			     struct vu_virtq_element *elem)
> +{
> +	const struct vring_desc *desc = vq->vring.desc;
> +	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
> +	unsigned int out_num = 0, in_num = 0;
> +	unsigned int max = vq->vring.num;
> +	unsigned int i = idx;
> +	uint64_t read_len;
> +	int rc;
> +
> +	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
> +		unsigned int desc_len;
> +		uint64_t desc_addr;
> +
> +		if (le32toh(desc[i].len) % sizeof(struct vring_desc))
> +			die("Invalid size for indirect buffer table");
> +
> +		/* loop over the indirect descriptor table */
> +		desc_addr = le64toh(desc[i].addr);
> +		desc_len = le32toh(desc[i].len);
> +		max = desc_len / sizeof(struct vring_desc);
> +		read_len = desc_len;
> +		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
> +		if (desc && read_len != desc_len) {
> +			/* Failed to use zero copy */
> +			desc = NULL;
> +			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len))
> +				desc = desc_buf;
> +		}
> +		if (!desc)
> +			die("Invalid indirect buffer table");
> +		i = 0;
> +	}
> +
> +	/* Collect all the descriptors */
> +	do {
> +		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
> +			if (!virtqueue_map_desc(dev, &in_num, elem->in_sg,
> +						elem->in_num,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return -1;
> +			}
> +		} else {
> +			if (in_num)
> +				die("Incorrect order for descriptors");
> +			if (!virtqueue_map_desc(dev, &out_num, elem->out_sg,
> +						elem->out_num,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return -1;
> +			}
> +		}
> +
> +		/* If we've got too many, that implies a descriptor loop. */
> +		if ((in_num + out_num) > max)
> +			die("Looped descriptor");
> +		rc = virtqueue_read_next_desc(desc, i, max, &i);
> +	} while (rc == VIRTQUEUE_READ_DESC_MORE);
> +
> +	if (rc == VIRTQUEUE_READ_DESC_ERROR)
> +		die("read descriptor error");
> +
> +	elem->index = idx;
> +	elem->in_num = in_num;
> +	elem->out_num = out_num;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_queue_pop() - Pop an entry from the virtqueue
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @elem:	Virtqueue element to file with the entry information
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
> +{
> +	unsigned int head;
> +	int ret;
> +
> +	if (!vq->vring.avail)
> +		return -1;
> +
> +	if (vu_queue_empty(vq))
> +		return -1;
> +
> +	/*
> +	 * Needed after vu_queue_empty(), see comment in
> +	 * virtqueue_num_heads().
> +	 */
> +	smp_rmb();
> +
> +	if (vq->inuse >= vq->vring.num)
> +		die("Virtqueue size exceeded");
> +
> +	virtqueue_get_head(vq, vq->last_avail_idx++, &head);
> +
> +	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
> +		vring_set_avail_event(vq, vq->last_avail_idx);
> +
> +	ret = vu_queue_map_desc(dev, vq, head, elem);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	vq->inuse++;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_queue_detach_element() - Detach an element from the virqueue
> + * @vq:		Virtqueue
> + */
> +void vu_queue_detach_element(struct vu_virtq *vq)
> +{
> +	vq->inuse--;
> +	/* unmap, when DMA support is added */
> +}
> +
> +/**
> + * vu_queue_unpop() - Push back the previously popped element from the virqueue
> + * @vq:		Virtqueue
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_unpop(struct vu_virtq *vq)
> +{
> +	vq->last_avail_idx--;
> +	vu_queue_detach_element(vq);
> +}
> +
> +/**
> + * vu_queue_rewind() - Push back a given number of popped elements
> + * @vq:		Virtqueue
> + * @num:	Number of element to unpop
> + */
> +/* cppcheck-suppress unusedFunction */
> +bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
> +{
> +	if (num > vq->inuse)
> +		return false;
> +
> +	vq->last_avail_idx -= num;
> +	vq->inuse -= num;
> +	return true;
> +}
> +
> +/**
> + * vring_used_write() - Write an entry in the used ring
> + * @vq:		Virtqueue
> + * @uelem:	Entry to write
> + * @i:		Index of the entry in the used ring
> + */
> +static inline void vring_used_write(struct vu_virtq *vq,
> +				    const struct vring_used_elem *uelem, int i)
> +{
> +	struct vring_used *used = vq->vring.used;
> +
> +	used->ring[i] = *uelem;
> +}
> +
> +/**
> + * vu_queue_fill_by_index() - Update information of a descriptor ring entry
> + *			      in the used ring
> + * @vq:		Virtqueue
> + * @index:	Descriptor ring index
> + * @len:	Size of the element
> + * @idx:	Used ring entry index
> + */
> +void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
> +			    unsigned int len, unsigned int idx)
> +{
> +	struct vring_used_elem uelem;
> +
> +	if (!vq->vring.avail)
> +		return;
> +
> +	idx = (idx + vq->used_idx) % vq->vring.num;
> +
> +	uelem.id = htole32(index);
> +	uelem.len = htole32(len);
> +	vring_used_write(vq, &uelem, idx);
> +}
> +
> +/**
> + * vu_queue_fill() - Update information of a given element in the used ring
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @elem:	Element information to fill
> + * @len:	Size of the element
> + * @idx:	Used ring entry index
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
> +		   unsigned int len, unsigned int idx)
> +{
> +	vu_queue_fill_by_index(vq, elem->index, len, idx);
> +}
> +
> +/**
> + * vring_used_idx_set() - Set the descriptor ring current index
> + * @vq:		Virtqueue
> + * @val:	Value to set in the index
> + */
> +static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
> +{
> +	vq->vring.used->idx = htole16(val);
> +
> +	vq->used_idx = val;
> +}
> +
> +/**
> + * vu_queue_flush() - Flush the virtqueue
> + * @vq:		Virtqueue
> + * @count:	Number of entry to flush
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
> +{
> +	uint16_t old, new;
> +
> +	if (!vq->vring.avail)
> +		return;
> +
> +	/* Make sure buffer is written before we update index. */
> +	smp_wmb();
> +
> +	old = vq->used_idx;
> +	new = old + count;
> +	vring_used_idx_set(vq, new);
> +	vq->inuse -= count;
> +	if ((uint16_t)(new - vq->signalled_used) < (uint16_t)(new - old))
> +		vq->signalled_used_valid = false;
> +}
> diff --git a/virtio.h b/virtio.h
> new file mode 100644
> index 000000000000..af9cadc990b9
> --- /dev/null
> +++ b/virtio.h
> @@ -0,0 +1,185 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * virtio API, vring and virtqueue functions definition
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef VIRTIO_H
> +#define VIRTIO_H
> +
> +#include <stdbool.h>
> +#include <linux/vhost_types.h>
> +
> +/* Maximum size of a virtqueue */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +/**
> + * struct vu_ring - Virtqueue rings
> + * @num:		Size of the queue
> + * @desc:		Descriptor ring
> + * @avail:		Available ring
> + * @used:		Used ring
> + * @log_guest_addr:	Guest address for logging
> + * @flags:		Vring flags
> + * 			VHOST_VRING_F_LOG is set if log address is valid
> + */
> +struct vu_ring {
> +	unsigned int num;
> +	struct vring_desc *desc;
> +	struct vring_avail *avail;
> +	struct vring_used *used;
> +	uint64_t log_guest_addr;
> +	uint32_t flags;
> +};
> +
> +/**
> + * struct vu_virtq - Virtqueue definition
> + * @vring:			Virtqueue rings
> + * @last_avail_idx:		Next head to pop
> + * @shadow_avail_idx:		Last avail_idx read from VQ.
> + * @used_idx:			Descriptor ring current index
> + * @signalled_used:		Last used index value we have signalled on
> + * @signalled_used_valid:	True if signalled_used if valid
> + * @notification:		True if the queues notify (via event
> + * 				index or interrupt)
> + * @inuse:			Number of entries in use
> + * @call_fd:			The event file descriptor to signal when
> + * 				buffers are used.
> + * @kick_fd:			The event file descriptor for adding
> + * 				buffers to the vring
> + * @err_fd:			The event file descriptor to signal when
> + * 				error occurs
> + * @enable:			True if the virtqueue is enabled
> + * @started:			True if the virtqueue is started
> + * @vra:			QEMU address of our rings
> + */
> +struct vu_virtq {
> +	struct vu_ring vring;
> +	uint16_t last_avail_idx;
> +	uint16_t shadow_avail_idx;
> +	uint16_t used_idx;
> +	uint16_t signalled_used;
> +	bool signalled_used_valid;
> +	bool notification;
> +	unsigned int inuse;
> +	int call_fd;
> +	int kick_fd;
> +	int err_fd;
> +	unsigned int enable;
> +	bool started;
> +	struct vhost_vring_addr vra;
> +};
> +
> +/**
> + * struct vu_dev_region - guest shared memory region
> + * @gpa:		Guest physical address of the region
> + * @size:		Memory size in bytes
> + * @qva:		QEMU virtual address
> + * @mmap_offset:	Offset where the region starts in the mapped memory
> + * @mmap_addr:		Address of the mapped memory
> + */
> +struct vu_dev_region {
> +	uint64_t gpa;
> +	uint64_t size;
> +	uint64_t qva;
> +	uint64_t mmap_offset;
> +	uint64_t mmap_addr;
> +};
> +
> +#define VHOST_USER_MAX_QUEUES 2
> +
> +/*
> + * Set a reasonable maximum number of ram slots, which will be supported by
> + * any architecture.
> + */
> +#define VHOST_USER_MAX_RAM_SLOTS 32
> +
> +/**
> + * struct vu_dev - vhost-user device information
> + * @context:		Execution context
> + * *nregions:		Number of shared memory regions

Nit: s/*nregions/@nregions/

> + * @regions:		Guest shared memory regions
> + * @features:		Vhost-user features
> + * @protocol_features:	Vhost-user protocol features
> + * @hdrlen:		Virtio -net header length
> + */
> +struct vu_dev {
> +	uint32_t nregions;
> +	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
> +	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
> +	uint64_t features;
> +	uint64_t protocol_features;
> +	int hdrlen;
> +};
> +
> +/**
> + * struct vu_virtq_element - virtqueue element
> + * @index:	Descriptor ring index
> + * @out_num:	Number of outgoing iovec buffers
> + * @in_num:	Number of incoming iovec buffers
> + * @in_sg:	Incoming iovec buffers
> + * @out_sg:	Outgoing iovec buffers
> + */
> +struct vu_virtq_element {
> +	unsigned int index;
> +	unsigned int out_num;
> +	unsigned int in_num;
> +	struct iovec *in_sg;
> +	struct iovec *out_sg;
> +};
> +
> +/**
> + * has_feature() - Check a feature bit in a features set
> + * @features:	Features set
> + * @fb:		Feature bit to check
> + *
> + * Return:	True if the feature bit is set
> + */
> +static inline bool has_feature(uint64_t features, unsigned int fbit)
> +{
> +	return !!(features & (1ULL << fbit));
> +}
> +
> +/**
> + * vu_has_feature() - Check if a virtio-net feature is available
> + * @vdev:	Vhost-user device
> + * @bit:	Feature to check
> + *
> + * Return:	True if the feature is available
> + */
> +static inline bool vu_has_feature(const struct vu_dev *vdev,
> +				  unsigned int fbit)
> +{
> +	return has_feature(vdev->features, fbit);
> +}
> +
> +/**
> + * vu_has_protocol_feature() - Check if a vhost-user feature is available
> + * @vdev:	Vhost-user device
> + * @bit:	Feature to check
> + *
> + * Return:	True if the feature is available
> + */
> +/* cppcheck-suppress unusedFunction */
> +static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
> +					   unsigned int fbit)
> +{
> +	return has_feature(vdev->protocol_features, fbit);
> +}
> +
> +bool vu_queue_empty(struct vu_virtq *vq);
> +void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
> +int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
> +		 struct vu_virtq_element *elem);
> +void vu_queue_detach_element(struct vu_virtq *vq);
> +void vu_queue_unpop(struct vu_virtq *vq);
> +bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
> +void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
> +			    unsigned int len, unsigned int idx);
> +void vu_queue_fill(struct vu_virtq *vq,
> +		   const struct vu_virtq_element *elem, unsigned int len,
> +		   unsigned int idx);
> +void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
> +#endif /* VIRTIO_H */

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/4] Add vhost-user support to passt. (part 3)
  2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
                   ` (3 preceding siblings ...)
  2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
@ 2024-08-20 22:41 ` Stefano Brivio
  2024-08-22 16:53   ` Stefano Brivio
  4 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2024-08-20 22:41 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 956 bytes --]

On Thu, 15 Aug 2024 17:50:19 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> This series of patches adds vhost-user support to passt
> and then allows passt to connect to QEMU network backend using
> virtqueue rather than a socket.
> 
> With QEMU, rather than using to connect:
> 
>   -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
> 
> we will use:
> 
>   -chardev socket,id=chr0,path=/tmp/passt_1.socket
>   -netdev vhost-user,id=netdev0,chardev=chr0
>   -device virtio-net,netdev=netdev0
>   -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
>   -numa node,memdev=memfd0
> 
> The memory backend is needed to share data between passt and QEMU.
> 
> Performance comparison between "-netdev stream" and "-netdev vhost-user":

By the way, I attached a quick patch adding vhost-user-based tests to
the usual throughput and latency tests.

UDP doesn't work (I didn't look into that at all), TCP does.

-- 
Stefano

[-- Attachment #2: vhost_user_tests.patch --]
[-- Type: text/x-patch, Size: 3981 bytes --]

diff --git a/test/lib/setup b/test/lib/setup
index 58371bd..cc24cb0 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -17,6 +17,7 @@ INITRAMFS="${BASEPATH}/mbuto.img"
 VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
 __mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
 VMEM="$((${__mem_kib} / 1024 / 4))"
+VMEM_ROUND="$(((${VMEM} + 500) / 1000))G"
 QEMU_ARCH="$(uname -m)"
 [ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
 
@@ -145,23 +146,45 @@ setup_passt_in_ns() {
 	else
 		context_run passt "make clean"
 		context_run passt "make"
-		context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid"
+		if [ ${VHOST_USER} -eq 1 ]; then
+			context_run_bg passt "./passt -f ${__opts} --vhost-user -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid"
+		else
+			context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid"
+		fi
 	fi
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
 
 	GUEST_CID=94557
-	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
-		' -machine accel=kvm'                                      \
-		' -M accel=kvm:tcg'                                        \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
-		' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
-		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
-		' -nodefaults'						   \
-		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
-		' -device virtio-net-pci,netdev=s0 '			   \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
-		" -pidfile ${STATESETUP}/qemu.pid"			   \
-		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	if [ ${VHOST_USER} -eq 1 ]; then
+		context_run_bg qemu 'qemu-system-$(uname -m)'			   \
+			' -machine accel=kvm'                                      \
+			' -M accel=kvm:tcg'                                        \
+			' -m '${VMEM_ROUND}' -cpu host -smp '${VCPUS}              \
+			' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
+			' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
+			' -nodefaults'						   \
+			' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
+			" -chardev socket,id=chr0,path=${STATESETUP}/passt.socket" \
+			' -netdev vhost-user,id=netdev0,chardev=chr0'		   \
+			' -device virtio-net,netdev=netdev0'			   \
+			" -object memory-backend-memfd,id=memfd0,share=on,size=${VMEM_ROUND}" \
+			' -numa node,memdev=memfd0'				   \
+			" -pidfile ${STATESETUP}/qemu.pid"			   \
+			" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	else
+		context_run_bg qemu 'qemu-system-$(uname -m)'			   \
+			' -machine accel=kvm'                                      \
+			' -M accel=kvm:tcg'                                        \
+			' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
+			' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
+			' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
+			' -nodefaults'						   \
+			' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
+			' -device virtio-net-pci,netdev=s0 '			   \
+			" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
+			" -pidfile ${STATESETUP}/qemu.pid"			   \
+			" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	fi
 
 	context_setup_guest guest $GUEST_CID
 }
diff --git a/test/run b/test/run
index 3b37663..b522a69 100755
--- a/test/run
+++ b/test/run
@@ -70,6 +70,18 @@ run() {
 	test build/clang_tidy
 	teardown build
 
+	VALGRIND=0
+	VHOST_USER=1
+	setup passt_in_ns
+	test passt/ndp
+	test passt/dhcp
+	test perf_vhost_user/passt_tcp
+	test perf_vhost_user/passt_udp
+	test perf_vhost_user/pasta_tcp
+	test perf_vhost_user/pasta_udp
+	test passt_in_ns/shutdown
+	teardown passt_in_ns
+
 	setup pasta
 	test pasta/ndp
 	test pasta/dhcp

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] vhost-user: add vhost-user
  2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
@ 2024-08-22  9:59   ` Stefano Brivio
  2024-08-22 22:14   ` Stefano Brivio
  2024-08-23 12:32   ` Stefano Brivio
  2 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2024-08-22  9:59 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Thu, 15 Aug 2024 17:50:23 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>

This patch (and only this patch) has now some trivial conflicts with
the current HEAD, as I wanted to apply it for review anyway, I solved
them on my local branch, patch attached if it saves you some time.

-- 
Stefano

[-- Attachment #2: 0001-vhost-user-add-vhost-user.patch --]
[-- Type: text/x-patch, Size: 64201 bytes --]

From aa0c68bd9f76b66c70d27cc269b505f8ee9d5cd9 Mon Sep 17 00:00:00 2001
From: Laurent Vivier <lvivier@redhat.com>
Date: Thu, 15 Aug 2024 17:50:23 +0200
Subject: [PATCH] vhost-user: add vhost-user

add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 Makefile       |   6 +-
 checksum.c     |   1 -
 conf.c         |  24 +-
 epoll_type.h   |   4 +
 isolation.c    |  15 +-
 packet.c       |  13 ++
 packet.h       |   2 +
 passt.c        |  25 ++-
 passt.h        |   6 +
 pcap.c         |   1 -
 tap.c          | 106 +++++++--
 tap.h          |   5 +-
 tcp.c          |  33 ++-
 tcp_buf.c      |   6 +-
 tcp_internal.h |   3 +-
 tcp_vu.c       | 593 +++++++++++++++++++++++++++++++++++++++++++++++++
 tcp_vu.h       |  12 +
 udp.c          |  72 +++---
 udp.h          |   8 +-
 udp_internal.h |  34 +++
 udp_vu.c       | 338 ++++++++++++++++++++++++++++
 udp_vu.h       |  13 ++
 vhost_user.c   |  28 ++-
 virtio.c       |   1 -
 vu_common.c    |  27 +++
 vu_common.h    |  34 +++
 26 files changed, 1311 insertions(+), 99 deletions(-)
 create mode 100644 tcp_vu.c
 create mode 100644 tcp_vu.h
 create mode 100644 udp_internal.h
 create mode 100644 udp_vu.c
 create mode 100644 udp_vu.h
 create mode 100644 vu_common.c
 create mode 100644 vu_common.h

diff --git a/Makefile b/Makefile
index 01e95ac..e481a94 100644
--- a/Makefile
+++ b/Makefile
@@ -47,7 +47,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
 PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
 	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
 	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
-	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
+	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
+	vhost_user.c virtio.c vu_common.c
 QRAP_SRCS = qrap.c
 SRCS = $(PASST_SRCS) $(QRAP_SRCS)
 
@@ -57,7 +58,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
 	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
 	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
 	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
-	udp.h udp_flow.h util.h vhost_user.h virtio.h
+	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
+	virtio.h vu_common.h
 HEADERS = $(PASST_HEADERS) seccomp.h
 
 C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
diff --git a/checksum.c b/checksum.c
index 006614f..aa5b7ae 100644
--- a/checksum.c
+++ b/checksum.c
@@ -501,7 +501,6 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
  *
  * Return: 16-bit folded, complemented checksum
  */
-/* cppcheck-suppress unusedFunction */
 uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
 {
 	unsigned int i;
diff --git a/conf.c b/conf.c
index e29b6a9..ff84e8c 100644
--- a/conf.c
+++ b/conf.c
@@ -45,6 +45,7 @@
 #include "lineread.h"
 #include "isolation.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /**
  * next_chunk - Return the next piece of a string delimited by a character
@@ -759,9 +760,14 @@ static void usage(const char *name, FILE *f, int status)
 			"    default: same interface name as external one\n");
 	} else {
 		fprintf(f,
-			"  -s, --socket PATH	UNIX domain socket path\n"
+			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
 			"    default: probe free path starting from "
 			UNIX_SOCK_PATH "\n", 1);
+		fprintf(f,
+			"  --vhost-user		Enable vhost-user mode\n"
+			"    UNIX domain socket is provided by -s option\n"
+			"  --print-capabilities	print back-end capabilities in JSON format,\n"
+			"    only meaningful for vhost-user mode\n");
 	}
 
 	fprintf(f,
@@ -1281,6 +1287,10 @@ void conf(struct ctx *c, int argc, char **argv)
 		{"netns-only",	no_argument,		NULL,		20 },
 		{"map-host-loopback", required_argument, NULL,		21 },
 		{"map-guest-addr", required_argument,	NULL,		22 },
+		{"vhost-user",	no_argument,		NULL,		23 },
+		/* vhost-user backend program convention */
+		{"print-capabilities", no_argument,	NULL,		24 },
+		{"socket-path",	required_argument,	NULL,		's' },
 		{ 0 },
 	};
 	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
@@ -1412,14 +1422,12 @@ void conf(struct ctx *c, int argc, char **argv)
 				       sizeof(c->ip4.ifname_out), "%s", optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->ip4.ifname_out))
 				die("Invalid interface name: %s", optarg);
-
 			break;
 		case 16:
 			ret = snprintf(c->ip6.ifname_out,
 				       sizeof(c->ip6.ifname_out), "%s", optarg);
 			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
 				die("Invalid interface name: %s", optarg);
-
 			break;
 		case 17:
 			if (c->mode != MODE_PASTA)
@@ -1458,6 +1466,16 @@ void conf(struct ctx *c, int argc, char **argv)
 			conf_nat(optarg, &c->ip4.map_guest_addr,
 				 &c->ip6.map_guest_addr, NULL);
 			break;
+		case 23:
+			if (c->mode == MODE_PASTA) {
+				err("--vhost-user is for passt mode only");
+				usage(argv[0], stdout, EXIT_SUCCESS);
+			}
+			c->mode = MODE_VU;
+			break;
+		case 24:
+			vu_print_capabilities();
+			break;
 		case 'd':
 			c->debug = 1;
 			c->quiet = 0;
diff --git a/epoll_type.h b/epoll_type.h
index 0ad1efa..f3ef415 100644
--- a/epoll_type.h
+++ b/epoll_type.h
@@ -36,6 +36,10 @@ enum epoll_type {
 	EPOLL_TYPE_TAP_PASST,
 	/* socket listening for qemu socket connections */
 	EPOLL_TYPE_TAP_LISTEN,
+	/* vhost-user command socket */
+	EPOLL_TYPE_VHOST_CMD,
+	/* vhost-user kick event socket */
+	EPOLL_TYPE_VHOST_KICK,
 
 	EPOLL_NUM_TYPES,
 };
diff --git a/isolation.c b/isolation.c
index 45fba1e..c2a3c7b 100644
--- a/isolation.c
+++ b/isolation.c
@@ -379,12 +379,19 @@ void isolate_postfork(const struct ctx *c)
 
 	prctl(PR_SET_DUMPABLE, 0);
 
-	if (c->mode == MODE_PASTA) {
-		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
-		prog.filter = filter_pasta;
-	} else {
+	switch (c->mode) {
+	case MODE_PASST:
 		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
 		prog.filter = filter_passt;
+		break;
+	case MODE_PASTA:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
+		prog.filter = filter_pasta;
+		break;
+	case MODE_VU:
+		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
+		prog.filter = filter_vu;
+		break;
 	}
 
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
diff --git a/packet.c b/packet.c
index 3748996..36c7e50 100644
--- a/packet.c
+++ b/packet.c
@@ -36,6 +36,19 @@
 static int packet_check_range(const struct pool *p, size_t offset, size_t len,
 			      const char *start, const char *func, int line)
 {
+	ASSERT(p->buf);
+
+	if (p->buf_size == 0) {
+		int ret;
+
+		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
+
+		if (ret == -1)
+			trace("cannot find region, %s:%i", func, line);
+
+		return ret;
+	}
+
 	if (start < p->buf) {
 		trace("packet start %p before buffer start %p, "
 		      "%s:%i", (void *)start, (void *)p->buf, func, line);
diff --git a/packet.h b/packet.h
index 8377dcf..d32688d 100644
--- a/packet.h
+++ b/packet.h
@@ -22,6 +22,8 @@ struct pool {
 	struct iovec pkt[1];
 };
 
+int vu_packet_check_range(void *buf, size_t offset, size_t len,
+			  const char *start);
 void packet_add_do(struct pool *p, size_t len, const char *start,
 		   const char *func, int line);
 void *packet_get_do(const struct pool *p, const size_t idx,
diff --git a/passt.c b/passt.c
index ad6f0bc..8c8c170 100644
--- a/passt.c
+++ b/passt.c
@@ -74,6 +74,8 @@ char *epoll_type_str[] = {
 	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
 	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
 	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
+	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
+	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
 };
 static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
 	      "epoll_type_str[] doesn't match enum epoll_type");
@@ -206,6 +208,7 @@ int main(int argc, char **argv)
 	struct rlimit limit;
 	struct timespec now;
 	struct sigaction sa;
+	struct vu_dev vdev;
 
 	clock_gettime(CLOCK_MONOTONIC, &log_start);
 
@@ -262,6 +265,8 @@ int main(int argc, char **argv)
 	pasta_netns_quit_init(&c);
 
 	tap_sock_init(&c);
+	if (c.mode == MODE_VU)
+		vu_init(&c, &vdev);
 
 	secret_init(&c);
 
@@ -352,14 +357,30 @@ loop:
 			tcp_timer_handler(&c, ref);
 			break;
 		case EPOLL_TYPE_UDP_LISTEN:
-			udp_listen_sock_handler(&c, ref, eventmask, &now);
+			if (c.mode == MODE_VU)
+				udp_vu_listen_sock_handler(&c, ref, eventmask,
+							   &now);
+			else
+				udp_buf_listen_sock_handler(&c, ref, eventmask,
+							    &now);
 			break;
 		case EPOLL_TYPE_UDP_REPLY:
-			udp_reply_sock_handler(&c, ref, eventmask, &now);
+			if (c.mode == MODE_VU)
+				udp_vu_reply_sock_handler(&c, ref, eventmask,
+							  &now);
+			else
+				udp_buf_reply_sock_handler(&c, ref, eventmask,
+							   &now);
 			break;
 		case EPOLL_TYPE_PING:
 			icmp_sock_handler(&c, ref);
 			break;
+		case EPOLL_TYPE_VHOST_CMD:
+			tap_handler_vu(&vdev, c.fd_tap, eventmask);
+			break;
+		case EPOLL_TYPE_VHOST_KICK:
+			vu_kick_cb(&vdev, ref, &now);
+			break;
 		default:
 			/* Can't happen */
 			ASSERT(0);
diff --git a/passt.h b/passt.h
index 031c9b6..a98f043 100644
--- a/passt.h
+++ b/passt.h
@@ -25,6 +25,8 @@ union epoll_ref;
 #include "fwd.h"
 #include "tcp.h"
 #include "udp.h"
+#include "udp_vu.h"
+#include "vhost_user.h"
 
 /* Default address for our end on the tap interface.  Bit 0 of byte 0 must be 0
  * (unicast) and bit 1 of byte 1 must be 1 (locally administered).  Otherwise
@@ -94,6 +96,7 @@ struct fqdn {
 enum passt_modes {
 	MODE_PASST,
 	MODE_PASTA,
+	MODE_VU,
 };
 
 /**
@@ -227,6 +230,7 @@ struct ip6_ctx {
  * @no_ra:		Disable router advertisements
  * @low_wmem:		Low probed net.core.wmem_max
  * @low_rmem:		Low probed net.core.rmem_max
+ * @vdev:		vhost-user device
  */
 struct ctx {
 	enum passt_modes mode;
@@ -287,6 +291,8 @@ struct ctx {
 
 	int low_wmem;
 	int low_rmem;
+
+	struct vu_dev *vdev;
 };
 
 void proto_update_l2_buf(const unsigned char *eth_d,
diff --git a/pcap.c b/pcap.c
index 46cc4b0..7e9c560 100644
--- a/pcap.c
+++ b/pcap.c
@@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
  *		containing packet data to write, including L2 header
  * @iovcnt:	Number of buffers (@iov entries)
  */
-/* cppcheck-suppress unusedFunction */
 void pcap_iov(const struct iovec *iov, size_t iovcnt)
 {
 	struct timespec now;
diff --git a/tap.c b/tap.c
index 852d837..9cb2092 100644
--- a/tap.c
+++ b/tap.c
@@ -58,6 +58,7 @@
 #include "packet.h"
 #include "tap.h"
 #include "log.h"
+#include "vhost_user.h"
 
 /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
 static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
@@ -78,16 +79,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
 	struct iovec iov[2];
 	size_t iovcnt = 0;
 
-	if (c->mode == MODE_PASST) {
+	switch (c->mode) {
+	case MODE_PASST:
 		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
 		iovcnt++;
-	}
-
-	iov[iovcnt].iov_base = (void *)data;
-	iov[iovcnt].iov_len = l2len;
-	iovcnt++;
+		/* fall through */
+	case MODE_PASTA:
+		iov[iovcnt].iov_base = (void *)data;
+		iov[iovcnt].iov_len = l2len;
+		iovcnt++;
 
-	tap_send_frames(c, iov, iovcnt, 1);
+		tap_send_frames(c, iov, iovcnt, 1);
+		break;
+	case MODE_VU:
+		vu_send(c->vdev, data, l2len);
+		break;
+	}
 }
 
 /**
@@ -406,10 +413,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
 	if (!nframes)
 		return 0;
 
-	if (c->mode == MODE_PASTA)
+	switch (c->mode) {
+	case MODE_PASTA:
 		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
-	else
+		break;
+	case MODE_PASST:
 		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
+		break;
+	case MODE_VU:
+		/* fall through */
+	default:
+		ASSERT(0);
+	}
 
 	if (m < nframes)
 		debug("tap: failed to send %zu frames of %zu",
@@ -968,7 +983,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
  * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
  * @c:		Execution context
  */
-static void tap_sock_reset(struct ctx *c)
+void tap_sock_reset(struct ctx *c)
 {
 	info("Client connection closed%s", c->one_off ? ", exiting" : "");
 
@@ -979,6 +994,8 @@ static void tap_sock_reset(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
 	close(c->fd_tap);
 	c->fd_tap = -1;
+	if (c->mode == MODE_VU)
+		vu_cleanup(c->vdev);
 }
 
 /**
@@ -1178,11 +1195,17 @@ static void tap_sock_unix_init(struct ctx *c)
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
 
-	info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
-	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
-	     c->sock_path);
-	info("or qrap, for earlier qemu versions:");
-	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	if (c->mode == MODE_VU) {
+		info("You can start qemu with:");
+		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
+		     c->sock_path);
+	} else {
+		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
+		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
+		     c->sock_path);
+		info("or qrap, for earlier qemu versions:");
+		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
+	}
 }
 
 /**
@@ -1192,8 +1215,8 @@ static void tap_sock_unix_init(struct ctx *c)
  */
 void tap_listen_handler(struct ctx *c, uint32_t events)
 {
-	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
 	struct epoll_event ev = { 0 };
+	union epoll_ref ref;
 	int v = INT_MAX / 2;
 	struct ucred ucred;
 	socklen_t len;
@@ -1233,6 +1256,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
 		trace("tap: failed to set SO_SNDBUF to %i", v);
 
 	ref.fd = c->fd_tap;
+	if (c->mode == MODE_VU)
+		ref.type = EPOLL_TYPE_VHOST_CMD;
+	else
+		ref.type = EPOLL_TYPE_TAP_PASST;
 	ev.events = EPOLLIN | EPOLLRDHUP;
 	ev.data.u64 = ref.u64;
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
@@ -1294,21 +1321,47 @@ static void tap_sock_tun_init(struct ctx *c)
 	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
 }
 
+void tap_sock_update_buf(void *base, size_t size)
+{
+	int i;
+
+	pool_tap4_storage.buf = base;
+	pool_tap4_storage.buf_size = size;
+	pool_tap6_storage.buf = base;
+	pool_tap6_storage.buf_size = size;
+
+	for (i = 0; i < TAP_SEQS; i++) {
+		tap4_l4[i].p.buf = base;
+		tap4_l4[i].p.buf_size = size;
+		tap6_l4[i].p.buf = base;
+		tap6_l4[i].p.buf_size = size;
+	}
+}
+
 /**
  * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
  * @c:		Execution context
  */
 void tap_sock_init(struct ctx *c)
 {
-	size_t sz = sizeof(pkt_buf);
+	size_t sz;
+	char *buf;
 	int i;
 
-	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
-	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
+	if (c->mode == MODE_VU) {
+		buf = NULL;
+		sz = 0;
+	} else {
+		buf = pkt_buf;
+		sz = sizeof(pkt_buf);
+	}
+
+	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, buf, sz);
+	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, buf, sz);
 
 	for (i = 0; i < TAP_SEQS; i++) {
-		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
-		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
+		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
+		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
 	}
 
 	if (c->fd_tap != -1) { /* Passed as --fd */
@@ -1317,10 +1370,17 @@ void tap_sock_init(struct ctx *c)
 
 		ASSERT(c->one_off);
 		ref.fd = c->fd_tap;
-		if (c->mode == MODE_PASST)
+		switch (c->mode) {
+		case MODE_PASST:
 			ref.type = EPOLL_TYPE_TAP_PASST;
-		else
+			break;
+		case MODE_PASTA:
 			ref.type = EPOLL_TYPE_TAP_PASTA;
+			break;
+		case MODE_VU:
+			ref.type = EPOLL_TYPE_VHOST_CMD;
+			break;
+		}
 
 		ev.events = EPOLLIN | EPOLLRDHUP;
 		ev.data.u64 = ref.u64;
diff --git a/tap.h b/tap.h
index ec9e2ac..c5447f7 100644
--- a/tap.h
+++ b/tap.h
@@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
  */
 static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
 {
-	thdr->vnet_len = htonl(l2len);
+	if (thdr)
+		thdr->vnet_len = htonl(l2len);
 }
 
 void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
@@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
 void tap_handler_passt(struct ctx *c, uint32_t events,
 		       const struct timespec *now);
 int tap_sock_unix_open(char *sock_path);
+void tap_sock_reset(struct ctx *c);
+void tap_sock_update_buf(void *base, size_t size);
 void tap_sock_init(struct ctx *c);
 void tap_flush_pools(void);
 void tap_handler(struct ctx *c, const struct timespec *now);
diff --git a/tcp.c b/tcp.c
index 77c62f0..dabac96 100644
--- a/tcp.c
+++ b/tcp.c
@@ -304,6 +304,7 @@
 #include "flow_table.h"
 #include "tcp_internal.h"
 #include "tcp_buf.h"
+#include "tcp_vu.h"
 
 /* MSS rounding: see SET_MSS() */
 #define MSS_DEFAULT			536
@@ -896,6 +897,7 @@ static void tcp_fill_header(struct tcphdr *th,
 
 /**
  * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @iph:	Pointer to IPv4 header
@@ -906,7 +908,8 @@ static void tcp_fill_header(struct tcphdr *th,
  *
  * Return: The IPv4 payload length, host order
  */
-static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers4(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct iphdr *iph, struct tcphdr *th,
 				size_t dlen, const uint16_t *check,
@@ -929,7 +932,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 
 	tcp_fill_header(th, conn, seq);
 
-	tcp_update_check_tcp4(iph, th);
+	if (c->mode != MODE_VU)
+		tcp_update_check_tcp4(iph, th);
+	else
+		th->check = 0;
 
 	tap_hdr_update(taph, l3len + sizeof(struct ethhdr));
 
@@ -938,6 +944,7 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
 
 /**
  * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @taph:	tap backend specific header
  * @ip6h:	Pointer to IPv6 header
@@ -948,7 +955,8 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
  *
  * Return: The IPv6 payload length, host order
  */
-static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
+static size_t tcp_fill_headers6(const struct ctx *c,
+				const struct tcp_tap_conn *conn,
 				struct tap_hdr *taph,
 				struct ipv6hdr *ip6h, struct tcphdr *th,
 				size_t dlen, uint32_t seq)
@@ -970,7 +978,10 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 
 	tcp_fill_header(th, conn, seq);
 
-	tcp_update_check_tcp6(ip6h, th);
+	if (c->mode != MODE_VU)
+		tcp_update_check_tcp6(ip6h, th);
+	else
+		th->check = 0;
 
 	tap_hdr_update(taph, l4len + sizeof(*ip6h) + sizeof(struct ethhdr));
 
@@ -979,6 +990,7 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
 
 /**
  * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
+ * @c:		Execution context
  * @conn:	Connection pointer
  * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
  * @dlen:	TCP payload length
@@ -987,7 +999,8 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
  *
  * Return: IP payload length, host order
  */
-size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+size_t tcp_l2_buf_fill_headers(const struct ctx *c,
+			       const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq)
 {
@@ -995,13 +1008,13 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
 	const struct in_addr *a4 = inany_v4(&tapside->oaddr);
 
 	if (a4) {
-		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
+		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
 					 iov[TCP_IOV_IP].iov_base,
 					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 					 check, seq);
 	}
 
-	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
+	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
 				 iov[TCP_IOV_IP].iov_base,
 				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
 				 seq);
@@ -1237,6 +1250,9 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
  */
 int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_send_flag(c, conn, flags);
+
 	return tcp_buf_send_flag(c, conn, flags);
 }
 
@@ -1630,6 +1646,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
  */
 static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
 {
+	if (c->mode == MODE_VU)
+		return tcp_vu_data_from_sock(c, conn);
+
 	return tcp_buf_data_from_sock(c, conn);
 }
 
diff --git a/tcp_buf.c b/tcp_buf.c
index c31e9f3..6b702b0 100644
--- a/tcp_buf.c
+++ b/tcp_buf.c
@@ -321,7 +321,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
 		return ret;
 	}
 
-	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq);
+	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq);
 	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 
 	if (flags & DUP_ACK) {
@@ -378,7 +378,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		tcp4_frame_conns[tcp4_payload_used] = conn;
 
 		iov = tcp4_l2_iov[tcp4_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
+		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
@@ -386,7 +386,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
 		tcp6_frame_conns[tcp6_payload_used] = conn;
 
 		iov = tcp6_l2_iov[tcp6_payload_used++];
-		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
+		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
 		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
 		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
 			tcp_payload_flush(c);
diff --git a/tcp_internal.h b/tcp_internal.h
index aa8bb64..04b0011 100644
--- a/tcp_internal.h
+++ b/tcp_internal.h
@@ -89,7 +89,8 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
 		tcp_rst_do(c, conn);					\
 	} while (0)
 
-size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
+size_t tcp_l2_buf_fill_headers(const struct ctx *c,
+			       const struct tcp_tap_conn *conn,
 			       struct iovec *iov, size_t dlen,
 			       const uint16_t *check, uint32_t seq);
 int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
diff --git a/tcp_vu.c b/tcp_vu.c
new file mode 100644
index 0000000..b5f5d2c
--- /dev/null
+++ b/tcp_vu.c
@@ -0,0 +1,593 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * tcp_vu.c - TCP L2 vhost-user management functions
+ */
+
+#include <errno.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <netinet/ip.h>
+
+#include <sys/socket.h>
+
+#include <linux/tcp.h>
+#include <linux/virtio_net.h>
+
+#include "util.h"
+#include "ip.h"
+#include "passt.h"
+#include "siphash.h"
+#include "inany.h"
+#include "vhost_user.h"
+#include "tcp.h"
+#include "pcap.h"
+#include "flow.h"
+#include "tcp_conn.h"
+#include "flow_table.h"
+#include "tcp_vu.h"
+#include "tcp_internal.h"
+#include "checksum.h"
+#include "vu_common.h"
+
+/**
+ * struct tcp_payload_t - TCP header and data to send segments with payload
+ * @th:		TCP header
+ * @data:	TCP data
+ */
+struct tcp_payload_t {
+	struct tcphdr th;
+	uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
+};
+
+/**
+ * struct tcp_flags_t - TCP header and data to send zero-length
+ *                      segments (flags)
+ * @th:		TCP header
+ * @opts	TCP options
+ */
+struct tcp_flags_t {
+	struct tcphdr th;
+	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
+};
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE];
+static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
+
+static size_t tcp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
+		    sizeof(struct tcphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+static void tcp_vu_pcap(const struct ctx *c, const struct flowside *tapside,
+			struct iovec *iov, int iov_used, size_t l4len)
+{
+	const struct in_addr *src = inany_v4(&tapside->oaddr);
+	const struct in_addr *dst = inany_v4(&tapside->eaddr);
+	const struct vu_dev *vdev = c->vdev;
+	char *base = iov[0].iov_base;
+	size_t size = iov[0].iov_len;
+	struct tcp_payload_t *bp;
+	uint32_t sum;
+
+	if (!*c->pcap)
+		return;
+
+	if (src && dst) {
+		bp = vu_payloadv4(vdev, base);
+		sum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
+					     *src, *dst);
+	} else {
+		bp = vu_payloadv6(vdev, base);
+		sum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
+					     &tapside->oaddr.a6,
+					     &tapside->eaddr.a6);
+	}
+	iov[0].iov_base = &bp->th;
+	iov[0].iov_len = size - ((char *)iov[0].iov_base - base);
+	bp->th.check = 0;
+	bp->th.check = csum_iov(iov, iov_used, sum);
+
+	/* set iov for pcap logging */
+	iov[0].iov_base = base + vdev->hdrlen;
+	iov[0].iov_len = size - vdev->hdrlen;
+
+	pcap_iov(iov, iov_used);
+
+	/* restore iov[0] */
+	iov[0].iov_base = base;
+	iov[0].iov_len = size;
+}
+
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	struct virtio_net_hdr_mrg_rxbuf *vh;
+	struct iovec l2_iov[TCP_NUM_IOVS];
+	size_t l2len, l4len, optlen;
+	struct iovec in_sg;
+	struct ethhdr *eh;
+	int nb_ack;
+	int ret;
+
+	elem[0].out_num = 0;
+	elem[0].out_sg = NULL;
+	elem[0].in_num = 1;
+	elem[0].in_sg = &in_sg;
+	ret = vu_queue_pop(vdev, vq, &elem[0]);
+	if (ret < 0)
+		return 0;
+
+	if (elem[0].in_num < 1) {
+		err("virtio-net receive queue contains no in buffers");
+		vu_queue_rewind(vq, 1);
+		return 0;
+	}
+
+	vh = elem[0].in_sg[0].iov_base;
+
+	vh->hdr = vu_header;
+	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+		vh->num_buffers = htole16(1);
+
+	l2_iov[TCP_IOV_TAP].iov_base = NULL;
+	l2_iov[TCP_IOV_TAP].iov_len = 0;
+	l2_iov[TCP_IOV_ETH].iov_base = (char *)elem[0].in_sg[0].iov_base + vdev->hdrlen;
+	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
+
+	eh = l2_iov[TCP_IOV_ETH].iov_base;
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	if (CONN_V4(conn)) {
+		struct tcp_flags_t *payload;
+		struct iphdr *iph;
+		uint32_t seq;
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = l2_iov[TCP_IOV_IP].iov_base;
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_flags_t, opts) / 4,
+			.ack = 1
+		};
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
+						seq);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		l2len = l4len + sizeof(*iph) + sizeof(struct ethhdr);
+	} else {
+		struct tcp_flags_t *payload;
+		struct ipv6hdr *ip6h;
+		uint32_t seq;
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = l2_iov[TCP_IOV_IP].iov_base;
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_flags_t, opts) / 4,
+			.ack = 1
+		};
+
+		seq = conn->seq_to_tap;
+		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
+		if (ret <= 0) {
+			vu_queue_rewind(vq, 1);
+			return ret;
+		}
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
+						seq);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		l2len = l4len + sizeof(*ip6h) + sizeof(struct ethhdr);
+	}
+	l2len += vdev->hdrlen;
+	ASSERT(l2len <= elem[0].in_sg[0].iov_len);
+
+	elem[0].in_sg[0].iov_len = l2len;
+	tcp_vu_pcap(c, tapside, &elem[0].in_sg[0], 1, l4len);
+
+	vu_queue_fill(vq, &elem[0], l2len, 0);
+	nb_ack = 1;
+
+	if (flags & DUP_ACK) {
+		struct iovec in_sg_dup;
+
+		elem[1].out_num = 0;
+		elem[1].out_sg = NULL;
+		elem[1].in_num = 1;
+		elem[1].in_sg = &in_sg_dup;
+		ret = vu_queue_pop(vdev, vq, &elem[1]);
+		if (ret == 0) {
+			if (elem[1].in_num < 1 || elem[1].in_sg[0].iov_len < l2len) {
+				vu_queue_rewind(vq, 1);
+			} else {
+				memcpy(elem[1].in_sg[0].iov_base, vh, l2len);
+				nb_ack++;
+
+				tcp_vu_pcap(c, tapside, &elem[1].in_sg[0], 1,
+					    l4len);
+
+				vu_queue_fill(vq, &elem[1], l2len, 1);
+			}
+		}
+	}
+
+	vu_queue_flush(vq, nb_ack);
+	vu_queue_notify(vdev, vq);
+
+	return 0;
+}
+
+static ssize_t tcp_vu_sock_recv(struct ctx *c,
+				struct tcp_tap_conn *conn, bool v4,
+				size_t fillsize, uint16_t mss,
+				ssize_t *data_len)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+	struct msghdr mh_sock = { 0 };
+	static int in_sg_count;
+	int s = conn->sock;
+	size_t l2_hdrlen;
+	int segment_size;
+	int iov_cnt;
+	ssize_t ret;
+
+	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
+
+	iov_cnt = 0;
+	in_sg_count = 0;
+	segment_size = 0;
+	*data_len = 0;
+	while (fillsize > 0 && iov_cnt < VIRTQUEUE_MAX_SIZE - 1 &&
+			       in_sg_count < ARRAY_SIZE(in_sg)) {
+
+		elem[iov_cnt].out_num = 0;
+		elem[iov_cnt].out_sg = NULL;
+		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
+		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
+		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
+		if (ret < 0)
+			break;
+
+		if (elem[iov_cnt].in_num < 1)
+			die("virtio-net receive queue contains no in buffers");
+
+		in_sg_count += elem[iov_cnt].in_num;
+
+		ASSERT(elem[iov_cnt].in_num == 1);
+		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
+
+		if (segment_size == 0) {
+			iov_vu[iov_cnt + 1].iov_base =
+					(char *)elem[iov_cnt].in_sg[0].iov_base + l2_hdrlen;
+			iov_vu[iov_cnt + 1].iov_len =
+					elem[iov_cnt].in_sg[0].iov_len - l2_hdrlen;
+		} else {
+			iov_vu[iov_cnt + 1].iov_base = elem[iov_cnt].in_sg[0].iov_base;
+			iov_vu[iov_cnt + 1].iov_len = elem[iov_cnt].in_sg[0].iov_len;
+		}
+
+		if (iov_vu[iov_cnt + 1].iov_len > fillsize)
+			iov_vu[iov_cnt + 1].iov_len = fillsize;
+
+		segment_size += iov_vu[iov_cnt + 1].iov_len;
+		if (vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
+			segment_size = 0;
+		} else if (segment_size >= mss) {
+			iov_vu[iov_cnt + 1].iov_len -= segment_size - mss;
+			segment_size = 0;
+		}
+		fillsize -= iov_vu[iov_cnt + 1].iov_len;
+
+		iov_cnt++;
+	}
+	if (iov_cnt == 0)
+		return 0;
+
+	mh_sock.msg_iov = iov_vu;
+	mh_sock.msg_iovlen = iov_cnt + 1;
+
+	do
+		ret = recvmsg(s, &mh_sock, MSG_PEEK);
+	while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		vu_queue_rewind(vq, iov_cnt);
+		if (errno != EAGAIN && errno != EWOULDBLOCK) {
+			ret = -errno;
+			tcp_rst(c, conn);
+		}
+		return ret;
+	}
+	if (!ret) {
+		vu_queue_rewind(vq, iov_cnt);
+
+		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
+			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
+			if (retf) {
+				tcp_rst(c, conn);
+				return retf;
+			}
+
+			conn_event(c, conn, TAP_FIN_SENT);
+		}
+		return 0;
+	}
+
+	*data_len = ret;
+	return iov_cnt;
+}
+
+static size_t tcp_vu_prepare(const struct ctx *c,
+			     struct tcp_tap_conn *conn, struct iovec *first,
+			     size_t data_len, const uint16_t **check)
+{
+	const struct flowside *toside = TAPFLOW(conn);
+	const struct vu_dev *vdev = c->vdev;
+	struct iovec l2_iov[TCP_NUM_IOVS];
+	char *base = first->iov_base;
+	struct ethhdr *eh;
+	size_t l4len;
+
+	/* we guess the first iovec provided by the guest can embed
+         * all the headers needed by L2 frame
+         */
+
+	l2_iov[TCP_IOV_TAP].iov_base = NULL;
+	l2_iov[TCP_IOV_TAP].iov_len = 0;
+	l2_iov[TCP_IOV_ETH].iov_base = base + vdev->hdrlen;
+	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
+
+	eh = l2_iov[TCP_IOV_ETH].iov_base;
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
+		struct tcp_payload_t *payload;
+		struct iphdr *iph;
+
+		ASSERT(first[0].iov_len >= vdev->hdrlen +
+		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
+		       sizeof(struct tcphdr));
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		iph = l2_iov[TCP_IOV_IP].iov_base;
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_payload_t, data) / 4,
+			.ack = 1
+		};
+
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
+						data_len, *check,
+						conn->seq_to_tap);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+
+		*check = &iph->check;
+	} else {
+		struct tcp_payload_t *payload;
+		struct ipv6hdr *ip6h;
+
+		ASSERT(first[0].iov_len >= vdev->hdrlen +
+		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		       sizeof(struct tcphdr));
+
+		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
+						      l2_iov[TCP_IOV_ETH].iov_len;
+		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
+		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
+							   l2_iov[TCP_IOV_IP].iov_len;
+
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		ip6h = l2_iov[TCP_IOV_IP].iov_base;
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
+
+		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
+		payload->th = (struct tcphdr){
+			.doff = offsetof(struct tcp_payload_t, data) / 4,
+			.ack = 1
+		};
+;
+		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
+						data_len,
+						NULL, conn->seq_to_tap);
+		/* keep the following assignment for clarity */
+		/* cppcheck-suppress unreadVariable */
+		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
+	}
+
+	return l4len;
+}
+
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
+{
+	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	const struct flowside *tapside = TAPFLOW(conn);
+	uint16_t mss = MSS_GET(conn);
+	size_t l2_hdrlen, fillsize;
+	int i, iov_cnt, iov_used;
+	int v4 = CONN_V4(conn);
+	uint32_t already_sent;
+	const uint16_t *check;
+	struct iovec *first;
+	int segment_size;
+	int num_buffers;
+	ssize_t len;
+
+	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
+		flow_err(conn,
+			 "Got packet, but no available descriptors on RX virtq.");
+		return 0;
+	}
+
+	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
+
+	if (SEQ_LT(already_sent, 0)) {
+		/* RFC 761, section 2.1. */
+		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
+			   conn->seq_ack_from_tap, conn->seq_to_tap);
+		conn->seq_to_tap = conn->seq_ack_from_tap;
+		already_sent = 0;
+	}
+
+	if (!wnd_scaled || already_sent >= wnd_scaled) {
+		conn_flag(c, conn, STALLED);
+		conn_flag(c, conn, ACK_FROM_TAP_DUE);
+		return 0;
+	}
+
+	/* Set up buffer descriptors we'll fill completely and partially. */
+
+	fillsize = wnd_scaled;
+
+	iov_vu[0].iov_base = tcp_buf_discard;
+	iov_vu[0].iov_len = already_sent;
+	fillsize -= already_sent;
+
+	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, mss, &len);
+	if (iov_cnt <= 0)
+		return iov_cnt;
+
+	len -= already_sent;
+	if (len <= 0) {
+		conn_flag(c, conn, STALLED);
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	conn_flag(c, conn, ~STALLED);
+
+	/* Likely, some new data was acked too. */
+	tcp_update_seqack_wnd(c, conn, 0, NULL);
+
+	/* initialize headers */
+	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
+	iov_used = 0;
+	num_buffers = 0;
+	check = NULL;
+	segment_size = 0;
+	for (i = 0; i < iov_cnt && len; i++) {
+
+		if (segment_size == 0)
+			first = &iov_vu[i + 1];
+
+		if (iov_vu[i + 1].iov_len > (size_t)len)
+			iov_vu[i + 1].iov_len = len;
+
+		len -= iov_vu[i + 1].iov_len;
+		iov_used++;
+
+		segment_size += iov_vu[i + 1].iov_len;
+		num_buffers++;
+
+		if (segment_size >= mss || len == 0 ||
+		    i + 1 == iov_cnt || vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
+			struct virtio_net_hdr_mrg_rxbuf *vh;
+			size_t l4len;
+
+			if (i + 1 == iov_cnt)
+				check = NULL;
+
+			/* restore first iovec base: point to vnet header */
+			first->iov_base = (char *)first->iov_base - l2_hdrlen;
+			first->iov_len = first->iov_len + l2_hdrlen;
+
+			vh = first->iov_base;
+
+			vh->hdr = vu_header;
+			if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+				vh->num_buffers = htole16(num_buffers);
+
+			l4len = tcp_vu_prepare(c, conn, first, segment_size, &check);
+
+			tcp_vu_pcap(c, tapside, first, num_buffers, l4len);
+
+			conn->seq_to_tap += segment_size;
+
+			segment_size = 0;
+			num_buffers = 0;
+		}
+	}
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	/* send packets */
+	vu_send_frame(vdev, vq, elem, &iov_vu[1], iov_used);
+
+	conn_flag(c, conn, ACK_FROM_TAP_DUE);
+
+	return 0;
+}
diff --git a/tcp_vu.h b/tcp_vu.h
new file mode 100644
index 0000000..99daba5
--- /dev/null
+++ b/tcp_vu.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef TCP_VU_H
+#define TCP_VU_H
+
+int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
+int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
+
+#endif  /*TCP_VU_H */
diff --git a/udp.c b/udp.c
index 8a93aad..b5792a5 100644
--- a/udp.c
+++ b/udp.c
@@ -109,8 +109,7 @@
 #include "pcap.h"
 #include "log.h"
 #include "flow_table.h"
-
-#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+#include "udp_internal.h"
 
 /* "Spliced" sockets indexed by bound port (host order) */
 static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
@@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
 
 /* Static buffers */
 
-/**
- * struct udp_payload_t - UDP header and data for inbound messages
- * @uh:		UDP header
- * @data:	UDP data
- */
-static struct udp_payload_t {
-	struct udphdr uh;
-	char data[USHRT_MAX - sizeof(struct udphdr)];
-#ifdef __AVX2__
-} __attribute__ ((packed, aligned(32)))
-#else
-} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
-#endif
-udp_payload[UDP_MAX_FRAMES];
+/* UDP header and data for inbound messages */
+static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
 
 /* Ethernet header for IPv4 frames */
 static struct ethhdr udp4_eth_hdr;
@@ -311,6 +298,7 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
 
 /**
  * udp_update_hdr4() - Update headers for one IPv4 datagram
+ * @c:		Execution context
  * @ip4h:	Pre-filled IPv4 header (except for tot_len and saddr)
  * @bp:		Pointer to udp_payload_t to update
  * @toside:	Flowside for destination side
@@ -318,8 +306,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
  *
  * Return: size of IPv4 payload (UDP header + data)
  */
-static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen)
+size_t udp_update_hdr4(const struct ctx *c,
+		       struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen)
 {
 	const struct in_addr *src = inany_v4(&toside->oaddr);
 	const struct in_addr *dst = inany_v4(&toside->eaddr);
@@ -336,13 +325,17 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
 	bp->uh.source = htons(toside->oport);
 	bp->uh.dest = htons(toside->eport);
 	bp->uh.len = htons(l4len);
-	csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
+	if (c->mode != MODE_VU)
+		csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
+	else
+		bp->uh.check = 0;
 
 	return l4len;
 }
 
 /**
  * udp_update_hdr6() - Update headers for one IPv6 datagram
+ * @c:		Execution context
  * @ip6h:	Pre-filled IPv6 header (except for payload_len and addresses)
  * @bp:		Pointer to udp_payload_t to update
  * @toside:	Flowside for destination side
@@ -350,8 +343,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
  *
  * Return: size of IPv6 payload (UDP header + data)
  */
-static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
-			      const struct flowside *toside, size_t dlen)
+size_t udp_update_hdr6(const struct ctx *c,
+		       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen)
 {
 	uint16_t l4len = dlen + sizeof(bp->uh);
 
@@ -365,19 +359,25 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
 	bp->uh.source = htons(toside->oport);
 	bp->uh.dest = htons(toside->eport);
 	bp->uh.len = ip6h->payload_len;
-	csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6, bp->data, dlen);
+	if (c->mode != MODE_VU) {
+		csum_udp6(&bp->uh, &toside->oaddr.a6, &toside->eaddr.a6,
+			  bp->data, dlen);
+	} else {
+		bp->uh.check = 0xffff; /* zero checksum is invalid with IPv6 */
+	}
 
 	return l4len;
 }
 
 /**
  * udp_tap_prepare() - Convert one datagram into a tap frame
+ * @c:		Execution context
  * @mmh:	Receiving mmsghdr array
  * @idx:	Index of the datagram to prepare
  * @toside:	Flowside for destination side
  */
-static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
-			    const struct flowside *toside)
+static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh,
+			    unsigned idx, const struct flowside *toside)
 {
 	struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
 	struct udp_payload_t *bp = &udp_payload[idx];
@@ -385,13 +385,15 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
 	size_t l4len;
 
 	if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->oaddr)) {
-		l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len);
+		l4len = udp_update_hdr6(c, &bm->ip6h, bp, toside,
+					mmh[idx].msg_len);
 		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
 			       sizeof(udp6_eth_hdr));
 		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
 		(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
 	} else {
-		l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len);
+		l4len = udp_update_hdr4(c, &bm->ip4h, bp, toside,
+					mmh[idx].msg_len);
 		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
 			       sizeof(udp4_eth_hdr));
 		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
@@ -408,7 +410,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
  *
  * #syscalls recvmsg
  */
-static bool udp_sock_recverr(int s)
+bool udp_sock_recverr(int s)
 {
 	const struct sock_extended_err *ee;
 	const struct cmsghdr *hdr;
@@ -495,7 +497,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
 }
 
 /**
- * udp_listen_sock_handler() - Handle new data from socket
+ * udp_buf_listen_sock_handler() - Handle new data from socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -503,8 +505,8 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
  *
  * #syscalls recvmmsg
  */
-void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-			     uint32_t events, const struct timespec *now)
+void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				 uint32_t events, const struct timespec *now)
 {
 	struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv;
 	int n, i;
@@ -527,7 +529,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 			if (pif_is_socket(batchpif)) {
 				udp_splice_prepare(mmh_recv, i);
 			} else if (batchpif == PIF_TAP) {
-				udp_tap_prepare(mmh_recv, i,
+				udp_tap_prepare(c, mmh_recv, i,
 						flowside_at_sidx(batchsidx));
 			}
 
@@ -561,7 +563,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
 }
 
 /**
- * udp_reply_sock_handler() - Handle new data from flow specific socket
+ * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
  * @c:		Execution context
  * @ref:	epoll reference
  * @events:	epoll events bitmap
@@ -569,8 +571,8 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
  *
  * #syscalls recvmmsg
  */
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now)
+void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now)
 {
 	const struct flowside *fromside = flowside_at_sidx(ref.flowside);
 	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
@@ -594,7 +596,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
 		if (pif_is_socket(topif))
 			udp_splice_prepare(mmh_recv, i);
 		else if (topif == PIF_TAP)
-			udp_tap_prepare(mmh_recv, i, toside);
+			udp_tap_prepare(c, mmh_recv, i, toside);
 	}
 
 	if (pif_is_socket(topif)) {
diff --git a/udp.h b/udp.h
index fb42e1c..77b2926 100644
--- a/udp.h
+++ b/udp.h
@@ -9,10 +9,10 @@
 #define UDP_TIMER_INTERVAL		1000 /* ms */
 
 void udp_portmap_clear(void);
-void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
-			     uint32_t events, const struct timespec *now);
-void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
-			    uint32_t events, const struct timespec *now);
+void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				 uint32_t events, const struct timespec *now);
+void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now);
 int udp_tap_handler(const struct ctx *c, uint8_t pif,
 		    sa_family_t af, const void *saddr, const void *daddr,
 		    const struct pool *p, int idx, const struct timespec *now);
diff --git a/udp_internal.h b/udp_internal.h
new file mode 100644
index 0000000..7dd4575
--- /dev/null
+++ b/udp_internal.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright (c) 2021 Red Hat GmbH
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#ifndef UDP_INTERNAL_H
+#define UDP_INTERNAL_H
+
+#include "tap.h" /* needed by udp_meta_t */
+
+#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
+
+/**
+ * struct udp_payload_t - UDP header and data for inbound messages
+ * @uh:		UDP header
+ * @data:	UDP data
+ */
+struct udp_payload_t {
+	struct udphdr uh;
+	char data[USHRT_MAX - sizeof(struct udphdr)];
+#ifdef __AVX2__
+} __attribute__ ((packed, aligned(32)));
+#else
+} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
+#endif
+
+size_t udp_update_hdr4(const struct ctx *c,
+		       struct iphdr *ip4h, struct udp_payload_t *bp,
+		       const struct flowside *toside, size_t dlen);
+size_t udp_update_hdr6(const struct ctx *c,
+                       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
+                       const struct flowside *toside, size_t dlen);
+bool udp_sock_recverr(int s);
+#endif /* UDP_INTERNAL_H */
diff --git a/udp_vu.c b/udp_vu.c
new file mode 100644
index 0000000..a25a183
--- /dev/null
+++ b/udp_vu.c
@@ -0,0 +1,338 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * udp_vu.c - UDP L2 vhost-user management functions
+ */
+
+#include <unistd.h>
+#include <assert.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <sys/uio.h>
+#include <linux/virtio_net.h>
+
+#include "checksum.h"
+#include "util.h"
+#include "ip.h"
+#include "siphash.h"
+#include "inany.h"
+#include "passt.h"
+#include "pcap.h"
+#include "log.h"
+#include "vhost_user.h"
+#include "udp_internal.h"
+#include "flow.h"
+#include "flow_table.h"
+#include "udp_flow.h"
+#include "udp_vu.h"
+#include "vu_common.h"
+
+/* vhost-user */
+static const struct virtio_net_hdr vu_header = {
+	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
+	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
+};
+
+static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
+static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
+static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+static int in_sg_count;
+
+static size_t udp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)
+{
+	size_t l2_hdrlen;
+
+	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
+		    sizeof(struct udphdr);
+
+	if (v6)
+		l2_hdrlen += sizeof(struct ipv6hdr);
+	else
+		l2_hdrlen += sizeof(struct iphdr);
+
+	return l2_hdrlen;
+}
+
+static int udp_vu_sock_recv(const struct ctx *c, union sockaddr_inany *s_in,
+			    int s, uint32_t events, bool v6, ssize_t *data_len)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	int virtqueue_max, iov_cnt, idx, iov_used;
+	size_t fillsize, size, off, l2_hdrlen;
+	struct virtio_net_hdr_mrg_rxbuf *vh;
+	struct msghdr msg  = { 0 };
+	char *base;
+
+	ASSERT(!c->no_udp);
+
+	/* Clear any errors first */
+	if (events & EPOLLERR) {
+		while (udp_sock_recverr(s))
+			;
+	}
+
+	if (!(events & EPOLLIN))
+		return 0;
+
+	/* compute L2 header length */
+
+	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+		virtqueue_max = VIRTQUEUE_MAX_SIZE;
+	else
+		virtqueue_max = 1;
+
+	l2_hdrlen = udp_vu_l2_hdrlen(vdev, v6);
+
+	msg.msg_name = s_in;
+	msg.msg_namelen = sizeof(union sockaddr_inany);
+
+	fillsize = USHRT_MAX;
+	iov_cnt = 0;
+	in_sg_count = 0;
+	while (fillsize && iov_cnt < virtqueue_max &&
+			in_sg_count < ARRAY_SIZE(in_sg)) {
+		int ret;
+
+		elem[iov_cnt].out_num = 0;
+		elem[iov_cnt].out_sg = NULL;
+		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
+		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
+		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
+		if (ret < 0)
+			break;
+		in_sg_count += elem[iov_cnt].in_num;
+
+		if (elem[iov_cnt].in_num < 1) {
+			err("virtio-net receive queue contains no in buffers");
+			vu_queue_rewind(vq, iov_cnt);
+			return 0;
+		}
+		ASSERT(elem[iov_cnt].in_num == 1);
+		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
+
+		if (iov_cnt == 0) {
+			base = elem[iov_cnt].in_sg[0].iov_base;
+			size = elem[iov_cnt].in_sg[0].iov_len;
+
+			/* keep space for the headers */
+			iov_vu[0].iov_base = base + l2_hdrlen;
+			iov_vu[0].iov_len = size - l2_hdrlen;
+		} else {
+			iov_vu[iov_cnt].iov_base = elem[iov_cnt].in_sg[0].iov_base;
+			iov_vu[iov_cnt].iov_len = elem[iov_cnt].in_sg[0].iov_len;
+		}
+
+		if (iov_vu[iov_cnt].iov_len > fillsize)
+			iov_vu[iov_cnt].iov_len = fillsize;
+
+		fillsize -= iov_vu[iov_cnt].iov_len;
+
+		iov_cnt++;
+	}
+	if (iov_cnt == 0)
+		return 0;
+
+	msg.msg_iov = iov_vu;
+	msg.msg_iovlen = iov_cnt;
+
+	*data_len = recvmsg(s, &msg, 0);
+	if (*data_len < 0) {
+		vu_queue_rewind(vq, iov_cnt);
+		return 0;
+	}
+
+	/* restore original values */
+	iov_vu[0].iov_base = base;
+	iov_vu[0].iov_len = size;
+
+	/* count the numbers of buffer filled by recvmsg() */
+	idx = iov_skip_bytes(iov_vu, iov_cnt, l2_hdrlen + *data_len,
+			     &off);
+	/* adjust last iov length */
+	if (idx < iov_cnt)
+		iov_vu[idx].iov_len = off;
+	iov_used = idx + !!off;
+
+	/* release unused buffers */
+	vu_queue_rewind(vq, iov_cnt - iov_used);
+
+	vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
+	vh->hdr = vu_header;
+	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
+		vh->num_buffers = htole16(iov_used);
+
+	return iov_used;
+}
+
+static size_t udp_vu_prepare(const struct ctx *c,
+			     const struct flowside *toside, ssize_t data_len)
+{
+	const struct vu_dev *vdev = c->vdev;
+	struct ethhdr *eh;
+	size_t l4len;
+
+	/* ethernet header */
+	eh = vu_eth(vdev, iov_vu[0].iov_base);
+
+	memcpy(eh->h_dest, c->guest_mac, sizeof(eh->h_dest));
+	memcpy(eh->h_source, c->our_tap_mac, sizeof(eh->h_source));
+
+	/* initialize header */
+	if (inany_v4(&toside->eaddr) && inany_v4(&toside->oaddr)) {
+		struct iphdr *iph = vu_ip(vdev, iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv4(vdev,
+							    iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IP);
+
+		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr4(c, iph, bp, toside, data_len);
+	} else {
+		struct ipv6hdr *ip6h = vu_ip(vdev, iov_vu[0].iov_base);
+		struct udp_payload_t *bp = vu_payloadv6(vdev,
+							    iov_vu[0].iov_base);
+
+		eh->h_proto = htons(ETH_P_IPV6);
+
+		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
+
+		l4len = udp_update_hdr6(c, ip6h, bp, toside, data_len);
+	}
+
+	return l4len;
+}
+
+static void udp_vu_pcap(const struct ctx *c, const struct flowside *toside,
+			size_t l4len, int iov_used)
+{
+	const struct in_addr *src = inany_v4(&toside->oaddr);
+	const struct in_addr *dst = inany_v4(&toside->eaddr);
+	const struct vu_dev *vdev = c->vdev;
+	char *base = iov_vu[0].iov_base;
+	size_t size = iov_vu[0].iov_len;
+	struct udp_payload_t *bp;
+	uint32_t sum;
+
+	if (!*c->pcap)
+		return;
+
+	if (src && dst) {
+		bp = vu_payloadv4(vdev, base);
+		sum = proto_ipv4_header_psum(l4len, IPPROTO_UDP, *src, *dst);
+	} else {
+		bp = vu_payloadv6(vdev, base);
+		sum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
+					     &toside->oaddr.a6,
+					     &toside->eaddr.a6);
+		bp->uh.check = 0; /* by default, set to 0xffff */
+	}
+
+	iov_vu[0].iov_base = &bp->uh;
+	iov_vu[0].iov_len = size - ((char *)iov_vu[0].iov_base - base);
+
+	bp->uh.check = csum_iov(iov_vu, iov_used, sum);
+
+	/* set iov for pcap logging */
+	iov_vu[0].iov_base = base + vdev->hdrlen;
+	iov_vu[0].iov_len = size - vdev->hdrlen;
+	pcap_iov(iov_vu, iov_used);
+
+	/* restore iov_vu[0] */
+	iov_vu[0].iov_base = base;
+	iov_vu[0].iov_len = size;
+}
+
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now)
+{
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	bool v6 = ref.udp.v6;
+	int i;
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		union sockaddr_inany s_in;
+		flow_sidx_t batchsidx;
+		uint8_t batchpif;
+		ssize_t data_len;
+		int iov_used;
+
+		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
+					    events, v6, &data_len);
+		if (iov_used <= 0)
+			return;
+
+		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
+		batchpif = pif_at_sidx(batchsidx);
+
+		if (batchpif == PIF_TAP) {
+			size_t l4len;
+
+			l4len = udp_vu_prepare(c, flowside_at_sidx(batchsidx),
+					       data_len);
+			udp_vu_pcap(c, flowside_at_sidx(batchsidx), l4len,
+				    iov_used);
+			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
+		} else if (flow_sidx_valid(batchsidx)) {
+			flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
+			struct udp_flow *uflow = udp_at_sidx(batchsidx);
+
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(pif_at_sidx(fromsidx)),
+				 pif_name(batchpif));
+		} else {
+			debug("Discarding 1 datagram without flow");
+		}
+	}
+}
+
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			        uint32_t events, const struct timespec *now)
+{
+	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
+	const struct flowside *toside = flowside_at_sidx(tosidx);
+	struct vu_dev *vdev = c->vdev;
+	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
+	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
+	uint8_t topif = pif_at_sidx(tosidx);
+	bool v6 = ref.udp.v6;
+	int i;
+
+	ASSERT(!c->no_udp && uflow);
+
+	for (i = 0; i < UDP_MAX_FRAMES; i++) {
+		union sockaddr_inany s_in;
+		ssize_t data_len;
+		int iov_used;
+
+		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
+					    events, v6, &data_len);
+		if (iov_used <= 0)
+			return;
+		flow_trace(uflow, "Received 1 datagram on reply socket");
+		uflow->ts = now->tv_sec;
+
+		if (topif == PIF_TAP) {
+			size_t l4len;
+
+			l4len = udp_vu_prepare(c, toside, data_len);
+			udp_vu_pcap(c, toside, l4len, iov_used);
+			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
+		} else {
+			uint8_t frompif = pif_at_sidx(ref.flowside);
+
+			flow_err(uflow,
+				 "No support for forwarding UDP from %s to %s",
+				 pif_name(frompif), pif_name(topif));
+		}
+	}
+}
diff --git a/udp_vu.h b/udp_vu.h
new file mode 100644
index 0000000..0db7558
--- /dev/null
+++ b/udp_vu.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ */
+
+#ifndef UDP_VU_H
+#define UDP_VU_H
+
+void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
+				uint32_t events, const struct timespec *now);
+void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
+			       uint32_t events, const struct timespec *now);
+#endif /* UDP_VU_H */
diff --git a/vhost_user.c b/vhost_user.c
index c4cd25f..e65b550 100644
--- a/vhost_user.c
+++ b/vhost_user.c
@@ -52,7 +52,6 @@
  * 			     this is part of the vhost-user backend
  * 			     convention.
  */
-/* cppcheck-suppress unusedFunction */
 void vu_print_capabilities(void)
 {
 	info("{");
@@ -163,8 +162,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
  */
 static void vu_remove_watch(const struct vu_dev *vdev, int fd)
 {
-	(void)vdev;
-	(void)fd;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
 }
 
 /**
@@ -426,7 +424,6 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
  *
  * Return: 0 if the zone in a mapped memory region, -1 otherwise
  */
-/* cppcheck-suppress unusedFunction */
 int vu_packet_check_range(void *buf, size_t offset, size_t len,
 			  const char *start)
 {
@@ -517,6 +514,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
 		}
 	}
 
+	/* As vu_packet_check_range() has no access to the number of
+	 * memory regions, mark the end of the array with mmap_addr = 0
+	 */
+	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
+	vdev->regions[vdev->nregions].mmap_addr = 0;
+
+	tap_sock_update_buf(vdev->regions, 0);
+
 	return false;
 }
 
@@ -637,8 +642,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
  */
 static void vu_set_watch(const struct vu_dev *vdev, int fd)
 {
-	(void)vdev;
-	(void)fd;
+	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
+	struct epoll_event ev = { 0 };
+
+	ev.data.u64 = ref.u64;
+	ev.events = EPOLLIN;
+	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
 }
 
 /**
@@ -678,7 +687,6 @@ static int vu_wait_queue(const struct vu_virtq *vq)
  *
  * Return: number of bytes sent, -1 if there is an error
  */
-/* cppcheck-suppress unusedFunction */
 int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
 {
 	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
@@ -864,7 +872,6 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
  * @vdev:	vhost-user device
  * @ref:	epoll reference information
  */
-/* cppcheck-suppress unusedFunction */
 void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
 		const struct timespec *now)
 {
@@ -1102,11 +1109,11 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
  * @c:		execution context
  * @vdev:	vhost-user device
  */
-/* cppcheck-suppress unusedFunction */
 void vu_init(struct ctx *c, struct vu_dev *vdev)
 {
 	int i;
 
+	c->vdev = vdev;
 	vdev->context = c;
 	vdev->hdrlen = 0;
 	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
@@ -1170,7 +1177,7 @@ void vu_cleanup(struct vu_dev *vdev)
  */
 static void vu_sock_reset(struct vu_dev *vdev)
 {
-	(void)vdev;
+	tap_sock_reset(vdev->context);
 }
 
 /**
@@ -1179,7 +1186,6 @@ static void vu_sock_reset(struct vu_dev *vdev)
  * @fd:		vhost-user message socket
  * @events:	epoll events
  */
-/* cppcheck-suppress unusedFunction */
 void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)
 {
 	struct vhost_user_msg msg = { 0 };
diff --git a/virtio.c b/virtio.c
index d02e6e0..55fc647 100644
--- a/virtio.c
+++ b/virtio.c
@@ -559,7 +559,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
  * @vq:		Virtqueue
  * @num:	Number of element to unpop
  */
-/* cppcheck-suppress unusedFunction */
 bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
 {
 	if (num > vq->inuse)
diff --git a/vu_common.c b/vu_common.c
new file mode 100644
index 0000000..611c44a
--- /dev/null
+++ b/vu_common.c
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * common_vu.c - vhost-user common UDP and TCP functions
+ */
+
+#include <unistd.h>
+#include <sys/uio.h>
+
+#include "util.h"
+#include "passt.h"
+#include "vhost_user.h"
+#include "vu_common.h"
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used)
+{
+	int i;
+
+	for (i = 0; i < iov_used; i++)
+		vu_queue_fill(vq, &elem[i], iov_vu[i].iov_len, i);
+
+	vu_queue_flush(vq, iov_used);
+	vu_queue_notify(vdev, vq);
+}
diff --git a/vu_common.h b/vu_common.h
new file mode 100644
index 0000000..d2ea46b
--- /dev/null
+++ b/vu_common.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+
+static inline void *vu_eth(const struct vu_dev *vdev, void *base)
+{
+	return ((char *)base + vdev->hdrlen);
+}
+
+static inline void *vu_ip(const struct vu_dev *vdev, void *base)
+{
+	return (struct ethhdr *)vu_eth(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv4(const struct vu_dev *vdev, void *base)
+{
+	return (struct iphdr *)vu_ip(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv6(const struct vu_dev *vdev, void *base)
+{
+	return (struct ipv6hdr *)vu_ip(vdev, base) + 1;
+}
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used);
+#endif /* VU_COMMON_H */
-- 
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ * Copyright Red Hat
+ * Author: Laurent Vivier <lvivier@redhat.com>
+ *
+ * vhost-user common UDP and TCP functions
+ */
+
+#ifndef VU_COMMON_H
+#define VU_COMMON_H
+
+static inline void *vu_eth(const struct vu_dev *vdev, void *base)
+{
+	return ((char *)base + vdev->hdrlen);
+}
+
+static inline void *vu_ip(const struct vu_dev *vdev, void *base)
+{
+	return (struct ethhdr *)vu_eth(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv4(const struct vu_dev *vdev, void *base)
+{
+	return (struct iphdr *)vu_ip(vdev, base) + 1;
+}
+
+static inline void *vu_payloadv6(const struct vu_dev *vdev, void *base)
+{
+	return (struct ipv6hdr *)vu_ip(vdev, base) + 1;
+}
+
+void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
+		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
+		   int iov_used);
+#endif /* VU_COMMON_H */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/4] Add vhost-user support to passt. (part 3)
  2024-08-20 22:41 ` [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Stefano Brivio
@ 2024-08-22 16:53   ` Stefano Brivio
  2024-08-23 12:32     ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2024-08-22 16:53 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

[-- Attachment #1: Type: text/plain, Size: 1208 bytes --]

On Wed, 21 Aug 2024 00:41:14 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Thu, 15 Aug 2024 17:50:19 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
> 
> > This series of patches adds vhost-user support to passt
> > and then allows passt to connect to QEMU network backend using
> > virtqueue rather than a socket.
> > 
> > With QEMU, rather than using to connect:
> > 
> >   -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
> > 
> > we will use:
> > 
> >   -chardev socket,id=chr0,path=/tmp/passt_1.socket
> >   -netdev vhost-user,id=netdev0,chardev=chr0
> >   -device virtio-net,netdev=netdev0
> >   -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
> >   -numa node,memdev=memfd0
> > 
> > The memory backend is needed to share data between passt and QEMU.
> > 
> > Performance comparison between "-netdev stream" and "-netdev vhost-user":  
> 
> By the way, I attached a quick patch adding vhost-user-based tests to
> the usual throughput and latency tests.
> 
> UDP doesn't work (I didn't look into that at all), TCP does.

Complete/fixed patch attached. The only part of UDP that's not working
is actually over IPv6 -- it works over IPv4.

-- 
Stefano

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-test-Add-vhost-user-performance-tests-for-TCP-and-UD.patch --]
[-- Type: text/x-patch, Size: 17110 bytes --]

From 1ea5bf7942d1fdbea37d9d2cf3ce3c2c335360cf Mon Sep 17 00:00:00 2001
From: Stefano Brivio <sbrivio@redhat.com>
Date: Thu, 22 Aug 2024 18:50:43 +0200
Subject: [PATCH] test: Add vhost-user performance tests for TCP and UDP

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 test/lib/perf_report  |  15 +++
 test/lib/setup        |  49 +++++++---
 test/perf/passtvu_tcp | 211 ++++++++++++++++++++++++++++++++++++++++++
 test/perf/passtvu_udp | 159 +++++++++++++++++++++++++++++++
 test/run              |  10 ++
 5 files changed, 431 insertions(+), 13 deletions(-)
 create mode 100644 test/perf/passtvu_tcp
 create mode 100644 test/perf/passtvu_udp

diff --git a/test/lib/perf_report b/test/lib/perf_report
index d1ef50b..dfab32e 100755
--- a/test/lib/perf_report
+++ b/test/lib/perf_report
@@ -49,6 +49,21 @@ td:empty { visibility: hidden; }
 	__passt_tcp_LINE__ __passt_udp_LINE__
 </table>
 
+</li><li><p>passt with vhost-user back-end</p>
+<table class="passt" width="70%">
+	<tr>
+		<th/>
+		<th id="perf_passtvu_tcp" colspan="__passtvu_tcp_cols__">TCP, __passtvu_tcp_threads__ at __passtvu_tcp_freq__ GHz</th>
+		<th id="perf_passtvu_udp" colspan="__passtvu_udp_cols__">UDP, __passtvu_udp_threads__ at __passtvu_udp_freq__ GHz</th>
+	</tr>
+	<tr>
+		<td align="right">MTU:</td>
+		__passtvu_tcp_header__
+		__passtvu_udp_header__
+	</tr>
+	__passtvu_tcp_LINE__ __passtvu_udp_LINE__
+</table>
+
 <style type="text/CSS">
 table.pasta_local td { border: 0px solid; padding: 6px; line-height: 1; }
 table.pasta_local td { text-align: right; }
diff --git a/test/lib/setup b/test/lib/setup
index d764138..31e24d5 100755
--- a/test/lib/setup
+++ b/test/lib/setup
@@ -17,6 +17,7 @@ INITRAMFS="${BASEPATH}/mbuto.img"
 VCPUS="$( [ $(nproc) -ge 8 ] && echo 6 || echo $(( $(nproc) / 2 + 1 )) )"
 __mem_kib="$(sed -n 's/MemTotal:[ ]*\([0-9]*\) kB/\1/p' /proc/meminfo)"
 VMEM="$((${__mem_kib} / 1024 / 4))"
+VMEM_ROUND="$(((${VMEM} + 512) / 1024))G"
 QEMU_ARCH="$(uname -m)"
 [ "${QEMU_ARCH}" = "i686" ] && QEMU_ARCH=i386
 
@@ -150,23 +151,45 @@ setup_passt_in_ns() {
 	else
 		context_run passt "make clean"
 		context_run passt "make"
-		context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
+		if [ ${VHOST_USER} -eq 1 ]; then
+			context_run_bg passt "./passt -f ${__opts} --vhost-user -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
+		else
+			context_run_bg passt "./passt -f ${__opts} -s ${STATESETUP}/passt.socket -t 10001,10011,10021,10031 -u 10001,10011,10021,10031 -P ${STATESETUP}/passt.pid --map-host-loopback ${__map_ns4} --map-host-loopback ${__map_ns6}"
+		fi
 	fi
 	wait_for [ -f "${STATESETUP}/passt.pid" ]
 
 	GUEST_CID=94557
-	context_run_bg qemu 'qemu-system-'"${QEMU_ARCH}"		   \
-		' -machine accel=kvm'                                      \
-		' -M accel=kvm:tcg'                                        \
-		' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
-		' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
-		' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
-		' -nodefaults'						   \
-		' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
-		' -device virtio-net-pci,netdev=s0 '			   \
-		" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
-		" -pidfile ${STATESETUP}/qemu.pid"			   \
-		" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	if [ ${VHOST_USER} -eq 1 ]; then
+		context_run_bg qemu 'qemu-system-$(uname -m)'			   \
+			' -machine accel=kvm'                                      \
+			' -M accel=kvm:tcg'                                        \
+			' -m '${VMEM_ROUND}' -cpu host -smp '${VCPUS}		   \
+			' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
+			' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
+			' -nodefaults'						   \
+			' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
+			" -chardev socket,id=chr0,path=${STATESETUP}/passt.socket" \
+			' -netdev vhost-user,id=netdev0,chardev=chr0'		   \
+			' -device virtio-net,netdev=netdev0'			   \
+			" -object memory-backend-memfd,id=memfd0,share=on,size=${VMEM_ROUND}" \
+			' -numa node,memdev=memfd0'				   \
+			" -pidfile ${STATESETUP}/qemu.pid"			   \
+			" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	else
+		context_run_bg qemu 'qemu-system-$(uname -m)'			   \
+			' -machine accel=kvm'                                      \
+			' -M accel=kvm:tcg'                                        \
+			' -m '${VMEM}' -cpu host -smp '${VCPUS}                    \
+			' -kernel ' "/boot/vmlinuz-$(uname -r)"			   \
+			' -initrd '${INITRAMFS}' -nographic -serial stdio'	   \
+			' -nodefaults'						   \
+			' -append "console=ttyS0 mitigations=off apparmor=0" '	   \
+			' -device virtio-net-pci,netdev=s0 '			   \
+			" -netdev stream,id=s0,server=off,addr.type=unix,addr.path=${STATESETUP}/passt.socket " \
+			" -pidfile ${STATESETUP}/qemu.pid"			   \
+			" -device vhost-vsock-pci,guest-cid=$GUEST_CID"
+	fi
 
 	context_setup_guest guest $GUEST_CID
 }
diff --git a/test/perf/passtvu_tcp b/test/perf/passtvu_tcp
new file mode 100644
index 0000000..a30af7b
--- /dev/null
+++ b/test/perf/passtvu_tcp
@@ -0,0 +1,211 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/perf/passtvu_tcp - Check TCP performance in passt vhost-user mode
+#
+# Copyright (c) 2021 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+gtools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr # From neper
+nstools	/sbin/sysctl ip jq nproc seq sleep iperf3 tcp_rr tcp_crr
+htools	bc head sed seq
+
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	passt with vhost-user: throughput and latency
+
+guest	/sbin/sysctl -w net.core.rmem_max=536870912
+guest	/sbin/sysctl -w net.core.wmem_max=536870912
+guest	/sbin/sysctl -w net.core.rmem_default=33554432
+guest	/sbin/sysctl -w net.core.wmem_default=33554432
+guest	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 131072 268435456"
+guest	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 131072 268435456"
+guest	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
+
+ns	/sbin/sysctl -w net.ipv4.tcp_rmem="4096 524288 134217728"
+ns	/sbin/sysctl -w net.ipv4.tcp_wmem="4096 524288 134217728"
+ns	/sbin/sysctl -w net.ipv4.tcp_timestamps=0
+
+gout	IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname'
+
+hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
+hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
+hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
+
+set	THREADS 2
+set	TIME 2
+set	OMIT 0.1
+set	OPTS -Z -N -P __THREADS__ -l 60k -O__OMIT__
+
+info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
+report	passtvu tcp __THREADS__ __FREQ__
+
+th	MTU 256B 576B 1280B 1500B 9000B 65520B
+
+
+tr	TCP throughput over IPv6: guest to host
+iperf3s	ns 10002
+
+bw	-
+bw	-
+guest	ip link set dev __IFNAME__ mtu 1280
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 8M
+bw	__BW__ 1.2 1.5
+guest	ip link set dev __IFNAME__ mtu 1500
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 16M
+bw	__BW__ 1.6 1.8
+guest	ip link set dev __IFNAME__ mtu 9000
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 4.0 5.0
+guest	ip link set dev __IFNAME__ mtu 65520
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 7.0 8.0
+
+iperf3k	ns
+
+tl	TCP RR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_rr --nolog -6
+gout	LAT tcp_rr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_crr --nolog -6
+gout	LAT tcp_crr --nolog -l1 -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 400
+
+tr	TCP throughput over IPv4: guest to host
+iperf3s	ns 10002
+
+guest	ip link set dev __IFNAME__ mtu 256
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M
+bw	__BW__ 0.2 0.3
+guest	ip link set dev __IFNAME__ mtu 576
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 8M
+bw	__BW__ 0.5 0.8
+guest	ip link set dev __IFNAME__ mtu 1280
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M
+bw	__BW__ 1.2 1.5
+guest	ip link set dev __IFNAME__ mtu 1500
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 16M
+bw	__BW__ 1.6 1.8
+guest	ip link set dev __IFNAME__ mtu 9000
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 4.0 5.0
+guest	ip link set dev __IFNAME__ mtu 65520
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 7.0 8.0
+
+iperf3k	ns
+
+# Reducing MTU below 1280 deconfigures IPv6, get our address back
+guest	dhclient -6 -x
+guest	dhclient -6 __IFNAME__
+
+tl	TCP RR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_rr --nolog -4
+gout	LAT tcp_rr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	tcp_crr --nolog -4
+gout	LAT tcp_crr --nolog -l1 -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 400
+
+tr	TCP throughput over IPv6: host to guest
+iperf3s	guest 10001
+
+bw	-
+bw	-
+bw	-
+bw	-
+bw	-
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 6.0 6.8
+
+iperf3k	guest
+
+tl	TCP RR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_rr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_crr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 350
+
+
+tr	TCP throughput over IPv4: host to guest
+iperf3s	guest 10001
+
+bw	-
+bw	-
+bw	-
+bw	-
+bw	-
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -w 32M
+bw	__BW__ 6.0 6.8
+
+iperf3k	guest
+
+tl	TCP RR latency over IPv4: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_rr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT tcp_rr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+tl	TCP CRR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	tcp_crr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT tcp_crr --nolog -l1 -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 500 300
+
+te
diff --git a/test/perf/passtvu_udp b/test/perf/passtvu_udp
new file mode 100644
index 0000000..e9941a0
--- /dev/null
+++ b/test/perf/passtvu_udp
@@ -0,0 +1,159 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+#  for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+#  for network namespace/tap device mode
+#
+# test/perf/passtvu_udp - Check UDP performance in passt vhost-user mode
+#
+# Copyright (c) 2021 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+gtools	/sbin/sysctl ip jq nproc sleep iperf3 udp_rr # From neper
+nstools	ip jq sleep iperf3 udp_rr
+htools	bc head sed
+
+set	MAP_NS4 192.0.2.2
+set	MAP_NS6 2001:db8:9a55::2
+
+test	passt with vhost-user: throughput and latency
+
+guest	/sbin/sysctl -w net.core.rmem_max=16777216
+guest	/sbin/sysctl -w net.core.wmem_max=16777216
+guest	/sbin/sysctl -w net.core.rmem_default=16777216
+guest	/sbin/sysctl -w net.core.wmem_default=16777216
+
+hout	FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: \([0-9]*\)\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1
+hout	FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l
+hout	FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__
+
+set	THREADS 2
+set	TIME 1
+set	OPTS -u -P __THREADS__ --pacing-timer 1000
+
+info	Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz
+
+report	passtvu udp __THREADS__ __FREQ__
+
+th	pktlen 256B 576B 1280B 1500B 9000B 65520B
+
+tr	UDP throughput over IPv6: guest to host
+iperf3s	ns 10002
+# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
+
+bw	-
+bw	-
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 3G -l 1232
+bw	__BW__ 0.8 1.2
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 4G -l 1452
+bw	__BW__ 1.0 1.5
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 10G -l 8952
+bw	__BW__ 4.0 5.0
+iperf3	BW guest __MAP_NS6__ 10002 __TIME__ __OPTS__ -b 20G -l 64372
+bw	__BW__ 4.0 5.0
+
+iperf3k	ns
+
+tl	UDP RR latency over IPv6: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	udp_rr --nolog -6
+gout	LAT udp_rr --nolog -6 -c -H __MAP_NS6__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv4: guest to host
+iperf3s	ns 10002
+# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
+
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 1G -l 228
+bw	__BW__ 0.0 0.0
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 2G -l 548
+bw	__BW__ 0.4 0.6
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 3G -l 1252
+bw	__BW__ 0.8 1.2
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 4G -l 1472
+bw	__BW__ 1.0 1.5
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 10G -l 8972
+bw	__BW__ 4.0 5.0
+iperf3	BW guest __MAP_NS4__ 10002 __TIME__ __OPTS__ -b 20G -l 65492
+bw	__BW__ 4.0 5.0
+
+iperf3k	ns
+
+tl	UDP RR latency over IPv4: guest to host
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+nsb	udp_rr --nolog -4
+gout	LAT udp_rr --nolog -4 -c -H __MAP_NS4__ | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv6: host to guest
+iperf3s	guest 10001
+# (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header
+
+bw	-
+bw	-
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 3G -l 1232
+bw	__BW__ 0.8 1.2
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 4G -l 1452
+bw	__BW__ 1.0 1.5
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 10G -l 8952
+bw	__BW__ 3.0 4.0
+iperf3	BW ns ::1 10001 __TIME__ __OPTS__ -b 20G -l 64372
+bw	__BW__ 3.0 4.0
+
+iperf3k	guest
+
+tl	UDP RR latency over IPv6: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	udp_rr --nolog -P 10001 -C 10011 -6
+sleep	1
+nsout	LAT udp_rr --nolog -P 10001 -C 10011 -6 -c -H ::1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+
+tr	UDP throughput over IPv4: host to guest
+iperf3s	guest 10001
+# (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header
+
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 1G -l 228
+bw	__BW__ 0.0 0.0
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 2G -l 548
+bw	__BW__ 0.4 0.6
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 3G -l 1252
+bw	__BW__ 0.8 1.2
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 4G -l 1472
+bw	__BW__ 1.0 1.5
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 10G -l 8972
+bw	__BW__ 3.0 4.0
+iperf3	BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 20G -l 65492
+bw	__BW__ 3.0 4.0
+
+iperf3k	guest
+
+tl	UDP RR latency over IPv4: host to guest
+lat	-
+lat	-
+lat	-
+lat	-
+lat	-
+guestb	udp_rr --nolog -P 10001 -C 10011 -4
+sleep	1
+nsout	LAT udp_rr --nolog -P 10001 -C 10011 -4 -c -H 127.0.0.1 | sed -n 's/^throughput=\(.*\)/\1/p'
+lat	__LAT__ 200 150
+
+te
diff --git a/test/run b/test/run
index cd6d707..ebcfcf5 100755
--- a/test/run
+++ b/test/run
@@ -113,6 +113,7 @@ run() {
 	teardown two_guests
 
 	VALGRIND=0
+	VHOST_USER=0
 	setup passt_in_ns
 	test passt/ndp
 	test passt_in_ns/dhcp
@@ -123,6 +124,15 @@ run() {
 	test passt_in_ns/shutdown
 	teardown passt_in_ns
 
+	VHOST_USER=1
+	setup passt_in_ns
+	test passt/ndp
+	test passt_in_ns/dhcp
+	test perf/passtvu_tcp
+	test perf/passtvu_udp
+	test passt_in_ns/shutdown
+	teardown passt_in_ns
+
 	# TODO: Make those faster by at least pre-installing gcc and make on
 	# non-x86 images, then re-enable.
 skip_distro() {
-- 
@@ -113,6 +113,7 @@ run() {
 	teardown two_guests
 
 	VALGRIND=0
+	VHOST_USER=0
 	setup passt_in_ns
 	test passt/ndp
 	test passt_in_ns/dhcp
@@ -123,6 +124,15 @@ run() {
 	test passt_in_ns/shutdown
 	teardown passt_in_ns
 
+	VHOST_USER=1
+	setup passt_in_ns
+	test passt/ndp
+	test passt_in_ns/dhcp
+	test perf/passtvu_tcp
+	test perf/passtvu_udp
+	test passt_in_ns/shutdown
+	teardown passt_in_ns
+
 	# TODO: Make those faster by at least pre-installing gcc and make on
 	# non-x86 images, then re-enable.
 skip_distro() {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/4] vhost-user: introduce virtio API
  2024-08-15 15:50 ` [PATCH v3 2/4] vhost-user: introduce virtio API Laurent Vivier
  2024-08-20  1:00   ` David Gibson
@ 2024-08-22 22:14   ` Stefano Brivio
  1 sibling, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2024-08-22 22:14 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 15 Aug 2024 17:50:21 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> Add virtio.c and virtio.h that define the functions needed
> to manage virtqueues.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile |   4 +-
>  util.h   |   8 +
>  virtio.c | 662 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  virtio.h | 185 ++++++++++++++++
>  4 files changed, 857 insertions(+), 2 deletions(-)
>  create mode 100644 virtio.c
>  create mode 100644 virtio.h
> 
> diff --git a/Makefile b/Makefile
> index b6329e35f884..f171c7955ac9 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c
> +	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h
> +	udp.h udp_flow.h util.h virtio.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/util.h b/util.h
> index b7541ce24e5a..7944cfe1219d 100644
> --- a/util.h
> +++ b/util.h
> @@ -132,6 +132,14 @@ static inline uint32_t ntohl_unaligned(const void *p)
>  	return ntohl(val);
>  }
>  
> +static inline void barrier(void) { __asm__ __volatile__("" ::: "memory"); }
> +#define smp_mb()		do { barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); } while (0)
> +#define smp_mb_release()	do { barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); } while (0)
> +#define smp_mb_acquire()	do { barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); } while (0)
> +
> +#define smp_wmb()	smp_mb_release()
> +#define smp_rmb()	smp_mb_acquire()
> +
>  #define NS_FN_STACK_SIZE	(RLIMIT_STACK_VAL * 1024 / 8)
>  int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
>  	     void *arg);
> diff --git a/virtio.c b/virtio.c
> new file mode 100644
> index 000000000000..8354f6052aee
> --- /dev/null
> +++ b/virtio.c
> @@ -0,0 +1,662 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

Even though I don't see the point, this needs to include "AND
BSD-3-Clause" according to REUSE and SPDX guidelines (similarly to what
Debian and Fedora require in package tags). And the SPDX tag needs to
be in a C++-style comment. So, altogether:

// SPDX-License-Identifier: GPL-2.0-or-later AND BSD-3-Clause

> + *
> + * virtio API, vring and virtqueue functions definition
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +/* some parts copied from QEMU subprojects/libvhost-user/libvhost-user.c

s/some/Some/

> + * licensed under the following terms:

I would say "originally licensed", to make it clear that we have no
intention of modifying the original licensing terms. And given that
it's multiple notices we're quoting here, perhaps separate them with
something like:

 * --
 *
 * notice
 *
 * --

> + *
> + * Copyright IBM, Corp. 2007
> + * Copyright (c) 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Anthony Liguori <aliguori@us.ibm.com>
> + *  Marc-André Lureau <mlureau@redhat.com>
> + *  Victor Kaplansky <victork@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + *
> + * Some parts copied from QEMU hw/virtio/virtio.c
> + * licensed under the following terms:

Same here.

> + *
> + * Copyright IBM, Corp. 2007
> + *
> + * Authors:
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * virtq_used_event() and virtq_avail_event() from
> + * https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-712000A
> + * licensed under the following terms:
> + *
> + * This header is BSD licensed so anyone can use the definitions
> + * to implement compatible drivers/servers.
> + *
> + * Copyright 2007, 2009, IBM Corporation
> + * Copyright 2011, Red Hat, Inc
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of IBM nor the names of its contributors
> + *    may be used to endorse or promote products derived from this software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ‘‘AS IS’’ AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +
> +#include <stddef.h>
> +#include <endian.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/eventfd.h>
> +#include <sys/socket.h>
> +
> +#include "util.h"
> +#include "virtio.h"
> +
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +/**
> + * vu_gpa_to_va() - Translate guest physical address to our virtual address.
> + * @dev:	Vhost-user device
> + * @plen:	Physical length to map (input), virtual address mapped (output)

I found this a bit misleading, in the sense that the output is not a
virtual address, it's the size we pass as input, capped as needed if
the remaining of the region is smaller... correct? In that case, I
would say: ", capped to region (output)" or something like that.

> + * @guest_addr:	Guest physical address
> + *
> + * Return: virtual address in our address space of the guest physical address
> + */
> +static void *vu_gpa_to_va(struct vu_dev *dev, uint64_t *plen, uint64_t guest_addr)
> +{
> +	unsigned int i;
> +
> +	if (*plen == 0)
> +		return NULL;
> +
> +	/* Find matching memory region. */
> +	for (i = 0; i < dev->nregions; i++) {
> +		const struct vu_dev_region *r = &dev->regions[i];
> +
> +		if ((guest_addr >= r->gpa) &&
> +		    (guest_addr < (r->gpa + r->size))) {
> +			if ((guest_addr + *plen) > (r->gpa + r->size))
> +				*plen = r->gpa + r->size - guest_addr;
> +			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +			return (void *)(guest_addr - r->gpa + r->mmap_addr +
> +						     r->mmap_offset);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * vring_avail_flags() - Read the available ring flags
> + * @vq:		Virtqueue
> + *
> + * Return: the available ring descriptor flags of the given virtqueue
> + */
> +static inline uint16_t vring_avail_flags(const struct vu_virtq *vq)
> +{
> +	return le16toh(vq->vring.avail->flags);
> +}
> +
> +/**
> + * vring_avail_idx() - Read the available ring index
> + * @vq:		Virtqueue
> + *
> + * Return: the available ring index of the given virtqueue
> + */
> +static inline uint16_t vring_avail_idx(struct vu_virtq *vq)
> +{
> +	vq->shadow_avail_idx = le16toh(vq->vring.avail->idx);
> +
> +	return vq->shadow_avail_idx;
> +}
> +
> +/**
> + * vring_avail_ring() - Read an available ring entry
> + * @vq:		Virtqueue
> + * @i:		Index of the entry to read
> + *
> + * Return: the ring entry content (head of the descriptor chain)
> + */
> +static inline uint16_t vring_avail_ring(const struct vu_virtq *vq, int i)
> +{
> +	return le16toh(vq->vring.avail->ring[i]);
> +}
> +
> +/**
> + * virtq_used_event - Get location of used event indices
> + *		      (only with VIRTIO_F_EVENT_IDX)
> + * @vq		Virtqueue
> + *
> + * Return: return the location of the used event index
> + */
> +static inline uint16_t *virtq_used_event(const struct vu_virtq *vq)
> +{
> +        /* For backwards compat, used event index is at *end* of avail ring. */
> +        return &vq->vring.avail->ring[vq->vring.num];
> +}
> +
> +/**
> + * vring_get_used_event() - Get the used event from the available ring
> + * @vq		Virtqueue
> + *
> + * Return: the used event (available only if VIRTIO_RING_F_EVENT_IDX is set)
> + *         used_event is a performant alternative where the driver
> + *         specifies how far the device can progress before a notification
> + *         is required.
> + */
> +static inline uint16_t vring_get_used_event(const struct vu_virtq *vq)
> +{
> +	return le16toh(*virtq_used_event(vq));
> +}
> +
> +/**
> + * virtqueue_get_head() - Get the head of the descriptor chain for a given
> + *                        index
> + * @vq:		Virtqueue
> + * @idx:	Available ring entry index
> + * @head:	Head of the descriptor chain
> + */
> +static void virtqueue_get_head(const struct vu_virtq *vq,
> +			       unsigned int idx, unsigned int *head)
> +{
> +	/* Grab the next descriptor number they're advertising, and increment
> +	 * the index we've seen.
> +	 */
> +	*head = vring_avail_ring(vq, idx % vq->vring.num);
> +
> +	/* If their number is silly, that's a fatal mistake. */

For the original code doing this, yes. But for passt, not necessarily:
I think we should clarify some aspects of the lifecycle when we run
with vhost-user support.

At the moment, you allow mmap(2). But eventually I think it would be
nice if we could postpone the application of the seccomp filter,
perhaps just for the vhost-user case, so that we can lock things back
up after receiving VHOST_USER_SET_MEM_TABLE.

Should we get another command like that, we can report failure and quit,
the user would need to restart us (libvirt already does), and the
hypervisor would need to reconnect.

If we can go that way, then we can very well die() here. But if that's
not possible for some reason (perhaps we already discussed this?), then
I think it would be more reasonable to close the connection here,
instead of terminating.

> +	if (*head >= vq->vring.num)
> +		die("Guest says index %u is available", *head);

In any case, this is a bit mysterious if one just sees it printed.
Maybe:

		die("vhost-user: Guest sent invalid descriptor %u"...

?

> +}
> +
> +/**
> + * virtqueue_read_indirect_desc() - Copy virtio ring descriptors from guest
> + *                                  memory
> + * @dev:	Vhost-user device
> + * @desc:	Destination address to copy the descriptors

...to copy the descriptors to

> + * @addr:	Guest memory address to copy from
> + * @len:	Length of memory to copy
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +static int virtqueue_read_indirect_desc(struct vu_dev *dev, struct vring_desc *desc,
> +					uint64_t addr, size_t len)
> +{
> +	uint64_t read_len;
> +
> +	if (len > (VIRTQUEUE_MAX_SIZE * sizeof(struct vring_desc)))
> +		return -1;
> +
> +	if (len == 0)
> +		return -1;
> +
> +	while (len) {
> +		const struct vring_desc *orig_desc;
> +
> +		read_len = len;
> +		orig_desc = vu_gpa_to_va(dev, &read_len, addr);

Should we also return if read_len < sizeof(struct vring_desc) after
this call? Can that ever happen, if we pick a particular value of addr
so that it's almost at the end of a region?

> +		if (!orig_desc)
> +			return -1;
> +
> +		memcpy(desc, orig_desc, read_len);
> +		len -= read_len;
> +		addr += read_len;
> +		desc += read_len / sizeof(struct vring_desc);
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * enum virtqueue_read_desc_state - State in the descriptor chain
> + * @VIRTQUEUE_READ_DESC_ERROR	Found an invalid descriptor
> + * @VIRTQUEUE_READ_DESC_DONE	No more descriptors in the chain
> + * @VIRTQUEUE_READ_DESC_MORE	there are more descriptors in the chain
> + */
> +enum virtqueue_read_desc_state {
> +	VIRTQUEUE_READ_DESC_ERROR = -1,
> +	VIRTQUEUE_READ_DESC_DONE = 0,   /* end of chain */
> +	VIRTQUEUE_READ_DESC_MORE = 1,   /* more buffers in chain */
> +};
> +
> +/**
> + * virtqueue_read_next_desc() - Read the the next descriptor in the chain
> + * @desc:	Virtio ring descriptors
> + * @i:		Index of the current descriptor
> + * @max:	Maximum value of the descriptor index
> + * @next:	Index of the next descriptor in the chain (output value)
> + *
> + * Return: current chain descriptor state (error, next, done)
> + */
> +static int virtqueue_read_next_desc(const struct vring_desc *desc,
> +				    int i, unsigned int max, unsigned int *next)
> +{
> +	/* If this descriptor says it doesn't chain, we're done. */
> +	if (!(le16toh(desc[i].flags) & VRING_DESC_F_NEXT))
> +		return VIRTQUEUE_READ_DESC_DONE;
> +
> +	/* Check they're not leading us off end of descriptors. */
> +	*next = le16toh(desc[i].next);
> +	/* Make sure compiler knows to grab that: we don't want it changing! */
> +	smp_wmb();
> +
> +	if (*next >= max)
> +		return VIRTQUEUE_READ_DESC_ERROR;
> +
> +	return VIRTQUEUE_READ_DESC_MORE;
> +}
> +
> +/**
> + * vu_queue_empty() - Check if virtqueue is empty
> + * @vq:		Virtqueue
> + *
> + * Return: true if the virtqueue is empty, false otherwise
> + */
> +bool vu_queue_empty(struct vu_virtq *vq)
> +{
> +	if (!vq->vring.avail)
> +		return true;
> +
> +	if (vq->shadow_avail_idx != vq->last_avail_idx)
> +		return false;
> +
> +	return vring_avail_idx(vq) == vq->last_avail_idx;
> +}
> +
> +/**
> + * vring_can_notify() - Check if a notification can be sent
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + *
> + * Return: true if notification can be sent
> + */
> +static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
> +{
> +	uint16_t old, new;
> +	bool v;
> +
> +	/* We need to expose used array entries before checking used event. */
> +	smp_mb();
> +
> +	/* Always notify when queue is empty (when feature acknowledge) */
> +	if (vu_has_feature(dev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
> +		!vq->inuse && vu_queue_empty(vq)) {

Nit: indentation:

	if (vu_has_feature(...) &&
	    !vq->inuse && ...)

and curly brackets can go away.

> +		return true;
> +	}
> +
> +	if (!vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
> +		return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
> +
> +	v = vq->signalled_used_valid;
> +	vq->signalled_used_valid = true;
> +	old = vq->signalled_used;
> +	new = vq->signalled_used = vq->used_idx;
> +	return !v || vring_need_event(vring_get_used_event(vq), new, old);
> +}
> +
> +/**
> + * vu_queue_notify() - Send a notification to the given virtqueue
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
> +{
> +	if (!vq->vring.avail)
> +		return;
> +
> +	if (!vring_can_notify(dev, vq)) {
> +		debug("vhost-user: virtqueue can skip notify...");
> +		return;
> +	}
> +
> +	if (eventfd_write(vq->call_fd, 1) < 0)
> +		die_perror("Error writing eventfd");

Same as the error above, this is missing a bit of context. It could say
"Error writing vhost-user queue eventfd", perhaps.

> +}
> +
> +/* virtq_avail_event() -  Get location of available event indices
> + *			      (only with VIRTIO_F_EVENT_IDX)
> + * @vq:		Virtqueue
> + *
> + * Return: return the location of the available event index
> + */
> +static inline uint16_t *virtq_avail_event(const struct vu_virtq *vq)
> +{
> +        /* For backwards compat, avail event index is at *end* of used ring. */
> +        return (uint16_t *)&vq->vring.used->ring[vq->vring.num];
> +}
> +
> +/**
> + * vring_set_avail_event() - Set avail_event
> + * @vq:		Virtqueue
> + * @val:	Value to set to avail_event
> + *		avail_event is used in the same way the used_event is in the
> + *		avail_ring.
> + *		avail_event is used to advise the driver that notifications
> + *		are unnecessary until the driver writes entry with an index
> + *		specified by avail_event into the available ring.
> + */
> +static inline void vring_set_avail_event(const struct vu_virtq *vq,
> +					 uint16_t val)
> +{
> +	uint16_t val_le = htole16(val);
> +
> +	if (!vq->notification)
> +		return;
> +
> +	memcpy(virtq_avail_event(vq), &val_le, sizeof(val_le));
> +}
> +
> +/**
> + * virtqueue_map_desc() - Translate descriptor ring physical address into our
> + * 			  virtual address space
> + * @dev:	Vhost-user device
> + * @p_num_sg:	First iov entry to use (input),
> + *		first iov entry not used (output)
> + * @iov:	Iov array to use to store buffer virtual addresses
> + * @max_num_sg:	Maximum number of iov entries
> + * @pa:		Guest physical address of the buffer to map into our virtual
> + * 		address
> + * @sz:		Size of the buffer
> + *
> + * Return: false on error, true otherwise
> + */
> +static bool virtqueue_map_desc(struct vu_dev *dev,
> +			       unsigned int *p_num_sg, struct iovec *iov,
> +			       unsigned int max_num_sg,
> +			       uint64_t pa, size_t sz)
> +{
> +	unsigned int num_sg = *p_num_sg;
> +
> +	ASSERT(num_sg < max_num_sg);
> +	ASSERT(sz);
> +
> +	while (sz) {
> +		uint64_t len = sz;
> +
> +		iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
> +		if (iov[num_sg].iov_base == NULL)
> +			die("virtio: invalid address for buffers");
> +		iov[num_sg].iov_len = len;
> +		num_sg++;
> +		sz -= len;
> +		pa += len;
> +	}
> +
> +	*p_num_sg = num_sg;
> +	return true;
> +}
> +
> +/**
> + * vu_queue_map_desc - Map the virqueue descriptor ring into our virtual

virtqueue

into our virtual one

> + * 		       address space
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @idx:	First descriptor ring entry to map
> + * @elem:	Virtqueue element to store descriptor ring iov
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned int idx,
> +			     struct vu_virtq_element *elem)
> +{
> +	const struct vring_desc *desc = vq->vring.desc;
> +	struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
> +	unsigned int out_num = 0, in_num = 0;
> +	unsigned int max = vq->vring.num;
> +	unsigned int i = idx;
> +	uint64_t read_len;
> +	int rc;
> +
> +	if (le16toh(desc[i].flags) & VRING_DESC_F_INDIRECT) {
> +		unsigned int desc_len;
> +		uint64_t desc_addr;
> +
> +		if (le32toh(desc[i].len) % sizeof(struct vring_desc))
> +			die("Invalid size for indirect buffer table");

Same as above: prefix by "virtio:" or "vhost-user"?

> +
> +		/* loop over the indirect descriptor table */
> +		desc_addr = le64toh(desc[i].addr);
> +		desc_len = le32toh(desc[i].len);
> +		max = desc_len / sizeof(struct vring_desc);
> +		read_len = desc_len;
> +		desc = vu_gpa_to_va(dev, &read_len, desc_addr);
> +		if (desc && read_len != desc_len) {
> +			/* Failed to use zero copy */
> +			desc = NULL;
> +			if (!virtqueue_read_indirect_desc(dev, desc_buf, desc_addr, desc_len))
> +				desc = desc_buf;
> +		}
> +		if (!desc)
> +			die("Invalid indirect buffer table");

Same here.

> +		i = 0;
> +	}
> +
> +	/* Collect all the descriptors */
> +	do {
> +		if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
> +			if (!virtqueue_map_desc(dev, &in_num, elem->in_sg,
> +						elem->in_num,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return -1;
> +			}

No need for curly brackets here, we have a single line in the block.

> +		} else {
> +			if (in_num)
> +				die("Incorrect order for descriptors");
> +			if (!virtqueue_map_desc(dev, &out_num, elem->out_sg,
> +						elem->out_num,
> +						le64toh(desc[i].addr),
> +						le32toh(desc[i].len))) {
> +				return -1;

Same here.

> +			}
> +		}
> +
> +		/* If we've got too many, that implies a descriptor loop. */
> +		if ((in_num + out_num) > max)
> +			die("Looped descriptor");

"vhost-user: Loop in queue descriptor list"?

> +		rc = virtqueue_read_next_desc(desc, i, max, &i);
> +	} while (rc == VIRTQUEUE_READ_DESC_MORE);
> +
> +	if (rc == VIRTQUEUE_READ_DESC_ERROR)
> +		die("read descriptor error");

"vhost-user: Failed to read descriptor list"?

> +
> +	elem->index = idx;
> +	elem->in_num = in_num;
> +	elem->out_num = out_num;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_queue_pop() - Pop an entry from the virtqueue
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @elem:	Virtqueue element to file with the entry information
> + *
> + * Return: -1 if there is an error, 0 otherwise
> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
> +{
> +	unsigned int head;
> +	int ret;
> +
> +	if (!vq->vring.avail)
> +		return -1;
> +
> +	if (vu_queue_empty(vq))
> +		return -1;
> +
> +	/*
> +	 * Needed after vu_queue_empty(), see comment in

No need for extra newline at the beginning of the comment,

	/* Needed after ...

> +	 * virtqueue_num_heads().
> +	 */
> +	smp_rmb();
> +
> +	if (vq->inuse >= vq->vring.num)
> +		die("Virtqueue size exceeded");

"vhost-user queue size exceeded"?

> +
> +	virtqueue_get_head(vq, vq->last_avail_idx++, &head);
> +
> +	if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX))
> +		vring_set_avail_event(vq, vq->last_avail_idx);
> +
> +	ret = vu_queue_map_desc(dev, vq, head, elem);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	vq->inuse++;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_queue_detach_element() - Detach an element from the virqueue
> + * @vq:		Virtqueue
> + */
> +void vu_queue_detach_element(struct vu_virtq *vq)
> +{
> +	vq->inuse--;
> +	/* unmap, when DMA support is added */
> +}
> +
> +/**
> + * vu_queue_unpop() - Push back the previously popped element from the virqueue
> + * @vq:		Virtqueue
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_unpop(struct vu_virtq *vq)
> +{
> +	vq->last_avail_idx--;
> +	vu_queue_detach_element(vq);
> +}
> +
> +/**
> + * vu_queue_rewind() - Push back a given number of popped elements
> + * @vq:		Virtqueue
> + * @num:	Number of element to unpop
> + */
> +/* cppcheck-suppress unusedFunction */
> +bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
> +{
> +	if (num > vq->inuse)
> +		return false;
> +
> +	vq->last_avail_idx -= num;
> +	vq->inuse -= num;
> +	return true;
> +}
> +
> +/**
> + * vring_used_write() - Write an entry in the used ring
> + * @vq:		Virtqueue
> + * @uelem:	Entry to write
> + * @i:		Index of the entry in the used ring
> + */
> +static inline void vring_used_write(struct vu_virtq *vq,
> +				    const struct vring_used_elem *uelem, int i)
> +{
> +	struct vring_used *used = vq->vring.used;
> +
> +	used->ring[i] = *uelem;
> +}
> +
> +/**
> + * vu_queue_fill_by_index() - Update information of a descriptor ring entry
> + *			      in the used ring
> + * @vq:		Virtqueue
> + * @index:	Descriptor ring index
> + * @len:	Size of the element
> + * @idx:	Used ring entry index
> + */
> +void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
> +			    unsigned int len, unsigned int idx)
> +{
> +	struct vring_used_elem uelem;
> +
> +	if (!vq->vring.avail)
> +		return;
> +
> +	idx = (idx + vq->used_idx) % vq->vring.num;
> +
> +	uelem.id = htole32(index);
> +	uelem.len = htole32(len);
> +	vring_used_write(vq, &uelem, idx);
> +}
> +
> +/**
> + * vu_queue_fill() - Update information of a given element in the used ring
> + * @dev:	Vhost-user device
> + * @vq:		Virtqueue
> + * @elem:	Element information to fill
> + * @len:	Size of the element
> + * @idx:	Used ring entry index
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
> +		   unsigned int len, unsigned int idx)
> +{
> +	vu_queue_fill_by_index(vq, elem->index, len, idx);
> +}
> +
> +/**
> + * vring_used_idx_set() - Set the descriptor ring current index
> + * @vq:		Virtqueue
> + * @val:	Value to set in the index
> + */
> +static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
> +{
> +	vq->vring.used->idx = htole16(val);
> +
> +	vq->used_idx = val;
> +}
> +
> +/**
> + * vu_queue_flush() - Flush the virtqueue
> + * @vq:		Virtqueue
> + * @count:	Number of entry to flush
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
> +{
> +	uint16_t old, new;
> +
> +	if (!vq->vring.avail)
> +		return;
> +
> +	/* Make sure buffer is written before we update index. */
> +	smp_wmb();
> +
> +	old = vq->used_idx;
> +	new = old + count;
> +	vring_used_idx_set(vq, new);
> +	vq->inuse -= count;
> +	if ((uint16_t)(new - vq->signalled_used) < (uint16_t)(new - old))
> +		vq->signalled_used_valid = false;
> +}
> diff --git a/virtio.h b/virtio.h
> new file mode 100644
> index 000000000000..af9cadc990b9
> --- /dev/null
> +++ b/virtio.h
> @@ -0,0 +1,185 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * virtio API, vring and virtqueue functions definition
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef VIRTIO_H
> +#define VIRTIO_H
> +
> +#include <stdbool.h>
> +#include <linux/vhost_types.h>
> +
> +/* Maximum size of a virtqueue */
> +#define VIRTQUEUE_MAX_SIZE 1024
> +
> +/**
> + * struct vu_ring - Virtqueue rings
> + * @num:		Size of the queue
> + * @desc:		Descriptor ring
> + * @avail:		Available ring
> + * @used:		Used ring
> + * @log_guest_addr:	Guest address for logging
> + * @flags:		Vring flags
> + * 			VHOST_VRING_F_LOG is set if log address is valid
> + */
> +struct vu_ring {
> +	unsigned int num;
> +	struct vring_desc *desc;
> +	struct vring_avail *avail;
> +	struct vring_used *used;
> +	uint64_t log_guest_addr;
> +	uint32_t flags;
> +};
> +
> +/**
> + * struct vu_virtq - Virtqueue definition
> + * @vring:			Virtqueue rings
> + * @last_avail_idx:		Next head to pop
> + * @shadow_avail_idx:		Last avail_idx read from VQ.
> + * @used_idx:			Descriptor ring current index
> + * @signalled_used:		Last used index value we have signalled on
> + * @signalled_used_valid:	True if signalled_used if valid
> + * @notification:		True if the queues notify (via event
> + * 				index or interrupt)
> + * @inuse:			Number of entries in use
> + * @call_fd:			The event file descriptor to signal when
> + * 				buffers are used.
> + * @kick_fd:			The event file descriptor for adding
> + * 				buffers to the vring
> + * @err_fd:			The event file descriptor to signal when
> + * 				error occurs
> + * @enable:			True if the virtqueue is enabled
> + * @started:			True if the virtqueue is started
> + * @vra:			QEMU address of our rings
> + */
> +struct vu_virtq {
> +	struct vu_ring vring;
> +	uint16_t last_avail_idx;
> +	uint16_t shadow_avail_idx;
> +	uint16_t used_idx;
> +	uint16_t signalled_used;
> +	bool signalled_used_valid;
> +	bool notification;
> +	unsigned int inuse;
> +	int call_fd;
> +	int kick_fd;
> +	int err_fd;
> +	unsigned int enable;
> +	bool started;
> +	struct vhost_vring_addr vra;
> +};
> +
> +/**
> + * struct vu_dev_region - guest shared memory region
> + * @gpa:		Guest physical address of the region
> + * @size:		Memory size in bytes
> + * @qva:		QEMU virtual address
> + * @mmap_offset:	Offset where the region starts in the mapped memory
> + * @mmap_addr:		Address of the mapped memory
> + */
> +struct vu_dev_region {
> +	uint64_t gpa;
> +	uint64_t size;
> +	uint64_t qva;
> +	uint64_t mmap_offset;
> +	uint64_t mmap_addr;
> +};
> +
> +#define VHOST_USER_MAX_QUEUES 2
> +
> +/*
> + * Set a reasonable maximum number of ram slots, which will be supported by
> + * any architecture.
> + */
> +#define VHOST_USER_MAX_RAM_SLOTS 32
> +
> +/**
> + * struct vu_dev - vhost-user device information
> + * @context:		Execution context
> + * *nregions:		Number of shared memory regions
> + * @regions:		Guest shared memory regions
> + * @features:		Vhost-user features
> + * @protocol_features:	Vhost-user protocol features
> + * @hdrlen:		Virtio -net header length
> + */
> +struct vu_dev {
> +	uint32_t nregions;
> +	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
> +	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
> +	uint64_t features;
> +	uint64_t protocol_features;
> +	int hdrlen;
> +};
> +
> +/**
> + * struct vu_virtq_element - virtqueue element
> + * @index:	Descriptor ring index
> + * @out_num:	Number of outgoing iovec buffers
> + * @in_num:	Number of incoming iovec buffers
> + * @in_sg:	Incoming iovec buffers
> + * @out_sg:	Outgoing iovec buffers
> + */
> +struct vu_virtq_element {
> +	unsigned int index;
> +	unsigned int out_num;
> +	unsigned int in_num;
> +	struct iovec *in_sg;
> +	struct iovec *out_sg;
> +};
> +
> +/**
> + * has_feature() - Check a feature bit in a features set
> + * @features:	Features set
> + * @fb:		Feature bit to check
> + *
> + * Return:	True if the feature bit is set
> + */
> +static inline bool has_feature(uint64_t features, unsigned int fbit)
> +{
> +	return !!(features & (1ULL << fbit));
> +}
> +
> +/**
> + * vu_has_feature() - Check if a virtio-net feature is available
> + * @vdev:	Vhost-user device
> + * @bit:	Feature to check
> + *
> + * Return:	True if the feature is available
> + */
> +static inline bool vu_has_feature(const struct vu_dev *vdev,
> +				  unsigned int fbit)
> +{
> +	return has_feature(vdev->features, fbit);
> +}
> +
> +/**
> + * vu_has_protocol_feature() - Check if a vhost-user feature is available
> + * @vdev:	Vhost-user device
> + * @bit:	Feature to check
> + *
> + * Return:	True if the feature is available
> + */
> +/* cppcheck-suppress unusedFunction */
> +static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
> +					   unsigned int fbit)
> +{
> +	return has_feature(vdev->protocol_features, fbit);
> +}
> +
> +bool vu_queue_empty(struct vu_virtq *vq);
> +void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq);
> +int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq,
> +		 struct vu_virtq_element *elem);
> +void vu_queue_detach_element(struct vu_virtq *vq);
> +void vu_queue_unpop(struct vu_virtq *vq);
> +bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num);
> +void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
> +			    unsigned int len, unsigned int idx);
> +void vu_queue_fill(struct vu_virtq *vq,
> +		   const struct vu_virtq_element *elem, unsigned int len,
> +		   unsigned int idx);
> +void vu_queue_flush(struct vu_virtq *vq, unsigned int count);
> +#endif /* VIRTIO_H */

The rest looks good to me.

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-15 15:50 ` [PATCH v3 3/4] vhost-user: introduce vhost-user API Laurent Vivier
@ 2024-08-22 22:14   ` Stefano Brivio
  2024-08-26  5:27     ` David Gibson
  2024-08-26  5:26   ` David Gibson
  1 sibling, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2024-08-22 22:14 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 15 Aug 2024 17:50:22 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> Add vhost_user.c and vhost_user.h that define the functions needed
> to implement vhost-user backend.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |    4 +-
>  iov.c        |    1 -
>  vhost_user.c | 1271 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  vhost_user.h |  202 ++++++++
>  virtio.c     |    5 -
>  virtio.h     |    2 +-
>  6 files changed, 1476 insertions(+), 9 deletions(-)
>  create mode 100644 vhost_user.c
>  create mode 100644 vhost_user.h
> 
> diff --git a/Makefile b/Makefile
> index f171c7955ac9..4ccefffacfde 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
> +	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h virtio.h
> +	udp.h udp_flow.h util.h vhost_user.h virtio.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/iov.c b/iov.c
> index 3f9e229a305f..3741db21790f 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
>   *
>   * Returns:    The number of bytes successfully copied.
>   */
> -/* cppcheck-suppress unusedFunction */
>  size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
>  		    size_t offset, const void *buf, size_t bytes)
>  {
> diff --git a/vhost_user.c b/vhost_user.c
> new file mode 100644
> index 000000000000..c4cd25fae84e
> --- /dev/null
> +++ b/vhost_user.c
> @@ -0,0 +1,1271 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

Same as 2/4 with the SPDX tag:

// SPDX-License-Identifier: GPL-2.0-or-later

> + *
> + * vhost-user API, command management and virtio interface
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +/* some parts from QEMU subprojects/libvhost-user/libvhost-user.c
> + * licensed under the following terms:
> + *
> + * Copyright IBM, Corp. 2007
> + * Copyright (c) 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Anthony Liguori <aliguori@us.ibm.com>
> + *  Marc-André Lureau <mlureau@redhat.com>
> + *  Victor Kaplansky <victork@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <string.h>
> +#include <assert.h>
> +#include <stdbool.h>
> +#include <inttypes.h>
> +#include <time.h>
> +#include <net/ethernet.h>
> +#include <netinet/in.h>
> +#include <sys/epoll.h>
> +#include <sys/eventfd.h>
> +#include <sys/mman.h>
> +#include <linux/vhost_types.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "tap.h"
> +#include "vhost_user.h"
> +
> +/* vhost-user version we are compatible with */
> +#define VHOST_USER_VERSION 1
> +
> +/**
> + * vu_print_capabilities() - print vhost-user capabilities
> + * 			     this is part of the vhost-user backend
> + * 			     convention.
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_print_capabilities(void)
> +{
> +	info("{");
> +	info("  \"type\": \"net\"");
> +	info("}");
> +	exit(EXIT_SUCCESS);
> +}
> +
> +/**
> + * vu_request_to_string() - convert a vhost-user request number to its name
> + * @req:	request number
> + *
> + * Return: the name of request number
> + */
> +static const char *vu_request_to_string(unsigned int req)
> +{
> +	if (req < VHOST_USER_MAX) {
> +#define REQ(req) [req] = #req

Oh, neat, I had never thought of a macro like this.

> +		static const char * const vu_request_str[] = {
> +			REQ(VHOST_USER_NONE),
> +			REQ(VHOST_USER_GET_FEATURES),
> +			REQ(VHOST_USER_SET_FEATURES),
> +			REQ(VHOST_USER_SET_OWNER),
> +			REQ(VHOST_USER_RESET_OWNER),
> +			REQ(VHOST_USER_SET_MEM_TABLE),
> +			REQ(VHOST_USER_SET_LOG_BASE),
> +			REQ(VHOST_USER_SET_LOG_FD),
> +			REQ(VHOST_USER_SET_VRING_NUM),
> +			REQ(VHOST_USER_SET_VRING_ADDR),
> +			REQ(VHOST_USER_SET_VRING_BASE),
> +			REQ(VHOST_USER_GET_VRING_BASE),
> +			REQ(VHOST_USER_SET_VRING_KICK),
> +			REQ(VHOST_USER_SET_VRING_CALL),
> +			REQ(VHOST_USER_SET_VRING_ERR),
> +			REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
> +			REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
> +			REQ(VHOST_USER_GET_QUEUE_NUM),
> +			REQ(VHOST_USER_SET_VRING_ENABLE),
> +			REQ(VHOST_USER_SEND_RARP),
> +			REQ(VHOST_USER_NET_SET_MTU),
> +			REQ(VHOST_USER_SET_BACKEND_REQ_FD),
> +			REQ(VHOST_USER_IOTLB_MSG),
> +			REQ(VHOST_USER_SET_VRING_ENDIAN),
> +			REQ(VHOST_USER_GET_CONFIG),
> +			REQ(VHOST_USER_SET_CONFIG),
> +			REQ(VHOST_USER_POSTCOPY_ADVISE),
> +			REQ(VHOST_USER_POSTCOPY_LISTEN),
> +			REQ(VHOST_USER_POSTCOPY_END),
> +			REQ(VHOST_USER_GET_INFLIGHT_FD),
> +			REQ(VHOST_USER_SET_INFLIGHT_FD),
> +			REQ(VHOST_USER_GPU_SET_SOCKET),
> +			REQ(VHOST_USER_VRING_KICK),
> +			REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
> +			REQ(VHOST_USER_ADD_MEM_REG),
> +			REQ(VHOST_USER_REM_MEM_REG),
> +			REQ(VHOST_USER_MAX),

REQ(VHOST_USER_MAX) isn't really needed here, you check it's less than
that.

> +		};
> +#undef REQ
> +		return vu_request_str[req];
> +	}
> +
> +	return "unknown";
> +}
> +
> +/**
> + * qva_to_va() -  Translate front-end (QEMU) virtual address to our virtual
> + * 		  address
> + * @dev:		Vhost-user device

vhost-user device

> + * @qemu_addr:		front-end userspace address
> + *
> + * Return: the memory address in our process virtual address space.
> + */
> +static void *qva_to_va(struct vu_dev *dev, uint64_t qemu_addr)
> +{
> +	unsigned int i;
> +
> +	/* Find matching memory region.  */
> +	for (i = 0; i < dev->nregions; i++) {
> +		const struct vu_dev_region *r = &dev->regions[i];
> +
> +		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
> +			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +			return (void *)(qemu_addr - r->qva + r->mmap_addr +
> +					r->mmap_offset);
> +		}
> +	}

Not a strong preference, only if you find this convenient: this could
be vu_gpa_to_va() if it optionally took NULL as plen (in that case, you
wouldn't use it, or set it).

> +
> +	return NULL;
> +}
> +
> +/**
> + * vmsg_close_fds() - Close all file descriptors of a given message
> + * @vmsg:	Vhost-user message with the list of the file descriptors

vhost-user

> + */
> +static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
> +{
> +	int i;
> +
> +	for (i = 0; i < vmsg->fd_num; i++)
> +		close(vmsg->fds[i]);
> +}
> +
> +/**
> + * vu_remove_watch() - Remove a file descriptor from an our passt epoll

s/an //

> + * 		       file descriptor
> + * @vdev:	Vhost-user device
> + * @fd:		file descriptor to remove
> + */
> +static void vu_remove_watch(const struct vu_dev *vdev, int fd)
> +{
> +	(void)vdev;
> +	(void)fd;
> +}
> +
> +/**
> + * vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
> + * 			  and fd_num
> + * @vmsg:	Vhost-user message

vhost-user

> + * @val:	64bit value to reply

64-bit

> + */
> +static void vmsg_set_reply_u64(struct vhost_user_msg *vmsg, uint64_t val)
> +{
> +	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
> +	vmsg->hdr.size = sizeof(vmsg->payload.u64);
> +	vmsg->payload.u64 = val;
> +	vmsg->fd_num = 0;
> +}
> +
> +/**
> + * vu_message_read_default() - Read incoming vhost-user message from the
> + * 			       front-end
> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: -1 there is an error,
> + *          0 if recvmsg() has been interrupted,

or if there's no data to read

> + *          1 if a message has been received
> + */
> +static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
> +{
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
> +		     sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +		.msg_controllen = sizeof(control),
> +	};
> +	ssize_t ret, sz_payload;
> +	struct cmsghdr *cmsg;
> +	size_t fd_size;
> +
> +	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
> +	if (ret < 0) {
> +		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
> +			return 0;
> +		return -1;
> +	}
> +
> +	vmsg->fd_num = 0;
> +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> +	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> +		if (cmsg->cmsg_level == SOL_SOCKET &&
> +		    cmsg->cmsg_type == SCM_RIGHTS) {
> +			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
> +			ASSERT(fd_size / sizeof(int) <=
> +			       VHOST_MEMORY_BASELINE_NREGIONS);
> +			vmsg->fd_num = fd_size / sizeof(int);
> +			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);

Coverity doesn't quite like the fact that fd_size is used without an
appropriate check.

If sizeof(int) is 4, VHOST_MEMORY_BASELINE_NREGIONS is 8, and fd_size
is 35, we'll pass the ASSERT(), because 35 / 4 = 8, but we'll have
three extra bytes here.

This looks safer:

		    	size_t fd_size;

			ASSERT(cmsg->cmsg_len >= CMSG_LEN(0));
			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
			ASSERT(fd_size <
			       VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int));

			vmsg->fd_num = fd_size / sizeof(int);
			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);

or even:

			ASSERT(fd_size <=
			       sizeof(((struct vhost_user_msg *)0)->fds));

> +			break;
> +		}
> +	}
> +
> +	sz_payload = vmsg->hdr.size;
> +	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
> +		die("Error: too big message request: %d,"

It's not clear that it's about a vhost-user message, perhaps:

	vhost-user message request too big: ...

> +			 " size: vmsg->size: %zd, "
> +			 "while sizeof(vmsg->payload) = %zu",
> +			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
> +	}
> +
> +	if (sz_payload) {
> +		do {
> +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));

No need for curly brackets, it's a one-line statement.

> +
> +		if (ret < sz_payload)
> +			die_perror("Error while reading");

errno will not necessarily indicate _this_ error, here, because you can
also hit this with a positive, or zero value.

And I'm not sure if partial reads are a risk, but if they are, you
should keep a count of how much you read, say:

		for (n = 0; n < sz_payload, n += rc) {
			rc = recv(conn_fd, &vmsg->payload + n, sz_payload - n,
				  0);

			if ((rc < 0 && errno != EINTR && errno != EAGAIN))
				die_perror("vhost-user message receive");
			if (rc == 0)
				die("EOF on vhost-user message receive");
		}

By the way, the socket is actually blocking, and if you really meant to
keep it blocking, you'll never get EAGAIN, and you don't need to loop.
Same for the first recvmsg() in this function, you wouldn't need to check
for EAGAIN or EWOULDBLOCK.

> +	}
> +
> +	return 1;
> +}
> +
> +/**
> + * vu_message_write() - send a message to the front-end

Send

> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * #syscalls:vu sendmsg
> + */
> +static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
> +{
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +	};
> +	const uint8_t *p = (uint8_t *)vmsg;
> +	int rc;
> +
> +	memset(control, 0, sizeof(control));
> +	ASSERT(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
> +	if (vmsg->fd_num > 0) {
> +		size_t fdsize = vmsg->fd_num * sizeof(int);
> +		struct cmsghdr *cmsg;
> +
> +		msg.msg_controllen = CMSG_SPACE(fdsize);
> +		cmsg = CMSG_FIRSTHDR(&msg);
> +		cmsg->cmsg_len = CMSG_LEN(fdsize);
> +		cmsg->cmsg_level = SOL_SOCKET;
> +		cmsg->cmsg_type = SCM_RIGHTS;
> +		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
> +	} else {
> +		msg.msg_controllen = 0;
> +	}
> +
> +	do {
> +		rc = sendmsg(conn_fd, &msg, 0);
> +	} while (rc < 0 && (errno == EINTR || errno == EAGAIN));

Same as above: if you keep the socket blocking, you don't need to check
for EAGAIN...

> +
> +	if (vmsg->hdr.size) {
> +		do {
> +			rc = write(conn_fd, p + VHOST_USER_HDR_SIZE,
> +				   vmsg->hdr.size);
> +		} while (rc < 0 && (errno == EINTR || errno == EAGAIN));

and you don't need to loop here, either.

> +	}
> +
> +	if (rc <= 0)
> +		die_perror("Error while writing");

"vhost-user message send"?

> +}
> +
> +/**
> + * vu_send_reply() - Update message flags and send it to front-end
> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message
> + */
> +static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
> +{
> +	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
> +	msg->hdr.flags |= VHOST_USER_VERSION;
> +	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
> +
> +	vu_message_write(conn_fd, msg);
> +}
> +
> +/**
> + * vu_get_features_exec() - Provide back-end features bitmask to front-end
> + * @vmsg:	Vhost-user message
> + *
> + * Return: true as a reply is requested
> + */
> +static bool vu_get_features_exec(struct vhost_user_msg *msg)
> +{
> +	uint64_t features =
> +		1ULL << VIRTIO_F_VERSION_1 |
> +		1ULL << VIRTIO_NET_F_MRG_RXBUF |
> +		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_enable_all_rings() - Enable/disable all the virtqueues
> + * @vdev:	Vhost-user device
> + * @enable:	New virtqueues state
> + */
> +static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
> +		vdev->vq[i].enable = enable;
> +}
> +
> +/**
> + * vu_set_features_exec() - Enable features of the back-end
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_features_exec(struct vu_dev *vdev,
> +				 struct vhost_user_msg *msg)
> +{
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vdev->features = msg->payload.u64;
> +	/* We only support devices conforming to VIRTIO 1.0 or
> +	 * later
> +	 */
> +	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1))
> +		die("virtio legacy devices aren't supported by passt");
> +
> +	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES))
> +		vu_set_enable_all_rings(vdev, true);
> +
> +	/* virtio-net features */
> +
> +	if (vu_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> +	    vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	} else {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr);
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_owner_exec() - Session start flag, do nothing in our case
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_owner_exec(void)
> +{
> +	return false;
> +}
> +
> +/**
> + * map_ring() - Convert ring front-end (QEMU) addresses to our process
> + * 		virtual address space.
> + * @vdev:	Vhost-user device

vhost-user

> + * @vq:		Virtqueue
> + *
> + * Return: true if ring cannot be mapped to our address space
> + */
> +static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
> +{
> +	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
> +	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
> +	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
> +
> +	debug("Setting virtq addresses:");
> +	debug("    vring_desc  at %p", (void *)vq->vring.desc);
> +	debug("    vring_used  at %p", (void *)vq->vring.used);
> +	debug("    vring_avail at %p", (void *)vq->vring.avail);
> +
> +	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
> +}
> +
> +/**
> + * vu_packet_check_range() - Check if a given memory zone is contained in
> + * 			     a mapped guest memory region
> + * @buf:	Array of the available memory regions
> + * @offset:	Offset of data range in packet descriptor
> + * @size:	Length of desired data range
> + * @start:	Start of the packet descriptor
> + *
> + * Return: 0 if the zone in a mapped memory region, -1 otherwise

s/in/is in/

> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start)
> +{
> +	struct vu_dev_region *dev_region;
> +
> +	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		char *m = (char *)dev_region->mmap_addr;
> +
> +		if (m <= start &&
> +		    start + offset + len < m + dev_region->mmap_offset +

Shouldn't this be <= as well? If the packet length matches the size of
the region, we're not out of it.

> +					       dev_region->size)
> +			return 0;
> +	}
> +
> +	return -1;
> +}
> +
> +/**
> + * vu_set_mem_table_exec() - Sets the memory map regions to be able to
> + * 			     translate the vring addresses.
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + *
> + * #syscalls:vu mmap munmap

As I mentioned in my comments to 2/4: it would be great if we could
assume a model where this function is invoked during initialisation,
and then we go ahead and apply an appropriate seccomp profile. I'm not
sure if it's possible.

If it helps: seccomp-bpf profiles can be appended, so we could also
allow mmap() until this function is called, and then have an extra jump
at the end of the BPF filter where, after this function is called, we
add one instruction denying mmap(). If it's called again, we would
report an error.

> + */
> +static bool vu_set_mem_table_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	struct vhost_user_memory m = msg->payload.memory, *memory = &m;
> +	unsigned int i;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		struct vu_dev_region *r = &vdev->regions[i];
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		void *mm = (void *)r->mmap_addr;
> +
> +		if (mm)
> +			munmap(mm, r->size + r->mmap_offset);
> +	}
> +	vdev->nregions = memory->nregions;
> +
> +	debug("Nregions: %u", memory->nregions);

It's debug(), so it doesn't need to be perfectly clear, but still it
would be nice to prefix this and "Region" below with "vhost-user".

> +	for (i = 0; i < vdev->nregions; i++) {
> +		struct vhost_user_memory_region *msg_region = &memory->regions[i];
> +		struct vu_dev_region *dev_region = &vdev->regions[i];
> +		void *mmap_addr;
> +
> +		debug("Region %d", i);
> +		debug("    guest_phys_addr: 0x%016"PRIx64,
> +		      msg_region->guest_phys_addr);
> +		debug("    memory_size:     0x%016"PRIx64,
> +		      msg_region->memory_size);
> +		debug("    userspace_addr   0x%016"PRIx64,
> +		      msg_region->userspace_addr);
> +		debug("    mmap_offset      0x%016"PRIx64,
> +		      msg_region->mmap_offset);
> +
> +		dev_region->gpa = msg_region->guest_phys_addr;
> +		dev_region->size = msg_region->memory_size;
> +		dev_region->qva = msg_region->userspace_addr;
> +		dev_region->mmap_offset = msg_region->mmap_offset;
> +
> +		/* We don't use offset argument of mmap() since the
> +		 * mapped address has to be page aligned, and we use huge
> +		 * pages.
> +		 */
> +		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
> +				 PROT_READ | PROT_WRITE, MAP_SHARED |
> +				 MAP_NORESERVE, msg->fds[i], 0);
> +
> +		if (mmap_addr == MAP_FAILED)
> +			die_perror("region mmap error");

Also here, "vhost-user region...".

> +
> +		dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
> +		debug("    mmap_addr:       0x%016"PRIx64,
> +		      dev_region->mmap_addr);
> +
> +		close(msg->fds[i]);
> +	}
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		if (vdev->vq[i].vring.desc) {
> +			if (map_ring(vdev, &vdev->vq[i]))
> +				die("remapping queue %d during setmemtable", i);
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_num_exec() - Set the size of the queue (vring size)
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_num_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", idx);
> +	debug("State.num:   %u", num);
> +	vdev->vq[idx].vring.num = num;
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_addr_exec() - Set the addresses of the vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	struct vhost_vring_addr addr = msg->payload.addr, *vra = &addr;
> +	struct vu_virtq *vq = &vdev->vq[vra->index];
> +
> +	debug("vhost_vring_addr:");
> +	debug("    index:  %d", vra->index);
> +	debug("    flags:  %d", vra->flags);
> +	debug("    desc_user_addr:   0x%016" PRIx64, (uint64_t)vra->desc_user_addr);
> +	debug("    used_user_addr:   0x%016" PRIx64, (uint64_t)vra->used_user_addr);
> +	debug("    avail_user_addr:  0x%016" PRIx64, (uint64_t)vra->avail_user_addr);
> +	debug("    log_guest_addr:   0x%016" PRIx64, (uint64_t)vra->log_guest_addr);
> +
> +	vq->vra = *vra;
> +	vq->vring.flags = vra->flags;
> +	vq->vring.log_guest_addr = vra->log_guest_addr;
> +
> +	if (map_ring(vdev, vq))
> +		die("Invalid vring_addr message");
> +
> +	vq->used_idx = le16toh(vq->vring.used->idx);
> +
> +	if (vq->last_avail_idx != vq->used_idx) {
> +		debug("Last avail index != used index: %u != %u",
> +		      vq->last_avail_idx, vq->used_idx);
> +	}
> +
> +	return false;
> +}
> +/**
> + * vu_set_vring_base_exec() - Sets the next index to use for descriptors
> + * 			      in this vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_base_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", idx);
> +	debug("State.num:   %u", num);
> +	vdev->vq[idx].shadow_avail_idx = vdev->vq[idx].last_avail_idx = num;
> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_vring_base_exec() - Stops the vring and returns the current
> + * 			      descriptor index or indices
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as a reply is requested

True, then. :)

> + */
> +static bool vu_get_vring_base_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +
> +	debug("State.index: %u", idx);
> +	msg->payload.state.num = vdev->vq[idx].last_avail_idx;
> +	msg->hdr.size = sizeof(msg->payload.state);
> +
> +	vdev->vq[idx].started = false;
> +
> +	if (vdev->vq[idx].call_fd != -1) {
> +		close(vdev->vq[idx].call_fd);
> +		vdev->vq[idx].call_fd = -1;
> +	}
> +	if (vdev->vq[idx].kick_fd != -1) {
> +		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
> +		close(vdev->vq[idx].kick_fd);
> +		vdev->vq[idx].kick_fd = -1;
> +	}
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_watch() - Add a file descriptor to the passt epoll file descriptor
> + * @vdev:	vhost-user device
> + * @fd:		file descriptor to add
> + */
> +static void vu_set_watch(const struct vu_dev *vdev, int fd)
> +{
> +	(void)vdev;
> +	(void)fd;
> +}
> +
> +/**
> + * vu_wait_queue() - wait new free entries in the virtqueue
> + * @vq:		virtqueue to wait on
> + */
> +static int vu_wait_queue(const struct vu_virtq *vq)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int status;
> +
> +	/* wait the kernel to put new entries in the queue */

s/the/for the/

> +	status = fcntl(vq->kick_fd, F_GETFL);
> +	if (status == -1)
> +		return -1;
> +
> +	status = fcntl(vq->kick_fd, F_SETFL, status & ~O_NONBLOCK);

This value is not used, the function could be a bit shorter by omitting
the store, if (fcntl(...)) return -1;

> +	if (status == -1)
> +		return -1;
> +	rc = eventfd_read(vq->kick_fd, &kick_data);
> +	status = fcntl(vq->kick_fd, F_SETFL, status);

Same here.

> +	if (status == -1)
> +		return -1;
> +
> +	if (rc == -1)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_send() - Send a buffer to the front-end using the RX virtqueue
> + * @vdev:	vhost-user device
> + * @buf:	address of the buffer
> + * @size:	size of the buffer
> + *
> + * Return: number of bytes sent, -1 if there is an error
> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
> +{
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +	size_t lens[VIRTQUEUE_MAX_SIZE];
> +	__virtio16 *num_buffers_ptr = NULL;
> +	size_t hdrlen = vdev->hdrlen;
> +	int in_sg_count = 0;
> +	size_t offset = 0;
> +	int i = 0, j;
> +
> +	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");
> +		return 0;
> +	}
> +
> +	while (offset < size) {
> +		size_t len;
> +		int total;
> +		int ret;
> +
> +		total = 0;
> +
> +		if (i == ARRAY_SIZE(elem) ||
> +		    in_sg_count == ARRAY_SIZE(in_sg)) {
> +			err("virtio-net unexpected long buffer chain");
> +			goto err;
> +		}
> +
> +		elem[i].out_num = 0;
> +		elem[i].out_sg = NULL;
> +		elem[i].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
> +		elem[i].in_sg = &in_sg[in_sg_count];
> +
> +		ret = vu_queue_pop(vdev, vq, &elem[i]);
> +		if (ret < 0) {
> +			if (vu_wait_queue(vq) != -1)
> +				continue;
> +			if (i) {
> +				err("virtio-net unexpected empty queue: "
> +				    "i %d mergeable %d offset %zd, size %zd, "
> +				    "features 0x%" PRIx64,
> +				    i, vu_has_feature(vdev,
> +						      VIRTIO_NET_F_MRG_RXBUF),
> +				    offset, size, vdev->features);
> +			}
> +			offset = -1;
> +			goto err;
> +		}
> +		in_sg_count += elem[i].in_num;
> +
> +		if (elem[i].in_num < 1) {
> +			err("virtio-net receive queue contains no in buffers");
> +			vu_queue_detach_element(vq);
> +			offset = -1;
> +			goto err;
> +		}
> +
> +		if (i == 0) {
> +			struct virtio_net_hdr hdr = {
> +				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +			};
> +
> +			ASSERT(offset == 0);
> +			ASSERT(elem[i].in_sg[0].iov_len >= hdrlen);
> +
> +			len = iov_from_buf(elem[i].in_sg, elem[i].in_num, 0,
> +					   &hdr, sizeof(hdr));
> +
> +			num_buffers_ptr = (__virtio16 *)((char *)elem[i].in_sg[0].iov_base +
> +							 len);
> +
> +			total += hdrlen;
> +		}
> +
> +		len = iov_from_buf(elem[i].in_sg, elem[i].in_num, total,
> +				   (char *)buf + offset, size - offset);
> +
> +		total += len;
> +		offset += len;
> +
> +		/* If buffers can't be merged, at this point we
> +		 * must have consumed the complete packet.
> +		 * Otherwise, drop it.
> +		 */
> +		if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) &&
> +		    offset < size) {
> +			vu_queue_unpop(vq);
> +			goto err;
> +		}
> +
> +		lens[i] = total;
> +		i++;
> +	}
> +
> +	if (num_buffers_ptr && vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		*num_buffers_ptr = htole16(i);
> +
> +	for (j = 0; j < i; j++) {
> +		debug("filling total %zd idx %d", lens[j], j);
> +		vu_queue_fill(vq, &elem[j], lens[j], j);
> +	}
> +
> +	vu_queue_flush(vq, i);
> +	vu_queue_notify(vdev, vq);
> +
> +	debug("vhost-user sent %zu", offset);
> +
> +	return offset;
> +err:
> +	for (j = 0; j < i; j++)
> +		vu_queue_detach_element(vq);
> +
> +	return offset;
> +}
> +
> +/**
> + * vu_handle_tx() - Receive data from the TX virtqueue
> + * @vdev:	vhost-user device
> + * @index:	index of the virtqueue

 * @now:	Current timestamp

> + */
> +static void vu_handle_tx(struct vu_dev *vdev, int index,
> +			 const struct timespec *now)
> +{
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> +	struct vu_virtq *vq = &vdev->vq[index];
> +	int hdrlen = vdev->hdrlen;
> +	int out_sg_count;
> +	int count;
> +
> +	if (!VHOST_USER_IS_QUEUE_TX(index)) {
> +		debug("index %d is not a TX queue", index);
> +		return;
> +	}
> +
> +	tap_flush_pools();
> +
> +	count = 0;
> +	out_sg_count = 0;
> +	while (1) {
> +		int ret;
> +
> +

Excess newline.

> +		elem[count].out_num = 1;
> +		elem[count].out_sg = &out_sg[out_sg_count];
> +		elem[count].in_num = 0;
> +		elem[count].in_sg = NULL;
> +		ret = vu_queue_pop(vdev, vq, &elem[count]);
> +		if (ret < 0)
> +			break;

Perhaps I already asked but I can't remember/find the conclusion.

Shouldn't we assign a budget limit to this function, so that we break
the loop after a maximum number (1024?) of descriptors, to guarantee
some amount of fairness?

> +		out_sg_count += elem[count].out_num;
> +
> +		if (elem[count].out_num < 1) {
> +			debug("virtio-net header not in first element");
> +			break;
> +		}
> +		ASSERT(elem[count].out_num == 1);
> +
> +		tap_add_packet(vdev->context,
> +			       elem[count].out_sg[0].iov_len - hdrlen,
> +			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
> +		count++;
> +	}
> +	tap_handler(vdev->context, now);
> +
> +	if (count) {
> +		int i;
> +
> +		for (i = 0; i < count; i++)
> +			vu_queue_fill(vq, &elem[i], 0, i);
> +		vu_queue_flush(vq, count);
> +		vu_queue_notify(vdev, vq);
> +	}
> +}
> +
> +/**
> + * vu_kick_cb() - Called on a kick event to start to receive data
> + * @vdev:	vhost-user device
> + * @ref:	epoll reference information
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int idx;
> +
> +	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++)

Multi-line body loop, use curly brackets.

> +		if (vdev->vq[idx].kick_fd == ref.fd)
> +			break;
> +
> +	if (idx == VHOST_USER_MAX_QUEUES)
> +		return;
> +
> +	rc = eventfd_read(ref.fd, &kick_data);
> +	if (rc == -1)
> +		die_perror("kick eventfd_read()");

"vhost-user kick ..."

> +
> +	debug("Got kick_data: %016"PRIx64" idx:%d",
> +	      kick_data, idx);
> +	if (VHOST_USER_IS_QUEUE_TX(idx))
> +		vu_handle_tx(vdev, idx, now);
> +}
> +
> +/**
> + * vu_check_queue_msg_file() - Check if a message is valid,
> + * 			       close fds if NOFD bit is set
> + * @vmsg:	Vhost-user message

vhost-user

> + */
> +static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	if (idx >= VHOST_USER_MAX_QUEUES)
> +		die("Invalid queue index: %u", idx);

Invalid vhost-user queue...

> +
> +	if (nofd) {
> +		vmsg_close_fds(msg);
> +		return;
> +	}
> +
> +	if (msg->fd_num != 1)
> +		die("Invalid fds in request: %d", msg->hdr.request);

in vhost-user request...

> +}
> +
> +/**
> + * vu_set_vring_kick_exec() - Set the event file descriptor for adding buffers
> + * 			      to the vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].kick_fd != -1) {
> +		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
> +		close(vdev->vq[idx].kick_fd);
> +	}
> +
> +	vdev->vq[idx].kick_fd = nofd ? -1 : msg->fds[0];
> +	debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
> +
> +	vdev->vq[idx].started = true;
> +
> +	if (vdev->vq[idx].kick_fd != -1 && VHOST_USER_IS_QUEUE_TX(idx)) {
> +		vu_set_watch(vdev, vdev->vq[idx].kick_fd);
> +		debug("Waiting for kicks on fd: %d for vq: %d",
> +		      vdev->vq[idx].kick_fd, idx);
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_call_exec() - Set the event file descriptor to signal when
> + * 			      buffers are used
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_call_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].call_fd != -1)
> +		close(vdev->vq[idx].call_fd);
> +
> +	vdev->vq[idx].call_fd = nofd ? -1 : msg->fds[0];
> +
> +	/* in case of I/O hang after reconnecting */
> +	if (vdev->vq[idx].call_fd != -1)
> +		eventfd_write(msg->fds[0], 1);
> +
> +	debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_err_exec() - Set the event file descriptor to signal when
> + * 			     error occurs
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_err_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].err_fd != -1) {
> +		close(vdev->vq[idx].err_fd);
> +		vdev->vq[idx].err_fd = -1;
> +	}
> +
> +	/* cppcheck-suppress redundantAssignment */
> +	vdev->vq[idx].err_fd = nofd ? -1 : msg->fds[0];

Wouldn't it be easier (and not require a suppression) to say:

	if (!nofd)
		vdev->vq[idx].err_fd = msg->fds[0];

?

> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_protocol_features_exec() - Provide the protocol (vhost-user) features
> + * 				     to the front-end
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message

vhost-user

> + *
> + * Return: false as a reply is requested

True.

> + */
> +static bool vu_get_protocol_features_exec(struct vhost_user_msg *msg)
> +{
> +	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_protocol_features_exec() - Enable protocol (vhost-user) features
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
> +					  struct vhost_user_msg *msg)
> +{
> +	uint64_t features = msg->payload.u64;
> +
> +	debug("u64: 0x%016"PRIx64, features);
> +
> +	vdev->protocol_features = msg->payload.u64;
> +
> +	if (vu_has_protocol_feature(vdev,
> +				    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS) &&

Do we actually care about VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS at
all, I wonder? This whole part (coming from ff1320050a3a "libvhost-user:
implement in-band notifications") is rather hard to read/understand, so it
would be great if we could just get rid of it altogether.

But if not, sure, let's leave it like the original, I'd say.

> +	    (!vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_BACKEND_REQ) ||
> +	     !vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_REPLY_ACK))) {
> +	/*
> +	 * The use case for using messages for kick/call is simulation, to make
> +	 * the kick and call synchronous. To actually get that behaviour, both
> +	 * of the other features are required.
> +	 * Theoretically, one could use only kick messages, or do them without
> +	 * having F_REPLY_ACK, but too many (possibly pending) messages on the
> +	 * socket will eventually cause the master to hang, to avoid this in
> +	 * scenarios where not desired enforce that the settings are in a way
> +	 * that actually enables the simulation case.
> +	 */
> +		die("F_IN_BAND_NOTIFICATIONS requires F_BACKEND_REQ && F_REPLY_ACK");
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_queue_num_exec() - Tell how many queues we support
> + * @vmsg:	Vhost-user message
> + *
> + * Return: true as a reply is requested
> + */
> +static bool vu_get_queue_num_exec(struct vhost_user_msg *msg)
> +{
> +	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
> +	return true;
> +}
> +
> +/**
> + * vu_set_vring_enable_exec() - Enable or disable corresponding vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
> +				     struct vhost_user_msg *msg)
> +{
> +	unsigned int enable = msg->payload.state.num;
> +	unsigned int idx = msg->payload.state.index;
> +
> +	debug("State.index:  %u", idx);
> +	debug("State.enable: %u", enable);
> +
> +	if (idx >= VHOST_USER_MAX_QUEUES)
> +		die("Invalid vring_enable index: %u", idx);
> +
> +	vdev->vq[idx].enable = enable;
> +	return false;
> +}
> +
> +/**
> + * vu_init() - Initialize vhost-user device structure
> + * @c:		execution context
> + * @vdev:	vhost-user device
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_init(struct ctx *c, struct vu_dev *vdev)
> +{
> +	int i;
> +
> +	vdev->context = c;
> +	vdev->hdrlen = 0;
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		vdev->vq[i] = (struct vu_virtq){
> +			.call_fd = -1,
> +			.kick_fd = -1,
> +			.err_fd = -1,
> +			.notification = true,
> +		};
> +	}
> +}
> +
> +/**
> + * vu_cleanup() - Reset vhost-user device

On the same topic as mmap(): if we just terminate after one connection
(implying --one-off / -1), we don't need to clean up after ourselves.

> + * @vdev:	vhost-user device
> + */
> +void vu_cleanup(struct vu_dev *vdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		struct vu_virtq *vq = &vdev->vq[i];
> +
> +		vq->started = false;
> +		vq->notification = true;
> +
> +		if (vq->call_fd != -1) {
> +			close(vq->call_fd);
> +			vq->call_fd = -1;
> +		}
> +		if (vq->err_fd != -1) {
> +			close(vq->err_fd);
> +			vq->err_fd = -1;
> +		}
> +		if (vq->kick_fd != -1) {
> +			vu_remove_watch(vdev, vq->kick_fd);
> +			close(vq->kick_fd);
> +			vq->kick_fd = -1;
> +		}
> +
> +		vq->vring.desc = 0;
> +		vq->vring.used = 0;
> +		vq->vring.avail = 0;
> +	}
> +	vdev->hdrlen = 0;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		const struct vu_dev_region *r = &vdev->regions[i];
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		void *m = (void *)r->mmap_addr;
> +
> +		if (m)
> +			munmap(m, r->size + r->mmap_offset);
> +	}
> +	vdev->nregions = 0;
> +}
> +
> +/**
> + * vu_sock_reset() - Reset connection socket
> + * @vdev:	vhost-user device
> + */
> +static void vu_sock_reset(struct vu_dev *vdev)
> +{
> +	(void)vdev;
> +}
> +
> +/**
> + * tap_handler_vu() - Packet handler for vhost-user
> + * @vdev:	vhost-user device
> + * @fd:		vhost-user message socket
> + * @events:	epoll events
> + */
> +/* cppcheck-suppress unusedFunction */
> +void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)
> +{
> +	struct vhost_user_msg msg = { 0 };
> +	bool need_reply, reply_requested;
> +	int ret;
> +
> +	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
> +		vu_sock_reset(vdev);
> +		return;
> +	}
> +
> +	ret = vu_message_read_default(fd, &msg);
> +	if (ret < 0)
> +		die_perror("Error while recvmsg");

Right now this looks correct, we should only hit this if
vu_message_read_default() fails on the recvmsg(), but I think it's
rather bug prone, as errno could be accidentally set after recvmsg().

And this is the only call site, so we could die_perror() directly there.

> +	if (ret == 0) {
> +		vu_sock_reset(vdev);

This sounds a bit harsh on EINTR. Again, if we just terminate on EOF or
error, we don't need to handle this.

> +		return;
> +	}
> +	debug("================ Vhost user message ================");
> +	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
> +		msg.hdr.request);
> +	debug("Flags:   0x%x", msg.hdr.flags);
> +	debug("Size:    %u", msg.hdr.size);
> +
> +	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
> +	switch (msg.hdr.request) {
> +	case VHOST_USER_GET_FEATURES:
> +		reply_requested = vu_get_features_exec(&msg);

Maybe we could have an array of function pointers (and always pass vdev
and &msg), say:

	bool (*handle[VHOST_USER_MAX])(struct vu_dev *vdev,
				       struct vhost_user_msg *msg) = {
		[VHOST_USER_FEATURES]		   = vu_set_features,
		[VHOST_USER_GET_PROTOCOL_FEATURES] = vu_get_protocol_features,
		...
	};

	if (msg.hdr.request >= 0 && msg.hdr.request < VHOST_USER_MAX &&
	    handle[msg.hdr.request])
		handle[msg.hdr.request](vdev, &msg);

> +		break;
> +	case VHOST_USER_SET_FEATURES:
> +		reply_requested = vu_set_features_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_PROTOCOL_FEATURES:
> +		reply_requested = vu_get_protocol_features_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_PROTOCOL_FEATURES:
> +		reply_requested = vu_set_protocol_features_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_QUEUE_NUM:
> +		reply_requested = vu_get_queue_num_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_OWNER:
> +		reply_requested = vu_set_owner_exec();
> +		break;
> +	case VHOST_USER_SET_MEM_TABLE:
> +		reply_requested = vu_set_mem_table_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_NUM:
> +		reply_requested = vu_set_vring_num_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ADDR:
> +		reply_requested = vu_set_vring_addr_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_BASE:
> +		reply_requested = vu_set_vring_base_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_VRING_BASE:
> +		reply_requested = vu_get_vring_base_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_KICK:
> +		reply_requested = vu_set_vring_kick_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_CALL:
> +		reply_requested = vu_set_vring_call_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ERR:
> +		reply_requested = vu_set_vring_err_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ENABLE:
> +		reply_requested = vu_set_vring_enable_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_NONE:
> +		vu_cleanup(vdev);
> +		return;
> +	default:
> +		die("Unhandled request: %d", msg.hdr.request);
> +	}
> +
> +	if (!reply_requested && need_reply) {
> +		msg.payload.u64 = 0;
> +		msg.hdr.flags = 0;
> +		msg.hdr.size = sizeof(msg.payload.u64);
> +		msg.fd_num = 0;
> +		reply_requested = true;
> +	}
> +
> +	if (reply_requested)
> +		vu_send_reply(fd, &msg);
> +}
> diff --git a/vhost_user.h b/vhost_user.h
> new file mode 100644
> index 000000000000..135856dc2873
> --- /dev/null
> +++ b/vhost_user.h
> @@ -0,0 +1,202 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

Same here with the SPDX tag.

> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * vhost-user API, command management and virtio interface
> + */
> +
> +/* some parts from subprojects/libvhost-user/libvhost-user.h */
> +
> +#ifndef VHOST_USER_H
> +#define VHOST_USER_H
> +
> +#include "virtio.h"
> +#include "iov.h"
> +
> +#define VHOST_USER_F_PROTOCOL_FEATURES 30
> +
> +#define VHOST_MEMORY_BASELINE_NREGIONS 8
> +
> +/**
> + * enum vhost_user_protocol_feature - List of available vhost-user features
> + */
> +enum vhost_user_protocol_feature {
> +	VHOST_USER_PROTOCOL_F_MQ = 0,
> +	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
> +	VHOST_USER_PROTOCOL_F_RARP = 2,
> +	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
> +	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
> +	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
> +	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
> +	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
> +	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
> +	VHOST_USER_PROTOCOL_F_CONFIG = 9,
> +	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
> +	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
> +	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
> +	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
> +
> +	VHOST_USER_PROTOCOL_F_MAX
> +};
> +
> +/**
> + * enum vhost_user_request - list of available vhost-user request

List, requests

> + */
> +enum vhost_user_request {
> +	VHOST_USER_NONE = 0,
> +	VHOST_USER_GET_FEATURES = 1,
> +	VHOST_USER_SET_FEATURES = 2,
> +	VHOST_USER_SET_OWNER = 3,
> +	VHOST_USER_RESET_OWNER = 4,
> +	VHOST_USER_SET_MEM_TABLE = 5,
> +	VHOST_USER_SET_LOG_BASE = 6,
> +	VHOST_USER_SET_LOG_FD = 7,
> +	VHOST_USER_SET_VRING_NUM = 8,
> +	VHOST_USER_SET_VRING_ADDR = 9,
> +	VHOST_USER_SET_VRING_BASE = 10,
> +	VHOST_USER_GET_VRING_BASE = 11,
> +	VHOST_USER_SET_VRING_KICK = 12,
> +	VHOST_USER_SET_VRING_CALL = 13,
> +	VHOST_USER_SET_VRING_ERR = 14,
> +	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
> +	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
> +	VHOST_USER_GET_QUEUE_NUM = 17,
> +	VHOST_USER_SET_VRING_ENABLE = 18,
> +	VHOST_USER_SEND_RARP = 19,
> +	VHOST_USER_NET_SET_MTU = 20,
> +	VHOST_USER_SET_BACKEND_REQ_FD = 21,
> +	VHOST_USER_IOTLB_MSG = 22,
> +	VHOST_USER_SET_VRING_ENDIAN = 23,
> +	VHOST_USER_GET_CONFIG = 24,
> +	VHOST_USER_SET_CONFIG = 25,
> +	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
> +	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
> +	VHOST_USER_POSTCOPY_ADVISE  = 28,
> +	VHOST_USER_POSTCOPY_LISTEN  = 29,
> +	VHOST_USER_POSTCOPY_END     = 30,
> +	VHOST_USER_GET_INFLIGHT_FD = 31,
> +	VHOST_USER_SET_INFLIGHT_FD = 32,
> +	VHOST_USER_GPU_SET_SOCKET = 33,
> +	VHOST_USER_VRING_KICK = 35,
> +	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> +	VHOST_USER_ADD_MEM_REG = 37,
> +	VHOST_USER_REM_MEM_REG = 38,
> +	VHOST_USER_MAX
> +};
> +
> +/**
> + * struct vhost_user_header - Vhost-user message header

vhost-user

> + * @request:	Request type of the message
> + * @flags:	Request flags
> + * @size:	The following payload size
> + */
> +struct vhost_user_header {
> +	enum vhost_user_request request;
> +
> +#define VHOST_USER_VERSION_MASK     0x3
> +#define VHOST_USER_REPLY_MASK       (0x1 << 2)
> +#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
> +	uint32_t flags;
> +	uint32_t size; /* the following payload size */

It's already in the struct comment.

> +} __attribute__ ((__packed__));
> +
> +/**
> + * struct vhost_user_memory_region - Front-end shared memory region information
> + * @guest_phys_addr:	Guest physical address of the region
> + * @memory_size:	Memory size
> + * @userspace_addr:	front-end (QEMU) userspace address
> + * @mmap_offset:	region offset in the shared memory area
> + */
> +struct vhost_user_memory_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t memory_size;
> +	uint64_t userspace_addr;
> +	uint64_t mmap_offset;
> +};
> +
> +/**
> + * struct vhost_user_memory - List of all the shared memory regions
> + * @nregions:	Number of memory regions
> + * @padding:	Padding
> + * @regions:	Memory regions list
> + */
> +struct vhost_user_memory {
> +	uint32_t nregions;
> +	uint32_t padding;
> +	struct vhost_user_memory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
> +};
> +
> +/**
> + * union vhost_user_payload - Vhost-user message payload
> + * @u64:		64bit payload
> + * @state:		Vring state payload
> + * @addr:		Vring addresses payload
> + * vhost_user_memory:	Memory regions information payload

vhost-uesr, 64-bit, vring

> + */
> +union vhost_user_payload {
> +#define VHOST_USER_VRING_IDX_MASK   0xff
> +#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> +	uint64_t u64;
> +	struct vhost_vring_state state;
> +	struct vhost_vring_addr addr;
> +	struct vhost_user_memory memory;
> +};
> +
> +/**
> + * struct vhost_user_msg - Vhost-use message

vhost-user

> + * @hdr:		Message header
> + * @payload:		Message payload
> + * @fds:		File descriptors associated with the message
> + * 			in the ancillary data.
> + * 			(shared memory or event file descriptors)
> + * @fd_num:		Number of file descriptors
> + */
> +struct vhost_user_msg {
> +	struct vhost_user_header hdr;
> +	union vhost_user_payload payload;
> +
> +	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> +	int fd_num;
> +} __attribute__ ((__packed__));
> +#define VHOST_USER_HDR_SIZE sizeof(struct vhost_user_header)
> +
> +/* index of the RX virtqueue */
> +#define VHOST_USER_RX_QUEUE 0
> +/* index of the TX virtqueue */
> +#define VHOST_USER_TX_QUEUE 1
> +
> +/* in case of multiqueue, we RX and TX queues are interleaved */

s/we/the/

> +#define VHOST_USER_IS_QUEUE_TX(n)	(n % 2)
> +#define VHOST_USER_IS_QUEUE_RX(n)	(!(n % 2))
> +
> +/**
> + * vu_queue_enabled - Return state of a virtqueue
> + * @vq:		Virtqueue to check

virtqueue

> + *
> + * Return: true if the virqueue is enabled, false otherwise

virtqueue

> + */
> +static inline bool vu_queue_enabled(const struct vu_virtq *vq)
> +{
> +	return vq->enable;
> +}
> +
> +/**
> + * vu_queue_started - Return state of a virtqueue
> + * @vq:		Virtqueue to check

virtqueue

> + *
> + * Return: true if the virqueue is started, false otherwise

virtqueue

> + */
> +static inline bool vu_queue_started(const struct vu_virtq *vq)
> +{
> +	return vq->started;
> +}
> +
> +int vu_send(struct vu_dev *vdev, const void *buf, size_t size);
> +void vu_print_capabilities(void);
> +void vu_init(struct ctx *c, struct vu_dev *vdev);
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now);
> +void vu_cleanup(struct vu_dev *vdev);
> +void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events);
> +#endif /* VHOST_USER_H */
> diff --git a/virtio.c b/virtio.c
> index 8354f6052aee..d02e6e04701d 100644
> --- a/virtio.c
> +++ b/virtio.c
> @@ -323,7 +323,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>   * @dev:	Vhost-user device
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>  {
>  	if (!vq->vring.avail)
> @@ -500,7 +499,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
>   *
>   * Return: -1 if there is an error, 0 otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
>  {
>  	unsigned int head;
> @@ -550,7 +548,6 @@ void vu_queue_detach_element(struct vu_virtq *vq)
>   * vu_queue_unpop() - Push back the previously popped element from the virqueue
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_unpop(struct vu_virtq *vq)
>  {
>  	vq->last_avail_idx--;
> @@ -618,7 +615,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
>   * @len:	Size of the element
>   * @idx:	Used ring entry index
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
>  		   unsigned int len, unsigned int idx)
>  {
> @@ -642,7 +638,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
>   * @vq:		Virtqueue
>   * @count:	Number of entry to flush
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
>  {
>  	uint16_t old, new;
> diff --git a/virtio.h b/virtio.h
> index af9cadc990b9..242e788e07e9 100644
> --- a/virtio.h
> +++ b/virtio.h
> @@ -106,6 +106,7 @@ struct vu_dev_region {
>   * @hdrlen:		Virtio -net header length
>   */
>  struct vu_dev {
> +	struct ctx *context;
>  	uint32_t nregions;
>  	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
>  	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
> @@ -162,7 +163,6 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
>   *
>   * Return:	True if the feature is available
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
>  					   unsigned int fbit)
>  {

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] vhost-user: add vhost-user
  2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
  2024-08-22  9:59   ` Stefano Brivio
@ 2024-08-22 22:14   ` Stefano Brivio
  2024-08-23 12:32   ` Stefano Brivio
  2 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2024-08-22 22:14 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

On Thu, 15 Aug 2024 17:50:23 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> add virtio and vhost-user functions to connect with QEMU.
> 
>   $ ./passt --vhost-user
> 
> and
> 
>   # qemu-system-x86_64 ... -m 4G \
>         -object memory-backend-memfd,id=memfd0,share=on,size=4G \
>         -numa node,memdev=memfd0 \
>         -chardev socket,id=chr0,path=/tmp/passt_1.socket \
>         -netdev vhost-user,id=netdev0,chardev=chr0 \
>         -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
>         ...
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile       |   6 +-
>  checksum.c     |   1 -
>  conf.c         |  24 +-
>  epoll_type.h   |   4 +
>  isolation.c    |  15 +-
>  packet.c       |  13 ++
>  packet.h       |   2 +
>  passt.c        |  25 ++-
>  passt.h        |   6 +
>  pcap.c         |   1 -
>  tap.c          | 106 +++++++--
>  tap.h          |   5 +-
>  tcp.c          |  33 ++-
>  tcp_buf.c      |   6 +-
>  tcp_internal.h |   3 +-
>  tcp_vu.c       | 593 +++++++++++++++++++++++++++++++++++++++++++++++++
>  tcp_vu.h       |  12 +
>  udp.c          |  71 +++---
>  udp.h          |   8 +-
>  udp_internal.h |  34 +++
>  udp_vu.c       | 338 ++++++++++++++++++++++++++++
>  udp_vu.h       |  13 ++
>  vhost_user.c   |  28 ++-
>  virtio.c       |   1 -
>  vu_common.c    |  27 +++
>  vu_common.h    |  34 +++
>  26 files changed, 1310 insertions(+), 99 deletions(-)
>  create mode 100644 tcp_vu.c
>  create mode 100644 tcp_vu.h
>  create mode 100644 udp_internal.h
>  create mode 100644 udp_vu.c
>  create mode 100644 udp_vu.h
>  create mode 100644 vu_common.c
>  create mode 100644 vu_common.h
> 
> diff --git a/Makefile b/Makefile
> index 4ccefffacfde..4fb178932f8e 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
> +	tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \
> +	vhost_user.c virtio.c vu_common.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +58,8 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h vhost_user.h virtio.h
> +	tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \
> +	virtio.h vu_common.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/checksum.c b/checksum.c
> index 006614fcbb28..aa5b7ae1cb66 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -501,7 +501,6 @@ uint16_t csum(const void *buf, size_t len, uint32_t init)
>   *
>   * Return: 16-bit folded, complemented checksum
>   */
> -/* cppcheck-suppress unusedFunction */
>  uint16_t csum_iov(const struct iovec *iov, size_t n, uint32_t init)
>  {
>  	unsigned int i;
> diff --git a/conf.c b/conf.c
> index 46fcd9126b4c..c684dbaac694 100644
> --- a/conf.c
> +++ b/conf.c
> @@ -45,6 +45,7 @@
>  #include "lineread.h"
>  #include "isolation.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /**
>   * next_chunk - Return the next piece of a string delimited by a character
> @@ -759,9 +760,14 @@ static void usage(const char *name, FILE *f, int status)
>  			"    default: same interface name as external one\n");
>  	} else {
>  		fprintf(f,
> -			"  -s, --socket PATH	UNIX domain socket path\n"
> +			"  -s, --socket, --socket-path PATH	UNIX domain socket path\n"
>  			"    default: probe free path starting from "
>  			UNIX_SOCK_PATH "\n", 1);
> +		fprintf(f,
> +			"  --vhost-user		Enable vhost-user mode\n"
> +			"    UNIX domain socket is provided by -s option\n"
> +			"  --print-capabilities	print back-end capabilities in JSON format,\n"
> +			"    only meaningful for vhost-user mode\n");

I actually wonder if we should even advertise here --socket-path and
--print-capabilities. I mean, the specification requires us to support
them, not to document them, right?

In any case, they (at least --vhost-user) should also be mentioned in
the man page.

>  	}
>  
>  	fprintf(f,
> @@ -1230,6 +1236,10 @@ void conf(struct ctx *c, int argc, char **argv)
>  		{"no-copy-routes", no_argument,		NULL,		18 },
>  		{"no-copy-addrs", no_argument,		NULL,		19 },
>  		{"netns-only",	no_argument,		NULL,		20 },
> +		{"vhost-user",	no_argument,		NULL,		21 },
> +		/* vhost-user backend program convention */
> +		{"print-capabilities", no_argument,	NULL,		22 },
> +		{"socket-path",	required_argument,	NULL,		's' },
>  		{ 0 },
>  	};
>  	const char *logname = (c->mode == MODE_PASTA) ? "pasta" : "passt";
> @@ -1359,14 +1369,12 @@ void conf(struct ctx *c, int argc, char **argv)
>  				       sizeof(c->ip4.ifname_out), "%s", optarg);
>  			if (ret <= 0 || ret >= (int)sizeof(c->ip4.ifname_out))
>  				die("Invalid interface name: %s", optarg);
> -

Unrelated change.

>  			break;
>  		case 16:
>  			ret = snprintf(c->ip6.ifname_out,
>  				       sizeof(c->ip6.ifname_out), "%s", optarg);
>  			if (ret <= 0 || ret >= (int)sizeof(c->ip6.ifname_out))
>  				die("Invalid interface name: %s", optarg);
> -

Unrelated change.

>  			break;
>  		case 17:
>  			if (c->mode != MODE_PASTA)
> @@ -1395,6 +1403,16 @@ void conf(struct ctx *c, int argc, char **argv)
>  			netns_only = 1;
>  			*userns = 0;
>  			break;
> +		case 21:
> +			if (c->mode == MODE_PASTA) {
> +				err("--vhost-user is for passt mode only");
> +				usage(argv[0], stdout, EXIT_SUCCESS);
> +			}
> +			c->mode = MODE_VU;
> +			break;
> +		case 22:
> +			vu_print_capabilities();
> +			break;
>  		case 'd':
>  			c->debug = 1;
>  			c->quiet = 0;
> diff --git a/epoll_type.h b/epoll_type.h
> index 0ad1efa0ccec..f3ef41584757 100644
> --- a/epoll_type.h
> +++ b/epoll_type.h
> @@ -36,6 +36,10 @@ enum epoll_type {
>  	EPOLL_TYPE_TAP_PASST,
>  	/* socket listening for qemu socket connections */
>  	EPOLL_TYPE_TAP_LISTEN,
> +	/* vhost-user command socket */
> +	EPOLL_TYPE_VHOST_CMD,
> +	/* vhost-user kick event socket */
> +	EPOLL_TYPE_VHOST_KICK,
>  
>  	EPOLL_NUM_TYPES,
>  };
> diff --git a/isolation.c b/isolation.c
> index 4956d7e6f331..1a27f066c2ba 100644
> --- a/isolation.c
> +++ b/isolation.c
> @@ -373,12 +373,19 @@ void isolate_postfork(const struct ctx *c)
>  
>  	prctl(PR_SET_DUMPABLE, 0);
>  
> -	if (c->mode == MODE_PASTA) {
> -		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> -		prog.filter = filter_pasta;
> -	} else {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		prog.len = (unsigned short)ARRAY_SIZE(filter_passt);
>  		prog.filter = filter_passt;
> +		break;
> +	case MODE_PASTA:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_pasta);
> +		prog.filter = filter_pasta;
> +		break;
> +	case MODE_VU:
> +		prog.len = (unsigned short)ARRAY_SIZE(filter_vu);
> +		prog.filter = filter_vu;
> +		break;
>  	}
>  
>  	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) ||
> diff --git a/packet.c b/packet.c
> index 37489961a37e..36c7e5070831 100644
> --- a/packet.c
> +++ b/packet.c
> @@ -36,6 +36,19 @@
>  static int packet_check_range(const struct pool *p, size_t offset, size_t len,
>  			      const char *start, const char *func, int line)
>  {
> +	ASSERT(p->buf);

I would rather keep this path ASSERT-free, because it's meant as a
safety net.

> +
> +	if (p->buf_size == 0) {

It would be convenient to have this special value (if I understood
correctly) of buf_size documented for struct pool,

 @buf_size:	Total size of buffer, 0 for passt vhost-user mode

And I guess we shouldn't call vu_packet_check_range() by accident if
MODE_VU isn't set.

> +		int ret;
> +
> +		ret = vu_packet_check_range((void *)p->buf, offset, len, start);
> +
> +		if (ret == -1)
> +			trace("cannot find region, %s:%i", func, line);
> +
> +		return ret;
> +	}
> +
>  	if (start < p->buf) {
>  		trace("packet start %p before buffer start %p, "
>  		      "%s:%i", (void *)start, (void *)p->buf, func, line);
> diff --git a/packet.h b/packet.h
> index 8377dcf678bb..d32688d8a0a4 100644
> --- a/packet.h
> +++ b/packet.h
> @@ -22,6 +22,8 @@ struct pool {
>  	struct iovec pkt[1];
>  };
>  
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start);
>  void packet_add_do(struct pool *p, size_t len, const char *start,
>  		   const char *func, int line);
>  void *packet_get_do(const struct pool *p, const size_t idx,
> diff --git a/passt.c b/passt.c
> index 6401730dae65..a931a55ab31b 100644
> --- a/passt.c
> +++ b/passt.c
> @@ -74,6 +74,8 @@ char *epoll_type_str[] = {
>  	[EPOLL_TYPE_TAP_PASTA]		= "/dev/net/tun device",
>  	[EPOLL_TYPE_TAP_PASST]		= "connected qemu socket",
>  	[EPOLL_TYPE_TAP_LISTEN]		= "listening qemu socket",
> +	[EPOLL_TYPE_VHOST_CMD]		= "vhost-user command socket",
> +	[EPOLL_TYPE_VHOST_KICK]		= "vhost-user kick socket",
>  };
>  static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
>  	      "epoll_type_str[] doesn't match enum epoll_type");
> @@ -206,6 +208,7 @@ int main(int argc, char **argv)
>  	struct rlimit limit;
>  	struct timespec now;
>  	struct sigaction sa;
> +	struct vu_dev vdev;
>  
>  	clock_gettime(CLOCK_MONOTONIC, &log_start);
>  
> @@ -262,6 +265,8 @@ int main(int argc, char **argv)
>  	pasta_netns_quit_init(&c);
>  
>  	tap_sock_init(&c);
> +	if (c.mode == MODE_VU)
> +		vu_init(&c, &vdev);
>  
>  	secret_init(&c);
>  
> @@ -350,14 +355,30 @@ loop:
>  			tcp_timer_handler(&c, ref);
>  			break;
>  		case EPOLL_TYPE_UDP_LISTEN:
> -			udp_listen_sock_handler(&c, ref, eventmask, &now);
> +			if (c.mode == MODE_VU)

Curly brackets for multi-line statements.

> +				udp_vu_listen_sock_handler(&c, ref, eventmask,
> +							   &now);
> +			else
> +				udp_buf_listen_sock_handler(&c, ref, eventmask,
> +							    &now);
>  			break;
>  		case EPOLL_TYPE_UDP_REPLY:
> -			udp_reply_sock_handler(&c, ref, eventmask, &now);
> +			if (c.mode == MODE_VU)
> +				udp_vu_reply_sock_handler(&c, ref, eventmask,
> +							  &now);
> +			else
> +				udp_buf_reply_sock_handler(&c, ref, eventmask,
> +							   &now);
>  			break;
>  		case EPOLL_TYPE_PING:
>  			icmp_sock_handler(&c, ref);
>  			break;
> +		case EPOLL_TYPE_VHOST_CMD:
> +			tap_handler_vu(&vdev, c.fd_tap, eventmask);
> +			break;
> +		case EPOLL_TYPE_VHOST_KICK:
> +			vu_kick_cb(&vdev, ref, &now);
> +			break;
>  		default:
>  			/* Can't happen */
>  			ASSERT(0);
> diff --git a/passt.h b/passt.h
> index d0f31a230976..71ad32aa3dd0 100644
> --- a/passt.h
> +++ b/passt.h
> @@ -25,6 +25,8 @@ union epoll_ref;
>  #include "fwd.h"
>  #include "tcp.h"
>  #include "udp.h"
> +#include "udp_vu.h"
> +#include "vhost_user.h"
>  
>  /**
>   * union epoll_ref - Breakdown of reference for epoll fd bookkeeping
> @@ -87,6 +89,7 @@ struct fqdn {
>  enum passt_modes {
>  	MODE_PASST,
>  	MODE_PASTA,
> +	MODE_VU,
>  };
>  
>  /**
> @@ -193,6 +196,7 @@ struct ip6_ctx {
>   * @no_map_gw:		Don't map connections, untracked UDP to gateway to host
>   * @low_wmem:		Low probed net.core.wmem_max
>   * @low_rmem:		Low probed net.core.rmem_max
> + * @vdev:		vhost-user device
>   */
>  struct ctx {
>  	enum passt_modes mode;
> @@ -256,6 +260,8 @@ struct ctx {
>  
>  	int low_wmem;
>  	int low_rmem;
> +
> +	struct vu_dev *vdev;
>  };
>  
>  void proto_update_l2_buf(const unsigned char *eth_d,
> diff --git a/pcap.c b/pcap.c
> index 46cc4b0d72b6..7e9c56090041 100644
> --- a/pcap.c
> +++ b/pcap.c
> @@ -140,7 +140,6 @@ void pcap_multiple(const struct iovec *iov, size_t frame_parts, unsigned int n,
>   *		containing packet data to write, including L2 header
>   * @iovcnt:	Number of buffers (@iov entries)
>   */
> -/* cppcheck-suppress unusedFunction */
>  void pcap_iov(const struct iovec *iov, size_t iovcnt)
>  {
>  	struct timespec now;
> diff --git a/tap.c b/tap.c
> index 5852705b897c..a25e4e494287 100644
> --- a/tap.c
> +++ b/tap.c
> @@ -58,6 +58,7 @@
>  #include "packet.h"
>  #include "tap.h"
>  #include "log.h"
> +#include "vhost_user.h"
>  
>  /* IPv4 (plus ARP) and IPv6 message batches from tap/guest to IP handlers */
>  static PACKET_POOL_NOINIT(pool_tap4, TAP_MSGS, pkt_buf);
> @@ -78,16 +79,22 @@ void tap_send_single(const struct ctx *c, const void *data, size_t l2len)
>  	struct iovec iov[2];
>  	size_t iovcnt = 0;
>  
> -	if (c->mode == MODE_PASST) {
> +	switch (c->mode) {
> +	case MODE_PASST:
>  		iov[iovcnt] = IOV_OF_LVALUE(vnet_len);
>  		iovcnt++;
> -	}
> -
> -	iov[iovcnt].iov_base = (void *)data;
> -	iov[iovcnt].iov_len = l2len;
> -	iovcnt++;
> +		/* fall through */
> +	case MODE_PASTA:
> +		iov[iovcnt].iov_base = (void *)data;
> +		iov[iovcnt].iov_len = l2len;
> +		iovcnt++;
>  
> -	tap_send_frames(c, iov, iovcnt, 1);
> +		tap_send_frames(c, iov, iovcnt, 1);
> +		break;
> +	case MODE_VU:
> +		vu_send(c->vdev, data, l2len);
> +		break;
> +	}
>  }
>  
>  /**
> @@ -406,10 +413,18 @@ size_t tap_send_frames(const struct ctx *c, const struct iovec *iov,
>  	if (!nframes)
>  		return 0;
>  
> -	if (c->mode == MODE_PASTA)
> +	switch (c->mode) {
> +	case MODE_PASTA:
>  		m = tap_send_frames_pasta(c, iov, bufs_per_frame, nframes);
> -	else
> +		break;
> +	case MODE_PASST:
>  		m = tap_send_frames_passt(c, iov, bufs_per_frame, nframes);
> +		break;
> +	case MODE_VU:
> +		/* fall through */
> +	default:
> +		ASSERT(0);
> +	}
>  
>  	if (m < nframes)
>  		debug("tap: failed to send %zu frames of %zu",
> @@ -967,7 +982,7 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p)
>   * tap_sock_reset() - Handle closing or failure of connect AF_UNIX socket
>   * @c:		Execution context
>   */
> -static void tap_sock_reset(struct ctx *c)
> +void tap_sock_reset(struct ctx *c)
>  {
>  	info("Client connection closed%s", c->one_off ? ", exiting" : "");
>  
> @@ -978,6 +993,8 @@ static void tap_sock_reset(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL);
>  	close(c->fd_tap);
>  	c->fd_tap = -1;
> +	if (c->mode == MODE_VU)
> +		vu_cleanup(c->vdev);
>  }
>  
>  /**
> @@ -1177,11 +1194,17 @@ static void tap_sock_unix_init(struct ctx *c)
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev);
>  
> -	info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
> -	info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> -	     c->sock_path);
> -	info("or qrap, for earlier qemu versions:");
> -	info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	if (c->mode == MODE_VU) {
> +		info("You can start qemu with:");
> +		info("    kvm ... -chardev socket,id=chr0,path=%s -netdev vhost-user,id=netdev0,chardev=chr0 -device virtio-net,netdev=netdev0 -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE -numa node,memdev=memfd0\n",
> +		     c->sock_path);
> +	} else {
> +		info("\nYou can now start qemu (>= 7.2, with commit 13c6be96618c):");
> +		info("    kvm ... -device virtio-net-pci,netdev=s -netdev stream,id=s,server=off,addr.type=unix,addr.path=%s",
> +		     c->sock_path);
> +		info("or qrap, for earlier qemu versions:");
> +		info("    ./qrap 5 kvm ... -net socket,fd=5 -net nic,model=virtio");
> +	}
>  }
>  
>  /**
> @@ -1191,8 +1214,8 @@ static void tap_sock_unix_init(struct ctx *c)
>   */
>  void tap_listen_handler(struct ctx *c, uint32_t events)
>  {
> -	union epoll_ref ref = { .type = EPOLL_TYPE_TAP_PASST };
>  	struct epoll_event ev = { 0 };
> +	union epoll_ref ref;
>  	int v = INT_MAX / 2;
>  	struct ucred ucred;
>  	socklen_t len;
> @@ -1232,6 +1255,10 @@ void tap_listen_handler(struct ctx *c, uint32_t events)
>  		trace("tap: failed to set SO_SNDBUF to %i", v);
>  
>  	ref.fd = c->fd_tap;
> +	if (c->mode == MODE_VU)
> +		ref.type = EPOLL_TYPE_VHOST_CMD;
> +	else
> +		ref.type = EPOLL_TYPE_TAP_PASST;
>  	ev.events = EPOLLIN | EPOLLRDHUP;
>  	ev.data.u64 = ref.u64;
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
> @@ -1293,21 +1320,47 @@ static void tap_sock_tun_init(struct ctx *c)
>  	epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev);
>  }
>  
> +void tap_sock_update_buf(void *base, size_t size)

Function comment missing.

> +{
> +	int i;
> +
> +	pool_tap4_storage.buf = base;
> +	pool_tap4_storage.buf_size = size;
> +	pool_tap6_storage.buf = base;
> +	pool_tap6_storage.buf_size = size;
> +
> +	for (i = 0; i < TAP_SEQS; i++) {
> +		tap4_l4[i].p.buf = base;
> +		tap4_l4[i].p.buf_size = size;
> +		tap6_l4[i].p.buf = base;
> +		tap6_l4[i].p.buf_size = size;
> +	}
> +}
> +
>  /**
>   * tap_sock_init() - Create and set up AF_UNIX socket or tuntap file descriptor
>   * @c:		Execution context
>   */
>  void tap_sock_init(struct ctx *c)
>  {
> -	size_t sz = sizeof(pkt_buf);
> +	size_t sz;
> +	char *buf;
>  	int i;
>  
> -	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, pkt_buf, sz);
> -	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, pkt_buf, sz);
> +	if (c->mode == MODE_VU) {
> +		buf = NULL;
> +		sz = 0;
> +	} else {
> +		buf = pkt_buf;
> +		sz = sizeof(pkt_buf);
> +	}
> +
> +	pool_tap4_storage = PACKET_INIT(pool_tap4, TAP_MSGS, buf, sz);
> +	pool_tap6_storage = PACKET_INIT(pool_tap6, TAP_MSGS, buf, sz);
>  
>  	for (i = 0; i < TAP_SEQS; i++) {
> -		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
> -		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, pkt_buf, sz);
> +		tap4_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
> +		tap6_l4[i].p = PACKET_INIT(pool_l4, UIO_MAXIOV, buf, sz);
>  	}
>  
>  	if (c->fd_tap != -1) { /* Passed as --fd */
> @@ -1316,10 +1369,17 @@ void tap_sock_init(struct ctx *c)
>  
>  		ASSERT(c->one_off);
>  		ref.fd = c->fd_tap;
> -		if (c->mode == MODE_PASST)
> +		switch (c->mode) {
> +		case MODE_PASST:
>  			ref.type = EPOLL_TYPE_TAP_PASST;
> -		else
> +			break;
> +		case MODE_PASTA:
>  			ref.type = EPOLL_TYPE_TAP_PASTA;
> +			break;
> +		case MODE_VU:
> +			ref.type = EPOLL_TYPE_VHOST_CMD;
> +			break;
> +		}
>  
>  		ev.events = EPOLLIN | EPOLLRDHUP;
>  		ev.data.u64 = ref.u64;
> diff --git a/tap.h b/tap.h
> index ec9e2acec460..c5447f7077eb 100644
> --- a/tap.h
> +++ b/tap.h
> @@ -40,7 +40,8 @@ static inline struct iovec tap_hdr_iov(const struct ctx *c,
>   */
>  static inline void tap_hdr_update(struct tap_hdr *thdr, size_t l2len)
>  {
> -	thdr->vnet_len = htonl(l2len);
> +	if (thdr)
> +		thdr->vnet_len = htonl(l2len);
>  }
>  
>  void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport,
> @@ -68,6 +69,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events,
>  void tap_handler_passt(struct ctx *c, uint32_t events,
>  		       const struct timespec *now);
>  int tap_sock_unix_open(char *sock_path);
> +void tap_sock_reset(struct ctx *c);
> +void tap_sock_update_buf(void *base, size_t size);
>  void tap_sock_init(struct ctx *c);
>  void tap_flush_pools(void);
>  void tap_handler(struct ctx *c, const struct timespec *now);
> diff --git a/tcp.c b/tcp.c
> index c0820ce7a391..1af99b2ae042 100644
> --- a/tcp.c
> +++ b/tcp.c
> @@ -304,6 +304,7 @@
>  #include "flow_table.h"
>  #include "tcp_internal.h"
>  #include "tcp_buf.h"
> +#include "tcp_vu.h"
>  
>  /* MSS rounding: see SET_MSS() */
>  #define MSS_DEFAULT			536
> @@ -896,6 +897,7 @@ static void tcp_fill_header(struct tcphdr *th,
>  
>  /**
>   * tcp_fill_headers4() - Fill 802.3, IPv4, TCP headers in pre-cooked buffers
> + * @c:		Execution context
>   * @conn:	Connection pointer
>   * @taph:	tap backend specific header
>   * @iph:	Pointer to IPv4 header
> @@ -906,7 +908,8 @@ static void tcp_fill_header(struct tcphdr *th,
>   *
>   * Return: The IPv4 payload length, host order
>   */
> -static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
> +static size_t tcp_fill_headers4(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
>  				struct tap_hdr *taph,
>  				struct iphdr *iph, struct tcphdr *th,
>  				size_t dlen, const uint16_t *check,
> @@ -929,7 +932,10 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
>  
>  	tcp_fill_header(th, conn, seq);
>  
> -	tcp_update_check_tcp4(iph, th);
> +	if (c->mode != MODE_VU)

Instead of passing 'c', we could pass, all the way down, a
"no_tcp_csum" flag, which is set when you call
tcp_l2_buf_fill_headers() from tcp_vu.c. Eventually, we'll want to be
able to skip the checksum for pasta as well, if I recall correctly.

> +		tcp_update_check_tcp4(iph, th);
> +	else
> +		th->check = 0;
>  
>  	tap_hdr_update(taph, l3len + sizeof(struct ethhdr));
>  
> @@ -938,6 +944,7 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
>  
>  /**
>   * tcp_fill_headers6() - Fill 802.3, IPv6, TCP headers in pre-cooked buffers
> + * @c:		Execution context
>   * @conn:	Connection pointer
>   * @taph:	tap backend specific header
>   * @ip6h:	Pointer to IPv6 header
> @@ -948,7 +955,8 @@ static size_t tcp_fill_headers4(const struct tcp_tap_conn *conn,
>   *
>   * Return: The IPv6 payload length, host order
>   */
> -static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
> +static size_t tcp_fill_headers6(const struct ctx *c,
> +				const struct tcp_tap_conn *conn,
>  				struct tap_hdr *taph,
>  				struct ipv6hdr *ip6h, struct tcphdr *th,
>  				size_t dlen, uint32_t seq)
> @@ -970,7 +978,10 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
>  
>  	tcp_fill_header(th, conn, seq);
>  
> -	tcp_update_check_tcp6(ip6h, th);
> +	if (c->mode != MODE_VU)
> +		tcp_update_check_tcp6(ip6h, th);
> +	else
> +		th->check = 0;
>  
>  	tap_hdr_update(taph, l4len + sizeof(*ip6h) + sizeof(struct ethhdr));
>  
> @@ -979,6 +990,7 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
>  
>  /**
>   * tcp_l2_buf_fill_headers() - Fill 802.3, IP, TCP headers in pre-cooked buffers
> + * @c:		Execution context
>   * @conn:	Connection pointer
>   * @iov:	Pointer to an array of iovec of TCP pre-cooked buffers
>   * @dlen:	TCP payload length
> @@ -987,7 +999,8 @@ static size_t tcp_fill_headers6(const struct tcp_tap_conn *conn,
>   *
>   * Return: IP payload length, host order
>   */
> -size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
> +size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> +			       const struct tcp_tap_conn *conn,
>  			       struct iovec *iov, size_t dlen,
>  			       const uint16_t *check, uint32_t seq)
>  {
> @@ -995,13 +1008,13 @@ size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
>  	const struct in_addr *a4 = inany_v4(&tapside->faddr);
>  
>  	if (a4) {
> -		return tcp_fill_headers4(conn, iov[TCP_IOV_TAP].iov_base,
> +		return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base,
>  					 iov[TCP_IOV_IP].iov_base,
>  					 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
>  					 check, seq);
>  	}
>  
> -	return tcp_fill_headers6(conn, iov[TCP_IOV_TAP].iov_base,
> +	return tcp_fill_headers6(c, conn, iov[TCP_IOV_TAP].iov_base,
>  				 iov[TCP_IOV_IP].iov_base,
>  				 iov[TCP_IOV_PAYLOAD].iov_base, dlen,
>  				 seq);
> @@ -1237,6 +1250,9 @@ int tcp_prepare_flags(struct ctx *c, struct tcp_tap_conn *conn,
>   */
>  int tcp_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_send_flag(c, conn, flags);
> +
>  	return tcp_buf_send_flag(c, conn, flags);
>  }
>  
> @@ -1630,6 +1646,9 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq)
>   */
>  static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)
>  {
> +	if (c->mode == MODE_VU)
> +		return tcp_vu_data_from_sock(c, conn);
> +
>  	return tcp_buf_data_from_sock(c, conn);
>  }
>  
> diff --git a/tcp_buf.c b/tcp_buf.c
> index c31e9f31b438..6b702b00be89 100644
> --- a/tcp_buf.c
> +++ b/tcp_buf.c
> @@ -321,7 +321,7 @@ int tcp_buf_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)
>  		return ret;
>  	}
>  
> -	l4len = tcp_l2_buf_fill_headers(conn, iov, optlen, NULL, seq);
> +	l4len = tcp_l2_buf_fill_headers(c, conn, iov, optlen, NULL, seq);
>  	iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  
>  	if (flags & DUP_ACK) {
> @@ -378,7 +378,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp4_frame_conns[tcp4_payload_used] = conn;
>  
>  		iov = tcp4_l2_iov[tcp4_payload_used++];
> -		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, check, seq);
> +		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq);
>  		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  		if (tcp4_payload_used > TCP_FRAMES_MEM - 1)
>  			tcp_payload_flush(c);
> @@ -386,7 +386,7 @@ static void tcp_data_to_tap(struct ctx *c, struct tcp_tap_conn *conn,
>  		tcp6_frame_conns[tcp6_payload_used] = conn;
>  
>  		iov = tcp6_l2_iov[tcp6_payload_used++];
> -		l4len = tcp_l2_buf_fill_headers(conn, iov, dlen, NULL, seq);
> +		l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);
>  		iov[TCP_IOV_PAYLOAD].iov_len = l4len;
>  		if (tcp6_payload_used > TCP_FRAMES_MEM - 1)
>  			tcp_payload_flush(c);
> diff --git a/tcp_internal.h b/tcp_internal.h
> index 8b60aabc1b33..3dd4b49a4441 100644
> --- a/tcp_internal.h
> +++ b/tcp_internal.h
> @@ -89,7 +89,8 @@ void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
>  		tcp_rst_do(c, conn);					\
>  	} while (0)
>  
> -size_t tcp_l2_buf_fill_headers(const struct tcp_tap_conn *conn,
> +size_t tcp_l2_buf_fill_headers(const struct ctx *c,
> +			       const struct tcp_tap_conn *conn,
>  			       struct iovec *iov, size_t dlen,
>  			       const uint16_t *check, uint32_t seq);
>  int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
> diff --git a/tcp_vu.c b/tcp_vu.c
> new file mode 100644
> index 000000000000..6eef9187dbd7
> --- /dev/null
> +++ b/tcp_vu.c
> @@ -0,0 +1,593 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

Same here with the SPDX tag.

> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * tcp_vu.c - TCP L2 vhost-user management functions
> + */
> +
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdint.h>
> +
> +#include <netinet/ip.h>
> +
> +#include <sys/socket.h>
> +
> +#include <linux/tcp.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "ip.h"
> +#include "passt.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "vhost_user.h"
> +#include "tcp.h"
> +#include "pcap.h"
> +#include "flow.h"
> +#include "tcp_conn.h"
> +#include "flow_table.h"
> +#include "tcp_vu.h"
> +#include "tcp_internal.h"
> +#include "checksum.h"
> +#include "vu_common.h"
> +
> +/**
> + * struct tcp_payload_t - TCP header and data to send segments with payload
> + * @th:		TCP header
> + * @data:	TCP data
> + */
> +struct tcp_payload_t {
> +	struct tcphdr th;
> +	uint8_t data[IP_MAX_MTU - sizeof(struct tcphdr)];
> +};
> +
> +/**
> + * struct tcp_flags_t - TCP header and data to send zero-length
> + *                      segments (flags)
> + * @th:		TCP header
> + * @opts	TCP options
> + */
> +struct tcp_flags_t {
> +	struct tcphdr th;
> +	char opts[OPT_MSS_LEN + OPT_WS_LEN + 1];
> +};
> +
> +/* vhost-user */
> +static const struct virtio_net_hdr vu_header = {
> +	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +};
> +
> +static struct iovec iov_vu[VIRTQUEUE_MAX_SIZE];
> +static struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +
> +static size_t tcp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)

Function comment missing.

> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
> +		    sizeof(struct tcphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +static void tcp_vu_pcap(const struct ctx *c, const struct flowside *tapside,
> +			struct iovec *iov, int iov_used, size_t l4len)

Function comment missing.

> +{
> +	const struct in_addr *src = inany_v4(&tapside->faddr);
> +	const struct in_addr *dst = inany_v4(&tapside->eaddr);
> +	const struct vu_dev *vdev = c->vdev;
> +	char *base = iov[0].iov_base;
> +	size_t size = iov[0].iov_len;
> +	struct tcp_payload_t *bp;
> +	uint32_t sum;
> +
> +	if (!*c->pcap)
> +		return;
> +
> +	if (src && dst) {
> +		bp = vu_payloadv4(vdev, base);
> +		sum = proto_ipv4_header_psum(l4len, IPPROTO_TCP,
> +					     *src, *dst);
> +	} else {
> +		bp = vu_payloadv6(vdev, base);
> +		sum = proto_ipv6_header_psum(l4len, IPPROTO_TCP,
> +					     &tapside->faddr.a6,
> +					     &tapside->eaddr.a6);
> +	}
> +	iov[0].iov_base = &bp->th;
> +	iov[0].iov_len = size - ((char *)iov[0].iov_base - base);
> +	bp->th.check = 0;
> +	bp->th.check = csum_iov(iov, iov_used, sum);
> +
> +	/* set iov for pcap logging */
> +	iov[0].iov_base = base + vdev->hdrlen;
> +	iov[0].iov_len = size - vdev->hdrlen;
> +
> +	pcap_iov(iov, iov_used);
> +
> +	/* restore iov[0] */
> +	iov[0].iov_base = base;
> +	iov[0].iov_len = size;
> +}
> +
> +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags)

Function comment missing.

> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	struct virtio_net_hdr_mrg_rxbuf *vh;
> +	struct iovec l2_iov[TCP_NUM_IOVS];
> +	size_t l2len, l4len, optlen;
> +	struct iovec in_sg;
> +	struct ethhdr *eh;
> +	int nb_ack;
> +	int ret;
> +
> +	elem[0].out_num = 0;
> +	elem[0].out_sg = NULL;
> +	elem[0].in_num = 1;
> +	elem[0].in_sg = &in_sg;
> +	ret = vu_queue_pop(vdev, vq, &elem[0]);
> +	if (ret < 0)
> +		return 0;
> +
> +	if (elem[0].in_num < 1) {
> +		err("virtio-net receive queue contains no in buffers");

Should this really be err(), or debug()? I wonder if it can happen in
bursts.

> +		vu_queue_rewind(vq, 1);
> +		return 0;
> +	}
> +
> +	vh = elem[0].in_sg[0].iov_base;
> +
> +	vh->hdr = vu_header;
> +	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> +		vh->num_buffers = htole16(1);
> +
> +	l2_iov[TCP_IOV_TAP].iov_base = NULL;
> +	l2_iov[TCP_IOV_TAP].iov_len = 0;
> +	l2_iov[TCP_IOV_ETH].iov_base = (char *)elem[0].in_sg[0].iov_base + vdev->hdrlen;
> +	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> +
> +	eh = l2_iov[TCP_IOV_ETH].iov_base;
> +
> +	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +	if (CONN_V4(conn)) {
> +		struct tcp_flags_t *payload;
> +		struct iphdr *iph;
> +		uint32_t seq;
> +
> +		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
> +						      l2_iov[TCP_IOV_ETH].iov_len;
> +		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> +		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
> +							   l2_iov[TCP_IOV_IP].iov_len;
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = l2_iov[TCP_IOV_IP].iov_base;
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +
> +		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
> +		payload->th = (struct tcphdr){
> +			.doff = offsetof(struct tcp_flags_t, opts) / 4,
> +			.ack = 1
> +		};
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
> +						seq);
> +		/* keep the following assignment for clarity */
> +		/* cppcheck-suppress unreadVariable */
> +		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
> +
> +		l2len = l4len + sizeof(*iph) + sizeof(struct ethhdr);
> +	} else {
> +		struct tcp_flags_t *payload;
> +		struct ipv6hdr *ip6h;
> +		uint32_t seq;
> +
> +		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
> +						      l2_iov[TCP_IOV_ETH].iov_len;
> +		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> +		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
> +							   l2_iov[TCP_IOV_IP].iov_len;
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = l2_iov[TCP_IOV_IP].iov_base;
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
> +		payload->th = (struct tcphdr){
> +			.doff = offsetof(struct tcp_flags_t, opts) / 4,
> +			.ack = 1
> +		};
> +
> +		seq = conn->seq_to_tap;
> +		ret = tcp_prepare_flags(c, conn, flags, &payload->th, payload->opts, &optlen);
> +		if (ret <= 0) {
> +			vu_queue_rewind(vq, 1);
> +			return ret;
> +		}
> +
> +		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov, optlen, NULL,
> +						seq);
> +		/* keep the following assignment for clarity */
> +		/* cppcheck-suppress unreadVariable */
> +		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
> +
> +		l2len = l4len + sizeof(*ip6h) + sizeof(struct ethhdr);
> +	}
> +	l2len += vdev->hdrlen;
> +	ASSERT(l2len <= elem[0].in_sg[0].iov_len);
> +
> +	elem[0].in_sg[0].iov_len = l2len;
> +	tcp_vu_pcap(c, tapside, &elem[0].in_sg[0], 1, l4len);
> +
> +	vu_queue_fill(vq, &elem[0], l2len, 0);
> +	nb_ack = 1;
> +
> +	if (flags & DUP_ACK) {
> +		struct iovec in_sg_dup;
> +
> +		elem[1].out_num = 0;
> +		elem[1].out_sg = NULL;
> +		elem[1].in_num = 1;
> +		elem[1].in_sg = &in_sg_dup;
> +		ret = vu_queue_pop(vdev, vq, &elem[1]);
> +		if (ret == 0) {
> +			if (elem[1].in_num < 1 || elem[1].in_sg[0].iov_len < l2len) {
> +				vu_queue_rewind(vq, 1);
> +			} else {
> +				memcpy(elem[1].in_sg[0].iov_base, vh, l2len);
> +				nb_ack++;
> +
> +				tcp_vu_pcap(c, tapside, &elem[1].in_sg[0], 1,
> +					    l4len);
> +
> +				vu_queue_fill(vq, &elem[1], l2len, 1);
> +			}
> +		}
> +	}

I guess it would be nice one day to decrease code duplication here, but
let's keep this simple for the moment.

> +
> +	vu_queue_flush(vq, nb_ack);
> +	vu_queue_notify(vdev, vq);
> +
> +	return 0;
> +}
> +
> +static ssize_t tcp_vu_sock_recv(struct ctx *c,
> +				struct tcp_tap_conn *conn, bool v4,
> +				size_t fillsize, uint16_t mss,
> +				ssize_t *data_len)

Function comment missing.

> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +	struct msghdr mh_sock = { 0 };
> +	static int in_sg_count;
> +	int s = conn->sock;
> +	size_t l2_hdrlen;
> +	int segment_size;
> +	int iov_cnt;
> +	ssize_t ret;
> +
> +	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
> +
> +	iov_cnt = 0;
> +	in_sg_count = 0;
> +	segment_size = 0;
> +	*data_len = 0;
> +	while (fillsize > 0 && iov_cnt < VIRTQUEUE_MAX_SIZE - 1 &&
> +			       in_sg_count < ARRAY_SIZE(in_sg)) {
> +
> +		elem[iov_cnt].out_num = 0;
> +		elem[iov_cnt].out_sg = NULL;
> +		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
> +		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
> +		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
> +		if (ret < 0)
> +			break;
> +
> +		if (elem[iov_cnt].in_num < 1)
> +			die("virtio-net receive queue contains no in buffers");

We could easily recover (without doing anything) from this error,
right? If that's the case, I think this should be a warn().

> +
> +		in_sg_count += elem[iov_cnt].in_num;
> +
> +		ASSERT(elem[iov_cnt].in_num == 1);
> +		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
> +
> +		if (segment_size == 0) {
> +			iov_vu[iov_cnt + 1].iov_base =
> +					(char *)elem[iov_cnt].in_sg[0].iov_base + l2_hdrlen;
> +			iov_vu[iov_cnt + 1].iov_len =
> +					elem[iov_cnt].in_sg[0].iov_len - l2_hdrlen;
> +		} else {
> +			iov_vu[iov_cnt + 1].iov_base = elem[iov_cnt].in_sg[0].iov_base;
> +			iov_vu[iov_cnt + 1].iov_len = elem[iov_cnt].in_sg[0].iov_len;
> +		}
> +
> +		if (iov_vu[iov_cnt + 1].iov_len > fillsize)
> +			iov_vu[iov_cnt + 1].iov_len = fillsize;

Perhaps storing an "iov" pointer to iov_vu[iov_cnt + 1] would make this
less crowded.

> +
> +		segment_size += iov_vu[iov_cnt + 1].iov_len;
> +		if (vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
> +			segment_size = 0;
> +		} else if (segment_size >= mss) {
> +			iov_vu[iov_cnt + 1].iov_len -= segment_size - mss;
> +			segment_size = 0;
> +		}
> +		fillsize -= iov_vu[iov_cnt + 1].iov_len;
> +
> +		iov_cnt++;
> +	}
> +	if (iov_cnt == 0)
> +		return 0;
> +
> +	mh_sock.msg_iov = iov_vu;
> +	mh_sock.msg_iovlen = iov_cnt + 1;
> +
> +	do
> +		ret = recvmsg(s, &mh_sock, MSG_PEEK);
> +	while (ret < 0 && errno == EINTR);
> +
> +	if (ret < 0) {
> +		vu_queue_rewind(vq, iov_cnt);
> +		if (errno != EAGAIN && errno != EWOULDBLOCK) {
> +			ret = -errno;
> +			tcp_rst(c, conn);
> +		}
> +		return ret;
> +	}
> +	if (!ret) {

Maybe make the non-error path explicit, say, if (ret) goto out;

> +		vu_queue_rewind(vq, iov_cnt);
> +
> +		if ((conn->events & (SOCK_FIN_RCVD | TAP_FIN_SENT)) == SOCK_FIN_RCVD) {
> +			int retf = tcp_vu_send_flag(c, conn, FIN | ACK);
> +			if (retf) {
> +				tcp_rst(c, conn);
> +				return retf;
> +			}
> +
> +			conn_event(c, conn, TAP_FIN_SENT);
> +		}
> +		return 0;
> +	}
> +
> +	*data_len = ret;
> +	return iov_cnt;
> +}
> +
> +static size_t tcp_vu_prepare(const struct ctx *c,
> +			     struct tcp_tap_conn *conn, struct iovec *first,
> +			     size_t data_len, const uint16_t **check)
> +{

Missing function comment.

> +	const struct flowside *toside = TAPFLOW(conn);
> +	const struct vu_dev *vdev = c->vdev;
> +	struct iovec l2_iov[TCP_NUM_IOVS];
> +	char *base = first->iov_base;
> +	struct ethhdr *eh;
> +	size_t l4len;
> +
> +	/* we guess the first iovec provided by the guest can embed
> +         * all the headers needed by L2 frame
> +         */

Mixed tabs and spaces.

> +
> +	l2_iov[TCP_IOV_TAP].iov_base = NULL;
> +	l2_iov[TCP_IOV_TAP].iov_len = 0;
> +	l2_iov[TCP_IOV_ETH].iov_base = base + vdev->hdrlen;
> +	l2_iov[TCP_IOV_ETH].iov_len = sizeof(struct ethhdr);
> +
> +	eh = l2_iov[TCP_IOV_ETH].iov_base;
> +
> +	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->faddr)) {
> +		struct tcp_payload_t *payload;
> +		struct iphdr *iph;
> +
> +		ASSERT(first[0].iov_len >= vdev->hdrlen +
> +		       sizeof(struct ethhdr) + sizeof(struct iphdr) +
> +		       sizeof(struct tcphdr));
> +
> +		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
> +						      l2_iov[TCP_IOV_ETH].iov_len;
> +		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct iphdr);
> +		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
> +							   l2_iov[TCP_IOV_IP].iov_len;
> +
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		iph = l2_iov[TCP_IOV_IP].iov_base;
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_TCP);
> +		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
> +		payload->th = (struct tcphdr){
> +			.doff = offsetof(struct tcp_payload_t, data) / 4,
> +			.ack = 1
> +		};
> +
> +		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
> +						data_len, *check,
> +						conn->seq_to_tap);
> +		/* keep the following assignment for clarity */
> +		/* cppcheck-suppress unreadVariable */
> +		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
> +
> +		*check = &iph->check;
> +	} else {
> +		struct tcp_payload_t *payload;
> +		struct ipv6hdr *ip6h;
> +
> +		ASSERT(first[0].iov_len >= vdev->hdrlen +
> +		       sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
> +		       sizeof(struct tcphdr));
> +
> +		l2_iov[TCP_IOV_IP].iov_base = (char *)l2_iov[TCP_IOV_ETH].iov_base +
> +						      l2_iov[TCP_IOV_ETH].iov_len;
> +		l2_iov[TCP_IOV_IP].iov_len = sizeof(struct ipv6hdr);
> +		l2_iov[TCP_IOV_PAYLOAD].iov_base = (char *)l2_iov[TCP_IOV_IP].iov_base +
> +							   l2_iov[TCP_IOV_IP].iov_len;
> +
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		ip6h = l2_iov[TCP_IOV_IP].iov_base;
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_TCP);
> +
> +		payload = l2_iov[TCP_IOV_PAYLOAD].iov_base;
> +		payload->th = (struct tcphdr){
> +			.doff = offsetof(struct tcp_payload_t, data) / 4,
> +			.ack = 1
> +		};
> +;
> +		l4len = tcp_l2_buf_fill_headers(c, conn, l2_iov,
> +						data_len,
> +						NULL, conn->seq_to_tap);
> +		/* keep the following assignment for clarity */
> +		/* cppcheck-suppress unreadVariable */
> +		l2_iov[TCP_IOV_PAYLOAD].iov_len = l4len;
> +	}
> +
> +	return l4len;
> +}

I reviewed until here for the moment, the rest will come in a bit.

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/4] Add vhost-user support to passt. (part 3)
  2024-08-22 16:53   ` Stefano Brivio
@ 2024-08-23 12:32     ` Stefano Brivio
  0 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2024-08-23 12:32 UTC (permalink / raw)
  To: passt-dev; +Cc: Laurent Vivier

On Thu, 22 Aug 2024 18:53:00 +0200
Stefano Brivio <sbrivio@redhat.com> wrote:

> On Wed, 21 Aug 2024 00:41:14 +0200
> Stefano Brivio <sbrivio@redhat.com> wrote:
> 
> > On Thu, 15 Aug 2024 17:50:19 +0200
> > Laurent Vivier <lvivier@redhat.com> wrote:
> >   
> > > This series of patches adds vhost-user support to passt
> > > and then allows passt to connect to QEMU network backend using
> > > virtqueue rather than a socket.
> > > 
> > > With QEMU, rather than using to connect:
> > > 
> > >   -netdev stream,id=s,server=off,addr.type=unix,addr.path=/tmp/passt_1.socket
> > > 
> > > we will use:
> > > 
> > >   -chardev socket,id=chr0,path=/tmp/passt_1.socket
> > >   -netdev vhost-user,id=netdev0,chardev=chr0
> > >   -device virtio-net,netdev=netdev0
> > >   -object memory-backend-memfd,id=memfd0,share=on,size=$RAMSIZE
> > >   -numa node,memdev=memfd0
> > > 
> > > The memory backend is needed to share data between passt and QEMU.
> > > 
> > > Performance comparison between "-netdev stream" and "-netdev vhost-user":    
> > 
> > By the way, I attached a quick patch adding vhost-user-based tests to
> > the usual throughput and latency tests.
> > 
> > UDP doesn't work (I didn't look into that at all), TCP does.  
> 
> Complete/fixed patch attached. The only part of UDP that's not working
> is actually over IPv6 -- it works over IPv4.

...guest to host, that is. The other direction works.

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 4/4] vhost-user: add vhost-user
  2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
  2024-08-22  9:59   ` Stefano Brivio
  2024-08-22 22:14   ` Stefano Brivio
@ 2024-08-23 12:32   ` Stefano Brivio
  2 siblings, 0 replies; 22+ messages in thread
From: Stefano Brivio @ 2024-08-23 12:32 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

Second part of review:

On Thu, 15 Aug 2024 17:50:23 +0200
Laurent Vivier <lvivier@redhat.com> wrote:

> +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn)

Missing function comment. This one is especially relevant because I'm
not sure where the boundary is between this and tcp_vu_sock_recv().

> +{
> +	uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap;
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	const struct flowside *tapside = TAPFLOW(conn);
> +	uint16_t mss = MSS_GET(conn);
> +	size_t l2_hdrlen, fillsize;
> +	int i, iov_cnt, iov_used;
> +	int v4 = CONN_V4(conn);
> +	uint32_t already_sent;
> +	const uint16_t *check;
> +	struct iovec *first;
> +	int segment_size;
> +	int num_buffers;
> +	ssize_t len;
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		flow_err(conn,
> +			 "Got packet, but no available descriptors on RX virtq.");

That seems to rather describe the case where tcp_vu_sock_recv() gets
< 0 from vu_queue_pop().

Strictly speaking, here, there are no descriptors available either, but
the message looks misleading. What about:

		flow_err(conn, "Got packet, but RX virtqueue not usable yet");

?

> +		return 0;
> +	}
> +
> +	already_sent = conn->seq_to_tap - conn->seq_ack_from_tap;
> +
> +	if (SEQ_LT(already_sent, 0)) {
> +		/* RFC 761, section 2.1. */
> +		flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u",
> +			   conn->seq_ack_from_tap, conn->seq_to_tap);
> +		conn->seq_to_tap = conn->seq_ack_from_tap;
> +		already_sent = 0;
> +	}
> +
> +	if (!wnd_scaled || already_sent >= wnd_scaled) {
> +		conn_flag(c, conn, STALLED);
> +		conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +		return 0;
> +	}
> +
> +	/* Set up buffer descriptors we'll fill completely and partially. */
> +
> +	fillsize = wnd_scaled;
> +
> +	iov_vu[0].iov_base = tcp_buf_discard;

This should now be conditional to if (!peek_offset_cap), see the
(updated) tcp_buf_data_from_sock().

> +	iov_vu[0].iov_len = already_sent;
> +	fillsize -= already_sent;
> +
> +	iov_cnt = tcp_vu_sock_recv(c, conn, v4, fillsize, mss, &len);
> +	if (iov_cnt <= 0)
> +		return iov_cnt;
> +
> +	len -= already_sent;
> +	if (len <= 0) {
> +		conn_flag(c, conn, STALLED);
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	conn_flag(c, conn, ~STALLED);
> +
> +	/* Likely, some new data was acked too. */
> +	tcp_update_seqack_wnd(c, conn, 0, NULL);
> +
> +	/* initialize headers */
> +	l2_hdrlen = tcp_vu_l2_hdrlen(vdev, !v4);
> +	iov_used = 0;
> +	num_buffers = 0;
> +	check = NULL;
> +	segment_size = 0;
> +	for (i = 0; i < iov_cnt && len; i++) {
> +
> +		if (segment_size == 0)
> +			first = &iov_vu[i + 1];

I don't understand why we have this loop on top of the loop from
tcp_vu_sock_recv(). I mean, it works, but this is a bit obscure
to me. I didn't really manage to review this function.

> +
> +		if (iov_vu[i + 1].iov_len > (size_t)len)
> +			iov_vu[i + 1].iov_len = len;
> +
> +		len -= iov_vu[i + 1].iov_len;
> +		iov_used++;
> +
> +		segment_size += iov_vu[i + 1].iov_len;
> +		num_buffers++;
> +
> +		if (segment_size >= mss || len == 0 ||
> +		    i + 1 == iov_cnt || vdev->hdrlen != sizeof(struct virtio_net_hdr_mrg_rxbuf)) {
> +			struct virtio_net_hdr_mrg_rxbuf *vh;
> +			size_t l4len;
> +
> +			if (i + 1 == iov_cnt)
> +				check = NULL;
> +
> +			/* restore first iovec base: point to vnet header */
> +			first->iov_base = (char *)first->iov_base - l2_hdrlen;
> +			first->iov_len = first->iov_len + l2_hdrlen;
> +
> +			vh = first->iov_base;
> +
> +			vh->hdr = vu_header;
> +			if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> +				vh->num_buffers = htole16(num_buffers);
> +
> +			l4len = tcp_vu_prepare(c, conn, first, segment_size, &check);
> +
> +			tcp_vu_pcap(c, tapside, first, num_buffers, l4len);
> +
> +			conn->seq_to_tap += segment_size;
> +
> +			segment_size = 0;
> +			num_buffers = 0;
> +		}
> +	}
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	/* send packets */
> +	vu_send_frame(vdev, vq, elem, &iov_vu[1], iov_used);
> +
> +	conn_flag(c, conn, ACK_FROM_TAP_DUE);
> +
> +	return 0;
> +}
> diff --git a/tcp_vu.h b/tcp_vu.h
> new file mode 100644
> index 000000000000..99daba5b34ed
> --- /dev/null
> +++ b/tcp_vu.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef TCP_VU_H
> +#define TCP_VU_H
> +
> +int tcp_vu_send_flag(struct ctx *c, struct tcp_tap_conn *conn, int flags);
> +int tcp_vu_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn);
> +
> +#endif  /*TCP_VU_H */
> diff --git a/udp.c b/udp.c
> index 7731257292e1..4d2afc62478a 100644
> --- a/udp.c
> +++ b/udp.c
> @@ -109,8 +109,7 @@
>  #include "pcap.h"
>  #include "log.h"
>  #include "flow_table.h"
> -
> -#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +#include "udp_internal.h"
>  
>  /* "Spliced" sockets indexed by bound port (host order) */
>  static int udp_splice_ns  [IP_VERSIONS][NUM_PORTS];
> @@ -118,20 +117,8 @@ static int udp_splice_init[IP_VERSIONS][NUM_PORTS];
>  
>  /* Static buffers */
>  
> -/**
> - * struct udp_payload_t - UDP header and data for inbound messages
> - * @uh:		UDP header
> - * @data:	UDP data
> - */
> -static struct udp_payload_t {
> -	struct udphdr uh;
> -	char data[USHRT_MAX - sizeof(struct udphdr)];
> -#ifdef __AVX2__
> -} __attribute__ ((packed, aligned(32)))
> -#else
> -} __attribute__ ((packed, aligned(__alignof__(unsigned int))))
> -#endif
> -udp_payload[UDP_MAX_FRAMES];
> +/* UDP header and data for inbound messages */
> +static struct udp_payload_t udp_payload[UDP_MAX_FRAMES];
>  
>  /* Ethernet header for IPv4 frames */
>  static struct ethhdr udp4_eth_hdr;
> @@ -311,6 +298,7 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
>  
>  /**
>   * udp_update_hdr4() - Update headers for one IPv4 datagram
> + * @c:		Execution context
>   * @ip4h:	Pre-filled IPv4 header (except for tot_len and saddr)
>   * @bp:		Pointer to udp_payload_t to update
>   * @toside:	Flowside for destination side
> @@ -318,8 +306,9 @@ static void udp_splice_send(const struct ctx *c, size_t start, size_t n,
>   *
>   * Return: size of IPv4 payload (UDP header + data)
>   */
> -static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
> -			      const struct flowside *toside, size_t dlen)
> +size_t udp_update_hdr4(const struct ctx *c,
> +		       struct iphdr *ip4h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen)
>  {
>  	const struct in_addr *src = inany_v4(&toside->faddr);
>  	const struct in_addr *dst = inany_v4(&toside->eaddr);
> @@ -336,13 +325,17 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
>  	bp->uh.source = htons(toside->fport);
>  	bp->uh.dest = htons(toside->eport);
>  	bp->uh.len = htons(l4len);
> -	csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
> +	if (c->mode != MODE_VU)
> +		csum_udp4(&bp->uh, *src, *dst, bp->data, dlen);
> +	else
> +		bp->uh.check = 0;
>  
>  	return l4len;
>  }
>  
>  /**
>   * udp_update_hdr6() - Update headers for one IPv6 datagram
> + * @c:		Execution context
>   * @ip6h:	Pre-filled IPv6 header (except for payload_len and addresses)
>   * @bp:		Pointer to udp_payload_t to update
>   * @toside:	Flowside for destination side
> @@ -350,8 +343,9 @@ static size_t udp_update_hdr4(struct iphdr *ip4h, struct udp_payload_t *bp,
>   *
>   * Return: size of IPv6 payload (UDP header + data)
>   */
> -static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> -			      const struct flowside *toside, size_t dlen)
> +size_t udp_update_hdr6(const struct ctx *c,
> +		       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen)
>  {
>  	uint16_t l4len = dlen + sizeof(bp->uh);
>  
> @@ -365,19 +359,24 @@ static size_t udp_update_hdr6(struct ipv6hdr *ip6h, struct udp_payload_t *bp,
>  	bp->uh.source = htons(toside->fport);
>  	bp->uh.dest = htons(toside->eport);
>  	bp->uh.len = ip6h->payload_len;
> -	csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6, bp->data, dlen);
> +	if (c->mode != MODE_VU)

Curly brackets for multi-line statement.

> +		csum_udp6(&bp->uh, &toside->faddr.a6, &toside->eaddr.a6,
> +			  bp->data, dlen);
> +	else
> +		bp->uh.check = 0xffff; /* zero checksum is invalid with IPv6 */

0xffff is the value virtio requires in case we want to ignore the
checksum, right? Or would any non-zero value work? The comment should
say that.

>  
>  	return l4len;
>  }
>  
>  /**
>   * udp_tap_prepare() - Convert one datagram into a tap frame
> + * @c:		Execution context
>   * @mmh:	Receiving mmsghdr array
>   * @idx:	Index of the datagram to prepare
>   * @toside:	Flowside for destination side
>   */
> -static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
> -			    const struct flowside *toside)
> +static void udp_tap_prepare(const struct ctx *c, const struct mmsghdr *mmh,
> +			    unsigned idx, const struct flowside *toside)
>  {
>  	struct iovec (*tap_iov)[UDP_NUM_IOVS] = &udp_l2_iov[idx];
>  	struct udp_payload_t *bp = &udp_payload[idx];
> @@ -385,13 +384,15 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
>  	size_t l4len;
>  
>  	if (!inany_v4(&toside->eaddr) || !inany_v4(&toside->faddr)) {
> -		l4len = udp_update_hdr6(&bm->ip6h, bp, toside, mmh[idx].msg_len);
> +		l4len = udp_update_hdr6(c, &bm->ip6h, bp, toside,
> +					mmh[idx].msg_len);
>  		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip6h) +
>  			       sizeof(udp6_eth_hdr));
>  		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp6_eth_hdr);
>  		(*tap_iov)[UDP_IOV_IP] = IOV_OF_LVALUE(bm->ip6h);
>  	} else {
> -		l4len = udp_update_hdr4(&bm->ip4h, bp, toside, mmh[idx].msg_len);
> +		l4len = udp_update_hdr4(c, &bm->ip4h, bp, toside,
> +					mmh[idx].msg_len);
>  		tap_hdr_update(&bm->taph, l4len + sizeof(bm->ip4h) +
>  			       sizeof(udp4_eth_hdr));
>  		(*tap_iov)[UDP_IOV_ETH] = IOV_OF_LVALUE(udp4_eth_hdr);
> @@ -408,7 +409,7 @@ static void udp_tap_prepare(const struct mmsghdr *mmh, unsigned idx,
>   *
>   * #syscalls recvmsg
>   */
> -static bool udp_sock_recverr(int s)
> +bool udp_sock_recverr(int s)
>  {
>  	const struct sock_extended_err *ee;
>  	const struct cmsghdr *hdr;
> @@ -495,7 +496,7 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
>  }
>  
>  /**
> - * udp_listen_sock_handler() - Handle new data from socket
> + * udp_buf_listen_sock_handler() - Handle new data from socket
>   * @c:		Execution context
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
> @@ -503,8 +504,8 @@ static int udp_sock_recv(const struct ctx *c, int s, uint32_t events,
>   *
>   * #syscalls recvmmsg
>   */
> -void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			     uint32_t events, const struct timespec *now)
> +void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				 uint32_t events, const struct timespec *now)
>  {
>  	struct mmsghdr *mmh_recv = ref.udp.v6 ? udp6_mh_recv : udp4_mh_recv;
>  	int n, i;
> @@ -527,7 +528,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
>  			if (pif_is_socket(batchpif)) {
>  				udp_splice_prepare(mmh_recv, i);
>  			} else if (batchpif == PIF_TAP) {
> -				udp_tap_prepare(mmh_recv, i,
> +				udp_tap_prepare(c, mmh_recv, i,
>  						flowside_at_sidx(batchsidx));
>  			}
>  
> @@ -561,7 +562,7 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
>  }
>  
>  /**
> - * udp_reply_sock_handler() - Handle new data from flow specific socket
> + * udp_buf_reply_sock_handler() - Handle new data from flow specific socket
>   * @c:		Execution context
>   * @ref:	epoll reference
>   * @events:	epoll events bitmap
> @@ -569,8 +570,8 @@ void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
>   *
>   * #syscalls recvmmsg
>   */
> -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			    uint32_t events, const struct timespec *now)
> +void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now)
>  {
>  	const struct flowside *fromside = flowside_at_sidx(ref.flowside);
>  	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
> @@ -594,7 +595,7 @@ void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
>  		if (pif_is_socket(topif))
>  			udp_splice_prepare(mmh_recv, i);
>  		else if (topif == PIF_TAP)
> -			udp_tap_prepare(mmh_recv, i, toside);
> +			udp_tap_prepare(c, mmh_recv, i, toside);
>  	}
>  
>  	if (pif_is_socket(topif)) {
> diff --git a/udp.h b/udp.h
> index fb42e1c50d70..77b29260e8d1 100644
> --- a/udp.h
> +++ b/udp.h
> @@ -9,10 +9,10 @@
>  #define UDP_TIMER_INTERVAL		1000 /* ms */
>  
>  void udp_portmap_clear(void);
> -void udp_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			     uint32_t events, const struct timespec *now);
> -void udp_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> -			    uint32_t events, const struct timespec *now);
> +void udp_buf_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				 uint32_t events, const struct timespec *now);
> +void udp_buf_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now);
>  int udp_tap_handler(const struct ctx *c, uint8_t pif,
>  		    sa_family_t af, const void *saddr, const void *daddr,
>  		    const struct pool *p, int idx, const struct timespec *now);
> diff --git a/udp_internal.h b/udp_internal.h
> new file mode 100644
> index 000000000000..7dd45753698f
> --- /dev/null
> +++ b/udp_internal.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

// SPDX...

> + * Copyright (c) 2021 Red Hat GmbH
> + * Author: Stefano Brivio <sbrivio@redhat.com>
> + */
> +
> +#ifndef UDP_INTERNAL_H
> +#define UDP_INTERNAL_H
> +
> +#include "tap.h" /* needed by udp_meta_t */
> +
> +#define UDP_MAX_FRAMES		32  /* max # of frames to receive at once */
> +
> +/**
> + * struct udp_payload_t - UDP header and data for inbound messages
> + * @uh:		UDP header
> + * @data:	UDP data
> + */
> +struct udp_payload_t {
> +	struct udphdr uh;
> +	char data[USHRT_MAX - sizeof(struct udphdr)];
> +#ifdef __AVX2__
> +} __attribute__ ((packed, aligned(32)));
> +#else
> +} __attribute__ ((packed, aligned(__alignof__(unsigned int))));
> +#endif
> +
> +size_t udp_update_hdr4(const struct ctx *c,
> +		       struct iphdr *ip4h, struct udp_payload_t *bp,
> +		       const struct flowside *toside, size_t dlen);
> +size_t udp_update_hdr6(const struct ctx *c,
> +                       struct ipv6hdr *ip6h, struct udp_payload_t *bp,
> +                       const struct flowside *toside, size_t dlen);
> +bool udp_sock_recverr(int s);
> +#endif /* UDP_INTERNAL_H */
> diff --git a/udp_vu.c b/udp_vu.c
> new file mode 100644
> index 000000000000..f9e7afcf4ddb
> --- /dev/null
> +++ b/udp_vu.c
> @@ -0,0 +1,338 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

// SPDX...

> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * udp_vu.c - UDP L2 vhost-user management functions
> + */
> +
> +#include <unistd.h>
> +#include <assert.h>
> +#include <net/ethernet.h>
> +#include <net/if.h>
> +#include <netinet/in.h>
> +#include <netinet/ip.h>
> +#include <netinet/udp.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <sys/uio.h>
> +#include <linux/virtio_net.h>
> +
> +#include "checksum.h"
> +#include "util.h"
> +#include "ip.h"
> +#include "siphash.h"
> +#include "inany.h"
> +#include "passt.h"
> +#include "pcap.h"
> +#include "log.h"
> +#include "vhost_user.h"
> +#include "udp_internal.h"
> +#include "flow.h"
> +#include "flow_table.h"
> +#include "udp_flow.h"
> +#include "udp_vu.h"
> +#include "vu_common.h"
> +
> +/* vhost-user */
> +static const struct virtio_net_hdr vu_header = {
> +	.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +	.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +};
> +
> +static struct iovec     iov_vu		[VIRTQUEUE_MAX_SIZE];
> +static struct vu_virtq_element	elem		[VIRTQUEUE_MAX_SIZE];
> +static struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +static int in_sg_count;
> +
> +static size_t udp_vu_l2_hdrlen(const struct vu_dev *vdev, bool v6)

Function comment missing.

> +{
> +	size_t l2_hdrlen;
> +
> +	l2_hdrlen = vdev->hdrlen + sizeof(struct ethhdr) +
> +		    sizeof(struct udphdr);
> +
> +	if (v6)
> +		l2_hdrlen += sizeof(struct ipv6hdr);
> +	else
> +		l2_hdrlen += sizeof(struct iphdr);
> +
> +	return l2_hdrlen;
> +}
> +
> +static int udp_vu_sock_recv(const struct ctx *c, union sockaddr_inany *s_in,
> +			    int s, uint32_t events, bool v6, ssize_t *data_len)

Function comment missing.

> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	int virtqueue_max, iov_cnt, idx, iov_used;
> +	size_t fillsize, size, off, l2_hdrlen;
> +	struct virtio_net_hdr_mrg_rxbuf *vh;
> +	struct msghdr msg  = { 0 };
> +	char *base;
> +
> +	ASSERT(!c->no_udp);
> +
> +	/* Clear any errors first */
> +	if (events & EPOLLERR) {
> +		while (udp_sock_recverr(s))
> +			;
> +	}
> +
> +	if (!(events & EPOLLIN))
> +		return 0;
> +
> +	/* compute L2 header length */
> +
> +	if (vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		virtqueue_max = VIRTQUEUE_MAX_SIZE;
> +	else
> +		virtqueue_max = 1;
> +
> +	l2_hdrlen = udp_vu_l2_hdrlen(vdev, v6);
> +
> +	msg.msg_name = s_in;
> +	msg.msg_namelen = sizeof(union sockaddr_inany);
> +
> +	fillsize = USHRT_MAX;
> +	iov_cnt = 0;
> +	in_sg_count = 0;
> +	while (fillsize && iov_cnt < virtqueue_max &&
> +			in_sg_count < ARRAY_SIZE(in_sg)) {

This is much easier to understand compared to the TCP version, by the
way, but of course the TCP version needs to be a bit more complicated
because of the segmentation.

Both iov_cnt and in_sg_count are indices rather than... counts, right?
What about iov_idx (or i) and in_sg_idx?

> +		int ret;
> +
> +		elem[iov_cnt].out_num = 0;
> +		elem[iov_cnt].out_sg = NULL;
> +		elem[iov_cnt].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
> +		elem[iov_cnt].in_sg = &in_sg[in_sg_count];
> +		ret = vu_queue_pop(vdev, vq, &elem[iov_cnt]);
> +		if (ret < 0)
> +			break;
> +		in_sg_count += elem[iov_cnt].in_num;
> +
> +		if (elem[iov_cnt].in_num < 1) {
> +			err("virtio-net receive queue contains no in buffers");
> +			vu_queue_rewind(vq, iov_cnt);
> +			return 0;
> +		}
> +		ASSERT(elem[iov_cnt].in_num == 1);
> +		ASSERT(elem[iov_cnt].in_sg[0].iov_len >= l2_hdrlen);
> +
> +		if (iov_cnt == 0) {
> +			base = elem[iov_cnt].in_sg[0].iov_base;
> +			size = elem[iov_cnt].in_sg[0].iov_len;
> +
> +			/* keep space for the headers */
> +			iov_vu[0].iov_base = base + l2_hdrlen;
> +			iov_vu[0].iov_len = size - l2_hdrlen;
> +		} else {
> +			iov_vu[iov_cnt].iov_base = elem[iov_cnt].in_sg[0].iov_base;
> +			iov_vu[iov_cnt].iov_len = elem[iov_cnt].in_sg[0].iov_len;
> +		}
> +
> +		if (iov_vu[iov_cnt].iov_len > fillsize)
> +			iov_vu[iov_cnt].iov_len = fillsize;
> +
> +		fillsize -= iov_vu[iov_cnt].iov_len;
> +
> +		iov_cnt++;
> +	}
> +	if (iov_cnt == 0)
> +		return 0;
> +
> +	msg.msg_iov = iov_vu;
> +	msg.msg_iovlen = iov_cnt;
> +
> +	*data_len = recvmsg(s, &msg, 0);

Is recvmsg() instead of recvmmsg() by choice/constraint, or just to keep
the initial implementation simpler?

> +	if (*data_len < 0) {
> +		vu_queue_rewind(vq, iov_cnt);
> +		return 0;
> +	}
> +
> +	/* restore original values */
> +	iov_vu[0].iov_base = base;
> +	iov_vu[0].iov_len = size;
> +
> +	/* count the numbers of buffer filled by recvmsg() */
> +	idx = iov_skip_bytes(iov_vu, iov_cnt, l2_hdrlen + *data_len,
> +			     &off);
> +	/* adjust last iov length */
> +	if (idx < iov_cnt)
> +		iov_vu[idx].iov_len = off;
> +	iov_used = idx + !!off;
> +
> +	/* release unused buffers */
> +	vu_queue_rewind(vq, iov_cnt - iov_used);
> +
> +	vh = (struct virtio_net_hdr_mrg_rxbuf *)base;
> +	vh->hdr = vu_header;
> +	if (vdev->hdrlen == sizeof(struct virtio_net_hdr_mrg_rxbuf))
> +		vh->num_buffers = htole16(iov_used);
> +
> +	return iov_used;
> +}
> +
> +static size_t udp_vu_prepare(const struct ctx *c,
> +			     const struct flowside *toside, ssize_t data_len)

Function comment missing.

> +{
> +	const struct vu_dev *vdev = c->vdev;
> +	struct ethhdr *eh;
> +	size_t l4len;
> +
> +	/* ethernet header */
> +	eh = vu_eth(vdev, iov_vu[0].iov_base);
> +
> +	memcpy(eh->h_dest, c->mac_guest, sizeof(eh->h_dest));
> +	memcpy(eh->h_source, c->mac, sizeof(eh->h_source));
> +
> +	/* initialize header */
> +	if (inany_v4(&toside->eaddr) && inany_v4(&toside->faddr)) {
> +		struct iphdr *iph = vu_ip(vdev, iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv4(vdev,
> +							    iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IP);
> +
> +		*iph = (struct iphdr)L2_BUF_IP4_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr4(c, iph, bp, toside, data_len);
> +	} else {
> +		struct ipv6hdr *ip6h = vu_ip(vdev, iov_vu[0].iov_base);
> +		struct udp_payload_t *bp = vu_payloadv6(vdev,
> +							    iov_vu[0].iov_base);
> +
> +		eh->h_proto = htons(ETH_P_IPV6);
> +
> +		*ip6h = (struct ipv6hdr)L2_BUF_IP6_INIT(IPPROTO_UDP);
> +
> +		l4len = udp_update_hdr6(c, ip6h, bp, toside, data_len);
> +	}
> +
> +	return l4len;
> +}
> +
> +static void udp_vu_pcap(const struct ctx *c, const struct flowside *toside,
> +			size_t l4len, int iov_used)

Function comment missing.

> +{
> +	const struct in_addr *src = inany_v4(&toside->faddr);
> +	const struct in_addr *dst = inany_v4(&toside->eaddr);
> +	const struct vu_dev *vdev = c->vdev;
> +	char *base = iov_vu[0].iov_base;
> +	size_t size = iov_vu[0].iov_len;
> +	struct udp_payload_t *bp;
> +	uint32_t sum;
> +
> +	if (!*c->pcap)
> +		return;
> +
> +	if (src && dst) {

If we call them src4 and dst4, this logic becomes easier to understand.

> +		bp = vu_payloadv4(vdev, base);
> +		sum = proto_ipv4_header_psum(l4len, IPPROTO_UDP, *src, *dst);
> +	} else {
> +		bp = vu_payloadv6(vdev, base);
> +		sum = proto_ipv6_header_psum(l4len, IPPROTO_UDP,
> +					     &toside->faddr.a6,
> +					     &toside->eaddr.a6);
> +		bp->uh.check = 0; /* by default, set to 0xffff */
> +	}
> +
> +	iov_vu[0].iov_base = &bp->uh;
> +	iov_vu[0].iov_len = size - ((char *)iov_vu[0].iov_base - base);
> +
> +	bp->uh.check = csum_iov(iov_vu, iov_used, sum);
> +
> +	/* set iov for pcap logging */
> +	iov_vu[0].iov_base = base + vdev->hdrlen;
> +	iov_vu[0].iov_len = size - vdev->hdrlen;
> +	pcap_iov(iov_vu, iov_used);
> +
> +	/* restore iov_vu[0] */
> +	iov_vu[0].iov_base = base;
> +	iov_vu[0].iov_len = size;
> +}
> +
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now)

Function comment missing.

> +{
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	bool v6 = ref.udp.v6;
> +	int i;
> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		union sockaddr_inany s_in;
> +		flow_sidx_t batchsidx;
> +		uint8_t batchpif;
> +		ssize_t data_len;
> +		int iov_used;
> +
> +		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
> +					    events, v6, &data_len);
> +		if (iov_used <= 0)
> +			return;
> +
> +		batchsidx = udp_flow_from_sock(c, ref, &s_in, now);
> +		batchpif = pif_at_sidx(batchsidx);
> +
> +		if (batchpif == PIF_TAP) {
> +			size_t l4len;
> +
> +			l4len = udp_vu_prepare(c, flowside_at_sidx(batchsidx),
> +					       data_len);
> +			udp_vu_pcap(c, flowside_at_sidx(batchsidx), l4len,
> +				    iov_used);
> +			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
> +		} else if (flow_sidx_valid(batchsidx)) {
> +			flow_sidx_t fromsidx = flow_sidx_opposite(batchsidx);
> +			struct udp_flow *uflow = udp_at_sidx(batchsidx);
> +
> +			flow_err(uflow,
> +				 "No support for forwarding UDP from %s to %s",
> +				 pif_name(pif_at_sidx(fromsidx)),
> +				 pif_name(batchpif));
> +		} else {
> +			debug("Discarding 1 datagram without flow");
> +		}
> +	}
> +}
> +
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			        uint32_t events, const struct timespec *now)

Function comment missing.

> +{
> +	flow_sidx_t tosidx = flow_sidx_opposite(ref.flowside);
> +	const struct flowside *toside = flowside_at_sidx(tosidx);
> +	struct vu_dev *vdev = c->vdev;
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct udp_flow *uflow = udp_at_sidx(ref.flowside);
> +	uint8_t topif = pif_at_sidx(tosidx);
> +	bool v6 = ref.udp.v6;
> +	int i;
> +
> +	ASSERT(!c->no_udp && uflow);

Pre-existing (in udp_buf_reply_sock_handler()), but I'd rather keep this
as two assertions, so that, should one ever trigger, we know which case
it is.

> +
> +	for (i = 0; i < UDP_MAX_FRAMES; i++) {
> +		union sockaddr_inany s_in;
> +		ssize_t data_len;
> +		int iov_used;
> +
> +		iov_used = udp_vu_sock_recv(c, &s_in, ref.fd,
> +					    events, v6, &data_len);
> +		if (iov_used <= 0)
> +			return;
> +		flow_trace(uflow, "Received 1 datagram on reply socket");
> +		uflow->ts = now->tv_sec;
> +
> +		if (topif == PIF_TAP) {
> +			size_t l4len;
> +
> +			l4len = udp_vu_prepare(c, toside, data_len);
> +			udp_vu_pcap(c, toside, l4len, iov_used);
> +			vu_send_frame(vdev, vq, elem, iov_vu, iov_used);
> +		} else {
> +			uint8_t frompif = pif_at_sidx(ref.flowside);
> +
> +			flow_err(uflow,
> +				 "No support for forwarding UDP from %s to %s",
> +				 pif_name(frompif), pif_name(topif));
> +		}
> +	}
> +}
> diff --git a/udp_vu.h b/udp_vu.h
> new file mode 100644
> index 000000000000..0db7558914d9
> --- /dev/null
> +++ b/udp_vu.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +
> +#ifndef UDP_VU_H
> +#define UDP_VU_H
> +
> +void udp_vu_listen_sock_handler(const struct ctx *c, union epoll_ref ref,
> +				uint32_t events, const struct timespec *now);
> +void udp_vu_reply_sock_handler(const struct ctx *c, union epoll_ref ref,
> +			       uint32_t events, const struct timespec *now);
> +#endif /* UDP_VU_H */
> diff --git a/vhost_user.c b/vhost_user.c
> index c4cd25fae84e..e65b550774b7 100644
> --- a/vhost_user.c
> +++ b/vhost_user.c
> @@ -52,7 +52,6 @@
>   * 			     this is part of the vhost-user backend
>   * 			     convention.
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_print_capabilities(void)
>  {
>  	info("{");
> @@ -163,8 +162,7 @@ static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
>   */
>  static void vu_remove_watch(const struct vu_dev *vdev, int fd)
>  {
> -	(void)vdev;
> -	(void)fd;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_DEL, fd, NULL);
>  }
>  
>  /**
> @@ -426,7 +424,6 @@ static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
>   *
>   * Return: 0 if the zone in a mapped memory region, -1 otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_packet_check_range(void *buf, size_t offset, size_t len,
>  			  const char *start)
>  {
> @@ -517,6 +514,14 @@ static bool vu_set_mem_table_exec(struct vu_dev *vdev,
>  		}
>  	}
>  
> +	/* As vu_packet_check_range() has no access to the number of
> +	 * memory regions, mark the end of the array with mmap_addr = 0
> +	 */
> +	ASSERT(vdev->nregions < VHOST_USER_MAX_RAM_SLOTS - 1);
> +	vdev->regions[vdev->nregions].mmap_addr = 0;
> +
> +	tap_sock_update_buf(vdev->regions, 0);
> +
>  	return false;
>  }
>  
> @@ -637,8 +642,12 @@ static bool vu_get_vring_base_exec(struct vu_dev *vdev,
>   */
>  static void vu_set_watch(const struct vu_dev *vdev, int fd)
>  {
> -	(void)vdev;
> -	(void)fd;
> +	union epoll_ref ref = { .type = EPOLL_TYPE_VHOST_KICK, .fd = fd };
> +	struct epoll_event ev = { 0 };
> +
> +	ev.data.u64 = ref.u64;
> +	ev.events = EPOLLIN;
> +	epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, fd, &ev);
>  }
>  
>  /**
> @@ -678,7 +687,6 @@ static int vu_wait_queue(const struct vu_virtq *vq)
>   *
>   * Return: number of bytes sent, -1 if there is an error
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
>  {
>  	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> @@ -864,7 +872,6 @@ static void vu_handle_tx(struct vu_dev *vdev, int index,
>   * @vdev:	vhost-user device
>   * @ref:	epoll reference information
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
>  		const struct timespec *now)
>  {
> @@ -1102,11 +1109,11 @@ static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
>   * @c:		execution context
>   * @vdev:	vhost-user device
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_init(struct ctx *c, struct vu_dev *vdev)
>  {
>  	int i;
>  
> +	c->vdev = vdev;
>  	vdev->context = c;
>  	vdev->hdrlen = 0;
>  	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> @@ -1170,7 +1177,7 @@ void vu_cleanup(struct vu_dev *vdev)
>   */
>  static void vu_sock_reset(struct vu_dev *vdev)
>  {
> -	(void)vdev;
> +	tap_sock_reset(vdev->context);
>  }
>  
>  /**
> @@ -1179,7 +1186,6 @@ static void vu_sock_reset(struct vu_dev *vdev)
>   * @fd:		vhost-user message socket
>   * @events:	epoll events
>   */
> -/* cppcheck-suppress unusedFunction */
>  void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)
>  {
>  	struct vhost_user_msg msg = { 0 };
> diff --git a/virtio.c b/virtio.c
> index d02e6e04701d..55fc647842bb 100644
> --- a/virtio.c
> +++ b/virtio.c
> @@ -559,7 +559,6 @@ void vu_queue_unpop(struct vu_virtq *vq)
>   * @vq:		Virtqueue
>   * @num:	Number of element to unpop
>   */
> -/* cppcheck-suppress unusedFunction */
>  bool vu_queue_rewind(struct vu_virtq *vq, unsigned int num)
>  {
>  	if (num > vq->inuse)
> diff --git a/vu_common.c b/vu_common.c
> new file mode 100644
> index 000000000000..611c44a39142
> --- /dev/null
> +++ b/vu_common.c
> @@ -0,0 +1,27 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later

// SPDX...

> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * common_vu.c - vhost-user common UDP and TCP functions
> + */
> +
> +#include <unistd.h>
> +#include <sys/uio.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "vhost_user.h"
> +#include "vu_common.h"
> +
> +void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
> +		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
> +		   int iov_used)

Function comment missing.

> +{
> +	int i;
> +
> +	for (i = 0; i < iov_used; i++)
> +		vu_queue_fill(vq, &elem[i], iov_vu[i].iov_len, i);
> +
> +	vu_queue_flush(vq, iov_used);
> +	vu_queue_notify(vdev, vq);
> +}
> diff --git a/vu_common.h b/vu_common.h
> new file mode 100644
> index 000000000000..d2ea46bd379b
> --- /dev/null
> +++ b/vu_common.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * vhost-user common UDP and TCP functions
> + */
> +
> +#ifndef VU_COMMON_H
> +#define VU_COMMON_H
> +
> +static inline void *vu_eth(const struct vu_dev *vdev, void *base)

Function comments missing (but these are really obvious ones, so I
don't actually care that much).

> +{
> +	return ((char *)base + vdev->hdrlen);
> +}
> +
> +static inline void *vu_ip(const struct vu_dev *vdev, void *base)
> +{
> +	return (struct ethhdr *)vu_eth(vdev, base) + 1;
> +}
> +
> +static inline void *vu_payloadv4(const struct vu_dev *vdev, void *base)
> +{
> +	return (struct iphdr *)vu_ip(vdev, base) + 1;
> +}
> +
> +static inline void *vu_payloadv6(const struct vu_dev *vdev, void *base)
> +{
> +	return (struct ipv6hdr *)vu_ip(vdev, base) + 1;
> +}
> +
> +void vu_send_frame(const struct vu_dev *vdev, struct vu_virtq *vq,
> +		   struct vu_virtq_element *elem, const struct iovec *iov_vu,
> +		   int iov_used);
> +#endif /* VU_COMMON_H */

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-15 15:50 ` [PATCH v3 3/4] vhost-user: introduce vhost-user API Laurent Vivier
  2024-08-22 22:14   ` Stefano Brivio
@ 2024-08-26  5:26   ` David Gibson
  2024-08-26 22:14     ` Stefano Brivio
  2024-09-05  9:58     ` Laurent Vivier
  1 sibling, 2 replies; 22+ messages in thread
From: David Gibson @ 2024-08-26  5:26 UTC (permalink / raw)
  To: Laurent Vivier; +Cc: passt-dev

[-- Attachment #1: Type: text/plain, Size: 52107 bytes --]

On Thu, Aug 15, 2024 at 05:50:22PM +0200, Laurent Vivier wrote:
> Add vhost_user.c and vhost_user.h that define the functions needed
> to implement vhost-user backend.
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  Makefile     |    4 +-
>  iov.c        |    1 -
>  vhost_user.c | 1271 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  vhost_user.h |  202 ++++++++
>  virtio.c     |    5 -
>  virtio.h     |    2 +-
>  6 files changed, 1476 insertions(+), 9 deletions(-)
>  create mode 100644 vhost_user.c
>  create mode 100644 vhost_user.h
> 
> diff --git a/Makefile b/Makefile
> index f171c7955ac9..4ccefffacfde 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -47,7 +47,7 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
>  PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \
>  	icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \
>  	ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \
> -	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c virtio.c
> +	tcp_buf.c tcp_splice.c udp.c udp_flow.c util.c vhost_user.c virtio.c
>  QRAP_SRCS = qrap.c
>  SRCS = $(PASST_SRCS) $(QRAP_SRCS)
>  
> @@ -57,7 +57,7 @@ PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \
>  	flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \
>  	lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \
>  	siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \
> -	udp.h udp_flow.h util.h virtio.h
> +	udp.h udp_flow.h util.h vhost_user.h virtio.h
>  HEADERS = $(PASST_HEADERS) seccomp.h
>  
>  C := \#include <linux/tcp.h>\nstruct tcp_info x = { .tcpi_snd_wnd = 0 };
> diff --git a/iov.c b/iov.c
> index 3f9e229a305f..3741db21790f 100644
> --- a/iov.c
> +++ b/iov.c
> @@ -68,7 +68,6 @@ size_t iov_skip_bytes(const struct iovec *iov, size_t n,
>   *
>   * Returns:    The number of bytes successfully copied.
>   */
> -/* cppcheck-suppress unusedFunction */
>  size_t iov_from_buf(const struct iovec *iov, size_t iov_cnt,
>  		    size_t offset, const void *buf, size_t bytes)
>  {
> diff --git a/vhost_user.c b/vhost_user.c
> new file mode 100644
> index 000000000000..c4cd25fae84e
> --- /dev/null
> +++ b/vhost_user.c
> @@ -0,0 +1,1271 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * vhost-user API, command management and virtio interface
> + *
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + */
> +/* some parts from QEMU subprojects/libvhost-user/libvhost-user.c
> + * licensed under the following terms:
> + *
> + * Copyright IBM, Corp. 2007
> + * Copyright (c) 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Anthony Liguori <aliguori@us.ibm.com>
> + *  Marc-André Lureau <mlureau@redhat.com>
> + *  Victor Kaplansky <victork@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdlib.h>
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <stddef.h>
> +#include <string.h>
> +#include <assert.h>
> +#include <stdbool.h>
> +#include <inttypes.h>
> +#include <time.h>
> +#include <net/ethernet.h>
> +#include <netinet/in.h>
> +#include <sys/epoll.h>
> +#include <sys/eventfd.h>
> +#include <sys/mman.h>
> +#include <linux/vhost_types.h>
> +#include <linux/virtio_net.h>
> +
> +#include "util.h"
> +#include "passt.h"
> +#include "tap.h"
> +#include "vhost_user.h"
> +
> +/* vhost-user version we are compatible with */
> +#define VHOST_USER_VERSION 1
> +
> +/**
> + * vu_print_capabilities() - print vhost-user capabilities
> + * 			     this is part of the vhost-user backend
> + * 			     convention.
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_print_capabilities(void)
> +{
> +	info("{");
> +	info("  \"type\": \"net\"");
> +	info("}");
> +	exit(EXIT_SUCCESS);
> +}
> +
> +/**
> + * vu_request_to_string() - convert a vhost-user request number to its name
> + * @req:	request number
> + *
> + * Return: the name of request number
> + */
> +static const char *vu_request_to_string(unsigned int req)
> +{
> +	if (req < VHOST_USER_MAX) {
> +#define REQ(req) [req] = #req
> +		static const char * const vu_request_str[] = {

Adding VHOST_USER_MAX as an explicit array length can act as a useful
sanity check here.

> +			REQ(VHOST_USER_NONE),
> +			REQ(VHOST_USER_GET_FEATURES),
> +			REQ(VHOST_USER_SET_FEATURES),
> +			REQ(VHOST_USER_SET_OWNER),
> +			REQ(VHOST_USER_RESET_OWNER),
> +			REQ(VHOST_USER_SET_MEM_TABLE),
> +			REQ(VHOST_USER_SET_LOG_BASE),
> +			REQ(VHOST_USER_SET_LOG_FD),
> +			REQ(VHOST_USER_SET_VRING_NUM),
> +			REQ(VHOST_USER_SET_VRING_ADDR),
> +			REQ(VHOST_USER_SET_VRING_BASE),
> +			REQ(VHOST_USER_GET_VRING_BASE),
> +			REQ(VHOST_USER_SET_VRING_KICK),
> +			REQ(VHOST_USER_SET_VRING_CALL),
> +			REQ(VHOST_USER_SET_VRING_ERR),
> +			REQ(VHOST_USER_GET_PROTOCOL_FEATURES),
> +			REQ(VHOST_USER_SET_PROTOCOL_FEATURES),
> +			REQ(VHOST_USER_GET_QUEUE_NUM),
> +			REQ(VHOST_USER_SET_VRING_ENABLE),
> +			REQ(VHOST_USER_SEND_RARP),
> +			REQ(VHOST_USER_NET_SET_MTU),
> +			REQ(VHOST_USER_SET_BACKEND_REQ_FD),
> +			REQ(VHOST_USER_IOTLB_MSG),
> +			REQ(VHOST_USER_SET_VRING_ENDIAN),
> +			REQ(VHOST_USER_GET_CONFIG),
> +			REQ(VHOST_USER_SET_CONFIG),
> +			REQ(VHOST_USER_POSTCOPY_ADVISE),
> +			REQ(VHOST_USER_POSTCOPY_LISTEN),
> +			REQ(VHOST_USER_POSTCOPY_END),
> +			REQ(VHOST_USER_GET_INFLIGHT_FD),
> +			REQ(VHOST_USER_SET_INFLIGHT_FD),
> +			REQ(VHOST_USER_GPU_SET_SOCKET),
> +			REQ(VHOST_USER_VRING_KICK),
> +			REQ(VHOST_USER_GET_MAX_MEM_SLOTS),
> +			REQ(VHOST_USER_ADD_MEM_REG),
> +			REQ(VHOST_USER_REM_MEM_REG),
> +			REQ(VHOST_USER_MAX),
> +		};
> +#undef REQ
> +		return vu_request_str[req];
> +	}
> +
> +	return "unknown";
> +}
> +
> +/**
> + * qva_to_va() -  Translate front-end (QEMU) virtual address to our virtual
> + * 		  address
> + * @dev:		Vhost-user device
> + * @qemu_addr:		front-end userspace address
> + *
> + * Return: the memory address in our process virtual address space.
> + */
> +static void *qva_to_va(struct vu_dev *dev, uint64_t qemu_addr)
> +{
> +	unsigned int i;
> +
> +	/* Find matching memory region.  */
> +	for (i = 0; i < dev->nregions; i++) {
> +		const struct vu_dev_region *r = &dev->regions[i];
> +
> +		if ((qemu_addr >= r->qva) && (qemu_addr < (r->qva + r->size))) {
> +			/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +			return (void *)(qemu_addr - r->qva + r->mmap_addr +
> +					r->mmap_offset);
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * vmsg_close_fds() - Close all file descriptors of a given message
> + * @vmsg:	Vhost-user message with the list of the file descriptors
> + */
> +static void vmsg_close_fds(const struct vhost_user_msg *vmsg)
> +{
> +	int i;
> +
> +	for (i = 0; i < vmsg->fd_num; i++)
> +		close(vmsg->fds[i]);
> +}
> +
> +/**
> + * vu_remove_watch() - Remove a file descriptor from an our passt epoll
> + * 		       file descriptor
> + * @vdev:	Vhost-user device
> + * @fd:		file descriptor to remove
> + */
> +static void vu_remove_watch(const struct vu_dev *vdev, int fd)
> +{
> +	(void)vdev;
> +	(void)fd;

Uh... this doesn't seem to do what the function comment says.

> +}
> +
> +/**
> + * vmsg_set_reply_u64() - Set reply payload.u64 and clear request flags
> + * 			  and fd_num
> + * @vmsg:	Vhost-user message
> + * @val:	64bit value to reply
> + */
> +static void vmsg_set_reply_u64(struct vhost_user_msg *vmsg, uint64_t val)
> +{
> +	vmsg->hdr.flags = 0; /* defaults will be set by vu_send_reply() */
> +	vmsg->hdr.size = sizeof(vmsg->payload.u64);
> +	vmsg->payload.u64 = val;
> +	vmsg->fd_num = 0;
> +}
> +
> +/**
> + * vu_message_read_default() - Read incoming vhost-user message from the
> + * 			       front-end
> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message
> + *
> + * Return: -1 there is an error,
> + *          0 if recvmsg() has been interrupted,
> + *          1 if a message has been received
> + */
> +static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
> +{
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
> +		     sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +		.msg_controllen = sizeof(control),
> +	};
> +	ssize_t ret, sz_payload;
> +	struct cmsghdr *cmsg;
> +	size_t fd_size;
> +
> +	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
> +	if (ret < 0) {
> +		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
> +			return 0;
> +		return -1;
> +	}
> +
> +	vmsg->fd_num = 0;
> +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> +	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> +		if (cmsg->cmsg_level == SOL_SOCKET &&
> +		    cmsg->cmsg_type == SCM_RIGHTS) {
> +			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
> +			ASSERT(fd_size / sizeof(int) <=
> +			       VHOST_MEMORY_BASELINE_NREGIONS);

IIUC, this could be tripped by a bug in the peer (qemu?) rather than
in our own code.  In which case I think a die() would be more
appropriate than an ASSERT().

> +			vmsg->fd_num = fd_size / sizeof(int);
> +			memcpy(vmsg->fds, CMSG_DATA(cmsg), fd_size);
> +			break;
> +		}
> +	}
> +
> +	sz_payload = vmsg->hdr.size;
> +	if ((size_t)sz_payload > sizeof(vmsg->payload)) {
> +		die("Error: too big message request: %d,"
> +			 " size: vmsg->size: %zd, "
> +			 "while sizeof(vmsg->payload) = %zu",
> +			 vmsg->hdr.request, sz_payload, sizeof(vmsg->payload));
> +	}
> +
> +	if (sz_payload) {
> +		do {
> +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));
> +
> +		if (ret < sz_payload)
> +			die_perror("Error while reading");
> +	}
> +
> +	return 1;
> +}
> +
> +/**
> + * vu_message_write() - send a message to the front-end
> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message
> + *
> + * #syscalls:vu sendmsg
> + */
> +static void vu_message_write(int conn_fd, struct vhost_user_msg *vmsg)
> +{
> +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS * sizeof(int))] = { 0 };
> +	struct iovec iov = {
> +		.iov_base = (char *)vmsg,
> +		.iov_len = VHOST_USER_HDR_SIZE,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = control,
> +	};
> +	const uint8_t *p = (uint8_t *)vmsg;
> +	int rc;
> +
> +	memset(control, 0, sizeof(control));

I think this is redundant with the { 0 } initialiser.

> +	ASSERT(vmsg->fd_num <= VHOST_MEMORY_BASELINE_NREGIONS);
> +	if (vmsg->fd_num > 0) {
> +		size_t fdsize = vmsg->fd_num * sizeof(int);
> +		struct cmsghdr *cmsg;
> +
> +		msg.msg_controllen = CMSG_SPACE(fdsize);
> +		cmsg = CMSG_FIRSTHDR(&msg);
> +		cmsg->cmsg_len = CMSG_LEN(fdsize);
> +		cmsg->cmsg_level = SOL_SOCKET;
> +		cmsg->cmsg_type = SCM_RIGHTS;
> +		memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
> +	} else {
> +		msg.msg_controllen = 0;

I believe since you have a C99 initialiser on 'msg', fields not
explicitly mentioned will be initiaised to 0, making this redundant.

> +	}
> +
> +	do {
> +		rc = sendmsg(conn_fd, &msg, 0);
> +	} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
> +
> +	if (vmsg->hdr.size) {
> +		do {
> +			rc = write(conn_fd, p + VHOST_USER_HDR_SIZE,
> +				   vmsg->hdr.size);

Is there any particular reason to send the payload as a separate
write(), rather than including it as a second entry in the iov to
sendmsg above?  Or indeed as part of the first entry, since AFAICT the
payload is contiguous with the header.

> +		} while (rc < 0 && (errno == EINTR || errno == EAGAIN));
> +	}

Checking for short writes seems like a good idea.  Even if it
shouldn't ever happen, a die() would be much easier to debug than some
cryptic failure because of truncated data.

> +	if (rc <= 0)
> +		die_perror("Error while writing");
> +}
> +
> +/**
> + * vu_send_reply() - Update message flags and send it to front-end
> + * @conn_fd:	Vhost-user command socket
> + * @vmsg:	Vhost-user message
> + */
> +static void vu_send_reply(int conn_fd, struct vhost_user_msg *msg)
> +{
> +	msg->hdr.flags &= ~VHOST_USER_VERSION_MASK;
> +	msg->hdr.flags |= VHOST_USER_VERSION;
> +	msg->hdr.flags |= VHOST_USER_REPLY_MASK;
> +
> +	vu_message_write(conn_fd, msg);
> +}
> +
> +/**
> + * vu_get_features_exec() - Provide back-end features bitmask to front-end
> + * @vmsg:	Vhost-user message
> + *
> + * Return: true as a reply is requested
> + */
> +static bool vu_get_features_exec(struct vhost_user_msg *msg)
> +{
> +	uint64_t features =
> +		1ULL << VIRTIO_F_VERSION_1 |
> +		1ULL << VIRTIO_NET_F_MRG_RXBUF |
> +		1ULL << VHOST_USER_F_PROTOCOL_FEATURES;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	debug("Sending back to guest u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_enable_all_rings() - Enable/disable all the virtqueues
> + * @vdev:	Vhost-user device
> + * @enable:	New virtqueues state
> + */
> +static void vu_set_enable_all_rings(struct vu_dev *vdev, bool enable)
> +{
> +	uint16_t i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++)
> +		vdev->vq[i].enable = enable;
> +}
> +
> +/**
> + * vu_set_features_exec() - Enable features of the back-end
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_features_exec(struct vu_dev *vdev,
> +				 struct vhost_user_msg *msg)
> +{
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);

A number of these debug() messages look like they'd be pretty cryptic,
with no indication of which part of passt they're coming from.  This
one is especially bad.

> +
> +	vdev->features = msg->payload.u64;
> +	/* We only support devices conforming to VIRTIO 1.0 or
> +	 * later
> +	 */
> +	if (!vu_has_feature(vdev, VIRTIO_F_VERSION_1))
> +		die("virtio legacy devices aren't supported by passt");
> +
> +	if (!vu_has_feature(vdev, VHOST_USER_F_PROTOCOL_FEATURES))
> +		vu_set_enable_all_rings(vdev, true);
> +
> +	/* virtio-net features */
> +
> +	if (vu_has_feature(vdev, VIRTIO_F_VERSION_1) ||

You checked this is set above, making this test redundant, no?

> +	    vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	} else {
> +		vdev->hdrlen = sizeof(struct virtio_net_hdr);
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_owner_exec() - Session start flag, do nothing in our case
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_owner_exec(void)
> +{
> +	return false;
> +}
> +
> +/**
> + * map_ring() - Convert ring front-end (QEMU) addresses to our process
> + * 		virtual address space.
> + * @vdev:	Vhost-user device
> + * @vq:		Virtqueue
> + *
> + * Return: true if ring cannot be mapped to our address space
> + */
> +static bool map_ring(struct vu_dev *vdev, struct vu_virtq *vq)
> +{
> +	vq->vring.desc = qva_to_va(vdev, vq->vra.desc_user_addr);
> +	vq->vring.used = qva_to_va(vdev, vq->vra.used_user_addr);
> +	vq->vring.avail = qva_to_va(vdev, vq->vra.avail_user_addr);
> +
> +	debug("Setting virtq addresses:");
> +	debug("    vring_desc  at %p", (void *)vq->vring.desc);
> +	debug("    vring_used  at %p", (void *)vq->vring.used);
> +	debug("    vring_avail at %p", (void *)vq->vring.avail);
> +
> +	return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
> +}
> +
> +/**
> + * vu_packet_check_range() - Check if a given memory zone is contained in
> + * 			     a mapped guest memory region
> + * @buf:	Array of the available memory regions
> + * @offset:	Offset of data range in packet descriptor
> + * @size:	Length of desired data range
> + * @start:	Start of the packet descriptor
> + *
> + * Return: 0 if the zone in a mapped memory region, -1 otherwise
> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_packet_check_range(void *buf, size_t offset, size_t len,
> +			  const char *start)
> +{
> +	struct vu_dev_region *dev_region;
> +
> +	for (dev_region = buf; dev_region->mmap_addr; dev_region++) {
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		char *m = (char *)dev_region->mmap_addr;
> +
> +		if (m <= start &&
> +		    start + offset + len < m + dev_region->mmap_offset +
> +					       dev_region->size)
> +			return 0;
> +	}
> +
> +	return -1;
> +}
> +
> +/**
> + * vu_set_mem_table_exec() - Sets the memory map regions to be able to
> + * 			     translate the vring addresses.
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + *
> + * #syscalls:vu mmap munmap
> + */
> +static bool vu_set_mem_table_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	struct vhost_user_memory m = msg->payload.memory, *memory = &m;

Is there a reason to take a copy of the message, rather than just
referencing into msg as passed?

> +	unsigned int i;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		struct vu_dev_region *r = &vdev->regions[i];
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		void *mm = (void *)r->mmap_addr;
> +
> +		if (mm)
> +			munmap(mm, r->size + r->mmap_offset);

Do we actually ever need to change the mapping of the regions?  If not
we can avoid this unmapping loop.

> +	}
> +	vdev->nregions = memory->nregions;
> +
> +	debug("Nregions: %u", memory->nregions);
> +	for (i = 0; i < vdev->nregions; i++) {
> +		struct vhost_user_memory_region *msg_region = &memory->regions[i];
> +		struct vu_dev_region *dev_region = &vdev->regions[i];
> +		void *mmap_addr;
> +
> +		debug("Region %d", i);
> +		debug("    guest_phys_addr: 0x%016"PRIx64,
> +		      msg_region->guest_phys_addr);
> +		debug("    memory_size:     0x%016"PRIx64,
> +		      msg_region->memory_size);
> +		debug("    userspace_addr   0x%016"PRIx64,
> +		      msg_region->userspace_addr);
> +		debug("    mmap_offset      0x%016"PRIx64,
> +		      msg_region->mmap_offset);
> +
> +		dev_region->gpa = msg_region->guest_phys_addr;
> +		dev_region->size = msg_region->memory_size;
> +		dev_region->qva = msg_region->userspace_addr;
> +		dev_region->mmap_offset = msg_region->mmap_offset;
> +
> +		/* We don't use offset argument of mmap() since the
> +		 * mapped address has to be page aligned, and we use huge
> +		 * pages.

We do what now?

> +		 */
> +		mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
> +				 PROT_READ | PROT_WRITE, MAP_SHARED |
> +				 MAP_NORESERVE, msg->fds[i], 0);
> +
> +		if (mmap_addr == MAP_FAILED)
> +			die_perror("region mmap error");
> +
> +		dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
> +		debug("    mmap_addr:       0x%016"PRIx64,
> +		      dev_region->mmap_addr);
> +
> +		close(msg->fds[i]);
> +	}
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		if (vdev->vq[i].vring.desc) {
> +			if (map_ring(vdev, &vdev->vq[i]))
> +				die("remapping queue %d during setmemtable", i);
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_num_exec() - Set the size of the queue (vring size)
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_num_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", idx);
> +	debug("State.num:   %u", num);
> +	vdev->vq[idx].vring.num = num;
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_addr_exec() - Set the addresses of the vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_addr_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	struct vhost_vring_addr addr = msg->payload.addr, *vra = &addr;

Again, any reason to copy the message?

> +	struct vu_virtq *vq = &vdev->vq[vra->index];
> +
> +	debug("vhost_vring_addr:");
> +	debug("    index:  %d", vra->index);
> +	debug("    flags:  %d", vra->flags);
> +	debug("    desc_user_addr:   0x%016" PRIx64, (uint64_t)vra->desc_user_addr);
> +	debug("    used_user_addr:   0x%016" PRIx64, (uint64_t)vra->used_user_addr);
> +	debug("    avail_user_addr:  0x%016" PRIx64, (uint64_t)vra->avail_user_addr);
> +	debug("    log_guest_addr:   0x%016" PRIx64, (uint64_t)vra->log_guest_addr);
> +
> +	vq->vra = *vra;

.. and then copy it again?

> +	vq->vring.flags = vra->flags;
> +	vq->vring.log_guest_addr = vra->log_guest_addr;
> +
> +	if (map_ring(vdev, vq))
> +		die("Invalid vring_addr message");
> +
> +	vq->used_idx = le16toh(vq->vring.used->idx);
> +
> +	if (vq->last_avail_idx != vq->used_idx) {
> +		debug("Last avail index != used index: %u != %u",
> +		      vq->last_avail_idx, vq->used_idx);
> +	}
> +
> +	return false;
> +}
> +/**
> + * vu_set_vring_base_exec() - Sets the next index to use for descriptors
> + * 			      in this vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_base_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +	unsigned int num = msg->payload.state.num;
> +
> +	debug("State.index: %u", idx);
> +	debug("State.num:   %u", num);
> +	vdev->vq[idx].shadow_avail_idx = vdev->vq[idx].last_avail_idx = num;
> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_vring_base_exec() - Stops the vring and returns the current
> + * 			      descriptor index or indices
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as a reply is requested
> + */
> +static bool vu_get_vring_base_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	unsigned int idx = msg->payload.state.index;
> +
> +	debug("State.index: %u", idx);
> +	msg->payload.state.num = vdev->vq[idx].last_avail_idx;
> +	msg->hdr.size = sizeof(msg->payload.state);
> +
> +	vdev->vq[idx].started = false;
> +
> +	if (vdev->vq[idx].call_fd != -1) {
> +		close(vdev->vq[idx].call_fd);
> +		vdev->vq[idx].call_fd = -1;
> +	}
> +	if (vdev->vq[idx].kick_fd != -1) {
> +		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
> +		close(vdev->vq[idx].kick_fd);
> +		vdev->vq[idx].kick_fd = -1;
> +	}
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_watch() - Add a file descriptor to the passt epoll file descriptor
> + * @vdev:	vhost-user device
> + * @fd:		file descriptor to add
> + */
> +static void vu_set_watch(const struct vu_dev *vdev, int fd)
> +{
> +	(void)vdev;
> +	(void)fd;

As with remove, this doesn't appear to do what the function comment
says.  Are these placeholders?  A TODO comment would make that
clearer, if so.

> +}
> +
> +/**
> + * vu_wait_queue() - wait new free entries in the virtqueue

s/wait/wait for/?

> + * @vq:		virtqueue to wait on
> + */
> +static int vu_wait_queue(const struct vu_virtq *vq)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int status;
> +
> +	/* wait the kernel to put new entries in the queue */
> +	status = fcntl(vq->kick_fd, F_GETFL);
> +	if (status == -1)
> +		return -1;
> +
> +	status = fcntl(vq->kick_fd, F_SETFL, status & ~O_NONBLOCK);
> +	if (status == -1)
> +		return -1;
> +	rc = eventfd_read(vq->kick_fd, &kick_data);
> +	status = fcntl(vq->kick_fd, F_SETFL, status);
> +	if (status == -1)
> +		return -1;
> +
> +	if (rc == -1)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +/**
> + * vu_send() - Send a buffer to the front-end using the RX virtqueue
> + * @vdev:	vhost-user device
> + * @buf:	address of the buffer
> + * @size:	size of the buffer
> + *
> + * Return: number of bytes sent, -1 if there is an error
> + */
> +/* cppcheck-suppress unusedFunction */
> +int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
> +{
> +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> +	size_t lens[VIRTQUEUE_MAX_SIZE];
> +	__virtio16 *num_buffers_ptr = NULL;
> +	size_t hdrlen = vdev->hdrlen;
> +	int in_sg_count = 0;
> +	size_t offset = 0;
> +	int i = 0, j;
> +
> +	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
> +
> +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> +		err("Got packet, but no available descriptors on RX virtq.");
> +		return 0;
> +	}
> +
> +	while (offset < size) {
> +		size_t len;
> +		int total;
> +		int ret;
> +
> +		total = 0;
> +
> +		if (i == ARRAY_SIZE(elem) ||
> +		    in_sg_count == ARRAY_SIZE(in_sg)) {
> +			err("virtio-net unexpected long buffer chain");
> +			goto err;
> +		}
> +
> +		elem[i].out_num = 0;
> +		elem[i].out_sg = NULL;
> +		elem[i].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
> +		elem[i].in_sg = &in_sg[in_sg_count];
> +
> +		ret = vu_queue_pop(vdev, vq, &elem[i]);
> +		if (ret < 0) {
> +			if (vu_wait_queue(vq) != -1)
> +				continue;
> +			if (i) {
> +				err("virtio-net unexpected empty queue: "
> +				    "i %d mergeable %d offset %zd, size %zd, "
> +				    "features 0x%" PRIx64,
> +				    i, vu_has_feature(vdev,
> +						      VIRTIO_NET_F_MRG_RXBUF),
> +				    offset, size, vdev->features);
> +			}
> +			offset = -1;
> +			goto err;
> +		}
> +		in_sg_count += elem[i].in_num;

Initially I thought this would consume the entire in_sg array on the
first loop iteration, but I guess vu_queue_pop() reduces in_num from
the value we initialise above.

> +		if (elem[i].in_num < 1) {

I realise it doesn't really matter in this context, but it makes more
sense to me for this check to go _before_ we use in_num to update
in_sg_cuont.


> +			err("virtio-net receive queue contains no in buffers");
> +			vu_queue_detach_element(vq);
> +			offset = -1;
> +			goto err;
> +		}
> +
> +		if (i == 0) {
> +			struct virtio_net_hdr hdr = {
> +				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> +				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> +			};
> +
> +			ASSERT(offset == 0);
> +			ASSERT(elem[i].in_sg[0].iov_len >= hdrlen);

Is this necessarily our bug, or could it be cause by the peer giving
unreasonably small buffers?  If the latter, then a die() would make
more sense.

> +
> +			len = iov_from_buf(elem[i].in_sg, elem[i].in_num, 0,
> +					   &hdr, sizeof(hdr));
> +
> +			num_buffers_ptr = (__virtio16 *)((char *)elem[i].in_sg[0].iov_base +
> +							 len);
> +
> +			total += hdrlen;
> +		}
> +
> +		len = iov_from_buf(elem[i].in_sg, elem[i].in_num, total,
> +				   (char *)buf + offset, size - offset);
> +
> +		total += len;
> +		offset += len;
> +
> +		/* If buffers can't be merged, at this point we
> +		 * must have consumed the complete packet.
> +		 * Otherwise, drop it.
> +		 */
> +		if (!vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) &&
> +		    offset < size) {
> +			vu_queue_unpop(vq);
> +			goto err;
> +		}
> +
> +		lens[i] = total;
> +		i++;
> +	}
> +
> +	if (num_buffers_ptr && vu_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
> +		*num_buffers_ptr = htole16(i);
> +
> +	for (j = 0; j < i; j++) {
> +		debug("filling total %zd idx %d", lens[j], j);
> +		vu_queue_fill(vq, &elem[j], lens[j], j);
> +	}
> +
> +	vu_queue_flush(vq, i);
> +	vu_queue_notify(vdev, vq);
> +
> +	debug("vhost-user sent %zu", offset);
> +
> +	return offset;
> +err:
> +	for (j = 0; j < i; j++)
> +		vu_queue_detach_element(vq);
> +
> +	return offset;
> +}
> +
> +/**
> + * vu_handle_tx() - Receive data from the TX virtqueue
> + * @vdev:	vhost-user device
> + * @index:	index of the virtqueue
> + */
> +static void vu_handle_tx(struct vu_dev *vdev, int index,
> +			 const struct timespec *now)
> +{
> +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> +	struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> +	struct vu_virtq *vq = &vdev->vq[index];
> +	int hdrlen = vdev->hdrlen;
> +	int out_sg_count;
> +	int count;
> +
> +	if (!VHOST_USER_IS_QUEUE_TX(index)) {
> +		debug("index %d is not a TX queue", index);
> +		return;
> +	}
> +
> +	tap_flush_pools();
> +
> +	count = 0;
> +	out_sg_count = 0;
> +	while (1) {
> +		int ret;
> +
> +
> +		elem[count].out_num = 1;
> +		elem[count].out_sg = &out_sg[out_sg_count];
> +		elem[count].in_num = 0;
> +		elem[count].in_sg = NULL;
> +		ret = vu_queue_pop(vdev, vq, &elem[count]);
> +		if (ret < 0)
> +			break;
> +		out_sg_count += elem[count].out_num;
> +
> +		if (elem[count].out_num < 1) {
> +			debug("virtio-net header not in first element");
> +			break;
> +		}
> +		ASSERT(elem[count].out_num == 1);
> +
> +		tap_add_packet(vdev->context,
> +			       elem[count].out_sg[0].iov_len - hdrlen,
> +			       (char *)elem[count].out_sg[0].iov_base + hdrlen);
> +		count++;
> +	}
> +	tap_handler(vdev->context, now);
> +
> +	if (count) {
> +		int i;
> +
> +		for (i = 0; i < count; i++)
> +			vu_queue_fill(vq, &elem[i], 0, i);
> +		vu_queue_flush(vq, count);
> +		vu_queue_notify(vdev, vq);
> +	}
> +}
> +
> +/**
> + * vu_kick_cb() - Called on a kick event to start to receive data
> + * @vdev:	vhost-user device
> + * @ref:	epoll reference information

Missing @now argument

> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now)
> +{
> +	eventfd_t kick_data;
> +	ssize_t rc;
> +	int idx;
> +
> +	for (idx = 0; idx < VHOST_USER_MAX_QUEUES; idx++)
> +		if (vdev->vq[idx].kick_fd == ref.fd)

Eventually I think it would be preferable to put the vq index directly
into the epoll ref, rather than having to scan through the queues for
the right one.  I'm ok with that being a follow up change, though.

> +			break;
> +
> +	if (idx == VHOST_USER_MAX_QUEUES)
> +		return;
> +
> +	rc = eventfd_read(ref.fd, &kick_data);
> +	if (rc == -1)
> +		die_perror("kick eventfd_read()");
> +
> +	debug("Got kick_data: %016"PRIx64" idx:%d",
> +	      kick_data, idx);
> +	if (VHOST_USER_IS_QUEUE_TX(idx))
> +		vu_handle_tx(vdev, idx, now);
> +}
> +
> +/**
> + * vu_check_queue_msg_file() - Check if a message is valid,
> + * 			       close fds if NOFD bit is set
> + * @vmsg:	Vhost-user message
> + */
> +static void vu_check_queue_msg_file(struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	if (idx >= VHOST_USER_MAX_QUEUES)
> +		die("Invalid queue index: %u", idx);
> +
> +	if (nofd) {
> +		vmsg_close_fds(msg);
> +		return;
> +	}
> +
> +	if (msg->fd_num != 1)
> +		die("Invalid fds in request: %d", msg->hdr.request);
> +}
> +
> +/**
> + * vu_set_vring_kick_exec() - Set the event file descriptor for adding buffers
> + * 			      to the vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_kick_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].kick_fd != -1) {
> +		vu_remove_watch(vdev, vdev->vq[idx].kick_fd);
> +		close(vdev->vq[idx].kick_fd);
> +	}
> +
> +	vdev->vq[idx].kick_fd = nofd ? -1 : msg->fds[0];
> +	debug("Got kick_fd: %d for vq: %d", vdev->vq[idx].kick_fd, idx);
> +
> +	vdev->vq[idx].started = true;
> +
> +	if (vdev->vq[idx].kick_fd != -1 && VHOST_USER_IS_QUEUE_TX(idx)) {
> +		vu_set_watch(vdev, vdev->vq[idx].kick_fd);
> +		debug("Waiting for kicks on fd: %d for vq: %d",
> +		      vdev->vq[idx].kick_fd, idx);
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_call_exec() - Set the event file descriptor to signal when
> + * 			      buffers are used
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_call_exec(struct vu_dev *vdev,
> +				   struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].call_fd != -1)
> +		close(vdev->vq[idx].call_fd);
> +
> +	vdev->vq[idx].call_fd = nofd ? -1 : msg->fds[0];
> +
> +	/* in case of I/O hang after reconnecting */
> +	if (vdev->vq[idx].call_fd != -1)
> +		eventfd_write(msg->fds[0], 1);
> +
> +	debug("Got call_fd: %d for vq: %d", vdev->vq[idx].call_fd, idx);
> +
> +	return false;
> +}
> +
> +/**
> + * vu_set_vring_err_exec() - Set the event file descriptor to signal when
> + * 			     error occurs
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_err_exec(struct vu_dev *vdev,
> +				  struct vhost_user_msg *msg)
> +{
> +	bool nofd = msg->payload.u64 & VHOST_USER_VRING_NOFD_MASK;
> +	int idx = msg->payload.u64 & VHOST_USER_VRING_IDX_MASK;
> +
> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> +
> +	vu_check_queue_msg_file(msg);
> +
> +	if (vdev->vq[idx].err_fd != -1) {
> +		close(vdev->vq[idx].err_fd);
> +		vdev->vq[idx].err_fd = -1;
> +	}
> +
> +	/* cppcheck-suppress redundantAssignment */
> +	vdev->vq[idx].err_fd = nofd ? -1 : msg->fds[0];
> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_protocol_features_exec() - Provide the protocol (vhost-user) features
> + * 				     to the front-end
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as a reply is requested
> + */
> +static bool vu_get_protocol_features_exec(struct vhost_user_msg *msg)
> +{
> +	uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_REPLY_ACK;
> +
> +	vmsg_set_reply_u64(msg, features);
> +
> +	return true;
> +}
> +
> +/**
> + * vu_set_protocol_features_exec() - Enable protocol (vhost-user) features
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_protocol_features_exec(struct vu_dev *vdev,
> +					  struct vhost_user_msg *msg)
> +{
> +	uint64_t features = msg->payload.u64;
> +
> +	debug("u64: 0x%016"PRIx64, features);
> +
> +	vdev->protocol_features = msg->payload.u64;
> +
> +	if (vu_has_protocol_feature(vdev,
> +				    VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS) &&
> +	    (!vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_BACKEND_REQ) ||
> +	     !vu_has_protocol_feature(vdev, VHOST_USER_PROTOCOL_F_REPLY_ACK))) {
> +	/*
> +	 * The use case for using messages for kick/call is simulation, to make
> +	 * the kick and call synchronous. To actually get that behaviour, both
> +	 * of the other features are required.
> +	 * Theoretically, one could use only kick messages, or do them without
> +	 * having F_REPLY_ACK, but too many (possibly pending) messages on the
> +	 * socket will eventually cause the master to hang, to avoid this in
> +	 * scenarios where not desired enforce that the settings are in a way
> +	 * that actually enables the simulation case.
> +	 */
> +		die("F_IN_BAND_NOTIFICATIONS requires F_BACKEND_REQ && F_REPLY_ACK");
> +	}
> +
> +	return false;
> +}
> +
> +/**
> + * vu_get_queue_num_exec() - Tell how many queues we support
> + * @vmsg:	Vhost-user message
> + *
> + * Return: true as a reply is requested
> + */
> +static bool vu_get_queue_num_exec(struct vhost_user_msg *msg)
> +{
> +	vmsg_set_reply_u64(msg, VHOST_USER_MAX_QUEUES);
> +	return true;
> +}
> +
> +/**
> + * vu_set_vring_enable_exec() - Enable or disable corresponding vring
> + * @vdev:	Vhost-user device
> + * @vmsg:	Vhost-user message
> + *
> + * Return: false as no reply is requested
> + */
> +static bool vu_set_vring_enable_exec(struct vu_dev *vdev,
> +				     struct vhost_user_msg *msg)
> +{
> +	unsigned int enable = msg->payload.state.num;
> +	unsigned int idx = msg->payload.state.index;
> +
> +	debug("State.index:  %u", idx);
> +	debug("State.enable: %u", enable);
> +
> +	if (idx >= VHOST_USER_MAX_QUEUES)
> +		die("Invalid vring_enable index: %u", idx);
> +
> +	vdev->vq[idx].enable = enable;
> +	return false;
> +}
> +
> +/**
> + * vu_init() - Initialize vhost-user device structure
> + * @c:		execution context
> + * @vdev:	vhost-user device
> + */
> +/* cppcheck-suppress unusedFunction */
> +void vu_init(struct ctx *c, struct vu_dev *vdev)
> +{
> +	int i;
> +
> +	vdev->context = c;
> +	vdev->hdrlen = 0;
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		vdev->vq[i] = (struct vu_virtq){
> +			.call_fd = -1,
> +			.kick_fd = -1,
> +			.err_fd = -1,
> +			.notification = true,
> +		};
> +	}
> +}
> +
> +/**
> + * vu_cleanup() - Reset vhost-user device
> + * @vdev:	vhost-user device
> + */
> +void vu_cleanup(struct vu_dev *vdev)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < VHOST_USER_MAX_QUEUES; i++) {
> +		struct vu_virtq *vq = &vdev->vq[i];
> +
> +		vq->started = false;
> +		vq->notification = true;
> +
> +		if (vq->call_fd != -1) {
> +			close(vq->call_fd);
> +			vq->call_fd = -1;
> +		}
> +		if (vq->err_fd != -1) {
> +			close(vq->err_fd);
> +			vq->err_fd = -1;
> +		}
> +		if (vq->kick_fd != -1) {
> +			vu_remove_watch(vdev, vq->kick_fd);
> +			close(vq->kick_fd);
> +			vq->kick_fd = -1;
> +		}
> +
> +		vq->vring.desc = 0;
> +		vq->vring.used = 0;
> +		vq->vring.avail = 0;
> +	}
> +	vdev->hdrlen = 0;
> +
> +	for (i = 0; i < vdev->nregions; i++) {
> +		const struct vu_dev_region *r = &vdev->regions[i];
> +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> +		void *m = (void *)r->mmap_addr;
> +
> +		if (m)
> +			munmap(m, r->size + r->mmap_offset);
> +	}
> +	vdev->nregions = 0;
> +}
> +
> +/**
> + * vu_sock_reset() - Reset connection socket
> + * @vdev:	vhost-user device
> + */
> +static void vu_sock_reset(struct vu_dev *vdev)
> +{
> +	(void)vdev;

Placeholder?

> +}
> +
> +/**
> + * tap_handler_vu() - Packet handler for vhost-user
> + * @vdev:	vhost-user device
> + * @fd:		vhost-user message socket
> + * @events:	epoll events
> + */
> +/* cppcheck-suppress unusedFunction */
> +void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events)

I think this name is misleading.  While we are re-using fd_tap for the
vhost-user control socket, this is quite unlike most of the other
tap_handler functions: those are generally related to getting a new
packet from the "tap" interface - it's the main entry point into the
data path from the guest.  This is, instead, a control path function,
more akin to tap_listen_handler() (also not a great name).  Maybe
"vu_socket_handler()" or "vu_control_handler()"?  tap_handler_vu() I'd
expect to be the function that handles notifications on the queue
receiving packets from the guest.

> +{
> +	struct vhost_user_msg msg = { 0 };
> +	bool need_reply, reply_requested;
> +	int ret;
> +
> +	if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) {
> +		vu_sock_reset(vdev);
> +		return;
> +	}
> +
> +	ret = vu_message_read_default(fd, &msg);
> +	if (ret < 0)
> +		die_perror("Error while recvmsg");
> +	if (ret == 0) {
> +		vu_sock_reset(vdev);
> +		return;
> +	}
> +	debug("================ Vhost user message ================");
> +	debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
> +		msg.hdr.request);
> +	debug("Flags:   0x%x", msg.hdr.flags);
> +	debug("Size:    %u", msg.hdr.size);
> +
> +	need_reply = msg.hdr.flags & VHOST_USER_NEED_REPLY_MASK;
> +	switch (msg.hdr.request) {
> +	case VHOST_USER_GET_FEATURES:
> +		reply_requested = vu_get_features_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_FEATURES:
> +		reply_requested = vu_set_features_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_PROTOCOL_FEATURES:
> +		reply_requested = vu_get_protocol_features_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_PROTOCOL_FEATURES:
> +		reply_requested = vu_set_protocol_features_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_QUEUE_NUM:
> +		reply_requested = vu_get_queue_num_exec(&msg);
> +		break;
> +	case VHOST_USER_SET_OWNER:
> +		reply_requested = vu_set_owner_exec();
> +		break;
> +	case VHOST_USER_SET_MEM_TABLE:
> +		reply_requested = vu_set_mem_table_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_NUM:
> +		reply_requested = vu_set_vring_num_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ADDR:
> +		reply_requested = vu_set_vring_addr_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_BASE:
> +		reply_requested = vu_set_vring_base_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_GET_VRING_BASE:
> +		reply_requested = vu_get_vring_base_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_KICK:
> +		reply_requested = vu_set_vring_kick_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_CALL:
> +		reply_requested = vu_set_vring_call_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ERR:
> +		reply_requested = vu_set_vring_err_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_SET_VRING_ENABLE:
> +		reply_requested = vu_set_vring_enable_exec(vdev, &msg);
> +		break;
> +	case VHOST_USER_NONE:
> +		vu_cleanup(vdev);
> +		return;
> +	default:
> +		die("Unhandled request: %d", msg.hdr.request);
> +	}
> +
> +	if (!reply_requested && need_reply) {
> +		msg.payload.u64 = 0;
> +		msg.hdr.flags = 0;
> +		msg.hdr.size = sizeof(msg.payload.u64);
> +		msg.fd_num = 0;
> +		reply_requested = true;
> +	}
> +
> +	if (reply_requested)
> +		vu_send_reply(fd, &msg);
> +}
> diff --git a/vhost_user.h b/vhost_user.h
> new file mode 100644
> index 000000000000..135856dc2873
> --- /dev/null
> +++ b/vhost_user.h
> @@ -0,0 +1,202 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later
> + * Copyright Red Hat
> + * Author: Laurent Vivier <lvivier@redhat.com>
> + *
> + * vhost-user API, command management and virtio interface
> + */
> +
> +/* some parts from subprojects/libvhost-user/libvhost-user.h */
> +
> +#ifndef VHOST_USER_H
> +#define VHOST_USER_H
> +
> +#include "virtio.h"
> +#include "iov.h"
> +
> +#define VHOST_USER_F_PROTOCOL_FEATURES 30
> +
> +#define VHOST_MEMORY_BASELINE_NREGIONS 8
> +
> +/**
> + * enum vhost_user_protocol_feature - List of available vhost-user features
> + */
> +enum vhost_user_protocol_feature {
> +	VHOST_USER_PROTOCOL_F_MQ = 0,
> +	VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
> +	VHOST_USER_PROTOCOL_F_RARP = 2,
> +	VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
> +	VHOST_USER_PROTOCOL_F_NET_MTU = 4,
> +	VHOST_USER_PROTOCOL_F_BACKEND_REQ = 5,
> +	VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,
> +	VHOST_USER_PROTOCOL_F_CRYPTO_SESSION = 7,
> +	VHOST_USER_PROTOCOL_F_PAGEFAULT = 8,
> +	VHOST_USER_PROTOCOL_F_CONFIG = 9,
> +	VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
> +	VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +	VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
> +	VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
> +	VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
> +
> +	VHOST_USER_PROTOCOL_F_MAX
> +};
> +
> +/**
> + * enum vhost_user_request - list of available vhost-user request
> + */
> +enum vhost_user_request {
> +	VHOST_USER_NONE = 0,
> +	VHOST_USER_GET_FEATURES = 1,
> +	VHOST_USER_SET_FEATURES = 2,
> +	VHOST_USER_SET_OWNER = 3,
> +	VHOST_USER_RESET_OWNER = 4,
> +	VHOST_USER_SET_MEM_TABLE = 5,
> +	VHOST_USER_SET_LOG_BASE = 6,
> +	VHOST_USER_SET_LOG_FD = 7,
> +	VHOST_USER_SET_VRING_NUM = 8,
> +	VHOST_USER_SET_VRING_ADDR = 9,
> +	VHOST_USER_SET_VRING_BASE = 10,
> +	VHOST_USER_GET_VRING_BASE = 11,
> +	VHOST_USER_SET_VRING_KICK = 12,
> +	VHOST_USER_SET_VRING_CALL = 13,
> +	VHOST_USER_SET_VRING_ERR = 14,
> +	VHOST_USER_GET_PROTOCOL_FEATURES = 15,
> +	VHOST_USER_SET_PROTOCOL_FEATURES = 16,
> +	VHOST_USER_GET_QUEUE_NUM = 17,
> +	VHOST_USER_SET_VRING_ENABLE = 18,
> +	VHOST_USER_SEND_RARP = 19,
> +	VHOST_USER_NET_SET_MTU = 20,
> +	VHOST_USER_SET_BACKEND_REQ_FD = 21,
> +	VHOST_USER_IOTLB_MSG = 22,
> +	VHOST_USER_SET_VRING_ENDIAN = 23,
> +	VHOST_USER_GET_CONFIG = 24,
> +	VHOST_USER_SET_CONFIG = 25,
> +	VHOST_USER_CREATE_CRYPTO_SESSION = 26,
> +	VHOST_USER_CLOSE_CRYPTO_SESSION = 27,
> +	VHOST_USER_POSTCOPY_ADVISE  = 28,
> +	VHOST_USER_POSTCOPY_LISTEN  = 29,
> +	VHOST_USER_POSTCOPY_END     = 30,
> +	VHOST_USER_GET_INFLIGHT_FD = 31,
> +	VHOST_USER_SET_INFLIGHT_FD = 32,
> +	VHOST_USER_GPU_SET_SOCKET = 33,
> +	VHOST_USER_VRING_KICK = 35,
> +	VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> +	VHOST_USER_ADD_MEM_REG = 37,
> +	VHOST_USER_REM_MEM_REG = 38,
> +	VHOST_USER_MAX
> +};
> +
> +/**
> + * struct vhost_user_header - Vhost-user message header
> + * @request:	Request type of the message
> + * @flags:	Request flags
> + * @size:	The following payload size
> + */
> +struct vhost_user_header {
> +	enum vhost_user_request request;
> +
> +#define VHOST_USER_VERSION_MASK     0x3
> +#define VHOST_USER_REPLY_MASK       (0x1 << 2)
> +#define VHOST_USER_NEED_REPLY_MASK  (0x1 << 3)
> +	uint32_t flags;
> +	uint32_t size; /* the following payload size */
> +} __attribute__ ((__packed__));
> +
> +/**
> + * struct vhost_user_memory_region - Front-end shared memory region information
> + * @guest_phys_addr:	Guest physical address of the region
> + * @memory_size:	Memory size
> + * @userspace_addr:	front-end (QEMU) userspace address
> + * @mmap_offset:	region offset in the shared memory area
> + */
> +struct vhost_user_memory_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t memory_size;
> +	uint64_t userspace_addr;
> +	uint64_t mmap_offset;
> +};
> +
> +/**
> + * struct vhost_user_memory - List of all the shared memory regions
> + * @nregions:	Number of memory regions
> + * @padding:	Padding
> + * @regions:	Memory regions list
> + */
> +struct vhost_user_memory {
> +	uint32_t nregions;
> +	uint32_t padding;
> +	struct vhost_user_memory_region regions[VHOST_MEMORY_BASELINE_NREGIONS];
> +};
> +
> +/**
> + * union vhost_user_payload - Vhost-user message payload
> + * @u64:		64bit payload
> + * @state:		Vring state payload
> + * @addr:		Vring addresses payload
> + * vhost_user_memory:	Memory regions information payload
> + */
> +union vhost_user_payload {
> +#define VHOST_USER_VRING_IDX_MASK   0xff
> +#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> +	uint64_t u64;
> +	struct vhost_vring_state state;
> +	struct vhost_vring_addr addr;
> +	struct vhost_user_memory memory;
> +};
> +
> +/**
> + * struct vhost_user_msg - Vhost-use message
> + * @hdr:		Message header
> + * @payload:		Message payload
> + * @fds:		File descriptors associated with the message
> + * 			in the ancillary data.
> + * 			(shared memory or event file descriptors)
> + * @fd_num:		Number of file descriptors
> + */
> +struct vhost_user_msg {
> +	struct vhost_user_header hdr;
> +	union vhost_user_payload payload;
> +
> +	int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> +	int fd_num;
> +} __attribute__ ((__packed__));
> +#define VHOST_USER_HDR_SIZE sizeof(struct vhost_user_header)
> +
> +/* index of the RX virtqueue */
> +#define VHOST_USER_RX_QUEUE 0
> +/* index of the TX virtqueue */
> +#define VHOST_USER_TX_QUEUE 1
> +
> +/* in case of multiqueue, we RX and TX queues are interleaved */
> +#define VHOST_USER_IS_QUEUE_TX(n)	(n % 2)
> +#define VHOST_USER_IS_QUEUE_RX(n)	(!(n % 2))
> +
> +/**
> + * vu_queue_enabled - Return state of a virtqueue
> + * @vq:		Virtqueue to check
> + *
> + * Return: true if the virqueue is enabled, false otherwise
> + */
> +static inline bool vu_queue_enabled(const struct vu_virtq *vq)
> +{
> +	return vq->enable;
> +}
> +
> +/**
> + * vu_queue_started - Return state of a virtqueue
> + * @vq:		Virtqueue to check
> + *
> + * Return: true if the virqueue is started, false otherwise
> + */
> +static inline bool vu_queue_started(const struct vu_virtq *vq)
> +{
> +	return vq->started;
> +}
> +
> +int vu_send(struct vu_dev *vdev, const void *buf, size_t size);
> +void vu_print_capabilities(void);
> +void vu_init(struct ctx *c, struct vu_dev *vdev);
> +void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref,
> +		const struct timespec *now);
> +void vu_cleanup(struct vu_dev *vdev);
> +void tap_handler_vu(struct vu_dev *vdev, int fd, uint32_t events);
> +#endif /* VHOST_USER_H */
> diff --git a/virtio.c b/virtio.c
> index 8354f6052aee..d02e6e04701d 100644
> --- a/virtio.c
> +++ b/virtio.c
> @@ -323,7 +323,6 @@ static bool vring_can_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>   * @dev:	Vhost-user device
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_notify(const struct vu_dev *dev, struct vu_virtq *vq)
>  {
>  	if (!vq->vring.avail)
> @@ -500,7 +499,6 @@ static int vu_queue_map_desc(struct vu_dev *dev, struct vu_virtq *vq, unsigned i
>   *
>   * Return: -1 if there is an error, 0 otherwise
>   */
> -/* cppcheck-suppress unusedFunction */
>  int vu_queue_pop(struct vu_dev *dev, struct vu_virtq *vq, struct vu_virtq_element *elem)
>  {
>  	unsigned int head;
> @@ -550,7 +548,6 @@ void vu_queue_detach_element(struct vu_virtq *vq)
>   * vu_queue_unpop() - Push back the previously popped element from the virqueue
>   * @vq:		Virtqueue
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_unpop(struct vu_virtq *vq)
>  {
>  	vq->last_avail_idx--;
> @@ -618,7 +615,6 @@ void vu_queue_fill_by_index(struct vu_virtq *vq, unsigned int index,
>   * @len:	Size of the element
>   * @idx:	Used ring entry index
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_fill(struct vu_virtq *vq, const struct vu_virtq_element *elem,
>  		   unsigned int len, unsigned int idx)
>  {
> @@ -642,7 +638,6 @@ static inline void vring_used_idx_set(struct vu_virtq *vq, uint16_t val)
>   * @vq:		Virtqueue
>   * @count:	Number of entry to flush
>   */
> -/* cppcheck-suppress unusedFunction */
>  void vu_queue_flush(struct vu_virtq *vq, unsigned int count)
>  {
>  	uint16_t old, new;
> diff --git a/virtio.h b/virtio.h
> index af9cadc990b9..242e788e07e9 100644
> --- a/virtio.h
> +++ b/virtio.h
> @@ -106,6 +106,7 @@ struct vu_dev_region {
>   * @hdrlen:		Virtio -net header length
>   */
>  struct vu_dev {
> +	struct ctx *context;
>  	uint32_t nregions;
>  	struct vu_dev_region regions[VHOST_USER_MAX_RAM_SLOTS];
>  	struct vu_virtq vq[VHOST_USER_MAX_QUEUES];
> @@ -162,7 +163,6 @@ static inline bool vu_has_feature(const struct vu_dev *vdev,
>   *
>   * Return:	True if the feature is available
>   */
> -/* cppcheck-suppress unusedFunction */
>  static inline bool vu_has_protocol_feature(const struct vu_dev *vdev,
>  					   unsigned int fbit)
>  {

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-22 22:14   ` Stefano Brivio
@ 2024-08-26  5:27     ` David Gibson
  2024-08-26  7:55       ` Stefano Brivio
  0 siblings, 1 reply; 22+ messages in thread
From: David Gibson @ 2024-08-26  5:27 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

On Fri, Aug 23, 2024 at 12:14:22AM +0200, Stefano Brivio wrote:
> On Thu, 15 Aug 2024 17:50:22 +0200
> Laurent Vivier <lvivier@redhat.com> wrote:
[snip]

> > +	if (sz_payload) {
> > +		do {
> > +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> > +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));
> 
> No need for curly brackets, it's a one-line statement.

Unlike if, while or for, I'm pretty sure the braces are mandatory for
do {} while.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-26  5:27     ` David Gibson
@ 2024-08-26  7:55       ` Stefano Brivio
  2024-08-26  9:53         ` David Gibson
  0 siblings, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2024-08-26  7:55 UTC (permalink / raw)
  To: David Gibson; +Cc: Laurent Vivier, passt-dev

On Mon, 26 Aug 2024 15:27:59 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Aug 23, 2024 at 12:14:22AM +0200, Stefano Brivio wrote:
> > On Thu, 15 Aug 2024 17:50:22 +0200
> > Laurent Vivier <lvivier@redhat.com> wrote:  
> [snip]
> 
> > > +	if (sz_payload) {
> > > +		do {
> > > +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> > > +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));  
> > 
> > No need for curly brackets, it's a one-line statement.  
> 
> Unlike if, while or for, I'm pretty sure the braces are mandatory for
> do {} while.

What do you mean by mandatory? This is not covered in any special way
by the kernel coding style documentation, and that statement is not a
compound statement:

$ cat dowhile.c
#include <stdio.h>

int main()
{
    int a = 3;

    do
        printf("%i\n", a--);
    while (a);

    return 0;
}
$ gcc -Wall -Wextra -pedantic -std=c89 -o dowhile dowhile.c 
$ ./dowhile 
3
2
1

but sure, if you suggest that curly brackets improve clarity here, I
have nothing against them.

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-26  7:55       ` Stefano Brivio
@ 2024-08-26  9:53         ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2024-08-26  9:53 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 1605 bytes --]

On Mon, Aug 26, 2024 at 09:55:42AM +0200, Stefano Brivio wrote:
> On Mon, 26 Aug 2024 15:27:59 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Aug 23, 2024 at 12:14:22AM +0200, Stefano Brivio wrote:
> > > On Thu, 15 Aug 2024 17:50:22 +0200
> > > Laurent Vivier <lvivier@redhat.com> wrote:  
> > [snip]
> > 
> > > > +	if (sz_payload) {
> > > > +		do {
> > > > +			ret = recv(conn_fd, &vmsg->payload, sz_payload, 0);
> > > > +		} while (ret < 0 && (errno == EINTR || errno == EAGAIN));  
> > > 
> > > No need for curly brackets, it's a one-line statement.  
> > 
> > Unlike if, while or for, I'm pretty sure the braces are mandatory for
> > do {} while.
> 
> What do you mean by mandatory? This is not covered in any special way
> by the kernel coding style documentation, and that statement is not a
> compound statement:
> 
> $ cat dowhile.c
> #include <stdio.h>
> 
> int main()
> {
>     int a = 3;
> 
>     do
>         printf("%i\n", a--);
>     while (a);

Huh.  Ok, I'm just wrong.

For some reason I thought the braces were required by C for do while.

> 
>     return 0;
> }
> $ gcc -Wall -Wextra -pedantic -std=c89 -o dowhile dowhile.c 
> $ ./dowhile 
> 3
> 2
> 1
> 
> but sure, if you suggest that curly brackets improve clarity here, I
> have nothing against them.

Not what I was getting at, but I do think that's true.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-26  5:26   ` David Gibson
@ 2024-08-26 22:14     ` Stefano Brivio
  2024-08-27  4:42       ` David Gibson
  2024-09-05  9:58     ` Laurent Vivier
  1 sibling, 1 reply; 22+ messages in thread
From: Stefano Brivio @ 2024-08-26 22:14 UTC (permalink / raw)
  To: David Gibson, Laurent Vivier; +Cc: passt-dev

On Mon, 26 Aug 2024 15:26:44 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Aug 15, 2024 at 05:50:22PM +0200, Laurent Vivier wrote:
> > Add vhost_user.c and vhost_user.h that define the functions needed
> > to implement vhost-user backend.
> >
> > [...]
> > 
> > +static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
> > +{
> > +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
> > +		     sizeof(int))] = { 0 };
> > +	struct iovec iov = {
> > +		.iov_base = (char *)vmsg,
> > +		.iov_len = VHOST_USER_HDR_SIZE,
> > +	};
> > +	struct msghdr msg = {
> > +		.msg_iov = &iov,
> > +		.msg_iovlen = 1,
> > +		.msg_control = control,
> > +		.msg_controllen = sizeof(control),
> > +	};
> > +	ssize_t ret, sz_payload;
> > +	struct cmsghdr *cmsg;
> > +	size_t fd_size;
> > +
> > +	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
> > +	if (ret < 0) {
> > +		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
> > +			return 0;
> > +		return -1;
> > +	}
> > +
> > +	vmsg->fd_num = 0;
> > +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> > +	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> > +		if (cmsg->cmsg_level == SOL_SOCKET &&
> > +		    cmsg->cmsg_type == SCM_RIGHTS) {
> > +			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
> > +			ASSERT(fd_size / sizeof(int) <=
> > +			       VHOST_MEMORY_BASELINE_NREGIONS);  
> 
> IIUC, this could be tripped by a bug in the peer (qemu?) rather than
> in our own code.  In which case I think a die() would be more
> appropriate than an ASSERT().

Ah, right, it wouldn't be our issue... what about neither, so that we
don't crash if QEMU has an issue we could easily recover from?

> > [...]
> >
> > +/**
> > + * vu_set_mem_table_exec() - Sets the memory map regions to be able to
> > + * 			     translate the vring addresses.
> > + * @vdev:	Vhost-user device
> > + * @vmsg:	Vhost-user message
> > + *
> > + * Return: false as no reply is requested
> > + *
> > + * #syscalls:vu mmap munmap
> > + */
> > +static bool vu_set_mem_table_exec(struct vu_dev *vdev,
> > +				  struct vhost_user_msg *msg)
> > +{
> > +	struct vhost_user_memory m = msg->payload.memory, *memory = &m;  
> 
> Is there a reason to take a copy of the message, rather than just
> referencing into msg as passed?
> 
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < vdev->nregions; i++) {
> > +		struct vu_dev_region *r = &vdev->regions[i];
> > +		/* NOLINTNEXTLINE(performance-no-int-to-ptr) */
> > +		void *mm = (void *)r->mmap_addr;
> > +
> > +		if (mm)
> > +			munmap(mm, r->size + r->mmap_offset);  
> 
> Do we actually ever need to change the mapping of the regions?  If not
> we can avoid this unmapping loop.
> 
> > +	}
> > +	vdev->nregions = memory->nregions;
> > +
> > +	debug("Nregions: %u", memory->nregions);
> > +	for (i = 0; i < vdev->nregions; i++) {
> > +		struct vhost_user_memory_region *msg_region = &memory->regions[i];
> > +		struct vu_dev_region *dev_region = &vdev->regions[i];
> > +		void *mmap_addr;
> > +
> > +		debug("Region %d", i);
> > +		debug("    guest_phys_addr: 0x%016"PRIx64,
> > +		      msg_region->guest_phys_addr);
> > +		debug("    memory_size:     0x%016"PRIx64,
> > +		      msg_region->memory_size);
> > +		debug("    userspace_addr   0x%016"PRIx64,
> > +		      msg_region->userspace_addr);
> > +		debug("    mmap_offset      0x%016"PRIx64,
> > +		      msg_region->mmap_offset);
> > +
> > +		dev_region->gpa = msg_region->guest_phys_addr;
> > +		dev_region->size = msg_region->memory_size;
> > +		dev_region->qva = msg_region->userspace_addr;
> > +		dev_region->mmap_offset = msg_region->mmap_offset;
> > +
> > +		/* We don't use offset argument of mmap() since the
> > +		 * mapped address has to be page aligned, and we use huge
> > +		 * pages.  
> 
> We do what now?

We do madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE) in main(), but
we're not using pkt_buf in this case, so I guess it's not relevant. I'm
not sure if _passt_ calling madvise(..., MADV_HUGEPAGE) on the memory
regions we get would have any effect, by the way.

> > [...]
> >
> > +/**
> > + * vu_send() - Send a buffer to the front-end using the RX virtqueue
> > + * @vdev:	vhost-user device
> > + * @buf:	address of the buffer
> > + * @size:	size of the buffer
> > + *
> > + * Return: number of bytes sent, -1 if there is an error
> > + */
> > +/* cppcheck-suppress unusedFunction */
> > +int vu_send(struct vu_dev *vdev, const void *buf, size_t size)
> > +{
> > +	struct vu_virtq *vq = &vdev->vq[VHOST_USER_RX_QUEUE];
> > +	struct vu_virtq_element elem[VIRTQUEUE_MAX_SIZE];
> > +	struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> > +	size_t lens[VIRTQUEUE_MAX_SIZE];
> > +	__virtio16 *num_buffers_ptr = NULL;
> > +	size_t hdrlen = vdev->hdrlen;
> > +	int in_sg_count = 0;
> > +	size_t offset = 0;
> > +	int i = 0, j;
> > +
> > +	debug("vu_send size %zu hdrlen %zu", size, hdrlen);
> > +
> > +	if (!vu_queue_enabled(vq) || !vu_queue_started(vq)) {
> > +		err("Got packet, but no available descriptors on RX virtq.");
> > +		return 0;
> > +	}
> > +
> > +	while (offset < size) {
> > +		size_t len;
> > +		int total;
> > +		int ret;
> > +
> > +		total = 0;
> > +
> > +		if (i == ARRAY_SIZE(elem) ||
> > +		    in_sg_count == ARRAY_SIZE(in_sg)) {
> > +			err("virtio-net unexpected long buffer chain");
> > +			goto err;
> > +		}
> > +
> > +		elem[i].out_num = 0;
> > +		elem[i].out_sg = NULL;
> > +		elem[i].in_num = ARRAY_SIZE(in_sg) - in_sg_count;
> > +		elem[i].in_sg = &in_sg[in_sg_count];
> > +
> > +		ret = vu_queue_pop(vdev, vq, &elem[i]);
> > +		if (ret < 0) {
> > +			if (vu_wait_queue(vq) != -1)
> > +				continue;
> > +			if (i) {
> > +				err("virtio-net unexpected empty queue: "
> > +				    "i %d mergeable %d offset %zd, size %zd, "
> > +				    "features 0x%" PRIx64,
> > +				    i, vu_has_feature(vdev,
> > +						      VIRTIO_NET_F_MRG_RXBUF),
> > +				    offset, size, vdev->features);
> > +			}
> > +			offset = -1;
> > +			goto err;
> > +		}
> > +		in_sg_count += elem[i].in_num;  
> 
> Initially I thought this would consume the entire in_sg array on the
> first loop iteration, but I guess vu_queue_pop() reduces in_num from
> the value we initialise above.
> 
> > +		if (elem[i].in_num < 1) {  
> 
> I realise it doesn't really matter in this context, but it makes more
> sense to me for this check to go _before_ we use in_num to update
> in_sg_cuont.
> 
> 
> > +			err("virtio-net receive queue contains no in buffers");
> > +			vu_queue_detach_element(vq);
> > +			offset = -1;
> > +			goto err;
> > +		}
> > +
> > +		if (i == 0) {
> > +			struct virtio_net_hdr hdr = {
> > +				.flags = VIRTIO_NET_HDR_F_DATA_VALID,
> > +				.gso_type = VIRTIO_NET_HDR_GSO_NONE,
> > +			};
> > +
> > +			ASSERT(offset == 0);
> > +			ASSERT(elem[i].in_sg[0].iov_len >= hdrlen);  
> 
> Is this necessarily our bug, or could it be cause by the peer giving
> unreasonably small buffers?  If the latter, then a die() would make
> more sense.

...same here.

-- 
Stefano


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-26 22:14     ` Stefano Brivio
@ 2024-08-27  4:42       ` David Gibson
  0 siblings, 0 replies; 22+ messages in thread
From: David Gibson @ 2024-08-27  4:42 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Laurent Vivier, passt-dev

[-- Attachment #1: Type: text/plain, Size: 3771 bytes --]

On Tue, Aug 27, 2024 at 12:14:20AM +0200, Stefano Brivio wrote:
> On Mon, 26 Aug 2024 15:26:44 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Aug 15, 2024 at 05:50:22PM +0200, Laurent Vivier wrote:
> > > Add vhost_user.c and vhost_user.h that define the functions needed
> > > to implement vhost-user backend.
> > >
> > > [...]
> > > 
> > > +static int vu_message_read_default(int conn_fd, struct vhost_user_msg *vmsg)
> > > +{
> > > +	char control[CMSG_SPACE(VHOST_MEMORY_BASELINE_NREGIONS *
> > > +		     sizeof(int))] = { 0 };
> > > +	struct iovec iov = {
> > > +		.iov_base = (char *)vmsg,
> > > +		.iov_len = VHOST_USER_HDR_SIZE,
> > > +	};
> > > +	struct msghdr msg = {
> > > +		.msg_iov = &iov,
> > > +		.msg_iovlen = 1,
> > > +		.msg_control = control,
> > > +		.msg_controllen = sizeof(control),
> > > +	};
> > > +	ssize_t ret, sz_payload;
> > > +	struct cmsghdr *cmsg;
> > > +	size_t fd_size;
> > > +
> > > +	ret = recvmsg(conn_fd, &msg, MSG_DONTWAIT);
> > > +	if (ret < 0) {
> > > +		if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK)
> > > +			return 0;
> > > +		return -1;
> > > +	}
> > > +
> > > +	vmsg->fd_num = 0;
> > > +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL;
> > > +	     cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> > > +		if (cmsg->cmsg_level == SOL_SOCKET &&
> > > +		    cmsg->cmsg_type == SCM_RIGHTS) {
> > > +			fd_size = cmsg->cmsg_len - CMSG_LEN(0);
> > > +			ASSERT(fd_size / sizeof(int) <=
> > > +			       VHOST_MEMORY_BASELINE_NREGIONS);  
> > 
> > IIUC, this could be tripped by a bug in the peer (qemu?) rather than
> > in our own code.  In which case I think a die() would be more
> > appropriate than an ASSERT().
> 
> Ah, right, it wouldn't be our issue... what about neither, so that we
> don't crash if QEMU has an issue we could easily recover from?

It wasn't immediately obvious to me if we could reasily recover from
that or not.

[snip]
> > > +	vdev->nregions = memory->nregions;
> > > +
> > > +	debug("Nregions: %u", memory->nregions);
> > > +	for (i = 0; i < vdev->nregions; i++) {
> > > +		struct vhost_user_memory_region *msg_region = &memory->regions[i];
> > > +		struct vu_dev_region *dev_region = &vdev->regions[i];
> > > +		void *mmap_addr;
> > > +
> > > +		debug("Region %d", i);
> > > +		debug("    guest_phys_addr: 0x%016"PRIx64,
> > > +		      msg_region->guest_phys_addr);
> > > +		debug("    memory_size:     0x%016"PRIx64,
> > > +		      msg_region->memory_size);
> > > +		debug("    userspace_addr   0x%016"PRIx64,
> > > +		      msg_region->userspace_addr);
> > > +		debug("    mmap_offset      0x%016"PRIx64,
> > > +		      msg_region->mmap_offset);
> > > +
> > > +		dev_region->gpa = msg_region->guest_phys_addr;
> > > +		dev_region->size = msg_region->memory_size;
> > > +		dev_region->qva = msg_region->userspace_addr;
> > > +		dev_region->mmap_offset = msg_region->mmap_offset;
> > > +
> > > +		/* We don't use offset argument of mmap() since the
> > > +		 * mapped address has to be page aligned, and we use huge
> > > +		 * pages.  
> > 
> > We do what now?
> 
> We do madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE) in main(), but
> we're not using pkt_buf in this case, so I guess it's not relevant. I'm
> not sure if _passt_ calling madvise(..., MADV_HUGEPAGE) on the memory
> regions we get would have any effect, by the way.

Huh, I'd forgotten about that.  AIUI qemu allocates the memory and we
map it into passt, so I don't think our madvise() would have any
effect here.

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 3/4] vhost-user: introduce vhost-user API
  2024-08-26  5:26   ` David Gibson
  2024-08-26 22:14     ` Stefano Brivio
@ 2024-09-05  9:58     ` Laurent Vivier
  1 sibling, 0 replies; 22+ messages in thread
From: Laurent Vivier @ 2024-09-05  9:58 UTC (permalink / raw)
  To: David Gibson; +Cc: passt-dev

On 26/08/2024 07:26, David Gibson wrote:
>> +/**
>> + * vu_set_features_exec() - Enable features of the back-end
>> + * @vdev:	Vhost-user device
>> + * @vmsg:	Vhost-user message
>> + *
>> + * Return: false as no reply is requested
>> + */
>> +static bool vu_set_features_exec(struct vu_dev *vdev,
>> +				 struct vhost_user_msg *msg)
>> +{
>> +	debug("u64: 0x%016"PRIx64, msg->payload.u64);
> A number of these debug() messages look like they'd be pretty cryptic,
> with no indication of which part of passt they're coming from.  This
> one is especially bad.

We have in the main loop calling each vu_ function a trace:

         debug("================ Vhost user message ================");
         debug("Request: %s (%d)", vu_request_to_string(msg.hdr.request),
                 msg.hdr.request);
         debug("Flags:   0x%x", msg.hdr.flags);
         debug("Size:    %u", msg.hdr.size);

That will give:

33.1458: passt: epoll event on vhost-user command socket 73 (events: 0x00000001)
33.1458: ================ Vhost user message ================
33.1458: Request: VHOST_USER_SET_FEATURES (2)
33.1458: Flags:   0x1
33.1458: Size:    8
33.1458: u64: 0x0000000140008000

I think this provides enough context to understand the trace.

Thanks,
Laurent


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-09-05  9:58 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-15 15:50 [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Laurent Vivier
2024-08-15 15:50 ` [PATCH v3 1/4] packet: replace struct desc by struct iovec Laurent Vivier
2024-08-20  0:27   ` David Gibson
2024-08-15 15:50 ` [PATCH v3 2/4] vhost-user: introduce virtio API Laurent Vivier
2024-08-20  1:00   ` David Gibson
2024-08-22 22:14   ` Stefano Brivio
2024-08-15 15:50 ` [PATCH v3 3/4] vhost-user: introduce vhost-user API Laurent Vivier
2024-08-22 22:14   ` Stefano Brivio
2024-08-26  5:27     ` David Gibson
2024-08-26  7:55       ` Stefano Brivio
2024-08-26  9:53         ` David Gibson
2024-08-26  5:26   ` David Gibson
2024-08-26 22:14     ` Stefano Brivio
2024-08-27  4:42       ` David Gibson
2024-09-05  9:58     ` Laurent Vivier
2024-08-15 15:50 ` [PATCH v3 4/4] vhost-user: add vhost-user Laurent Vivier
2024-08-22  9:59   ` Stefano Brivio
2024-08-22 22:14   ` Stefano Brivio
2024-08-23 12:32   ` Stefano Brivio
2024-08-20 22:41 ` [PATCH v3 0/4] Add vhost-user support to passt. (part 3) Stefano Brivio
2024-08-22 16:53   ` Stefano Brivio
2024-08-23 12:32     ` Stefano Brivio

Code repositories for project(s) associated with this public inbox

	https://passt.top/passt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for IMAP folder(s).